The AI Disclosures Project, a non-profit co-founded in 2024 by media mogul TimO’Reilly and economist Ilan Strauss, has alleged that OpenAI trained itsGPT-4o model on copyrighted O’Reilly Media books without permission. Thisclaim has sparked significant controversy.
AI models function as complex prediction engines. They learn from vastdatasets like books, movies, and TV shows to respond to prompts. When an AIcreates content, it draws from a large knowledge base rather than generatingsomething completely new.
The training methods of AI models are evolving. Many labs, including OpenAI,are turning to AI-generated data due to the shrINKing availability of real-world data. Still, numerous organizations prefer real-world data for trainingto avoid associated risks.
The study’s findings suggest that GPT-4o shows enhanced recognition ofO’Reilly’s paid book content compared to GPT-3.5 Turbo. The reseARChers used amethod called DE-COP to detect copyrighted content in the training data.
They analyzed the knowledge of multiple OpenAI models, including GPT-4o andGPT-3.5 Turbo, using excerpts from 34 O’Reilly books. The results indicatedthat GPT-4o had a higher recognition rate of paid O’Reilly book content.
HowEVEr, the researchers note that this isn’t conclusive evidence. OpenAImight have obtained the content through user copy-pasting. Also, the studydidn’t assess OpenAI’s latest models, leaving room for the possibility thatthey weren’t trained on O’Reilly’s paid books.
OpenAI already faces intense scrutiny regarding its data usage practices undercurrent legal frameworks, desPite paying for some training data and havingagreements with certain entities. This new research adds to the legalchallenges it faces regarding training data usage.