Monday, 16 June 2025
29.3 C
Singapore
28.2 C
Thailand
20.1 C
Indonesia
28.7 C
Philippines

OpenAI under fire as new study reveals signs of using copyrighted content in training

A new study suggests OpenAI’s models memorised copyrighted content, raising concerns about fair use and data transparency.

A recent study suggests that some of OpenAI’s AI models may have learned directly from copyrighted material — without permission. This finding adds weight to ongoing legal battles by authors, developers, and other rights-holders who say their work has been unfairly used to build these models.

While OpenAI has argued that using such content is covered by “fair use,” the study raises new concerns over whether that defence holds up. Researchers from the University of Washington, Stanford University, and the University of Copenhagen worked together to develop a method for checking whether AI models have memorised specific pieces of text during training.

New method reveals hidden memorisation

The study focused on what the authors call “high-surprisal” words — uncommon words that stand out when placed in certain sentences. For example, “radar” in the sentence “Jack and I sat perfectly still with the radar humming” is considered a high-surprisal word. It’s less expected in this context than words like “engine” or “radio,” which might appear more often before the word “humming.”

Using this idea, the researchers tested several of OpenAI’s models, including GPT-3.5 and GPT-4. They took text snippets from fiction books and articles published in The New York Times, removed the high-surprisal words, and asked the models to guess the missing word.

If a model guessed the correct word with high accuracy, it suggested that the model had seen that exact phrase or passage during its training — an indication of memorisation. Since the data included copyrighted books and journalism, this poses serious ethical and legal concerns.

GPT-4 showed signs of memorising books

The tests showed that GPT-4 — OpenAI’s most advanced model — appears to have memorised sections from popular fiction. Some of this content came from a dataset called BookMIA, which includes samples from copyrighted ebooks. The model also seemed to recall parts of New York Times articles less frequently than it did with fiction.

These findings point to the possibility that GPT-4 was trained, at least in part, on copyrighted materials. That’s a major issue for creators whose work may have been included without their consent.

Abhilasha Ravichander, a PhD student at the University of Washington and one of the study’s authors, explained the significance of this discovery. “To have large language models that are trustworthy, we need to be able to audit them and understand how they work,” she said. “Our study offers one way to investigate that, but it also highlights the urgent need for more transparency around the data these models are trained on.”

OpenAI has been the subject of several lawsuits over its use of copyrighted content. It has defended its approach by arguing that training AI with such content is fair use—a legal principle in the US that allows for limited use of copyrighted material without needing permission.

At the same time, OpenAI has tried to show it takes content rights seriously. It has licensing agreements with some publishers and offers an “opt-out” process so creators can request that their work not be used in training.

Still, the company continues to lobby governments around the world to support looser rules regarding AI training. It wants clearer legal protections that would allow models to be trained on a broad range of online content—including some copyrighted material—without facing legal risks.

But as this new study shows, there’s a fine line between learning from data and copying it. And until lawmakers draw that line, the debate around fair use in AI training will likely remain heated.

Hot this week

Keeper Security named overall leader in GigaOm report for enterprise password management

Keeper Security is named GigaOm's Overall Leader in enterprise password management for the fourth year, praised for innovation and usability.

Redmagic 10S Pro launches in Singapore with faster gaming performance and exclusive offers

Redmagic 10S Pro lands in Singapore with overclocked performance, S$270 early bird deals, and a free cooling fan for a limited time.

Xbox enters handheld gaming with ROG Ally, taking aim at Steam Deck—not Switch 2

Xbox’s ROG Ally handheld targets Steam Deck with new software and powerful specs, and it will launch this autumn to shake up PC gaming.

Coco Robotics secures US$80 million to expand delivery robot services

Coco Robotics raises US$80M to expand its eco-friendly delivery robots. It is backed by Sam Altman and partnered with OpenAI for real-world AI training.

Proofpoint opens new Singapore office to expand APAC operations and AI capabilities

Proofpoint opens new Singapore office to expand APAC presence and boost AI-led, human-centric cybersecurity efforts across the region.

Informatica deepens partnership with Databricks to support new Iceberg and OLTP services

Informatica joins Databricks as launch partner for new Iceberg and OLTP solutions, introducing AI tools to speed up GenAI development.

Hong Kong opens skies to larger drones in bid to grow low-altitude economy

Hong Kong will allow the testing of larger drones to boost its low-altitude economy and improve logistics, following mainland China's lead.

Hong Kong to build new AI supercomputing centre in bid to lead global tech race

Hong Kong plans a new AI supercomputing centre to boost its tech hub status and support growing start-ups across the Greater Bay Area.

Steam adds full native support for Apple Silicon Macs

Steam runs natively on Apple Silicon Macs, ditching Rosetta 2 for smoother performance and better gaming on M1 and M2 devices.

Related Articles

Popular Categories