AIs can generate near-verbatim copies of novels from training data

February 23, 2026

TL;DR

Leading AI models can generate near-verbatim copies of copyrighted novels, contradicting previous industry claims.
Studies show large language models memorize more training data than previously understood.
This memorization ability poses significant challenges to AI groups fighting dozens of copyright lawsuits globally.
AI companies have argued that their models 'learn' from data but do not store copies, a claim now being questioned.
Researchers were able to extract thousands of words, and even entire novels, from models like Gemini 2.5 and Grok 3.
Jailbreaking techniques allowed for the extraction of almost entire novels from models like Anthropic's Claude 3.7 Sonnet.
Previous research also found that 'open' models like Meta's Llama memorize large portions of books.
The memorization feature has potential privacy and confidentiality implications in sectors like healthcare and education.
Legal experts suggest this could create significant liability for AI groups regarding copyright infringement.
Past legal rulings have considered training on copyrighted content as 'fair use' if transformative, but storing pirated works as infringing.
A German ruling found OpenAI infringed copyright due to its model memorizing song lyrics.
While jailbreaking is considered impractical for normal users, extracting entire books without it is clearly a copyright violation.
Companies like Anthropic state their models learn patterns rather than storing specific datasets.

Continue reading the original article