tech
February 23, 2026
AIs can generate near-verbatim copies of novels from training data
LLMs memorize more training data than previously thought.

TL;DR
- Leading AI models can generate near-verbatim copies of copyrighted novels, contradicting previous industry claims.
- Studies show large language models memorize more training data than previously understood.
- This memorization ability poses significant challenges to AI groups fighting dozens of copyright lawsuits globally.
- AI companies have argued that their models 'learn' from data but do not store copies, a claim now being questioned.
- Researchers were able to extract thousands of words, and even entire novels, from models like Gemini 2.5 and Grok 3.
- Jailbreaking techniques allowed for the extraction of almost entire novels from models like Anthropic's Claude 3.7 Sonnet.
- Previous research also found that 'open' models like Meta's Llama memorize large portions of books.
- The memorization feature has potential privacy and confidentiality implications in sectors like healthcare and education.
- Legal experts suggest this could create significant liability for AI groups regarding copyright infringement.
- Past legal rulings have considered training on copyrighted content as 'fair use' if transformative, but storing pirated works as infringing.
- A German ruling found OpenAI infringed copyright due to its model memorizing song lyrics.
- While jailbreaking is considered impractical for normal users, extracting entire books without it is clearly a copyright violation.
- Companies like Anthropic state their models learn patterns rather than storing specific datasets.
Continue reading the original article