tech

February 23, 2026

AIs can generate near-verbatim copies of novels from training data

LLMs memorize more training data than previously thought.

AIs can generate near-verbatim copies of novels from training data

TL;DR

  • Leading AI models can generate near-verbatim copies of copyrighted novels, contradicting previous industry claims.
  • Studies show large language models memorize more training data than previously understood.
  • This memorization ability poses significant challenges to AI groups fighting dozens of copyright lawsuits globally.
  • AI companies have argued that their models 'learn' from data but do not store copies, a claim now being questioned.
  • Researchers were able to extract thousands of words, and even entire novels, from models like Gemini 2.5 and Grok 3.
  • Jailbreaking techniques allowed for the extraction of almost entire novels from models like Anthropic's Claude 3.7 Sonnet.
  • Previous research also found that 'open' models like Meta's Llama memorize large portions of books.
  • The memorization feature has potential privacy and confidentiality implications in sectors like healthcare and education.
  • Legal experts suggest this could create significant liability for AI groups regarding copyright infringement.
  • Past legal rulings have considered training on copyrighted content as 'fair use' if transformative, but storing pirated works as infringing.
  • A German ruling found OpenAI infringed copyright due to its model memorizing song lyrics.
  • While jailbreaking is considered impractical for normal users, extracting entire books without it is clearly a copyright violation.
  • Companies like Anthropic state their models learn patterns rather than storing specific datasets.

Continue reading the original article

Made withNostr