The news: Microsoft signed a deal with HarperCollins to use the book publisher’s nonfiction works for AI model training.
Authors need to opt in to the training program, and the AI model will be limited to “no more than 200 consecutive words and/or 5% of a book’s text” in its output. In context, 5% of Michelle McNamara’s “I’ll Be Gone in the Dark,” a nonfiction title published by HarperCollins, comes to about 18 pages of text.
Zooming out: The pool of public content for generative AI (genAI) training is running out, which could affect timelines for model improvement.
The obstacle: Getting writers on board with the licensing deal may be difficult.
Less than half (47%) of US adults trust companies to responsibly prevent their AI models from creating work that’s derivative of other work, per The Verge.
Why this could succeed: Book content could be safer from unauthorized data scraping than news content, since books are less frequently published in full online.
For authors who aren’t opposed to genAI learning from their work, this partnership could offer additional revenue with a concrete limit to the AI’s outputs.
Our take: It isn’t clear what role will be left for human creators in an AI-driven future, but with a finite amount of data left for model training, AI companies are likely to keep pursuing various publishers with lucrative licensing deals.
This article is part of EMARKETER’s client-only subscription Briefings—daily newsletters authored by industry analysts who are experts in marketing, advertising, media, and tech trends. To help you finish 2024 strong, and start 2025 off on the right foot, articles like this one—delivering the latest news and insights—are completely free through January 31, 2025. If you want to learn how to get insights like these delivered to your inbox every day, and get access to our data-driven forecasts, reports, and industry benchmarks, schedule a demo with our sales team.