The internet has run out of training data

ⓘ Yellow links may be interesting further reading. Purple links are mostly just citing sources.

Tech companies are outrageously interested in generative AI. Every company under the sun is trying to integrate it into their product. The biggest tech companies are even creating their own models.

Here's a quick recap of how generative AI quality is going, focusing on OpenAI's models:

GPT-2 was trained on a dataset of a mere 8 million web pages. That's 40 GB of text. That's a fragment of the internet so small it could fit on a USB stick.
- GPT-2 can write grammatically correct text that fits a theme, but can easily go off the rails or get stuck in loops, and doesn't have logical consistency between different things it writes in the same piece.
The paper for GPT-3 describes how much more data was used to train it, saying "Datasets for language models have rapidly expanded", "constituting nearly a trillion words", "books" (they won't say which books), and all of Wikipedia.
- 🗦new🗧Less than a month after I published this blog post, book authors took them to court to find that the evidence was destroyed and OpenAI has still not revealed which books.
- GPT-3 can solve logic- and language-based puzzles, with questionable success.
We can only guess what GPT-4 was trained on, but it's definitely a lot of (copyrighted) data. It uses everything from GPT-3 plus a million hours of transcribed YouTube videos, and basically the whole internet via Common Crawl and OpenAI's spiders. Look, even my website is in there! I'm famous!
- These largest models have claimed to be sentient, multiple times, and done it well enough that they have fooled people with real intelligence into believing them.

But where is this trend headed in the future? What will the inevitable GPT-5 be like?

Good models require lots of data to train. Presumably, making bigger better models will require even more training data.

But there's one big problem.

Virtually all information created by humanity is already part of an AI model.

All written works before 1929 are in the model
A surface-level lens of all humanity's knowledge is in the model

So models already know about all the things. But it's not enough.

Current, copyrighted books are in the model
All of Wikipedia's source web pages are in the model

So models already know a lot about all the things. But it's not enough.

News articles are in the model
Everything on the internet is in the model

So models know everything about all the things. But it's not enough. They don't know how to act human.

So they need to study every online human conversation. They need social media. But social media is a walled garden — you can't just walk up and scrape every Facebook post, every Instagram image — not after those guys already did it. And besides, maybe copyright does apply to them after all!

So they make deals with the social media platform's owners to access its subjects' data. Here's every platform I can find that has publicly announced AI training:

Facebook has its own AI model, and all Facebook posts will be used for training it
Same goes for Instagram and Threads posts, also owned by Facebook
Twitter has its own AI model and is using Twitter posts to train it
All conversations on Reddit were bought by Google for $60 million
Tumblr, the black sheep of social media, is working out an AI deal
YouTube video transcripts are in the model, as previously mentioned
YouTube videos themselves might be in the model, but who can say?
Stack Overflow, a Q&A site most well known for its coding advice, also has a deal with Google
Quora, another Q&A site, will train an AI on its user data
Snapchat has added generative AI features, and conversations are used for training
Sites that have fallen out of relevance, like Photobucket, have been contacted by multiple companies seeking a data deal
Stock photo sites used to be suing, but Getty Images is now running its own model while Shutterstock sells access to OpenAI
The news has resisted snippets being republished even before generative AI, but it turns out all they needed was the right price: NewsCorp, which owns WSJ, Sun, New York Post, The Australian, and dozens of others, has now traded all its articles to AI companies
Axel Springer, which owns Politico, Business Insider and others, has also signed a deal with OpenAI
The Atlantic, and separately, Vox Media which owns Vox, The Verge, Polygon, New York, NowThis, Intelligencer, Vulture, and others, have also signed a deal with OpenAI
They have even begun drawing on private conversations: your emails on Gmail are used for AI training
🗦new🗧Eight hours after I published this blog post, the Telegram chat app added Microsoft's generative AI chatbot. This falls under Microsoft's privacy policy, so your data is used for training.

That's all the social media sites! All of them! That's a copy of every digitally recorded human conversation!

The internet has run out of training data

That's all there is. After these deals finalise, there will be no more available training data on the internet. It's all been absorbed. Whatever the state of AI models is in a year or two, they cannot get any better after that, as there simply will not be any more data to train on.

In fact, as the amount of AI generated text published on the internet begins to dwarf human-written text, models are expected to get worse.

You've probably been hearing that AI will just keep getting better and better, and any current problems won't be a problem in the future, so you have to embrace AI instead of fighting it — but unlike what many believe, AI will not simply keep getting better over time. I believe we have already reached the peak of how "good" AI will ever get.

I can't make any firm predictions of what will happen to the internet in the future, but here's what I do believe:

The quality of generative AI will decline uncontrollably. Soon.
Users are already exploited for their attention; we will then be exploited for our humanity as tech companies trade human-generated anything as a commodity in an attempt to improve model quality.
Sludge will become even more sludgy.
It will be impossible to find anything online.

— Cadence