cadence’s website.

Some changes will be applied after reloading.
Some changes will be applied after reloading.
Rated PG - Parental Guidance recommended.

The internet has run out of training data

Yellow links may be interesting further reading. Purple links are mostly just citing sources.

Tech companies are outrageously interested in generative AI. Every company under the sun is trying to integrate it into their product. The biggest tech companies are even creating their own models.

Here's a quick recap of how generative AI quality is going, focusing on OpenAI's models:

But where is this trend headed in the future? What will the inevitable GPT-5 be like?

Good models require lots of data to train. Presumably, making bigger better models will require even more training data.

But there's one big problem.

The internet has run out of training data

Virtually all information created by humanity is already part of an AI model.

So models already know about all the things. But it's not enough.

  • Current, copyrighted books are in the model
  • All of Wikipedia's source web pages are in the model

So models already know a lot about all the things. But it's not enough.

So models know everything about all the things. But it's not enough. They don't know how to act human.

So they need to study every online human conversation. They need social media. But social media is a walled garden — you can't just walk up and scrape every Facebook post, every Instagram image — not after those guys already did it. And besides, maybe copyright does apply to them after all!

So they make deals with the social media platform's owners to access its subjects' data. Here's every platform I can find that has publicly announced AI training:

That's all the social media sites! All of them! That's a copy of every digitally recorded human conversation!

The internet has run out of training data

That's all there is. After these deals finalise, there will be no more available training data on the internet. It's all been absorbed. Whatever the state of AI models is in a year or two, they cannot get any better after that, as there simply will not be any more data to train on.

In fact, as the amount of AI generated text published on the internet begins to dwarf human-written text, models are expected to get worse.

You've probably been hearing that AI will just keep getting better and better, and any current problems won't be a problem in the future, so you have to embrace AI instead of fighting it — but unlike what many believe, AI will not simply keep getting better over time. I believe we have already reached the peak of how "good" AI will ever get.

I can't make any firm predictions of what will happen to the internet in the future, but here's what I do believe:

  • The quality of generative AI will decline uncontrollably. Soon.
  • Users are already exploited for their attention; we will then be exploited for our humanity as tech companies trade human-generated anything as a commodity in an attempt to improve model quality.
  • Sludge will become even more sludgy.
  • It will be impossible to find anything online.

— Cadence

A seal on a cushion spinning a globe on its nose.
Another seal. They are friends!