AI ‘gold rush’ for chatbot training data may run out of human-written text

Artificial intelligence systems like ChatGPT could soon run out of what makes them smarter: the tens of trillions of words people have written and shared online.

A new study published Thursday by research group Epoch AI shows that tech companies will exhaust the supply of publicly available training data for AI language models by the turn of the century — sometime between 2026 and 2032.

Tamay Besiroglu, an author of the study, likened it to a “literal gold rush” that depletes finite natural resources and said the AI ​​field could face challenges in maintaining its current pace of progress once it reaches the depleting reserves of human-generated writing.

In the short term, tech companies like ChatGPT maker OpenAI and Google are racing to secure high-quality data sources and sometimes pay to train their AI-sized language models — for example, by striking deals to capitalize on the steady stream of sentences coming their way . from Reddit forums and news media.

In the longer term, there won’t be enough new blogs, news articles and social media commentary to support the current trajectory of AI development, putting pressure on companies to tap into sensitive data now considered private – such as emails or text messages – or relying on less reliable ‘synthetic data’ spit out by the chatbots themselves.

“There is a serious bottleneck here,” Besiroglu said. “If you run into those limitations about the amount of data you have, you can’t really scale your models efficiently anymore. And scaling models has probably been the most important way to expand their capabilities and improve the quality of their output.”

The researchers first made their projections two years ago — shortly before ChatGPT’s debut — in a working paper that predicted an impending 2026 shutdown of high-quality text data. A lot has changed since then, including new techniques that have allowed AI researchers to make better use of the data they already have, sometimes “overtraining” on the same sources multiple times.

But there are limits, and after further research, Epoch now expects that public text data will no longer be available sometime in the next two to eight years.

The team’s latest study has been peer-reviewed and will be presented this summer at the International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by advocates of effective altruism — a philanthropic movement that has poured money into mitigating the worst risks of AI.

Besiroglu said AI researchers realized more than a decade ago that aggressively expanding two key ingredients — computing power and vast amounts of Internet data — could significantly improve the performance of AI systems.

The amount of text data fed into AI language models has grown about 2.5 times per year, while computing usage has grown about 4 times per year, according to the Epoch study. Facebook parent company Meta Platforms recently claimed that the largest version of their upcoming Llama 3 model – which has not yet been released – has been trained on up to 15 trillion tokens, each of which can represent a piece of a word.

But the extent to which it is worth worrying about the data bottleneck is debatable.

“I think it’s important to keep in mind that we don’t necessarily need to train larger and larger models,” says Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto and a researcher at the nonprofit Vector Institute for Artificial Intelligence.

Papernot, who was not involved in the Epoch study, said building more skilled AI systems can also come from training models that are more specialized for specific tasks. But he worries about training generative AI systems on the same output they produce, leading to reduced performance known as “model collapse.”

AI-generated data training is “like what happens when you copy a piece of paper and then photocopy it. You lose some of the information,” Papernot said. Not only that, but Papernot’s research also found that it can further encode the errors, biases, and unfairness already ingrained in the information ecosystem.

If real, human-made sentences remain a crucial AI data source, those who run its most sought-after assets – websites like Reddit and Wikipedia, as well as news and book publishers – will be forced to think carefully about how they are used.

“Maybe you don’t cut the tops off every mountain,” jokes Selena Deckelmann, head of product and technology at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having conversations about natural resources over human-created data. I shouldn’t laugh at it, but I think it’s amazing.”

While some have tried to lock out their data from AI training—often after it has already been collected without compensation—Wikipedia has placed few restrictions on how AI companies use their volunteer-authored contributions. Still, Deckelmann said she hopes there will remain incentives for people to keep contributing, especially as a flood of cheap and auto-generated “junk content” begins to pollute the Internet.

AI companies should be “concerned about how human-generated content persists and remains accessible,” she said.

From the perspective of AI developers, Epoch’s research says that paying millions of people to generate the text that AI models will need is “unlikely an economical way” to achieve better technical performance.

As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating a lot of synthetic data” for training.

“I think you need high-quality data. There is low quality synthetic data. There is low quality human data,” Altman said. But he also expressed reservations about relying too heavily on synthetic data instead of other technical methods to improve AI models.

“There would be something very strange if the best way to train a model was to just generate a trillion tokens of synthetic data and feed it back in,” Altman said. “Somehow that seems inefficient.”


The Associated Press and OpenAI have a licensing and technology agreement that gives OpenAI access to some of AP’s text archives.

Leave a Comment