Artificial intelligence prophets and news mongers are predicting the end of the generative AI hype and talking about an impending catastrophic “model collapse.”
But how realistic are these predictions? And what exactly is model collapse?
Discussed in 2023 and recently popularized, “model collapse” refers to a hypothetical scenario in which future AI systems become increasingly stupid due to the rise of AI-generated data on the internet.
Data Needed
Modern AI systems are built using machine learning: programmers set up the underlying mathematical structures, but the actual “intelligence” comes from training the system to mimic patterns in data.
But not just any data is good: current generative AI systems require large amounts of high-quality data.
To get to this data, big tech companies like OpenAI, Google, Meta, and Nvidia are constantly scouring the internet, amassing terabytes of content to feed their machines. But with the advent of widely available and useful generative AI systems in 2022, more and more people are uploading and sharing content that has been created in part or in whole by AI.
In 2023, researchers began to wonder if it might work if AI was trained exclusively on data created by the AI, rather than human-generated data.
There are strong incentives to make this happen: In addition to its proliferation on the internet, AI-created content is much cheaper to acquire than human data, and there are no ethical or legal issues with collecting it in bulk.
But researchers have found that without high-quality human data, AI systems trained on AI-created data become increasingly stupid as each model learns from the previous one — like a digital version of the inbreeding problem.
This “iterative training” appears to lead to a decline in the quality and diversity of model behavior. Quality here roughly means a combination of being helpful, harmless, and honest. Diversity refers to the variety of responses, and the cultural and social perspectives of people represented in the AI output.
This means that overuse of AI systems could potentially pollute the very data sources necessary to make them useful in the first place.
Avoid collapse
Can't big tech companies just filter the content that their AI generates? Not really. Tech companies already spend a lot of time and money cleaning and filtering the data they collect, with one industry insider recently revealing that they sometimes discard 90% of the data they initially collect to train their models.
These efforts may become more difficult as the need to specifically remove AI-generated content increases, but more importantly, in the long term, AI content will become increasingly difficult to distinguish, making filtering and removing synthetic data a game of diminishing (monetary) returns.
After all, research to date has shown that you can’t completely eliminate human data — the “I” in AI comes from human data, after all.
Are we heading for disaster?
There are already signs that developers are working hard to get access to high-quality data: For example, the documentation accompanying the GPT-4 release notes an unprecedented number of staff involved in the data-related parts of the project.
New human data may also be drying up: Some estimates suggest that the pool of human-generated textual data could be exhausted by as early as 2026.
Perhaps that's why OpenAI and other companies are racing to forge exclusive partnerships with industry giants like Shutterstock, Associated Press and NewsCorp, which have vast amounts of proprietary human data that isn't readily available on the public internet.
But the likelihood of a model collapsing catastrophically may be exaggerated: Most studies to date have dealt with cases where synthetic data replaces human data. In reality, human and AI data are more likely to accumulate in parallel, making the collapse less likely.
In the most likely future scenario, a somewhat diverse ecosystem of generative AI platforms will be used for content creation and publishing rather than a single monolithic model, which will also be more robust against disruption.
This is a good reason for regulators to limit monopolies in the AI field, promote healthy competition, and fund technological development that serves the public interest.
Real Concerns
Too much AI-created content also poses more subtle risks.
While the proliferation of synthetic content may not pose an existential threat to progress in AI development, it does pose a threat to the digital commons of the (human) internet.
For example, researchers found that a year after ChatGPT's release, activity on the coding website StackOverflow dropped by 16%, suggesting that AI assistance may already be reducing person-to-person interactions in some online communities.
Hyper-production by AI-powered content farms also makes it hard to find content that isn’t ad-stuffed clickbait.
It is becoming impossible to reliably distinguish between human-generated and AI-generated content, and one way to remedy this, as I and many others have highlighted recently and reflected in recent Australian Government interim legislation, is to watermark or label AI-generated content.
There are other risks: the systematic homogenization of AI-generated content risks losing socio-cultural diversity and even causing some people to experience cultural erasure. There is an urgent need for interdisciplinary research into the social and cultural challenges posed by AI systems.
Human interactions and human data are important, and we need to protect them – for our own sake, and perhaps to avoid the risk of future model collapse.
This article is republished from The Conversation under a Creative Commons license. Read the original article.
Image credit: Google DeepMind / Unsplash