With the rise of GPT-4, Stable Diffusion, and Midjourney, an increasing number of people are incorporating generative AI technology into their work and daily lives. Some have even begun to experiment with using AI-generated data to train AI models. Could this be the legendary "perpetual motion of data"?
Over time, the models tend to forget the underlying real-world data, even in nearly ideal long-term learning conditions. This situation is unavoidable.
As a result, researchers urge that if we want to maintain the superiority of models trained on large-scale data, we must take human-generated text seriously.
Training GPT-5 with GPT-4? Study warns: Toxicity in training AI with AI leads to model collapse.
However, the current problem lies in the fact that what you perceive as "human-generated data" may not have been authored by humans.
The latest research from EPFL estimates that 33% to 46% of human-generated data is actually created by AI.
Training GPT-5 with GPT-4? Study warns: Toxicity in training AI with AI leads to model collapse.
Undoubtedly, large-scale language models have evolved to possess impressive capabilities. For example, GPT-4 can generate text that closely resembles human writing in certain contexts.
However, a significant reason behind this is that their training data largely originates from human interactions on the internet over the past few decades.
If future language models continue to rely on data crawled from the web, they will inevitably introduce their own generated text into the training set.
In light of this, researchers predict that as GPT advances to the nth generation, severe collapse issues will arise in the models.
Training GPT-5 with GPT-4? Study warns: Toxicity in training AI with AI leads to model collapse.
Therefore, in a scenario where it is inevitable to capture LLM-generated content, preparing real human-produced data for model training becomes crucial.
The renowned Amazon Mechanical Turk (MTurk) platform, which has been operating since 2005, has become a popular choice for many as a side gig.
Researchers can publish various small-scale human intelligence tasks, such as image labeling and surveys, covering a wide range of options.
These tasks are typically beyond the capabilities of computers and algorithms. In fact, MTurk has become the "best choice" for some budget-constrained researchers and companies.
Even Bezos jokingly referred to MTurk's outsourced workers as "artificial artificial intelligence."
In addition to MTurk, crowdsourcing platforms like Prolific have become core resources for researchers and industry practitioners, providing methods for establishing, labeling, and summarizing various types of data for surveys and experiments.
However, a study from EPFL found that nearly half of this crucial human-generated data comes from annotators using AI for creation.