Consider this... is a recurring feature where we pose a provocative question and share our thoughts on the subject. We may not have answers, or even suggestions, but we will have a point of view, and hopefully make you think about something you haven't considered.
As more people use AI to create content, and AI platforms are trained on that content, how will that impact the quality of digital information over time?
It looks like we're kicking off this recurring feature with a mind bending exercise in recursion, thus the title reference to Ouroboros, the snake eating its own tail. Let's start with the most common sources of information that AI platforms use for training.
Books, articles, research papers, encyclopedias, documentation, and public forums
High-quality, licensed content that isn’t freely available to the public
Domain-specific content (e.g. programming languages, medical texts)
These represent the most common (and likely the largest) corpora that will contain AI generated or influenced information. And they're the most likely to increase in breadth and scope over time.
Training on these sources is a double edged sword. Good training content will be reinforced over time, but likewise, junk and erroneous content will be too. Complicating things, as the training set increases in size, it becomes exponentially more difficult to validate. But hey, we can use AI to do that. Can't we?
Here's another thing to think about: bad actors (e.g., geopolitical adversaries) are already poisoning training data through massive disinformation campaigns. According to Carnegie Mellon University Security and Privacy Institute: “Modern AI systems that are trained to understand language are trained on giant crawls of the internet,” said Daphne Ippolito, assistant professor at the Language Technologies Institute. “If an adversary can modify 0.1 percent of the Internet, and then the Internet is used to train the next generation of AI, what sort of bad behaviors could the adversary introduce into the new generation?”
We're scratching the surface here. This topic will certainly become more prominent in years to come. And tackling these issues is already a priority for AI companies. As Nature and others have determined, "AI models collapse when trained on recursively generated data." We dealt with similar issues when the Internet boom first enabled wide scale plagiarism and an easy path to bad information. AI has just amplified the issue through convenience and the assumption of correctness. As I wrote in a previous AI post, in spite of how helpful AI tools can be, the memes of AI fails may yet save us by educating the public on just how often AI is wrong, and that it doesn't actually think in the first place.
There's usually more to the story so if you have questions or comments about this post let us know!
Do you need a new software development partner for an upcoming project? We would love to work with you! From websites and mobile apps to cloud services and custom software, we can help!