AI Is Using Flawed Training Data

The limit to AI's improvement is all the data that is unavailable for it to train on

Oct 30, 2024

There's a common mentality that AI is just going to keep improving forever. Researchers will keep creating better models and have a greater amount of training data at their disposal. However, that runs on the assumption that all the training data AI will ever need is available and is accurate. That's a pretty big assumption. I'd argue that there is some data that would be incredibly difficult to provide and that'll act as a limiter on AI's progression.

What does AI have access to? Public web pages. Social media posts. Maybe even private text conversations, though we can debate whether that's actually happening or not.

There are some really big problems with that dataset. The first of which is that most of the training data comes from publicly available web pages and that group of content is filled with decades of SEO fluff.

A lot of people have been feeling that Google search has gotten worse over the years. How is that possible? Google has thousands of smart software engineers. Shouldn't they be improving the algorithm, not making it worse?

One of the best quotes I've heard about this topic is "Is Google getting worse, or is it the internet that is getting worse?"

Google's success has had consequences. The primary way that most people find content on the web is through Google. No matter the motivation for writing, everyone would rather have more people read their content than less. That creates an incentive to create content that favors looking good for Google's algorithm rather than benefiting the human readers.

Look at how that incentive manifests. When was the last time you looked up a recipe and got just the recipe? I would believe it was in the 1990s. These days you have to scroll through endless paragraphs about why the dish is delicious and healthy. You probably already know that because why else would you look up the recipe? Those paragraphs aren't for you. They're for Google. They're written just for SEO.

Software engineers have gone through countless web pages searching for solutions to technical problems. Those web pages overwhelmingly have lots of content regarding the basic concepts of the technology in question... and no solution to the actual problem. That's because those web pages aren't written for us. They're written for Google.

Try searching for whether there will be a new season of your favorite tv show. You'll get plenty of web pages with 6-8 paragraphs of content before you get to the part where they say no one knows because there was no announcement yet. Those 6-8 paragraphs are completely useless to you. That's because they're not written for you. They're written for Google.

Google search hasn't technically gotten worse. Google's success has incentivized countless people to write garbage that makes the internet worse. If that content has had such a negative effect on Google search, what do you think it does to the training data for AI? If you find AI search products like Perplexity to be better than Google, what do you think will happen to publicly available content if those products become more dominant than Google?

The problem with public content isn't just all the SEO fluff. There's also a positivity bias. Ask ChatGPT to compare Postgres and Oracle. Under the positives for Oracle, it states that Oracle is better suited for large scale enterprises and high availability. That implies that Postgres is not good at those things, which is false.

This implication isn't there because there's a lot of negative content about Postgres. It is there because there's a lot of positive content about Oracle. People who use and like Oracle have reason to write good things about it. Putting content out there helps people build an audience, show display ads, promote other products, etc. Writing about things you like is easy and comes with no cost other than the time it takes to write the post.

Oracle certainly has good reason to write positive content about their own products. They want to sell those products and having lots of content out there increases the probability that Google will show their content.

What about everyone who dislikes Oracle products? There is content out there that is negative on Oracle, but it is outnumbered by the positive content. Why might this be? Well, do you feel like publicly airing all your bad experiences with technology? Or would you rather try and forget it? The cost of writing a negative piece isn't just the time it takes to write the post. You're also going to spend that time feeling miserable and/or angry. Why put yourself through that?

This affects every technology out there. The same thing happens with content about Postgres. And ReactJS. And AWS. And Kubernetes.

Speaking of Kubernetes, there's another reason for positivity bias. For a time, Kubernetes was seen as the bees knees. Everyone was talking about how great it was... publicly. In private, any developers were struggling with the technology. The onboarding is now known to be notoriously bad, but that wasn't clear at first. If you're struggling with something everyone else is praising, are you likely to write something public about your struggles? Or are you more likely to try and hide that fact so that no one thinks less of you? Once again there is little cost to writing something positive, but plenty of cost in writing something negative.

People may shy from writing something negative publicly, but they don't shy from complaining about things in person over drinks. People will say quite a bit in person that they will never say on the internet. I once talked to someone who was very very bullish on microservices publicly. I asked him if he ever implemented microservices before. The answer was no. I was once at a company that looked at all the marketing content written about a piece of technology. Privately, the account rep for the company that built that technology told us all the caveats. I once talked to a big proponent of remote work. Privately, he admitted to spending a lot of the work day doing house chores.

Side note: despite that, I'm still a proponent of remote work. I've been remote since 2016 and I run a company that is fully remote.

There are an endless number of topics where people are willing to say one thing in public, but only go into some of the details in private. Those details are spoken and recorded only in human brains. That leaves a huge amount of information unavailable for AI training, which almost certainly skews the output provided by AI. Even if you believe AI is capable of reasoning, which is unlikely, its ability to improve is still limited by the fact that the training data available is incomplete at best and incorrect at worst.

Startup Software Development

Discussion about this post