Human differences in judgment lead to problems for AI

Many people understand the concept of bias on an intuitive level. Racial and gender biases are well documented in society and in artificial intelligence systems.

If society could somehow eliminate prejudice, would all problems disappear? The late Nobel Prize winner Daniel Kahneman, a key figure in the field of behavioral economics, argued in his last book that bias is only one side of the coin. Errors in ratings can be attributed to two sources: bias and noise.

Biases and noise both play an important role in areas such as law, medicine and financial forecasting, where human judgments are central. In our work as computer and information scientists, my colleagues and I have discovered that noise also plays a role in AI.

Statistical noise

Noise in this context means variation in the way people judge the same problem or situation. The noise problem is more widespread than it seems at first glance. A seminal work dating back to the Great Depression found that different judges imposed different sentences for similar cases.

Worryingly, the punishment in court cases can depend on things such as the temperature and whether the local football team won. Such factors contribute, at least in part, to the perception that the legal system is not only biased, but sometimes arbitrary.

Other examples: Insurance adjusters may provide different estimates for similar claims, reflecting noise in their judgment. Noise is likely to be present at all kinds of competitions, ranging from wine tastings to local beauty pageants to college admissions.

Noise in the data

At first glance, it seems unlikely that noise can affect the performance of AI systems. After all, machines aren’t affected by the weather or football teams, so why would they make judgments that vary depending on the circumstances? On the other hand, researchers know that bias affects AI because it is reflected in the data the AI ​​is trained on.

For the new wave of AI models like ChatGPT, the gold standard is human performance on general intelligence problems like common sense. ChatGPT and its peers are measured against human-labeled, common-sense datasets.

Simply put, researchers and developers can ask the machine a common sense question and compare it to human answers: “If I put a heavy rock on a paper table, will it collapse? Yes or no.” If there is a close agreement between the two – at best a perfect agreement – ​​the machine approaches human-level common sense, according to the test.

So where would sound come in? The above common sense question seems simple, and most people would probably agree on the answer, but there are many questions where there is more disagreement or uncertainty: “Is the following sentence plausible or implausible? My dog ​​plays volleyball.” In other words: there is potential for noise pollution. Not surprisingly, interesting, common sense questions create some noise.

But the problem is that most AI tests in experiments don’t take this noise into account. Intuitively, questions that generate human answers that tend to agree with each other should be weighted more highly than when the answers diverge – in other words, when there is noise. Researchers still don’t know if or how to weigh AI’s responses in that situation, but a first step is to acknowledge that the problem exists.

Detecting noise in the machine

Theory aside, the question remains whether all of the above is hypothetical or whether real tests of common sense involve noise. The best way to prove or disprove the presence of noise is to take an existing test, remove the answers, and have multiple people independently label which means they have to give answers. By measuring the disagreement between people, researchers can know how much noise there is in the test.

The details behind measuring this disagreement are complex and involve significant statistics and mathematics. Besides, who’s to say how to define common sense? How do you know that the human judges are motivated enough to consider the question? These issues lie at the intersection of good experimental design and statistics. Robustness is key: one result, test, or set of human labelers is unlikely to convince anyone. Pragmatically, human labor is expensive. Perhaps for this reason, no studies have been done on possible noise in AI tests.

To close this gap, my colleagues and I designed such a study and published our findings in Nature Scientific Reports, showing that even in the realm of common sense, noise is unavoidable. Because the setting in which judgments are elicited can be important, we conducted two types of studies. One type of study involved paid employees of Amazon Mechanical Turk, while the other study involved a small-scale labeling exercise in two labs at the University of Southern California and Rensselaer Polytechnic Institute.

You can think of the former as a more realistic online environment, reflecting how many AI tests are actually labeled before being released for training and evaluation. The latter is more extreme and guarantees high quality, but on a much smaller scale. The question we wanted to answer was: how unavoidable is noise, and is it just a matter of quality control?

The results were sobering. In both situations, even on common-sense issues that might have been expected to produce broad—even universal—agreement, we found a nontrivial degree of noise. The noise was so high that we could conclude that between 4% and 10% of a system’s performance could be attributed to noise.

To emphasize what this means, let’s say I built an AI system that got 85% on a test, and you built an AI system that got 91%. Your system seems a lot better than mine. But if there is noise in the human labels used to score the answers, we can no longer be sure that the 6% improvement means much. For all we know, there may be no real improvement.

On AI leaderboards, where large language models like ChatGPT are compared, the performance differences between competing systems are much smaller, typically less than 1%. As we show in the article, regular metrics can’t really help distinguish the effects of noise from those of real performance improvements.

Sound audits

What’s the way forward? Returning to Kahneman’s book, he proposed the concept of a “noise audit” to quantify and ultimately reduce noise as much as possible. At the very least, AI researchers need to estimate what impact noise might have.

Checking AI systems for bias is somewhat common, so we believe the concept of a noise audit should naturally follow suit. We hope that this study, as well as others like it, will lead to their adoption.

This article is republished from The Conversation, an independent nonprofit organization providing facts and analysis to help you understand our complex world.

It is written by: Mayank Kejriwal, University of Southern California.

Read more:

Mayank Kejriwal receives funding from DARPA.

Leave a Comment