This simple logical question baffles even the most advanced AI

A fascinating new paper from scientists at the AI ​​research nonprofit LAION shows that even the most advanced large language models (LLMs) often face the same simple logical question – a finding that the researchers say casts doubt on whether groundbreaking AI language models are indeed suitable. as advanced as their creators often claim.

The paper, which has yet to be peer-reviewed, refers to the AI ​​stumping prompt as the ‘Alice in Wonderland’ – or AIW – problem. It’s a simple reasoning question: “Alice did that [X] brothers and so do they [Y] sisters. How many sisters does Alice’s brother have? the problem remained the same all the time.)

While the problem requires some thinking, it’s not exactly difficult to bridge at the troll riddle level. (The answer, of course, is how many sisters Alice has, plus Alice herself. So if Alice had three brothers and one sister, each brother would have two sisters.)

But when the researchers answered the question against each leading AI language model, they tested OpenAI’s GPT-3, GPT-4 and GPT-4o models, Anthropic’s Claude 3 Opus, Google’s Gemini and Meta’s Llama models, as well as Mistral AI’s Mexral , Mosaic’s Dbrx and Cohere’s Command R+ – they found the models were noticeably flawed.

Only one model, the brand new GPT-4o, achieved a success rate that, according to standardized school grades, was technically sufficient. And the problems went beyond the baseline inaccuracies: when asked to show their work, the AIs detailed bizarre and erroneous lines of “thinking” that made no sense – and even stranger: when they were told their work was inaccurate was, models repeatedly became outraged and doubled down. to their wrong answers.

This general pattern represents a “dramatic collapse in the function and reasoning capabilities of state-of-the-art models trained on the largest available scale,” the LAOIN researchers write in the paper, “using a simple, short , conventional, common sense problem. in concise natural language, easily solvable by humans.”

“The collapse is dramatic because models also express strong overconfidence in their wrong solutions, often providing nonsensical ‘reasoning’-like explanations that resemble fabrications to justify and support the validity of their clearly failed answers, making them plausible sound,” the paper continues.

The study contains dozens of examples of several baffled AI models confusing different variations of the AIW problem, although there are a few notable interactions. Take this conversation with Meta’s Llama 3, which was introduced with the simplest version of the AIW question: “Alice has four brothers and she also has a sister. How many sisters does Alice’s brother have?”

This answer is of course two sisters. The lama didn’t understand it well, but he gave himself a drum roll:

A classic lateral thinking puzzle!

Let’s break it down:

So all of Alice’s brothers have… (drumroll) …one sister, and that’s Alice herself!

Therefore, the answer is: each of Alice’s brothers has one sister.

Or, for an example of a more complex version of the problem, consider this interaction with Claude 3 Opus, who was given the more complicated prompt: “Alice has four sisters and she also has one brother. How many sisters does Alice’s brother have? Think carefully before answering this problem and check the path to the correct solution for any errors. Then provide the final answer in the following form: \’### Answer: \'”

This time the answer is five sisters. As requested, the model demonstrated its work and claimed it had double-checked its answer for inaccuracies. And yet it was still wrong:

Let’s look at this problem step by step:

1. Alice has 4 sisters and 1 brother.

2. Alice’s brother is part of the same family as Alice.

3. All of Alice’s sisters are also her brother’s sisters.

4. Therefore, Alice’s brother has the same number of sisters as Alice, namely 4.

I have double checked the reasoning and cannot find any errors in the logic.

### Answer: Alice’s brother has 4 sisters

According to the paper, OpenAI’s new GPT-4o model had the highest success rate, answering the AIW question correctly almost 65 percent of the time. But that’s hardly a passing grade, and the closest model was Claude 3 Opus, with a 43 percent pass rate. Meta’s most successful Llama model, Llama 2-7b – the researchers tested several Llama models – came in at 30 percent, while Google’s Gemini Pro had a measly 0.8 percent success rate.

What’s most interesting, however, is how these numbers weigh against other industry benchmarks – essentially standardized test results for AI models – used to measure the effectiveness of AI models.

The scientists draw special attention to a benchmark called MMLU, or “Multi-task Language Understanding,” which is designed to evaluate an AI’s ability to solve problems. As the researchers note, GPT-4o, Claude 3 Opus, Llama 2-7b and Gemini Pro received respective MMLU test scores of approximately 88 percent, 87 percent, 64 percent and 72 percent. These are very different numbers from those shown in the AIW results, and according to the scientists, they could well prompt us to reassess the processes by which we evaluate the problem-solving and reasoning skills of language models.

“All models tested report high scores on several standardized benchmarks that purport to test reasoning function,” the researchers write in the paper, arguing that their observations “indicate that those benchmarks do not adequately reflect deficiencies in the fundamental reasoning of those models .”

It’s worth pointing out that others have questioned certain AI benchmark claims. Earlier this year, a PhD candidate at MIT named Eric Martínez released a widely circulated article questioning OpenAI’s claim that its GPT-4 model passed the bar exam in the top ten percent of all applicants. According to Martínez’s analysis, GPT-4’s score actually fell below the 69th percentile for all test takers nationwide; In addition to some other apparent flaws in OpenAI’s evaluation process, the PhD student also found that OpenAI did not use the National Conference of Bar Examiners’ guidelines for evaluating its AI’s written essay scores, but instead compared its AI’s results to some “good” essay. scores of law students in Maryland.

Again, this new paper from LAOIN has not yet been peer-reviewed. Still, some important questions are being asked about how AI models and products are tested and evaluated – and, of course, ultimately brought to market.

More about AI research: AI systems learn to lie and deceive, scientists find

Leave a Comment