At the headquarters of Google DeepMind, an artificial intelligence lab in London, researchers have a set ritual for announcing major results: They strike a large ceremonial gong.

In 2016, the gong sounded for AlphaGo, an AI system that excelled at the game of Go. In 2017, the gong sounded as AlphaZero conquered chess. On each occasion, the algorithm had defeated human world champions.

Last week, DeepMind researchers pulled out the gong again to celebrate what Alex Davies, a leader of Google DeepMind’s mathematics initiative, described as a “huge breakthrough” in mathematical reasoning by an AI system. A pair of Google DeepMind models tried their luck at a problem posed in the 2024 International Mathematical Olympiad (IMO), being held July 11-22 about 100 miles west of London at the University of Bath. The event was supposed to be the premier mathematics competition for the world’s “smartest mathematicians,” according to a promotional post on social media.

Sign up for the New York Times morning newsletter

The human problem solvers — 609 high school students from 108 countries — won 58 gold medals, 123 silver, and 145 bronze. The AI performed at the level of a silver medalist, solving 4 out of 6 problems for a total of 28 points. It was the first time that AI had achieved a medal-worthy performance on an Olympiad problem.

“It’s not perfect, we haven’t solved everything,” Pushmeet Kohli, Google DeepMind’s vice president of research, said in an interview. “We want to be perfect.”

Nevertheless, Kohli described the result as a “phase transition” – a transformative change – “in the use of AI in mathematics and the ability of AI systems to do mathematics.”

The lab asked two independent experts to judge the AI’s performance: Timothy Gowers, a mathematician at the University of Cambridge in England and a Fields Medalist who has been interested in the interplay of mathematics and AI for 25 years; and Joseph Myers, a software developer at Cambridge. Both won IMO gold in their time. Myers chaired this year’s problem selection committee and has served as a coordinator at previous Olympiads, judging human solutions. “I tried to grade the AI attempts consistently with how the human attempts were graded this year,” he said.

Gowers added in an email: “I was absolutely blown away.” The lab had discussed its Olympiad ambitions with him a few weeks in advance, so “my expectations were pretty high,” he said. “But the program met those expectations and in one or two cases significantly exceeded them.” The program found the “magic keys” that unlocked the problems, he said.

** Striking the gong **

After months of intensive training, the students had to take two exams with three questions per day: on algebra, combinatorics, geometry and number theory.

The AI counterpart was working more or less simultaneously in the London lab. (The students didn’t know that Google DeepMind was participating, in part because the researchers didn’t want to steal the show.) Researchers moved the gong into the room where they had gathered to watch the system at work. “Every time the system solved a problem, we would bang the gong to celebrate,” said David Silver, a researcher.

Haojia Shi, a student from China, was ranked No. 1 and was the only contestant to receive a perfect score — 42 points for six problems; each problem is worth seven points for a complete solution. The U.S. team won first place with 192 points; China came in second with 190.

The Google system earned its 28 points for fully solving four problems — two in algebra, one in geometry and one in number theory. (It flopped on two combinatorics problems.) The system was given unlimited time; some problems took up to three days. Students were given just 4.5 hours per exam.

For the Google DeepMind team, speed is secondary to overall success, because “it’s really a matter of how much computing power you’re willing to put into these things,” Silver said.

“The fact that we’ve reached this threshold where it’s possible to tackle these problems at all is what represents a step change in the history of mathematics,” he added. “And hopefully it’s not just a step change IMO, but it also represents the point where we went from computers that could only prove very, very simple things to computers that could prove things that humans can’t.”

** Algorithmic ingredients **

Applying AI to mathematics has been part of DeepMind’s mission for years, often in collaboration with leading mathematical researchers.

“Math requires this interesting combination of abstract, precise, and creative reasoning,” Davies said. This repertoire of skills, he noted, is partly why math is a good litmus test for the ultimate goal: achieving so-called artificial general intelligence, or AGI, a system with capabilities ranging from emergent to competent to virtuoso to superhuman. Companies like OpenAI, Meta AI, and xAI are working toward similar goals.

Mathematics problems from the Olympiad are now considered the benchmark.

In January, a Google DeepMind system called AlphaGeometry solved a sample of Olympiad geometry problems to nearly the level of a human gold medalist. “AlphaGeometry 2 has now surpassed the gold medalists in solving IMO problems,” Thang Luong, the lead researcher, said in an email.

Building on that momentum, Google DeepMind ramped up its multidisciplinary Olympiad effort, with two teams: one led by Thomas Hubert, a research engineer in London, and another led by Luong and Quoc Le in Mountain View, each with about 20 researchers. For his “superhuman reasoning team,” Luong said he recruited a dozen IMO medalists — “by far the highest concentration of IMO medalists at Google!”

The lab’s strike at this year’s Olympiad used the improved version of AlphaGeometry. Not surprisingly, the model did quite well on the geometry problem, completing it in 19 seconds.

Hubert’s team developed a new model that is similar, but more generalized. Called AlphaProof, it is designed to tackle a wide range of mathematical topics. Overall, AlphaGeometry and AlphaProof used a number of different AI technologies.

One approach was an informal reasoning system, expressed in natural language. This system used Gemini, Google’s large language model. It used the English corpus of published problems and proofs and the like as training data.

The informal system excels at identifying patterns and suggesting what comes next; it is creative and talks about ideas in an understandable way. Of course, large language models are prone to making things up — which may (or may not) fly for poetry, and certainly not for mathematics. But in this context, the LLM seems to have shown restraint; it was not immune to hallucinations, but their frequency was reduced.

Another approach was a formal reasoning system, based on logic and expressed in code. It used a theorem prover and proof assistant software called Lean, which guarantees that if the system says a proof is correct, it is correct. “We can check exactly whether the proof is correct or not,” Hubert said. “Every step is guaranteed to be logically correct.”

Another key component was a reinforcement learning algorithm in the AlphaGo and AlphaZero lines. This type of AI is self-learning and can scale infinitely, said Silver, who is vice president of reinforcement learning at Google DeepMind. Because the algorithm doesn’t need a human teacher, it can “learn and keep learning and keep learning until it can eventually solve the hardest problems that humans can solve,” he said. “And then maybe one day go beyond that.”

Hubert added: “The system can rediscover knowledge for itself.” That’s what happened with AlphaZero: It started with zero knowledge, Hubert said, “and by just playing games and seeing who wins and who loses, it was able to rediscover all the knowledge of chess. It took us less than a day to rediscover all the knowledge of chess, and about a week to rediscover all the knowledge of Go. So we thought, let’s apply this to mathematics.”

Gowers isn’t too worried about the long-term consequences. “You can imagine a situation where mathematicians have nothing to do,” he said. “That would be the case if computers got better and much faster at everything mathematicians do now.”

“There seems to be a long way to go before computers are able to do research-level mathematics,” he added. “It’s a pretty safe bet that if Google DeepMind can solve at least some IMO hard problems, a useful research tool can’t be that far away.”

A truly proficient tool could make mathematics accessible to more people, speed up the research process, push mathematicians off the beaten track. Ultimately, it might even spark new ideas that resonate.

c. 2024 The New York Times Company