Self-driving cars still lack common sense. AI chatbot technology could be the answer

A quick search on the internet will yield countless videos showing self-driving car accidents, often eliciting a smile or a laugh. But why do we find this behavior funny? Perhaps because it is in stark contrast to how a human driver would handle similar situations.

Everyday situations that seem trivial to us can still pose a major challenge for self-driving cars. This is because they are designed using engineering methods that are fundamentally different from how the human mind works. However, recent developments in AI have opened up new possibilities.

New AI systems with language capabilities – such as the technology behind chatbots like ChatGPT – could hold the key to enabling self-driving cars to reason and behave more like human drivers.

Autonomous driving research got a huge boost in the late 2010s with the advent of deep neural networks (DNNs), a form of artificial intelligence (AI) that processes data in a way inspired by the human brain. This makes it possible to process images and videos of traffic scenarios to identify “critical elements,” such as obstacles.

To detect these, a 3D box often needs to be calculated to determine the size, orientation and position of the obstacles. This process, applied for example to vehicles, pedestrians and cyclists, creates a representation of the world based on classes and spatial properties, including distance and speed relative to the self-driving car.

This is the basis of the most widely used engineering approach to autonomous driving, known as “sense-think-act”. In this approach, sensor data is first processed by the DNN. The sensor data is then used to predict obstacle trajectories. Finally, the systems plan the car’s next actions.

While this approach offers advantages such as easy debugging, the sense-think-act framework has an important limitation: it is fundamentally different from the brain mechanisms behind human driving behavior.

Lessons from the brain

Much remains unknown about brain function, making it difficult to apply intuition from the human brain to self-driving vehicles. Nevertheless, there are several research efforts that draw inspiration from neuroscience, cognitive science, and psychology to improve autonomous driving.

A long-standing theory suggests that “sense” and “act” are not consecutive but closely linked processes. People perceive their environment in terms of their ability to act upon it.

For example, when preparing to turn left at an intersection, a driver focuses on specific parts of the environment and obstacles relevant to the turn. In contrast, the sense-think-act approach processes the entire scenario independently of current action intentions.

Waymo car in San Francisco

Another important difference from humans is that DNNs rely primarily on the data they are trained on. When exposed to a small unusual variation of a scenario, they can fail or miss important information.

Such rare, underrepresented scenarios, known as “long-tail cases,” pose a major challenge. Current workarounds involve creating ever-larger training datasets, but the complexity and variability of real-life situations make it impossible to cover all possibilities.

As a result, data-driven approaches such as sense-think-act struggle to generalize to unseen situations. Humans, on the other hand, are good at dealing with novel situations.

A general knowledge of the world allows us to assess new scenarios using ‘common sense’: a mix of practical knowledge, reasoning and an intuitive understanding of how people generally behave, built up from a lifetime of experiences.

In fact, driving is another form of social interaction for humans, and common sense is key to interpreting the behavior of road users (other drivers, pedestrians, cyclists). This ability allows us to make informed judgments and decisions in unexpected situations.

Copying common sense

Reproducing common sense in DNNs has been a major challenge over the past decade, prompting scientists to call for a radical change in approach. Recent AI developments finally provide a solution.

Large language models (LLMs) are the technology behind chatbots like ChatGPT and have demonstrated remarkable proficiency in understanding and generating human language. Their impressive abilities stem from being trained on large amounts of information across multiple domains, allowing them to develop a form of common sense similar to our own.

More recently, multimodal LLMs (which can respond to user requests in text, image, and video), such as GPT-4o and GPT-4o-mini, have emerged that combine language with images and integrate extensive world knowledge with the ability to reason about visual input.

These models can understand complex, unforeseen scenarios, provide explanations in natural language, and recommend appropriate actions, offering a promising solution to the long-tail problem.

In robotics, vision-language-action models (VLAMs) are emerging, which combine linguistic and visual processing with robot actions. VLAMs show impressive early results in controlling robot arms via language instructions.

In autonomous driving, early research focuses on using multimodal models to provide driving commentary and explanations for motor planning decisions. For example, a model might say, “There’s a cyclist in front of me who’s starting to slow down,” which provides insight into the decision-making process and increases transparency. The company Wayve has shown promising early results in deploying voice-enabled self-driving cars at a commercial level.

Future of driving

While LLMs can address long-tail cases, they bring new challenges. Evaluating their reliability and safety is more complex than for modular approaches such as sense-think-act. Every component of an autonomous vehicle, including integrated LLMs, needs to be verified, requiring new test methodologies tailored to these systems.

Furthermore, multimodal LLMs are large and resource-intensive on a computer, leading to high latency (a delay in action or communication from the computer). Self-driving cars require real-time operation, and current models cannot generate responses quickly enough. Running LLMs also requires significant processing power and memory, which conflicts with the limited hardware constraints of vehicles.

Several research efforts are now focused on optimizing LLMs for use in vehicles. It will be a few years before we see commercially viable self-driving vehicles on the streets.

The future of autonomous driving, however, looks bright. In AI models with language capabilities, we have a solid alternative to the sense-think-act paradigm, which is reaching its limits.

LLMs are widely considered the key to achieving vehicles that can reason and behave more like humans. This advancement is crucial, as approximately 1.19 million people die each year in traffic accidents.

Traffic accidents are the leading cause of death for children and young adults aged 5 to 29. The development of autonomous vehicles with human-like reasoning could potentially significantly reduce these numbers, saving countless lives.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Alice Plebe is not employed by, consulted by, owns stock in, or receives funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond her academic appointment.

Leave a Comment