We humans often lament that we cannot predict the future, but perhaps we don’t give ourselves quite enough credit. With sufficient practice, our short-term predictive skills become truly remarkable.
Driving is a good example, particularly in urban environments. Navigating through a city, you become aware of a colossal number of dynamic aspects in your surroundings. The other cars — some moving, some stationary — pedestrians, cyclists, traffic lights changing. As you drive, your mind is generating predictions of how the universe around you is likely to manifest: “that car looks likely to pull out in front of me”; “that pedestrian is about to sleepwalk off the sidewalk – be ready to hit the brake”; “the front wheels of that parked car have just turned, so it’s about to move”.
Jesse Levinson, co-founder and CTO of Zoox, on the development of fully autonomous vehicles for mobility-as-a-service
Your power of prediction and anticipation throws a protective buffer zone around you, your passengers, and everyone in your vicinity as you travel from A to B. It is a broad yet very nuanced power, making it incredibly hard to recreate in real-world robotics applications.
Nevertheless, the teams at Zoox have achieved noteworthy success.
The integration of cutting-edge hardware, sensor technology, and bespoke machine learning (ML) approaches has resulted in an autonomous robotaxi that can predict the trajectories of vehicles, people, and even animals in its surroundings, as far as 8 seconds into the future — more than enough to enable the vehicle to make sensible and safe driving decisions.
“Predicting the future — the intentions and movements of other agents in the scene — is a core component of safe, autonomous driving,” says Kai Wang, director of the Zoox Prediction team.
Perceiving, predicting, planning
The AI stack at the center of the Zoox driving system broadly consists of three processes, which occur in order: perception, prediction, and planning. These equate to seeing the world and how everything around the vehicle is currently moving, predicting how everything will move next, and deciding how to move from A to B given those predictions.
The Perception team gathers high-resolution data from the vehicle’s dozens of sensors, which include visual cameras, LiDAR, radar, and longwave-infrared cameras. These sensors, positioned high on the four corners of the vehicle, provide an overlapping, 360-degree field of view that can extend for over a hundred meters. To borrow a popular phrase, this vehicle can see everything, everywhere, all at once.
The robotaxi already contains a detailed semantic map of its environment, called the Zoox Road Network (ZRN), which means it understands everything about local infrastructure, road rules, speed limits, intersection layouts, locations of traffic signals, and so on.
Perception quickly identifies and classifies the other cars, pedestrians, and cyclists in the scene, which are dubbed “agents.” And crucially, it tracks each agent’s velocity and current trajectory. These data are then combined with the ZRN to provide the Zoox vehicle with an incredibly detailed understanding of its environment.
Before these combined data are passed to Prediction, they are instantly boiled down to their essentials, into a format optimized for machine learning. To this end, what Prediction ultimately operates on is a top-down, spatially accurate graphical depiction of the vehicle and all the relevant dynamic and static aspects of its environment: a machine-readable, birds-eye representation of the scene with the robotaxi at the center.
“We draw everything into a 2D image and present it to a convolutional neural network [CNN], which in turn determines what distances matter, what relationships between agents matter, and so on,” says Wang.
Learning from data-rich images
While a human can get the gist of this map, such as the relative positions of all the vehicles (represented by boxes) and pedestrians (different, smaller boxes) in the scene, it is not designed for human consumption, explains Andres Morales, staff software engineer.
“This is not an RGB image. It’s got about 60 channels, or layers, which also include semantic information,” he notes. “For example, because someone holding a smartphone tends to behave differently, we might have one channel that represents a pedestrian holding their phone as a ‘1’ and a pedestrian with no phone as a ‘0’.”
From this data-rich image, the ML system produces a probability distribution of potential trajectories for each and every dynamic agent in the scene, from trucks right down to that pet dog milling around near the crosswalk.
These predictions consider not only the current trajectory of each agent, but also include factors such as how cars are expected to behave on given road layouts, what the traffic lights are doing, the workings of crosswalks, and so on.
These predictions are typically up to about 8 seconds into the future, but they are constantly recalculated every tenth of a second as new information is delivered from Perception.
These weighted predictions are delivered to the Planner aspect of the AI stack — the vehicle’s executive decision-maker — which uses those predictions to help it decide how the Zoox vehicle will operate safely.
From perception through to planning, the whole process is working in real-time; this robotaxi has lightning-quick reactions, should it need them.
The team can be confident of its predictions because it has a vast pool of data with which to train its ML algorithms — millions of road miles of high-resolution sensor data collected by the Zoox test fleet: Toyota Highlanders retrofitted with an almost identical sensor architecture as the robotaxi mapping and driving autonomously in San Francisco, Seattle, and Las Vegas.
Zoox has a further advantage.
“We don’t need to label any data by hand, because our data show where things actually moved into the future,” says Wang. “My team doesn’t have a data problem. Our main challenge is that the future is inherently uncertain. Even humans cannot do this task perfectly.”
Utilizing graph neural networks
While perfect prediction is, by its nature, impossible, Wang’s team is currently taking steps on several fronts to raise the vehicle’s prediction capabilities to the next level, firstly by leveraging a graph neural network (GNN) approach.
“Think of the GNN as a message-passing system by which all the agents and static elements in the scene are interconnected,” says Mahsa Ghafarianzadeh, senior software engineer on the Prediction team.
“What this enables is the explicit encoding of the relationships between all the agents in the scene, as well as the Zoox vehicle, and how these relationships might develop into the future.”
To give an everyday example, imagine yourself walking down the middle of a long corridor and seeing a stranger walking toward you, also in the middle of the corridor. That act of seeing each other is effectively the passing of a tacit message that would likely cause you both to alter your course slightly, so that by the time you reach each other, you won’t collide or require a sharp course-correction. That’s human nature.
So this GNN approach results in the prediction of more natural behaviors between everyone around the Zoox vehicle, because the algorithm, through training on Zoox’s vast pool of real-world road data, is better able to model how agents, on foot or in cars, affect each other’s behavior in the real world.
Another way the Prediction team is improving accuracy is by embracing the fact that what you do as a driver affects other drivers, which in turn affects you. For example, if you get into your parked car and pull out just a little into busy traffic, a driver coming up the road behind you may slow down or stop to let you out, or they may drive straight past, obliging you to wait for a better opportunity.
“Prediction doesn’t happen in a vacuum. Other people’s behaviors are dependent on how their world is changing. If you’re not capturing that within prediction, you’re limiting yourself,” says Wang.
Next steps
Work is now underway to integrate Prediction even more deeply with Planner, creating a feedback loop. Instead of simply receiving predictions and making a decision on how to proceed, the Planner can now interact with Prediction along these lines: “If I perform action X, or Y, or Z, how are the agents in my vicinity likely to adjust their own behavior in each case?”
I’ve seen Prediction grow from being just three source code files implementing basic heuristics to predict trajectories to where it is now, at the cutting edge of deep learning. It’s incredible how fast everything is evolving.
In this way, the Zoox robotaxi will become even more naturalistic and adept at negotiations with other vehicles, while also creating a smoother-flowing ride for its customers.
“The team and I started to work on this new mode a couple years ago, just as a research project,” says Morales, “and now we’re focused on its integration, ironing everything out, reducing latency, and generally making it production-ready.”
The ever-increasing sophistication of the Zoox robotaxi’s predictive abilities is a clear source of pride for the team dedicated to it.
“I’ve been in this team for over five years. I’ve seen Prediction grow from being just three source code files implementing basic heuristics to predict trajectories to where it is now, at the cutting edge of deep learning. It’s incredible how fast everything is evolving,” says Ghafarianzadeh.
Indeed, at this rate, the Zoox robotaxi may ultimately become the most prescient vehicle on the road. Though that prediction comes with the usual caveat: Nobody can perfectly predict the future.