The Famous Byzantine Generals Problem Is Useful For Achieving Self-Driving Cars

Dr. Lance Eliot, AI Insider

Image for post
Image for post

[Ed. Note: For reader’s interested in Dr. Eliot’s ongoing business analyses about the advent of self-driving cars, see his online Forbes column:]

Let’s examine the topic of things that work only intermittently, which as you’ll soon see is a crucial topic for intelligently designing and building AI systems, especially for self-driving autonomous cars.

First, a story to illuminate the matter.

I went camping recently and the flashlight that I brought along was only working intermittently, so I shook it to get the bulb to shine, hoping to cast some steady light.

One moment the flashlight had a nice strong beam and the next moment it was faded and not of much use.

I’m guessing that you’ve probably had a similar circumstance with a flashlight, whereby sometimes it wants to work and sometimes not.

Of course, this kind of intermittent performance is not confined to flashlights.

Various mechanical contraptions can haunt us with intermittent performance, whether it might be a flashlight, a washing machine, a hair dryer, etc.

Introducing The Byzantine Generals Problem

The crux is that you might find yourself sometimes immersed into a system that has aspects that are not working as you imagined they would.

Is this because those other elements are purposefully doing so, or is it by happenstance?

In whatever manner it is occurring, what can you do to rectify the situation?

Are you even ready in case a system that you are immersed into might suddenly begin to have such difficulties?

Welcome to the Byzantine Generals Problem.

First introduced in a 1982 paper that appeared in the Transactions on Programming Languages and Systems published by the esteemed Association for Computing Machinery (ACM), the article was aptly entitled “The Byzantine Generals Problem” — there are numerous variants of the now-classic problem and what to do about it.

It is a commonly described and taught problem in computer science classes and covers an important topic that anyone involved especially in real-time systems development should be aware of.

It has to do with fault-tolerance.

You might have a system that contains elements or sub-components that might at one point or another suffer a fault.

Upon having a fault, the element or sub-component might not make life so easy that the element or sub-component just outright fails and stops.

In a sense, if an element stops completely from working, you are in an “easier” diagnostic situation in that you can perhaps declare that element or sub-component “dead” and no longer usable, versus the more tortuous route of having an element or sub-component that kind of works but not entirely so.

What can be particularly trying is a situation of an element or sub-component that intermittently works.

In that case, you need to figure out how to handle something that might or might not work when you need it to work. If my flashlight had not worked at all, I would have assumed that the batteries were dead or that the wiring was bad, and I would not have toyed with the flashlight at all, figuring it was beyond hope. But, since the flashlight was nearly working, I was hopeful of trying to deal with the faults and see what could be done.

You could of course declare outright that any element or sub-component that falters is considered “dead” and therefore you will henceforth pretend that it is. In the case of my flashlight, due to the intermittent nature of it, I might have just put it back into my backpack and decide that it was not worth playing around with it. Sure, it did still kind of work, but I might decide to declare it dead and finished.

The downside there is that I’ve given up on something that still has some life to it, and therefore some practical value. Plus, there is an outside chance that it might opt to start working correctly, doing so all of the time and no longer be intermittent. And, there’s a chance that I might be able to play with it and get it to work properly, even if it won’t happen to do so of its own accord.

For the flashlight, I wasn’t sure what was the source of the underlying problem. Was it the batteries? Was it the wiring? Was it the bulb? This can be another difficulty associated with faults in a system. You might not know or readily be able to discern where the fault exists. You might know that overall the system isn’t working as intended, but the specific element or sub-component that is causing the trouble might be hidden or buried within lots of other elements or sub-components.

Intermingling Of Faults

One fault can at times intermingle with another fault.

This makes things doubly challenging.

It’s usually easiest (and often naive) to think that you can find “the one” element that is causing the difficulty and then deal with that element only.

In real life, it is often the case that you end-up with several elements or sub-components at fault.

You can have what are considered “error avalanches” that cascade through a system and are due to one or more elements or sub-components that are suffering faults.

Remember too that a fault does not imply that the element or sub-component won’t work at all.

The faulty element can do its function in a half-baked way. If the batteries in the flashlight were low on energy, they were perhaps only able to provide enough of a charge to light the flashlight part of the time. They apparently weren’t completely depleted of their charge, since the flashlight was at least still partially able to light-up.

A fault can be even more inadvertently devious in that it might not function in a half-baked way and instead provide false or misleading aspects, not necessarily because it is purposely trying to lie to you.

I am not ruling out that an element or sub-component might intentionally lie, and merely emphasizing that the fault in an element or sub-component can cause it to lie, and this might be “lying” without such intent or indeed it might be that the element or sub-component is purposely lying.

About Byzantine Fault Tolerance

Byzantine Fault Tolerance (BFT) is the notion that you need to design a system to be able to contend with so-called Byzantine faults, which consists of faults that might or might not involve an element or sub-component entirely going dead (known as fail-stop), and for which the fault could allow that element or sub-component to still function but in a half-baked way, or it might do worse and actually “lie” or distort whatever it is supposed to do.

And, this can occur to any of the elements or sub-components, at any time, and intermittently, and can occur to only one element or sub-component at a time or might encompass multiple elements or sub-components that are each twinkling as to properly functioning.

Why is this known as the Byzantine Generals Problem?

In the original 1982 setup of this intriguing “thought experiment” problem, the researchers proposed that you might have military generals in the Byzantine army that are trying to take a city or fort. Suppose that the generals will need to coordinate their attack and will be coming at the city or fort from different angles. The timing of the attack has to be done just right. They need to attack at the same time to effectively win the battle.

We’ll pretend that the generals can only communicate a simplistic message that says either to attack or to retreat. If you were a general, you would wait to see what the other generals have to say. If they are saying to attack, you would presumably attack too. If they say to retreat, you would presumably retreat too. The generals are not able to directly communicate with each other (because they didn’t have cell phones in those days, ha!), and instead they use their respective lieutenants to pass messages among the generals.

You can likely guess that the generals are our elements or sub-components of a system, and we can consider the lieutenants to be elements too, though one way to treat the lieutenants in this allegory is as messengers rather than purely as traditional elements of the system. I don’t want to make this too messy and long here, so I’ll keep things simpler. One aspect though to keep in mind is that a fault might occur not just in the functional items of interest, but it might also occur in the communicating of their efforts. The starter in my car might work perfectly and it is only the wire that connects it to the rest of the engine that has the fault (it’s the messenger that is at fault). That kind of thing.

Suppose that one or more of the generals is a traitor. To undermine the attack, the traitorous general(s) might send an attack message to some of the generals and simultaneously send a retreat message to others. This could then induce some of the generals into attacking and yet they might not be sufficient in numbers to win and take the city or fort. Those generals attacking might get wiped out. The loyal generals would be considered non-faulty, and the traitorous generals would be considered “faults” in terms of how they are functioning.

There are all kinds of proposed solutions to dealing with the Byzantine Generals Problem.

You can mathematically describe the situation and then try to show a mathematical solution, along with providing handy rules-of-thumb about it. For example, depending upon how you describe and restrict the nature of the problem, you could say that in certain situations as long as only a third or less of the participants are traitors you can provide a method to deal with the traitorous acts (this comes from a mathematical formulation of n > 3t, wherein t is the number of traitors and n is the number of generals).

Byzantine Generals Problem And AI Autonomous Cars

What does this have to do with AI self-driving autonomous cars?

At the Cybernetic AI Self-Driving Car Institute, we are developing AI software for self-driving cars. The AI for a self-driving car is a real-time system and has hundreds upon hundreds if not thousands of elements or sub-components.

Some estimates suggest that the software for a self-driving car might amount to well over 250 million lines of code (though lines of code is a problematic metric).

The auto makers and tech firms crafting such complex real-time systems need to make sure they are properly taking into account the nature of Byzantine Fault Tolerance.

Bluntly, an AI self-driving car is a real-time system that involves life-or-death matters and must be able to contend with faults of a wide variety and that can happen at the worst of times. Keep in mind that an AI self-driving car could ram into a wall or crash into another car, any of which might happen because the AI system itself suffered an internal fault and the fault-tolerance was insufficient to safely keep the self-driving car from getting into a wreck.

Dealing With Faults In AI Autonomous Car Systems

Returning to the Byzantine Fault Tolerance matter, let’s consider the various aspects of an AI self-driving car and how it needs to be designed and developed to contend with a myriad of potential faults.

Let’s start with the sensors.

An AI self-driving car has numerous sensors, including cameras, radar, ultrasonic, LIDAR, and other sensory devices. Any of those sensors can experience a fault. The fault might involve the sensor going “dead” and into a fault-stop state. Or, the fault might cause the sensor to report only partial data or only a partial interpretation of the data collected by the sensor. Worse still, the fault could encompass that the sensor is “lying” about the data or its interpretation of the data.

When I use the word “lying” it is not intended herein to imply necessarily that someone has been traitorous and gotten the sensor to purposely lie about what data it has or the interpretation of the data. I’m herein instead suggesting that the sensor might provide false data that doesn’t exist, or provide real data that has been changed to falsely represent the original data, or provided an interpretation of the data that maybe originally would have said one thing but instead gave something completely contrary. This could occur by happenstance due to the nature of the fault.

Those could also of course be purposeful and intentional “lies” in that suppose a nefarious person has hacked into the AI self-driving car and forced the sensors to internally tell falsehoods.

Or, maybe the bad-hat hacker has planted a computer virus that causes the sensors to tell falsehoods. The virus might not even be forcing the sensors to do so and instead be working as a man-in-the-middle attack that takes whatever the sensors report, blocks the messages, substitutes its own messages of a contrary nature, and sends them along. It could be that the AI self-driving car has been attacked by an outsider, or it could be that even an insider that aided the development of the AI self-driving car had implanted a virus that would at some future time become engaged.

Overall, the AI needs to protect itself from itself.

The AI developers should have considered beforehand the potential for faults occurring with the various elements and sub-components of the AI system. There should have been numerous checks-and-balances included within the AI system to try and detect the faults. Besides detecting the faults, there needs to be systematic ways in which the faults are then dealt with.

In the case of the sensors, pretend that one of the cameras is experiencing a fault. The camera is still partially functioning. It is not entirely “dead” or at a fail-stop status. The images are filled with noise and it makes the images occluded or confused looking. The internal system software that deals with this particular camera does not realize that the camera is having troubles. The troubles come and go, meaning that at one moment the camera is providing pristine and accurate images, while the next moment it does not.

We’ve previously let’s say put in place a Machine Learning (ML) component that has been trained to be able to detect pedestrians. After having scrutinized thousands and thousands of street scenes with pedestrians, the Machine Learning algorithm using an Artificial Neural Network has gotten pretty good at picking out the shape of a pedestrian in even crowded street scenes. It does so with a rather high reliability.

The Machine Learning component gets fed a lousy camera image that has been populated with lots of static and noise, due to the subtle fault in the camera. This has made the portion that has a pedestrian in it very hazy and fuzzy, and the ML is unable to detect a pedestrian to any significant probability. The ML reports this to the sensor fusion portion of the AI system.

We now have a situation wherein a pedestrian exists in the street ahead of us, but the interpretation of the camera scene has indicated that there is not a pedestrian there. Is the Machine Learning component lying? In this case, it has done its genuine job and concluded that there is not a pedestrian there. I suppose we would say it is not lying per se. If it had been implanted with a computer virus that caused it to intentionally ignore the presence of a pedestrian and misreport as such to the rest of the AI, we might then consider that to be a lie.

One should be asking why the system element that drives the camera has not yet detected that the camera has a fault? Furthermore, we might expect the ML element to be suspicious of images that have static and noise, though of course that could be happening a lot of the time in a more natural manner that has nothing to do with faults. Presumably, once the interpretation reaches the sensor fusion portion of the AI system, the sensor fusion will try to triangulate the accuracy and “honesty” of the interpretation by comparing to the other sensors, including other cameras, radar, LIDAR, and the like.

You could liken the various sensors to the generals in the Byzantine Generals Problem. The sensor fusion must try to ferret out which of the generals (the sensors) are being truthful and which are not, though it is not quite so straightforward as a simple attack versus retreat kind of message. Instead, the matter is much more complex involving where objects are in the surrounding area and whether those objects are near to the AI self-driving car, or whether they pose a threat to the self-driving car, or whether the self-driving car poses a threat to them. And so on.

The sensor fusion then reports to the virtual world model update component of the AI system. The virtual world model updater code would place a maker in the virtual world as to where the pedestrian is standing, though if the sensors misreported the presence of the pedestrian and the sensor fusion did not catch the fault, the virtual world model would now misrepresent the world around it. The AI action planner would then not realize a pedestrian is nearby.

The AI action planner might not issue car control commands to maneuver the car away from the pedestrian. The pedestrian might get run over by the AI self-driving car, all stemming from a subtle fault in a camera. This is a fault that had the AI system been better designed and constructed it should have been able to catch. There should have been other means established to deal with a potentially faulty sensor.


The tale of the Byzantine Generals Problem is helpful to serve as a reminder that modern day real-time systems need to be built with fault tolerance.

There are some AI developers that came from a university or research lab that might not have been particularly concerned with fault tolerance since they were devising experimental systems to explore new advances in AI.

When shifting such AI systems into everyday use, it is crucial that fault tolerance be baked into the very fabric of the AI system.

We are going to have the emergence of AI self-driving cars that will be on our streets and will be operating fully unattended by a human driver. We rightfully should expect that fault tolerance has been given a top priority for these real-time systems that are controlling multi-ton vehicles. Without proper and appropriate fault tolerance, the AI self-driving car you see coming down the street could go astray due to a subtle fault in some hidden area of the AI.

An error avalanche could allow the fault to cascade to a level that the AI self-driving car then gets into an untoward incident and human lives are jeopardized.

One of the greatest emperors of the Byzantines was Justinian I, and it is claimed that he had said that the safety of the state was the highest law.

For those AI developers involved in designing and building AI self-driving car systems, I hope that you will abide by Justinian’s advice and aim to ensure that you have dutifully included Byzantine Fault Tolerance or the equivalent thereof for aiming to have safety as the highest attention in your AI system.

Consider that an order by Roman Law, per the Codex Justinianus.

For free podcast of this story, visit:

The podcasts are also available on Spotify, iTunes, iHeartRadio, etc.

More info about AI self-driving cars, see:

To follow Lance Eliot on Twitter:

For his blog, see:

For his Medium blog, see:

For Dr. Eliot’s books, see:

Copyright © 2019 Dr. Lance B. Eliot

Written by

Dr. Lance B. Eliot is a renowned global expert on AI, Stanford Fellow at Stanford University, was a professor at USC, headed an AI Lab, top exec at a major VC.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store