On June 4, 1996, the maiden flight of the Ariane 5 satellite launcher ended spectacularly after only forty seconds, with bits of the $67 billion vehicle and its payload spread over a fairly large part of French Guiana. The report issued July 19 by the International Inquiry Board noted that the fiery crash was due to a "chain of technical events." The details of that particular chain of events are reviewed here.
On June 4, 1996, the maiden flight of the Ariane 5 satellite launcher ended spectacularly after only forty seconds, with bits of the $67 billion vehicle and its payload spread over a fairly large part of French Guiana. The report issued July 19th by the International Inquiry Board noted that recovery of material proved difficult, as the area is nearly all mangrove swamp or savanna. The Board also noted that the fiery crash was due to a "chain of technical events." The details of that particular chain of events bear reviewing here.
This incident is one of a wide class of famous failures based on precision problems . One such problem was responsible for the Patriot missile failure in the Gulf War in 1991 that caused the death of twenty-nine people. Another led to the Bank of New York having to borrow $24 billion for a day from the Federal Reserve a few years ago. The interesting thing about precision problems is that they are generally statically detectable and are therefore highly avoidable .
In the case of Ariane 5, the programmers had arranged the code such that a 64-bit floating point number was shoehorned into a 16-bit integer. This is not an easy feat in Ada, the programming language used in the Ariane's software, as one has to actually manually override the compiler's objections to achieve it—but the programmers chose to do so not once, but seven times. Of the seven overrides committed, only four of them were protected against the possibility of overflow. The other three were not protected because the programmers thought they could never overflow.
They were wrong.
Two Wrongs Don't Make a Flight
The offending piece of software was actually reused from Ariane 4. (Reuse was also implicated in the tragic software failure in the Therac-25 radiation therapy machine, leading to the death of three people due to severe radiological overdose.) In fact, this recycled piece of software had no relevance to the flight of Ariane 5, any possible usefulness ending at the point of lift-off. The program continued to run, however, and approximately thirty-seven seconds into the flight, the 16-bit integer overflowed. That is, at that point the software attempted to store a value too big to fit.
Now in a sloppier language like C, the program would have continued happily humming along but would not, in all probability, have interfered with the flight. However, the Ada language is made of sterner stuff. Faced with this run-time problem, the program initiated an exception to invoke special exception-handling code—as any reasonable language should in such a situation. However, the programmers did not handle the exception because the assumption was made that the program was correct until proved at fault—apparently a feature of the programming culture for this system (this observation is worth an article in itself). The default action was, regrettably, to close the system down—including other components that were critical.
At this point, Ariane 5 demonstrated a fundamental weakness. The offending piece of software was running in two SRIs (Inertial Reference Systems): a primary system and a "hot" backup. When the first failed, the backup jumped in and took over. Since each SRI contained the same hardware and software, when the backup SRI took over it failed for the same reason. From this point on, Ariane 5 assumed the aerodynamic properties of an overhead projector and shortly afterwards turned itself into twelve square kilometers of debris.
What can we learn from this? There are several lessons:
A Type and precision mismatching is once again identified as a primary source of computer systems failure.