Software Design (CSC-223 97F)
The Ariane 5 Failure
The first mission of the ESA's Ariane 5 failed about 40 seconds into
the flight. The problem was traced to a software failure which
persisted in spite of
Lions, J.L. et al. (1996).
"Report by the Inquiry Board on the Ariane 5 Flight 501 Failure."
- Redundant systems are little help if both systems can fail
(or are likely to fail) in the same way.
- "Some" error recovery may not be enough.
- The success of a system in one situation does not guarantee its
success in all situations.
- Sometimes thorough "hand testing" is as important as through analysis.
The Ariane 5 launcher was the fifth (I think) in a series of rocket launcher
designs produced by the ESA (European Space Agency). About 40 seconds into its
initial flight, it failed. In particular, it followed a normal flight
path until about 37 seconds after launch. It then suddently veered off
of its flight path, broke up, and exploded.
What went wrong? According to the general description in the report,
- Both the back-up and active Inertial Reference Systems failed
- The nozzles of the two solid boosters moved to extreme
- The launcher self-destructured when the solid boosters separated from
the core stage.
Suppose that you were requested to develop a key system for a rocket.
- What are some of the issues you would consider in your design?
- What are some of the limitations that you expect would be placed
on your design?
- What steps would you attempt to use to ensure that it would not
The Ariane 5 keeps track of its attitude and movements through an
Inertial Reference System. To protect for failure, the launcher
has two identical inertial reference systems, both of which are
working, but only one of which is active.
- What are some other mechanisms for redundancy?
The software for the inertial reference systems is written in Ada,
a language that permits more robust code through the use of exception
handlers (or variants thereof). Much of the code for the inertial
reference systems had appropriate error checking. The parts that
didn't had gone through an analysis as to their posible values.
- What are some of the basic kinds of error checking that might go on in
an inertial reference system?
- What type of analysis might you be able to do to show that
error checking isn't necessary at certain points?
- Are there reasons you might not do error checking?
The inertial reference system software for the Ariane 5 was
"practically the same" as that used successfully on the Ariane 4.
- What is "practically the same"?
- Is "practically the same" a guarantee of success?
A design decision was made that when an inertial reference system
failed, a number of steps should be taken:
- the failure should by signaled (presumably, to the central
flight control processor);
- the context of the failure should be stored in semi-permanent
- the processor should shut down.
- What are some alternative strategies?
What went wrong? A routine in the inertial reference system used
only during take off continued to run for another forty or so seconds,
and then failed. The failure stopped the two inertial reference systems,
as per the design above. (Yes, it was part of the design of the system that
it continued to run. Why? For the Ariane 4, this permitted interrupts
in the count-down without significant realignment time.)
How did it fail? It attempted to convert a 64 bit floating point number
to a 16 bit signed integer. Unfortunately, the floating point number was
too big. (Yes, this had been identified as a potential trouble spot, but
hand analysis in the initial design of the software showed that it would
never happen, at least in the Ariane 4.)
- Why convert a float to an integer?
- Could this happen even if we had a 16 bit floating point number (or
a 64 bit integer)?
Given that most components of space systems undergo thorough testing,
why was this failure allowed? While the hardware for the inertial reference
system was tested, and the software was shown to meet specifications,
the full inertial reference system was never tested. (It was deemed
both inappropriate and too expensive to test the inertial reference system.
The specifications for the software had originally been for the
Ariane 4, and did not take the acceleration of the Ariane 5 into account.)
These are the verbatim recommendations of the inquiry board
- R1 Switch off the alignment function of the inertial reference system
immediately after lift-off. More generally, no software function should
run during flight unless it is needed.
- R2 Prepare a test facility including as much real equipment as technically
feasible, inject realistic input data, and perform complete, closed-loop,
system testing. Complete simulations must take place before any mission.
A high test coverage has to be obtained.
- R3 Do not allow any sensor, such as the inertial reference system, to
stop sending best effort data.
- R4 Organize, for each item of equipment incorporating software, a specific
software qualification review. The Industrial Architect shall take part
in these reviews and report on complete system testing performed with the
equipment. All restrictions on use of the equipment shall be made explicit
for the Review Board. Make all critical software a Configuration
Controlled Item (CCI).
- R5 Review all flight software (including embedded software), and
- Identify all implicit assumptions made by the code and its
justification documents on the values of quantities provided
by the equipment. Check these assumptions against the
restrictions on use of the equipment.
- Verify the range of values taken by any internal or
communication variables in the software.
- Solutions to potential
problems in the on-board computer software, paying particular
attention to on-board computer switch over, shall be proposed by
the project team and reviewed by a group of external experts,
who shall report to the on-board computer Qualification Board.
- R6 Wherever technically feasible, consider confining exceptions to
tasks and devise backup capabilities.
- R7 Provide more data to the telemetry upon failure of any component,
so that recovering equipment will be less essential.
- R8 Reconsider the definition of critical components, taking failures of
software origin into account (particularly single point failures).
- R9 Include external (to the project) participants when reviewing
specifications, code and justification documents. Make sure that these
reviews consider the substance of arguments, rather than check that
verifications have been made.
- R10 Include trajectory data in specifications and test requirements.
- R11 Review the test coverage of existing equipment and extend it
where it is deemed necessary.
- R12 Give the justification documents the same attention as code.
Improve the technique for keeping code and its justifications consistent.
- R13 Set up a team that will prepare the procedure for qualifying software,
propose stringent rules for confirming such qualification, and ascertain
that specification, verification and testing
of software are of a consistently high quality in the Ariane 5 programme.
Including external RAMS experts is to be considered.
- R14 A more transparent organisation of the cooperation among the partners
in the Ariane 5 programme must be considered. Close engineering cooperation,
with clear cut authority and responsibility, is needed to achieve system
coherence, with simple and clear interfaces between partners.
Disclaimer Often, these pages were created "on the fly" with little, if any, proofreading. Any or all of the information on the pages may be incorrect. Please contact me if you notice errors.
Source text written by Samuel A. Rebelsky.
Source text last modified Fri Sep 5 12:31:49 1997.
This page generated on Fri Oct 17 09:04:48 1997 by SamR's Site Suite.
Contact our webmaster at email@example.com