Wednesday, August 25, 2010

Malware in Spanair Fatal Air Crash Case: FUD or a real factor?

This week we have seen an extraordinary number of articles claiming that malware is at least partly to blame for the fatal crash of Spanair Flight 5022 which killed 154 of the 172 souls aboard.

Reading the headlines from this story leads one to believe that a trojan was used in a deliberate attempt to bring down an airliner.  Digging further into the facts of the story, nothing could be further from the truth.

The aircraft in question is a 150,000lb twin turbofan MD-82 airliner.  Aircraft like these do not ordinarily make "no flap" takeoffs.  The pilots failed to follow the checklists properly, and applied takeoff power with flaps retracted, and stalled when exiting ground effect.

Source: CIAIAC
The image on the left shows the layout of the forward pedestal on the MD-82.  There are multiple tactile and visual indications of flap/slat deflection in the cockpit of the MD-82.

The aircraft's crew made two passes through the pre-takeoff checklist.  The first pass, the flaps were correctly set to 11'.  However, before reaching the runway the crew noticed an abnormal RAT (ram air temp) indication and returned to the ramp for maintenance.  The ground crew "resolved" the issue with the RAT sensor by disabling it, and the MD-82 again began taxiing for takeoff.  However, the cockpit voice recorder captured several signs of trouble, indicative of a forthcoming cascading chain of failure.

Source: CIAIAC
The official report [1] from CIAIAC (the Spanish equivalent of the NTSB)   shows that the flight was delayed, the cabin was hot, and the copilot had his mind on his dinner plans.  The pilot interrupted the pre-takeoff checklist to ask the co-pilot to call for takeoff clearance. The copilot called for clearance on the wrong frequency.  This is understandable, people make mistakes. However, what's unforgivable is that when the co-pilot ran through the "takeoff imminent" checklist, the pilot was "anticipating" rather than verifying and read back a flap deflection of "11", from memory, rather than actually confirming the position of the flaps.

Source: CIAIAC

Upon applying takeoff power, the TOWS  (take off warning system) should have provided an audible warning that takeoff flaps/slats were not set properly.  This did not happen.  So the TOWS system was infected with malware/trojans, right?  NO.  The TOWS system itself was disabled, but this is an onboard aircraft system with no IP connection, no USB ports, and no operating system familiar to everyday malware authors.  There is quite a bit of misinformation being spread on this point, with security boffins and AV vendors latching on to the malware point. No, it was not the onboard TOWS system which had been infected.  So why did the TOWS fail to callout a warning?

The report goes into significant technical detail on this point, and it's a bit more complicated than an open circuit breaker or a stray bit of malware.

Source: CIAIAC

The issue with the energized ram air temp heater was indicative of a relay failure which would put the aircraft's systems in "flight mode" which would not only explain the high RAT (the probe has a heating element which is enabled in flight but disabled on the ground) but also the disabled TOWS (which only operates in 'ground mode').

So, in a nutshell, if the ground sensing system fails to flight mode, we would have a situation where the TOWS would be disabled, and the ram air temp heater (among other things) would be enabled.  So how would we know if the ground sensing system had failed?

Well, digging deeper, we find that on this particular aircraft, there were numerous instances of high RAT readings on the ground:

Source: CIAIAC

So in three days prior to the accident we had six abnormal RAT readings, while the aircraft was on the ground, as recorded by the digital flight data recorder (DFDR).  Surely the crew would start to pick up on the fact that something was not right.

Source: CIAIAC

No, because the three RAT abnormalities that were actually entered into the aircraft's technical log book (ATLB) were reported by three different crews.  Doh!

So this is (finally) where the malware issue starts to enter the picture.  The off-board system in question was responsible for correlating, scoring  and alerting on situations just like this.  A computer system is ideal for this type of scenario, where the air crews and maintenance personnel rotate frequently.  So why didn't the system fire an alert?

On this issue the official report from the CIAIAC is silent, though perhaps the forthcoming inquiry [2] will shed some light.  Perhaps the threshold for RAT anomalies wasn't reached.  Maybe this was because the system was only made aware of the 3 anomalies that were actually entered into the ATLB, rather than the 6 that were detected by the DFDR on board the aircraft.  Or perhaps the malware present on the scoring and alerting system prevented the system from working as expected.

In any case, this tragic event is typical of so many in aviation.  It's primary cause, in the opinion of this security consultant and pilot, is what we aviators call "Get There Itis".  The flight was delayed, the cabin was hot, the copilot had dinner plans, and the passengers and flight attendants were grumpy.  The pilot breezed through the "takeoff imminent" checklist, repeating from memory "11" degrees of flat deflection rather than verifying the position of the flaps/slats on the numerous indicators present in the cockpit.

The throttles were pushed forward, the stick was pulled back, and the jet momentarily became airborne before stumbling out of ground effect into a fireball that killed 154 people.

Did malware have anything to do with this tragedy?  Perhaps.  But it's certainly a tertiary factor.  A number of recommendations came out of the post accident investigation, and these were specifically to:

  • Recommend that TOWS systems are checked for proper operation before each flight, rather than once per day
  • Recommend that checklists be streamlined to ensure that critical items (such as setting flap/slat deflection for takeoff) are performed without interruption, and verified

Source: CIAIAC

As to malware on maintenance systems?  Of course that's undesirable.  However, modifications to checklists and operational procedures are our best and most important defense against similar accidents going forward.  Keeping airlines' computer systems free of malware is a completely reasonable requirement.  Would a malware-free maintenance system have prevented this accident?  Perhaps.  But properly trained and professional aviators certainly would have mitigated this tragedy.



  1. I am certain that the pilots were "properly trained and professional aviators." I suspect this apparent smear on the pilots is related to psychological self-defense: "it can't happen to me." All humans, however well trained, make mistakes and get complacent sometimes.

    This was a cascade of failures: ground system computer didn't alert ground crews properly, ground crew disabled a malfunctioning system without full understanding of why it was malfunctioning or the side effects of disabling that system, pilots rushed and perhaps complacent because they expected warning system to function as a backup. ALL of those failures contributed to and were necessary aspects of the crash.

    The pilots KNEW exactly what had to be done, every step of the way, even from memory. All pilots like to think "I wouldn't make that mistake" but we have all made similar ones. We are simply lucky that the long chain of failures that brings a plane out of the sky weren't present on the day we made a small error. Many people made small mistakes which, if done correctly, could have prevented this tragedy. The pilots' error was only the last and (in retrospect) most obvious one.

  2. My blog update has my opinion:

    [Update 8/28 12:12] Clarity on the Spanair crash should be given; the maintenance computer found partially responsible was indeed infected with malware however this was not an onboard flight computer. Rather, it was the ground crew policy and procedure which was interfered with by the malware-ridden system. The flight would have been grounded according to policy had the alarm triggered, however the pilot error was ruled the primary cause of the mishap.

    So the pilot made an error, the takeoff warning system (TOWS) failed to alert the pilot to the error, and this TOWS system was problematic, which would have grounded the plane had the malware-infected system the ground crew was using been operating properly. Any of the three issues being resolved would have saved 154 people, and that does include the malware on the flight system, which would have been ruled a ‘contributing factor to the mishap’ in Naval Aviation. Others have said it’s tertiary – there is no such thing. There are primary causes and contributing causes for a mishap. All contributing causes are equally to blame because without them the mishap may have been avoided, and that includes malware.

  3. I haven't read the entire report, but also as a pilot who is still alive despite my share of bone-headed moves, and has survived some interesting flights (including fire in the cockpit during single-pilot IMC), this does look to be a good assessment. It is not a "smear on pilots" to point to pilot error if there is such. The take-off flaps and slats settings is the responsibility of the pilot, and no one else.

    The FIRST rule of flying is AVIATE first. It seems that these pilots may have failed at that rule. There have been other accidents for exactly the same failure.

  4. What a tragic accident. So many lives have been lost. My condolences to their families.

  5. It breaks my heart reading through this transcript. My sincerest condolences to the families of the victims. I'd suggest for them to visit as it really helped me get through the loss of a loved one.