Wednesday, August 25, 2010

Malware in Spanair Fatal Air Crash Case: FUD or a real factor?

This week we have seen an extraordinary number of articles claiming that malware is at least partly to blame for the fatal crash of Spanair Flight 5022 which killed 154 of the 172 souls aboard.

Reading the headlines from this story leads one to believe that a trojan was used in a deliberate attempt to bring down an airliner.  Digging further into the facts of the story, nothing could be further from the truth.

The aircraft in question is a 150,000lb twin turbofan MD-82 airliner.  Aircraft like these do not ordinarily make "no flap" takeoffs.  The pilots failed to follow the checklists properly, and applied takeoff power with flaps retracted, and stalled when exiting ground effect.

Source: CIAIAC
The image on the left shows the layout of the forward pedestal on the MD-82.  There are multiple tactile and visual indications of flap/slat deflection in the cockpit of the MD-82.

The aircraft's crew made two passes through the pre-takeoff checklist.  The first pass, the flaps were correctly set to 11'.  However, before reaching the runway the crew noticed an abnormal RAT (ram air temp) indication and returned to the ramp for maintenance.  The ground crew "resolved" the issue with the RAT sensor by disabling it, and the MD-82 again began taxiing for takeoff.  However, the cockpit voice recorder captured several signs of trouble, indicative of a forthcoming cascading chain of failure.

Source: CIAIAC
The official report [1] from CIAIAC (the Spanish equivalent of the NTSB)   shows that the flight was delayed, the cabin was hot, and the copilot had his mind on his dinner plans.  The pilot interrupted the pre-takeoff checklist to ask the co-pilot to call for takeoff clearance. The copilot called for clearance on the wrong frequency.  This is understandable, people make mistakes. However, what's unforgivable is that when the co-pilot ran through the "takeoff imminent" checklist, the pilot was "anticipating" rather than verifying and read back a flap deflection of "11", from memory, rather than actually confirming the position of the flaps.

Source: CIAIAC

Upon applying takeoff power, the TOWS  (take off warning system) should have provided an audible warning that takeoff flaps/slats were not set properly.  This did not happen.  So the TOWS system was infected with malware/trojans, right?  NO.  The TOWS system itself was disabled, but this is an onboard aircraft system with no IP connection, no USB ports, and no operating system familiar to everyday malware authors.  There is quite a bit of misinformation being spread on this point, with security boffins and AV vendors latching on to the malware point. No, it was not the onboard TOWS system which had been infected.  So why did the TOWS fail to callout a warning?

The report goes into significant technical detail on this point, and it's a bit more complicated than an open circuit breaker or a stray bit of malware.

Source: CIAIAC

The issue with the energized ram air temp heater was indicative of a relay failure which would put the aircraft's systems in "flight mode" which would not only explain the high RAT (the probe has a heating element which is enabled in flight but disabled on the ground) but also the disabled TOWS (which only operates in 'ground mode').

So, in a nutshell, if the ground sensing system fails to flight mode, we would have a situation where the TOWS would be disabled, and the ram air temp heater (among other things) would be enabled.  So how would we know if the ground sensing system had failed?

Well, digging deeper, we find that on this particular aircraft, there were numerous instances of high RAT readings on the ground:

Source: CIAIAC

So in three days prior to the accident we had six abnormal RAT readings, while the aircraft was on the ground, as recorded by the digital flight data recorder (DFDR).  Surely the crew would start to pick up on the fact that something was not right.

Source: CIAIAC

No, because the three RAT abnormalities that were actually entered into the aircraft's technical log book (ATLB) were reported by three different crews.  Doh!

So this is (finally) where the malware issue starts to enter the picture.  The off-board system in question was responsible for correlating, scoring  and alerting on situations just like this.  A computer system is ideal for this type of scenario, where the air crews and maintenance personnel rotate frequently.  So why didn't the system fire an alert?

On this issue the official report from the CIAIAC is silent, though perhaps the forthcoming inquiry [2] will shed some light.  Perhaps the threshold for RAT anomalies wasn't reached.  Maybe this was because the system was only made aware of the 3 anomalies that were actually entered into the ATLB, rather than the 6 that were detected by the DFDR on board the aircraft.  Or perhaps the malware present on the scoring and alerting system prevented the system from working as expected.

In any case, this tragic event is typical of so many in aviation.  It's primary cause, in the opinion of this security consultant and pilot, is what we aviators call "Get There Itis".  The flight was delayed, the cabin was hot, the copilot had dinner plans, and the passengers and flight attendants were grumpy.  The pilot breezed through the "takeoff imminent" checklist, repeating from memory "11" degrees of flat deflection rather than verifying the position of the flaps/slats on the numerous indicators present in the cockpit.

The throttles were pushed forward, the stick was pulled back, and the jet momentarily became airborne before stumbling out of ground effect into a fireball that killed 154 people.

Did malware have anything to do with this tragedy?  Perhaps.  But it's certainly a tertiary factor.  A number of recommendations came out of the post accident investigation, and these were specifically to:

  • Recommend that TOWS systems are checked for proper operation before each flight, rather than once per day
  • Recommend that checklists be streamlined to ensure that critical items (such as setting flap/slat deflection for takeoff) are performed without interruption, and verified

Source: CIAIAC

As to malware on maintenance systems?  Of course that's undesirable.  However, modifications to checklists and operational procedures are our best and most important defense against similar accidents going forward.  Keeping airlines' computer systems free of malware is a completely reasonable requirement.  Would a malware-free maintenance system have prevented this accident?  Perhaps.  But properly trained and professional aviators certainly would have mitigated this tragedy.