Announcement

Collapse
No announcement yet.

Is this an aviation safety issue?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is this an aviation safety issue?

    At first, I just saw this as a logistical nightmare. First Southwest, then Delta experience debilitating fleetwide cancellations due to ground computer network meltdowns. People are stuck safely on the ground. Not a safety issue.

    But what happens next? A great backlog of flights. Enormous pressure to get everything back on track as soon as possible. How does that effect pilot rostering? Turnaround times? MEL issues? How much stress and pressure are flight and ground crews under to get it all back to normal? What is the leading cause of accidents? Stress, pressure, fatigue, shortcuts, bad decisions based on get-there-itis, questionable dispatches, a lack of contingency for safety reasons, back office pressure...?

    The way I see it, such events DO heighten the risk of something going wrong. So shouldn't these systems—a relatively new but essential component of aviation—have FAR certification standards as well. Shouldn't these systems be required to conform to "aircraft-grade' reliability and proven redundancy requirements? Both Southwest and Delta have stated that they have back-up systems in place, but in both cases these systems failed to work. Flimsy stuff.

    Well, maybe simple economics will take care of it. I can well imagine the airlines not caring to invest in better back-up systems before such an event happens, but now Southwest is looking at "tens of millions" of dollars in damages. Perhaps better back-up systems are looking more appealing now...

    Still, it's the 21st century. Everything depends on these networks. Why isn't this a safety issue? Shouldn't we have FAR's for this?

  • #2
    You do have a point. One thought though. Until it all actually goes to ratshit how are you going to know that it's all going to go to ratshit ?
    If it 'ain't broken........ Don't try to mend it !

    Comment


    • #3
      Originally posted by brianw999 View Post
      You do have a point. One thought though. Until it all actually goes to ratshit how are you going to know that it's all going to go to ratshit ?

      Just curious, do most people give a ratsass if it is going to ratshit?

      Comment


      • #4
        Originally posted by Evan View Post
        At first, I just saw this as a logistical nightmare. First Southwest, then Delta experience debilitating fleetwide cancellations due to ground computer network meltdowns. People are stuck safely on the ground. Not a safety issue.

        But what happens next? A great backlog of flights. Enormous pressure to get everything back on track as soon as possible. How does that effect pilot rostering? Turnaround times? MEL issues? How much stress and pressure are flight and ground crews under to get it all back to normal? What is the leading cause of accidents? Stress, pressure, fatigue, shortcuts, bad decisions based on get-there-itis, questionable dispatches, a lack of contingency for safety reasons, back office pressure...?

        The way I see it, such events DO heighten the risk of something going wrong. So shouldn't these systems—a relatively new but essential component of aviation—have FAR certification standards as well. Shouldn't these systems be required to conform to "aircraft-grade' reliability and proven redundancy requirements? Both Southwest and Delta have stated that they have back-up systems in place, but in both cases these systems failed to work. Flimsy stuff.

        Well, maybe simple economics will take care of it. I can well imagine the airlines not caring to invest in better back-up systems before such an event happens, but now Southwest is looking at "tens of millions" of dollars in damages. Perhaps better back-up systems are looking more appealing now...

        Still, it's the 21st century. Everything depends on these networks. Why isn't this a safety issue? Shouldn't we have FAR's for this?

        Noted.

        Comment


        • #5
          Originally posted by Evan View Post
          Still, it's the 21st century. Everything depends on these networks. Why isn't this a safety issue? Shouldn't we have FAR's for this?
          As a certain political leaning often asks, "Do existing regulations not already cover this?"

          A network failure is not the only reason the system sometimes backs up...snowstorms are another good one.
          Les rčgles de l'aviation de base découragent de longues périodes de dur tirer vers le haut.

          Comment


          • #6
            Originally posted by 3WE View Post
            As a certain political leaning often asks, "Do existing regulations not already cover this?"

            A network failure is not the only reason the system sometimes backs up...snowstorms are another good one.
            AFAIK you can't regulate weather with the FAR's. Electronic things, however...

            As I noted above, many accidents turn out to be precipitated by human factor issues like overly demanding pilot rosters and ground service turnaround pressures. By cutting capacity so thin the airlines cannot recover from incidents like these without inducing fairly extreme versions of these human factor factors. I'm always of the mind that you solve an institutionalized problem by starting at the root cause. We now have regulations (weak ones) for pilot rostering. Why not for the robustness of these VERY MISSION CRITICAL systems. At least some tougher ARINC standards....

            Because obviously the back-up systems currently in place are made out of paper maché...

            Comment


            • #7
              FWIW, when my airline had a sizable meltdown last summer, I felt under no pressure to do anything extraordinary to return things to "normal". The FARs hadn't changed, our manual hadn't changed, and the MEL hadn't changed, so I did my job same as always.

              Not sure why we need some "tougher ARINC standards", ARINC wasn't an issue in this case anyway.

              Comment


              • #8
                Originally posted by ATLcrew View Post
                FWIW, when my airline had a sizable meltdown last summer, I felt under no pressure to do anything extraordinary to return things to "normal". The FARs hadn't changed, our manual hadn't changed, and the MEL hadn't changed, so I did my job same as always.

                Not sure why we need some "tougher ARINC standards", ARINC wasn't an issue in this case anyway.
                No, but perhaps these systems should fall under the aegis of ARINC standards. If they were that robust, they would probably work when called upon.

                When your airline suffered a sizable meltdown, how long did it take to return things to normal? What was the backlog in flight cancellations? Southwest had to cancel about 2300 flights. Delta cancellations have continued into the second day. Hundreds, perhaps thousands of them. Obviously that creates backlog and requires unusual rostering to remediate. Surely most pilots will be cautious as ever and make safe decisions in flight. Most of them...

                But certainly you are aware of the rich history of bad decisions and hasty, improvised or deferred maintenance that have led to so many fatalities. Certainly you are aware that pressure and/or fatigue is almost always a factor behind these.

                Certainly you are aware that these catastrophic system meltdowns create fertile environments for pressure and fatigue. Certainly you are aware that safety abhors chaos...

                Comment


                • #9
                  how far left do you go evan? now you want to start putting the government in charge of the computer hardware and network software private companies use? you do realize that just about every piece of electronic equipment utilized by the airlines is connected in one way or another to its network, right? so that a manager's iphone would have to meet ARINC standards because it is VPN capable and is actually connected to say, aa's network. and copiers and printers, keyboards, mice etc etc etc.

                  paper mache? now you're showing a high degree of ignorance.

                  has there been a single fatality you can link to an airline's corporate computer/network failure? what's that? not one? ah, thought so.....

                  Comment


                  • #10
                    Originally posted by TeeVee View Post
                    how far left do you go evan? now you want to start putting the government in charge of the computer hardware and network software private companies use? you do realize that just about every piece of electronic equipment utilized by the airlines is connected in one way or another to its network, right? so that a manager's iphone would have to meet ARINC standards because it is VPN capable and is actually connected to say, aa's network. and copiers and printers, keyboards, mice etc etc etc.
                    Not that far, obviously. Essential systems and networks needed to avoid massive cancellations is how far I would go. The sytems that went down Sunday are a managerie of third party systems, some dating back to the 90's and very nearly obsolete. Our national transportation system is relying on very fragile networks like this. So are you good with that or how far do you want to go? As pax, we're on the same team you know.

                    I've said this a thousand times here but it bears repeating: major industries do a terrible job of regulating themselves when money is involved. If they are not forced to upgrade these networks to something A LOT more reliable, they will take their sweet time.

                    Fortunately, as I said, in this case the economic costs of these failures might force them to get it in gear. Might...

                    has there been a single fatality you can link to an airline's corporate computer/network failure? what's that? not one? ah, thought so.....
                    What are you going to say when it happens? Couldn't have seen that coming?

                    Comment


                    • #11
                      Originally posted by Evan View Post
                      The sytems that went down Sunday are a managerie of third party systems, some dating back to the 90's and very nearly obsolete. Our national transportation system is relying on very fragile networks like this.

                      If they are not forced to upgrade these networks to something A LOT more reliable, they will take their sweet time.
                      A lot more reliable than what?

                      Let's take a look at the numbers. For ease of calculation, let's assume there are 10 "major" airlines in the USA and we'll examine the period of the last 20 years. So that's (10 airlines) * (365 days/year) * (20 years) = 73,000 airline-days of operation. And say over that period, there have been a total of 5 days of outages. So the uptime is 72995/73000 or 99.993%. Can you name another modern system in the consumer space that has that level of reliability?

                      Just for comparison from my own life... we've been in our current house for about 12 years. During that time I'd say we average 1 day / year of electrical outage, so its uptime is 99.72%. Internet and telephone are about the same. Can your smartphone or tablet boast greater than 99.99% uptime? I doubt it.

                      In areas closer to what we're actually talking about, most server hosting and co-location services advertise an uptime of 99.9% and I've never seen one that would guarantee higher than 99.95% continuously... have you?
                      Be alert! America needs more lerts.

                      Eric Law

                      Comment


                      • #12
                        Originally posted by elaw View Post
                        A lot more reliable than what?

                        Let's take a look at the numbers. For ease of calculation, let's assume there are 10 "major" airlines in the USA and we'll examine the period of the last 20 years. So that's (10 airlines) * (365 days/year) * (20 years) = 73,000 airline-days of operation. And say over that period, there have been a total of 5 days of outages. So the uptime is 72995/73000 or 99.993%. Can you name another modern system in the consumer space that has that level of reliability?

                        Just for comparison from my own life... we've been in our current house for about 12 years. During that time I'd say we average 1 day / year of electrical outage, so its uptime is 99.72%. Internet and telephone are about the same. Can your smartphone or tablet boast greater than 99.99% uptime? I doubt it.

                        In areas closer to what we're actually talking about, most server hosting and co-location services advertise an uptime of 99.9% and I've never seen one that would guarantee higher than 99.95% continuously... have you?
                        Maybe I haven't been clear enough. The issue is not just the reliability of primary systems. Uptime numbers are indeed good but we can't expect them to be perfect. The issue is reliability of backup systems as well. Like any mission-critical aspect of aviation, these networks as a whole need to be fail-operational. Just as modern turbofans are exceedingly reliable, having a second one makes propulsion fail-passive and the prospect of loss-of-propulsion extremely remote. The issue is having a system that is, in its entirety, fault-tolerant.

                        So how does that compare to what we actually have out there? Southwest's system failed for about one hour. The back-up failed to work as designed. It took thirteen hours to reboot the primary system, causing 2,300 flight cancellation and costing them tens of millions of dollars.

                        So much more reliable than that.

                        Comment


                        • #13
                          well said eric. i could argue taht my macbook pro from February 2009 is pretty damn reliable since it crashed exactly ONCE since then, and only suffered one HDD failure, resulting in a several hour down period while i replaced it with an SDD.

                          funny/sad thing is modern is not always better. yes, it's gone now, but up until a few years ago the space shuttle flew to outerspace using antiquated computers running god only knows what OS. the xbox360 has nearly 100x the processing power the space shuttles computer, which can only load one program at a time.

                          of course, older does not always mean more reliable. and lots of new systems crash very often, for many reasons. also, the logistics of migrating VAST, complex networks to newer ones is a frighteningly difficult task. this is why most of america's banks are running ancient systems that don't always talk to each other.

                          it's not as simple as saying, "hey, let's upgrade."

                          Comment


                          • #14
                            Originally posted by Evan View Post
                            Southwest's system failed for about one hour. The back-up failed to work as designed. It took thirteen hours to reboot the primary system, causing 2,300 flight cancellation and costing them tens of millions of dollars.
                            Compared to how much to do what you're proposing? Hundreds of millions? As I'm sure you know this isn't a simple drop-in replacement like getting a new PC at home. You're talking lots of expensive equipment in lots of geographically diverse locations, lots of time spent creating and debugging custom software, training of thousands of people, etc.

                            And while "2,300 flights canceled" looks bad in a headline, again lets examine the numbers. Say on average each flight had 200 people booked on it. So that's 46,000 people that were inconvenienced by the event you refer to. I'd be willing to bet that worldwide, a comparable number of people miss their flights every day due to personal emergencies, poor planning, failure in transportation to the airport (car breaks down, gets stuck in traffic...), etc. as well as more directly airline-related things like missed connections due to inbound flights being late. Missing a flight stinks but the reality is it happens all the time.
                            Be alert! America needs more lerts.

                            Eric Law

                            Comment


                            • #15
                              Originally posted by elaw View Post
                              Compared to how much to do what you're proposing? Hundreds of millions? As I'm sure you know this isn't a simple drop-in replacement like getting a new PC at home. You're talking lots of expensive equipment in lots of geographically diverse locations, lots of time spent creating and debugging custom software, training of thousands of people, etc.
                              Pay to play. A single aircraft can cost close to $300M. It's an expensive business to be in. 85% of American air travel is now served by only four major airlines. These are enormous companies with enormous resources currently lording over virtual monpolies and squeezing profit out of every nook and cranny. The cost of doing business involves meeting FAA requirements as well as those of other regulatory agencies. At least one of them should be regulating EVERY mission critical aspect of the industry. It doesn't happen overnight. It starts overnight though. A five-year requirement to have network redundancy in place at given standard is not unreasonable. And in five years, it will be even more critical. So now is the time to get going on this.

                              And while "2,300 flights canceled" looks bad in a headline, again lets examine the numbers. Say on average each flight had 200 people booked on it. So that's 46,000 people that were inconvenienced by the event you refer to. I'd be willing to bet that worldwide, a comparable number of people miss their flights every day due to personal emergencies, poor planning, failure in transportation to the airport (car breaks down, gets stuck in traffic...), etc. as well as more directly airline-related things like missed connections due to inbound flights being late. Missing a flight stinks but the reality is it happens all the time.
                              What is it with you and "the numbers". There is this widespread belief out there that everything can be rationalized but cherry-picking statistics instead of recognizing problems intellectually. That might end up being the undoing of modern society.

                              The numbers are: 2,300 DOMESTIC flight cancellations in ONE DAY. 6000 flights delayed in ONE DAY. TENS OF MILLIONS of dollars lost in ONE DAY. TWO such events in ONE MONTH!!!

                              The cause? Reckless mergers and a lack of proper investment in IT to keep up with growing complexity.

                              DALLAS  — Twice in less than a month, a major airline was paralyzed by a computer outage that prevented passengers from checking in and flights from taking off. Last month, it took Southwest days t…


                              These are most definitely highly disruptive events that result in temporary chaos, extended stress and lingering pressure. Accidents rarely happen when things are going as expected. DIsruption, chaos, stress and pressure are key parts of the accident equation.

                              Comment

                              Working...
                              X