Announcement

**Schwartz** · 2016-09-17, 02:53

Originally posted by Evan View Post

So what if regulations were established that required airlines to:

a) have back-up systems in place and a network designed to gracefully fail over;

and

b) thoroughly test this aspect of the network before it goes online.

I think this is unlikely to work if the network is cobbled together from legacy networks left over from a litany of mergers and expansions and outsourcings. But a clean-sheet core network should easily meet these requirements, don't you think? Remember, you have four years and abundant financial resources...

Evan, you seem utterly fixated on the network and I am completely baffled as to why. Cobbled together networks are not a huge problem. I am 99% sure this problem had nothing to do with the network. Outside of major fiber line cuts, or massive DDOS attacks, networks are not generally a central point of failure regardless of the age of the equipment. When it does fail, it tends to be easy to fix or replace.

They also didn't suffer from failed backups. A backup is required if you lose data, and need to restore it. They didn't lose data. Backups were not a factor in this incident. The network was not a factor in this incident. I guarantee you they had a backup system in place and a network designed to gracefully fail over. That kind of regulation would be completely unnecessary and would not have changed this incident one iota. Testing for the effects of a fire in the data center is pretty hard to do.

The central points of failure that typically cause serious problems at this level would be primary network trunks -- like a fiber line into the data center, that can be broken by a car accident into a pole or cut by a construction crew a hundred miles away, had this happen to me twice last year -- power (the problem here), primary data center loss due to fire/flood, or software system upgrade (i.e. updating a system and the new software causing problems).

When you read about cobbled together systems, it isn't the network or even the hardware people are worried about, it is the software systems they are worried about.

**Evan** · 2016-09-17, 13:24

Originally posted by Schwartz View Post

Evan, you seem utterly fixated on the network and I am completely baffled as to why. Cobbled together networks are not a huge problem. I am 99% sure this problem had nothing to do with the network. Outside of major fiber line cuts, or massive DDOS attacks, networks are not generally a central point of failure regardless of the age of the equipment. When it does fail, it tends to be easy to fix or replace.

They also didn't suffer from failed backups. A backup is required if you lose data, and need to restore it. They didn't lose data. Backups were not a factor in this incident. The network was not a factor in this incident. I guarantee you they had a backup system in place and a network designed to gracefully fail over. That kind of regulation would be completely unnecessary and would not have changed this incident one iota. Testing for the effects of a fire in the data center is pretty hard to do.

The central points of failure that typically cause serious problems at this level would be primary network trunks -- like a fiber line into the data center, that can be broken by a car accident into a pole or cut by a construction crew a hundred miles away, had this happen to me twice last year -- power (the problem here), primary data center loss due to fire/flood, or software system upgrade (i.e. updating a system and the new software causing problems).

When you read about cobbled together systems, it isn't the network or even the hardware people are worried about, it is the software systems they are worried about.

I think we might be getting hung up on semantics. When I say 'network' I mean that to include everything a network is comprised of, including software systems and network trunks. The weakness I see is that when a single point of failure such as a data center experiences a fire (or any kind of failure), there is no RELIABLE AND TESTED secondary system able to take over and preserve the network functionality. The other problem I see is the organic rather than architectural structure resulting from mergers and outsourcing where nobody seems to have a comprehensive understanding of the vast monster they have created.

**Schwartz** · 2016-09-17, 15:02

Originally posted by Evan View Post

I think we might be getting hung up on semantics. When I say 'network' I mean that to include everything a network is comprised of, including software systems and network trunks. The weakness I see is that when a single point of failure such as a data center experiences a fire (or any kind of failure), there is no RELIABLE AND TESTED secondary system able to take over and preserve the network functionality. The other problem I see is the organic rather than architectural structure resulting from mergers and outsourcing where nobody seems to have a comprehensive understanding of the vast monster they have created.

OK, that makes sense then. Semantics matter because if I was worried about an "airplane" and kept saying that the engines were cobbled together everyone would be thoroughly confused.
When mergers occur, the biggest problems are typically the software systems not the hardware. I'm sure there are networks cobbled together with different hardware but these aren't hard to replace and are typically easy to get to talk together pretty well. You should be saying computing systems not network functionality. Network functionality has a very narrow meaning which is not what you're talking about.

When you look at a system you want to make really reliable -- an airplane for example -- part of the exercise is looking for the single points of failure and you either put a ton of effort into making sure they will have extremely low risk of failure, or you provide a backup. Airplanes have multiple redundant hydraulic systems, but if they have a single hydraulic pump you have just shifted the single point of failure to a new place. Perhaps you can almost "guarantee" that the
single pump won't fail and you'll decide to accept that minimal risk (reminds me of the Sikorsky 30 minute 0 PSI oil rating). Now when you say that "Current technology is far less costly and far more agile than it was even a decade ago" you are largely referring to hardware, and not the software components. It is true that certain types of software development has decreased in price -- especially simple software -- but at the enterprise scale, I say it is still extremely expensive due to it's interconnectivity and complexity. Delta can't buy off-the-shelf software to run their business. They have to build it, and building that type of software is extremely expensive and I would argue the costs have not changed that much in the last 10 years.

If you look at an airplane, if there is a fire in the cockpit, there is no backup. It isn't worth building an emergency cockpit in the back either. The more backup systems you build, the more complex the system gets and testing complexity is exponential as you add linear complexity. The same applies to airplanes in a way but the big difference between hardware and software is that adding features to hardware is orders of magnitude more difficult and expensive, so they typically have a lot less features, making testing much simpler. Features in software can be added at the drop of a hat, and typically they have thousands of features, making the testing matrix something that is impossible to cover. This is a software fact which no company has solved -- except NASA 30 years ago and they solved that problem back then by making the software so simple had so little features that testing it was actually feasible.

Arguably, you still run into similar problems in planes today, especially since they all have software now. You can have three hydraulic pumps but if you co-locate them a single impact can take all three of them out simultaneously. How do you test that? I guarantee that Delta backs up their data in the case of storage hardware failure, corruption, or malfeasance. I almost guarantee you they had redundant cooling in their data center, and probably redundant network links as well (for those construction diggers that always cut through lines). I will also bet they had redundant power going to the data center but when you have a fire, it might very well have damaged both the primary and backups, just like a cockpit fire renders useless all the redundant controls on an airplane. How do you test for that? The example I described earlier -- the maintenance electrician wiring the power incorrectly -- had all the backups and redundancies and the human who was maintaining them caused them all to fail.

The software systems that your typical Stanford grad writes in their San Francisco startup is probably being tested in production because they let their users find the problems, and it is too difficult to simulate real world usage anyways.

So far, Netflix seems to have the most resilient system because they even survived the outage of their provider (Amazon). Their approach is very interesting, because have written software to inject continuous faults into various components of their system forcing their developers to constantly deal with failure because they assume it will always happen. But then, they don't require data consistency -- if you lose a second of Game of Thrones it doesn't matter -- but Delta does. If 1 out of every 10,000 passengers lost their reservation, that would be a big problem.

So again, I am sure that Delta had backup systems in place so regulation wouldn't have changed anything. I'm sure they also did the appropriate level of testing. I have been reading about software failure since the late 80's (starting in telecom) and even though the architectures keep improving the frequency of systemic failure hasn't changed that much. Even the best companies with extensive distributed architectures have had systemic failure -- Amazon took out their system when their backup systems overwhelmed their own backup network during maintenance causing a massive, long outage. AT&T suffered a cascading 1-800 failure in the 90's contributed to by their backup systems. The point is that testing backup architectures without knowing the exact modes of failure or system state at the time is impossible. You can regulate that they "test" the system, and they will test it for known conditions and system state, just like I'm sure Delta did. Your regulation will not have changed the outcome. Complexity is the enemy of testing. I get your typical CS grad who says, look how fast I can write this software to do X and Y. Yep, it is fast an easy when it supports one user, one language, one computer, no integrations no backups etc. Supporting 10 users gets more complicated, and supporting 1M users orders of magnitude.

If you regulated that the system must be testable, then you will force them to remove most of the features and rely on basic, simple, slow changing systems functionality at the extreme loss of efficiency.

EDIT: Actually it was Nortel who suffered an outage due to the automated recovery systems. AT&T suffered their outage because a failure in one switch (buffer overload if I remember correctly) cascaded to the next switch and triggered the same failure.

**Schwartz** · 2016-09-18, 00:35

One other very interesting example: The F35 is enormously over budget and is still not complete, but the hardware seems to be working reasonably well. The bigger issue is the software, no doubt because it is so complicated. The problem is the software is integral to delivering the value promised by the SYSTEM the F35 brings to the battlefield so without it, the plane can't justify it's cost.

**LH-B744** · 2016-09-18, 01:15

Originally posted by Schwartz View Post

[...]
So again, I am sure that Delta had backup systems in place so regulation wouldn't have changed anything. I'm sure they also did the appropriate level of testing. I have been reading about software failure since the late 80's (starting in telecom) and even though the architectures keep improving the frequency of systemic failure hasn't changed that much. Even the best companies with extensive distributed architectures have had systemic failure -- Amazon took out their system when their backup systems overwhelmed their own backup network during maintenance causing a massive, long outage. AT&T suffered a cascading 1-800 failure in the 90's contributed to by their backup systems. The point is that testing backup architectures without knowing the exact modes of failure or system state at the time is impossible. You can regulate that they "test" the system, and they will test it for known conditions and system state, just like I'm sure Delta did. Your regulation will not have changed the outcome. Complexity is the enemy of testing. I get your typical CS grad who says, look how fast I can write this software to do X and Y. Yep, it is fast an easy when it supports one user, one language, one computer, no integrations no backups etc. Supporting 10 users gets more complicated, and supporting 1M users orders of magnitude.

If you regulated that the system must be testable, then you will force them to remove most of the features and rely on basic, simple, slow changing systems functionality at the extreme loss of efficiency.

Since when did we get rid of the 10,000,000 letter-per-forum-entry restriction?

And these are the junior members that I like. Between the lines, they reveal that they are not 19 years old. The last junior member who I confronted with the assumption, ok, you might be half as old as me, really said 'HA HA' (not really).

Since the last .. let's say 10 years, I've intensively tried the one or the other aviation forum. But why do I regularly return to Jetphotos? Because of Les Abend.
Ops. No names. But there are forums which are attractive for men who know what an 8086 is, 486-DX33, Pentium I et cetera et cetera et cetera

And there are other forums.

'remove most of the features and rely on basic, simple, slow changing systems' ... that sounds if you compare an A320neo HUD with a 747-200.
'since they all have software now.'
-- Well, as long as that does not prevent us from flying. So called 'Computer failure' often happens when 'the user' does no longer understand 'the machine'.
Ok, since this topic has moved, my example is also at the right place. One of the pilots had assumed that only one engine was responsible for the smoke. Wherever he took that assumption from, probably from an earlier type, prior to the 737-400.

Plus, one thing that hasn't been shown on TV, the B734 only has/had one VIB gauge for 2 engines?
I know that the 744 has four, for four engines, but the 744, at least my avatar was built after that accident, so that the cockpit is less unclear...
'basic, simple' - easy to understand even under difficult circumstances... Here it is what I saw on TV:
A jet with a VIB problem, back in 1989.

**Gabriel** · 2016-09-18, 01:38

Originally posted by LH-B744 View Post

Since when did we get rid of the 10,000,000 letter-per-forum-entry restriction?

Since I've arrived here?

**Schwartz** · 2016-09-18, 12:56

Originally posted by LH-B744 View Post

Since when did we get rid of the 10,000,000 letter-per-forum-entry restriction?

And these are the junior members that I like. Between the lines, they reveal that they are not 19 years old. The last junior member who I confronted with the assumption, ok, you might be half as old as me, really said 'HA HA' (not really).

Since the last .. let's say 10 years, I've intensively tried the one or the other aviation forum. But why do I regularly return to Jetphotos? Because of Les Abend.
Ops. No names. But there are forums which are attractive for men who know what an 8086 is, 486-DX33, Pentium I et cetera et cetera et cetera

And there are other forums.

'remove most of the features and rely on basic, simple, slow changing systems' ... that sounds if you compare an A320neo HUD with a 747-200.
'since they all have software now.'
-- Well, as long as that does not prevent us from flying. So called 'Computer failure' often happens when 'the user' does no longer understand 'the machine'.
Ok, since this topic has moved, my example is also at the right place. One of the pilots had assumed that only one engine was responsible for the smoke. Wherever he took that assumption from, probably from an earlier type, prior to the 737-400.

Plus, one thing that hasn't been shown on TV, the B734 only has/had one VIB gauge for 2 engines?
I know that the 744 has four, for four engines, but the 744, at least my avatar was built after that accident, so that the cockpit is less unclear...
'basic, simple' - easy to understand even under difficult circumstances... Here it is what I saw on TV:
A jet with a VIB problem, back in 1989.

Summary: Not a safety issue. Regulation will not achieve the objective.

**3WE** · 2016-10-14, 13:18

And now this...

https://www.yahoo.com/news/m/1aea5a59-6c01-3fc6-9744-ba989d61b79e/ss_united-computer-glitch-leads.html

Hopefully the cowboy pilots pushing to make up for the delays and cutting safety corners will not kill any one.

In other news, they seem to be having a tough time with push-backs at EWR: https://www.yahoo.com/news/m/64acecb...at-newark.html

**Evan** · 2016-10-14, 16:17

Originally posted by 3WE View Post

https://www.yahoo.com/news/m/1aea5a5...tch-leads.html

Hopefully the cowboy pilots pushing to make up for the delays and cutting safety corners will not kill any one.

In other news, they seem to be having a tough time with push-backs at EWR: https://www.yahoo.com/news/m/64acecb...at-newark.html

But let's not do anything about this predictable and recurrent issue. Because it's not an issue. Because the airline trusts will take care of it. Because they have our safety and well-being in mind. Because when you run an obese enterprise tied together by a network made of spiderwebs and magic, what could go wrong. In the meantime, you'll have to be patient because the first order of business is to feed the shareholders. They were ahead of you. Please step aside and let them through.

**Schwartz** · 2016-10-15, 04:11

Originally posted by Evan View Post

But let's not do anything about this predictable and recurrent issue. Because it's not an issue. Because the airline trusts will take care of it. Because they have our safety and well-being in mind. Because when you run an obese enterprise tied together by a network made of spiderwebs and magic, what could go wrong. In the meantime, you'll have to be patient because the first order of business is to feed the shareholders. They were ahead of you. Please step aside and let them through.

If it was predictable, it wouldn't have happened. If you read the details, not all flights were affected, just a small number. Regulation would not have changed the outcome and last time I checked, big storms cause a lot more delays than that. Plus, where is there a single bit of evidence of safety here? Delays happen all the time, and it better not be a safety issue.

**ATLcrew** · 2016-10-15, 11:57

Originally posted by Evan View Post

But let's not do anything about this predictable and recurrent issue. Because it's not an issue. Because the airline trusts will take care of it. Because they have our safety and well-being in mind. Because when you run an obese enterprise tied together by a network made of spiderwebs and magic, what could go wrong. In the meantime, you'll have to be patient because the first order of business is to feed the shareholders. They were ahead of you. Please step aside and let them through.

There is an old Russian saying which can be translated as "if what I say disagrees with the facts, it's all the worse for the facts".

**Evan** · 2016-10-15, 14:50

Originally posted by Schwartz View Post

If it was predictable, it wouldn't have happened.

That goes against everything we have learned from aviation disasters over the past decade.

I realize it wasn't a total meltdown this time, but that's kind of like saying a 2.0 earthquake doesn't mean there's a major fault line down there somewhere. I call this a warning sign. Which will almost certainly be ignored in favor of procrastination and blind confidence.

**3WE** · 2016-10-15, 16:15

Originally posted by ATLcrew View Post

There is an old Russian saying which can be translated as "if what I say disagrees with the facts, it's all the worse for the facts".

Any fluffy animals involved?

**Schwartz** · 2016-10-18, 01:08

Originally posted by Evan View Post

I realize it wasn't a total meltdown this time, but that's kind of like saying a 2.0 earthquake doesn't mean there's a major fault line down there somewhere. I call this a warning sign. Which will almost certainly be ignored in favor of procrastination and blind confidence.

Based on what evidence? Sounds like they had a manual backup plan in place to deal with it.

**3WE** · 2016-10-27, 22:28

I was recently reminded how before every flight, a nasty noisy ANCIENT dot-matrix printer makes a roughly 12-ft long printout of Lord know's what all information (like the pilots really read all that?) (And I guess that would solve TeeVee's mystery of what happens to all that crap the check in agents type into the computer.)

After seeing that, it became clear that we must do something!

Announcement

Is this an aviation safety issue?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment