I’ve had a couple of opportunities this week to reflect on the reliability of complex systems. None of these thoughts are especially profound, but they’re nonetheless interesting.
Earlier in the week, a line of powerful thunderstorms rolled through the DC area. At one point, over 25% of the region was without power. Most of the outages were brief, though there are a few places that are still awaiting power restoration two days later.
My family’s publishing business lost power for only a few seconds. But about an hour after the storm passed through, Internet service (delivered via a T1 line) failed. This was a little surprising, since they had power.
The long version of this story involves a great deal of incompetence and finger-pointing between no less than three telecom carriers. Normal behavior, in other words. But the root cause of the problem turned out to be pretty interesting. Somewhere between their office and Verizon’s central office, the T1 line is multiplexed using a remote equipment cabinet. This cabinet requires its own power. It’s battery-backed, but there’s no generator. And the batteries apparently only last a couple of hours. One the batteries are discharged, every customer downstream of one of these equipment cabinets (whether a data customer or simply an ordinary telephone user) will lose service.
A telephone company central office is a masterpiece of conservative engineering. Power is supplied by two completely separate sets of massive 48V batteries. Each set is sufficient to power the entire office. The batteries are constantly being charged by commercial power. Should the power fail, the batteries can carry the whole office for many hours–but in practice they just have to run for the few minutes necessary for a generator to come online.
This kind of reliability isn’t cheap. A switch installation can easily cost upwards of $20 million. Downtime is measured in seconds per year. And, of course, the customers of the telephone companies pay for all of this reliability in the end.
Unfortunately, all of this reliability isn’t going to do a whole lot of good if there are intermediate components that are engineered to much lower standards. And that seems to be just what’s happened.
The FCC seems to have noticed the issue. In October 2007, they issued rules that require carriers to have at least 24 hours of backup power at the central office, and at least eight hours for remote equipment. While the rules aren’t completely toothless, it looks like there’s a lot of room for carriers to maneuver: it seems to be good enough, for example, to deploy equipment that’s merely designed to meet the longevity requirement. I suspect that in practice that’ll be a lot like that laptop battery that’s designed to run the computer for four hours, but can barely manage two and a half.
I’m not trying to make the argument that the current state of affairs is necessarily bad. There’s a reasonable argument to be made that the telephone network has been overengineered at great expense. But it’s definitely true that the standards of reliability have changed, and that it’s no longer reasonable to assume that customer equipment is the weak link in the service-delivery chain.
My second weak-link experience this week also involved my family’s business.
As a publisher, they use Quark XPress layout software extensively. The cool kids are all using Adobe InDesign now, and my family’s office will eventually switch too. But they’ve got a substantial investment in training and templates, so for the moment they’re still using Quark.
Quark offers a network licensing model where you can install the software on as many machines as you like, but only N of those instances can be active at any one time. This is enforced by license management software running on a server.
I’m not a fan of software-enforced licenses. We reluctantly put up with this one because it offers some benefit and it doesn’t depend on having the license server “phone home” in order to keep the software running.
Quark XPress has quite a reputation as a bug-ridden and strange piece of software, but most of the difficulties are really pretty minor. Over the last couple of years, we’ve had serious problems with Quark XPress maybe a dozen times. 100% of these problems have been caused by the license manager.
I had another such problem today. Quark’s tech support was actually pretty good (far, far better than in the past), but the fact remains that the license manager is a persistent source of trouble. When it fails, very little work gets done.
There’s probably no way to avoid this short of licensing the software another way. License management software, by design, has to strictly enforce a set of conditions–and good software design practice (in isolation) says that failures should result in denial of service. This is indeed what happens.
In both instances (the Verizon outage and the Quark license manager failure), we lost service due to a weak link in an otherwise-robust chain. From a customer’s perspective, it’s irritating. From the designer’s perspective, it’s instructive to consider the weak-link possibilities when constructing a new system in order to avoid creating this kind of irritation.