North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
RE: I think I jinxed Sprint
> From: Sean Donelan [mailto:[email protected]] > Sent: Friday, November 24, 2000 2:44 AM > > Last week I gave Sprint some complements on their success avoiding > customer service affecting fiber cuts. > > Unfortunately, Murphy took it as a challange. First, Sprint lost an > STP power supply which blocked SS7 service in Sprint's southeastern > network for 2 hours and 52 minutes. After all the discussion that we had on the Datacenter list, I am surprised at this. You'd think that they'd have redundant PS's with redundant UPS's. > Then on Tuesday, Murphy killed > a disk drive on Sprint's SCP, blocking Sprint's nationwide network > for 4 minutes until it could be taken out of service. With the radical reduction in the cost of any form of RAID these days, I am surprised that a single disk-drive failure was able to do this. I'm even putting IDE RAID1 on critical workstations these days (Promis FastTrak66 and 2x10GB IDE drives. 3ware makes one with Linux drivers, IBM has linux drivers for their high-end RAID controllers as well [works on 3090's]). Of course, if they really insist on paying $money$ then they could spec EMC ... > Maintaining 99.999% network availability is hard for any network, > telephone or the Internet. But sometimes I wonder what the real > requirement is. The Australian stock exchange went down for a few > hours, it wasn't the end of the world. Sprint had some more blocked > calls than normal, most people didn't notice. Telephony and internet are substantially different, as are uptime and availability. For the telcos, they have no choice. Local public utilities commissions set the requirement for them and they negotiate how to measure it. Telcos are good at this sort of negotiation (what the meaning of "is" is, in this case, what the meaning of "99.999%" is, there is an amazing amount of varience <g>). For the internet, I see an amazing number of systems with no redundancy whatsoever. Of course, the first hardware failure usually corrects the problem, at the cost of substantial down-time. But many second-tier ISPs and dot-coms are still operating on brand-new equipment that hasn't started hitting its MTBF specs yet and they don't even have a clue on their MTTR ratings. In the next few years, I expect to see a lot more failures, as the equipment starts to age. > Are we setting artificial performance requirements, which > don't reflect > reality? Either in what can be achieved, or is necessary. The internet is a lot less forgiving wrt outages then the telco. The telco can have a circut outage, re-route to another circuit, and the customer never sees an availability gap. Also, a total outage, during reduced traffic times, and no customer ever misses a dial-tone because they aren't trying to get one, is not an outage in telco terms. The internet, on the other hand, may have similar issues, unless we start talking streaming video, streaming audio, and voice over IP. In those cases, packet losses can make a serious mess of things. Also, congestion is treated differently between the two systems. Telcos will actually return a fast-busy when a switch becomes congested. The internet simply starts dropping packets. You can actually hear the latter when using www.dialpad.com or MS-Netmeeting (both of which, I use extensively).