North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

RE: I think I jinxed Sprint

  • From: Roeland Meyer
  • Date: Fri Nov 24 12:05:15 2000

> From: Sean Donelan [mailto:[email protected]]
> Sent: Friday, November 24, 2000 2:44 AM
> 
> Last week I gave Sprint some complements on their success avoiding
> customer service affecting fiber cuts.
> 
> Unfortunately, Murphy took it as a challange.  First, Sprint lost an
> STP power supply which blocked SS7 service in Sprint's southeastern
> network for 2 hours and 52 minutes.  

After all the discussion that we had on the Datacenter list, I am surprised
at this. You'd think that they'd have redundant PS's with redundant UPS's.

> Then on Tuesday, Murphy killed
> a disk drive on Sprint's SCP, blocking Sprint's nationwide network
> for 4 minutes until it could be taken out of service.

With the radical reduction in the cost of any form of RAID these days, I am
surprised that a single disk-drive failure was able to do this. I'm even
putting IDE RAID1 on critical workstations these days (Promis FastTrak66 and
2x10GB IDE drives. 3ware makes one with Linux drivers, IBM has linux drivers
for their high-end RAID controllers as well [works on 3090's]). Of course,
if they really insist on paying $money$ then they could spec EMC ...

> Maintaining 99.999% network availability is hard for any network,
> telephone or the Internet.  But sometimes I wonder what the real
> requirement is.  The Australian stock exchange went down for a few
> hours, it wasn't the end of the world.  Sprint had some more blocked
> calls than normal, most people didn't notice.

Telephony and internet are substantially different, as are uptime and
availability. For the telcos, they have no choice. Local public utilities
commissions set the requirement for them and they negotiate how to measure
it. Telcos are good at this sort of negotiation (what the meaning of "is"
is, in this case, what the meaning of "99.999%" is, there is an amazing
amount of varience <g>).

For the internet, I see an amazing number of systems with no redundancy
whatsoever. Of course, the first hardware failure usually corrects the
problem, at the cost of substantial down-time. But many second-tier ISPs and
dot-coms are still operating on brand-new equipment that hasn't started
hitting its MTBF specs yet and they don't even have a clue on their MTTR
ratings. In the next few years, I expect to see a lot more failures, as the
equipment starts to age.

> Are we setting artificial performance requirements, which 
> don't reflect
> reality?  Either in what can be achieved, or is necessary.

The internet is a lot less forgiving wrt outages then the telco. The telco
can have a circut outage, re-route to another circuit, and the customer
never sees an availability gap. Also, a total outage, during reduced traffic
times, and no customer ever misses a dial-tone because they aren't trying to
get one, is not an outage in telco terms. The internet, on the other hand,
may have similar issues, unless we start talking streaming video, streaming
audio, and voice over IP. In those cases, packet losses can make a serious
mess of things. Also, congestion is treated differently between the two
systems. Telcos will actually return a fast-busy when a switch becomes
congested. The internet simply starts dropping packets. You can actually
hear the latter when using www.dialpad.com or MS-Netmeeting (both of which,
I use extensively).