North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: DACS blamed for MCI / CSX train problems last week

  • From: Joe Abley
  • Date: Sat May 06 02:52:14 2000

[apologies in advanced for the pitiful layout that Outlook Express will no
doubt inflict on you after I click send]

-----Original Message-----
From: Sean Donelan <[email protected]>
To: [email protected] <[email protected]>
Date: Saturday, May 06, 2000 3:04 PM
Subject: DACS blamed for MCI / CSX train problems last week


>This isn't the first DACS problem nor is MCI the only carrier affected
>by DACS problems.  Bell Atlantic had a multi-day DACS problems, I've
>experienced 18+ hour MCI DACS problems in the past.  What is it about
>DACS systems which seem to lead to such catastrophic problems?

I don't believe there is anything unherently unreliable in a DACS (we call
them Digital Cross-Connects, or DCCs, down here). In fact, most DCCs I have
seen are extrordinarily redundant and would probably quite happily continue
to function at 100% after a good seeing-to with an axe. The problem lies
elsewhere.

"Modern" telecommunications equipment seems to come equipped with two
configuration paradigms:

[1] will usually be a vendor-supplied Unix solution, probably running on Sun
or HP hardware, which provides a full database back-end and a simple
point-and-click interface for end-to-end service management. This is what
the engineering planning people see when the system is demonstrated to them,
and this obvious money-saving wonder is what will justify the inflated
budget required to purchase it.

[2] will probably be 9600 bps terminal access to a protracted series of
cryptic menus, about sixteen levels deep, through which the only practical
way to navigate is to type ahead protracted strings such as
"1,4,9,13-17,3,3,1" and then walk away for ten minutes to drink coffee.

[1] will usually stop working properly about four weeks after the vendors
disappear from site. Various technicians who have either a vague idea of the
network technology, or a self-taught-at-home expertise in Linux, will
attempt to fix the problem. The IT department will be called in, and will
limp through a SunOS 4.1.3 install from QIC tape, and may even get as far as
completing a successful reinstallation of the vendor's management system (at
which point it will become clear that the back-end is an embedded version of
Informix, for which the licence keys are either missing or have mysteriously
transformed themselves into one-week demo licences for some completely
unrelated product).

Hence [1] will quickly cease to be a useful tool for provisioning services.

Of course, the business doesn't stop just because one management system is
broken; we always have [2]. [2] is extremely vulnerable to radical
accidental reprogramming of the network due to caffeine-induced
finger-shake, but quickly becomes a preferred tool for programming the
network for a certain subset of the network provisioning technicians.

Those who discover that they can script routine tasks using shareware
terminal packages will also become able to perform radical accidental
reprogramming of the network whilst standing on the keyboard, reaching back
to tape laser-printed Dilbert cartoons or semi-pornographic calendars to
cubicle walls.

Any change using [2] will naturally never be reflected in the database of
[1]. This fact will be recognised whenever someone manages to get [1]
temporarily back on its feet, and will trigger a seemingly endless series of
network audit projects, which will never succeed due to the underground
popularity of [2] and the fact that anybody competent enough to perform a
network audit does their very best to escape from the project at the
earliest opportunity, leaving no indication whatsoever of what they have
found out so far.

The unreliability of any network records means that any troubleshooting is
likely to involve far more random guesswork than is healthy, and the
probability of a cascade of outages following routine maintenance on a minor
issue is much higher than you would expect.

On the other hand, it could be simply that marketing departments have
observed that customers are happy with the phrase "DACS outage" and
therefore use it as a generic term to describe any incident which might
otherwise cause a customer to complain :)