North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

RE: Quick question.

  • From: Paul Jakma
  • Date: Sun Aug 01 13:09:29 2004

On Sun, 1 Aug 2004, Michel Py wrote:

True; this would be like raid-0 arrays, the more disks the greater the chance of failure.
This holds true for most RAID-x levels.

In other words, I don't really care if the second processor reduces the MTBF from 200k hours to 60k hours, but I do care if the second processor reduces the time to restore service from 24 hours to 20 minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute for the tech to find out that either it's frozen or there's a BSOD, 6 minutes to have someone go there and reset, 5 minutes to reboot).
If a CPU dies, it's unlikely to come back up without removing the bad CPU, especially if the CPU has become unreliable rather than dying completely. Even if CPU 0 is good and the BIOS has no problems booting the OS, the SMP aware OS will quite probably hit problems with the bad CPU.

If you really want to guard against CPU failures, you need a machine designed for fault-tolerance, not a "cheap" SMP box, those are just *less* reliable.[1]

The dead processor still has to be replaced, but this is scheduled maintenance, not outage. A little extra ammo when you have to hunt five or six nines.
Just tape a spare CPU to the inside of the box if time-to-repair is important. Even better, just have a second system on standby.

Unsignificant in my experience, and does not balance what Alexei mentioned yesterday:
Alexei is talking about something else.

a duallie will keep the system up when a faulty process hogs 100% CPU, because the second one is still available. That also increases availability ratio.
This is a resource problem, not an availibility problem. A spinning application is not going to take down the machine on any modern OS[2] and anyway can be dealt with with resource limits, SMP or not, presuming your OS supports resource limits.

The real problem with SMP is kernel complexity. Drivers that are rock solid in single-processor can have bugs that are only triggered under SMP. Threaded applications can also become unreliable on SMP systems.

The extra power of an SMP system might be a bonus, but trying to argue their benefits on the basis of reliability is misguided.

Michel.
1. Now, they may still be very reliable, and more than reliable enough for your needs, but they are still not as reliable as the exact same machine with terminators in all CPU sockets/slots bar one ;) The fault-tolerant systems are outrageously expensive.

2. Unless you're running MacOS 9 or Windows 3.11 on your server.. - dont think either supports SMP though ;).

regards,
--
Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
Fortune:
A Linux machine! because a 486 is a terrible thing to waste!
(By [email protected], Joe Sloan)