North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Quick question.

  • From: John Underhill
  • Date: Sun Aug 01 15:09:49 2004

> If a CPU dies, it's unlikely to come back up without removing the bad
> CPU, especially if the CPU has become unreliable rather than dying
> completely. Even if CPU 0 is good and the BIOS has no problems
> booting the OS, the SMP aware OS will quite probably hit problems
> with the bad CPU.

Not necessarily. There have been a number of innovations in recent years in
the area of integrated fault tolerance, including bios level controls over
component monitoring / management. Some of the more upscale Compaq G3
servers for instance, can remove a processor from operation if it exceeds a
threshold of critical errors, (this is also true for memory).
Alphas can boot even if the bootstrap processor fails at system start, and
simply selects the next available processor.. they also have hot swap
processor capabilities, (again for the time being -upscale..). Add onto this
features like hot swap 'raid memory' and pci, redundant pwr, fans, and
drives, and systems can be made to withstand many common component failures,
with little or no interruption in service.
With the advent of technologies like hyperthreading, manufacturers are being
driven by market demands to create more reliable SMP drivers, and I think it
is likely that simultaneous multi-threading will eventually become the
standard.


> > a duallie will keep the system up when a faulty process hogs 100%
> > CPU, because the second one is still available. That also increases
> > availability ratio.

Well it depends.. The real differentiation is if the system is truly
'symetric', that is; dual processor, I/O and memory bus. If both processors
share the same resources, competition between processors for regions of
memory and acquiring locks on the pci bus, severely constrain the available
resources for each processor. So that if a process runs amock on a single
bus architecture, the second processor will not have the resources it needs
to run effectively..

> application is not going to take down the machine on any modern OS[2]
> and anyway can be dealt with with resource limits, SMP or not,
> presuming your OS supports resource limits.
>
> The real problem with SMP is kernel complexity. Drivers that are rock
> solid in single-processor can have bugs that are only triggered under
> SMP. Threaded applications can also become unreliable on SMP systems.
>
> The extra power of an SMP system might be a bonus, but trying to
> argue their benefits on the basis of reliability is misguided.
>
> > Michel.
>
> 1. Now, they may still be very reliable, and more than reliable
> enough for your needs, but they are still not as reliable as the
> exact same machine with terminators in all CPU sockets/slots bar one
> ;) The fault-tolerant systems are outrageously expensive.
>
> 2. Unless you're running MacOS 9 or Windows 3.11 on your server.. -
> dont think either supports SMP though ;).
>
> regards,
> -- 
> Paul Jakma [email protected] [email protected] Key ID: 64A2FF6A
> Fortune:
> A Linux machine! because a 486 is a terrible thing to waste!
> (By [email protected], Joe Sloan)