North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Operate until failure

  • From: Bennett Todd
  • Date: Mon Jan 08 19:35:14 2001

2001-01-08-17:35:49 Sean Donelan:
> [...] You most likely don't want to shutdown based on any
> automatic signal.  However, you do want a way for an operator to
> gracefully shutdown a lot of equipment quickly when the decision
> is made.
>
> For a server farm, with potentially thousands of individual
> systems, is there any standard piece of software you can install
> on all of the systems to act as a receiver of a signal to begin a
> graceful shutdown that does not depend on a vendor's proprietary
> interface?  Preferabally one which does not involve running a lot
> of additional wires.

I've got my own preference; when running even mere dozens of
machines in a tightly coordinated farm, I want the ability to manage
them all very quickly and easily, so I use a script I wrote for
parallel execution of a command. It takes a command-line and
operates on it with controlled parallelism; it's available at
<URL:http://people.oven.com/bet/multicmd>.

I'll install that on an admin server, which will be a very very
tightly secured machine indeed, since the account which I use on
that machine will have an ssh key, with no passphrase, that's
accepted for running root commands on every machine in the farm.
Given that setup, the answer to your question requires only a
pre-built list of the hostnames or ip addrs of the machines to halt,
at which point it's something along the rough lines of

	multicmd ssh \$1 'sh -c "sleep 10;halt" >&- 2>&- <&- &' \
		<hostlist

or thereabouts; after it's tested I'd save this, and any other
useful invocations, in scripts so I don't have to remember 'em.

For thousands of machines, the default options (10-at-a-time
parallel, 1-second delay between launches) wouldn't be quick;
for an emergency halt program, I'd probably up the parallelism
to whatever my local system could handle well, and drop the
inter-cmd delay to maybe 0.01 sec. And of course for platforms with
software-controllable power switches, the "halt" could be replaced
with an invocation that would power the boxes down.

-Bennett

Attachment: pgp00007.pgp
Description: PGP signature