Re: OS, Hardware, Network - Logging, Monitoring, and Alerting

North American Network Operators Group

Re: OS, Hardware, Network - Logging, Monitoring, and Alerting

From: Adam Armstrong
Date: Fri Jun 27 13:50:57 2008

List-archive: <http://mailman.nanog.org/pipermail/nanog>
List-help: <mailto:[email protected]?subject=help>
List-id: North American Network Operators Group <nanog.nanog.org>
List-post: <mailto:[email protected]>
List-subscribe: <http://mailman.nanog.org/mailman/listinfo/nanog>, <mailto:[email protected]?subject=subscribe>
List-unsubscribe: <http://mailman.nanog.org/mailman/listinfo/nanog>, <mailto:[email protected]?subject=unsubscribe>

Mike wrote:

you can do most of this with Cacti out of the box.  you can also add
the thold and monitoring plugins to get the additional things you
need.  Cacti mainly uses SNMP but you can also use external scripts to
gather information.  It does have future trending capabilities (that i
am aware of) but can evaluate against baseline thresholds using the
thold plugin.

The Cacti community has created templates and add-ons for the most common network vendors and system types.

Cacti does graphs, but it's really just not useful enough to me. Neither was Nagios (on top of being a nightmare to configure). I found similar issues with other similarish solutions such as OpenNMS and JFFNMS. I generally used Cricket with the config-generation tool for graphing devices and ports, Cacti was prettier, but IMO slightly more complex than necessary.

Observer is intended to be autodiscovering, with as little manually configured as possible. This has made a few things quite hard to do properly, like alerting. It was written firstly to discover the network, secondly to graph and log it, and thirdly to alert you when it breaks. Unfortunately it turns out that i can't get my head around the alerting bit, so it remains a little unfinished!

My personal opinion is that all of the FOSS NMS solutions are sorely disappointing, Observer included. It seems to be something that no one has quite gotten right yet!

Adam.

On Fri, Jun 27, 2008 at 11:42 AM, Adam Armstrong <[email protected]> wrote:

Rev. Jeffrey Paul wrote:

Hi.  I've a (theoretically) simple problem and I'm wondering how others
solve it.

I've recently deployed ~40 Linux instances on ~20 different Dell blades
and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and
assorted switchable PDUs and whatnot.
We need to monitor standard things like cpu, memory, disk usage on all
OSes.  This is straightforward with net-snmp.  It would also be cool if
I could monitor more esoteric things, like ntp synchronization status,
i/o statistics, etc.

Other stuff we really need to keep an eye on is hardware - redundant PSU
status in our 7204s and Dells, temperatures and voltages (one of
our colos in New York peaked at over 40C a few weeks ago, for instance),
and disk array status (I'd like to know of a failed disk in a hardware RAID5
before I get calls about performance issues).  Our
blade chassis have DRACs in them and I think they export this data via
SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
other PowerEdges have the DRACs in them so some of this information may
need to be pulled via IPMI from within the host OS.  Presumably the
Cisco gear makes the temperature available via SNMP.

Finally, service checks - standard stuff (dns, http, https, ssh, smtp).

Now, to the questions.

1) Is SNMP the best way to do this?  Obviously some of the data (service
checks) will need to be collected other ways.

2) Is there any good solution that does both logging/trending of this
data and also notification/monitoring/alerting?  I've used both Nagios
and Cacti in the past, and, due to the number of individual things being
monitored (3-5 items per OS instance, 5-10 items per physical server,
10-50 things per network device), setting them both up independently
seems like a huge pain.  Also, I've never really liked Nagios that much.

I recently entertained the idea of writing a CGI that output all of this
information in a standard format (csv?), distributing and installing it,
then
collecting it periodically at a central location and doing all the
rrd/notification myself, but then realized that this problem must've
been solved a million times already.

There's got to be a better way. What do you guys use?

I wrote an NMS to do something along these lines. It's focussed more towards
graphing than alerting. It knows where to find Dell/Cisco temperature
monitors via SNMP and will keep track of  hardware and OS types/versions.
It's probably still not really ready for general consumption, but if you
think it would be useful to you, give me a shout and I'll see if I can help
you make it work properly for you.

http://www.project-observer.org

I wrote it mostly due to my own absolute hatred of Nagios and disappointment
at the other NMSes around (where are the asthetics?)! :)

Thanks,
adam.

Follow-Ups:
- Re: OS, Hardware, Network - Logging, Monitoring, and Alerting Brandon Galbraith

References:
- OS, Hardware, Network - Logging, Monitoring, and Alerting Rev. Jeffrey Paul
- Re: OS, Hardware, Network - Logging, Monitoring, and Alerting Adam Armstrong
- Re: OS, Hardware, Network - Logging, Monitoring, and Alerting Mike

Prev by Date: what problem are we solving? (was Re: ICANN opens up Pandora's Box of new TLDs)
Next by Date: Re: what problem are we solving? (was Re: ICANN opens up Pandora's Box of new TLDs)
Date Index
Thread Index
Author Index
Historical