North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: How do you (not how do I) calculate 95th percentile?

  • From: David W. Hankins
  • Date: Wed Feb 22 15:27:50 2006

On Wed, Feb 22, 2006 at 12:50:34PM -0600, Tom Sands wrote:
> >A lot of smaller folks check the counter every 5 min and use that same
> >value for the 95th percentile.  Most of us larger folks need to check more 
> >often to prevent 32bit counters from rolling over too often. 
> 
> Actually, a lot of people do 5 minutes... and I would say that larger 
> companies don't check them more often because they are using 64 bit 
> counters, as should anyone with over about 100Mbps of traffic.

Counter size is an incomplete reason for polling interval.

If you need a 5 minute average and poll your routers once every five
minutes, what happens if an SNMP packet gets lost?

In the best case, a retransmission over Y seconds sees it through, but
now you've got 300+Y seconds in what was supposed to be a 300 second
average...your next datapoint will also now be a 300-Y average unless
you schedule it into the future.

In the worst case, you've lost the datapoint entirely.  This loses not
just the one datapoint ending in that five minute span, but also the
next datapoint.  Sure, you can synthesize two 5 minute averages from
one 10 minute average (presuming your counters wouldn't roll), but this
is still a loss in data - one of those two datapoints should have been
higher than the other.


At a place of previous employ, we solved this problem by using a 30
second (!) polling interval, and a home-written (C, linking to
the UCD-SNMP library (now net-snmp)) polling engine that did its best
to emit and receive as many queries in as short a space of time as it
was able to (without flooding monitored devices).

In these circumstances, we could lose several datapoints and still
construct valid 5-minute averages from the pieces (combinations of 30,
60, 90 etc second averages, weighting each by the number of seconds
it represents within the 300-second span).

Our operations staff also enjoyed being able to see graphical response
to changes in traffic balancing within half a minute...better, faster
feedback.  Another factor that makes 'counter size' a bad indicator
for polling interval.

> In our setup, as with a lot of people likely, any data that is older 
> than 30 days is averaged.  However, we store the exact maximums for the 
> most current 30 days.

You keep no record?  What do you do if a customer challenges their
bill?  Synthesize 5 minute datapoints out of the larger averages?

I recommend keeping the 5 minute averages in perpetuity, even if that
means having an operator burn the data to CD and store it in a safe (not
under his desk in the pizza boxes, nor under his soft drink as a coaster).

-- 
David W. Hankins		"If you don't do it right the first time,
Software Engineer			you'll just have to do it again."
Internet Systems Consortium, Inc.		-- Jack T. Hankins

Attachment: pgp00022.pgp
Description: PGP signature