North American Network Operators Group

Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical

Re: Zebra/linux device production networking?

  • From: alex
  • Date: Tue Jun 06 19:23:12 2006

On Tue, 6 Jun 2006, Nick Burke wrote:

> First, a little background.. My CTO made my stomach curdle today when he
> announced that he wanted to do away with all our cisco [routers] and
> instead use Linux/zebra boxen. We are a small company, so naturally
> penny pinching is the primary motivation. That, and the sheer joy of
> watching me squirm. He has informed me that he has found "many people"
> who do this for their "core devices". I'm not so certain about this
> whole situation, so I humbly ask:
> 
> How many of you have actually use(d) Zebra/Linux as a routing device 
> (core and/or regional, I'd be interested in both) in a production (read: 
> 99.999% required, hsrp, bgp, dot1q, other goodies) environment?
> 
> And, if you care to spend this much time, what pitfalls/benefits did you 
> find out about after implementation?
Having done exactly that previously, I wouldn't recommend it. 

While it will work, most of the time, reaching 99.999% will be a 
challenge. Amount of engineering time you will spend in order to reach 
that point (and to maintain your setup) will dwarf the cost of leasing 
proper equipment. 

Issues encountered: 
*) Performance under ddos: Linux routing stack is route-cache-based. That 
means, performance is a function of flows per second, and even small 
random src/dst ddos will kill you. Even when this is fixed, performance 
will be limited by pps - and the "worst case" performance of PC router is 
not as impressive as "omg i can route 1gbit with p3/1ghz". In the end, 
"worst case" performance is what really matters, and it isn't all that 
awesome.

*) Management: It takes certain amount of sysadmin time to manage each PC
router (tools/etc). 

*) Integration: As it is not designed as a "complete system", you will
have little wierdnesses, such as, quagga not seeing kernel-installed
routes, or netlink not being able to keep up with route updates, etc. All
of those are fairly small things, but there are more than enough of them.

*) Troubleshooting/continuity of operations: It takes two orders of
magnitude more clue to troubleshoot zebra network - there are simply
*lots* more things that can possibly go wrong - you don't worry just about
your links breaking, you have to worry about your software being buggy.  
While any CCIE will most likely be able to troubleshoot and run a
cisco-based network, pool of engineers sufficiently clued in a myriad of
things that relate to troubleshooting of a PC router (ie. both network
engineer, system admin, protocol engineer, kernel hacker, and at times,
zebra-source-code-hacker) is far smaller.

*) Maturity: While it has been improving, things like Quagga have still
have stability issues and "wierd issues that are resolved by killing
ospfd". Because of a greater state of flux in such environment, you are 
likely to encounter things like "oh, this bug is fixed in latest release" 
- and then having to retest the new release which has completely different 
bugs. Yes, I know, you get that with proprietary vendors - but at least 
you get a benefit of *them* doing at least some amount of testing prior to 
release.

*) Redundancy: Adding more redundancy to such a system is not likely to 
increase availability - in fact, it is likely to decrease availability 
because of added complexity and "more things to break". Your problems 
are not likely to be the PC losing power (complete failure). Your problem 
will be things like zebra's idea of routing table being different from 
kernel's idea, zebra being unhappy after a transit flaps sucking up CPU 
time, leading to other things timing out, etc. Redundancy will 
excarcerbate these issues, making troubleshooting *harder*.

So, in conclusion, if you have a large number of clued linux hackers who
have nothing better to do, it may be a good idea. Otherwise, you'll
realize you are spending far more on sysadmin time than you are saving on
equipment cost.

--
Alex Pilosov    | DSL, Colocation, Hosting Services
President       | [email protected]    877-PILOSOFT x601
Pilosoft, Inc.  | http://www.pilosoft.com