• 4

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191


File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

I'm currently evaluating monitoring software for (by my standards) a larger network expected to grow to around 3000 devices. I'm finding data on the hardware requirements for scaling hard to come by. (Edit: the devices are satellite receivers monitored by SNMP, so require an agentless monitor. Our main concern is to identify failing devices, and we don't need a great deal of analysis.)

Tthe 3000 devices will have about 40 data points each, logged on a cycle of 5 to 10 minutes. At a 10 minute polling interval, that's 12,000 points per minute. That provides two sorts of load: CPU load for the polling application, and most critically, disk write load to store those datapoints.

I've looked at Solarwinds Orion, Zenoss, Zabbix, and OpenNMS. We have experience of Zenoss and Orion on smaller networks of a few hundred devices. My initial impressions are:

  • Zenoss doesn't have a very efficient RRD implementation, but allows us to scale horizontally by adding collectors, which store RRD data locally.
  • Orion allows us to add polling engines, but requires a shared SQL server for the performance data.
  • Zabbix claims to scale to this level, but I've not found any useful guidance. As it uses a database for performance data, database tuning is key.
  • OpenNMS looks like the performance leader, due to an optimized RRD implementation and support for grouping.

Does anybody have experience or performance data for monitoring this scale of network?

OpenNMS can do the job.

For that type of environment, the key will be CPU threads and something that can handle low latency disk writes. I would use a standalone server (versus a VM), provide 12 or more cores and plan around direct-attached storage that either has 6 or more spindles or can leverage SSDs for the OpenNMS RRD directories. OpenNMS can also be tuned on the data collection and logging fronts to make it more efficient. Reaching out to their professional services team to help with the install would be a good option.

  • 3
Reply Report
      • 1
    • I'll call this the answer, because OpenNMS has made it to our second round of evaluation. It's definitely up to the workload, though I'm uncomfortable with the alarm handling workflow.
      • 1
    • We're used to Zenoss, which bundles loss of connection together with other events. It's odd to see a threshold alarm on a device, yet see zero outages. In our world, satellite signal below threshold is an outage, even if we still have IP connectivity. Also, alarm states may linger a long time, until somebody drives out and reattaches the satellite dish. We'd prefer acknowledged alarms to drop off the dashboard, as we're worried about losing new alarms amongst the clutter.
      • 2
    • There may be a way to clean up the interface for that purpose. In an installation of your size, it may make sense to use the OpenNMS consulting package. They can come out and tune/tweak the installation for your environment.

As far as I know, Zabbix has installations with 10k+ devices. Maybe you need to distribute the load, i.e. by placing the database server (if your solution needs one) to another machine. You also might want to look at Zabbix Proxy.

  • 2
Reply Report

I have experience monitoring this size of network. In addition, I'm always evaluating new possibilities when it comes to monitoring solutions.

That said, I'm coming from more of a Microsoft perspective than you are, and I'm not even sure if I would consider some of the solutions you mentioned enterprise-level solutions, but I might still be able to help.

Almost every monitoring system is going to consist of a few common components - the database and the management servers. (NetIQ, Nimsoft, Quest, VMware, SCOM, just to name a few.)

The amount of hardware you're going to need depends greatly on just how you plan to do your monitoring - specifically - how many data points you want to capture. For the most basic stuff like CPU utilization, memory, storage space, etc., your requirements will be less. If you want to monitor a huge slew of application metrics like how many web requests per second your hosts are getting, scanning logfiles for keywords, etc., well then the amount of data collected by your monitoring system will be much larger, and all strict hardware requirements are going to increase.

Other things to consider are factors such as: do you want to load agents on every machine (typically allows for more detailed info,) or do you want to try to go completely agentless? Are you monitoring all physical machines, all VMs, or a mixture of the two? How about network equipment, are you monitoring that too? In big heterogeneous networks like this, what you typically end up with are multiple solutions running together to cover all your bases. If you have a whole boatload of VMs to monitor, certain solutions like VMware VC Ops and Quest vFoglight get information from vCenter (or multiple vCenters) itself, which means a lot of the metrics are more accurate than if they were measured on the VM itself, and it also means you may not have to load an agent on the VM. You can also typically squeeze more machines onto a VM-only monitoring solution. VMware VC Ops has customers today that are running 10k VMs on a single instance of VC Ops.

That said, in my personal opinion VC Ops is almost like more of just a big fancy analytics engine than an actual monitoring solution. It's kinda' cool to see it tell you "based on your current growth, the ESXi host [x] in Datacenter [y] will reach capacity in 30 days."

Alright, so in general, there are a lot of different ways to design a database, but remember that you need high availability. You cannot work in such a huge network and take ownership of a monitoring solution that will go dark completely if one of your database nodes go down. So don't buy 1 HP Proliant server. But two. Or three. Cluster them. Plan for HA. So price that out -- $30 grand?

Secondly, a lot of these solutions are going to have a "management server" type of role in their infrastructure. In my experience these can typically be virtualized just fine. They act as intermediaries between the agents and the central repository, balancing the load and making sure all the data coming in from the thousands of agents gets inserted into the repository in an orderly fashion. You'll find that in these types of solutions, you have to have a few management servers for HA, but you don't want too many as each additional management server will cause contention and locks as they all just to insert data into the repository.

So plan on one or two virtualization hosts for those. Another $15k maybe? That's just ball park. I don't know if your company is going to be building this on new Cisco UCS gear, or Dell PowerEdges that you buy off Craigslist.

Most enterprise-grade solutions are configurable enough to be able to leverage SQL Server or MySQL or even Postgres. However, very few of them are totally awesome at everything, and what I usually see a company doing is running two or more monitoring solutions in parallel.

edit: Also don't forget to plan for geographical distribution. I have servers that physically reside in Amsterdam that are being monitored from Miami. It's possible, but I'm not altogether proud to admit it.

edit #2: It's also important to note that while some companies are very squeamish about spending money on software - it just depends on the culture of the company - a good company will realize the value of Enterprise support. Just something to keep in mind.

  • 2
Reply Report
      • 1
    • Thanks for that - that's prompted a few more things to think about. We have to go agentless, as the devices are satellite receivers, and we can't modify the firmware.
    • Cool. In that case, HPSIM and Solarwinds both make good SNMP (and trap) solutions. Sorry I can't provide more feedback on the specifically *nix software!
      • 1
    • Solarwinds is a strong contender. We're an odd case for Enterprise vendors: the equipment is weird, it's scattered over a whole continent, and we don't control all the network connecting it together.

Coming from a University environment where we did availability monitoring (Ok/Warning/Critical with alerts) and performance monitoring (graphing, RRD) of MANY network devices (mostly Cisco, but checking lots of metrics)...

I think this is being overanalyzed. First off, identify the minimum set of metrics that you need, what resolution, and how long you need to store them. Even if you really do need to poll each one of the 3,000 devices every 5 to 10 minutes, for 40 metrics, do you need to retain RRD graphed data on them, or can you just use something like Nagios to alert if a metric is outside a predefined threshold?

Also, how reliable does this need to be?

Here's how I'd do it, keeping in mind that my default viewpoint is minimum cost, open source, and the assumption that whoever is implementing it can do some coding:

  • Identify some possible solutions (Nagios/Icinga? OpenNMS? Cacti or Cricket or mrtg?) that have a somewhat flexible UI.
  • Get 10 or 20 cheap, minimal 1U servers that can each handle 5% or 10% of the total load. Come up with an algorithm to distribute checking/polling of your 3,000 devices between these 10 or 20 hosts.
  • If you just need alerting, each host can live in isolation. It would probably be good to have a Nagios box to monitor these 10-20 hosts, just to make sure they're up and running and collecting data.
  • If you need graphing/trending with a common interface, you'll need to do some web work (PHP?), but you should be able to pull together an interface that links graphs/data/etc. from the appropriate polling node.
  • 1
Reply Report

Trending Tags