Skip to main content

What am I looking for in a Monitoring Solution?






This is a Canonical Question about Monitoring Software.



Also Related: What tool do you use to monitor your servers?




I need to monitor my servers; what do I need to consider when deciding on a monitoring solution?



Answer



There are a lot of monitoring solutions out there. Everyone has their preference and each business has its own needs, so there is no correct answer. However, I can help you figure out what you might want to look for in choosing a monitoring solution.





In general monitoring systems serve two primary purposes. The first is to collect and store data over time. For example, you might want to collect CPU utilization and graph it over time. The second purpose is to alert when things are either not responding or are not within certain thresholds. For example, you might want alerts if a certain server can't be reached by pings or if CPU utilization is above a certain percentage. There are also log monitoring systems such as Splunk but I am treating those as separate for this.



These two primary roles sometimes come in a single product, other times and more common is to have a product dedicated to each purpose.






Pollers:
All monitoring systems need some sort of poller to collect the data. Not all data is collected in the same way. You should look at your environment and decide what data you need and how it might be collected. Then make sure the monitoring system you choose supports what you need. Some common methods include:




  • SNMP (Simple Network Management Protocol)

  • WMI (Windows Management Instrumentation)

  • Running Scripts (For example, running a script on the machine that is being monitored or running a script from the monitoring box itself which uses its own polling method). These can include things like Bash Scripts, Perl Scripts, executable, and Powershell Scripts

  • Agent Based Monitoring. With these a process runs on each client and collects that data. This data is either pushed to the monitoring server or the monitoring server polls the agent. Some admins are okay with Agents, others don't like them as it can leave a larger footprint on the server being monitored.

  • Focused APIs (i.e. VMWare API or the ability to run SQL queries)




If you have mostly one OS in your environment or a primary OS, certain systems might have more options that others.



Configuration:
In monitoring systems there tends to be a lot of object reuse. For example, you want to monitor a certain application such as Apache or IIS on a bunch of servers. Or you want certain thresholds to apply to groups of servers. You might also have certain groups of people to be "on call". Therefore a good templating system is vital to a monitor system.



The configuration is generally done through a user interface or text files. The user interface option will generally be easier, but text files tend to be better for reuse and variables. So depending on your IT staff you might prefer simplicity over power.



User Interface:
The most common interface for monitoring systems these days is a web interface. Some things to evaluate in regards to the web interface are:





  • Good overviews

  • Good detail pages

  • Speed (When you need to find information in crisis mode a slow interface can be very frustrating

  • General feeling. You will spend a lot of time in the interface, if it feels clunky your IT staff will feel resistant to using it

  • Customization. Every organization has certain things that are important, and other things that are not. It is important to be able to customize it to your needs



Alerting Engine:
The alerting engine has to be flexible and reliable. There are lots of different ways to be notified including:





  • SMS

  • Email

  • Phone

  • Other things like IM/Jabber



Other features to look for are:




  • Escalations (Notify someone if the other person has not acknowledged or fixed the alert)


  • Rotations and Shifts

  • Groups (Certain groups need to be notified of certain things)



It is important to trust that when something goes wrong you will get the alert. This comes down to two things:




  1. A reliable system

  2. A caveat free configuration. In monitoring systems it is not uncommon to think you should get an alert, but because of some detail in configuration the alert was never triggered.




Data Store:
If the system collects and stores data (i.e. systems that include graphs) than the system stores data. A very common implementation for both the store and graphing is RRD for example.



Some features to look for from the data store are:




  • Raw access to the data. This can be valuable for developing against or creating custom graphs with something like Excel.

  • Scalability. Depending on how much you data you collect it can add up fast, if you are going to collect a lot you want to make sure it will scale.




Graphing Library:
Graphs can be useful to quickly identify trends and give context to the current state of something based on its history. Some including trending which can be helpful to predict things before they happen (i.e. running out of disk space). Make sure that the graphs will give you the information you think you are going to need in a clear way.



Access Controls:
If you have a large organization you might need access controls because certain admins should only be able to adjust certain things. You might also want public facing dashboards. If this is important you should make sure the monitoring system has the controls you need.



Other Features



Reporting:
A system that provides good reports can help you identify what needs to be improved over long periods of time. For example it can give a good answer to things like "what systems go down the most?". This can be important when you are trying to convince management to spend money on certain things -- business's like hard evidence.



Specialized Features:
Some monitoring systems are targeted at specific products or have more support than others. For example if the main thing you need to monitor is SQL server, or if you make heavy use of VMWare products you should see how well these are supported.




Predefined Monitoring Templates:
A system that comes with a lot of predefined templates (or has a user base that has created many templates) can be a huge time saver.



Discovery:
If you have a large or changing environment. Some systems provide the ability to add new systems via an API or run scans to find new servers or components.



Distributed Monitoring:
If you have multiple locations to monitor, it can be helpful to have monitoring pollers in each location instead of a lot of independent systems are monitoring via the WAN.





There are a lot of monitoring systems out there. We have a list with a summary on this old question. For quick reference some that I hear the most about are:





  • Nagios

  • Cacti

  • OpenNMS

  • Solar Winds

  • Zabbix

  • Various cloud based Monitoring systems

  • Microsoft System Center

  • This one isn't popular yet, but Stack Exchange has open sourced its monitoring system http://bosun.org






The reason I can't tell you what to use is because every organization has its own needs. If you want to make the right choice you should think through all the above components and figure out what features are important to your organization. Then find a system or systems that claim to provide what you need and try them out. Some of these cost a little, a lot, or are free. Taking all of that into account you can then make your choice. From what I have used they are all far from perfect, but at least you can try to get something that fits.


Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits,

apache 2.2 - Server Potentially Compromised -- c99madshell

So, low and behold, a legacy site we've been hosting for a client had a version of FCKEditor that allowed someone to upload the dreaded c99madshell exploit onto our web host. I'm not a big security buff -- frankly I'm just a dev currently responsible for S/A duties due to a loss of personnel. Accordingly, I'd love any help you server-faulters could provide in assessing the damage from the exploit. To give you a bit of information: The file was uploaded into a directory within the webroot, "/_img/fck_uploads/File/". The Apache user and group are restricted such that they can't log in and don't have permissions outside of the directory from which we serve sites. All the files had 770 permissions (user rwx, group rwx, other none) -- something I wanted to fix but was told to hold off on as it wasn't "high priority" (hopefully this changes that). So it seems the hackers could've easily executed the script. Now I wasn't able