This is a Canonical Question about Monitoring Software.
Also Related: What tool do you use to monitor your servers?
I need to monitor my servers; what do I need to consider when deciding on a monitoring solution?
Answer
There are a lot of monitoring solutions out there. Everyone has their preference and each business has its own needs, so there is no correct answer. However, I can help you figure out what you might want to look for in choosing a monitoring solution.
In general monitoring systems serve two primary purposes. The first is to collect and store data over time. For example, you might want to collect CPU utilization and graph it over time. The second purpose is to alert when things are either not responding or are not within certain thresholds. For example, you might want alerts if a certain server can't be reached by pings or if CPU utilization is above a certain percentage. There are also log monitoring systems such as Splunk but I am treating those as separate for this.
These two primary roles sometimes come in a single product, other times and more common is to have a product dedicated to each purpose.
Pollers:
All monitoring systems need some sort of poller to collect the data. Not all data is collected in the same way. You should look at your environment and decide what data you need and how it might be collected. Then make sure the monitoring system you choose supports what you need. Some common methods include:
- SNMP (Simple Network Management Protocol)
- WMI (Windows Management Instrumentation)
- Running Scripts (For example, running a script on the machine that is being monitored or running a script from the monitoring box itself which uses its own polling method). These can include things like Bash Scripts, Perl Scripts, executable, and Powershell Scripts
- Agent Based Monitoring. With these a process runs on each client and collects that data. This data is either pushed to the monitoring server or the monitoring server polls the agent. Some admins are okay with Agents, others don't like them as it can leave a larger footprint on the server being monitored.
- Focused APIs (i.e. VMWare API or the ability to run SQL queries)
If you have mostly one OS in your environment or a primary OS, certain systems might have more options that others.
Configuration:
In monitoring systems there tends to be a lot of object reuse. For example, you want to monitor a certain application such as Apache or IIS on a bunch of servers. Or you want certain thresholds to apply to groups of servers. You might also have certain groups of people to be "on call". Therefore a good templating system is vital to a monitor system.
The configuration is generally done through a user interface or text files. The user interface option will generally be easier, but text files tend to be better for reuse and variables. So depending on your IT staff you might prefer simplicity over power.
User Interface:
The most common interface for monitoring systems these days is a web interface. Some things to evaluate in regards to the web interface are:
- Good overviews
- Good detail pages
- Speed (When you need to find information in crisis mode a slow interface can be very frustrating
- General feeling. You will spend a lot of time in the interface, if it feels clunky your IT staff will feel resistant to using it
- Customization. Every organization has certain things that are important, and other things that are not. It is important to be able to customize it to your needs
Alerting Engine:
The alerting engine has to be flexible and reliable. There are lots of different ways to be notified including:
- SMS
- Phone
- Other things like IM/Jabber
Other features to look for are:
- Escalations (Notify someone if the other person has not acknowledged or fixed the alert)
- Rotations and Shifts
- Groups (Certain groups need to be notified of certain things)
It is important to trust that when something goes wrong you will get the alert. This comes down to two things:
- A reliable system
- A caveat free configuration. In monitoring systems it is not uncommon to think you should get an alert, but because of some detail in configuration the alert was never triggered.
Data Store:
If the system collects and stores data (i.e. systems that include graphs) than the system stores data. A very common implementation for both the store and graphing is RRD for example.
Some features to look for from the data store are:
- Raw access to the data. This can be valuable for developing against or creating custom graphs with something like Excel.
- Scalability. Depending on how much you data you collect it can add up fast, if you are going to collect a lot you want to make sure it will scale.
Graphing Library:
Graphs can be useful to quickly identify trends and give context to the current state of something based on its history. Some including trending which can be helpful to predict things before they happen (i.e. running out of disk space). Make sure that the graphs will give you the information you think you are going to need in a clear way.
Access Controls:
If you have a large organization you might need access controls because certain admins should only be able to adjust certain things. You might also want public facing dashboards. If this is important you should make sure the monitoring system has the controls you need.
Other Features
Reporting:
A system that provides good reports can help you identify what needs to be improved over long periods of time. For example it can give a good answer to things like "what systems go down the most?". This can be important when you are trying to convince management to spend money on certain things -- business's like hard evidence.
Specialized Features:
Some monitoring systems are targeted at specific products or have more support than others. For example if the main thing you need to monitor is SQL server, or if you make heavy use of VMWare products you should see how well these are supported.
Predefined Monitoring Templates:
A system that comes with a lot of predefined templates (or has a user base that has created many templates) can be a huge time saver.
Discovery:
If you have a large or changing environment. Some systems provide the ability to add new systems via an API or run scans to find new servers or components.
Distributed Monitoring:
If you have multiple locations to monitor, it can be helpful to have monitoring pollers in each location instead of a lot of independent systems are monitoring via the WAN.
There are a lot of monitoring systems out there. We have a list with a summary on this old question. For quick reference some that I hear the most about are:
- Nagios
- Cacti
- OpenNMS
- Solar Winds
- Zabbix
- Various cloud based Monitoring systems
- Microsoft System Center
- This one isn't popular yet, but Stack Exchange has open sourced its monitoring system http://bosun.org
The reason I can't tell you what to use is because every organization has its own needs. If you want to make the right choice you should think through all the above components and figure out what features are important to your organization. Then find a system or systems that claim to provide what you need and try them out. Some of these cost a little, a lot, or are free. Taking all of that into account you can then make your choice. From what I have used they are all far from perfect, but at least you can try to get something that fits.
Comments
Post a Comment