Skip to main content

linux - Good introduction to server monitoring?




I'm currently developing a small web application using Linux, Apache, Django, and MySql.



Being a developer with bare-minimal knowledge of Linux / shell scripting / server monitoring, I have no clue what kind of monitoring I'm suppose to do... However, some things I like to do are:




  • Easy access to the time series of CPU / memory usage.

  • Alerts sent out whenever server resource is being overused.

  • Easy access to apache log files, and be able to run quick analysis with them.




Also, I'm wondering if there are any other log files / services that I should keep my eyes on?


Answer



Sever monitoring depends on which metrics matter to the server's purpose. As a web application there's quite a few areas to cover. There's endless numbers of metrics you can think of but you'll usually have these bare minimums:




  • Availability of server and services

  • Disk space & usage

  • Network usage


  • Memory usage

  • CPU usage

  • Log files



The other part of monitoring besides viewing into the present is to keep a record of the past. This gives you the ability to:




  • Plan for the future

  • Identify reasons when issues pop up




Will you run out of disk space in the next two months with the same growth? Are you seeing increases in CPU usage aligning with new feature deployments? Why are users having to wait four seconds to view a page?



I'll touch on each of the above metrics:



Availability



Very simple availability monitoring is via the ping command but the fact that a server pings doesn't mean the services like the web server are available, as it may have crashed. More complex monitoring would be running a test transaction on the website every hour to ensure that users can buy products.




Disk Space and Usage



The space metric is obvious, you'll want to know ahead of time before you app stops working. The usage part is a bit more complex. The usage will be metrics like bytes read/write, input/output operations per second, etc. These can be important because if you see an increase in site latency correlated with a drop in disk performance you may have developed a bad disk that requires multiple seeks or reads to satisfy the request. Don't forget to measure inode usage too, that's a metric I've forgotten about a couple times within OpenVZ.



Network Usage



Hitting your network bandwidth limit? Are you seeing the same numbers your ISP is seeing?



Memory Usage




When the system starts running out of memory it will start swapping. This will affect performance.



CPU Usage



Is the CPU spiked at 100% during peak times? Maybe you can improve the user's experience by upgrading the server to a faster CPU or more CPUs. Does performance die with the CPU having to handle so many network controller interrupts? Maybe time to invest in a TCP offload card.



Log Files




  • The MySQL slow query log: Queries are running slower than your threshold. Review this file and improve as needed. If you can't improve them and the query times are corresponding with heavy system load then maybe time to upgrade.



  • The application's log files: What were using doing causing all the heavy system load? Were most of them viewing a specific page? Why did only only half of the user uploads work today?


  • The Apache log files: Knowing the numbers is useful for site design effectiveness, usability, advertising campaign measurements, broken pages or images, etc.


  • The system's log files: Hack attempts, hardware errors, various daemon messages.




It's usually best to have system logs to be shipped off to another server so tracks can't be covered.



Beyond these there's lots of things that can be monitored: transactions per second, server temperature, hard drive temp & SMART, RAID status, backup reports, batch job statuses,



The Tools




There are quite a few tools to accomplish some of the above. Other more specific metrics will either need to be self-coded if not already available (showing the qmail queue size via SNMP is one such metric I've put together because sometimes qmail would half-break, still accept new emails but not send any out).



Some of the tools I use that you can easily start with:




  • Nagios or Icinga - One of the most popular *nix monitoring tools. Quite a few monitoring tools, like mysql slave monitoring. I generally use this specifically for availability monitoring of all services. Setup to send an email to phone's email-to-text address for alerts. Icinga is a fork of Nagios. Browser through the "commands" and see which ones you can use.

  • Munin or collectd - These give you the graphs. A breeze to setup on CentOS. Setup the MySQL monitoring plugin for database insights like buffer usage.

  • WebSitePulse - Be aware that availability monitoring is only best when done remotely. I use their POP3 monitoring to verify that Nagios is still running via a script I made.

  • AWStats - Process the Apache log files into reports.


  • Google Analytics - More client details that aren't in the common Apache log like screen resolution and color depth.


Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits, ...

linux - Awstats - outputting stats for merged Access_logs only producing stats for one server's log

I've been attempting this for two weeks and I've accessed countless number of sites on this issue and it seems there is something I'm not getting here and I'm at a lost. I manged to figure out how to merge logs from two servers together. (Taking care to only merge the matching domains together) The logs from the first server span from 15 Dec 2012 to 8 April 2014 The logs from the second server span from 2 Mar 2014 to 9 April 2014 I was able to successfully merge them using the logresolvemerge.pl script simply enermerating each log and > out_putting_it_to_file Looking at the two logs from each server the format seems exactly the same. The problem I'm having is producing the stats page for the logs. The command I've boiled it down to is /usr/share/awstats/tools/awstats_buildstaticpages.pl -configdir=/home/User/Documents/conf/ -config=example.com awstatsprog=/usr/share/awstats/wwwroot/cgi-bin/awstats.pl dir=/home/User/Documents/parced -month=all -year=all...