I should note that I'm not a sysadmin. You'll figure that out very shortly. :)
In a nutshell: Apache keeps taking a breather during heavy loads and all processes go idle. This is a polling server that is used by applications. The polls come from a lot of different endpoints. From time to time (every 4-5 minutes) if I'm watching top, HTTPD processes go idle all at the same time, stalling traffic for 10 seconds or so. It then recovers. The delay is problematic.
- Server is serving a lot of traffic. These are application polls via HTTPS, not web pages (though I doubt Apache knows the difference)
- The pauses noted above cause the traffic to become lopsided: after some time, I get a WHOLE BUNCH OF TRAFFIC, then a lull, then a WHOLE BUNCH OF TRAFFIC again
- Each poll requires a small database dip
Apache logs
Sometimes, but not always (mostly after a restart), I get these messages in error_log. Most of the time when it happens, I see nothing in the error_log.
[Mon Jun 30 17:55:17 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 31 idle, and 98 total children
[Mon Jun 30 17:55:18 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 14 idle, and 98 total children
[Mon Jun 30 17:55:44 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 74 idle, and 99 total children
[Mon Jun 30 17:55:54 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 61 idle, and 99 total children
[Mon Jun 30 17:56:00 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 0 idle, and 97 total children
[Mon Jun 30 17:56:02 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 36 idle, and 99 total children
[Mon Jun 30 17:56:03 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 39 idle, and 99 total children
[Mon Jun 30 18:08:17 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 18 idle, and 99 total children
[Mon Jun 30 18:08:18 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 16 children, there are 63 idle, and 98 total children
[Mon Jun 30 18:08:19 2014] [info] server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 32 children, there are 74 idle, and 97 total children
Apache Config (old config commented out)
just showing config items that I suspect are relevant
#Timeout 60
Timeout 20
KeepAlive on
MaxKeepAliveRequests 1000
KeepAliveTimeout 2
IfModule prefork.c
StartServers 85
MinSpareServers 85
MaxSpareServers 100
ServerLimit 100
MaxClients 100
#StartServers 60
#MinSpareServers 60
#MaxSpareServers 85
#ServerLimit 85
#MaxClients 85
MaxRequestsPerChild 1000
/IfModule
Note that there's no difference between old and new configs in behavior.
Environment
EC2, c1.medium, mod_perl, persistent database connections, separate RDS server, no errors showing in MySQL error logs and no errors showing in Apache logs
As an aside, I've seen suggestions to install mod-status, but i haven't figured out how to do so, and I don't know what to look for if I do.
Answer
Mystery solved.
In case this happens to anyone else:
The network connection (inside VPC via LAN interface) between Apache and database server was getting congested. Upgrading the database server to a larger instance solved the problem (for the time being).
Background: Amazon takes snapshots of your database every 5 minutes for its point-in-time restore feature. It downloads the binary log on your RDS instance to do so.
Every 5 minutes, the binary log gets transmitted (presumably to another EBS), and in my case that transmission congested the LAN interface. Apache stalls while it waited for the network connection every five minutes, and connections would pile up, with some ultimately aborting.
Comments
Post a Comment