Skip to main content

apache 2.2 - Server Freeze Up Under Load

I'm having a problem with a debian server that I thought
was due to bad RAM, but is
persisting.




It's a Dell Poweredge
6800 with two dual-core 3.6GHZ Xeon processors and 5GB of DDR2 ECC 333.



I've got a single 73GB SCSI
Drive.



I'm working it to death right now,
pulling records from MySQL to build asterisk .call files (small text files) which
trigger SIP calls.



We manage it via a cgi
interface, and the system is also running citadel for our mail, but we have less than
five users. It's not a huge drain.



My peak usage
seems to be about 460 calls per minute. Load hovers between 2.0 - 4.3, if I push it past
that, it spikes to >22.0.




The
problem I'm having is that, about an hour into a dial, it's freezing up on me. Last
night I started it at 5:59, and at 6:55:17 seconds, the system became non-responsive.
Nothing was logged, I couldn't connect via ssh or http, it responded to ping, and nmap
showed open ports which I was able to telnet to, but not elicit any response from.



My sar data collection ran at 6:50, and at that
time, I was seeing heavy usage, as expected, but nothing outrageous, as far as I can
tell.



The system had been complaining of a
memory error in one of the new 2GB strips I'd installed, so after the first crash, I
replaced that pair with the 512MB strips we upgraded from.



I'm currently dialing with a live sar data
collection running, in case it crashes again. At least I'll be able to dial in with a
little more granularity.



Other than that, I'm
lost as to how to diagnose the system freeze in absence of any relevant log data, or a
crash dump. As the system is still running, but completely nonresponsive during this
time, until I perform a power-cycle. Any
ideas?




NOTE: I have new servers on
order to take some of the load off of this system by distributing services, but in the
meantime, it's a mean time where our production is relying on this
workhorse.



href="http://bluedot.martythebanker.com/sar.txt" rel="nofollow noreferrer" title="Sar
Data Around Crash">Here's the Sar Data from Last Night's
crash.



UPDATE: href="http://bluedot.martythebanker.com/livesar.txt" rel="nofollow noreferrer"
title="Live Sar Data from 1 sec prior to last freezup">This sar snapshot was running
in 10sec increments, last gathered 1 sec prior to
freeze-up



I've purchased a terminal
console server, and can now see what's going on when the system freezes
up.



This set of messages just repeats every 30
seconds or so, cycling through CPU1 and
CPU2




[17675.940127]
BUG: soft lockup - CPU#1 stuck for 61s! [asterisk:4579]
[17675.940127] Modules
linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat
msdos fat jfs xfs reiserfs ext]
[17675.940127]
[17675.940127] Pid:
4579, comm: asterisk Not tainted (2.6.32-5-686-bigmem #1) PowerEdge
6800
[17675.940127] EIP: 0060:[] EFLAGS: 00000202 CPU:
1
[17675.940127] EIP is at
native_flush_tlb_others+0x85/0xa6
[17675.940127] EAX: 00000282 EBX: c14620ac
ECX: c102fb3a EDX: 00000020
[17675.940127] ESI: 00000001 EDI: 00000040 EBP:
c14620a0 ESP: f35d1a3c
[17675.940127] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS:
0068
[17675.940127] CR0: 80050033 CR2: b3f06946 CR3: 36787000 CR4:
000006f0

[17675.940127] DR0: 00000000 DR1: 00000000 DR2: 00000000
DR3: 00000000
[17675.940127] DR6: ffff0ff0 DR7:
00000400
[17675.940127] Call Trace:
[17675.940127]
[] ? flush_tlb_page+0x5d/0x65
[17675.940127]
[] ? ptep_set_access_flags+0x59/0x63
[17675.940127]
[] ? do_wp_page+0x3b9/0x7dd
[17675.940127] []
? kmap_atomic_prot+0xd7/0xfc
[17675.940127] [] ?
handle_mm_fault+0x982/0xa22
[17675.940127] [] ?
lock_hrtimer_base+0x15/0x2f
[17675.940127] [] ?
hrtimer_try_to_cancel+0x2f/0x35

[17675.940127] [] ?
do_page_fault+0x2f1/0x307
[17675.940127] [] ?
do_page_fault+0x0/0x307
[17675.940127] [] ?
error_code+0x73/0x78
[17675.940127] [] ?
copy_strings+0x94/0x1ba
[17675.940127] [] ?
do_sys_poll+0x2c3/0x312
[17675.940127] [] ?
__pollwait+0x0/0xa5
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
pollwake+0x0/0x65

[17675.940127] [] ?
activate_task+0x1e/0x24
[17675.940127] [] ?
push_rt_task+0x208/0x242
[17675.940127] [] ?
post_schedule+0x31/0x3e
[17675.940127] [] ?
schedule+0x78f/0x7dc
[17675.940127] [] ?
futex_wait_setup+0x5c/0xcd
[17675.940127] [] ?
futex_wait_queue_me+0x87/0x98
[17675.940127] [] ?
sched_clock+0x5/0x7
[17675.940127] [] ?
zone_watermark_ok+0x16/0x99
[17675.940127] [] ?
cpupri_find+0x4c/0xd6
[17675.940127] [] ?
get_page_from_freelist+0xc0/0x3c7

[17675.940127] []
? check_preempt_curr_rt+0x76/0xe3
[17675.940127] [] ?
smp_invalidate_interrupt+0x73/0x86
[17675.940127] [] ?
__alloc_pages_nodemask+0xf3/0x4d9
[17675.940127] [] ?
cpumask_any_but+0x20/0x2b
[17675.940127] [] ?
flush_tlb_page+0x4a/0x65
[17675.940127] [] ?
mutex_lock+0xb/0x24
[17675.940127] [] ?
do_sync_read+0xc0/0x107
[17675.940127] [] ?
do_send_sig_info+0x4f/0x59
[17675.940127] [] ?
autoremove_wake_function+0x0/0x2d
[17675.940127] [] ?
ktime_get_ts+0xcd/0xd5

[17675.940127] [] ?
sys_poll+0x44/0x8d
[17675.940127] [] ?
sysenter_do_call+0x12/0x28


The
first iteration had another set of modules
listed.



[267866.376128] Modules
linked in: cpufreq_powersave cpufreq_stats cpufreq_conservative cpufreq_userspace
parport_pc ppdev lp parport sco bridge stp bnep rfcomm l2cap crc16 bluetooth rfkill nfsd
lockd nfs_acl auth_rpcgss sunrpc exportfs binfmt_misc fuse loop radeon ttm psmouse
drm_kms_helper serio_raw evdev pcspkr drm i2c_algo_bit rng_core i2c_core dcdbas shpchp
button pci_hotplug processor ext3 jbd mbcache sd_mod crc_t10dif sg sr_mod cdrom
ata_generic uhci_hcd ata_piix mptspi mptscsih ehci_hcd mptbase usbcore nls_base libata
tg3 scsi_transport_spi scsi_mod floppy libphy thermal thermal_sys [last unloaded:
scsi_wait_scan]


I
installed intel-microcode microcode.ctl haven't figured out how
to disable hyperthreading as some other forums have suggested.

Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits,

apache 2.2 - Server Potentially Compromised -- c99madshell

So, low and behold, a legacy site we've been hosting for a client had a version of FCKEditor that allowed someone to upload the dreaded c99madshell exploit onto our web host. I'm not a big security buff -- frankly I'm just a dev currently responsible for S/A duties due to a loss of personnel. Accordingly, I'd love any help you server-faulters could provide in assessing the damage from the exploit. To give you a bit of information: The file was uploaded into a directory within the webroot, "/_img/fck_uploads/File/". The Apache user and group are restricted such that they can't log in and don't have permissions outside of the directory from which we serve sites. All the files had 770 permissions (user rwx, group rwx, other none) -- something I wanted to fix but was told to hold off on as it wasn't "high priority" (hopefully this changes that). So it seems the hackers could've easily executed the script. Now I wasn't able