apache 2.2 - Server Freeze Up Under Load

I'm having a problem with a debian server that I thought
was due to bad RAM, but is
persisting.

It's a Dell Poweredge
6800 with two dual-core 3.6GHZ Xeon processors and 5GB of DDR2 ECC 333.

I've got a single 73GB SCSI
Drive.

I'm working it to death right now,
pulling records from MySQL to build asterisk .call files (small text files) which
trigger SIP calls.

We manage it via a cgi
interface, and the system is also running citadel for our mail, but we have less than
five users. It's not a huge drain.

My peak usage
seems to be about 460 calls per minute. Load hovers between 2.0 - 4.3, if I push it past
that, it spikes to >22.0.

The
problem I'm having is that, about an hour into a dial, it's freezing up on me. Last
night I started it at 5:59, and at 6:55:17 seconds, the system became non-responsive.
Nothing was logged, I couldn't connect via ssh or http, it responded to ping, and nmap
showed open ports which I was able to telnet to, but not elicit any response from.

My sar data collection ran at 6:50, and at that
time, I was seeing heavy usage, as expected, but nothing outrageous, as far as I can
tell.

The system had been complaining of a
memory error in one of the new 2GB strips I'd installed, so after the first crash, I
replaced that pair with the 512MB strips we upgraded from.

I'm currently dialing with a live sar data
collection running, in case it crashes again. At least I'll be able to dial in with a
little more granularity.

Other than that, I'm
lost as to how to diagnose the system freeze in absence of any relevant log data, or a
crash dump. As the system is still running, but completely nonresponsive during this
time, until I perform a power-cycle. Any
ideas?

NOTE: I have new servers on
order to take some of the load off of this system by distributing services, but in the
meantime, it's a mean time where our production is relying on this
workhorse.

href="http://bluedot.martythebanker.com/sar.txt" rel="nofollow noreferrer" title="Sar
Data Around Crash">Here's the Sar Data from Last Night's
crash.

UPDATE: href="http://bluedot.martythebanker.com/livesar.txt" rel="nofollow noreferrer"
title="Live Sar Data from 1 sec prior to last freezup">This sar snapshot was running
in 10sec increments, last gathered 1 sec prior to
freeze-up

I've purchased a terminal
console server, and can now see what's going on when the system freezes
up.

This set of messages just repeats every 30
seconds or so, cycling through CPU1 and
CPU2

[17675.940127]
            BUG: soft lockup - CPU#1 stuck for 61s! [asterisk:4579]
[17675.940127] Modules
            linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat
            msdos fat jfs xfs reiserfs ext]
[17675.940127] 
[17675.940127] Pid:
            4579, comm: asterisk Not tainted (2.6.32-5-686-bigmem #1) PowerEdge
            6800
[17675.940127] EIP: 0060:[] EFLAGS: 00000202 CPU:
            1
[17675.940127] EIP is at
            native_flush_tlb_others+0x85/0xa6
[17675.940127] EAX: 00000282 EBX: c14620ac
            ECX: c102fb3a EDX: 00000020
[17675.940127] ESI: 00000001 EDI: 00000040 EBP:
            c14620a0 ESP: f35d1a3c
[17675.940127] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS:
            0068
[17675.940127] CR0: 80050033 CR2: b3f06946 CR3: 36787000 CR4:
            000006f0

[17675.940127] DR0: 00000000 DR1: 00000000 DR2: 00000000
            DR3: 00000000
[17675.940127] DR6: ffff0ff0 DR7:
            00000400
[17675.940127] Call Trace:
[17675.940127]
            [] ? flush_tlb_page+0x5d/0x65
[17675.940127]
            [] ? ptep_set_access_flags+0x59/0x63
[17675.940127]
            [] ? do_wp_page+0x3b9/0x7dd
[17675.940127] []
            ? kmap_atomic_prot+0xd7/0xfc
[17675.940127] [] ?
            handle_mm_fault+0x982/0xa22
[17675.940127] [] ?
            lock_hrtimer_base+0x15/0x2f
[17675.940127] [] ?
            hrtimer_try_to_cancel+0x2f/0x35

[17675.940127] [] ?
            do_page_fault+0x2f1/0x307
[17675.940127] [] ?
            do_page_fault+0x0/0x307
[17675.940127] [] ?
            error_code+0x73/0x78
[17675.940127] [] ?
            copy_strings+0x94/0x1ba
[17675.940127] [] ?
            do_sys_poll+0x2c3/0x312
[17675.940127] [] ?
            __pollwait+0x0/0xa5
[17675.940127] [] ?
            pollwake+0x0/0x65
[17675.940127] [] ?
            pollwake+0x0/0x65
[17675.940127] [] ?
            pollwake+0x0/0x65
[17675.940127] [] ?
            pollwake+0x0/0x65

[17675.940127] [] ?
            activate_task+0x1e/0x24
[17675.940127] [] ?
            push_rt_task+0x208/0x242
[17675.940127] [] ?
            post_schedule+0x31/0x3e
[17675.940127] [] ?
            schedule+0x78f/0x7dc
[17675.940127] [] ?
            futex_wait_setup+0x5c/0xcd
[17675.940127] [] ?
            futex_wait_queue_me+0x87/0x98
[17675.940127] [] ?
            sched_clock+0x5/0x7
[17675.940127] [] ?
            zone_watermark_ok+0x16/0x99
[17675.940127] [] ?
            cpupri_find+0x4c/0xd6
[17675.940127] [] ?
            get_page_from_freelist+0xc0/0x3c7

[17675.940127] []
            ? check_preempt_curr_rt+0x76/0xe3
[17675.940127] [] ?
            smp_invalidate_interrupt+0x73/0x86
[17675.940127] [] ?
            __alloc_pages_nodemask+0xf3/0x4d9
[17675.940127] [] ?
            cpumask_any_but+0x20/0x2b
[17675.940127] [] ?
            flush_tlb_page+0x4a/0x65
[17675.940127] [] ?
            mutex_lock+0xb/0x24
[17675.940127] [] ?
            do_sync_read+0xc0/0x107
[17675.940127] [] ?
            do_send_sig_info+0x4f/0x59
[17675.940127] [] ?
            autoremove_wake_function+0x0/0x2d
[17675.940127] [] ?
            ktime_get_ts+0xcd/0xd5

[17675.940127] [] ?
            sys_poll+0x44/0x8d
[17675.940127] [] ?
            sysenter_do_call+0x12/0x28

The
first iteration had another set of modules
listed.

[267866.376128] Modules
            linked in: cpufreq_powersave cpufreq_stats cpufreq_conservative cpufreq_userspace
            parport_pc ppdev lp parport sco bridge stp bnep rfcomm l2cap crc16 bluetooth rfkill nfsd
            lockd nfs_acl auth_rpcgss sunrpc exportfs binfmt_misc fuse loop radeon ttm psmouse
            drm_kms_helper serio_raw evdev pcspkr drm i2c_algo_bit rng_core i2c_core dcdbas shpchp
            button pci_hotplug processor ext3 jbd mbcache sd_mod crc_t10dif sg sr_mod cdrom
            ata_generic uhci_hcd ata_piix mptspi mptscsih ehci_hcd mptbase usbcore nls_base libata
            tg3 scsi_transport_spi scsi_mod floppy libphy thermal thermal_sys [last unloaded:
            scsi_wait_scan]

I
installed intel-microcode microcode.ctl haven't figured out how
to disable hyperthreading as some other forums have suggested.

Blog

Search This Blog

apache 2.2 - Server Freeze Up Under Load

Comments

Post a Comment

Popular posts from this blog

linux - Awstats - outputting stats for merged Access_logs only producing stats for one server's log

iLO 3 Firmware Update (HP Proliant DL380 G7)

hp proliant - Smart Array P822 with HBA Mode?