I'm having a problem with a debian server that I thought
was due to bad RAM, but is
persisting.
It's a Dell Poweredge
6800 with two dual-core 3.6GHZ Xeon processors and 5GB of DDR2 ECC 333.
I've got a single 73GB SCSI
Drive.
I'm working it to death right now,
pulling records from MySQL to build asterisk .call files (small text files) which
trigger SIP calls.
We manage it via a cgi
interface, and the system is also running citadel for our mail, but we have less than
five users. It's not a huge drain.
My peak usage
seems to be about 460 calls per minute. Load hovers between 2.0 - 4.3, if I push it past
that, it spikes to >22.0.
The
problem I'm having is that, about an hour into a dial, it's freezing up on me. Last
night I started it at 5:59, and at 6:55:17 seconds, the system became non-responsive.
Nothing was logged, I couldn't connect via ssh or http, it responded to ping, and nmap
showed open ports which I was able to telnet to, but not elicit any response from.
My sar data collection ran at 6:50, and at that
time, I was seeing heavy usage, as expected, but nothing outrageous, as far as I can
tell.
The system had been complaining of a
memory error in one of the new 2GB strips I'd installed, so after the first crash, I
replaced that pair with the 512MB strips we upgraded from.
I'm currently dialing with a live sar data
collection running, in case it crashes again. At least I'll be able to dial in with a
little more granularity.
Other than that, I'm
lost as to how to diagnose the system freeze in absence of any relevant log data, or a
crash dump. As the system is still running, but completely nonresponsive during this
time, until I perform a power-cycle. Any
ideas?
NOTE: I have new servers on
order to take some of the load off of this system by distributing services, but in the
meantime, it's a mean time where our production is relying on this
workhorse.
href="http://bluedot.martythebanker.com/sar.txt" rel="nofollow noreferrer" title="Sar
Data Around Crash">Here's the Sar Data from Last Night's
crash.
UPDATE: href="http://bluedot.martythebanker.com/livesar.txt" rel="nofollow noreferrer"
title="Live Sar Data from 1 sec prior to last freezup">This sar snapshot was running
in 10sec increments, last gathered 1 sec prior to
freeze-up
I've purchased a terminal
console server, and can now see what's going on when the system freezes
up.
This set of messages just repeats every 30
seconds or so, cycling through CPU1 and
CPU2
[17675.940127]
BUG: soft lockup - CPU#1 stuck for 61s! [asterisk:4579]
[17675.940127] Modules
linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat
msdos fat jfs xfs reiserfs ext]
[17675.940127]
[17675.940127] Pid:
4579, comm: asterisk Not tainted (2.6.32-5-686-bigmem #1) PowerEdge
6800
[17675.940127] EIP: 0060:[] EFLAGS: 00000202 CPU:
1
[17675.940127] EIP is at
native_flush_tlb_others+0x85/0xa6
[17675.940127] EAX: 00000282 EBX: c14620ac
ECX: c102fb3a EDX: 00000020
[17675.940127] ESI: 00000001 EDI: 00000040 EBP:
c14620a0 ESP: f35d1a3c
[17675.940127] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS:
0068
[17675.940127] CR0: 80050033 CR2: b3f06946 CR3: 36787000 CR4:
000006f0
[17675.940127] DR0: 00000000 DR1: 00000000 DR2: 00000000
DR3: 00000000
[17675.940127] DR6: ffff0ff0 DR7:
00000400
[17675.940127] Call Trace:
[17675.940127]
[] ? flush_tlb_page+0x5d/0x65
[17675.940127]
[] ? ptep_set_access_flags+0x59/0x63
[17675.940127]
[] ? do_wp_page+0x3b9/0x7dd
[17675.940127] []
? kmap_atomic_prot+0xd7/0xfc
[17675.940127] [] ?
handle_mm_fault+0x982/0xa22
[17675.940127] [] ?
lock_hrtimer_base+0x15/0x2f
[17675.940127] [] ?
hrtimer_try_to_cancel+0x2f/0x35
[17675.940127] [] ?
do_page_fault+0x2f1/0x307
[17675.940127] [] ?
do_page_fault+0x0/0x307
[17675.940127] [] ?
error_code+0x73/0x78
[17675.940127] [] ?
copy_strings+0x94/0x1ba
[17675.940127] [] ?
do_sys_poll+0x2c3/0x312
[17675.940127] [] ?
__pollwait+0x0/0xa5
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
pollwake+0x0/0x65
[17675.940127] [] ?
activate_task+0x1e/0x24
[17675.940127] [] ?
push_rt_task+0x208/0x242
[17675.940127] [] ?
post_schedule+0x31/0x3e
[17675.940127] [] ?
schedule+0x78f/0x7dc
[17675.940127] [] ?
futex_wait_setup+0x5c/0xcd
[17675.940127] [] ?
futex_wait_queue_me+0x87/0x98
[17675.940127] [] ?
sched_clock+0x5/0x7
[17675.940127] [] ?
zone_watermark_ok+0x16/0x99
[17675.940127] [] ?
cpupri_find+0x4c/0xd6
[17675.940127] [] ?
get_page_from_freelist+0xc0/0x3c7
[17675.940127] []
? check_preempt_curr_rt+0x76/0xe3
[17675.940127] [] ?
smp_invalidate_interrupt+0x73/0x86
[17675.940127] [] ?
__alloc_pages_nodemask+0xf3/0x4d9
[17675.940127] [] ?
cpumask_any_but+0x20/0x2b
[17675.940127] [] ?
flush_tlb_page+0x4a/0x65
[17675.940127] [] ?
mutex_lock+0xb/0x24
[17675.940127] [] ?
do_sync_read+0xc0/0x107
[17675.940127] [] ?
do_send_sig_info+0x4f/0x59
[17675.940127] [] ?
autoremove_wake_function+0x0/0x2d
[17675.940127] [] ?
ktime_get_ts+0xcd/0xd5
[17675.940127] [] ?
sys_poll+0x44/0x8d
[17675.940127] [] ?
sysenter_do_call+0x12/0x28
The
first iteration had another set of modules
listed.
[267866.376128] Modules
linked in: cpufreq_powersave cpufreq_stats cpufreq_conservative cpufreq_userspace
parport_pc ppdev lp parport sco bridge stp bnep rfcomm l2cap crc16 bluetooth rfkill nfsd
lockd nfs_acl auth_rpcgss sunrpc exportfs binfmt_misc fuse loop radeon ttm psmouse
drm_kms_helper serio_raw evdev pcspkr drm i2c_algo_bit rng_core i2c_core dcdbas shpchp
button pci_hotplug processor ext3 jbd mbcache sd_mod crc_t10dif sg sr_mod cdrom
ata_generic uhci_hcd ata_piix mptspi mptscsih ehci_hcd mptbase usbcore nls_base libata
tg3 scsi_transport_spi scsi_mod floppy libphy thermal thermal_sys [last unloaded:
scsi_wait_scan]
I
installed intel-microcode microcode.ctl
haven't figured out how
to disable hyperthreading as some other forums have suggested.
Comments
Post a Comment