I'm having a problem with a debian server that I thought was due to bad RAM, but is persisting.
It's a Dell Poweredge 6800 with two dual-core 3.6GHZ Xeon processors and 5GB of DDR2 ECC 333.
I've got a single 73GB SCSI Drive.
I'm working it to death right now, pulling records from MySQL to build asterisk .call files (small text files) which trigger SIP calls.
We manage it via a cgi interface, and the system is also running citadel for our mail, but we have less than five users. It's not a huge drain.
My peak usage seems to be about 460 calls per minute. Load hovers between 2.0 - 4.3, if I push it past that, it spikes to >22.0.
The problem I'm having is that, about an hour into a dial, it's freezing up on me. Last night I started it at 5:59, and at 6:55:17 seconds, the system became non-responsive. Nothing was logged, I couldn't connect via ssh or http, it responded to ping, and nmap showed open ports which I was able to telnet to, but not elicit any response from.
My sar data collection ran at 6:50, and at that time, I was seeing heavy usage, as expected, but nothing outrageous, as far as I can tell.
The system had been complaining of a memory error in one of the new 2GB strips I'd installed, so after the first crash, I replaced that pair with the 512MB strips we upgraded from.
I'm currently dialing with a live sar data collection running, in case it crashes again. At least I'll be able to dial in with a little more granularity.
Other than that, I'm lost as to how to diagnose the system freeze in absence of any relevant log data, or a crash dump. As the system is still running, but completely nonresponsive during this time, until I perform a power-cycle. Any ideas?
NOTE: I have new servers on order to take some of the load off of this system by distributing services, but in the meantime, it's a mean time where our production is relying on this workhorse.
Here's the Sar Data from Last Night's crash.
UPDATE: This sar snapshot was running in 10sec increments, last gathered 1 sec prior to freeze-up
I've purchased a terminal console server, and can now see what's going on when the system freezes up.
This set of messages just repeats every 30 seconds or so, cycling through CPU1 and CPU2
[17675.940127] BUG: soft lockup - CPU#1 stuck for 61s! [asterisk:4579]
[17675.940127] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs ext]
[17675.940127]
[17675.940127] Pid: 4579, comm: asterisk Not tainted (2.6.32-5-686-bigmem #1) PowerEdge 6800
[17675.940127] EIP: 0060:[] EFLAGS: 00000202 CPU: 1
[17675.940127] EIP is at native_flush_tlb_others+0x85/0xa6
[17675.940127] EAX: 00000282 EBX: c14620ac ECX: c102fb3a EDX: 00000020
[17675.940127] ESI: 00000001 EDI: 00000040 EBP: c14620a0 ESP: f35d1a3c
[17675.940127] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[17675.940127] CR0: 80050033 CR2: b3f06946 CR3: 36787000 CR4: 000006f0
[17675.940127] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[17675.940127] DR6: ffff0ff0 DR7: 00000400
[17675.940127] Call Trace:
[17675.940127] [] ? flush_tlb_page+0x5d/0x65
[17675.940127] [] ? ptep_set_access_flags+0x59/0x63
[17675.940127] [] ? do_wp_page+0x3b9/0x7dd
[17675.940127] [] ? kmap_atomic_prot+0xd7/0xfc
[17675.940127] [] ? handle_mm_fault+0x982/0xa22
[17675.940127] [] ? lock_hrtimer_base+0x15/0x2f
[17675.940127] [] ? hrtimer_try_to_cancel+0x2f/0x35
[17675.940127] [] ? do_page_fault+0x2f1/0x307
[17675.940127] [] ? do_page_fault+0x0/0x307
[17675.940127] [] ? error_code+0x73/0x78
[17675.940127] [] ? copy_strings+0x94/0x1ba
[17675.940127] [] ? do_sys_poll+0x2c3/0x312
[17675.940127] [] ? __pollwait+0x0/0xa5
[17675.940127] [] ? pollwake+0x0/0x65
[17675.940127] [] ? pollwake+0x0/0x65
[17675.940127] [] ? pollwake+0x0/0x65
[17675.940127] [] ? pollwake+0x0/0x65
[17675.940127] [] ? activate_task+0x1e/0x24
[17675.940127] [] ? push_rt_task+0x208/0x242
[17675.940127] [] ? post_schedule+0x31/0x3e
[17675.940127] [] ? schedule+0x78f/0x7dc
[17675.940127] [] ? futex_wait_setup+0x5c/0xcd
[17675.940127] [] ? futex_wait_queue_me+0x87/0x98
[17675.940127] [] ? sched_clock+0x5/0x7
[17675.940127] [] ? zone_watermark_ok+0x16/0x99
[17675.940127] [] ? cpupri_find+0x4c/0xd6
[17675.940127] [] ? get_page_from_freelist+0xc0/0x3c7
[17675.940127] [] ? check_preempt_curr_rt+0x76/0xe3
[17675.940127] [] ? smp_invalidate_interrupt+0x73/0x86
[17675.940127] [] ? __alloc_pages_nodemask+0xf3/0x4d9
[17675.940127] [] ? cpumask_any_but+0x20/0x2b
[17675.940127] [] ? flush_tlb_page+0x4a/0x65
[17675.940127] [] ? mutex_lock+0xb/0x24
[17675.940127] [] ? do_sync_read+0xc0/0x107
[17675.940127] [] ? do_send_sig_info+0x4f/0x59
[17675.940127] [] ? autoremove_wake_function+0x0/0x2d
[17675.940127] [] ? ktime_get_ts+0xcd/0xd5
[17675.940127] [] ? sys_poll+0x44/0x8d
[17675.940127] [] ? sysenter_do_call+0x12/0x28
The first iteration had another set of modules listed.
[267866.376128] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_conservative cpufreq_userspace parport_pc ppdev lp parport sco bridge stp bnep rfcomm l2cap crc16 bluetooth rfkill nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs binfmt_misc fuse loop radeon ttm psmouse drm_kms_helper serio_raw evdev pcspkr drm i2c_algo_bit rng_core i2c_core dcdbas shpchp button pci_hotplug processor ext3 jbd mbcache sd_mod crc_t10dif sg sr_mod cdrom ata_generic uhci_hcd ata_piix mptspi mptscsih ehci_hcd mptbase usbcore nls_base libata tg3 scsi_transport_spi scsi_mod floppy libphy thermal thermal_sys [last unloaded: scsi_wait_scan]
I installed intel-microcode microcode.ctl
haven't figured out how to disable hyperthreading as some other forums have suggested.
Comments
Post a Comment