Is this a memory failure being logged (CentOS web server)

$ sudo tail /var/log/messages

Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)

As you can see, this is logging at a phenomenal rate, I don't know about EDAC though. From what I understand this is indicating a faulty stick of RAM possibly, does this seem likely?

I understand this is little to go one, what else can I do to shed some light on this? This is a live server so I can't reboot it or take it down easily bare in mind.

Answer

I wish my servers' ECC chips were supported by the EDAC code I'm running! Try dmidecode -t memory to see the ECC hardware you have.

In your logs you're getting a notification from an ECC chip; If your chip hadn't been supported (like mine!) you'd get silent ECC corrections. In your case, the ECC correction happened and you also got notified, because you have support.

I'd go and change that memory stick at first. On the other hand, you might have a faulty channel, or a faulty processor core. I've once diagnosed such a problem with memtest86.org (the original memtest86 has SMP support, try it memtest86+).

Disable ECC in BIOS, boot memtest86 using a floppy/USB stick, and see if a bunch of addresses get flagged, all in a row: might be a memory channel problem if that happens.

Blog

Search This Blog

Is this a memory failure being logged (CentOS web server)

Comments

Post a Comment

Popular posts from this blog

linux - Awstats - outputting stats for merged Access_logs only producing stats for one server's log

iLO 3 Firmware Update (HP Proliant DL380 G7)

hp proliant - Smart Array P822 with HBA Mode?