$ sudo tail /var/log/messages
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
As you can see, this is logging at a phenomenal rate, I don't know about EDAC though. From what I understand this is indicating a faulty stick of RAM possibly, does this seem likely?
I understand this is little to go one, what else can I do to shed some light on this? This is a live server so I can't reboot it or take it down easily bare in mind.
Answer
I wish my servers' ECC chips were supported by the EDAC code I'm running! Try dmidecode -t memory
to see the ECC hardware you have.
In your logs you're getting a notification from an ECC chip; If your chip hadn't been supported (like mine!) you'd get silent ECC corrections. In your case, the ECC correction happened and you also got notified, because you have support.
I'd go and change that memory stick at first. On the other hand, you might have a faulty channel, or a faulty processor core. I've once diagnosed such a problem with memtest86.org (the original memtest86 has SMP support, try it memtest86+).
Disable ECC in BIOS, boot memtest86 using a floppy/USB stick, and see if a bunch of addresses get flagged, all in a row: might be a memory channel problem if that happens.
Comments
Post a Comment