$ sudo tail
/var/log/messages
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3,
channel 0, label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30
13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0
channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0,
label "": Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www
kernel: EDAC MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1
dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "":
Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC
MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1
dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "":
Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC
MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1
dimm=1)
Jan 30 13:47:58 www kernel: EDAC MC0: CE row 3, channel 0, label "":
Corrected error (Socket=0 channel=1 dimm=1)
Jan 30 13:47:58 www kernel: EDAC
MC0: CE row 3, channel 0, label "": Corrected error (Socket=0 channel=1
dimm=1)
As
you can see, this is logging at a phenomenal rate, I don't know about EDAC though. From
what I understand this is indicating a faulty stick of RAM possibly, does this seem
likely?
I understand this is little to go one,
what else can I do to shed some light on this? This is a live server so I can't reboot
it or take it down easily bare in mind.
I wish my
servers' ECC chips were supported by the EDAC code I'm running! Try
dmidecode -t memory
to see the ECC hardware you
have.
In your logs you're getting a notification
from an ECC chip; If your chip hadn't been supported (like mine!) you'd get silent ECC
corrections. In your case, the ECC correction happened and
you also got notified, because you have
support.
I'd go and change that
memory stick at first. On the other hand, you might have a faulty channel, or a faulty
processor core. I've once diagnosed such a problem with memtest86.org (the original
memtest86 has SMP support, try it
memtest86+).
Disable ECC in BIOS, boot memtest86
using a floppy/USB stick, and see if a bunch of addresses get flagged, all in a row:
might be a memory channel problem if that happens.
Comments
Post a Comment