linux - Deciphering continuing mpt2sas syslog messages

Summary

I have been getting these cryptic messages in syslog since I installed some new hardware and I can't figure out what the problem is, if it's serious, or what to do about it.

They're from the new SATA HBA and they follow a pattern. I will get several of the first message followed by several of the second message 5-30 seconds later. They come in blobs that are all logged in the same second and the exact amount of each varies between about 2 and 35. It can be minutes or hours between appearances of the entries.

Example of the two messages:

Jul 13 06:06:23 durandal kernel: [366918.435596] mpt2sas0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Jul 13 06:06:28 durandal kernel: [366923.145524] mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)

It is always always 0x31120303 followed by 0x31110d01.

mpt2sas is the driver for the SATA host bus adapter I'm using but the error content is overly cryptic. It doesn't tell me what the problem is, what disk or port it is with or how severe it is.

Hardware

Supermicro X9SCL with a Xeon E3-1220 and 8GB of RAM.

LSI SAS2008 based Supermicro AOC-USAS2-L8I SAS/SATA HBA connected to a Supermicro CSE-M35T-1B disk tray set. It has three Western Digital WD30EZRX and two Segate ST3000DM001 plugged into it. All 3TB drives (exact same number of sectors actually). No port expanders in use.

The HBA, disk trays and 4 of the drives are new. One of the WD30EZRXes has been in for months, had no problems with it. Had it connected to the integrated Intel SATA controller previously, moved it into the drive bays with this new setup.

Had problems with the HBA needing to reset frequently and getting really awful performance. Updated the firmware/bios to "Phase 12", the latest release available from Supermicro and changed the type to IT (i.e. passthrough, from IR for integrated raid since I was going to use all software raid): 2008IT12.FW. That update cleared up all the early issues and I didn't start getting the above messages until later (see below).

The first four disks I added are all on the first SFF-8087 port (split to 4 SATA cables). The latest disk I added is on the other port, if that matters.

The only other disk on the system contains the OS, and is a older Intel 80GB SSD plugged into the integrated SATA controller.

Software

Ubuntu 11.10 (oneiric). Linux 3.0.0-14-server x86_64. Using the mpt2sas driver that comes with the OS.

Trying to build a RAID6 array using Linux md with those five disks. Started with a degenerate array of 3 disks, the two Segates and one of the new WD drives. This was fast and went very well, no messages in the logs after I did the firmware update. Meanwhile, I am still using the old WD disk on port 0 of the same controller.

Added the other new WD disk to the array. Rebuild started and I am now getting those messages in syslog periodically. I'm not sure how long it's supposed to take to add a disk to the array but the estimated time (cat /proc/mdstat) ranges from thousands to tens of thousands of minutes, much longer than it took the first 3 disks. I do understand that the WD disks are much slower; I got different models to cut down on the chances of multiple disk failure, and those were the two cheapest 3TB models.

Notes

SMART does not report any problems on any disks. There are no logged errors on any disks and none of the failure stats are anywhere near threshold.

The logged messages only started appearing after I added the last disk, which suggests that one may be having a problem but I have nothing else pointing to that.

I did find a header file that seems to correspond to the logging messages from this driver. The first message seems to be an abort (code 12) for a "sub code" 0303 that isn't listed. The second message is a reset (code 11) for a reason that also isn't clear. If I could determine what 0303 and 0d01 mean, that would be really helpful.

I know that 4 disks in a 5 disk RAID6 is an incomplete array. I'm planning to copy the contents of the old disk to the array once it finishes integrating the 4th disk and then add the old disk to the array as well.

Answer

Likely your best bet is a hardware problem somewhere between your disks and up to and including your sas raid controller. I recommend trying:

Run any diagnostic tools from the vendor/s if they are available

Check/re-seat/replace cables

strip out hardware components and swap out hardware in the chain that connects the disks to your raid controller, including the controller itself (i.e., for you, try something else than the motherboard integrated raid).

I had one out of two identical Dell PowerEdge R515 giving very similar messages (logs periodically filling up with mpt2sas0 messages, though I do not have the exact numeric codes). Dell's own bootable diagnostic picked these up as "hardware errors" and replacing the RAID sas backplane solved the issue.

When I was investigating, I could not find a comprehensive resource of what various mpt2sas0 error codes mean. I suspect they may even be hardware-vendor-specific (someone who knows more about SAS needs to confirm or deny this). So your error codes could mean something widely different, but if SMART is clean it is hard to imagine other good reasons for mpt2sas0 to report error codes.

These errors can be very serious. My R515 worked seemingly OK with these messages for a week with a 12 disk Ubuntu Linux software raid 6, but then suddenly ejected all 12 disks out of the array as broken (!)

Also in my case the SMART for all disks were completely clean. A good check is a smart self diagnostic test: smartctl -t long /dev/sdX, and then check the results about a day later with smartctl -l selftest /dev/sdX. If all is OK the test should say Completed and the LBA_first_err column should be empty.

Blog

Search This Blog