Skip to main content

linux - Deciphering continuing mpt2sas syslog messages



Summary



I have been getting these cryptic messages in syslog since I installed some new hardware and I can't figure out what the problem is, if it's serious, or what to do about it.



They're from the new SATA HBA and they follow a pattern. I will get several of the first message followed by several of the second message 5-30 seconds later. They come in blobs that are all logged in the same second and the exact amount of each varies between about 2 and 35. It can be minutes or hours between appearances of the entries.




Example of the two messages:



Jul 13 06:06:23 durandal kernel: [366918.435596] mpt2sas0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303)
Jul 13 06:06:28 durandal kernel: [366923.145524] mpt2sas0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)


It is always always 0x31120303 followed by 0x31110d01.



mpt2sas is the driver for the SATA host bus adapter I'm using but the error content is overly cryptic. It doesn't tell me what the problem is, what disk or port it is with or how severe it is.




Hardware



Supermicro X9SCL with a Xeon E3-1220 and 8GB of RAM.



LSI SAS2008 based Supermicro AOC-USAS2-L8I SAS/SATA HBA connected to a Supermicro CSE-M35T-1B disk tray set. It has three Western Digital WD30EZRX and two Segate ST3000DM001 plugged into it. All 3TB drives (exact same number of sectors actually). No port expanders in use.



The HBA, disk trays and 4 of the drives are new. One of the WD30EZRXes has been in for months, had no problems with it. Had it connected to the integrated Intel SATA controller previously, moved it into the drive bays with this new setup.



Had problems with the HBA needing to reset frequently and getting really awful performance. Updated the firmware/bios to "Phase 12", the latest release available from Supermicro and changed the type to IT (i.e. passthrough, from IR for integrated raid since I was going to use all software raid): 2008IT12.FW. That update cleared up all the early issues and I didn't start getting the above messages until later (see below).




The first four disks I added are all on the first SFF-8087 port (split to 4 SATA cables). The latest disk I added is on the other port, if that matters.



The only other disk on the system contains the OS, and is a older Intel 80GB SSD plugged into the integrated SATA controller.



Software



Ubuntu 11.10 (oneiric). Linux 3.0.0-14-server x86_64. Using the mpt2sas driver that comes with the OS.



Trying to build a RAID6 array using Linux md with those five disks. Started with a degenerate array of 3 disks, the two Segates and one of the new WD drives. This was fast and went very well, no messages in the logs after I did the firmware update. Meanwhile, I am still using the old WD disk on port 0 of the same controller.




Added the other new WD disk to the array. Rebuild started and I am now getting those messages in syslog periodically. I'm not sure how long it's supposed to take to add a disk to the array but the estimated time (cat /proc/mdstat) ranges from thousands to tens of thousands of minutes, much longer than it took the first 3 disks. I do understand that the WD disks are much slower; I got different models to cut down on the chances of multiple disk failure, and those were the two cheapest 3TB models.



Notes



SMART does not report any problems on any disks. There are no logged errors on any disks and none of the failure stats are anywhere near threshold.



The logged messages only started appearing after I added the last disk, which suggests that one may be having a problem but I have nothing else pointing to that.



I did find a header file that seems to correspond to the logging messages from this driver. The first message seems to be an abort (code 12) for a "sub code" 0303 that isn't listed. The second message is a reset (code 11) for a reason that also isn't clear. If I could determine what 0303 and 0d01 mean, that would be really helpful.




I know that 4 disks in a 5 disk RAID6 is an incomplete array. I'm planning to copy the contents of the old disk to the array once it finishes integrating the 4th disk and then add the old disk to the array as well.


Answer



Likely your best bet is a hardware problem somewhere between your disks and up to and including your sas raid controller. I recommend trying:




  1. Run any diagnostic tools from the vendor/s if they are available

  2. Check/re-seat/replace cables

  3. strip out hardware components and swap out hardware in the chain that connects the disks to your raid controller, including the controller itself (i.e., for you, try something else than the motherboard integrated raid).




I had one out of two identical Dell PowerEdge R515 giving very similar messages (logs periodically filling up with mpt2sas0 messages, though I do not have the exact numeric codes). Dell's own bootable diagnostic picked these up as "hardware errors" and replacing the RAID sas backplane solved the issue.



When I was investigating, I could not find a comprehensive resource of what various mpt2sas0 error codes mean. I suspect they may even be hardware-vendor-specific (someone who knows more about SAS needs to confirm or deny this). So your error codes could mean something widely different, but if SMART is clean it is hard to imagine other good reasons for mpt2sas0 to report error codes.



These errors can be very serious. My R515 worked seemingly OK with these messages for a week with a 12 disk Ubuntu Linux software raid 6, but then suddenly ejected all 12 disks out of the array as broken (!)



Also in my case the SMART for all disks were completely clean. A good check is a smart self diagnostic test: smartctl -t long /dev/sdX, and then check the results about a day later with smartctl -l selftest /dev/sdX. If all is OK the test should say Completed and the LBA_first_err column should be empty.


Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits,

apache 2.2 - Server Potentially Compromised -- c99madshell

So, low and behold, a legacy site we've been hosting for a client had a version of FCKEditor that allowed someone to upload the dreaded c99madshell exploit onto our web host. I'm not a big security buff -- frankly I'm just a dev currently responsible for S/A duties due to a loss of personnel. Accordingly, I'd love any help you server-faulters could provide in assessing the damage from the exploit. To give you a bit of information: The file was uploaded into a directory within the webroot, "/_img/fck_uploads/File/". The Apache user and group are restricted such that they can't log in and don't have permissions outside of the directory from which we serve sites. All the files had 770 permissions (user rwx, group rwx, other none) -- something I wanted to fix but was told to hold off on as it wasn't "high priority" (hopefully this changes that). So it seems the hackers could've easily executed the script. Now I wasn't able