Skip to main content

storage - Linux - real-world hardware RAID controller tuning (scsi and cciss)



Most of the Linux systems I manage feature hardware RAID controllers (mostly HP Smart Array). They're all running RHEL or CentOS.



I'm looking for real-world tunables to help optimize performance for setups that incorporate hardware RAID controllers with SAS disks (Smart Array, Perc, LSI, etc.) and battery-backed or flash-backed cache. Assume RAID 1+0 and multiple spindles (4+ disks).




I spend a considerable amount of time tuning Linux network settings for low-latency and financial trading applications. But many of those options are well-documented (changing send/receive buffers, modifying TCP window settings, etc.). What are engineers doing on the storage side?



Historically, I've made changes to the I/O scheduling elevator, recently opting for the deadline and noop schedulers to improve performance within my applications. As RHEL versions have progressed, I've also noticed that the compiled-in defaults for SCSI and CCISS block devices have changed as well. This has had an impact on the recommended storage subsystem settings over time. However, it's been awhile since I've seen any clear recommendations. And I know that the OS defaults aren't optimal. For example, it seems that the default read-ahead buffer of 128kb is extremely small for a deployment on server-class hardware.



The following articles explore the performance impact of changing read-ahead cache and nr_requests values on the block queues.


http://zackreed.me/articles/54-hp-smart-array-p410-controller-tuning

http://www.overclock.net/t/515068/tuning-a-hp-smart-array-p400-with-linux-why-tuning-really-matters

http://yoshinorimatsunobu.blogspot.com/2009/04/linux-io-scheduler-queue-size-and.html




For example, these are suggested changes for an HP Smart Array RAID controller:



echo "noop" > /sys/block/cciss\!c0d0/queue/scheduler 
blockdev --setra 65536 /dev/cciss/c0d0
echo 512 > /sys/block/cciss\!c0d0/queue/nr_requests
echo 2048 > /sys/block/cciss\!c0d0/queue/read_ahead_kb


What else can be reliably tuned to improve storage performance?


I'm specifically looking for sysctl and sysfs options in production scenarios.


Answer



I've found that when I've had to tune for lower latency vs throughput, I've tuned nr_requests down from it's default (to as low as 32). The idea being smaller batches equals lower latency.



Also for read_ahead_kb I've found that for sequential reads/writes, increasing this value offers better throughput, but I've found that this option really depends on your workload and IO pattern. For example on a database system that I've recently tuned I changed this value to match a single db page size which helped to reduce read latency. Increasing or decreasing beyond this value proved to hurt performance in my case.



As for other options or settings for block device queues:



max_sectors_kb = I've set this value to match what the hardware allows for a single transfer (check the value of the max_hw_sectors_kb (RO) file in sysfs to see what's allowed)




nomerges = this lets you disable or adjust lookup logic for merging io requests. (turning this off can save you some cpu cycles, but I haven't seen any benefit when changing this for my systems, so I left it default)



rq_affinity = I haven't tried this yet, but here is the explanation behind it from the kernel docs




If this option is '1', the block layer will migrate request completions to the
cpu "group" that originally submitted the request. For some workloads this
provides a significant reduction in CPU cycles due to caching effects.
For storage configurations that need to maximize distribution of completion
processing setting this option to '2' forces the completion to run on the
requesting cpu (bypassing the "group" aggregation logic)"





scheduler = you said that you tried deadline and noop. I've tested both noop and deadline, but have found deadline win's out for the testing I've done most recently for a database server.



NOOP performed well, but for our database server I was still able to achieve better performance adjusting the deadline scheduler.



Options for deadline scheduler located under /sys/block/{sd,cciss,dm-}*/queue/iosched/ :



fifo_batch = kind of like nr_requests, but specific to the scheduler. Rule of thumb is tune this down for lower latency or up for throughput. Controls the batch size of read and write requests.




write_expire = sets the expire time for write batches default is 5000ms. Once again decrease this value decreases your write latency while increase the value increases throughput.



read_expire = sets the expire time for read batches default is 500ms. Same rules apply here.



front_merges = I tend to turn this off, and it's on by default. I don't see the need for the scheduler to waste cpu cycles trying to front merge IO requests.



writes_starved = since deadline is geared toward reads the default here is to process 2 read batches before a write batch is processed. I found the default of 2 to be good for my workload.


Comments

Popular posts from this blog

linux - iDRAC6 Virtual Media native library cannot be loaded

When attempting to mount Virtual Media on a iDRAC6 IP KVM session I get the following error: I'm using Ubuntu 9.04 and: $ javaws -version Java(TM) Web Start 1.6.0_16 $ uname -a Linux aud22419-linux 2.6.28-15-generic #51-Ubuntu SMP Mon Aug 31 13:39:06 UTC 2009 x86_64 GNU/Linux $ firefox -version Mozilla Firefox 3.0.14, Copyright (c) 1998 - 2009 mozilla.org On Windows + IE it (unsurprisingly) works. I've just gotten off the phone with the Dell tech support and I was told it is known to work on Linux + Firefox, albeit Ubuntu is not supported (by Dell, that is). Has anyone out there managed to mount virtual media in the same scenario?

ubuntu - Monitoring CPU, Mem, disk, on a single server

I've been looking for a simple starter solution for monitoring my [currently] single server hosted solution. Other than Nagios and similar, are there other good (simple) solutions people are using? Answer Everything depends on what you want. For example Munin is very simple, you can install and configure it in less then 10 minutes (on one server), it can sends alarms, make graphs from monitoring cpu, mem. apache connections, eaccellerator, disk io and many many more (it has many plugins). But if you are planning in future get some more machines, munin may not be enough. For example in munin you cant monitor state of individual processes, can't monitor changes in files (for security purpose). So if you wanna only see what is the utilization of basics parameters on your server and don't plan to buy some more servers Munin is what you are looking for, but if you wanna be alarmed when some of your service is down, take more control on what is happeninig on...

hp proliant - Smart Array P822 with HBA Mode?

We get an HP DL360 G8 with an Smart Array P822 controller. On that controller will come a HP StorageWorks D2700 . Does anybody know, that it is possible to run the Smart Array P822 in HBA mode? I found only information about the P410i, who can run HBA. If this is not supported, what you think about the LSI 9207-8e controller? Will this fit good in that setup? The Hardware we get is used but all original from HP. The StorageWorks has 25 x 900 GB SAS 10K disks. Because the disks are not new I would like to use only 22 for raid6, and the rest for spare (I need to see if the disk count is optimal or not for zfs). It would be nice if I'm not stick to SAS in future. As OS I would like to install debian stretch with zfs 0.71 as file system and software raid. I have see that hp has an page for debian to. I would like to use hba mode because it is recommend, that zfs know at most as possible about the disk, and I'm independent from the raid controller. For us zfs have many benefits, ...