I have md raid on two of my machines. They’ve grown over time, they have random disks in them – the main server has 6 x 2TB, the other server has a mix of 2TB, 1.5TB and 1TB drives. I’m currently engaged in tidying up this setup, as I was getting dreadful throughput. This post focuses on the tuning tips I found in various places, and the options I’ve tweaked to get better throughput.
For raid performance, my reading tells me that there are three things that really matter:
- All partitions, any LVM on top of the RAID, any other layers such as VMs, all should align on 1MB boundaries. Why? Basically all disks run with 512byte or 4kb block size. You want anything that you write to fit inside one of these blocks. If it doesn’t, then you have to read from the disk, update the bit of the block you want to change, then write back out. I don’t feel like this had a big impact, but it’s not hard to get right
- RAID stripe cache. A RAID stripe is a series of blocks across your RAID array that together have parity calculated over them. Any write to disk must always hit this whole stripe, as parity must be recalculated over the whole stripe. As I understand it the stripe cache holds your recently used stripes in memory, so if you write only a portion of that stripe, you have the blocks from the other disks still in cache. Again, this avoids read, update, write cycles.
- Read ahead cache. This is about the physics of spinning disks. Reading data is much faster than moving drive heads. So once you’ve moved the drive heads, the ideal is to read a decent amount of data before moving the heads somewhere else. The problem is that reads don’t always turn up in nice organised chunks – particularly if you have a few different processes running each of which is trying to read from different bits of the array. The read ahead cache basically reads a set of blocks from disk before moving the head somewhere else, on the assumption that you’re likely to use them soon. Reading a bit extra creates very little overhead, but much of the time it turns out that you actually wanted that data
There are some other things that can also be optimised, but that I either haven’t done so yet (time or laziness), or that I didn’t have a sensible way to do easily. One is the stride/block size/chunk size optimisation. Basically the ideal is for your file system to know the geometry of your array, and optimise the block size / read and write requests to fit that. The stride does this on ext3/4 systems. My problem is that my arrays are quite fragmented due to differing disk sizes, and my file systems have been created over time, some are 5 or more years old and originally created on a different server with completely different disks. LVM then sits over this …. mess …. so there’s no good way to sort this out. I’ve given up on stride.