MD Raid Management Tools

I have md raid on two of my machines.  They’ve grown over time, they have random disks in them – the main server has 6 x 2TB, the other server has a mix of 2TB, 1.5TB and 1TB drives.  I’m currently engaged in tidying up this setup, as I was getting dreadful throughput.  This post focuses on the tuning tips I found in various places, and the options I’ve tweaked to get better throughput.

For raid performance, my reading tells me that there are three things that really matter:

  1. All partitions, any LVM on top of the RAID, any other layers such as VMs, all should align on 1MB boundaries.  Why?  Basically all disks run with 512byte or 4kb block size.  You want anything that you write to fit inside one of these blocks.  If it doesn’t, then you have to read from the disk, update the bit of the block you want to change, then write back out.  I don’t feel like this had a big impact, but it’s not hard to get right
  2. RAID stripe cache.  A RAID stripe is a series of blocks across your RAID array that together have parity calculated over them.  Any write to disk must always hit this whole stripe, as parity must be recalculated over the whole stripe.  As I understand it the stripe cache holds your recently used stripes in memory, so if you write only a portion of that stripe, you have the blocks from the other disks still in cache.  Again, this avoids read, update, write cycles.
  3. Read ahead cache.  This is about the physics of spinning disks.  Reading data is much faster than moving drive heads.  So once you’ve moved the drive heads, the ideal is to read a decent amount of data before moving the heads somewhere else.  The problem is that reads don’t always turn up in nice organised chunks – particularly if you have a few different processes running each of which is trying to read from different bits of the array.  The read ahead cache basically reads a set of blocks from disk before moving the head somewhere else, on the assumption that you’re likely to use them soon.  Reading a bit extra creates very little overhead, but much of the time it turns out that you actually wanted that data

There are some other things that can also be optimised, but that I either haven’t done so yet (time or laziness), or that I didn’t have a sensible way to do easily.  One is the stride/block size/chunk size optimisation.  Basically the ideal is for your file system to know the geometry of your array, and optimise the block size / read and write requests to fit that.  The stride does this on ext3/4 systems.  My problem is that my arrays are quite fragmented due to differing disk sizes, and my file systems have been created over time, some are 5 or more years old and originally created on a different server with completely different disks.  LVM then sits over this …. mess …. so there’s no good way to sort this out.  I’ve given up on stride.

Firstly, partitioning to get the block sizes right.  Basically, if you’re creating a new array and filesystem today, you’ll find that pretty much all the tools will do this by default.

fdisk will create by default the first block as 2048 (1MB).  This wastes the first 1MB of your disk, but given disks are measured in TB, who cares.  It also wastes 1MB between each partition, again making sure that you get partitions starting on a 1MB boundary.  You can use

fdisk -c -u

but my observation is that recent builds default to these options anyway (so far as I can tell).

Next, RAID stripe cache.  You set the stripe cache by

echo xxxxx > /sys/block/mdX/md/stripe_cache_size

Be aware that you can use a lot of memory by playing with this, you are setting the number of pages that you wish to cache.  The calculation here is number pages x number devices x 4 / 1024 (in MB).  So if you have a 6 disk array, and set this to 1024, you’re using 24MB in RAM for this array.  The other caveat is that one of my machines runs the out of memory killer (and kills everything) when I set this to 4096 on each array (about 8 separate arrays).  My maths says it shouldn’t be running out of memory, but it is.  Use at your own risk.

Lastly, read ahead cache.  This can be set at a partition level, at a physical disk level, or for an array as a whole.  It can also be set in LVM.  So far as I can tell, you basically get the largest setting of anything in the stack – which says to me that you only need to set it in one place.  I’m choosing to set it on my arrays.  To set it, you use

blockdev --setra XXXX /dev/mdX

The memory usage here in theory is scarier, but seems to give me less problems.  You’re setting the number of 512 byte sectors, so the maths on memory usage is number sectors x number devices x 512 / 1024 / 1024, giving MB of RAM usage.  So setting 8192 on a 6 disk array gives 24MB RAM usage.  8192 appears to be the default on many newly created arrays.

In order to set and manipulate this stuff, I created a couple of perl scripts.  These are somewhat based on this script:

http://ubuntuforums.org/showthread.php?t=1916607

but since my shell scripting skills are really poor, and my perl skills only poor, I rewrote what I wanted into perl.  So what we have are 5 files (the files themselves embedded below):

1. RaidArray.pm    –  an object that represents a raid array, and gives getters and setters on key attributes

2. RaidLibraryFunctions.pm  – some generic functions that create a list of all arrays, and allow manipulating those arrays

3. get_array_details.pl   – interrogates all your raid arrays on a server, and prints out key metrics

4. set_readahead_size.pl  – sets readahead on all arrays on your server.  I tend to set to 8192

5. set_stripe_cache_size.pl  – sets stripe cache on all arrays on your server.  I tend to set to 1024 or 2048

As an example output, this is the output I get from get_array_details.pl

Name  | Device Name   | Level | Num Devices | RA Sectors | RA MB | Stripe Cache Pages | Stripe Cache MB
md8   | /dev/md8      |     6 |           6 |       8192 |    24 |               1024 |          24
md7   | /dev/md7      |     6 |           6 |       8192 |    24 |               1024 |          24
md6   | /dev/md6      |     6 |           6 |       8192 |    24 |               1024 |          24
md5   | /dev/md5      |     6 |           6 |       8192 |    24 |               1024 |          24
md11  | /dev/md11     |     6 |           6 |       8192 |    24 |               1024 |          24
md9   | /dev/md9      |     6 |           6 |       8192 |    24 |               1024 |          24
md10  | /dev/md10     |     6 |           6 |       8192 |    24 |               1024 |          24
md12  | /dev/md12     |     6 |           6 |       8192 |    24 |               1024 |          24
md1   | /dev/md1      |     1 |           6 |        256 |  0.75 |                  0 |           0
md2   | /dev/md2      |     1 |           6 |        256 |  0.75 |                  0 |           0
Total                                                      193.5                                192

The files are attached below.  I may update at some point, at present the usage is a bit funny – you call it and it tells you what it would do.  You then call it with second parameter “yes” and it actually does it.  It should really do in one pass – tell you what it’s going to do, then ask you if it should go ahead.

RaidArray.pm

package RaidArray;
use strict;

sub new
{
# instantiates an array given the array's name
# stores the array name that is passed in, calculates the device name, and gets the
# mdadm details upfront, avoiding calling mdadm every time we want a single attribute

my $class = shift;

my $self = {
_arrayName => shift,
_arrayDeviceName => undef,
_arrayDetails => undef };

chomp ($self->{_arrayName});
$self->{_arrayDeviceName} = "/dev/" . $self->{_arrayName};

my $exec_string = "mdadm --detail " . $self->{_arrayDeviceName};
$self->{_arrayDetails} = `$exec_string`;

bless $self, $class;
return $self;
}

sub getDetails
{
# only really used for debugging - returns the full dump of mdadm details for the array
my ($self) = shift;
return $self->{_arrayDetails};
}

sub getName
{
my ($self) = shift;
return $self->{_arrayName};
}

sub getDeviceName
{
my ($self) = shift;
return $self->{_arrayDeviceName};
}

sub getRaidLevel
{
my ($self) = shift;
$self->{_arrayDetails} =~ /Raid Level : raid(\d)/;
return $1;
}

sub getNumRaidDevices
{
my ($self) = shift;
$self->{_arrayDetails} =~ /Raid Devices : (\d*)/;
return $1;
}

sub getReadAheadSectors
{
my ($self) = shift;
my $exec_string  = "blockdev --getra " . $self->{_arrayDeviceName} . "\n";
my $RASectors = `$exec_string`;
chomp($RASectors);
return $RASectors;
}

sub getReadAheadMB
{
my ($self) = shift;
my $RASectors = $self->getReadAheadSectors;
my $devices = $self->getNumRaidDevices;
return $RASectors * $devices * 512 / 1024 / 1024;
}

sub setReadAheadSectors
{
my ($self) = shift;
my $newSetting = shift;
my $exec_string =  "blockdev --setra " . $newSetting . " " . $self->{_arrayDeviceName} . "\n";
my $result = `$exec_string`;
return 0;
}

sub getStripeCachePages
{
my ($self) = shift;

if ($self->getRaidLevel > 1)
{
my $exec_string  = "cat /sys/block/" . $self->{_arrayName} . "/md/stripe_cache_size\n";
my $pages = `$exec_string`;
chomp($pages);
return $pages;
}
else
{
return 0;
}
}

sub setStripeCachePages
{
my ($self) = shift;
if ($self->getRaidLevel > 1)
{
my $newSetting = shift;
my $exec_string =  "echo " . $newSetting . " > /sys/block/" . $self->{_arrayName} . "/md/stripe_cache_size\n";
my $result = `$exec_string`;
}
return 0;
}

sub getStripeCacheMB
{
my ($self) = shift;
my $pages = $self->getStripeCachePages;
my $devices = $self->getNumRaidDevices;
my $memoryUsageMB = $pages * $devices * 4 / 1024;
return $memoryUsageMB;
}

sub getDetailsHeaderCSV
{
my ($self) = shift;
return "Name,DeviceName,RaidLevel,NumRaidDevices,ReadAheadSectors,ReadAheadMB,StripeCachePages,StripeCacheMB\n";
}

sub getDetailsCSV
{
my ($self) = shift;
my $details;
$details = $details . $self->getName . ",";
$details = $details . $self->getDeviceName . ",";
$details = $details . $self->getRaidLevel . ",";
$details = $details . $self->getNumRaidDevices . ",";
$details = $details . $self->getReadAheadSectors . ",";
$details = $details . $self->getReadAheadMB . ",";
$details = $details . $self->getStripeCachePages . ",";
$details = $details . $self->getStripeCacheMB . "\n" ;
}

sub getDetailsHeaderPretty
{
my ($self) = shift;
return "Name  | Device Name   | Level | Num Devices | RA Sectors | RA MB | Stripe Cache Pages | Stripe Cache MB\n";
}

sub getDetailsPretty
{
my ($self) = shift;
my $details;
$details = $details . sprintf("%-6s", $self->getName) . "| ";
$details = $details . sprintf("%-14s", $self->getDeviceName) . "|";
$details = $details . sprintf("%6s", $self->getRaidLevel) . " |";
$details = $details . sprintf("%12s", $self->getNumRaidDevices) . " |";
$details = $details . sprintf("%11s", $self->getReadAheadSectors) . " |";
$details = $details . sprintf("%6s", $self->getReadAheadMB) . " |";
$details = $details . sprintf("%19s", $self->getStripeCachePages) . " |";
$details = $details . sprintf("%12s", $self->getStripeCacheMB) . "\n" ;
return $details;
}

1;

RaidLibraryFunctions.pm

#! /usr/bin/perl
#
# Library functions shared across a number of scripts that perform administration on
# arrays.

package RaidLibraryFunctions;
use RaidArray;

use strict;

sub GetArrayNames
{
# returns a list of array names as found from the mdadm --examine --scan

# We get a list of all the arrays from cat /proc/mdstat.  Relies on the format of this command.
# We are expecting lines in the format "md10 : active raid6 sdc10[6] sdb10[8] sda10[7] sdf10[5] sde10[4] sdd10[3]"
# There is also a line that starts with "Personalities" that we don't want, so we drop that out
# We're grabbing the first field - the md10 bit

my $exec_string = "cat /proc/mdstat | grep raid | grep -v Personalities | cut -d\" \" -f1\n";
my @arrayNames = `$exec_string`;

return @arrayNames;
}

sub GetArrays
{
# uses the array names to create a list of all of the arrays, each item in that list is
# an object representing that array, we can operate on those objects to get information or
# set values relating to that array

# get the list of array names
my @arrayNames = GetArrayNames();

# iterate through the names, creating an object for the array, and pushing it into the array list
my @arrays;
foreach (@arrayNames)
{
my $array = new RaidArray($_);
push (@arrays, $array);
}

return @arrays;
}

1;

get_array_details.pl

#!/usr/bin/perl
#
# Gets details of key settings for all arrays on the system.
#

use strict;
use RaidLibraryFunctions;

my @arrays = RaidLibraryFunctions::GetArrays();

print @arrays[0]->getDetailsHeaderPretty;
foreach (@arrays)
{
print $_->getDetailsPretty;
}

my $totalRAMB = 0;
my $totalStripeCacheMB = 0;
foreach (@arrays)
{
$totalRAMB = $totalRAMB + $_->getReadAheadMB;
$totalStripeCacheMB = $totalStripeCacheMB + $_->getStripeCacheMB;
}
print sprintf("Total%59s %34s\n", $totalRAMB, $totalStripeCacheMB);

set_readahead_size.pl

#!/usr/bin/perl
#
# Sets read ahead on all arrays on this server
# Unless second parameter is "yes", prints out what it's going to do
#
#  set_readahead_size.pl 8092 yes
#  set_readahead_size.pl 8092
#
use strict;
use RaidLibraryFunctions;

my $size = shift;
my $execute = shift;

my @arrays = RaidLibraryFunctions::GetArrays();
my $count = 0;
my $totalRAMB = 0;

foreach (@arrays)
{
$count = $count + 1;
$totalRAMB = $totalRAMB + (512 * $_->getNumRaidDevices * $size / 1024 / 1024);
}

print "Setting read ahead for all arrays\n";

if ($execute ne "yes")
{
print "  Test mode\n";
print "  If you're happy with the below calculations, run again with yes as second parameter\n";
print "  Such as\n";
print "    perl set_readahead_size.pl 8092 yes\n";
print "  This command will use 512B per page per disk per array, and set " . $size . " pages\n";
print "  Given there are " . $count . " arrays on this system, and taking into account the number of disks in each array\n";
print "  this will use " . $totalRAMB . "MB of memory.\n\n";

foreach (@arrays)
{
my $beforeRA = $_->getReadAheadSectors;
print "  Will change " . $_->getDeviceName . " from " . $beforeRA . " to " . $size . "\n";
}
}
else
{
foreach (@arrays)
{
my $beforeRA = $_->getReadAheadSectors;
$_->setReadAheadSectors($size);
my $afterRA = $_->getReadAheadSectors;
print "Have changed " . $_->getDeviceName . " from " . $beforeRA . " to " . $afterRA . "\n";
}
}

set_stripe_cache_size.pl

#!/usr/bin/perl
#
# Sets stripe cache on all arrays on this server
# Unless second parameter is "yes", prints out what it's going to do
#
#  set_stripe_cache_size.pl 8092 yes
#  set_stripe_cache_size.pl 8092
#
use strict;
use RaidLibraryFunctions;

my $size = shift;
my $execute = shift;

my @arrays = RaidLibraryFunctions::GetArrays();
my $count = 0;
my $totalStripeCacheMB = 0;

foreach (@arrays)
{
if ($_->getRaidLevel != 1)
{
$count = $count + 1;
$totalStripeCacheMB = $totalStripeCacheMB + (4 * $_->getNumRaidDevices * $size / 1024);
}
}

print "Setting stripe cache for all arrays\n";

if ($execute ne "yes")
{
print "  Test mode\n";
print "  If you're happy with the below calculations, run again with yes as second parameter\n";
print "  Such as\n";
print "    perl set_stripe_cache_size.pl 1024 yes\n";
print "  This command will use 4K per page per disk per array, and set " . $size . " pages.\n";
print "  Stripe cache does not apply to raid level 1\n";
print "  Given there are " . $count . " suitable arrays on this system, and taking into account the number of disks in each array\n";
print "  this will use " . $totalStripeCacheMB . "MB of memory.\n\n";

foreach (@arrays)
{
if ($_->getRaidLevel != 1)
{
my $beforeSC = $_->getStripeCachePages;
print "  Will change " . $_->getDeviceName . " from " . $beforeSC . " to " . $size . "\n";
}
else
{
print "  Array " .  $_->getDeviceName . " is raid level " . $_->getRaidLevel . " and has no stripe cache.\n"
}
}
}
else
{
foreach (@arrays)
{
if ($_->getRaidLevel != 1)
{
my $beforeSC = $_->getStripeCachePages;
$_->setStripeCachePages($size);
my $afterSC = $_->getStripeCachePages;
print "  Have changed " . $_->getDeviceName . " from " . $beforeSC . " to " . $afterSC . "\n";
}
else
{
print "  Array " .  $_->getDeviceName . " is raid level " . $_->getRaidLevel . " and has no stripe cache.\n"
}
}
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s