As I’ve noted earlier, there is code available and in kernel 3.9 that provides for native RAID5 and 6 on btrfs. I have some spare time this weekend, I’m going to compile a new kernel on a virtual machine, and have a play. This post contains instructions for anyone who might want to do the same.
I’m thinking of downloading the new kernel and trailing it on a virtual this weekend, at which point I can maybe post some experiences / howto information.
In the meantime, what I see that’s new is an update on the FAQ page. There is support for:
- RAID-5 (many devices, one parity)
- RAID-6 (many devices, two parity)
- The algorithm uses as many devices as are available (see note, below)
- No scrub support for fixing checksum errors
- No support in btrfs-progs for forcing parity rebuild
- No support for discard
- No support yet for n-way mirroring
- No support for a fixed-width stripe (see note, below)
It looks like the fixed-width stripe is the key one. The implication is that it uses all the devices available, so if you happened to have 12 disks, it’d build a 12 disk RAID. This impacts seek times – every disk needs to seek to the right place simultaneously to get that data. Whereas if you had 12 disks and btrfs decided to build stripes using 6 disks, then only 6 disks need to seek simultaneously, and on average you can support two different IO requests in parallel.
Secondly, the balancing code doesn’t yet deal with different sized disks. It sounds like if it didn’t always use as many devices as were available then it would deal OK with mixed size disks, no doubt within limits.
All good progress. I think to use it I’ll have to download and build a new kernel, and download and build a new version of the btrfs progs. We’ll see this weekend maybe.
Looks like I was wrong, according to phoronix the RAID6 code made into the 3.9 window. Looks like Linus is unhappy with the process of this being submitted late in the window, but accepted it. I reckon Chris Mason has a habit of doing stuff like this late in the window, it’s not the first time I’ve seen that complaint from Linus – but either way I’m happy as it brings it that much closer to being stable.
I see a pull request for the Raid 5/6 code http://email@example.com/msg22694.html. This means, I believe, that the code has gone into the base btrfs repository, which means a wider group of people will be testing it. Having said that, it’s still missing a few key features, so I wouldn’t be expecting to see it in a Linux kernel for a while, I can’t see that it will make 3.9, and it’d be shaky for 3.10.
Yet to be implemented features look to be:
- Parity logging, which looks to address the situation where a write has occurred to only some disks and a power failure or other crash occurs, which would leave the disks out of sync
- Providing for the scrubbing routines to notice any raid discrepancies, use the checksumming to work out which bits are the incorrect bits, and rewrite the stripe
- Support in the user-space programs (btrfs-progs), which in some circumstances need to be aware of the raid levels
Overall, it feels close, I’m very keen on this as I plan to eventually move my systems over to btrfs for all file systems. In theory this should mean the end of partitioning disks, dealing with different drive sizes in raid sets, creating different arrays for different use cases (one for RAID1, another for RAID0 etc). It should also mean goodbye to silent data corruption, and hello to in-file-system compression for those bits of my file system that are compressible (obviously that doesn’t include any media files such as photos, music and recorded tv, which make up the majority of my storage, but it would include virtual machine images).
I’ve been keeping an eye on btrfs for some time, it provides a next generation file system for Linux that is in many ways equivalent to ZFS on Solaris. It’s still relatively new, so not something you’d trust irreplaceable data to, but long term it will provide some compelling features:
- Checksumming of all data and metadata. As drives get larger the law of averages / large numbers says that some pieces of data will become corrupt. For media files this isn’t critical, you might get a pixel different on one frame of a movie. For a spreadsheet or a program it could be critical. Checksumming will tell you if your data is wrong on disk
- Integrated RAID support. RAID 0 and 1 are built-in, and it allows different RAID levels on different directories, and different for metadata v’s data. RAID5/6 has been “coming” for some time, but not available yet
- Because the RAID is integrated, and because of checksumming, where a discrepancy is identified between two drives btrfs can pick the correct one to repair from – not only does it make your data redundant, it can sensibly fix it when something goes wrong (md RAID basically randomly picks one block so as to get back in synch)
The RAID5/6 patches have gone up into an experimental tree, which means they’re on the way to the kernel proper, I’d guess maybe 3.10. Which is huge news. Refer:
Further information in my saga.
On my main server, hosting my media centre / myth setup, I am still having issues. I’m pretty sure that the recording dropouts that I’m getting are directly correlated to performance, I’ve gotten rid of the errors that I had with the recordings, so I’m now left only with a series of the following errors:
2013-02-07 14:41:44.538606 W [10163/15173] TFWWrite ThreadedFileWriter.cpp:500 (DiskLoop) - TFW(/usr/share/mythtv/recordings/6008_20130207030900.mpg:85): write(57716) cnt 19 total 1087956 - - took a long time, 1568 ms
This is then sometimes followed up with the following, and seems to be correlated with a recording dropout (which makes sense given the error message):
2013-02-06 10:18:35.013307 E [10163/1005] HDHRStreamHandler ThreadedFileWriter.cpp:217 (Write) - TFW(/usr/share/mythtv/recordings/6030_20130205231500.mpg:97): Maximum buffer size exceeded. file will be truncated, no further writing will be done. This generally indicates your disk performance is insufficient to deal with the number of on-going recordings, or you have a disk failure.
Clearly IO performance is an underlying issue here.
I note that when recording myth can be quite hungry on disk – it’s often recording 3-4 streams or more (since a number of channels simulcast shows – usually one in HD, and a couple of random SD ones, and they have subtly different program listings so myth often doesn’t notice that they’re duplicates), and it then starts up a commercial flagging job for each in parallel, reading it back and scanning for ad breaks. Overall it’s still only doing about 15MB per second, but it seems to exceed the current array performance.
In the near term I am reducing the number of parallel commercial flagging jobs, reducing the likelihood of contention, but I still need to fix array performance.
Overnight I pushed my testserver into single user mode, on the theory that perhaps my variation in results was driven by other jobs running on the machine. On a debian server you do this with “init 1”. I also learned that you don’t do this with “init 0” as that’s very similar to shutdown -h. I now have very consistent and repeatable results, which is great, and they look something like this:
This tells me I’m getting little variation from read ahead, but a reasonably clear minimum threshold for stripe cache. The ideal point looks to be around 1024/1024 or 2048/2048. Of course, looking further into it, I have question about whether my read ahead setting is having effect, and I note there is also a read ahead setting on the LVM. So I will also try tuning that to see if I get more variation. I also have a little concern that bonnie++ might not be a fully representative load, so I may try another load testing tool as well to check I have the right process. Or maybe run bonnie++ in multi-threaded mode. But, overall, progress is being made.
It still leaves me questioning why one server regularly produces results around 200MB/s, and the other seems stuck at 15MB/s. That is really my underlying problem to resolve.
So, I wrote a script that would try various combinations of stripe_cache and read_ahead, using bonnie++ to measure the impacts on performance.
On my test / lower spec server, I ran it once, and the results looked a bit funny. I wrote a script that combined all the results, and put into a spreadsheet, normalising the throughput on each of the bonnie++ measures to 100%, and colour coding the results that were close to 100% as green, those a bit worse as orange, those worse still as red. That gave me tables like this:
As expected, they were different across different measures, some things benefited from stripe_cache, some were harmed. So I consolidated into a min and average for each, trying to find the settings that didn’t impact any of the different metrics too much. This gave me:
Problem is that some of the measures just look funny, I was thinking that the 86.31% average, which had no measure worse than 70% was pretty good (RA of 4096, Stripe of 512). Thing is, it makes no sense that the min was 70%. So I ran another run, which gave this summary:
And it looked nothing like the first one. So then I did a third run, and that gave me this:
Still again, no correlations. Sigh.
My learning here is that something else is impacting my performance. There are definitely some middle of the road settings that deliver benefit and don’t hurt any particular element of performance too much, but there’s no magic setting here. Further to the point, the one that is perhaps best here is 4096/4096, but that consumes a lot of RAM. Not sure it’s worth it, it’s not that much better than some of the other settings. The scripts and spreadsheet are available if anyone wants them.
Next, I did the same process on my other server (my main server). Interestingly, the throughput on this server is dramatically slower. So all the tests were about 30% of the throughput of my nominally less good server, at all settings. Hmm. I then tried using “disk utility” directly against the drives and the raid arrays. And that gave much better results – similar to the second server. Hmm.
So, overall, my thought is that I need to look further afield for my performance issues. As noted in the earlier post, I think my secondary server (where the main problem was VM throughput) I’ve tracked to RAM settings, not RAID. And on my main server, I think the problem lies between the arrays and the file system – so I’ll be looking more closely at LVM and the ext4 settings (maybe stride and stripe are useful after all) when I get time.