RAID performance again

Further information in my saga.

On my main server, hosting my media centre / myth setup, I am still having issues.  I’m pretty sure that the recording dropouts that I’m getting are directly correlated to performance, I’ve gotten rid of the errors that I had with the recordings, so I’m now left only with a series of the following errors:

2013-02-07 14:41:44.538606 W [10163/15173] TFWWrite ThreadedFileWriter.cpp:500 (DiskLoop) - TFW(/usr/share/mythtv/recordings/6008_20130207030900.mpg:85): write(57716) cnt 19 total 1087956 -
- took a long time, 1568 ms

This is then sometimes followed up with the following, and seems to be correlated with a recording dropout (which makes sense given the error message):

2013-02-06 10:18:35.013307 E [10163/1005] HDHRStreamHandler ThreadedFileWriter.cpp:217 (Write) - TFW(/usr/share/mythtv/recordings/6030_20130205231500.mpg:97): Maximum buffer size exceeded.
                        file will be truncated, no further writing will be done.
                        This generally indicates your disk performance 
                        is insufficient to deal with the number of on-going 
                        recordings, or you have a disk failure.

Clearly IO performance is an underlying issue here.

I note that when recording myth can be quite hungry on disk – it’s often recording 3-4 streams or more (since a number of channels simulcast shows – usually one in HD, and a couple of random SD ones, and they have subtly different program listings so myth often doesn’t notice that they’re duplicates), and it then starts up a commercial flagging job for each in parallel, reading it back and scanning for ad breaks.  Overall it’s still only doing about 15MB per second, but it seems to exceed the current array performance.

In the near term I am reducing the number of parallel  commercial flagging jobs, reducing the likelihood of contention, but I still need to fix array performance.

Overnight I pushed my testserver into single user mode, on the theory that perhaps my variation in results was driven by other jobs running on the machine.  On a debian server you do this with “init 1”.  I also learned that you don’t do this with “init 0” as that’s very similar to shutdown -h.  I now have very consistent and repeatable results, which is great, and they look something like this:

Minimum RA
1024 2048 4096 8192
Stripe 256 63.16% 68.30% 61.17% 63.73%
512 60.43% 61.54% 65.05% 81.71%
1024 83.11% 71.06% 80.88% 70.16%
2048 68.76% 83.61% 79.30% 70.35%
4096 76.21% 56.78% 76.64% 78.82%
Average RA
1024 2048 4096 8192
Stripe 256 78.39% 86.11% 79.89% 81.64%
512 88.22% 89.05% 89.40% 92.11%
1024 92.35% 87.68% 90.50% 88.55%
2048 93.46% 94.63% 94.12% 92.63%
4096 90.89% 86.88% 90.02% 90.49%

This tells me I’m getting little variation from read ahead, but a reasonably clear minimum threshold for stripe cache.  The ideal point looks to be around 1024/1024 or 2048/2048.  Of course, looking further into it, I have question about whether my read ahead setting is having effect, and I note there is also a read ahead setting on the LVM.  So I will also try tuning that to see if I get more variation.  I also have a little concern that bonnie++ might not be a fully representative load, so I may try another load testing tool as well to check I have the right process.  Or maybe run bonnie++ in multi-threaded mode.  But, overall, progress is being made.

It still leaves me questioning why one server regularly produces results around 200MB/s, and the other seems stuck at 15MB/s.  That is really my underlying problem to resolve.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s