btrfs: why I don’t see it as “rampant layer violation”

There are many who aren’t thrilled with the btrfs design, seeing it as a “rampant layer violation.”  What does this mean, and why do I think that’s not a problem?

Linux uses layers for the block devices (disks, ssd, cdrom etc).  A typical stack might be:

  1. The base devices – say a set of hard drives in your computer, typically called /dev/sda, /dev/sdb etc
  2. The md raid driver, which combines those disks into a logical device, typically called /dev/md1 or similar.
  3. The logical volume manager, LVM2.  This will take your raid device and expose it as a volume group, such as /dev/vgMain, and allow you to allocate logical volumes off that, such as /dev/vgMain/lvRoot, /dev/vgMain/lvHome and so forth.  These volumes act much like partitions on a traditional hard drive, but you can move them around without taking them offline – so you can alter the size or location of a logical volume
  4. The filesystem, ext4 or btrfs or jfs.  This formats the logical device to store data

The advantage of this structure is that you can combine it in different ways.  For example, you can add in the encryption layer, and you can encrypt the base devices, or the raid device, or just a logical volume.  You can create logical volumes then make raid devices out of those, you can add a distributed block device like drbd.

This is very flexible and very standard, and it makes it easy for people to replace or do without elements of the stack.

It also limits what you can do in some ways.  An example here is synching a new drive in a raid array.  Given the layers described above, if I insert a new drive then the RAID setup will fully synch that disk.  But the filesystem on that disk might be mostly empty.  Tuning this stack for performance is hard – each layer has it’s own tunables, and when they’re not aligned your performance is poor.  Not critical, and someone with the right skills can tune it.  But it’s not as easy as it could be.

Btrfs is taking a different path.  It works directly on the physical devices, and offers a built-in RAID like system, volume manager and other functions.  In that sense it is duplicating stuff that already lives elsewhere in the kernel, and probably doing none of those things as well as the dedicated layer for that purpose.  But making this into a monolithic single layer gives a lot of advantages in making sure that all these pieces are working together sensibly.  Examples are:

  • Allowing quotas to be set at the directory level, not just for a logical device
  • Allowing RAID to be built from devices with unlike sizes, and optimising to still fill all them
  • Allowing mixed RAID levels in different directories, and between metadata and data
  • Simplifying resynch and replace operations – only the portions of the drive that are actually used need be synched
  • Allowing smarter error recovery – md RAID doesn’t have a good way to repair data when the parity drive doesn’t match the data read.  Btrfs with the checksumming it offers on data can calculate which drive is the one that is incorrect, and then recreate the correct data, not just a best guess as to which data is right
  • Easier performance tuning – the whole stack runs off the same stripe sizes and caches, there isn’t duplication at the different layers

I think in concept all these advantages could be achieved in the old layered model if the right interfaces are in place – the RAID driver could ask the filesystem which blocks were actually in use for example.  In practice that is unlikely to happen, and hasn’t happened to date.  I think that btrfs will give an easy to use and performant solution for the average Joe user, and is better than what we had before.  What an economist might call a second-best solution, which is actually optimal if the theoretically best solution cannot be achieved.  I’d also not that many people speak very approvingly of ZFS, and ZFS does exactly this – it’s part of what makes it good.

I’m very keen for btrfs with the full RAID levels to become stable so that I can start using it – I reckon it’d make my setup much more flexible.  I’d probably still need some partitions with the old-school setup for running my virtuals from though – I doubt things like drbd will work with btrfs.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s