Linux experts Seravo background Linux Debian SUSE
Linux-natives blog: Linux and open source – technology and strategy

The perfect Btrfs setup for a server


Btrfs
is probably the most modern filesystem of all widely used filesystems on Linux. In this article we explain how to use Btrfs as the only filesystem on a server machine, and how that enables some sweet capabilities, like very resilient RAID-1, flexible adding or replacing of disk drives, using snapshots for quick backups and so on.

btrfs_logo

The techniques described in this article were tested using the Ubuntu 16.04 server install, but are applicable on any system with about the same versions of btrfs, grub (2.02), the kernel (4.4) and the likes.

The hardware requirements for a btrfs based RAID-1 disk setup are very flexible. The amount of disks can be anything above two. The size of the disks in the RAID array do not need to be identical, thanks to the flexibility of btrfs RAID-1 as it works on the data level, and not just on device level like traditional mdadm does. Btrfs also includes the features traditionally provided by LVM, so Btrfs conveniently replaces both mdadm and LVM in a single easy to use tool. The best practice is to start with a setup that has 2–4 disks, and then later keep adding new disks when more space is needed, and the size of those disks is whatever is the best price/size ratio at that later time.

The Btrfs setup

When the hardware is ready, the next step is to install the operating system (=Linux). During the partitioning phase, create one big partition that fills all of the disk. There is no need to create a /boot partition nor a swap disk. For Grub compatibility reasons we need to create a real partition (eg. sda1, sdb1..) on every disk and not assign the whole disk to Btrfs, even though Btrfs would support that too. Remember to mark every primary partition (eg. sda1, sdb1..) bootable in the partition table.

After the partitioning step, select the first disk partition (e.g. sda1) as the root filesystem and use Btrfs as the filesystem type. Complete the installation and boot.

After boot you can expand the root filesystem to use all disks with the command:

btrfs device add /dev/sdb1 /dev/sdc1 /dev/sdd1 /

You can check the status of the btrfs system with btrfs fi show (fi is short for filesystem):

$ sudo btrfs fi show
Label: 'root' uuid: 31e77d75-c07d-44dd-b969-d640dfdf5f81
Total devices 4 FS bytes used 1.78GiB
devid 1 size 884.94GiB used 4.02GiB path /dev/sda1
devid 2 size 265.42GiB used 0.00B path /dev/sdb1
devid 3 size 283.18GiB used 0.00B path /dev/sdc1
devid 4 size 265.42GiB used 0.00B path /dev/sdd1

This pools the devices together and creates a big root filesystem. To make it a RAID-1 system run:

sudo btrfs balance start -v -mconvert=raid1 -dconvert=raid1 /

After this, the available disk space halves but becomes resilient against single disk failures. The read speed might also increase a bit, as data can be accesses in parallel on at least two devices.

The command btrfs fi usage is a new command that explains how disk space is used and how much might be available:

$ sudo btrfs fi usage /
Overall:
 Device size: 1.66TiB
 Device allocated: 6.06GiB
 Device unallocated: 1.65TiB
 Device missing: 0.00B
 Used: 3.53GiB
 Free (estimated): 846.76GiB (min: 846.76GiB)
 Data ratio: 2.00
 Metadata ratio: 2.00
 Global reserve: 32.00MiB (used: 0.00B)

Data,RAID1: Size:2.00GiB, Used:1.69GiB
 /dev/sda1 2.00GiB
 /dev/sdc1 2.00GiB

Metadata,RAID1: Size:1.00GiB, Used:72.81MiB
 /dev/sda1 1.00GiB
 /dev/sdc1 1.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
 /dev/sda1 32.00MiB
 /dev/sdc1 32.00MiB

Unallocated:
 /dev/sda1 881.90GiB
 /dev/sdb1 265.42GiB
 /dev/sdc1 280.15GiB
 /dev/sdd1 265.42GiB

By default the Linux system boot will hang if any of the devices used by the Btrfs root filesystem is missing. This is not the ideal behavior in a server environment, as we rather have the system boot and continue to operate in a degraded mode, so that services continue to work and admins can access remotely and assess the next steps.

To enable btrfs to boot in degraded mode we need to add the ‘degraded‘ mount option to two locations. First we need to make sure that Grub can mount the root filesystem and access the kernel. To to that we edit the rootflags line in /etc/grub.d/10_linux to include the option ‘degraded‘ like this:

GRUB_CMDLINE_LINUX="rootflags=degraded,subvol=${rootsubvol} ${GRUB_CMDLINE_LINUX}"

For the Grub config to take effect we need to run ‘update-grub‘ and after that install the new Grub on the master boot sector (MBR) of every disk. That can easily be scripted like this:

for x in a b c d; do sudo grub-install /dev/sd$x; done
Secondly we need to allow the Linux system to mount its filesystems in degraded more by adding the same option to /etc/fstab like this:

UUID=.... / btrfs degraded,noatime,nodiratime,subvol=@ 0 1

Note that also noatime and noadirtime have been selected, as they increase performance with the drawback of not recording access times to files or directories, but that feature is almost never used by anything, so it does not have any practical drawback.

With the setup above we now have a system with 4 disks, each disk containing one partition and those partitions are pooled together with Btrfs RAID-1. If any of the disks fail, the system will continue to operate and can also resume to operate after a reboot (thanks to mount option ‘degraded’) and it does not matter which of the disks break, as any disk is good for booting (thanks to having Grub in every disks’ MBR). If a disk failure occurs, it is up to the system administrator to detect it (e.g. from syslog) and then add a new disk and run ‘btrfs device replace...‘ as explained in our Btrfs recovery article.

Using ZRAM for swap

Note that this setup does not have any swap partitions. We can’t put a swap partition on the raw disk, as there is no redundancy on raw disk and if any of the disks fail, the swap partition and all memory stored on it would be lost and the kernel would most likely panic and halt. As btrfs RAID-1 is not a block level thing, we cannot have a swap partition on it either. We could have a swap file, but btrfs isn’t any good for keeping swap files. Our solution was not to have any traditional swap partition at all, but instead use ZRAM to store resident memory in a compressed format.

To install zram simply run:

apt install zram-config

After next reboot there will automatically be a zram device that the system uses for swapping. It does not matter how much RAM a system has, because at some point the kernel will anyway swap something our from active memory to swap to use the active memory more efficiently. Using ZRAM for swap will prevent it from going to real disk therefore make both swap out and swap in faster (though with some cost of more CPU use).

Using snapshots

Would you like to make a full system backup that does not consume any disk space? On a copy-on-write filesystem like Btrfs it is possible to create snapshots as a window into the filesystem state at a certain point in time.

A practical way to do it could be to have a directory called /snaphosts/ in the under the root filesystem and then save snapshots there at regular intervals. Using the -r option we make the snapshot read-only, which is ideal for backups.

$ sudo mkdir /snapshots
$ sudo btrfs subvolume snapshot -r / /snapshots/root.$(date +%Y%m%d-%H%M)
Create a readonly snapshot of '/' in '/snapshots/root.20160919-0954'

$ tree -L 3 /snapshots

/snapshots
`-- root.20160919-0954
 |-- bin
 |-- boot
 |-- dev
 |-- etc
 |-- home
 |-- initrd.img -> boot/initrd.img-4.4.0-36-generic
 |-- initrd.img.old -> boot/initrd.img-4.4.0-31-generic
 |-- lib
 |-- lib64
 |-- media
 |-- mnt

...

To be able to track how much disk space a snapshot uses, or more exactly to view the amount of data that changed between two snapshots, we can use Btrfs quota groups. The are not enabled by default, so start by running:

$ sudo bftrs quota enable /

After that you can view the subvolumes (snapshots) disk usage:

$ sudo btrfs qgroup show /

qgroupid rfer excl 
-------- ---- ---- 
0/5 16.00KiB 16.00KiB 
0/257 1.75GiB 47.74MiB 
0/258 48.00KiB 48.00KiB 
0/267 0.00B 16.00EiB 
0/268 48.00KiB 16.00EiB 
0/269 1.75GiB 44.95MiB

To find out which subvolume ID is mounted as what, list them with:

$ sudo btrfs subvolume list /
ID 257 gen 5367 top level 5 path @
ID 258 gen 5366 top level 5 path @home
ID 269 gen 5354 top level 257 path snapshots/root.20160919-0954

To make a subvolume the new root (after reboot) study the btrfs subvolume set-default command, and to manipulate other properties of subvolumes, see the Btrfs property command.

 

Written by

Linux-natives – a blog by Linux experts from Finland – is brought to you by Seravo, a Finnish company focused on open source software and services.

Our team provides premium hosting and upkeep for your WordPress website - with open source software.

24 thoughts on “The perfect Btrfs setup for a server

  1. Gerrit says:

    Hi there,
    do you know any reason why I should NOT use raw devices (i.e. “/dev/sdb”) with btrfs in RAID1? Besides the GRUB compatibility which is mentioned in the article of course.
    I red somewhere that there was/is(?) a bug in the btrfs RAID implementation where you can loose data if one disk fails in a RAID1 on raw devices. Did you ever hear about that?

    Thanks for your reply.

    Best regards
    Gerrit

    1. Greg says:

      I have been testing BTRFS in a single disk multiple partition config and have had all sorts of problems with booting and mounting even mounting with degraded read only options has proven at least for me impossible at times. I have assigned multiple partitions mounted by UUID and even with just one corrupt partition seen as a UUID device you get wrong fs, device missing errors and I have not worked out cloning either. I was hoping to run BTRFS for all my Linux machines with Windows REFS as intermediate storage while processing photos and use BTRFS for network storage for protection from corruption and bitrot. I’m using Freenas with ZFS and ECC memory for primary storage and NTFS, EXT4, BTRFS and sometimes XFS for laptops, desktop/workstations and may soon deploy REFS as intermediate storage. I don’t dedupe and get by fine with 16GB of memory and in fact pulled out one of the CPU’s on the ZFS machine since all 8 cores (two 4 cores) were idle during max samba transfer. Iscsi proved unreliable for remounting. I would suggest looking at SuSe’s sub-volume setup especially if you have a lot of writes. Test, test and then retest recovery/replacement scenarios before you even consider BTRFS for production use. I would want to see broad 3rd party tool support before mission critical deployment or at least a proven track record like ZFS.

  2. Jamie says:

    BTRFS documentation states that RAID1 “Needs at least two available devices always. Can get stuck in irreversible read-only mode if only one device is present.”(https://btrfs.wiki.kernel.org/index.php/Status).

    Specifically, degraded rw will work ONLY ONCE. Given this, would it be better NOT to mount degraded by default in order to make sure you are ready with a replacement disk?

  3. nero 50 says:

    would you show how you partitioned?

  4. foobar says:

    How about also including GPT in the examples?

  5. Kumo Isao says:

    Hello,

    Thanks for sharing. I have a question though. You mentioned that create a partition filling the entire disk then you said create real partition sda1 etc and not assign the whole disk to btrfs. Isn’t this contradicting? Would you be able to elaborate on this? I am working on an efi system, what’s your recommendation of partitioning of a disk? Do I have to keep a separate ESP partition for bootloader? Thanks.

  6. Anton says:

    Have you ever seen this problem on boot:

    BTRFS: failed to read the system array: -5
    BTRFS: open_ctree failed
    mount: mounting /dev/sda2 on /root failes: Invalid argument

    It occurs if you physically remove one disk from an array

  7. Simon says:

    Great article, thank you very much! Just one question for a 2 disc raid1 setup:
    – Do I need a EFI partition on both drives in case of a a disc failure?

  8. HT Geek says:

    No. AVOID BTRFS’ native RAID.

    From the BTRFS wiki: “The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.”

    There was a RAID-5/6 code update August 2017 to fix a write hole issue, but so far I haven’t heard anything positive about it. The warning (quoted in part above) is still on the BTRFS Wiki.

    Source: RAID56. BTRFS Wikipedia. 4 August 2017. .

    1. Steve Davis says:

      I believe you are referring to RAID56, the btrfs Wiki states that RAID0 and RAID1 are both stable. They’re working on improving some extant performance problems with mirroring, but it’s definitely stable. Since btrfs is still under heavy, rapid development I don’t know that I’d use it in any mission-critical production environments anyway, but the need for broadly supported modern filesystems like this is something I’m at the very least willing to contribute to simply by using it.

  9. Greg Collins says:

    What about upgrading, making major changes or even switching from Ubuntu to Suse or red hat and keeping data safe? Losing the OS is an annoyance and time waster losing data is more serious. Looks like the OS is redundant here but should it not always be separate from data? I understand sub-volumes but it seems like the risk of overwriting or completely losing data is not worth the simplicity and perhaps a little space efficiency. I suppose the safest way would be completely transfer data first but do you not risk data curruption during transfer and risk not being able to mount btrfs. In my case btrfs would be strictly backup storage with data checksums, but would need to be updated at least weekly. Certianly I don’t need to worry about read write performance probably not even fragmentation but I do care about data integrity and safety. Using refs and zfs now, zfs strictly for storage, no more ntfs, ext4 or xfs except ext4 for os and ntfs for os.

  10. Mihail Gershkovich says:

    Testing an Opensuse build. I’ll make it simple:
    1. Fresh install on BTRFS with snapshots, no dedicated home directory, on dedicated ssd (or two SSDs). YAST will autopartition for you
    2. create dedicated btrfs “pools” / volumes with other block devices, no partitioning needed, use the raw drives.
    3. use ZRAM, zram is really cool! if you want to hybernate, use a dedicated SWAP device.
    4. Use btrfs compression: very-very cool, enable btrfs features on SAMBA (if you use samba).

    1. james hartley says:

      Did you get this to work with leap 43 suse. Specifically are you able to boot with one drive disconnected from your / mirror. I am using grub2 as well… if you got boot to work in degraded mode can you give me the details on how you got everything set up… I have everything working except for booting in degraded mode. I am not sure why .. but the boot appears to hang with the message

      start job for dev-disk dev=

      Thanks for any help.

      james

      1. james hartley says:

        again that is leap 42.3 … sorry

  11. Ersatzreifen says:

    I just invested in a new server box and motherboard, etc., with future expansion in mind. It can hold up to 12 sata drives. I have three 4TB drives which I want to configure as follows:
    sda – 4TB raid1
    sdb – 4TB raid1
    sdc – reserved exclusively as a manual backup drive.
    All with btrfs using raw drives (no partitions at all.)

    OS will be OpenSUSE, but without the separate home partition – just the raw device. I will use snapshots to back up the $home subvolume. I also don’t care at all for OpenSUSE’s subvolume layout, it’s counterintuitive and bloated, so I’ll customize it for better protection of my data.

    Since I’m new at btrfs, I need to know how to go about properly setting it up to work this way, beginning with setting up the drives. I have been told to not use the mbr, so can someone help guide me in getting it set up correctly? …and then how to get the OpenSUSE installer to use this setup without trying to install partitions?

    i’m planing to do the first post of this motherboard on January 1st. Maybe sooner, but I only have two sata cables on hand.

  12. Jason says:

    Thank you for this, it’s really interesting. I am not able to remove any drive though and successfully reboot. Depending on which drive I pull, and how the drives enumerate, I may get stuck in grub or I may get stuck in initramfs.

    Grub is unable to find my device /dev/sdc1 is missing. Initramfs gives the same BTRFs error listed above by another user. I have added degraded to my kernel options and fstab. Using Debian stretch.

    Seems like grub can’t persist drive re-emuneration. Not sure why initramfs fails. Maybe degraded flag isn’t being passed properly?

  13. Stephen Hill says:

    I have an ext4 boot partition, and just root as BTRFS RAID1 (3 disks), so I assume I only need the degraded option in /etc/fstab?

  14. james hartley says:

    I cannot get the system to boot when I break the root mirror. I am using a two drive setup and followed the instructions. System will boot with two drives but fails when one or the other of the mirror drives is not plugged in. Also, Leap 4.3 uses grub2. My question did anyone get this to work with leap 4.3 and grub2. If so, could you provide me with some details on how you did this. Details on whether you installed in the MBR on each drive or just the root partition, and if you used zram what you did if anything to configure that service.

    james

    1. james hartley says:

      opps thats leap 42.3 suse linux.

  15. Michael Gaehme says:

    Good morning

    I’ve run this setup since some years now. Last days I was upgrading 17.10 to 18.04.1 and stuck in initramfs.
    The system was not able to mount the btrfs volume of 4 disks.
    Error was something like

    [38257.552648] BTRFS warning (device sdd): failed to read tree root
    [38257.586065] BTRFS error (device sdd): open_ctree failed

    i was afraid of my data integrity but anything was in perfect condition, shown by btrfs check, btrfs balance etc…

    Solution: add device=/dev/sda3,device=/dev/sdb3,device=….

    add it to the Grub-config too.

    Boots and runs perfectly now ;-)

    cheers
    Micha

  16. LinAdmin says:

    Although it normally will work, I would never ever want to have a server with only one huge partition for reasons of robustness.

    I always have first a root partition of approx. 5-10GB.

    Then a separate log partition because I absolutely want to avoid that in absurd situations logs could fill up my root partition.

    Data or home takes the remaining space in a huge partition.

  17. Frank Hardy says:

    I use btrfs but not on my boot drive i stick with ext4 for that. first i delete all the partitions on the drives then use btrfs to take control of them. I am not sure how the raid 1 works and was worried about it loosing data on multiple drives if 1 fails. I just mount every drive like drive1 drive2 drive3. then use mergerfs to join all the drives into a large volume. Until i can find out more info on how the raid 1 works i am avoiding it.
    So far this has been working with the exception of a 12 tb btrfs drive that seems to take over 1 minute to load in fstab so it fails to mount. I hope one of the developers finds way to fix that. I have about 20 drives in my mount.
    Next I am creating another set of 10 drives just for handling text and such I plan to try the raid 1 on this unit and see how it works if a drive disappears.

  18. Marcus says:

    I used this guide step by step to setup a Btrfs Raid 10 a couple of years ago. OMG!.. It has been working perfectly! No corruption or issues at all. Thank you so much!!

  19. Cerem Cem ASLAN says:

    Please add a bold warning for the side effects of placing `degraded` option to Grub:

    “You can mount your rootfs as read-write only once. This means that you may end up readonly rootfs if your computer accidentally (or intentionally) restarts while you have already booted in degraded mode.”

    You SHOULD add this warning to the “degraded” option section.

Leave a Reply

Your email address will not be published.