Nat! bio photo

Nat!

Senior Mull

Twitter Github Twitch

Added a SSD to my system - with ZFS!

Yay, a new SSD is in the house. It’s a WD SN850X. Now I want to move those parts of a partition unto it, that hold all my source code. In my case that is /home/src. Lets do it with ZFS, which is a first for me. Why ZFS ? Well the benchmarks seemed enticing (see below).

Install

Everything has to be run as sudo/root!

Prerequisites

I am on Ubuntu 22.04.1 LTS and I need a few packages to get going:

apt install nvme-cli zfsutils-linux gdisk smartmontools

Setup SSD

Find device

I installed the device in a PCIe expansion card, that can hold up to 4 NVMe SSDs. After I changed the PCIe slot BIOS setting to 4x4x4x4, I can now see the SSD as /dev/nvme2n1

nvme list

Setup environment

To avoid stupid typos, I use a few environment variables. Because ZFS degrades horribly, if it becomes full, I dedicate 20% to an untouchable “reservation” that prevents the SSD from becoming more than 80% full. The reservation is 20% of the SSD size (1800GB) in my case.

SSD="/dev/nvme2n1"
PARTITION="${SSD}p1"
ZFS_RESERVATION="360G"
ZFS_MOUNTPOINT="/home/src"
ZFS_POOL="zfs-pool"

Format SSD to 4K

Figure out the proper block size for the SSD with nvme id-ns -H ${SSD} and reformat it:

nvme format "${SSD}" --block-size=4096

block-size

Partition the SSD

Partition the SSD:

sgdisk --zap-all "${SSD}"
sgdisk -a 4096 -n 1:0.0 "${SSD}"

Technically you can leave this out and use the unpartitioned SSD as a pool device, but I prefer it this way, and the next steps depend on a partition.

linux.die.net

Setup ZFS

Create zfs pool on partition

Now add the partition to a fresh ZFS zpool:

zpool create -m none -o autotrim=on -o ashift=12 "${ZFS_POOL}" "${PARTITION}"

autotrim+ashift

Create two filesystems

Leave out 20% of SSD space as insurance against ZFS degradation:

zfs create -o reservation="${ZFS_RESERVATION}" \
           -o atime=off \
           -o xattr=sa \
           -o acltype=posixacl \
           -o mountpoint=none \
           "${ZFS_POOL}"/performance-protection 

Lets not immediately mount to ${ZFS_MOUNTPOINT}, because some data from this location needs to be copied unto the new file system:

zfs create -o atime=off \
           -o dedup=on \
           -o xattr=sa \
           -o acltype=posixacl \
           -o mountpoint="${ZFS_MOUNTPOINT}-new" \
           "${ZFS_POOL}/`basename -- "${ZFS_MOUNTPOINT}"`"

zfs list should now give this:

NAME                              USED  AVAIL     REFER  MOUNTPOINT
zfs-pool                          360G  1.40T       96K  none
zfs-pool/performance-protection    96K  1.76T       96K  none
zfs-pool/src                       96K  1.40T       96K  /home/src-new

and the filesystem is already mounted!

atime xattr+acltype

There is a choice to be made between dedup and compression. I chose dedup, because we benched compression vs non-compression on another system and the slowdown was significant. Granted that system isn’t as powerful as my desktop. The conventional wisdom is compression, I chose dedup. Doing both seems wasteful.

Copy and remount

rsync stuff over

Run this also as sudo (I don’t use the rsync ‘S’ option because its slower, if you don’t have sparse files (like some VM images):

rsync -axHAWX --numeric-ids --info=progress2 \
   "${ZFS_MOUNTPOINT}/" \
   "${ZFS_MOUNTPOINT}-new"

axHAWXS

Note

This was decidely faster on my initial try, when I had 512B blocks on the SSD still. I suppose its because of a block size mismatch with my ext4 drive (but I don’t know)

But in the end the 4K blocks give me just a smidgen faster build times and the wear on the SSD should be better.

Change the mount point

mv "${ZFS_MOUNTPOINT}" "${ZFS_MOUNTPOINT}-old"
zfs set mountpoint=/home/src zfs-pool/`basename -- "${ZFS_MOUNTPOINT}"`

And that’s it.

Benchmarks Ext4 vs ZFS

Caveat: These benchmarks were done with a block size of 512. I later reformatted the SSD with 4K. I have rerun the last test (ninja) with 4K to show the difference.

fio

The “best” test for zfs seems to be fio for some reason. I am using this command:

fio --randrepeat=1 \
    --numjobs 3 \
    --ioengine=posixaio \
    --name=test \
    --bs=4k \
    --iodepth=2 \
    --readwrite=randrw \
    --rwmixread=98 \
    --size=1G \
    --filename=$PWD/testfile

and the results are

ext4s:

READ: bw=494MiB/s (518MB/s), 164MiB/s-166MiB/s (172MB/s-174MB/s), io=3010MiB (3157MB), run=6057-6099msec
WRITE: bw=10.1MiB/s (10.6MB/s), 3431KiB/s-3486KiB/s (3513kB/s-3570kB/s), io=61.7MiB (64.7MB), run=6057-6099msec

zfs:

READ: bw=1813MiB/s (1902MB/s), 604MiB/s-655MiB/s (634MB/s-686MB/s), io=3010MiB (3157MB), run=1533-1660msec
WRITE: bw=37.2MiB/s (39.0MB/s), 12.5MiB/s-13.4MiB/s (13.1MB/s-14.0MB/s), io=61.7MiB (64.7MB), run=1533-1660msec

ZFS crushes this comparison.

mulle-sde with cmake / gmake

I am building a large Objective-C/C project with many dependencies, the actual command is mulle-sde clean all ; mulle-sde craft.

ext4:

real  2m7,117s
user  2m48,457s
sys   0m42,866s

zfs:

real  1m57,936s
user  2m39,020s
sys   0m44,386s

Ten seconds better is non-negligble, but it’s not a crushing defeat for ext4 (which was ~50% full).

mulle-sde with cmake / ninja

I noticed at this point, that the ninja version Ubuntu 22 installs is just too old and my build system doesn’t like it. When it reverted to make, mulle-sde didn’t use the ‘-j ' option, because ninja doesn't need it...

Corrected benchmark of mulle-sde with cmake/make with -j set:

zfs:

real  0m54,466s
user  3m16,574s
sys   0m53,979s

So I downloaded the newest ninja version and tried again:

ext4:

real  0m45,373s
user  3m13,417s
sys   0m42,272s

zfs:

real  0m46,803s
user  3m12,275s
sys   0m49,064s

zfs: (4K)

real  0m45,762s
user  3m9,124s
sys   0m45,804s

It’s curious and funny, but ext4 is ever so slightly faster in this benchmark.

ZFS considerations

Why lose 20% of the SSD ?

ZFS degrades badly, when there is not enough room. What is “not enough room for ZFS” probably depends on the exact configuration. But folklore (and I) say that a 80% fill grade is OK, and anything over that is bad.

Here are two fio benchmarks on an actual ZFS system

filled up:

READ: bw=5347KiB/s (5475kB/s), 1782KiB/s-1785KiB/s (1824kB/s-1828kB/s), io=2764MiB (2898MB), run=528527-529277msec
WRITE: bw=596KiB/s (611kB/s), 197KiB/s-200KiB/s (202kB/s-205kB/s), io=308MiB (323MB), run=528527-529277msec

and after clean up:

READ: bw=240MiB/s (252MB/s), 80.0MiB/s-80.3MiB/s (83.9MB/s-84.2MB/s), io=2764MiB (2898MB), run=11472-11514msec
WRITE: bw=26.8MiB/s (28.1MB/s), 9076KiB/s-9203KiB/s (9294kB/s-9424kB/s), io=308MiB (323MB), run=11472-11514msec

To put this into perspective, ZFS managed to turn a NVMe SSD into an ATA/33 drive and not a particular fast one…

Don’t stack ZFS

You have a desktop or server running ZFS, fine. Don’t create VMs with ZFS as their filesystem. Why ? You are wasting your resources. 20% of the SSD is gone for the desktop. The usable space is a factor 0.8. Now add a VM with ZFS, you get another factor 0.8. Combined usable space is now only 0.64 of the original capacity. Adding insult to injury, you don’t really want your filesystem to be completely full, lest you might not be able to log in. What is your comfort zone here ? Maybe 10% ?

That is (0.8 - (0.8 / 10)) * (0.8 - (0.8 / 10)) ~ 0.5. You halved your SSD.

And that’s just diskspace. ZFS also likes a lot of RAM, again you are paying twice.

Is ZFS good or bad for SSD wear ?

I couldn’t figure this one out. Here come some open ended musings without a definite opinion or answer…

With sudo smartctl -x -A ${SSD} | egrep 'Written|Write|Read' i can see (amongst other values):

Data Units Read:                    150.783 [77,2 GB]
Data Units Written:                 1.248.697 [639 GB]
Host Read Commands:                 3.342.858
Host Write Commands:                23.090.896

Data units are 1000 * 512B. So 150783 comes out to 77.2GB. I did start out with 512B and copied lots of GB, then reformatted with 4K and copied again pretty much the same amount. Apparently even if the block size is 4KB the data units size remains at 512B. Curious.

Supposedly the ratio of “Data Units Written” to “Host Write Commands” is an indication of Write amplification

But the number of “Host Write Commands” doesn’t tell the number of GB that were transferred. It could be 4K, it could be 512 Bytes, it could be something else. So if WAF=”Host Write Commands” / “Data Units Written”, then 23.090.89 / 1.248.697 comes out to 0,05. That seems ludicrously low for WAF.

I checked the SSD before doing a somewhat larger tar operation on ZFS with

before:

Data Units Read:                    150.783 [77,2 GB]
Data Units Written:                 1.248.741 [639 GB]
Host Read Commands:                 3.342.870
Host Write Commands:                23.093.189

after:

Data Units Read:                    151.358 [77,4 GB]
Data Units Written:                 1.251.795 [640 GB]
Host Read Commands:                 3.360.901
Host Write Commands:                23.108.672

That’s 15483 host write commands. 3054 data units were written 3054 * 1000 * 512 = 1.5GB. That matches with the tar file size.

Let’s try with ext4 (512B formatted):

before:

Data Units Read:                    4.602.526 [2,35 TB]
Data Units Written:                 13.464.991 [6,89 TB]
Host Read Commands:                 47.778.814
Host Write Commands:                169.418.307

it’s problematic to catch the right values, as the OS has apparently put a lot of writes into the queue and is dispensing them slowly, though tar has finished:

Data Units Read:                    4.602.526 [2,35 TB]
Data Units Written:                 13.468.907 [6,89 TB]
Host Read Commands:                 47.778.814
Host Write Commands:                169.423.843

5536 host write commands were issued and 3916 data units were written.

The relationship between host writes and data units on ext4 vs zfs is interesting. It is factor 0.7 for ext4. Whereas on zfs its factor 0.2. Is this WAF, is this better or worse ?

It seems that generally a higher number of “Host Write Commands” compared to “Data Units Written” is desirable, and therefore ZFS should be a better choice for SSDs (also dedup or compression will be beneficial for SSD wear).

stackoverflow


Post a comment

All comments are held for moderation; basic HTML formatting accepted.

Name:
E-mail: (not published)
Website: