Added a SSD to my system - with ZFS!
Yay, a new SSD is in the house. It’s a WD SN850X.
Now I want to move those parts of a partition unto it, that hold all my source
code. In my case that is /home/src
. Lets do it with ZFS, which is a first for
me. Why ZFS ? Well the benchmarks seemed enticing (see below).
Install
Everything has to be run as sudo/root!
Prerequisites
I am on Ubuntu 22.04.1 LTS and I need a few packages to get going:
apt install nvme-cli zfsutils-linux gdisk smartmontools
Setup SSD
Find device
I installed the device in a PCIe expansion card, that can hold up to 4
NVMe SSDs. After I changed the PCIe slot BIOS setting to 4x4x4x4, I can
now see the SSD as /dev/nvme2n1
nvme list
Setup environment
To avoid stupid typos, I use a few environment variables. Because ZFS degrades horribly, if it becomes full, I dedicate 20% to an untouchable “reservation” that prevents the SSD from becoming more than 80% full. The reservation is 20% of the SSD size (1800GB) in my case.
SSD="/dev/nvme2n1" PARTITION="${SSD}p1" ZFS_RESERVATION="360G" ZFS_MOUNTPOINT="/home/src" ZFS_POOL="zfs-pool"
Format SSD to 4K
Figure out the proper block size for the SSD with nvme id-ns -H ${SSD}
and reformat it:
nvme format "${SSD}" --block-size=4096
Partition the SSD
Partition the SSD:
sgdisk --zap-all "${SSD}"
sgdisk -a 4096 -n 1:0.0 "${SSD}"
Technically you can leave this out and use the unpartitioned SSD as a pool device, but I prefer it this way, and the next steps depend on a partition.
Setup ZFS
Create zfs pool on partition
Now add the partition to a fresh ZFS zpool:
zpool create -m none -o autotrim=on -o ashift=12 "${ZFS_POOL}" "${PARTITION}"
Create two filesystems
Leave out 20% of SSD space as insurance against ZFS degradation:
zfs create -o reservation="${ZFS_RESERVATION}" \
-o atime=off \
-o xattr=sa \
-o acltype=posixacl \
-o mountpoint=none \
"${ZFS_POOL}"/performance-protection
Lets not immediately mount to ${ZFS_MOUNTPOINT}, because some data from this location needs to be copied unto the new file system:
zfs create -o atime=off \
-o dedup=on \
-o xattr=sa \
-o acltype=posixacl \
-o mountpoint="${ZFS_MOUNTPOINT}-new" \
"${ZFS_POOL}/`basename -- "${ZFS_MOUNTPOINT}"`"
zfs list
should now give this:
NAME USED AVAIL REFER MOUNTPOINT
zfs-pool 360G 1.40T 96K none
zfs-pool/performance-protection 96K 1.76T 96K none
zfs-pool/src 96K 1.40T 96K /home/src-new
and the filesystem is already mounted!
There is a choice to be made between
dedup
andcompression
. I chosededup
, because we benched compression vs non-compression on another system and the slowdown was significant. Granted that system isn’t as powerful as my desktop. The conventional wisdom iscompression
, I chosededup
. Doing both seems wasteful.
Copy and remount
rsync stuff over
Run this also as sudo (I don’t use the rsync ‘S’ option because its slower, if you don’t have sparse files (like some VM images):
rsync -axHAWX --numeric-ids --info=progress2 \
"${ZFS_MOUNTPOINT}/" \
"${ZFS_MOUNTPOINT}-new"
Note
This was decidely faster on my initial try, when I had 512B blocks on the SSD still. I suppose its because of a block size mismatch with my ext4 drive (but I don’t know)
But in the end the 4K blocks give me just a smidgen faster build times and the wear on the SSD should be better.
Change the mount point
mv "${ZFS_MOUNTPOINT}" "${ZFS_MOUNTPOINT}-old"
zfs set mountpoint=/home/src zfs-pool/`basename -- "${ZFS_MOUNTPOINT}"`
And that’s it.
Benchmarks Ext4 vs ZFS
Caveat: These benchmarks were done with a block size of 512. I later reformatted the SSD with 4K. I have rerun the last test (ninja) with 4K to show the difference.
fio
The “best” test for zfs seems to be fio for some reason. I am using this command:
fio --randrepeat=1 \
--numjobs 3 \
--ioengine=posixaio \
--name=test \
--bs=4k \
--iodepth=2 \
--readwrite=randrw \
--rwmixread=98 \
--size=1G \
--filename=$PWD/testfile
and the results are
ext4s:
READ: bw=494MiB/s (518MB/s), 164MiB/s-166MiB/s (172MB/s-174MB/s), io=3010MiB (3157MB), run=6057-6099msec
WRITE: bw=10.1MiB/s (10.6MB/s), 3431KiB/s-3486KiB/s (3513kB/s-3570kB/s), io=61.7MiB (64.7MB), run=6057-6099msec
zfs:
READ: bw=1813MiB/s (1902MB/s), 604MiB/s-655MiB/s (634MB/s-686MB/s), io=3010MiB (3157MB), run=1533-1660msec
WRITE: bw=37.2MiB/s (39.0MB/s), 12.5MiB/s-13.4MiB/s (13.1MB/s-14.0MB/s), io=61.7MiB (64.7MB), run=1533-1660msec
ZFS crushes this comparison.
mulle-sde with cmake / gmake
I am building a large Objective-C/C project with many dependencies,
the actual command is mulle-sde clean all ; mulle-sde craft
.
ext4:
real 2m7,117s
user 2m48,457s
sys 0m42,866s
zfs:
real 1m57,936s
user 2m39,020s
sys 0m44,386s
Ten seconds better is non-negligble, but it’s not a crushing defeat for ext4 (which was ~50% full).
mulle-sde with cmake / ninja
I noticed at this point, that the ninja version Ubuntu 22 installs is just too
old and my build system doesn’t like it. When it reverted to make, mulle-sde
didn’t use the ‘-j
Corrected benchmark of mulle-sde with cmake/make with -j set:
zfs:
real 0m54,466s user 3m16,574s sys 0m53,979s
So I downloaded the newest ninja version and tried again:
ext4:
real 0m45,373s
user 3m13,417s
sys 0m42,272s
zfs:
real 0m46,803s
user 3m12,275s
sys 0m49,064s
zfs: (4K)
real 0m45,762s
user 3m9,124s
sys 0m45,804s
It’s curious and funny, but ext4 is ever so slightly faster in this benchmark.
ZFS considerations
Why lose 20% of the SSD ?
ZFS degrades badly, when there is not enough room. What is “not enough room for ZFS” probably depends on the exact configuration. But folklore (and I) say that a 80% fill grade is OK, and anything over that is bad.
Here are some benchmarks on an actual ZFS system
filled up:
READ: bw=5347KiB/s (5475kB/s), 1782KiB/s-1785KiB/s (1824kB/s-1828kB/s), io=2764MiB (2898MB), run=528527-529277msec
WRITE: bw=596KiB/s (611kB/s), 197KiB/s-200KiB/s (202kB/s-205kB/s), io=308MiB (323MB), run=528527-529277msec
and after clean up:
READ: bw=240MiB/s (252MB/s), 80.0MiB/s-80.3MiB/s (83.9MB/s-84.2MB/s), io=2764MiB (2898MB), run=11472-11514msec
WRITE: bw=26.8MiB/s (28.1MB/s), 9076KiB/s-9203KiB/s (9294kB/s-9424kB/s), io=308MiB (323MB), run=11472-11514msec
To put this into perspective, ZFS managed to turn a NVMe SSD into an ATA/33 drive and not a particular fast one…
Don’t stack ZFS
You have a desktop or server running ZFS, fine. Don’t create VMs with ZFS as their filesystem. Why ? You are wasting your resources. 20% of the SSD is gone for the desktop. The usable space is a factor 0.8. Now add a VM with ZFS, you get another factor 0.8. Combined usable space is now only 0.64 of the original capacity. Adding insult to injury, you don’t really want your filesystem to be completely full, lest you might not be able to log in. What is your comfort zone here ? Maybe 10% ?
That is (0.8 - (0.8 / 10)) * (0.8 - (0.8 / 10)) ~ 0.5
. You halved your SSD.
And that’s just diskspace. ZFS also likes a lot of RAM, again you are paying twice.
Is ZFS good or bad for SSD wear ?
I couldn’t figure this one out. Here come some open ended musings without a definite opinion or answer…
With sudo smartctl -x -A ${SSD} | egrep 'Written|Write|Read'
i can see (amongst other values):
Data Units Read: 150.783 [77,2 GB]
Data Units Written: 1.248.697 [639 GB]
Host Read Commands: 3.342.858
Host Write Commands: 23.090.896
Data units are 1000 * 512B
. So 150783 comes out to 77.2GB. I did start
out with 512B and copied lots of GB, then reformatted with 4K and copied
again pretty much the same amount. Apparently even if the block size is 4KB
the data units size remains at 512B. Curious.
Supposedly the ratio of “Data Units Written” to “Host Write Commands” is an indication of Write amplification
But the number of “Host Write Commands” doesn’t tell the number of GB that were
transferred. It could be 4K, it could be 512 Bytes, it could be something
else. So if WAF=”Host Write Commands” / “Data Units Written”, then
23.090.89 / 1.248.697
comes out to 0,05
. That seems ludicrously low for
WAF.
I checked the SSD before doing a somewhat larger tar operation on ZFS with
before:
Data Units Read: 150.783 [77,2 GB]
Data Units Written: 1.248.741 [639 GB]
Host Read Commands: 3.342.870
Host Write Commands: 23.093.189
after:
Data Units Read: 151.358 [77,4 GB]
Data Units Written: 1.251.795 [640 GB]
Host Read Commands: 3.360.901
Host Write Commands: 23.108.672
That’s 15483
host write commands.
3054
data units were written 3054 * 1000 * 512 = 1.5GB
. That matches with the
tar file size.
Let’s try with ext4
(512B formatted):
before:
Data Units Read: 4.602.526 [2,35 TB]
Data Units Written: 13.464.991 [6,89 TB]
Host Read Commands: 47.778.814
Host Write Commands: 169.418.307
it’s problematic to catch the right values, as the OS has apparently put a lot of writes into the queue and is dispensing them slowly, though tar has finished:
Data Units Read: 4.602.526 [2,35 TB]
Data Units Written: 13.468.907 [6,89 TB]
Host Read Commands: 47.778.814
Host Write Commands: 169.423.843
5536
host write commands were issued and 3916
data units were written.
The relationship between host writes and data units on ext4 vs zfs is interesting. It is factor 0.7 for ext4. Whereas on zfs its factor 0.2. Is this WAF, is this better or worse ?
It seems that generally a higher number of “Host Write Commands” compared
to “Data Units Written” is desirable, and therefore ZFS should be a better
choice for SSDs (also dedup
or compression
will be beneficial for SSD wear).
Post a comment
All comments are held for moderation; basic HTML formatting accepted.