Zfs A4
Zfs A4
Cheat Sheet
Serge Y. Stroobandt
Why ZFS?
The data integrity problem can be best described as follows:1
System administrators may feel that because they store their data
on a redundant disk array and maintain a well-designed tape-backup
regimen, their data is adequately protected. However, undetected da-
ta corruption can occur between backup periods. Subsequently back-
ing up corrupted data will yield corrupted data when restored.
The Zeta File System (ZFS) features the capability of being self validating and
self healing from silent data corruption or data rot through continuous data
block checksumming.2 Each block write operation yields a 256‑bit block check-
sum. The resulting block checksum is not stored with the block, but rather
with its parent block.3 Hence, the blocks of a ZFS storage pool form a Merkle
tree in which each block contains the checksum of all of its children.4 This al-
1
lows the entire pool to continuously self validate its data with every operation
on both accuracy and correctness, ensuring data integrity.
Figure 1: How ZFS self heals from silent data corruption or data rot through continu-
ous block checksumming. Source: root.cz
2
ZFS server hardware
To run ZFS, a server needs to meet a set of stringent
hardware requirements. A separate article deals
with this matter in detail.
I considered for a moment using CentOS for its ten years of support, compared
to a mere five years with Ubuntu Server LTS releases. However, in a home or
small office setting, one most probably expects a bit more from its server than
just file serving on the local network. The CentOS repository is extremely rudi-
mentary. Even for the most basic use cases, one grows dependent on third‑par-
ty repositories.
3
Neither are FreeBSD based distributions —like FreeNAS and NAS4Free—
an option, because of their lack of support for UTF-8 character encoding. This
might be fine in the Anglo-Saxon speaking part of the world, but certainly is
not for the vast majority in the rest of the world. This can be fixed, but I really
do not want to be dealing with something basic like this.
Installation
The Lawrence Livermore National Laboratory has
been working on porting the native Solaris ZFS
source to the Linux kernel as a kernel module. Fol-
low the installation instructions at zfsonlinux.org for
your specific GNU/Linux distribution.
$ dpkg -l zfs*
Virtual devices
A virtual device (VDEV) is a meta-device that can represent one or more de-
vices. ZFS supports seven different types of VDEV.
• File - a pre-allocated file
• Physical Drive (HDD, SDD, PCIe NVME, etc)
• Mirror - a standard RAID1 mirror
• ZFS software raidz1, raidz2, raidz3 ‘distributed’ parity based RAID
• Hot Spare - hot spare for ZFS software raid.
• Cache - a device for level 2 adaptive read cache (ZFS L2ARC)
• Log - ZFS Intent Log (ZFS ZIL)
A device can be added to a VDEV, but cannot be removed from it. For most
home or small office users, each VDEV usually corresponds to a single phys-
ical drive. During pool creation, several of these VDEVs are combined into
forming a mirror or RAIDZ.
• Use whole disks rather than partitions. ZFS can make better use of
the on-disk cache as a result. If you must use partitions, backup
the partition table, and take care when reinstalling data into the other
partitions, so you don’t corrupt the data in your pool.
• Do not mix disk sizes or speeds in VDEVs, nor storage pools. If VDEVs
vary in size, ZFS will favor the larger VDEV, which could lead to
performance bottlenecks.
4
• Do not create ZFS storage pools from files in other ZFS datasets. This will
cause all sorts of headaches and problems.
$ lsblk -o NAME,TYPE,SIZE,MODEL,SERIAL,WWN,MOUNTPOINT
sdf disk 1.8T WDC_WD20EFRX-68EUZN0 WD-WCC4M4DKAVF1
0x50014ee262c435af
$ ls -l /dev/disk/by-id/
lrwxrwxrwx 1 root root 9 Aug 20 16:55 ata-WDC_WD20EFRX-68EUZN0_WD-
WCC4M4DKAVF1 -> ../../sdf
Note that a drive or a drive partition can have more than one by-id . Apart
from the ID based on the brand, model name and the serial number, there is al-
so a wwn- ID. This is the unique World Wide Name (WWN) and is also print-
ed on the drive case.
Both type of IDs work fine with ZFS, but the WWN is a bit less telling. If these
WWN IDs are not referenced by the production system (e.g. a root partition
or a ZFS that has not been exported yet), these may simply be removed with
sudo rm wwn-* . Trust me; I have done that. Nothing can go wrong as long as
the ZFS is in an exported state before doing this. After all, WWN IDs are mere
symbolic links to sd devices that are created at drive detection. They will auto-
matically reappear when the system is rebooted. Internally, Linux always ref-
erences sd devices.
For the physical identification using storage enclosure LEDs, I created the fol-
lowing bash script:
5
#!/usr/bin/env bash
# https://serverfault.com/a/1108701/175321
if [[ $# -gt 0 ]]
then
while true
do
dd if=$1 of=/dev/null >/dev/null 2>&1 || sudo dd if=$1 of=/dev/null
>/dev/null 2>&1
sleep 1
done
else
echo -e '\nThis command requires a /dev argument.\n'
fi
Unlike ledctl from the ledmon package, this script also works fine with non-
Intel hard drive controllers.
Zpool creation
A zpool is a pool of storage made from a collection of
VDEVs. One or more ZFS file systems can be created
within a ZFS pool, as will be shown later. In practical
examples, zpools are often given the names pool ,
tank or backup , preferably followed by a digit to
designate between multiple pools on a system.
6
-o ashift=12
The zpool create property -o ashift can only be set at pool creation time.
Its value corresponds to the base 2 logarithm of the pool sector size in kibibyte.
I/O operations are aligned to boundaries specified by this size. The default
value is 9, as 29 = 512, which corresponds to the standard sector size of operat-
ing system utilities used for both reading and writing data. In order to achieve
maximum performance from Advanced Format drives with 4 KiB boundaries,
the value should be set to ashift=12 , as 212 = 4096.
Here is a potential way of finding out the physical block size of your drives us-
ing the operating system. However, this method is not fail proof! Western Dig-
ital drives in particular may falsely report as being non-Advanced Format (see
inset below). Anyhow, by installing hdparm , one can query the microprocessor
on the printed circuit board of a hard drive. (Yes, hard drives are actually little
computers on their own.)
7
Figure 2: The label of a 2010 model EARS 1 TB Western Digital Green drive,
featuring Advanced Format. Source: tomshardware.com
-o autoexpand=on
The pool property -o autoexpand=on must be set on before replacing a first
drive in the pool with a larger sized one. The property controls automatic pool
expansion. The default is off . After all drives in the pool have been replaced
with larger drives, the pool will automatically grow to the new, larger drive
size.
-O compression=on
Always enable compression. There is almost certainly no reason to keep it dis-
abled. It hardly touches the CPU and hardly touches throughput to the drive,
yet the benefits are amazing. Compression is disabled by default. This doesn’t
make much sense with today’s hardware. ZFS compression is extremely cheap,
extremely fast, and barely adds any latency to the reads and writes. In fact, in
some scenarios, your disks will respond faster with compression enabled than
disabled. A further benefit is the massive space benefits.
-O dedup=off
Even if you have the RAM for it, ZFS deduplication is, unfortunately, almost
certainly a lose.5 So, by all means avoid using deduplication; even on a ma-
chine built to handle it. Unlike compression, deduplication is very costly on
the system. The deduplication table consumes massive amounts of RAM.
-f
The force option -f forces the use of the stated VDEVs, even if these appear to
be in use. Not all devices can be overridden in this manner.
8
Zpool mirror
$ cd /dev/disk/by-id/
$ ls
$ sudo zpool create -f -o ashift=12 -O compression=on -O dedup=off pool0
mirror scsi-SATA_WDC_WD10EARS-00_WD-WCAV56475795 scsi-
SATA_WDC_WD10EARS-00_WD-WCAV56524564
A mirrored storage pool configuration requires at least two disks, when pos-
sible, connected to separate controllers. Personally, I prefer running a three-
way mirror using three disks, even though this consumes 50% more electric
power. Here is the reason why. When one physical drive in a two-way mirror
fails, the remaining drive needs to be replicated —resilvered in ZFS speak—
to a new physical drive. Replication puts additional stress on a drive and it
is not inconceivable that the remaining drive would fail during its replication
process. “When it rains, it pours.” By contrast, a three-way mirror with one
failed disk maintains 100% redundancy. A similar argument exists in favour of
RAIDZ‑2 and RAIDZ‑3 over RAIDZ‑1.
Zpool RAIDZ
• https://www.openoid.net/zfs-you-should-use-mirror-vdevs-not-raidz/
• When considering performance, know that for sequential writes,
mirrors will always outperform RAID-Z levels. For sequential reads,
RAID-Z levels will perform more slowly than mirrors on smaller data
blocks and faster on larger data blocks. For random reads and writes,
mirrors and RAID-Z seem to perform in similar manners. Striped
mirrors will outperform mirrors and RAID-Z in both sequential, and
random reads and writes.
• Consider using RAIDZ-2 or RAIDZ-3 over RAIDZ-1. You’ve heard
the phrase “when it rains, it pours”. This is true for disk failures. If a disk
fails in a RAIDZ-1, and the hot spare is getting resilvered, until the data
is fully copied, you cannot afford another disk failure during the resilver,
or you will suffer data loss. With RAIDZ-2, you can suffer two disk
failures, instead of one, increasing the probability you have fully
resilvered the necessary data before the second, and even third disk fails.
9
Export & import zpools
Exporting zpools
Storage pools should be explicitly exported to indicate that these are ready to
be migrated. This operation flushes any unwritten data to disk, writes data
to the disk indicating that the export was done, and removes all information
about the zpool from the system.
If a zpool is not explicitly exported, but instead physically removed, the re-
sulting pool can nonetheless still be imported on another system. However,
the last few seconds of data transactions may be lost. Moreover, the pool will
appear faulted on the original system because the devices are no longer pre-
sent. By default, the destination system cannot import a pool that has not been
explicitly exported. This condition is necessary to prevent from accidentally
importing an active zpool consisting of network attached storage and that is
still used by another system.
10
Importing all zpools
To import all known storage pools, simply type:
~# zpool import -a
Note that a drive or a drive partition can have more than one by-id . Apart
from the ID based on the brand, model name and the serial number, nowadays
there might also be a wwn- ID. This is the unique World Wide Name (WWN)
and is also printed on the drive case.
Both type of IDs work fine with ZFS, but the WWN is a bit less telling. If these
WWN IDs are not referenced by the production system (e.g. a root partition
or a ZFS that has not been exported yet), these may simply be removed with
sudo rm wwn-* . Trust me; I have done that. Nothing can go wrong as long as
the ZFS is in an exported state before doing this. After all, WWN IDs are mere
symbolic links to sd devices that are created at drive detection. They will auto-
matically reappear when the system is rebooted. Internally, Linux always ref-
erences sd devices.
Renaming a zpool
Here is an example where a zpool called tank0 is renamed to pool0 .
11
Upgrading the zpool version
Occasionally, a zpool status report may yield a message similar to this:
$ zpool --version
zfs-0.8.3-1ubuntu12.14
zfs-kmod-0.8.3-1ubuntu12.14
$ sudo zpool upgrade -a
12
ZFS file systems
ZFS file system creation
One or more Zeta file systems can live on a zpool. Here are a number of points
to take into account:
Turning access time writing off with -o atime=off can result in significant
performance gains. However, doing so might confuse legacy mail clients and
similar utilities.
• Avoid running a ZFS root file system on GNU/Linux for the time being.
It is currently a bit too experimental for /boot and GRUB.
• However, do create file systems for /home and, if desired, /var/log and
/var/cache .
• For /home ZFS installations, set up nested file systems for each user.
For example, pool0/home/atoponce and pool0/home/dobbs . Consider
using quotas on these file systems.
• Further implications of creating a /home file system are described in
the next subsection.
ZFS as /home
Mounting as /home
Here is how to mount a ZFS file system as home:
13
Unmounting /home
Certain Zeta pool and file system operations require prior unmounting of
the file system. However, the /home Zeta file system will refuse to unmount
because it is in use:
---
- name: 'PermitRootLogin no'
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PermitRootLogin '
line: 'PermitRootLogin no' # Was: PermitRootLogin prohibit-password
notify: restart ssh
- meta: flush_handlers
---
- name: 'restart ssh'
service: name=ssh state=restarted
3. Since Ubuntu is being used, and if a root password was not defined yet,
one should do so now using $ sudo passwd root . There really is no
other way.
14
4. In /etc/dhcp/dhclient.conf , set timeout to 15 . Otherwise, the next
step will cause the boot time to increase with 5 minutes (300 seconds)!
---
- name: 'Lower DHCP timeout to 15s.'
lineinfile:
path: /etc/dhcp/dhclient.conf
regexp: '^timeout'
line: 'timeout 15' # Was: timeout 300
Scrubbing
Scrubbing examines all data to discover hardware
faults or disk failures, whereas resilvering examines
only that data known to be out of date. Scrubbing
ZFS storage pools now happens automatically. It can
also be intiated manually. If possible, scrub con-
sumer-grade SATA and SCSI disks weekly and en-
terprise-grade SAS and FC disks monthly.
Monitoring
In order to preserve maximum performance, it is es-
sential to keep pool allocation under 80% of its full
capacity. The following set of monitoring commands
help to keep an eye on this and other aspects of zpool
health.
15
For example, the file system may get heavily fragmented due to the copy-on-
write nature of ZFS. It might be useful to e-mail capacity reports monthly.
$ zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH
ALTROOT
pool0 928G 475G 453G - 4% 51% 1.00x ONLINE -
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool0 475G 424G 136K /pool0
pool0/home 475G 424G 450G legacy
Use the zpool status command for status monitoring. Options are available
for verbose output or an automatic repeat after, for example, every five sec-
onds. Consult the Oracle Solaris documentation for details about
zpool status output. The second command line is a remote server example.
$ zpool status -v 5
$ ssh -t servername 'sudo zpool status -v 5'
$ zpool iostat -v 5
This hddtemp evocation can be made the default by adding the following line
to .bash_aliases :
16
Snapshots
Snapshot creation
Snapshot Zeta file systems frequently and regularly.
Snapshots are cheap, and can keep a plethora of file
versions over time. Consider using something like
the zfs-auto-snapshot script.
To see the snapshot creation time, add the following zfs list options:
Accessing snapshots
Snapshots of file systems are accessible through the .zfs/snapshot/ directory
within the root of the containing file system. Here, the root of the file system is
mounted as /home .
cd /home/.zfs/snapshot/20120722
Renaming snapshots
$ sudo zfs rename pool0/home@2021070 pool0/home@2021070.bad
17
Destroying snapshots
After a year or so, snapshots may start to take up some disk space. Individual
snapshots can be destroyed as follows after listing. This has no consequences
for the validity of other snapshots.
Snapshots can also be destroyed in sequences. For this to work, the range must
consists of two snapshots that actually do exist. It will not work with arbitrary
date ranges.
Automated snapshots
To automate periodic snapshots, Sanoid is probably your best shot (pun in-
tended). Apart from facilitating automated snapshot, Sanoid also offers es-
sential infrastructure to simplify the backup process (see next section).
The sanoid package is available from Ubuntu’s standard repository.
[pool0/home]
frequently = 0
hourly = 0
daily = 30
monthly = 12
yearly = 3
autosnap = yes
autoprune = yes
18
Backups
“Consider data as intrinsically lost, unless a tested, offline, off‑site
backup is available.”
Local backup
For all backup activities, the use of Syncoid is strongly recommended. Syncoid
is an open-source replication tool, part of the sanoid package and written by
the IT consultant Jim Salter. The syncoid command tremendously facilitates
the asynchronous incremental replication of ZFS file systems, both local and
remote over SSH. Syncoid is, so to speak, the cherry on top of the ZFS cake.
The backup process will take some time. Therefore, start a screen session once
logged in on the server. It is important to note that the syncoid command will
by default start with taking a snapshot of the ZFS file system.
Also, note that syncoid replicates ZFS file systems. To replicate an entire
zpool, the -r or --recursive command option is required:
Chained backups
A chained backup process occurs, for example, when a server zpool is replicat-
ed to a detachable drive, which in turn is replicated on another system.
19
By default, Syncoid takes a snapshot prior to starting the replication process.
This behaviour is undesired for chained backups. In above example, that
would create, unbeknownst to the server, an unnecessary extra snapshot on
the detachable drive.
To counter this default snapshot behaviour, issue the syncoid command for
the second backup in the chain as follows:
Remote backup
Perform regular (at least weekly) backups of the full storage pool.
A backup consists of multiple copies. Having only redundant disks, does not
suffice to guarantee data preservation in the event of a power failure, hard-
ware failure, disconnected cables, fire, a hacker ransom attack or a natural dis-
aster.
• https://www.openoid.net/why-sanoids-zfs-replication-matters/
20
• simplesnap
• zrep
Mounting a backup
A backup drive will attempt to mount at the same point as its original. This re-
sults in a error similar to:
Changing the mountpoint for the backup resolves this issue. The backup zpool
will need to be exported and imported again for this mountpoint change to
take effect.
Older snapshots of file systems on the backup drive are accessible through
the .zfs/snapshot/ directory within the root of the containing file system.
cd /backup0/.zfs/snapshot/home/20170803
21
SFTP server
Workstations use the SSHFS (Secure Shell Filesystem) client to access server
files through SFTP (Secure Shell File Transfer Protocol).
• Complete file permission transparency is the main reason for preferring
SFTP over the Windows™ style Server Message Block (SMB). This,
despite the fact that ZFS has been integrated with the GNU/Linux
implementation, called Samba.
• Furthermore, SFTP handles well changing the case of filenames. This
cannot be said about the latest Samba versions!
• SFTP significantly simplifies things. If the server is accessible over SSH,
SSHFS should also work.
• Running only an OpenSSH SFTP server significantly reduces exposure
as SFTP is inherently more secure than SMB and NFS.
• Eavesdropping on the (W)LAN is not an issue, since all file transfers are
encrypted.
• The only drawback are the slower file transfer speeds due to
the encryption overhead.
SSHFS clients
Here is the bash script that I wrote to mount the server through SSHFS on any
client computer. It gets executed at login, in my case by specifying the script
in Xubuntu’s Session and Startup → Application Autostart . However,
the script can also be run manually, for example after connecting to a mobile
network.
The optimisation parameters are from the following article and tests. As a req-
uisite, the administrator needs to create a /$server/$USER mount point direc-
tory for every user on the client system.
# !/usr/bin/env bash
mountpoint="/$server/$USER"
22
Drive attach & detach
Attaching more drives
The command zpool attach is used to attach an extra drive to an existing dri-
ve in a zpool as follows:
$ cd /dev/disk/by-id/
$ ls -l
$ sudo zpool status
The last line is to monitor the resilvering process. If zpool attach is complain-
ing about the new drive being in use, and you know what you are doing, sim-
ply add -f to force zpool attach into what you want it to do.
23
Replacing a failing drive
$ sudo zpool status
pool: pool0
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 28K in 6h58m with 0 errors on Sun Aug 14 07:22:34
2022
config:
The zpool replace command comes very handy when a failed or failing dri-
ve of a redundant zpool needs to be replaced with a new drive in exactly
the same physical drive bay. It suffices to identify the drive that needs to be re-
placed. ZFS will automatically detect the new drive and start the resilvering
process.
Detaching a drive
$ zpool status
$ sudo zpool detach pool0 ata-WDC_WD10EARS-00Y5B1_WD-WCAV56524564
Troubleshooting
When failing to create a zpool
I once ran into a case where zpool creation by-id did not work. Using the sdx
device name did work, however. Exporting and reimporting the pool by-id
kept everything nice and neat.
24
$ sudo zpool create -f -o ashift=12 -o autoexpand=on -O compression=on -O
dedup=off backup0 /dev/disk/by-id/ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M4DKAVF1
$ sudo zpool create -f -o ashift=12 -o autoexpand=on -O compression=on -O
dedup=off backup0 /dev/sdx
$ sudo zpool export backup0
$ sudo zpool import -d /dev/disk/by-id/ backup0
$ sudo zdb
$ sudo zpool attach pool0 15687870673060007972 /dev/disk/by-id/scsi-
SATA_WDC_WD10EADS-00_WD-WCAV51264701
Destroying a zpool
ZFS pools are virtually indestructible. If a zpool does not show up im-
mediately, do not presume too quickly the pool to be dead. In my
experience, digging around a bit will bring the pool back to life. Do
not unnecessarily destroy a zpool!
25
Real-world example
Here is a real-world ZFS example by Jim Salter of service provider Openoid,
involving reshuffling pool storage on the fly.
References
1. Michael H. Darden. Data integrity: The Dell|EMC distinction. Published
May 2002. https://www.dell.com/content/topics/global.aspx/power/en/
ps2q02_darden
26