Browse Source

Add BGP notes and cleanup ZFS notes

Håvard O. Nordstrand 4 năm trước cách đây
mục cha
commit
0b40321ed6

+ 41 - 23
config/linux-server/debian.md

@@ -23,13 +23,13 @@ Using **Debian 10 (Buster)**.
 - Use an FQDN as the hostname. It'll set both the shortname and the FQDN.
 - Use separate password for root and your personal admin user.
 - System disk partitioning:
-    - (Recommended for "simple" systems) Manually partition: One partition using all space, mounted as EXT4 at `/`.
-    - (Recommended for "complex" systems) Manually partition, see [system storage](/config/linux-server/storage/#system-storage).
+    - "Simple" system: Guided, single partition, use all available space.
+    - "Complex" system: Manually partition, see [system storage](/config/linux-server/storage/#system-storage).
     - Swap can be set up later as a file or LVM volume.
     - When using LVM: Create the partition for the volume group, configure LVM (separate menu), configure the LVM volumes (filesystem and mount).
 - At the software selection menu, select only "SSH server" and "standard system utilities".
 - If it asks to install non-free firmware, take note of the packages so they can be installed later.
-- Install GRUB to the used disk.
+- Install GRUB to the used disk (not partition).
 
 ### Reconfigure Clones
 
@@ -69,6 +69,7 @@ If you didn't already configure this during the installation. Typically the case
         - Add PID monitor group: `groupadd -g 500 hidepid` (example GID)
         - Add your personal user to the PID monitor group: `usermod -aG hidepid <user>`
         - Enable hidepid in `/etc/fstab`: `proc /proc proc defaults,hidepid=2,gid=500 0 0`
+    - (Optional) Disable the tiny swap partition added by the guided installer by commenting it in the fstab.
     - (Optional) Setup extra mount options: See [Storage](system.md).
     - Run `mount -a` to validate fstab.
     - (Optional) Restart the system.
@@ -77,7 +78,7 @@ If you didn't already configure this during the installation. Typically the case
     - Add the relevant groups (using `usermod -aG <group> <user>`):
         - `sudo` for sudo access.
         - `systemd-journal` for system log access.
-        - `pidmonitor` (whatever it's called) if using hidepid, to see all processes.
+        - `hidepid` (whatever it's called) if using hidepid, to see all processes.
     - Add your personal SSH pubkey to `~/.ssh/authorized_keys` and fix the owner and permissions (700 for dir, 600 for file).
         - Hint: Get `https://github.com/<user>.keys` and filter the results.
     - Try logging in remotely and gain root access through sudo.
@@ -114,15 +115,18 @@ If you didn't already configure this during the installation. Typically the case
     - (Optional) To install all common common firmware and microcode, install `firmware-linux` (or `firmware-linux-free`) (includes e.g. microcode packages).
 1. Setup smartmontools to monitor S.M.A.R.T. disks:
     1. Install: `apt install smartmontools`
-    1. Monitor disk: `smartctl -s on <dev>`.
+    1. (Optional) Monitor disk: `smartctl -s on <dev>`.
 1. Setup lm_sensors to monitor sensors:
     1. Install: `apt install lm-sensors`
     1. Run `sensors` to make sure it runs without errors and shows some (default-ish) sensors.
     1. For further configuration (more sensors) and more info, see [Linux Server Applications: lm_sensors](/config/linux-server/applications/#lm_sensors).
 1. Check the performance governor and other frequency settings:
     1. Install `linux-cpupower`.
-    1. Run `cpupower frequency-info` to show the boost state (should be on) (Intel) and current performance governor (should be "ondemand" or "performance").
-    1. Fix it something is wrong: Google it.
+    1. Show: `cpupower frequency-info`
+        - Check the boost state should be on (Intel). 
+        - Check the current performance governor (e.g. "powersave", "ondemand" or "performance").
+    1. (Optional) Temporarily change performance governor: `cpupower frequency-set -g <governor>`
+    1. (Optional) Permanently change performance governor: **TODO**
 1. (Optional) Mask `ctrl-alt-del.target` to disable CTRL+ALT+DEL reboot at the login screen.
 
 #### QEMU Virtual Host
@@ -133,22 +137,34 @@ If you didn't already configure this during the installation. Typically the case
 
 #### Network Manager
 
-Using ifupdown (not ifupdown2) (alternative 1, default):
+##### Using ifupdown (Alternative 1)
+
+This is used by default and is the simplest to use for simple setups.
 
 1. For VLAN support, install `vlan`.
 1. For bonding/LACP support, install `ifenslave`.
 1. Configure `/etc/network/interfaces`.
-1. Validate the interfaces: `ifup --no-act <if>`
-1. Reload the config: Reboot or run `ifdown` and `ifup` on all changed interfaces.
+1. Reload the config (per interface): `systemctl restart ifup@<if>.service`
+    - Don't restart `networking.service` or call `ifup`/`ifdown` directly, this is deprecated and may cause problems.
+
+##### Using systemd-networkd (Alternative 2)
 
-Using systemd-networkd (alternative 2):
+This is the systemd way of doing it and is recommended for more advanced setups as ifupdown is riddled with legacy/compatibility crap.
 
 1. Add a simple network config: Create `/etc/systemd/network/lan.network` based on [main.network](https://github.com/HON95/configs/blob/master/server/linux/networkd/main.network).
 1. Disable/remove the ifupdown config: `mv /etc/network/interfaces /etc/network/interfaces.old`
-1. Enable and (re)start systemd-networkd: `systemctl enable systemd-networkd`
+1. Enable the service: `systemctl enable --now systemd-networkd`
 1. Purge `ifupdown` and `ifupdown2`.
 1. Check status: `networkctl [status [-a]]`
-1. Restart the system and check if still working. This will also kill any dhclient daemons which could trigger a DHCP renew at some point.
+1. Restart the system to make sure all ifupdown stuff is stopped (like orphaned dhclients).
+
+##### Configure IPv6/NDP/RA Securely
+
+Prevent enabled (and potentially untrusted) interfaces from accepting router advertisements and autoconfiguring themselves, unless autoconfiguration is what you intended.
+
+- Using ifupdown: Set `accept_ra 0` for all `inet6` interface sections.
+- Using systemd-networked **TODO**
+- Using firewall: If the network manager can't be set to ignore RAs, just block them. Alternatively, block all ICMPv6 in/out if IPv6 shouldn't be used on this interface at all.
 
 #### Firewall
 
@@ -160,26 +176,30 @@ Using systemd-networkd (alternative 2):
 
 #### DNS
 
-Manual (default, alternative 1):
+##### Using resolv.conf (Alternative 1)
+
+The simplest alternative, without any local system caching.
 
 1. Manually configure `/etc/resolv.conf`.
 
-Using systemd-resolved (alternative 2):
+##### Using systemd-resolved (Alternative 2)
 
 1. (Optional) Make sure no other local DNS servers (like dnsmasq) is running.
 1. Configure `/etc/systemd/resolved.conf`
     - `DNS`: A space-separated list of DNS servers.
     - `Domains`: A space-separated list of search domains.
 1. (Optional) If you're hosting a DNS server on this machine, set `DNSStubListener=no` to avoid binding to port 53.
-1. Enable and start `systemd-resolved.service`.
-1. Point `/etc/resolv.conf` to the one generated by systemd: `ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf`
+1. Enable the service: `systemctl enable --now systemd-resolved.service`
+1. Point `resolv.conf` to the one generated by systemd: `ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf`
 1. Check status: `resolvectl`
 
 #### NTP
 
+This is typically correct by default.
+
 1. Check the timezome and network time status: `timedatectl`
-1. Fix the timezone: `timedatectl set-timezone Europe/Oslo`
-1. Fix enable network time: `timedatectl set-ntp true`
+1. (Optional) Fix the timezone: `timedatectl set-timezone Europe/Oslo`
+1. (Optional) Fix enable network time: `timedatectl set-ntp true`
 1. Configure `/etc/systemd/timesyncd.conf`:
     - `NTP` (optional): A space-separated list of NTP servers. The defaults are fine.
 1. Restart `systemd-timesyncd`.
@@ -238,10 +258,8 @@ Everything here is optional.
     - (Optional) Add a MOTD to `/etc/motd`.
     - (Optional) Clear or change the pre-login message in `/etc/issue`.
     - Test it: `su - <some-normal-user>`
-- Monitor free disk space (using custom script):
-    - Download [disk-space-checker.sh](https://github.com/HON95/scripts/blob/master/server/linux/general/disk-space-checker.sh) either to `/cron/cron.daily/` or to `/opt/bin` and create a cron job for it.
-    - Example cron job (15 minutes past every 4 hours): `15 */4 * * * root /opt/bin/disk-space-checker`
-    - Configure which disks/file systems it should exclude and how full they should be before it sends an email alert.
+- Setup monitoring:
+    - Use Prometheus with node exporter or something and set up alerts.
 
 ## Troubleshooting
 

+ 158 - 0
config/linux-server/storage-ceph.md

@@ -0,0 +1,158 @@
+---
+title: 'Linux Server Storage: Ceph'
+breadcrumbs:
+- title: Configuration
+- title: Linux Server
+---
+{% include header.md %}
+
+Using Debian.
+
+## Resources
+
+- (Ceph: Ceph PGs per Pool Calculator)[https://ceph.com/pgcalc/]
+- (Ceph Documentation: Placement Group States)[https://docs.ceph.com/docs/mimic/rados/operations/pg-states/]
+
+## Info
+
+- Distributed storage for HA.
+- Redundant and self-healing without any single point of failure.
+- The Ceph Storeage Cluster consists of:
+    - Monitors (typically one per node) for monitoring the state of itself and other nodes.
+    - Managers (at least two for HA) for serving metrics and statuses to users and external services.
+    - OSDs (object storage daemon) (one per disk) for handles storing of data, replication, etc.
+    - Metadata Servers (MDSes) for storing metadata for POSIX file systems to function properly and efficiently.
+- At least three monitors are required for HA, because of quorum.
+- Each node connects directly to OSDs when handling data.
+- Pools consist of a number of placement groups (PGs) and OSDs, where each PG uses a number of OSDs.
+- Replication factor (aka size):
+    - Replication factor *n*/*m* (e.g. 3/2) means replication factor *n* with minimum replication factor *m*. One of them is often omitted.
+    - The replication factor specifies how many copies of the data will be stored.
+    - The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful and unblocks the write operation.
+    - Replication factor *n* means the data will be stored on *n* different OSDs/disks on different nodes,
+      and that *n-1* nodes may fail without losing data.
+- When an OSD fails, Ceph will try to rebalance the data (with replication factor over 1) onto other OSDs to regain the correct replication factor.
+- A PG must have state *active* in order to be accessible for RW operations.
+- The number of PGs in an existing pool can be increased but not decreased.
+- Clients only interact with the primary OSD in a PG.
+- The CRUSH algorithm is used for determining storage locations based on hashing the pool and object names. It avoids having to index file locations.
+- BlueStore (default OSD back-end):
+    - Creates two partitions on the disk: One for metadata and one for data.
+    - The metadata partition uses an XFS FS and is mounted to `/var/lib/ceph/osd/ceph-<osd-id>`.
+    - The metadata file `block` points to the data partition.
+    - The metadata file `block.wal` points to the journal device/partition if it exists (it does not by default).
+    - Separate OSD WAL/journal and DB devices may be set up, typically when using HDDs or a mix of HDDs and SSDs.
+    - One OSD WAL device can serve multiple OSDs.
+    - OSD WAL devices should be sized according to how much data they should "buffer".
+    - OSD DB devices should be at least 4% as large as the backing OSDs. If they fill up, they will spill onto the OSDs and reduce performance.
+    - If the fast storage space is limited (e.g. less than 1GB), use it as an OSD WAL. If it is large, use it as an OSD DB.
+    - Using a DB device will also provide the benefits of a WAL device, as the journal is always placed on the fastest device.
+    - A lost OSD WAL/DB will be equivalent to lose all OSDs. (For the older Filestore back-end, it used to be possible to recover it.)
+
+## Guidelines
+
+- Use at least 3 nodes.
+- CPU: Metadata servers and partially OSDs are somewhat CPU intensive. Monitors are not.
+- RAM: OSDs should have ~1GB per 1TB storage, even though it typically doesn't use much.
+- Use a replication factor of at least 3/2.
+- Run OSes, OSD data and OSD journals on separate drives.
+- Network:
+    - Use an isolated separete physical network for internal cluster traffic between nodes.
+    - Consider using 10G or higher with a spine-leaf topology.
+- Pool PG count:
+    - \<5 OSDs: 128
+    - 5-10 OSDs: 512
+    - 10-50 OSDs: 4096
+    - \>50 OSDs: See (pgcalc)[https://ceph.com/pgcalc/].
+
+## Usage
+
+- General:
+    - List pools: `rados lspools` or `ceph osd lspools`
+- Show utilization:
+    - `rados df`
+    - `ceph df [detail]`
+    - `deph osd df`
+- Show health and status:
+    - `ceph status`
+    - `ceph health [detail]`
+    - `ceph osd stat`
+    - `ceph osd tree`
+    - `ceph mon stat`
+    - `ceph osd perf`
+    - `ceph osd pool stats`
+    - `ceph pg dump pgs_brief`
+- Pools:
+    - Create: `ceph osd pool create <pool> <pg-num>`
+    - Delete: `ceph osd pool delete <pool> [<pool> --yes-i-really-mean-it]`
+    - Rename: `ceph osd pool rename <old-name> <new-name>`
+    - Make or delete snapshot: `ceph osd pool <mksnap|rmsnap> <pool> <snap>`
+    - Set or get values: `ceph osd pool <set|get> <pool> <key>`
+    - Set quota: `ceph osd pool set-quota <pool> [max_objects <count>] [max_bytes <bytes>]`
+- Interact with pools directly using RADOS:
+    - Ceph is built on based on RADOS.
+    - List files: `rados -p <pool> ls`
+    - Put file: `rados -p <pool> put <name> <file>`
+    - Get file: `rados -p <pool> get <name> <file>`
+    - Delete file: `rados -p <pool> rm <name>`
+- Manage RBD (Rados Block Device) images:
+    - Images are spread over multiple objects.
+    - List images: `rbd -p <pool> ls`
+    - Show usage: `rbd -p <pool> du`
+    - Show image info: `rbd info <pool/image>`
+    - Create image: `rbd create <pool/image> --object-size=<obj-size> --size=<img-size>`
+    - Export image to file: `rbd export <pool/image> <file>`
+    - Mount image: TODO
+
+### Failure Handling
+
+**Down + peering:**
+
+The placement group is offline because an is unavailable and is blocking peering.
+
+1. `ceph pg <pg> query`
+1. Try to restart the blocked OSD.
+1. If restarting didn't help, mark OSD as lost: `ceph osd lost <osd>`
+    - No data loss should occur if using an appropriate replication factor.
+
+**Active degraded (X objects unfound):**
+
+Data loss has occurred, but metadata about the missing files exist.
+
+1. Check the hardware.
+1. Identify object names: `ceph pg <pg> query`
+1. Check which images the objects belong to: `ceph pg <pg list_missing>`
+1. Either restore or delete the lost objects: `ceph pg <pg> mark_unfound_lost <revert|delete>`
+
+**Inconsistent:**
+
+Typically combined with other states. May come up during scrubbing.
+Typically an early indicator of faulty hardware, so take note of which disk it is.
+
+1. Find inconsistent PGs: `ceph pg dump pgs_brief | grep -i inconsistent`
+    - Alternatively: `rados list-inconsistent pg <pool>`
+1. Repair the PG: `ceph pg repair <pg>`
+
+### OSD Replacement
+
+1. Stop the daemon: `systemctl stop ceph-osd@<id>`
+    - Check: `systemctl status ceph-osd@<id>`
+1. Destroy OSD: `ceph osd destroy osd.<id> [--yes-i-really-mean-it]`
+    - Check: `ceph osd tree`
+1. Remove OSD from CRUSH map: `ceph osd crush remove osd.<id>`
+1. Wait for rebalancing: `ceph -s [-w]`
+1. Remove the OSD: `ceph osd rm osd.<id>`
+    - Check that it's unmounted: `lsblk`
+    - Unmount it if not: `umount <dev>`
+1. Replace the physical disk.
+1. Zap the new disk: `ceph-disk zap <dev>`
+1. Create new OSD: `pveceph osd create <dev> [options]` (Proxmox VE)
+    - Optionally specify any WAL or DB devices.
+    - See [PVE: pveceph(1)](https://pve.proxmox.com/pve-docs/pveceph.1.html).
+    - Without PVE's `pveceph(1)`, a series of steps are required.
+    - Check that the new OSD is up: `ceph osd tree`
+1. Start the OSD daemon: `systemctl start ceph-osd@<id>`
+1. Wait for rebalancing: `ceph -s [-w]`
+1. Check the health: `ceph health [detail]`
+
+{% include footer.md %}

+ 269 - 0
config/linux-server/storage-zfs.md

@@ -0,0 +1,269 @@
+---
+title: 'Linux Server Storage: ZFS'
+breadcrumbs:
+- title: Configuration
+- title: Linux Server
+---
+{% include header.md %}
+
+Using ZFS on Linux (ZoL) running on Debian.
+
+## Info
+
+Note: ZFS's history (Oracle) and license (CDDL, which is incompatible with the Linux mainline kernel) are pretty good reasons to avoid ZFS.
+
+### Features
+
+- Filesystem and physical storage decoupled
+- Always consistent
+- Intent log
+- Synchronous or asynchronous
+- Everything checksummed
+- Compression
+- Deduplication
+- Encryption
+- Snapshots
+- Copy-on-write (CoW)
+- Clones
+- Caching
+- Log-strucrured filesystem
+- Tunable
+
+### Terminology
+
+- Vdev
+- Pool
+- Dataset
+- Zvol
+- ZFS POSIX Layer (ZPL)
+- ZFS Intent Log (ZIL)
+- Adaptive Replacement Cache (ARC) and L2ARC
+- ZFS Event Daemon (ZED)
+
+### Encryption
+
+- ZoL v0.8.0 and newer supports native encryption of pools and datasets. This encrypts all data except some metadata like pool/dataset structure, dataset names and file sizes.
+- Datasets can be scrubbed, resilvered, renamed and deleted without unlocking them first.
+- Datasets will by default inherit encryption and the encryption key (the "encryption root") from the parent pool/dataset.
+- The encryption suite can't be changed after creation, but the keyformat can.
+
+## Setup
+
+### Installation
+
+The installation part is highly specific to Debian 10 (Buster). The backports repo is used to get the newest version of ZoL.
+
+1. Enable the Buster backports repo: See [Backports (Debian Wiki)](https://wiki.debian.org/Backports)
+    - Add the following lines to `/etc/apt/sources.list`
+        ```
+        deb http://deb.debian.org/debian buster-backports main contrib non-free
+        deb-src http://deb.debian.org/debian buster-backports main contrib non-free
+        ```
+1. Install: `apt install -t buster-backports zfsutils-linux`
+1. (Optional) Fix automatic unlocking of encrypted datasets: See encryption usage subsection.
+
+### Configuration
+
+1. (Optional) Set the max ARC size:
+    - Command: `echo "options zfs zfs_arc_max=<bytes>" >> /etc/modprobe.d/zfs.conf`
+    - It should typically be around 15-25% of the physical RAM size on general nodes. It defaults to 50%.
+    - This is generally not required, ZFS should happily yield RAM to other processes that need it.
+1. Check that the cron scrub script exists:
+    - Typical location: `/etc/cron.d/zfsutils-linux`
+    - If it doesn't exist, add one which runs `/usr/lib/zfs-linux/scrub` e.g. monthly. It'll scrub all disks.
+1. Check that ZED is set up to send emails:
+    - In `/etc/zfs/zed.d/zed.rc`, make sure `ZED_EMAIL_ADDR="root"` is uncommented.
+
+## Usage
+
+### General
+
+- Show version: `zfs --version` or `modinfo zfs | grep '^version:'`
+- Be super careful when destroying stuff! ZFS never asks for confirmation. When entering dangerous commands, considering adding a `#` to the start to prevent running it half-way by accident.
+
+### Pools
+
+- Recommended pool options:
+    - Set thr right physical block/sector size: `ashift=<9|12>` (for 2^9 and 2^12, use 12 if unsure)
+    - Enabel compression: `compression=lz4` (use `zstd` when supported)
+    - Store extended attributes in the inodes: `xattr=sa` (`on` is default and stores them in a hidden file)
+    - Don't enable dedup.
+- Create pool:
+    - Format: `zpool create [options] <name> <levels-and-drives>`
+    - Basic example: `zpool create -o ashift=<9|12> -O compression=lz4 -O xattr=sa <name> [mirror|raidz|raidz2|...] <drives>`
+    - Create encrypted pool: See encryption section.
+    - Use absolute drive paths (`/dev/disk/by-id/` or similar).
+- View pool activity: `zpool iostat [-v] [interval]`
+    - Includes metadata operations.
+    - If no interval is specified, the operations and bandwidths are averaged from the system boot. If an interval is specified, the very first interval will still show this.
+
+### Datasets
+
+- Recommended dataset options:
+    - Set quota: `quota=<size>`
+    - Set reservation: `reservation=<size>`
+- Create dataset:
+    - Format: `zfs create [options] <pool>/<name>`
+    - Use `-p` to create parent datasets if they maybe don't exist.
+    - Basic example: `zfs create -o quota=<size> -o reservation=<size> <pool>/<other-datasets>/<name>`
+- Properties:
+    - Properties may have the following sources, as seen in the "source" column: Local, default, inherited, temporary, received and none.
+    - Get: `zfs get {all|<property>} [-r] [dataset]` (`-r` for recursive)
+    - Set: `zfs set <property>=<value> <dataset>`
+    - Reset to default/inherit: `zfs inherit -S [-r] <property> <dataset>` (`-r` for recursive, `-S` to use the received value if one exists)
+
+### Snapshots
+
+- Create snapshot: `zfs snapshot [-r] <dataset>@<snapshot>`
+    - `-r` for "recursive".
+- Destroy snapshot: `zfs destroy [-r] <dataset>@<snapshot>`
+
+### Transfers
+
+- `zfs send` sends to STDOUT and `zfs recv` receives from STDIN.
+- Send basic: `zfs send [-R] <snapshot>`
+    - `-R` for "replication" to include descendant datasets, snapshots, clones and properties.
+- Receive basic: `zfs recv -Fus <snapshot>`
+    - `-F` to destroy existing datasets and snapshots.
+    - `-u` to avoid mounting it.
+    - `-s` to save a resumable token in case the transfer is interrupted.
+- Send incremental: `zfs send {-i|-I} <first-snapshot> <last-snapshot>`
+    - `-i` sends the delta for a between two snapshots while `-I` sends for the whole range of snapshots between the two mentioned.
+    - The first snapshot may be specified without the dataset name.
+- Resume interrupted transfer started with `recv -s`: Use `zfs get receive_resume_token` and `zfs send -t <token>`.
+- Send encrypted snapshots: See encryption subsection.
+- Send encrypted snapshot over SSH (full example): `sudo zfs send -Rw tank1@1 | pv | ssh node2 sudo zfs recv tank2/tank1`
+    - Make sure you don't need to enter a sudo password on the other node, that would break the piped transfer.
+- Consider running it in a screen session or something to avoid interruption.
+- To show transfer info (duration, size, throughput), pipe it through `pv`.
+
+### Encryption
+
+- Show stuff:
+    - Encryption root: `zfs get encryptionroot`
+    - Key status: `zfs get keystatus`. `unavailable` means locked and `-` means not encrypted.
+    - Mount status: `zfs get mountpoint` and `zfs get mounted`.
+- Fix automatic unlock when mounting at boot time:
+    1. Copy `/lib/systemd/system/zfs-mount.service` to `/etc/systemd/system/`.
+    1. Change `ExecStart=/sbin/zfs mount -a` to `ExecStart=/sbin/zfs mount -l -a` (add `-l`), so that it loads encryption keys.
+    1. Reboot and test. It may fail due to dependency/boot order stuff.
+- Create a password encrypted pool: 
+    - Create: `zpool create -O encryption=aes-128-gcm -O keyformat=passphrase ...`
+- Create a raw key encrypted pool:
+    - Generate the key: `dd if=/dev/urandom of=/root/keys/zfs/<tank> bs=32 count=1`
+    - Create: `zpool create <normal-options> -O encryption=aes-128-gcm -O keyformat=raw -O keylocation=file:///root/keys/zfs/<tank> <name> ...`
+- Encrypt an existing dataset by sending and receiving:
+    1. Rename the old dataset: `zfs rename <dataset> <old-dataset>`
+    1. Snapshot the old dataset: `zfs snapshot -r <dataset>@<snapshot-name>`
+    1. Command: `zfs send [-R] <snapshot> | zfs recv -o encryption=aes-128-gcm -o keyformat=raw -o keylocation=file:///root/keys/zfs/<tank> <new-dataset>`
+    1. Test the new dataset.
+    1. Delete the snapshots and the old dataset.
+    1. Note: All child datasets will be encrypted too (if `-r` and `-R` were used).
+    1. Note: The new dataset will become its own encryption root instead of inheriting from any parent dataset/pool.
+- Change encryption property:
+    - The key must generally already be loaded.
+    - Change `keyformat`, `keylocation` or `pbkdf2iters`: `zfs change-key -o <property>=<value> <dataset>`
+    - Inherit key from parent: `zfs change-key -i <dataset>`
+- Send raw encrypted snapshot:
+    - Example: `zfs send -Rw <dataset>@<snapshot> | <...> | zfs recv <dataset>`
+    - As with normal sends, `-R` is useful for including snapshots and metadata.
+    - Sending encrypted datasets requires using raw (`-w`).
+    - Encrypted snapshots sent as raw may be sent incrementally.
+
+### Error Handling and Replacement
+
+- Clear transient device errors: `zpool clear <pool> [device]`
+- If a pool is "UNAVAIL", it means it can't be recovered without corrupted data.
+- Replace a device and automatically copy data from the old device or from redundant devices: `zpool replace <pool> <old-device> <new-device>`
+- Bring a device online or offline: `zpool (online|offline) <pool> <device>`
+- Re-add device that got wiped: Take it offline and then online again.
+
+### Miscellanea
+
+- Add bind mount targeting ZFS mount:
+    - fstab entry (example): `/bravo/abc /export/abc none bind,defaults,nofail,x-systemd.requires=zfs-mount.service 0 0`
+    - The `x-systemd.requires=zfs-mount.service` is required to wait until ZFS has mounted the dataset.
+- For automatic creation and rotation of periodic snapshots, see the zfs-auto-snapshot subsection.
+
+## Best Practices and Suggestions
+
+- As far as possible, use raw disks and HBA disk controllers (or RAID controllers in IT mode).
+- Always use `/etc/disk/by-id/X`, not `/dev/sdX`.
+- Always manually set the correct ashift for pools.
+    - Should be the log-2 of the physical block/sector size of the device.
+    - E.g. 12 for 4kB (Advanced Format (AF), common on HDDs) and 9 for 512B (common on SSDs).
+    - Check the physical block size with `smartctl -i <dev>`.
+    - Keep in mind that some 4kB disks emulate/report 512B. They should be used as 4kB disks.
+- Always enable compression.
+    - Generally `lz4`. Maybe `zstd` when implemented. Maybe `gzip-9` for archiving.
+    - For uncompressable data, worst case it that it does nothing (i.e. no loss for enabling it).
+    - The overhead is typically negligible. Only for super-high-bandwidth use cases (large NVMe RAIDs), the compression overhead may become noticable.
+- Never use deduplication.
+    - It's generally not useful, depending on the use case.
+    - It's expensive.
+    - It may brick your ZFS server.
+- Generally always use quotas and reservations.
+- Avoid using more than 80% of the available space.
+- Make sure regular automatic scrubs are enabled.
+    - There should be a cron job/script or something.
+    - Run it e.g. every 2 weeks or monthly.
+- Snapshots are great for incremental backups. They're easy to send places.
+- Use quotas, reservations and compression.
+- Very frequent reads:
+    - E.g. for a static web root.
+    - Set `atime=off` to disable updating the access time for files.
+- Database:
+    - Disable `atime`.
+    - Use an appropriate recordsize with `recordsize=<size>`.
+        - InnoDB should use 16k for data files and 128k on log files (two datasets).
+        - PostgreSQL should use 8k (or 16k) for both data and WAL.
+    - Disable caching with `primarycache=metadata`. DMBSes typically handle caching themselves.
+        - For InnoDB.
+        - For PostgreSQL if the working set fits in RAM.
+    - Disable the ZIL with `logbias=throughput` to prevent writing twice.
+        - For InnoDB and PostgreSQL.
+        - Consider not using it for high-traffic applications.
+    - PostgreSQL:
+        - Use the same dataset for data and logs.
+        - Use one dataset per database instance. Requires you to specify it when creating the database.
+        - Don't use PostgreSQL checksums or compression.
+        - Example: `su postgres -c 'initdb --no-locale -E=UTF8 -n -N -D /db/pgdb1'`
+
+## Extra Notes
+
+- ECC memory is recommended but not required. It does not affect data corruption on disk.
+- It does not require large amounts of memory, but more memory allows it to cache more data. A minimum of around 1GB is suggested. Memory caching is termed ARC. By default it's limited to 1/2 of all available RAM. Under memory pressure, it releases some of it to other applications.
+- Compressed ARC is a feature which compresses and checksums the ARC. It's enabled by default.
+- A dedicated disk (e.g. an NVMe SSD) can be used as a secondary read cache. This is termed L2ARC (level 2 ARC). Only frequently accessed blocks are cached. The memory requirement will increase based on the size of the L2ARC. It should only be considered for pools with high read traffic, slow disks and lots of memory available.
+- A dedicated disk (e.g. an NVMe SSD) can be used for the ZFS intent log (ZIL), which is used for synchronized writes. This is termed SLOG (separate intent log). The disk must have low latency, high durability and should preferrably be mirrored for redundancy. It should only be considered for pools with high synchronous write traffic on relatively slow disks.
+- Intel Optane is a perfect choice as both L2ARCs and SLOGs due to its high throughput, low latency and high durability.
+- Some SSD models come with a build-in cache. Make sure it actually flushes it on power loss.
+- ZFS is always consistent, even in case of data loss.
+- Bitrot is real.
+    - 4.2% to 34% of SSDs have one UBER (uncorrectable bit error rate) per year.
+    - External factors:
+        - Temperature.
+        - Bus power consumption.
+        - Data written by system software.
+        - Workload changes due to SSD failure.
+- Signs of drive failures:
+    - `zpool status <pool>` shows that a scrub has repaired any data.
+    - `zpool status <pool>` shows read, write or checksum errors (all values should be zero).
+- Database conventions:
+    - One app per database.
+    - Encode the environment and DMBS version into the dataset name, e.g. `theapp-prod-pg10`.
+
+## Related Software
+
+### zfs-auto-snapshot
+
+See [zfsonlinux/zfs-auto-snapshot (GitHub)](https://github.com/zfsonlinux/zfs-auto-snapshot/).
+
+- zfs-auto-snapshot automatically creates and rotates a certain number of snapshots for all/some datasets.
+- The project seems to be pretty dead, but it still works. Most alternatives are way more complex.
+- It uses a `zfs-auto-snapshot` script in each of the `cron.*` dirs, which may be modified to tweak things.
+- It uses the following intervals to snapshot all enabled datasets: `frequent` (15 minutes), `hourly`, `daily`, `weekly`, `monthly`
+- Installation: `apt install zfs-auto-snapshot`
+- By default it's enabled for all datasets. To disable it by default, add `--default-exclude` to each of the cron scripts, so that it's only enabled for datasets with property `com.sun:auto-snapshot` set to true.
+
+{% include footer.md %}

+ 2 - 335
config/linux-server/storage.md

@@ -209,343 +209,10 @@ To automount it, you need to actually enter it (or equivalent).
 
 ### Ceph
 
-#### Resources
-
-- (Ceph: Ceph PGs per Pool Calculator)[https://ceph.com/pgcalc/]
-- (Ceph Documentation: Placement Group States)[https://docs.ceph.com/docs/mimic/rados/operations/pg-states/]
-
-#### Info
-
-- Distributed storage for HA.
-- Redundant and self-healing without any single point of failure.
-- The Ceph Storeage Cluster consists of:
-    - Monitors (typically one per node) for monitoring the state of itself and other nodes.
-    - Managers (at least two for HA) for serving metrics and statuses to users and external services.
-    - OSDs (object storage daemon) (one per disk) for handles storing of data, replication, etc.
-    - Metadata Servers (MDSes) for storing metadata for POSIX file systems to function properly and efficiently.
-- At least three monitors are required for HA, because of quorum.
-- Each node connects directly to OSDs when handling data.
-- Pools consist of a number of placement groups (PGs) and OSDs, where each PG uses a number of OSDs.
-- Replication factor (aka size):
-    - Replication factor *n*/*m* (e.g. 3/2) means replication factor *n* with minimum replication factor *m*. One of them is often omitted.
-    - The replication factor specifies how many copies of the data will be stored.
-    - The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful and unblocks the write operation.
-    - Replication factor *n* means the data will be stored on *n* different OSDs/disks on different nodes,
-      and that *n-1* nodes may fail without losing data.
-- When an OSD fails, Ceph will try to rebalance the data (with replication factor over 1) onto other OSDs to regain the correct replication factor.
-- A PG must have state *active* in order to be accessible for RW operations.
-- The number of PGs in an existing pool can be increased but not decreased.
-- Clients only interact with the primary OSD in a PG.
-- The CRUSH algorithm is used for determining storage locations based on hashing the pool and object names. It avoids having to index file locations.
-- BlueStore (default OSD back-end):
-    - Creates two partitions on the disk: One for metadata and one for data.
-    - The metadata partition uses an XFS FS and is mounted to `/var/lib/ceph/osd/ceph-<osd-id>`.
-    - The metadata file `block` points to the data partition.
-    - The metadata file `block.wal` points to the journal device/partition if it exists (it does not by default).
-    - Separate OSD WAL/journal and DB devices may be set up, typically when using HDDs or a mix of HDDs and SSDs.
-    - One OSD WAL device can serve multiple OSDs.
-    - OSD WAL devices should be sized according to how much data they should "buffer".
-    - OSD DB devices should be at least 4% as large as the backing OSDs. If they fill up, they will spill onto the OSDs and reduce performance.
-    - If the fast storage space is limited (e.g. less than 1GB), use it as an OSD WAL. If it is large, use it as an OSD DB.
-    - Using a DB device will also provide the benefits of a WAL device, as the journal is always placed on the fastest device.
-    - A lost OSD WAL/DB will be equivalent to lose all OSDs. (For the older Filestore back-end, it used to be possible to recover it.)
-
-#### Guidelines
-
-- Use at least 3 nodes.
-- CPU: Metadata servers and partially OSDs are somewhat CPU intensive. Monitors are not.
-- RAM: OSDs should have ~1GB per 1TB storage, even though it typically doesn't use much.
-- Use a replication factor of at least 3/2.
-- Run OSes, OSD data and OSD journals on separate drives.
-- Network:
-    - Use an isolated separete physical network for internal cluster traffic between nodes.
-    - Consider using 10G or higher with a spine-leaf topology.
-- Pool PG count:
-    - \<5 OSDs: 128
-    - 5-10 OSDs: 512
-    - 10-50 OSDs: 4096
-    - \>50 OSDs: See (pgcalc)[https://ceph.com/pgcalc/].
-
-#### Usage
-
-- General:
-    - List pools: `rados lspools` or `ceph osd lspools`
-- Show utilization:
-    - `rados df`
-    - `ceph df [detail]`
-    - `deph osd df`
-- Show health and status:
-    - `ceph status`
-    - `ceph health [detail]`
-    - `ceph osd stat`
-    - `ceph osd tree`
-    - `ceph mon stat`
-    - `ceph osd perf`
-    - `ceph osd pool stats`
-    - `ceph pg dump pgs_brief`
-- Pools:
-    - Create: `ceph osd pool create <pool> <pg-num>`
-    - Delete: `ceph osd pool delete <pool> [<pool> --yes-i-really-mean-it]`
-    - Rename: `ceph osd pool rename <old-name> <new-name>`
-    - Make or delete snapshot: `ceph osd pool <mksnap|rmsnap> <pool> <snap>`
-    - Set or get values: `ceph osd pool <set|get> <pool> <key>`
-    - Set quota: `ceph osd pool set-quota <pool> [max_objects <count>] [max_bytes <bytes>]`
-- Interact with pools directly using RADOS:
-    - Ceph is built on based on RADOS.
-    - List files: `rados -p <pool> ls`
-    - Put file: `rados -p <pool> put <name> <file>`
-    - Get file: `rados -p <pool> get <name> <file>`
-    - Delete file: `rados -p <pool> rm <name>`
-- Manage RBD (Rados Block Device) images:
-    - Images are spread over multiple objects.
-    - List images: `rbd -p <pool> ls`
-    - Show usage: `rbd -p <pool> du`
-    - Show image info: `rbd info <pool/image>`
-    - Create image: `rbd create <pool/image> --object-size=<obj-size> --size=<img-size>`
-    - Export image to file: `rbd export <pool/image> <file>`
-    - Mount image: TODO
-
-##### Failure Handling
-
-**Down + peering:**
-
-The placement group is offline because an is unavailable and is blocking peering.
-
-1. `ceph pg <pg> query`
-1. Try to restart the blocked OSD.
-1. If restarting didn't help, mark OSD as lost: `ceph osd lost <osd>`
-    - No data loss should occur if using an appropriate replication factor.
-
-**Active degraded (X objects unfound):**
-
-Data loss has occurred, but metadata about the missing files exist.
-
-1. Check the hardware.
-1. Identify object names: `ceph pg <pg> query`
-1. Check which images the objects belong to: `ceph pg <pg list_missing>`
-1. Either restore or delete the lost objects: `ceph pg <pg> mark_unfound_lost <revert|delete>`
-
-**Inconsistent:**
-
-Typically combined with other states. May come up during scrubbing.
-Typically an early indicator of faulty hardware, so take note of which disk it is.
-
-1. Find inconsistent PGs: `ceph pg dump pgs_brief | grep -i inconsistent`
-    - Alternatively: `rados list-inconsistent pg <pool>`
-1. Repair the PG: `ceph pg repair <pg>`
-
-##### OSD Replacement
-
-1. Stop the daemon: `systemctl stop ceph-osd@<id>`
-    - Check: `systemctl status ceph-osd@<id>`
-1. Destroy OSD: `ceph osd destroy osd.<id> [--yes-i-really-mean-it]`
-    - Check: `ceph osd tree`
-1. Remove OSD from CRUSH map: `ceph osd crush remove osd.<id>`
-1. Wait for rebalancing: `ceph -s [-w]`
-1. Remove the OSD: `ceph osd rm osd.<id>`
-    - Check that it's unmounted: `lsblk`
-    - Unmount it if not: `umount <dev>`
-1. Replace the physical disk.
-1. Zap the new disk: `ceph-disk zap <dev>`
-1. Create new OSD: `pveceph osd create <dev> [options]` (Proxmox VE)
-    - Optionally specify any WAL or DB devices.
-    - See [PVE: pveceph(1)](https://pve.proxmox.com/pve-docs/pveceph.1.html).
-    - Without PVE's `pveceph(1)`, a series of steps are required.
-    - Check that the new OSD is up: `ceph osd tree`
-1. Start the OSD daemon: `systemctl start ceph-osd@<id>`
-1. Wait for rebalancing: `ceph -s [-w]`
-1. Check the health: `ceph health [detail]`
+See [Linux Server Storage: Ceph](../storage-ceph/).
 
 ### ZFS
 
-Using ZFS on Linux (ZoL).
-
-#### Info
-
-Note: ZFS's history (Oracle) and license (CDDL, which is incompatible with the Linux mainline kernel) are pretty good reasons to avoid ZFS.
-
-##### Features
-
-- Filesystem and physical storage decoupled
-- Always consistent
-- Intent log
-- Synchronous or asynchronous
-- Everything checksummed
-- Compression
-- Deduplication
-- Encryption
-- Snapshots
-- Copy-on-write (CoW)
-- Clones
-- Caching
-- Log-strucrured filesystem
-- Tunable
-
-##### Terminology
-
-- Vdev
-- Pool
-- Dataset
-- Zvol
-- ZFS POSIX Layer (ZPL)
-- ZFS Intent Log (ZIL)
-- Adaptive Replacement Cache (ARC) and L2ARC
-- ZFS Event Daemon (ZED)
-
-##### Encryption
-
-- ZoL v0.8.0 and newer supports native encryption of pools and datasets. This encrypts all data except some metadata like pool/dataset structure, dataset names and file sizes.
-- Datasets can be scrubbed, resilvered, renamed and deleted without unlocking them first.
-- Datasets will by default inherit encryption and the encryption key (the "encryption root") from the parent pool/dataset.
-- The encryption suite can't be changed after creation, but the keyformat can.
-
-#### Setup
-
-##### Installation
-
-The installation part is highly specific to Debian 10.
-Some guides recommend using backport repos, but this way avoids that.
-
-1. Enable the `contrib` and `non-free` repo areas.
-1. Install (will probably stall and fail): `apt install zfsutils-linux`
-1. Load the module: `modprobe zfs`
-1. Fix the install: `apt install`
-
-##### Configuration
-
-1. (Optional) Set the max ARC size: `echo "options zfs zfs_arc_max=<bytes>" >> /etc/modprobe.d/zfs.conf`
-    - It should typically be around 15-25% of the physical RAM size on general nodes. It defaults to 50%.
-    - This is generally not required, ZFS should happily yield RAM to other processes that need it.
-1. Check that the cron scrub script exists.
-    - Typical location: `/etc/cron.d/zfsutils-linux`
-    - If it doesn't exist, add one which runs `/usr/lib/zfs-linux/scrub` e.g. monthly. It'll scrub all disks.
-1. Check that ZED is set up to send emails.
-    - In `/etc/zfs/zed.d/zed.rc`, make sure `ZED_EMAIL_ADDR="root"` is uncommented.
-
-#### Usage
-
-- Recommended pool options:
-    - Set thr right physical block/sector size: `ashift=<9|12>` (for 2^9 and 2^12, use 12 if unsure)
-    - Enabel compression: `compression=lz4` (use `zstd` when supported)
-    - Store extended attributes in the inodes: `xattr=sa` (`on` is default and stores them in a hidden file)
-    - Don't enable dedup.
-- Recommended dataset options:
-    - Set quota: `quota=<size>`
-    - Set reservation: `reservation=<size>`
-- Create pool: `zpool create [options] <name> <levels-and-drives>`
-    - Create encrypted pool: See [encryption](#encryption-1).
-    - Example: `zpool create -o ashift=<9|12> -O compression=lz4 -O xattr=sa <name> [mirror|raidz|raidz2|...] <drives>`
-- Create dataset: `zfs create [options] <pool>/<name>`
-    - Example: `zfs create -o quota=<size> -o reservation=<size> <pool>/<other-datasets>/<name>`
-- Handle snapshots:
-    - Create: `zfs snapshot [-r] <dataset>@<snapshot>` (`-r` for "recursive")
-    - Destroy: `zfs destroy [-r] <dataset>@<snapshot>` (careful!)
-    - Send to STDOUT: `zfs send [-R] <snapshot>` (`-R` for "recursive")
-    - Receive from STDIN: `zfs recv <snapshot>`
-    - Resume interrupted transfer: Use `zfs get receive_resume_token` and `zfs send -t <token>`.
-    - Consider running it in a screen session or something to avoid interruption.
-    - If you want transfer information (throughput), pipe it through `pv`.
-- View activity: `zpool iostat [-v] [interval]`
-    - Includes metadata operations.
-    - If no interval is specified, the operations and bandwidths are averaged from the system boot. If an interval is specified, the very first interval will still show this.
-
-##### Error Handling and Replacement
-
-- Clear transient device errors: `zpool clear <pool> [device]`
-- If a pool is "UNAVAIL", it means it can't be recovered without corrupted data.
-- Replace a device and automatically copy data from the old device or from redundant devices: `zpool replace <pool> <old-device> <new-device>`
-- Bring a device online or offline: `zpool (online|offline) <pool> <device>`
-- Re-add device that got wiped: Take it offline and then online again.
-
-##### Encryption
-
-- Check stuff:
-    - Encryption root: `zfs get encryptionroot`
-    - Key status: `zfs get keystatus`. `unavailable` means locked and `-` means not encrypted.
-    - Mount status: `zfs get mountpoint` and `zfs get mounted`.
-- Fix automatic unlock when mounting at boot time:
-    1. Copy `/lib/systemd/system/zfs-mount.service` to `/etc/systemd/system/`.
-    1. Change `ExecStart=/sbin/zfs mount -a` to `ExecStart=/sbin/zfs mount -l -a` (add `-l`), so that it loads encryption keys.
-    1. Reboot and test. It may fail due to dependency/boot order stuff.
-- Create a password encrypted pool: `zpool create -O encryption=aes-128-gcm -O keyformat=passphrase ...`
-- Create a raw key encrypted pool:
-    - Generate the key: `dd if=/dev/random of=/root/.credentials/zfs/<tank> bs=64 count=1`
-    - Create the pool: `zpool create -O encryption=aes-128-gcm -O keyformat=raw -O keylocation=file:///root/.credentials/zfs/<tank> ...`
-- Encrypt an existing dataset by sending and receiving:
-    1. Rename the old dataset: `zfs rename <dataset> <old-dataset>`
-    1. Snapshot the old dataset: `zfs snapshot -r <dataset>@<snapshot-name>`
-    1. Command: `zfs send [-R] <snapshot> | zfs recv -o encryption=aes-128-gcm -o keyformat=raw -o keylocation=file:///root/.credentials/zfs/<tank> <new-dataset>`
-    1. Test the new dataset.
-    1. Delete the snapshots and the old dataset.
-    - All child datasets will be encrypted too (if `-r` and `-R` were used).
-    - The new dataset will become its own encryption root instead of inheriting from any parent dataset/pool.
-
-#### Best Practices and Suggestions
-
-- As far as possible, use raw disks and HBA disk controllers (or RAID controllers in IT mode).
-- Always use `/etc/disk/by-id/X`, not `/dev/sdX`.
-- Always manually set the correct ashift for pools.
-    - Should be the log-2 of the physical block/sector size of the device.
-    - E.g. 12 for 4kB (Advanced Format (AF), common on HDDs) and 9 for 512B (common on SSDs).
-    - Check the physical block size with `smartctl -i <dev>`.
-    - Keep in mind that some 4kB disks emulate/report 512B. They should be used as 4kB disks.
-- Always enable compression.
-    - Generally `lz4`. Maybe `zstd` when implemented. Maybe `gzip-9` for archiving.
-    - For uncompressable data, worst case it that it does nothing (i.e. no loss for enabling it).
-    - The overhead is typically negligible. Only for super-high-bandwidth use cases (large NVMe RAIDs), the compression overhead may become noticable.
-- Never use deduplication.
-    - It's generally not useful, depending on the use case.
-    - It's expensive.
-    - It may brick your ZFS server.
-- Generally always use quotas and reservations.
-- Avoid using more than 80% of the available space.
-- Make sure regular automatic scrubs are enabled.
-    - There should be a cron job/script or something.
-    - Run it e.g. every 2 weeks or monthly.
-- Snapshots are great for incremental backups. They're easy to send places.
-- Use quotas, reservations and compression.
-- Very frequent reads:
-    - E.g. for a static web root.
-    - Set `atime=off` to disable updating the access time for files.
-- Database:
-    - Disable `atime`.
-    - Use an appropriate recordsize with `recordsize=<size>`.
-        - InnoDB should use 16k for data files and 128k on log files (two datasets).
-        - PostgreSQL should use 8k (or 16k) for both data and WAL.
-    - Disable caching with `primarycache=metadata`. DMBSes typically handle caching themselves.
-        - For InnoDB.
-        - For PostgreSQL if the working set fits in RAM.
-    - Disable the ZIL with `logbias=throughput` to prevent writing twice.
-        - For InnoDB and PostgreSQL.
-        - Consider not using it for high-traffic applications.
-    - PostgreSQL:
-        - Use the same dataset for data and logs.
-        - Use one dataset per database instance. Requires you to specify it when creating the database.
-        - Don't use PostgreSQL checksums or compression.
-        - Example: `su postgres -c 'initdb --no-locale -E=UTF8 -n -N -D /db/pgdb1'`
-
-#### Extra Notes
-
-- ECC memory is recommended but not required. It does not affect data corruption on disk.
-- It does not require large amounts of memory, but more memory allows it to cache more data. A minimum of around 1GB is suggested. Memory caching is termed ARC. By default it's limited to 1/2 of all available RAM. Under memory pressure, it releases some of it to other applications.
-- Compressed ARC is a feature which compresses and checksums the ARC. It's enabled by default.
-- A dedicated disk (e.g. an NVMe SSD) can be used as a secondary read cache. This is termed L2ARC (level 2 ARC). Only frequently accessed blocks are cached. The memory requirement will increase based on the size of the L2ARC. It should only be considered for pools with high read traffic, slow disks and lots of memory available.
-- A dedicated disk (e.g. an NVMe SSD) can be used for the ZFS intent log (ZIL), which is used for synchronized writes. This is termed SLOG (separate intent log). The disk must have low latency, high durability and should preferrably be mirrored for redundancy. It should only be considered for pools with high synchronous write traffic on relatively slow disks.
-- Intel Optane is a perfect choice as both L2ARCs and SLOGs due to its high throughput, low latency and high durability.
-- Some SSD models come with a build-in cache. Make sure it actually flushes it on power loss.
-- ZFS is always consistent, even in case of data loss.
-- Bitrot is real.
-  - 4.2% to 34% of SSDs have one UBER (uncorrectable bit error rate) per year.
-  - External factors:
-    - Temperature.
-    - Bus power consumption.
-    - Data written by system software.
-    - Workload changes due to SSD failure.
-- Signs of drive failures:
-  - `zpool status <pool>` shows that a scrub has repaired any data.
-  - `zpool status <pool>` shows read, write or checksum errors (all values should be zero).
-- Database conventions:
-  - One app per database.
-  - Encode the environment and DMBS version into the dataset name, e.g. `theapp-prod-pg10`.
+See [Linux Server Storage: ZFS](../storage-zfs/).
 
 {% include footer.md %}

+ 2 - 0
index.md

@@ -53,6 +53,8 @@ Random collection of config notes and miscellaneous stuff. _Technically not a wi
 - [Debian](config/linux-server/debian/)
 - [Applications](config/linux-server/applications/)
 - [Storage](config/linux-server/storage/)
+- [Storage: ZFS](config/linux-server/storage-zfs/)
+- [Storage: Ceph](config/linux-server/storage-ceph/)
 - [Networking](config/linux-server/networking/)
 
 ### Media

+ 66 - 2
it/network/routing.md

@@ -74,6 +74,9 @@ breadcrumbs:
     - Update: Exchanges new route advertisements or withdrawals.
     - Notification: Signals errors and/or closes the session.
     - Keepalive: Shows it's still alive in the absence of update messages. Both keepalives and updates reset the hold timer.
+- Letter of Agency (LOA), Internet Routing Registry (IRR) and Resource Public Key Infrastructure (RPKI) are methods to secure BGP in order to prevent route leaks/hijacks.
+- The "default-free zone" (DFZ) is the set of ASes which have full-ish BGP tables instead of default routes.
+- Communities are used to exchange arbitrary policy information for announcements between peers. See [BGP Well-known Communities (IANA)](https://www.iana.org/assignments/bgp-well-known-communities/bgp-well-known-communities.xhtml).
 
 ### Attributes
 
@@ -91,11 +94,11 @@ Some important attributes:
 - Multi-exit discriminator (MED) (optional, non-transitive): When two ASes peer with multiple eBGP peerings, this number signals which of the two eBGP peerings should be used for incoming traffic (lower is preferred). This is only of significance between friendly ASes as ASes are selfish and free to ignore it (other alternatives for steering incoming traffic are AS path prepending, special communities and (as a very last resort) advertising more specific prefixes).
 - Local preference (well-known, discretionary, non-transitive): A number used to prioritise outgoing paths to another AS (higher is preferred).
 - Weight (Cisco-proprietary): Like local pref., but not exchanged between iBGP peers.
-- Community (optional, transitive): A bit of extra information used to group prefixes that should be treated similarly within or between ASes. There exists a few well-known communities such as "internet" (advertise to all neighbors), "no-advertise" (don't advertise toBGP neighbors), "no-export" (don't export to eBGP neighbors) and "local-as" (don't advertise outside the sub-AS).
+- Community (optional, transitive): A bit of extra information used to group routes that should be treated similarly within or between ASes.
 
 ### Path Selection
 
-The path selection algorithm is used to select a single best path for each prefix. The following shows an ordered list of decisions for which route to use (based on Cisco, may be inaccurate):
+The path selection algorithm is used to select a single best path for a prefix. The following shows an ordered list of decisions for which route to use, based on Cisco routers:
 
 1. (Before path selection) Longest prefix match.
 1. Highest weight (Cisco).
@@ -108,4 +111,65 @@ The path selection algorithm is used to select a single best path for each prefi
 1. Lowest IGP metric.
 1. Lowest BGP router ID.
 
+### Internet Routing Registry (IRR)
+
+- IRR is a mechanism for BGP route origin validation using a set of routing registries.
+- It consists of IRR routing policy records which are hosted in one of the multiple IRR registries.
+- Records are typically created for the route (`route`), the ASN (`aut-num`) and the upstream ISP AS-SET (`as-set`).
+- IRR is out-of-band, meaning it does not affect how originating routers are configured. It should however be able to source filtering policies for peering ASNs somehow.
+- IRR uses the Routing Policy Specification Language (RPSL) for describing routing policies.
+- Due to outdated, inaccurate or missing data, IRR has not seen global deployment.
+
+#### Setting Up IRR in the RIPE Database
+
+- See [Managing Route Objects in the IRR (RIPE)](https://www.ripe.net/manage-ips-and-asns/db/support/managing-route-objects-in-the-irr).
+- The RIPE Database is tightly couples with it's IRR.
+- IRR policies are handled by `route(6)` objects, containing the ASN and IPv4/IPv6 prefix.
+- Authorization for managing `route(6)` objects can be a little complicated. Generally, the LIR is always allowed to manage it.
+
+### Resource Public Key Infrastructure (RPKI)
+
+- RPKI is a mechanism for BGP route origin validation using cryptographic methods.
+- Like IRR, it validates the route origin only instead of the full path. Since routes typically use the shortest path due to both economical and operational incentives, this is generally not a big problem. It's also typically the case that route leaks are misconfigurations rather than malicious attacks, which origin validation would mostly prevent.
+- It's certificate authority (CA)-based, but RPKI calls the CAs "trust anchors" (TAs). The fire RIRs act as root CAs, which are also the entities allocating the ASN and IP prefixes which RPKI attempts to secure. This also simplifies RPKI management, as it's managed the same place as ASNs and IP prefixes. This also helps lock down access control for which orgs may create ROAs for which resources.
+- The main component are route origin authorization (ROA) records, which are certificates containing a prefix and an ASN.
+- ROAs are X.509 certificates. See RFCs 5280 and 3779.
+- IANA maintains lists for which ASNs, IPv4 prefixes and IPv6 prefixes are assigned to which RIR, which is also used to determine which RIR to use for RPKI.
+- Unlike DNSSEC where IANA is the single CA root (and IANA reporting to the US government), RPKI uses separate trees/TAs for each RIR, slightly more similar to web CAs (with arguably _too many_ CAs). There are some legal/political issues when the RIRs operate as TAs too, though.
+- RPKI is typically running out-of-band on servers called "validators" paired with the routers. For routers supporting it, the RPKI router protocol (RTR) may be used to feed the list of validated ROAs (aka VRPs or the validated cache, see other notes). It's recommended to use multiple validators for each router for redundancy. To reduce the number of validators, many routers may access common, remote validators over some secure transport link. The validators must periodidcally update their local databases from the RIRs' ones. It the route validator are running in parallel with the routers, it has a negligible impact on convergence speed.
+- RPKI Repository Delta Protocol (RRDP) (RFC 8182) is designed to fetch RPKI data from TAs and is based on HTTPS. It has replaced rsync due to rsync being inefficient and not scalable for the purpose.
+- If all validators become unavailable or all ROAs expire, RPKI will fall back to accepting all routes (the standard policy when a ROA is not found).
+- Trust anchor locators (TALs) are used to retrieve the RIRs' TAs and consists of the URL to retrieve it as well as a public key to verify its authenticity. This allows TAs to be rotated more easily.
+- RIPE, APNIC, AFRNIC and LACNIC distribute their TALs publicly, but for ARIN you have to explicitly agree to their terms before you can get it.
+- All RIRs offer hosted RPKI managed through the RIR portal, but it can also be hosted internally for large organizations, called delegated RPKI.
+- ROAs contain a max prefix length field, which limits how long prefixes the AS is allowed to advertise. This limits segmentation and helps prevent longer-prefix attacks.
+- Validation of a ROA results in a validated ROA payload (VRP), consisting of the IP prefix (same length or shorter), the maximum length and the origin ASN. Comparing router advertisements with VRPs has one of three possible outcomes:
+    - Valid: At least one VRP (maybe multiple) contains the prefix with the correct origin ASN and allowed prefix length. The route should be accepted.
+    - Invalid: A VRP for the prefix exists, but the ASN doesn't match or the length is longer than the maximum. The route should be rejected.
+    - Not found: No VRP with a matching prefix was found. The route should be accepted (until RPKI is globally deployed, at least).
+- ROAs are fetched and processed periodically (30-60 minutes preferably) to produce a list of VRPs, aka a validated cache. ROAs that are expired or are otherwise cryptographically erraneous are discarded and thus will not be used to validate route announcements.
+- Local overrides may be used for VRPs, e.g. for cases where a temporarily invalid announcement must be accepted. See Simplified Local Internet Number Resource Management with the RPKI (SLURM) (RFC 8416)
+
+#### Setting Up RPKI ROAs in the RIPE Database
+
+- See [Managing ROAs (Ripe)](https://www.ripe.net/manage-ips-and-asns/resource-management/rpki/resource-certification-roa-management).
+- For PA space, only the LIR is authorized to manage ROAs.
+
+#### Resources
+
+- [RPKI Documentation (NLnet Labs)](https://rpki.readthedocs.io)
+- [RPKI Test (RIPE)](http://www.ripe.net/s/rpki-test)
+
+### Best Practices
+
+- Announced prefix lengths (max /24 and /48): Generally, use a maximum length of 24 for IPv4 and 48 for IPv6, due to longer prefixes being commonly filtered. See [Visibility of IPv4 and IPv6 Prefix Lengths in 2019 (RIPE)](https://labs.ripe.net/Members/stephen_strowes/visibility-of-prefix-lengths-in-ipv4-and-ipv6).
+- IRR and RPKI: Add `route(6)` objects (for IRR) and ROAs (for RPKI) for all prefixes, both to avoid having your prefixes hijacked and to reduce the risk of getting filtered.
+- Explicit import & export policies: Always explicitly define the input and output policies to avoid route leakage. Certain routers defaults to announcing everything if no policy is defined, but RFC 8212 defines a safe default policy of filtering all routes if no policy is explicitly defined.
+- Enable large communities: 2-byte communities are outdated, enable 12-byte communities to allow for more advanced policies and to keep up up to date with 4-byte ASNs. See RFCs 8092 and 8195.
+- Administrative shutdown message: When administratively shutting down a session (due to maintenance or something), set a message to explain why to the other peer. Peers should log received shutdown messages. See RFC 9003, which adds support for this free-form 128-byte UTF-8 message in the BGP notification message.
+- Voluntary shutdown (for BGP-speaking routers): Before maintenance where the router is unable to route traffic, shutdown BGP peering sessions and wait for BGP convergence around the router to avoid/reduce temporary blackholing. Aka voluntary session culling and voluntary session teradown. See RFC 8327.
+- Involuntary shutdown (for IXPs): Before maintenance which will prevent connected routers from forwarding traffic through the IXP, apply an ACL or similar to filter all BGP communication (TCP/179) between directly connected routers and wait for BGP convergence around the IXP to avoid/reduce temporary blackholing. Multihop sessions may be allowed. This is as an alternative to or in addition to voluntary shutdown, as the routers are generally managed by orgs other than the one managing the IXP. Related to voluntary shutdown and described by the same RFC.
+- Use and support the graceful shutdown community: The well-known community GRACEFUL_SHUTDOWN (65535:0) is used to signal graceful shutdown of announced routes. Peers should support this community by adding a policy matching the community, which reduces the LOCAL_PREF to 0 or similar such that other paths are preferred and installed in the routing table, to eliminate the impact when the router finally shuts down the session. See RFC 8326.
+- Use and support the blackhole community: The well-known community BLACKHOLE (65535:666) is used to signal that the peer should discard traffic destined toward the prefix. This is mainly intended to stop DDoS attacks targeting the certain prefix before reaching the router advertising it, such that other non-targeted traffic may continue to use the link. While announced prefixes should generally avoid exceeding a certain max length, announcements with the blackhole community are typically allowed to be as specific as possible to narrow down the blackhole addresses (e.g. /32 for IPv4 and /128 for IPv6). See RFC 7999.
+
 {% include footer.md %}