5 rokov pred · 4c10e65f26
--- a/config/linux-server/storage.md
+++ b/config/linux-server/storage.md
@@ -85,6 +85,7 @@ This is just a suggestion for how to partition your main system drive. Since LVM
 
				 ### Resources
			
 
				 
			
 
				 - (Ceph: Ceph PGs per Pool Calculator)[https://ceph.com/pgcalc/]
			
 
				+- (Ceph Documentation: Placement Group States)[https://docs.ceph.com/docs/mimic/rados/operations/pg-states/]
			
 
				 
			
 
				 ### Info
			
 
				 
			
@@ -95,24 +96,128 @@ This is just a suggestion for how to partition your main system drive. Since LVM
 
				     - Managers (at least two for HA) for serving metrics and statuses to users and external services.
			
 
				     - OSDs (object storage daemon) (one per disk) for handles storing of data, replication, etc.
			
 
				     - Metadata Servers (MDSs) for storing metadata for POSIX file systems to function properly and efficiently.
			
 
				-- Multiple monitors, which uses quorum, are required for HA.
			
 
				+- At least three monitors are required for HA, because of quorum.
			
 
				 - Each node connects directly to OSDs when handling data.
			
 
				-- Pools consist of a number of placement groups (PGs) and OSDs.
			
 
				-- Each PG uses a number of OSDs, as described by the replication factor.
			
 
				+- Pools consist of a number of placement groups (PGs) and OSDs, where each PG uses a number of OSDs.
			
 
				+- Replication factor (aka size):
			
 
				+    - Replication factor *n*/*m* (e.g. 3/2) means replication factor *n* with minimum replication factor *m*. One of them is often omitted.
			
 
				+    - The replication factor specifies how many copies of the data will be stored.
			
 
				+    - The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful and unblocks the write operation.
			
 
				+    - Replication factor *n* means the data will be stored on *n* different OSDs/disks on different nodes,
			
 
				+      and that *n-1* nodes may fail without losing data.
			
 
				+- When an OSD fails, Ceph will try to rebalance the data (with replication factor over 1) onto other OSDs to regain the correct replication factor.
			
 
				+- A PG must have state *active* in order to be accessible for RW operations.
			
 
				 - The number of PGs in an existing pool can be increased but not decreased.
			
 
				-- The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful.
			
 
				 - Clients only interact with the primary OSD in a PG.
			
 
				 - The CRUSH algorithm is used for determining storage locations based on hashing the pool and object names. It avoids having to index file locations.
			
 
				+- BlueStore (default OSD back-end):
			
 
				+    - Creates two partitions on the disk: One for metadata and one for data.
			
 
				+    - The metadata partition uses an XFS FS and is mounted to `/var/lib/ceph/osd/ceph-<osd-id>`.
			
 
				+    - The metadata file `block` points to the data partition.
			
 
				+    - The metadata file `block.wal` points to the journal device/partition if it exists (it does not by default).
			
 
				+    - Separate OSD WAL/journal and DB devices may be set up, typically when using HDDs or a mix of HDDs and SSDs.
			
 
				+    - One OSD WAL device can serve multiple OSDs.
			
 
				+    - OSD WAL devices should be sized according to how much data they should "buffer".
			
 
				+    - OSD DB devices should be at least 4% as large as the backing OSDs. If they fill up, they will spill onto the OSDs and reduce performance.
			
 
				+    - If the fast storage space is limited (e.g. less than 1GB), use it as an OSD WAL. If it is large, use it as an OSD DB.
			
 
				+    - Using a DB device will also provide the benefits of a WAL device, as the journal is always placed on the fastest device.
			
 
				+    - A lost OSD WAL/DB will be equivalent to lose all OSDs. (For the older Filestore back-end, it used to be possible to recover it.)
			
 
				+
			
 
				+### Guidelines
			
 
				+
			
 
				+- Use at least 3 nodes.
			
 
				+- CPU: Metadata servers and partially OSDs are somewhat CPU intensive. Monitors are not.
			
 
				+- RAM: OSDs should have ~1GB per 1TB storage, even though it typically doesn't use much.
			
 
				+- Use a replication factor of at least 3/2.
			
 
				+- Run OSes, OSD data and OSD journals on separate drives.
			
 
				+- Network:
			
 
				+    - Use an isolated separete physical network for internal cluster traffic between nodes.
			
 
				+    - Consider using 10G or higher with a spine-leaf topology.
			
 
				+- Pool PG count:
			
 
				+    - \<5 OSDs: 128
			
 
				+    - 5-10 OSDs: 512
			
 
				+    - 10-50 OSDs: 4096
			
 
				+    - \>50 OSDs: See (pgcalc)[https://ceph.com/pgcalc/].
			
 
				 
			
 
				 ### Usage
			
 
				 
			
 
				-- Interact with pools using rados:
			
 
				-    - List pools: `rados lspools`
			
 
				-    - Show utilization: `rados df`
			
 
				+- General:
			
 
				+    - List pools: `rados lspools` or `ceph osd lspools`
			
 
				+    - Show pool utilization: `rados df`
			
 
				+- Pools:
			
 
				+    - Create: `ceph osd pool create <pool> <pg-num>`
			
 
				+    - Delete: `ceph osd pool delete <pool> [<pool> --yes-i-really-mean-it]`
			
 
				+    - Rename: `ceph osd pool rename <old-name> <new-name>`
			
 
				+    - Make or delete snapshot: `ceph osd pool <mksnap|rmsnap> <pool> <snap>`
			
 
				+    - Set or get values: `ceph osd pool <set|get> <pool> <key>`
			
 
				+    - Set quota: `ceph osd pool set-quota <pool> [max_objects <count>] [max_bytes <bytes>]`
			
 
				+- PGs:
			
 
				+    - Status of PGs: `ceph pg dump pgs_brief`
			
 
				+- Interact with pools directly using RADOS:
			
 
				+    - Ceph is built on based on RADOS.
			
 
				     - List files: `rados -p <pool> ls`
			
 
				     - Put file: `rados -p <pool> put <name> <file>`
			
 
				     - Get file: `rados -p <pool> get <name> <file>`
			
 
				     - Delete file: `rados -p <pool> rm <name>`
			
 
				+- Manage RBD (Rados Block Device) images:
			
 
				+    - Images are spread over multiple objects.
			
 
				+    - List images: `rbd -p <pool> ls`
			
 
				+    - Show usage: `rbd -p <pool> du`
			
 
				+    - Show image info: `rbd info <pool/image>`
			
 
				+    - Create image: `rbd create <pool/image> --object-size=<obj-size> --size=<img-size>`
			
 
				+    - Export image to file: `rbd export <pool/image> <file>`
			
 
				+    - Mount image: TODO
			
 
				+
			
 
				+#### Failure Handling
			
 
				+
			
 
				+**Down + peering:**
			
 
				+
			
 
				+The placement group is offline because an is unavailable and is blocking peering.
			
 
				+
			
 
				+1. `ceph pg <pg> query`
			
 
				+1. Try to restart the blocked OSD.
			
 
				+1. If restarting didn't help, mark OSD as lost: `ceph osd lost <osd>`
			
 
				+    - No data loss should occur if using an appropriate replication factor.
			
 
				+
			
 
				+**Active degraded (X objects unfound):**
			
 
				+
			
 
				+Data loss has occurred, but metadata about the missing files exist.
			
 
				+
			
 
				+1. Check the hardware.
			
 
				+1. Identify object names: `ceph pg <pg> query`
			
 
				+1. Check which images the objects belong to: `ceph pg <pg list_missing>`
			
 
				+1. Either restore or delete the lost objects: `ceph pg <pg> mark_unfound_lost <revert|delete>`
			
 
				+
			
 
				+**Inconsistent:**
			
 
				+
			
 
				+Typically combined with other states. May come up during scrubbing.
			
 
				+Typically an early indicator of faulty hardware, so take note of which disk it is.
			
 
				+
			
 
				+1. Find inconsistent PGs: `ceph pg dump pgs_brief | grep -i inconsistent`
			
 
				+    - Alternatively: `rados list-inconsistent pg <pool>`
			
 
				+1. Repair the PG: `ceph pg repair <pg>`
			
 
				+
			
 
				+#### OSD Replacement
			
 
				+
			
 
				+1. Stop the daemon: `systemctl stop ceph-osd@<id>`
			
 
				+    - Check: `systemctl status ceph-osd@<id>`
			
 
				+1. Destroy OSD: `ceph osd destroy osd.<id> [--yes-i-really-mean-it]`
			
 
				+    - Check: `ceph osd tree`
			
 
				+1. Remove OSD from CRUSH map: `ceph osd crush remove osd.<id>`
			
 
				+1. Wait for rebalancing: `ceph -s [-w]`
			
 
				+1. Remove the OSD: `ceph osd rm osd.<id>`
			
 
				+    - Check that it's unmounted: `lsblk`
			
 
				+    - Unmount it if not: `umount <dev>`
			
 
				+1. Replace the physical disk.
			
 
				+1. Zap the new disk: `ceph-disk zap <dev>`
			
 
				+1. Create new OSD: `pveceph osd create <dev> [options]` (PVE)
			
 
				+    - Specify any WAL or DB devices.
			
 
				+    - See [PVE: pveceph(1)](https://pve.proxmox.com/pve-docs/pveceph.1.html).
			
 
				+    - Without `pveceph osd create`, a series of steps are required.
			
 
				+    - Check that the new OSD is up: `ceph osd tree`
			
 
				+1. Start the OSD daemon: `systemctl start ceph-osd@<id>`
			
 
				+1. Wait for rebalancing: `ceph -s [-w]`
			
 
				+1. Check the health: `ceph health`
			
 
				 
			
 
				 ## ZFS
			
 
				 
			
@@ -150,7 +255,7 @@ This is just a suggestion for how to partition your main system drive. Since LVM
 
				 #### Installation
			
 
				 
			
 
				 The installation part is highly specific to Debian 10.
			
 
				-Some guides recommend using backport repos, but this way doesn't require that.
			
 
				+Some guides recommend using backport repos, but this way avoids that.
			
 
				 
			
 
				 1. Enable the `contrib` and `non-free` repo areas.
			
 
				 1. Install (it might give errors): `zfs-dkms zfsutils-linux zfs-zed`
			
@@ -245,7 +350,7 @@ Some guides recommend using backport repos, but this way doesn't require that.
 
				 
			
 
				 ### Troubleshooting
			
 
				 
			
 
				-**TODO** Test if this is actually working.
			
 
				+**TODO** Test if working. It may have worked just by accident.
			
 
				 
			
 
				 - `zfs-import-cache.service` fails to import pools because disks are not found:
			
 
				   - Set `options scsi_mod scan=sync` in `/etc/modprobe.d/zfs.conf` to wait for iSCSI disks to come online before ZFS starts.