Håvard O. Nordstrand пре 5 година
родитељ
комит
4c10e65f26
1 измењених фајлова са 114 додато и 9 уклоњено
  1. 114 9
      config/linux-server/storage.md

+ 114 - 9
config/linux-server/storage.md

@@ -85,6 +85,7 @@ This is just a suggestion for how to partition your main system drive. Since LVM
 ### Resources
 
 - (Ceph: Ceph PGs per Pool Calculator)[https://ceph.com/pgcalc/]
+- (Ceph Documentation: Placement Group States)[https://docs.ceph.com/docs/mimic/rados/operations/pg-states/]
 
 ### Info
 
@@ -95,24 +96,128 @@ This is just a suggestion for how to partition your main system drive. Since LVM
     - Managers (at least two for HA) for serving metrics and statuses to users and external services.
     - OSDs (object storage daemon) (one per disk) for handles storing of data, replication, etc.
     - Metadata Servers (MDSs) for storing metadata for POSIX file systems to function properly and efficiently.
-- Multiple monitors, which uses quorum, are required for HA.
+- At least three monitors are required for HA, because of quorum.
 - Each node connects directly to OSDs when handling data.
-- Pools consist of a number of placement groups (PGs) and OSDs.
-- Each PG uses a number of OSDs, as described by the replication factor.
+- Pools consist of a number of placement groups (PGs) and OSDs, where each PG uses a number of OSDs.
+- Replication factor (aka size):
+    - Replication factor *n*/*m* (e.g. 3/2) means replication factor *n* with minimum replication factor *m*. One of them is often omitted.
+    - The replication factor specifies how many copies of the data will be stored.
+    - The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful and unblocks the write operation.
+    - Replication factor *n* means the data will be stored on *n* different OSDs/disks on different nodes,
+      and that *n-1* nodes may fail without losing data.
+- When an OSD fails, Ceph will try to rebalance the data (with replication factor over 1) onto other OSDs to regain the correct replication factor.
+- A PG must have state *active* in order to be accessible for RW operations.
 - The number of PGs in an existing pool can be increased but not decreased.
-- The minimum replication factor describes the number of OSDs that must have received the data before the write is considered successful.
 - Clients only interact with the primary OSD in a PG.
 - The CRUSH algorithm is used for determining storage locations based on hashing the pool and object names. It avoids having to index file locations.
+- BlueStore (default OSD back-end):
+    - Creates two partitions on the disk: One for metadata and one for data.
+    - The metadata partition uses an XFS FS and is mounted to `/var/lib/ceph/osd/ceph-<osd-id>`.
+    - The metadata file `block` points to the data partition.
+    - The metadata file `block.wal` points to the journal device/partition if it exists (it does not by default).
+    - Separate OSD WAL/journal and DB devices may be set up, typically when using HDDs or a mix of HDDs and SSDs.
+    - One OSD WAL device can serve multiple OSDs.
+    - OSD WAL devices should be sized according to how much data they should "buffer".
+    - OSD DB devices should be at least 4% as large as the backing OSDs. If they fill up, they will spill onto the OSDs and reduce performance.
+    - If the fast storage space is limited (e.g. less than 1GB), use it as an OSD WAL. If it is large, use it as an OSD DB.
+    - Using a DB device will also provide the benefits of a WAL device, as the journal is always placed on the fastest device.
+    - A lost OSD WAL/DB will be equivalent to lose all OSDs. (For the older Filestore back-end, it used to be possible to recover it.)
+
+### Guidelines
+
+- Use at least 3 nodes.
+- CPU: Metadata servers and partially OSDs are somewhat CPU intensive. Monitors are not.
+- RAM: OSDs should have ~1GB per 1TB storage, even though it typically doesn't use much.
+- Use a replication factor of at least 3/2.
+- Run OSes, OSD data and OSD journals on separate drives.
+- Network:
+    - Use an isolated separete physical network for internal cluster traffic between nodes.
+    - Consider using 10G or higher with a spine-leaf topology.
+- Pool PG count:
+    - \<5 OSDs: 128
+    - 5-10 OSDs: 512
+    - 10-50 OSDs: 4096
+    - \>50 OSDs: See (pgcalc)[https://ceph.com/pgcalc/].
 
 ### Usage
 
-- Interact with pools using rados:
-    - List pools: `rados lspools`
-    - Show utilization: `rados df`
+- General:
+    - List pools: `rados lspools` or `ceph osd lspools`
+    - Show pool utilization: `rados df`
+- Pools:
+    - Create: `ceph osd pool create <pool> <pg-num>`
+    - Delete: `ceph osd pool delete <pool> [<pool> --yes-i-really-mean-it]`
+    - Rename: `ceph osd pool rename <old-name> <new-name>`
+    - Make or delete snapshot: `ceph osd pool <mksnap|rmsnap> <pool> <snap>`
+    - Set or get values: `ceph osd pool <set|get> <pool> <key>`
+    - Set quota: `ceph osd pool set-quota <pool> [max_objects <count>] [max_bytes <bytes>]`
+- PGs:
+    - Status of PGs: `ceph pg dump pgs_brief`
+- Interact with pools directly using RADOS:
+    - Ceph is built on based on RADOS.
     - List files: `rados -p <pool> ls`
     - Put file: `rados -p <pool> put <name> <file>`
     - Get file: `rados -p <pool> get <name> <file>`
     - Delete file: `rados -p <pool> rm <name>`
+- Manage RBD (Rados Block Device) images:
+    - Images are spread over multiple objects.
+    - List images: `rbd -p <pool> ls`
+    - Show usage: `rbd -p <pool> du`
+    - Show image info: `rbd info <pool/image>`
+    - Create image: `rbd create <pool/image> --object-size=<obj-size> --size=<img-size>`
+    - Export image to file: `rbd export <pool/image> <file>`
+    - Mount image: TODO
+
+#### Failure Handling
+
+**Down + peering:**
+
+The placement group is offline because an is unavailable and is blocking peering.
+
+1. `ceph pg <pg> query`
+1. Try to restart the blocked OSD.
+1. If restarting didn't help, mark OSD as lost: `ceph osd lost <osd>`
+    - No data loss should occur if using an appropriate replication factor.
+
+**Active degraded (X objects unfound):**
+
+Data loss has occurred, but metadata about the missing files exist.
+
+1. Check the hardware.
+1. Identify object names: `ceph pg <pg> query`
+1. Check which images the objects belong to: `ceph pg <pg list_missing>`
+1. Either restore or delete the lost objects: `ceph pg <pg> mark_unfound_lost <revert|delete>`
+
+**Inconsistent:**
+
+Typically combined with other states. May come up during scrubbing.
+Typically an early indicator of faulty hardware, so take note of which disk it is.
+
+1. Find inconsistent PGs: `ceph pg dump pgs_brief | grep -i inconsistent`
+    - Alternatively: `rados list-inconsistent pg <pool>`
+1. Repair the PG: `ceph pg repair <pg>`
+
+#### OSD Replacement
+
+1. Stop the daemon: `systemctl stop ceph-osd@<id>`
+    - Check: `systemctl status ceph-osd@<id>`
+1. Destroy OSD: `ceph osd destroy osd.<id> [--yes-i-really-mean-it]`
+    - Check: `ceph osd tree`
+1. Remove OSD from CRUSH map: `ceph osd crush remove osd.<id>`
+1. Wait for rebalancing: `ceph -s [-w]`
+1. Remove the OSD: `ceph osd rm osd.<id>`
+    - Check that it's unmounted: `lsblk`
+    - Unmount it if not: `umount <dev>`
+1. Replace the physical disk.
+1. Zap the new disk: `ceph-disk zap <dev>`
+1. Create new OSD: `pveceph osd create <dev> [options]` (PVE)
+    - Specify any WAL or DB devices.
+    - See [PVE: pveceph(1)](https://pve.proxmox.com/pve-docs/pveceph.1.html).
+    - Without `pveceph osd create`, a series of steps are required.
+    - Check that the new OSD is up: `ceph osd tree`
+1. Start the OSD daemon: `systemctl start ceph-osd@<id>`
+1. Wait for rebalancing: `ceph -s [-w]`
+1. Check the health: `ceph health`
 
 ## ZFS
 
@@ -150,7 +255,7 @@ This is just a suggestion for how to partition your main system drive. Since LVM
 #### Installation
 
 The installation part is highly specific to Debian 10.
-Some guides recommend using backport repos, but this way doesn't require that.
+Some guides recommend using backport repos, but this way avoids that.
 
 1. Enable the `contrib` and `non-free` repo areas.
 1. Install (it might give errors): `zfs-dkms zfsutils-linux zfs-zed`
@@ -245,7 +350,7 @@ Some guides recommend using backport repos, but this way doesn't require that.
 
 ### Troubleshooting
 
-**TODO** Test if this is actually working.
+**TODO** Test if working. It may have worked just by accident.
 
 - `zfs-import-cache.service` fails to import pools because disks are not found:
   - Set `options scsi_mod scan=sync` in `/etc/modprobe.d/zfs.conf` to wait for iSCSI disks to come online before ZFS starts.