3 жил өмнө · ef986ffd63
--- a/config/automation/ansible.md
+++ b/config/automation/ansible.md
@@ -6,14 +6,27 @@ breadcrumbs:
 
				 ---
			
 
				 {% include header.md %}
			
 
				 
			
 
				-## Ad Hoc Usage
			
 
				+## Usage
			
 
				 
			
 
				-- Run module for host: `ansible all -i <host>, -m <module> [-a <module-arg>]`
			
 
				-    - The comma after the host is required to treat it as a host list literal instead of an inventory file name.
			
 
				-    - Use `-i localhost, --connection=local` to run locally.
			
 
				-- Get facts (with optional filter): `ansible all -i <host>, -m setup -a 'filter=ansible_os_*'` (example fact filter)
			
 
				+### General
			
 
				 
			
 
				-## Playbooks
			
 
				+- Specify SSH password: `--ask-pass`
			
 
				+- Specify sudo password: `--ask-become-pass`
			
 
				+- Specify username: `--username=<username>`
			
 
				+- Specify SSH key: `--private-key=<key>` (use `/dev/null` to explicitly avoid SSH keys)
			
 
				+
			
 
				+### Ad Hoc
			
 
				+
			
 
				+- Basic usage: `ansible {all|<target>} -i <inventory> [-m <module>] [-a <module-arg>]`
			
 
				+    - To specify a hostname directly and not use an inventory file, specify `all -i <host>,` (with the comma).
			
 
				+    - To run locally, specift `all -i localhost, --connection=local`.
			
 
				+- Module examples:
			
 
				+    - Ping: `... -m ping`
			
 
				+    - Run command (default module): `... -a <cmd>`
			
 
				+    - Run complicated command (example): `... -a 'bash -c "nvidia-smi > /dev/null"'`
			
 
				+- Get facts (with optional filter): `ansible <...> -m setup -a 'filter=ansible_os_*'` (example fact filter)
			
 
				+
			
 
				+### Playbooks
			
 
				 
			
 
				 - Basic: `ansible-playbook <playbook>`
			
 
				 - Specify inventory file: `ansible-playbook -i <hosts> <playbook>`
			
@@ -21,7 +34,7 @@ breadcrumbs:
 
				 - Limit which tasks to run using tags (comma-separated): `ansible-playbook -t <tag> <playbook>`
			
 
				 - Use Vault password file: `ansible-playbook --vault-password-file <file> <...>`
			
 
				 
			
 
				-## Vault
			
 
				+### Vault
			
 
				 
			
 
				 - Use file for password: Just add the password as the only line in a file.
			
 
				 - Encrypt, prompt for secret, using password file: `ansible-vault encrypt_string --vault-password-file ~/.ansible_vault/stuff`
			
@@ -38,4 +51,10 @@ interpreter_python = /usr/bin/python3
 
				 host_key_checking = false
			
 
				 ```
			
 
				 
			
 
				+## Troubleshooting
			
 
				+
			
 
				+### Ansible Freezes when Connecting
			
 
				+
			
 
				+Probably caused by a password-protected SSH key. Add `--private-key=<keyfile>` to specify which SSH key to use or `--private-key=/dev/null` to avoid using any SSH key.
			
 
				+
			
 
				 {% include footer.md %}
			
--- a/config/virt-cont/docker.md
+++ b/config/virt-cont/docker.md
@@ -65,7 +65,10 @@ breadcrumbs:
 
				 
			
 
				 The toolkit is used for running CUDA applications within containers.
			
 
				 
			
 
				-See the [installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker).
			
 
				+1. Add the repo: See the [installation guide](https://nvidia.github.io/nvidia-container-runtime).
			
 
				+1. Install: `apt install nvidia-container-toolkit` (not `nvidia-docker2`)
			
 
				+1. Fix [an ldconfig bug](https://github.com/NVIDIA/nvidia-docker/issues/1399) (Debian 11): In `/etc/nvidia-container-runtime/config.toml`, under the `nvidia-container-cli` section, set `ldconfig = "/sbin/ldconfig"` (remove the `@` prefix).
			
 
				+1. Test: `docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi`
			
 
				 
			
 
				 ## Usage
			
 
				 
			
--- a/config/virt-cont/podman.md
+++ b/config/virt-cont/podman.md
@@ -19,17 +19,21 @@ breadcrumbs:
 
				 
			
 
				 #### Debian
			
 
				 
			
 
				-1. Add Kubic repo (pre Debian 11∕Ubuntu 20.10 only):
			
 
				-    1. Install dependencies: `apt install curl wget gnupg2`
			
 
				+1. (Note) Debian 11, Ubuntu 20.10 etc. should have Podman in the main repos.
			
 
				+1. Add Kubic repo (Ubuntu 20.04 and older):
			
 
				+    1. Install dependencies: `apt install curl gnupg`
			
 
				     1. Get OS info: `source /etc/os-release`
			
 
				-    1. Add repo: `echo "deb http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /" | tee /etc/apt/sources.list.d/kubic-libcontainers.list`
			
 
				-    1. Add GPG key (old way): `wget -nv https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable/xUbuntu_${VERSION_ID}/Release.key -O- | apt-key add -`
			
 
				+    1. Add GPG key: `curl -sSf https://download.opensuse.org/repositories/devel:kubic:libcontainers:stable/xUbuntu_${VERSION_ID}/Release.key | gpg --dearmor > /usr/share/keyrings/kubic-libcontainers-archive-keyring.gpg`
			
 
				+    1. Add repo: `echo "deb [signed-by=/usr/share/keyrings/kubic-libcontainers-archive-keyring.gpg] http://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /" | tee /etc/apt/sources.list.d/kubic-libcontainers.list`
			
 
				 1. Install: `apt install podman`
			
 
				-1. Enable (**TODO** required?): `systemctl enable --now podman.service podman.socket`
			
 
				+1. Enable auto-start:
			
 
				+    1. Enable: `systemctl enable --now podman-restart.service`
			
 
				+    1. (Note) The service is required to automatically start containers with `restart=always` on reboot.
			
 
				 1. Verify install: `podman info`
			
 
				 1. (Optional) Add Docker compat stuff:
			
 
				+    1. Set Docker executable link: `ln -s /usr/bin/podman /usr/bin/docker`
			
 
				     1. Set Docket socket path: `echo "DOCKER_HOST=unix:///run/podman/podman.sock" >> /etc/environment`
			
 
				-    1. Set Docker binary link: `ln -s /usr/bin/podman /usr/bin/docker`
			
 
				+    1. Set sudo to accept the socket path env var: `echo "Defaults env_keep += \"DOCKER_HOST\"" >> /etc/sudoers.d/podman-compat`
			
 
				 
			
 
				 #### Arch
			
 
				 
			
@@ -55,7 +59,10 @@ breadcrumbs:
 
				 
			
 
				 ### NVIDIA Container Toolkit
			
 
				 
			
 
				-**TODO**
			
 
				+1. Add the repo: See the [installation guide](https://nvidia.github.io/nvidia-container-runtime).
			
 
				+1. Install: `apt install nvidia-container-toolkit` (not `nvidia-docker2`)
			
 
				+1. Fix [an ldconfig bug](https://github.com/NVIDIA/nvidia-docker/issues/1399) (Debian 11): In `/etc/nvidia-container-runtime/config.toml`, under the `nvidia-container-cli` section, set `ldconfig = "/sbin/ldconfig"` (remove the `@` prefix).
			
 
				+1. Test: `podman run --privileged --rm docker.io/nvidia/cuda:11.0-base nvidia-smi`
			
 
				 
			
 
				 ## Usage
			
 
				 
			
@@ -64,14 +71,29 @@ breadcrumbs:
 
				 - See [Docker usage](../docker/#usage).
			
 
				     - Most commands are Docker clones and simply replacing `docker` with `podman` in the command will typically work.
			
 
				     - Configuration files are a bit different.
			
 
				-- Since Podman supports multiple default registries instead of just Docker Hub, it's recommended to prepend `docker.io/` to images you expect to find in Docker Hub.
			
 
				+- Registries:
			
 
				+    - Since Podman supports multiple default registries instead of just Docker Hub, it's recommended to prepend `docker.io/` to images you expect to find in Docker Hub.
			
 
				+1. Auto-start:
			
 
				+    - The `podman-restart.service` provides auto-starting of containers.
			
 
				+    - Only containers with `restart=always` will be auto-started.
			
 
				+1. Auto-updating:
			
 
				+    - Auto-updating is provided by a systemd timer and service.
			
 
				+    - Run `podman auto-update` to run manually.
			
 
				+    - Set label `io.containers.autoupdate=registry` on containers to enable auto-updates.
			
 
				+    - **TODO** Apparently this requires systemd-unit containers.
			
 
				 
			
 
				 ### Networking
			
 
				 
			
 
				+- Firewall:
			
 
				+    - Unlike Docker, you can't just restart some daemon to fix the firewall rules after reapplying your normal IPTables rules from a script or something.
			
 
				+    - (Bug) Doesn't open the ports when exposing ports from containers, for some reason. Works if changing the default forwarding actions to accept, but why would I do that. To work around it, you need to manually add forwarding accept rules to the container IP addresses.
			
 
				 - DNS:
			
 
				     - By default, the host's DNS domainname and servers will be set in the container's `/etc/resolv.conf`.
			
 
				 - IPv6:
			
 
				-    - Doesn't seem to be as broken/neglected as in Docker.
			
 
				+    - Doesn't seem to be as broken/neglected as in Docker. _To be continued ..._
			
 
				+    - (Bug) When creating a network with an IPv6 subnet, it ignores the provided IPv4 subnet and uses a default one instead.
			
 
				     - Add `--ipv6 --subnet=<subnet>/64` to enable on bridges (with NAT and firewalling, like IPv4).
			
 
				+- Miscellanea:
			
 
				+    - The MTU issues I had with Docker seems to be gone. It correctly received packet-too-big messages when the upstream transport has a lower MTU.
			
 
				 
			
 
				 {% include footer.md %}
			
--- a/index.md
+++ b/index.md
@@ -124,13 +124,14 @@ Random collection of config notes and miscellaneous stuff. _Technically not a wi
 
				 
			
 
				 ### Network
			
 
				 
			
 
				-- [IPv4](/it/network/ipv4/)
			
 
				-- [IPv6](/it/network/ipv6/)
			
 
				+- [IPv4 Theory](/it/network/ipv4/)
			
 
				+- [IPv6 Theory](/it/network/ipv6/)
			
 
				 - [Network Architecture](/it/network/architecture/)
			
 
				-- [Switching](/it/network/switching/)
			
 
				-- [Routing](/it/network/routing/)
			
 
				+- [Routing Theory](/it/network/routing/)
			
 
				+- [BGP](/it/network/bgp/)
			
 
				+- [Switching Theory](/it/network/switching/)
			
 
				 - [Wireless Basics](/it/network/wireless-basics/)
			
 
				-- [WLAN](/it/network/wlan/)
			
 
				+- [WLAN Theory](/it/network/wlan/)
			
 
				 
			
 
				 ### Services
			
 
				 
			
--- a/it/network/bgp.md
+++ b/it/network/bgp.md
@@ -0,0 +1,135 @@
 
				+---
			
 
				+title: BGP
			
 
				+breadcrumbs:
			
 
				+- title: IT
			
 
				+- title: Network
			
 
				+---
			
 
				+{% include header.md %}
			
 
				+
			
 
				+### Related Pages
			
 
				+{:.no_toc}
			
 
				+
			
 
				+- [Routing Theory](../routing/)
			
 
				+
			
 
				+## General
			
 
				+
			
 
				+- A path vector protocol and the only EGP used on the Internet.
			
 
				+- Version 4 (BGP-4) with multiprotocol extensions (MBGP) is the most common version, which supports CIDR, route aggregation and _address families_ such as multicast IPv4, unicast IPv6 and VPN information for MPLS.
			
 
				+- Uses a set of attributes to describe a route (see subsection).
			
 
				+- Route filtering and RPKI are methods commonly used to prevent accidental or malicious misconfiguration where prefixes are routed a place they should not.
			
 
				+- Redistributing between BGP and an IGP should in most cases be avoided. BGP has huge routing tables. IGP is dumber than BGP. IGP flapping should not leak onto EGPs, this may also be penalized by _BGP dampening_.
			
 
				+- Unlike typical IGPs, it does not support any kind of auto discovery of other BGP peers, but peers must instead be statically configured. It uses TCP port 179.
			
 
				+- Exterior BGP (eBGP) is used to advertise and receive routes from peers in other ASes, while interior BGP (iBGP) is used to distribute routes between all eBGP routers in the same AS. iBGP is used instead of an IGP because an IGP would lose BGP information.
			
 
				+- The iBGP split horizon rule: Routers are not allowed to adversise a route learned from one iBGP peering to another iBGP peering. This prevents loops in iBGP, but requires that peers must be connected in a full mesh. To reduce the complexity of the iBGP full mesh, techniques like route reflectors (RRs) (dividing the AS into clusters) and confederations (dividing the AS into sub-ASes) may be used.
			
 
				+- eBGP peers are generally required to be directly connected, which is enforced by using an IP TTL of 1. This limit may be relaxed by using multihop sessions. iBGP however is not subject to thus requirement.
			
 
				+- Both multihop sessions and TTL security are mutually exclusive features for increasing the number of allowed hops between eBGP peers, which is limited to 1 by default. TTL security (aka Generalized TTL Security Mechanism (GTSM), RFC 5082) inverts the TTL check for a value of minimum 255 minus the configured number of hops. This prevents remote attackers from spoofing the number of hops, as the TTL is limited by it's maximum value of 255.
			
 
				+- The synchronization rule: When a router receives a new route to announce from iBGP, it must first wait until it can validate the route from the IGP (in case iBGP is faster). This prevents announcing over eBGP a route that can't yet be routed within the AS.
			
 
				+- Full BGP tables are exchanged only during the start of peer sessions. Thereafter, only new announcements or withdrawals are exchanged.
			
 
				+- Network layer reachability information (NLRI) is basically what BGP calls prefixes/routes (and some extra information for address families other than IPv4).
			
 
				+- Message types:
			
 
				+    - Open: The first message sent when starting a session, for identifying eachother's capabilities and exchange basic information (not routes).
			
 
				+    - Update: Exchanges new route advertisements or withdrawals.
			
 
				+    - Notification: Signals errors and/or closes the session.
			
 
				+    - Keepalive: Shows it's still alive in the absence of update messages. Both keepalives and updates reset the hold timer.
			
 
				+- Internet Routing Registry (IRR) and Resource Public Key Infrastructure (RPKI) are methods to secure BGP in order to prevent route leaks/hijacks. While all routes should use IRR and RPKI (for providing valid bindings of prefixes to ASNs).
			
 
				+- Letter of Agency (aka Letter of Authorization) (LOA) required in certain countries to be allowed to announce a prefix.
			
 
				+- The "default-free zone" (DFZ) is the set of ASes which have full-ish BGP tables instead of default routes.
			
 
				+- Communities are used to exchange arbitrary policy information for announcements between peers. See [BGP Well-known Communities (IANA)](https://www.iana.org/assignments/bgp-well-known-communities/bgp-well-known-communities.xhtml).
			
 
				+- "Soft reconfiguration" is a feature to cache all incoming raw announcements from peers, such that the BGP table can be quickly rebuilt if it needs to be cleared. This reduces the impact of clearing the table and is recommended, but does increase memory usage.
			
 
				+
			
 
				+## Attributes
			
 
				+
			
 
				+Classification:
			
 
				+
			
 
				+- Mandatory/descretionary: If the attribute must be included in all updates.
			
 
				+- Well-knon/optional: If all implementations are required to recognize the attribute.
			
 
				+- Transitive/non-transitive: If an AS should be advertise an attribute it received from one AS to other ASes or if it is only of significance between pairs of ASes/peers.
			
 
				+
			
 
				+Some important attributes:
			
 
				+
			
 
				+- Origin (well-known, mandatory): How the prefix entered BGP. 0/"i" means from IGP, 1/"e" means from EGP, 2/"?" means redistributed from other sources or static routes.
			
 
				+- AS path (well-known, mandatory): The path of ASes to pass through in order to reach the destination. An eBGP peer prepends its own ASN before advertising it to other peers. ASes are free to append their ASN multiple times in series to artificially make the path longer (BGP prefers the shortest AS path during the path selection algorithm). If an AS aggregates prefixes from other ASes, it may use AS sets to indicate all ASes from which it aggregated the prefixes, giving an AS path like e.g. `100, {200, 201}`.
			
 
				+- Next hop (well-known, mandatory): The address of the nex hop towards the destination. eBGP peers will always change this to their own address but iBGP peers will never alter it.
			
 
				+- Multi-exit discriminator (MED) (optional, non-transitive): When two ASes peer with multiple eBGP peerings, this number signals which of the two eBGP peerings should be used for incoming traffic (lower is preferred). This is only of significance between friendly ASes as ASes are selfish and free to ignore it (other alternatives for steering incoming traffic are AS path prepending, special communities and (as a very last resort) advertising more specific prefixes).
			
 
				+- Local preference (well-known, discretionary, non-transitive): A number used to prioritise outgoing paths to another AS (higher is preferred).
			
 
				+- Weight (Cisco-proprietary): Like local pref., but not exchanged between iBGP peers.
			
 
				+- Community (optional, transitive): A bit of extra information used to group routes that should be treated similarly within or between ASes.
			
 
				+
			
 
				+## Path Selection
			
 
				+
			
 
				+The path selection algorithm is used to select a single best path for a prefix. The following shows an ordered list of decisions for which route to use, based on Cisco routers:
			
 
				+
			
 
				+1. (Before path selection) Longest prefix match.
			
 
				+1. Highest weight (Cisco).
			
 
				+1. Highest local pref.
			
 
				+1. Locally originated ("network" or "aggregate" command).
			
 
				+1. Shortest AS path.
			
 
				+1. Lowest origin (IGP then EGP then other).
			
 
				+1. Lowest MED (typically ignored).
			
 
				+1. eBGP over iBGP.
			
 
				+1. Lowest IGP metric.
			
 
				+1. Lowest BGP router ID.
			
 
				+
			
 
				+## Internet Routing Registry (IRR)
			
 
				+
			
 
				+- IRR is a mechanism for BGP route origin validation using a set of routing registries.
			
 
				+- It consists of IRR routing policy records which are hosted in one of the multiple IRR registries.
			
 
				+- Records are typically created for the route (`route`), the ASN (`aut-num`) and the upstream ISP AS-SET (`as-set`).
			
 
				+- IRR is out-of-band, meaning it does not affect how originating routers are configured. It should however be able to source filtering policies for peering ASNs somehow.
			
 
				+- IRR uses the Routing Policy Specification Language (RPSL) for describing routing policies.
			
 
				+- Due to outdated, inaccurate or missing data, IRR has not seen global deployment.
			
 
				+
			
 
				+### Setting Up IRR in the RIPE Database
			
 
				+
			
 
				+- See [Managing Route Objects in the IRR (RIPE)](https://www.ripe.net/manage-ips-and-asns/db/support/managing-route-objects-in-the-irr).
			
 
				+- The RIPE Database is tightly couples with it's IRR.
			
 
				+- IRR policies are handled by `route(6)` objects, containing the ASN and IPv4/IPv6 prefix.
			
 
				+- Authorization for managing `route(6)` objects can be a little complicated. Generally, the LIR is always allowed to manage it.
			
 
				+
			
 
				+## Resource Public Key Infrastructure (RPKI)
			
 
				+
			
 
				+- RPKI is a mechanism for BGP route origin validation using cryptographic methods.
			
 
				+- Like IRR, it validates the route origin only instead of the full path. Since routes typically use the shortest path due to both economical and operational incentives, this is generally not a big problem. It's also typically the case that route leaks are misconfigurations rather than malicious attacks, which origin validation would mostly prevent.
			
 
				+- It's certificate authority (CA)-based, but RPKI calls the CAs "trust anchors" (TAs). The fire RIRs act as root CAs, which are also the entities allocating the ASN and IP prefixes which RPKI attempts to secure. This also simplifies RPKI management, as it's managed the same place as ASNs and IP prefixes. This also helps lock down access control for which orgs may create ROAs for which resources.
			
 
				+- The main component are route origin authorization (ROA) records, which are certificates containing a prefix and an ASN.
			
 
				+- ROAs are X.509 certificates. See RFCs 5280 and 3779.
			
 
				+- IANA maintains lists for which ASNs, IPv4 prefixes and IPv6 prefixes are assigned to which RIR, which is also used to determine which RIR to use for RPKI.
			
 
				+- Unlike DNSSEC where IANA is the single CA root (and IANA reporting to the US government), RPKI uses separate trees/TAs for each RIR, slightly more similar to web CAs (with arguably _too many_ CAs). There are some legal/political issues when the RIRs operate as TAs too, though.
			
 
				+- RPKI is typically running out-of-band on servers called "validators" paired with the routers. For routers supporting it, the RPKI router protocol (RTR) may be used to feed the list of validated ROAs (aka VRPs or the validated cache, see other notes). It's recommended to use multiple validators for each router for redundancy. To reduce the number of validators, many routers may access common, remote validators over some secure transport link. The validators must periodidcally update their local databases from the RIRs' ones. It the route validator are running in parallel with the routers, it has a negligible impact on convergence speed.
			
 
				+- RPKI Repository Delta Protocol (RRDP) (RFC 8182) is designed to fetch RPKI data from TAs and is based on HTTPS. It has replaced rsync due to rsync being inefficient and not scalable for the purpose.
			
 
				+- If all validators become unavailable or all ROAs expire, RPKI will fall back to accepting all routes (the standard policy when a ROA is not found).
			
 
				+- Trust anchor locators (TALs) are used to retrieve the RIRs' TAs and consists of the URL to retrieve it as well as a public key to verify its authenticity. This allows TAs to be rotated more easily.
			
 
				+- RIPE, APNIC, AFRNIC and LACNIC distribute their TALs publicly, but for ARIN you have to explicitly agree to their terms before you can get it.
			
 
				+- All RIRs offer hosted RPKI managed through the RIR portal, but it can also be hosted internally for large organizations, called delegated RPKI.
			
 
				+- ROAs contain a max prefix length field, which limits how long prefixes the AS is allowed to advertise. This limits segmentation and helps prevent longer-prefix attacks.
			
 
				+- Validation of a ROA results in a validated ROA payload (VRP), consisting of the IP prefix (same length or shorter), the maximum length and the origin ASN. Comparing router advertisements with VRPs has one of three possible outcomes:
			
 
				+    - Valid: At least one VRP (maybe multiple) contains the prefix with the correct origin ASN and allowed prefix length. The route should be accepted.
			
 
				+    - Invalid: A VRP for the prefix exists, but the ASN doesn't match or the length is longer than the maximum. The route should be rejected.
			
 
				+    - Not found: No VRP with a matching prefix was found. The route should be accepted (until RPKI is globally deployed, at least).
			
 
				+- ROAs are fetched and processed periodically (30-60 minutes preferably) to produce a list of VRPs, aka a validated cache. ROAs that are expired or are otherwise cryptographically erraneous are discarded and thus will not be used to validate route announcements.
			
 
				+- Local overrides may be used for VRPs, e.g. for cases where a temporarily invalid announcement must be accepted. See Simplified Local Internet Number Resource Management with the RPKI (SLURM) (RFC 8416)
			
 
				+
			
 
				+### Setting Up RPKI ROAs in the RIPE Database
			
 
				+
			
 
				+- See [Managing ROAs (Ripe)](https://www.ripe.net/manage-ips-and-asns/resource-management/rpki/resource-certification-roa-management).
			
 
				+- For PA space, only the LIR is authorized to manage ROAs.
			
 
				+
			
 
				+### Resources
			
 
				+
			
 
				+- [RPKI Documentation (NLnet Labs)](https://rpki.readthedocs.io)
			
 
				+- [RPKI Test (RIPE)](http://www.ripe.net/s/rpki-test)
			
 
				+
			
 
				+## Best Practices
			
 
				+
			
 
				+- Announced prefix lengths (max /24 and /48): Generally, use a maximum length of 24 for IPv4 and 48 for IPv6, due to longer prefixes being commonly filtered. See [Visibility of IPv4 and IPv6 Prefix Lengths in 2019 (RIPE)](https://labs.ripe.net/Members/stephen_strowes/visibility-of-prefix-lengths-in-ipv4-and-ipv6).
			
 
				+- IRR and RPKI: Add `route(6)` objects (for IRR) and ROAs (for RPKI) for all prefixes, both to avoid having your prefixes hijacked and to reduce the risk of getting filtered.
			
 
				+- Explicit import & export policies: Always explicitly define the input and output policies to avoid route leakage. Certain routers defaults to announcing everything if no policy is defined, but RFC 8212 defines a safe default policy of filtering all routes if no policy is explicitly defined.
			
 
				+- Enable large communities: 2-byte communities are outdated, enable 12-byte communities to allow for more advanced policies and to keep up up to date with 4-byte ASNs. See RFCs 8092 and 8195.
			
 
				+- Administrative shutdown message: When administratively shutting down a session (due to maintenance or something), set a message to explain why to the other peer. Peers should log received shutdown messages. See RFC 9003, which adds support for this free-form 128-byte UTF-8 message in the BGP notification message.
			
 
				+- Voluntary shutdown (for BGP-speaking routers): Before maintenance where the router is unable to route traffic, shutdown BGP peering sessions and wait for BGP convergence around the router to avoid/reduce temporary blackholing. Aka voluntary session culling and voluntary session teradown. See RFC 8327.
			
 
				+- Involuntary shutdown (for IXPs): Before maintenance which will prevent connected routers from forwarding traffic through the IXP, apply an ACL or similar to filter all BGP communication (TCP/179) between directly connected routers and wait for BGP convergence around the IXP to avoid/reduce temporary blackholing. Multihop sessions may be allowed. This is as an alternative to or in addition to voluntary shutdown, as the routers are generally managed by orgs other than the one managing the IXP. Related to voluntary shutdown and described by the same RFC.
			
 
				+- Use and support the graceful shutdown community: The well-known community GRACEFUL_SHUTDOWN (65535:0) is used to signal graceful shutdown of announced routes. Peers should support this community by adding a policy matching the community, which reduces the LOCAL_PREF to 0 or similar such that other paths are preferred and installed in the routing table, to eliminate the impact when the router finally shuts down the session. See RFC 8326.
			
 
				+- Use and support the blackhole community: The well-known community BLACKHOLE (65535:666) is used to signal that the peer should discard traffic destined toward the prefix. This is mainly intended to stop DDoS attacks targeting the certain prefix before reaching the router advertising it, such that other non-targeted traffic may continue to use the link. While announced prefixes should generally avoid exceeding a certain max length, announcements with the blackhole community are typically allowed to be as specific as possible to narrow down the blackhole addresses (e.g. /32 for IPv4 and /128 for IPv6). See RFC 7999.
			
 
				+- Add reject-by-default policies to avoid leaking routes when no policies have been explicitly defined.
			
 
				+
			
 
				+{% include footer.md %}
			
--- a/it/network/routing.md
+++ b/it/network/routing.md
@@ -6,6 +6,11 @@ breadcrumbs:
 
				 ---
			
 
				 {% include header.md %}
			
 
				 
			
 
				+### Related Pages
			
 
				+{:.no_toc}
			
 
				+
			
 
				+- [BGP](../bgp/)
			
 
				+
			
 
				 ## General
			
 
				 
			
 
				 - Route source types:
			
@@ -55,124 +60,4 @@ breadcrumbs:
 
				 - May allow transit, where traffic flows through the AS to/from neighboring ASes, but neither the sources or destinations of the traffic is in the AS.
			
 
				 - Peering and transit between neighboring ASes physically happens at Internet exchange points (IXPs) or a private network interconnect (PNI).
			
 
				 
			
 
				-## BGP
			
 
				-
			
 
				-- A path vector protocol and the only EGP used on the Internet.
			
 
				-- Version 4 (BGP-4) with multiprotocol extensions (MBGP) is the most common version, which supports CIDR, route aggregation and _address families_ such as multicast IPv4, unicast IPv6 and VPN information for MPLS.
			
 
				-- Uses a set of attributes to describe a route (see subsection).
			
 
				-- Route filtering and RPKI are methods commonly used to prevent accidental or malicious misconfiguration where prefixes are routed a place they should not.
			
 
				-- Redistributing between BGP and an IGP should in most cases be avoided. BGP has huge routing tables. IGP is dumber than BGP. IGP flapping should not leak onto EGPs, this may also be penalized by _BGP dampening_.
			
 
				-- Unlike typical IGPs, it does not support any kind of auto discovery of other BGP peers, but peers must instead be statically configured. It uses TCP port 179.
			
 
				-- Exterior BGP (eBGP) is used to advertise and receive routes from peers in other ASes, while interior BGP (iBGP) is used to distribute routes between all eBGP routers in the same AS. iBGP is used instead of an IGP because an IGP would lose BGP information.
			
 
				-- The iBGP split horizon rule: Routers are not allowed to adversise a route learned from one iBGP peering to another iBGP peering. This prevents loops in iBGP, but requires that peers must be connected in a full mesh. To reduce the complexity of the iBGP full mesh, techniques like route reflectors (RRs) (dividing the AS into clusters) and confederations (dividing the AS into sub-ASes) may be used.
			
 
				-- eBGP peers are generally required to be directly connected, which is enforced by using an IP TTL of 1. This limit may be relaxed by using multihop sessions. iBGP however is not subject to thus requirement.
			
 
				-- Both multihop sessions and TTL security are mutually exclusive features for increasing the number of allowed hops between eBGP peers, which is limited to 1 by default. TTL security (aka Generalized TTL Security Mechanism (GTSM), RFC 5082) inverts the TTL check for a value of minimum 255 minus the configured number of hops. This prevents remote attackers from spoofing the number of hops, as the TTL is limited by it's maximum value of 255.
			
 
				-- The synchronization rule: When a router receives a new route to announce from iBGP, it must first wait until it can validate the route from the IGP (in case iBGP is faster). This prevents announcing over eBGP a route that can't yet be routed within the AS.
			
 
				-- Full BGP tables are exchanged only during the start of peer sessions. Thereafter, only new announcements or withdrawals are exchanged.
			
 
				-- Network layer reachability information (NLRI) is basically what BGP calls prefixes/routes (and some extra information for address families other than IPv4).
			
 
				-- Message types:
			
 
				-    - Open: The first message sent when starting a session, for identifying eachother's capabilities and exchange basic information (not routes).
			
 
				-    - Update: Exchanges new route advertisements or withdrawals.
			
 
				-    - Notification: Signals errors and/or closes the session.
			
 
				-    - Keepalive: Shows it's still alive in the absence of update messages. Both keepalives and updates reset the hold timer.
			
 
				-- Internet Routing Registry (IRR) and Resource Public Key Infrastructure (RPKI) are methods to secure BGP in order to prevent route leaks/hijacks. While all routes should use IRR and RPKI (for providing valid bindings of prefixes to ASNs).
			
 
				-- Letter of Agency (aka Letter of Authorization) (LOA) required in certain countries to be allowed to announce a prefix.
			
 
				-- The "default-free zone" (DFZ) is the set of ASes which have full-ish BGP tables instead of default routes.
			
 
				-- Communities are used to exchange arbitrary policy information for announcements between peers. See [BGP Well-known Communities (IANA)](https://www.iana.org/assignments/bgp-well-known-communities/bgp-well-known-communities.xhtml).
			
 
				-- "Soft reconfiguration" is a feature to cache all incoming raw announcements from peers, such that the BGP table can be quickly rebuilt if it needs to be cleared. This reduces the impact of clearing the table and is recommended, but does increase memory usage.
			
 
				-
			
 
				-### Attributes
			
 
				-
			
 
				-Classification:
			
 
				-
			
 
				-- Mandatory/descretionary: If the attribute must be included in all updates.
			
 
				-- Well-knon/optional: If all implementations are required to recognize the attribute.
			
 
				-- Transitive/non-transitive: If an AS should be advertise an attribute it received from one AS to other ASes or if it is only of significance between pairs of ASes/peers.
			
 
				-
			
 
				-Some important attributes:
			
 
				-
			
 
				-- Origin (well-known, mandatory): How the prefix entered BGP. 0/"i" means from IGP, 1/"e" means from EGP, 2/"?" means redistributed from other sources or static routes.
			
 
				-- AS path (well-known, mandatory): The path of ASes to pass through in order to reach the destination. An eBGP peer prepends its own ASN before advertising it to other peers. ASes are free to append their ASN multiple times in series to artificially make the path longer (BGP prefers the shortest AS path during the path selection algorithm). If an AS aggregates prefixes from other ASes, it may use AS sets to indicate all ASes from which it aggregated the prefixes, giving an AS path like e.g. `100, {200, 201}`.
			
 
				-- Next hop (well-known, mandatory): The address of the nex hop towards the destination. eBGP peers will always change this to their own address but iBGP peers will never alter it.
			
 
				-- Multi-exit discriminator (MED) (optional, non-transitive): When two ASes peer with multiple eBGP peerings, this number signals which of the two eBGP peerings should be used for incoming traffic (lower is preferred). This is only of significance between friendly ASes as ASes are selfish and free to ignore it (other alternatives for steering incoming traffic are AS path prepending, special communities and (as a very last resort) advertising more specific prefixes).
			
 
				-- Local preference (well-known, discretionary, non-transitive): A number used to prioritise outgoing paths to another AS (higher is preferred).
			
 
				-- Weight (Cisco-proprietary): Like local pref., but not exchanged between iBGP peers.
			
 
				-- Community (optional, transitive): A bit of extra information used to group routes that should be treated similarly within or between ASes.
			
 
				-
			
 
				-### Path Selection
			
 
				-
			
 
				-The path selection algorithm is used to select a single best path for a prefix. The following shows an ordered list of decisions for which route to use, based on Cisco routers:
			
 
				-
			
 
				-1. (Before path selection) Longest prefix match.
			
 
				-1. Highest weight (Cisco).
			
 
				-1. Highest local pref.
			
 
				-1. Locally originated ("network" or "aggregate" command).
			
 
				-1. Shortest AS path.
			
 
				-1. Lowest origin (IGP then EGP then other).
			
 
				-1. Lowest MED (typically ignored).
			
 
				-1. eBGP over iBGP.
			
 
				-1. Lowest IGP metric.
			
 
				-1. Lowest BGP router ID.
			
 
				-
			
 
				-### Internet Routing Registry (IRR)
			
 
				-
			
 
				-- IRR is a mechanism for BGP route origin validation using a set of routing registries.
			
 
				-- It consists of IRR routing policy records which are hosted in one of the multiple IRR registries.
			
 
				-- Records are typically created for the route (`route`), the ASN (`aut-num`) and the upstream ISP AS-SET (`as-set`).
			
 
				-- IRR is out-of-band, meaning it does not affect how originating routers are configured. It should however be able to source filtering policies for peering ASNs somehow.
			
 
				-- IRR uses the Routing Policy Specification Language (RPSL) for describing routing policies.
			
 
				-- Due to outdated, inaccurate or missing data, IRR has not seen global deployment.
			
 
				-
			
 
				-#### Setting Up IRR in the RIPE Database
			
 
				-
			
 
				-- See [Managing Route Objects in the IRR (RIPE)](https://www.ripe.net/manage-ips-and-asns/db/support/managing-route-objects-in-the-irr).
			
 
				-- The RIPE Database is tightly couples with it's IRR.
			
 
				-- IRR policies are handled by `route(6)` objects, containing the ASN and IPv4/IPv6 prefix.
			
 
				-- Authorization for managing `route(6)` objects can be a little complicated. Generally, the LIR is always allowed to manage it.
			
 
				-
			
 
				-### Resource Public Key Infrastructure (RPKI)
			
 
				-
			
 
				-- RPKI is a mechanism for BGP route origin validation using cryptographic methods.
			
 
				-- Like IRR, it validates the route origin only instead of the full path. Since routes typically use the shortest path due to both economical and operational incentives, this is generally not a big problem. It's also typically the case that route leaks are misconfigurations rather than malicious attacks, which origin validation would mostly prevent.
			
 
				-- It's certificate authority (CA)-based, but RPKI calls the CAs "trust anchors" (TAs). The fire RIRs act as root CAs, which are also the entities allocating the ASN and IP prefixes which RPKI attempts to secure. This also simplifies RPKI management, as it's managed the same place as ASNs and IP prefixes. This also helps lock down access control for which orgs may create ROAs for which resources.
			
 
				-- The main component are route origin authorization (ROA) records, which are certificates containing a prefix and an ASN.
			
 
				-- ROAs are X.509 certificates. See RFCs 5280 and 3779.
			
 
				-- IANA maintains lists for which ASNs, IPv4 prefixes and IPv6 prefixes are assigned to which RIR, which is also used to determine which RIR to use for RPKI.
			
 
				-- Unlike DNSSEC where IANA is the single CA root (and IANA reporting to the US government), RPKI uses separate trees/TAs for each RIR, slightly more similar to web CAs (with arguably _too many_ CAs). There are some legal/political issues when the RIRs operate as TAs too, though.
			
 
				-- RPKI is typically running out-of-band on servers called "validators" paired with the routers. For routers supporting it, the RPKI router protocol (RTR) may be used to feed the list of validated ROAs (aka VRPs or the validated cache, see other notes). It's recommended to use multiple validators for each router for redundancy. To reduce the number of validators, many routers may access common, remote validators over some secure transport link. The validators must periodidcally update their local databases from the RIRs' ones. It the route validator are running in parallel with the routers, it has a negligible impact on convergence speed.
			
 
				-- RPKI Repository Delta Protocol (RRDP) (RFC 8182) is designed to fetch RPKI data from TAs and is based on HTTPS. It has replaced rsync due to rsync being inefficient and not scalable for the purpose.
			
 
				-- If all validators become unavailable or all ROAs expire, RPKI will fall back to accepting all routes (the standard policy when a ROA is not found).
			
 
				-- Trust anchor locators (TALs) are used to retrieve the RIRs' TAs and consists of the URL to retrieve it as well as a public key to verify its authenticity. This allows TAs to be rotated more easily.
			
 
				-- RIPE, APNIC, AFRNIC and LACNIC distribute their TALs publicly, but for ARIN you have to explicitly agree to their terms before you can get it.
			
 
				-- All RIRs offer hosted RPKI managed through the RIR portal, but it can also be hosted internally for large organizations, called delegated RPKI.
			
 
				-- ROAs contain a max prefix length field, which limits how long prefixes the AS is allowed to advertise. This limits segmentation and helps prevent longer-prefix attacks.
			
 
				-- Validation of a ROA results in a validated ROA payload (VRP), consisting of the IP prefix (same length or shorter), the maximum length and the origin ASN. Comparing router advertisements with VRPs has one of three possible outcomes:
			
 
				-    - Valid: At least one VRP (maybe multiple) contains the prefix with the correct origin ASN and allowed prefix length. The route should be accepted.
			
 
				-    - Invalid: A VRP for the prefix exists, but the ASN doesn't match or the length is longer than the maximum. The route should be rejected.
			
 
				-    - Not found: No VRP with a matching prefix was found. The route should be accepted (until RPKI is globally deployed, at least).
			
 
				-- ROAs are fetched and processed periodically (30-60 minutes preferably) to produce a list of VRPs, aka a validated cache. ROAs that are expired or are otherwise cryptographically erraneous are discarded and thus will not be used to validate route announcements.
			
 
				-- Local overrides may be used for VRPs, e.g. for cases where a temporarily invalid announcement must be accepted. See Simplified Local Internet Number Resource Management with the RPKI (SLURM) (RFC 8416)
			
 
				-
			
 
				-#### Setting Up RPKI ROAs in the RIPE Database
			
 
				-
			
 
				-- See [Managing ROAs (Ripe)](https://www.ripe.net/manage-ips-and-asns/resource-management/rpki/resource-certification-roa-management).
			
 
				-- For PA space, only the LIR is authorized to manage ROAs.
			
 
				-
			
 
				-#### Resources
			
 
				-
			
 
				-- [RPKI Documentation (NLnet Labs)](https://rpki.readthedocs.io)
			
 
				-- [RPKI Test (RIPE)](http://www.ripe.net/s/rpki-test)
			
 
				-
			
 
				-### Best Practices
			
 
				-
			
 
				-- Announced prefix lengths (max /24 and /48): Generally, use a maximum length of 24 for IPv4 and 48 for IPv6, due to longer prefixes being commonly filtered. See [Visibility of IPv4 and IPv6 Prefix Lengths in 2019 (RIPE)](https://labs.ripe.net/Members/stephen_strowes/visibility-of-prefix-lengths-in-ipv4-and-ipv6).
			
 
				-- IRR and RPKI: Add `route(6)` objects (for IRR) and ROAs (for RPKI) for all prefixes, both to avoid having your prefixes hijacked and to reduce the risk of getting filtered.
			
 
				-- Explicit import & export policies: Always explicitly define the input and output policies to avoid route leakage. Certain routers defaults to announcing everything if no policy is defined, but RFC 8212 defines a safe default policy of filtering all routes if no policy is explicitly defined.
			
 
				-- Enable large communities: 2-byte communities are outdated, enable 12-byte communities to allow for more advanced policies and to keep up up to date with 4-byte ASNs. See RFCs 8092 and 8195.
			
 
				-- Administrative shutdown message: When administratively shutting down a session (due to maintenance or something), set a message to explain why to the other peer. Peers should log received shutdown messages. See RFC 9003, which adds support for this free-form 128-byte UTF-8 message in the BGP notification message.
			
 
				-- Voluntary shutdown (for BGP-speaking routers): Before maintenance where the router is unable to route traffic, shutdown BGP peering sessions and wait for BGP convergence around the router to avoid/reduce temporary blackholing. Aka voluntary session culling and voluntary session teradown. See RFC 8327.
			
 
				-- Involuntary shutdown (for IXPs): Before maintenance which will prevent connected routers from forwarding traffic through the IXP, apply an ACL or similar to filter all BGP communication (TCP/179) between directly connected routers and wait for BGP convergence around the IXP to avoid/reduce temporary blackholing. Multihop sessions may be allowed. This is as an alternative to or in addition to voluntary shutdown, as the routers are generally managed by orgs other than the one managing the IXP. Related to voluntary shutdown and described by the same RFC.
			
 
				-- Use and support the graceful shutdown community: The well-known community GRACEFUL_SHUTDOWN (65535:0) is used to signal graceful shutdown of announced routes. Peers should support this community by adding a policy matching the community, which reduces the LOCAL_PREF to 0 or similar such that other paths are preferred and installed in the routing table, to eliminate the impact when the router finally shuts down the session. See RFC 8326.
			
 
				-- Use and support the blackhole community: The well-known community BLACKHOLE (65535:666) is used to signal that the peer should discard traffic destined toward the prefix. This is mainly intended to stop DDoS attacks targeting the certain prefix before reaching the router advertising it, such that other non-targeted traffic may continue to use the link. While announced prefixes should generally avoid exceeding a certain max length, announcements with the blackhole community are typically allowed to be as specific as possible to narrow down the blackhole addresses (e.g. /32 for IPv4 and /128 for IPv6). See RFC 7999.
			
 
				-
			
 
				 {% include footer.md %}