Sfoglia il codice sorgente

Add temporary CUDA notes and Betzy

Håvard Ose Nordstrand 4 anni fa
parent
commit
8b86601a93
4 ha cambiato i file con 279 aggiunte e 76 eliminazioni
  1. 14 2
      config/hpc/cuda.md
  2. 70 74
      index.md
  3. 87 0
      miscellanea/betzy.md
  4. 108 0
      se/cuda-tmp.md

+ 14 - 2
config/hpc/cuda.md

@@ -26,7 +26,7 @@ The toolkit on Linux can be installed in different ways:
 
 Note that the toolkit requires a matching NVIDIA driver to be installed.
 
-#### Ubuntu (Main Repos)
+#### Ubuntu (Ubuntu Repos)
 
 Note: May be outdated.
 
@@ -46,7 +46,14 @@ See [NVIDIA CUDA Installation Guide for Linux (NVIDIA)](https://docs.nvidia.com/
 1. Install CUDA from the new repo (includes the NVIDIA driver): `apt install cuda`
 1. Setup path: In `/etc/environment`, append `:/usr/local/cuda/bin` to the end of the PATH list.
 
-## Running
+### Docker Containers
+
+- Docker containers may run NVIDIA applications using the NVIDIA runtime for Docker.
+- **TODO**
+
+## Usage
+
+### General
 
 - Gathering system/GPU information with `nvidia-smi`:
     - Show overview: `nvidia-smi`
@@ -56,4 +63,9 @@ See [NVIDIA CUDA Installation Guide for Linux (NVIDIA)](https://docs.nvidia.com/
     - Monitor device stats: `nvidia-smi dmon`
 - To specify which devices are available to the CUDA application and in which order, set the `CUDA_VISIBLE_DEVICES` env var to a comma-separated list of device IDs.
 
+### DCGM
+
+- For monitoring GPU hardware and performance.
+- See the DCGM exporter for Prometheus for monitoring NVIDIA GPUs from Prometheus.
+
 {% include footer.md %}

+ 70 - 74
index.md

@@ -10,148 +10,144 @@ Random collection of config notes and miscellaneous stuff. _Technically not a wi
 
 ### General
 
-- [General Notes](config/general/general/)
-- [Linux General Notes](config/general/linux-general/)
-- [Linux Examples](config/general/linux-examples/)
-- [Computer Testing](config/general/computer-testing/)
+- [General Notes](/config/general/general/)
+- [Linux General Notes](/config/general/linux-general/)
+- [Linux Examples](/config/general/linux-examples/)
+- [Computer Testing](/config/general/computer-testing/)
 
 ### Authentication, Authorization and Accounting (AAA)
 
-- [Kerberos](config/aaa/kerberos/)
+- [Kerberos](/config/aaa/kerberos/)
 
 ### Automation
 
-- [Ansible](config/automation/ansible/)
-- [Puppet](config/automation/puppet/)
+- [Ansible](/config/automation/ansible/)
+- [Puppet](/config/automation/puppet/)
 
 ### Computers
 
-- [Dell OptiPlex Series](config/computers/dell-optiplex/)
-- [Dell PowerEdge Series](config/computers/dell-poweredge/)
-- [HP ProLiant](config/computers/hp-proliant/)
-- [PCs](config/computers/pcs/)
+- [Dell OptiPlex Series](/config/computers/dell-optiplex/)
+- [Dell PowerEdge Series](/config/computers/dell-poweredge/)
+- [HP ProLiant](/config/computers/hp-proliant/)
+- [PCs](/config/computers/pcs/)
 
 ### Game Servers
 
-- [Counter-Strike: Global Offensive (CS:GO)](config/game-server/csgo/)
-- [Minecraft (Bukkit)](config/game-server/minecraft-bukkit/)
-- [Team Fortress 2 (TF2)](config/game-server/tf2/)
+- [Counter-Strike: Global Offensive (CS:GO)](/config/game-server/csgo/)
+- [Minecraft (Bukkit)](/config/game-server/minecraft-bukkit/)
+- [Team Fortress 2 (TF2)](/config/game-server/tf2/)
 
 ### HPC
 
-- [CUDA](config/hpc/cuda/)
-- [Open MPI](config/hpc/openmpi/)
-- [Slurm Workload Manager](config/hpc/slurm/)
+- [CUDA](/config/hpc/cuda/)
+- [Open MPI](/config/hpc/openmpi/)
+- [Slurm Workload Manager](/config/hpc/slurm/)
 
 ### IoT & Home Automation
 
-- [Raspberry Pi](config/iot-ha/raspberry-pi/)
-- [Home Assistant](config/iot-ha/home-assistant/)
+- [Raspberry Pi](/config/iot-ha/raspberry-pi/)
+- [Home Assistant](/config/iot-ha/home-assistant/)
 
 ### Linux Server
 
-- [Debian](config/linux-server/debian/)
-- [Applications](config/linux-server/applications/)
-- [Storage](config/linux-server/storage/)
-- [Storage: ZFS](config/linux-server/storage-zfs/)
-- [Storage: Ceph](config/linux-server/storage-ceph/)
-- [Networking](config/linux-server/networking/)
+- [Debian](/config/linux-server/debian/)
+- [Applications](/config/linux-server/applications/)
+- [Storage](/config/linux-server/storage/)
+- [Storage: ZFS](/config/linux-server/storage-zfs/)
+- [Storage: Ceph](/config/linux-server/storage-ceph/)
+- [Networking](/config/linux-server/networking/)
 
 ### Media
 
-- [Media Ripping](config/media/ripping/)
-- [Video Streaming](config/media/streaming/)
+- [Media Ripping](/config/media/ripping/)
+- [Video Streaming](/config/media/streaming/)
 
 ### Network
 
 #### General
 
-- [Routing](config/network/routing/)
-- [Switching](config/network/switching/)
-- [WLAN](config/network/wlan/)
-- [Security](config/network/security/)
+- [Routing](/config/network/routing/)
+- [Switching](/config/network/switching/)
+- [WLAN](/config/network/wlan/)
+- [Security](/config/network/security/)
 
 #### Specific
 
-- [Brocade FastIron Switches](config/network/brocade-fastiron-switches/)
-- [Cisco Hardware](config/network/cisco-hardware/)
-- [Cisco IOS General](config/network/cisco-ios-general/)
-- [Cisco IOS Routers](config/network/cisco-ios-routers/)
-- [Cisco IOS Switches](config/network/cisco-ios-switches/)
-- [FS FSOS Switches](config/network/fs-fsos-switches/)
-- [Juniper Hardware](config/network/juniper-hardware/)
-- [Juniper Junos General](config/network/juniper-junos-general/)
-- [Juniper Junos Switches](config/network/juniper-junos-switches/)
-- [Linksys LGS Switches](config/network/linksys-lgs/)
-- [Linux Switching & Routing](config/network/linux/)
-- [pfSense](config/network/pfsense/)
-- [TP-Link JetStream Switches](config/network/tplink-jetstream-switches/)
-- [Ubiquiti UniFi Controllers](config/network/ubiquiti-unifi-controllers/)
-- [Uniquiti UniFi Access Points](config/network/ubiquiti-unifi-aps/)
-- [VyOS](/config/network/vyos/)
+- [Brocade FastIron Switches](/config/network/brocade-fastiron-switches/)
+- [Cisco Hardware](/config/network/cisco-hardware/)
+- [Cisco IOS General](/config/network/cisco-ios-general/)
+- [Cisco IOS Routers](/config/network/cisco-ios-routers/)
+- [Cisco IOS Switches](/config/network/cisco-ios-switches/)
+- [FS FSOS Switches](/config/network/fs-fsos-switches/)
+- [Juniper Hardware](/config/network/juniper-hardware/)
+- [Juniper Junos General](/config/network/juniper-junos-general/)
+- [Juniper Junos Switches](/config/network/juniper-junos-switches/)
+- [Linksys LGS Switches](/config/network/linksys-lgs/)
+- [Linux Switching & Routing](/config/network/linux/)
+- [pfSense](/config/network/pfsense/)
+- [TP-Link JetStream Switches](/config/network/tplink-jetstream-switches/)
+- [Ubiquiti UniFi Controllers](/config/network/ubiquiti-unifi-controllers/)
+- [Uniquiti UniFi Access Points](/config/network/ubiquiti-unifi-aps/)
+- [VyOS](//config/network/vyos/)
 
 ### PC
 
-- [Kubuntu](config/pc/kubuntu/)
-- [Windows](config/pc/windows/)
-- [PC Applications](config/pc/applications/)
+- [Kubuntu](/config/pc/kubuntu/)
+- [Windows](/config/pc/windows/)
+- [PC Applications](/config/pc/applications/)
 
 ### Power
 
-- [APC PDUs](config/power/apc-pdus/)
+- [APC PDUs](/config/power/apc-pdus/)
 
 ### Virtualization & Containerization
 
-- [Docker](config/virt-cont/docker/)
-- [libvirt & KVM](config/virt-cont/libvirt-kvm/)
-- [Proxmox VE](config/virt-cont/proxmox-ve/)
+- [Docker](/config/virt-cont/docker/)
+- [libvirt & KVM](/config/virt-cont/libvirt-kvm/)
+- [Proxmox VE](/config/virt-cont/proxmox-ve/)
 
 ## Information Technology
 
 ### Network
 
-- [IPv4](it/network/ipv4/)
-- [IPv6](it/network/ipv6/)
-- [Network Architecture](it/network/architecture/)
-- [Switching](it/network/switching/)
-- [Routing](it/network/routing/)
-- [Wireless Basics](it/network/wireless-basics/)
-- [WLAN](it/network/wlan/)
+- [IPv4](/it/network/ipv4/)
+- [IPv6](/it/network/ipv6/)
+- [Network Architecture](/it/network/architecture/)
+- [Switching](/it/network/switching/)
+- [Routing](/it/network/routing/)
+- [Wireless Basics](/it/network/wireless-basics/)
+- [WLAN](/it/network/wlan/)
 
 ### Services
 
-- [Email](it/services/email/)
-- [DNS](it/services/dns/)
+- [Email](/it/services/email/)
+- [DNS](/it/services/dns/)
 
 ## Media
 
 ### Audio
 
-- [Audio Basics](media/audio/basics/)
+- [Audio Basics](/media/audio/basics/)
 
 ## Software Engineering
 
 ### General
 
-- [Database Management Systems (DBMSes)](se/general/dbmses/)
-- [Software Licensing](se/general/licensing/)
+- [Database Management Systems (DBMSes)](/se/general/dbmses/)
+- [Software Licensing](/se/general/licensing/)
 
 ### Web
 
-- [Web Security](se/web/security/)
+- [Web Security](/se/web/security/)
 
 ## Guides
 
 ### Network
 
-- [Juniper EX3300 Fan Mod](guides/network/juniper-ex3300-fanmod/)
+- [Juniper EX3300 Fan Mod](/guides/network/juniper-ex3300-fanmod/)
 
-<!--
-## External Resources
+## Miscellanea
 
-- [My miscellaneous configs and scripts](https://github.com/HON95/configs)
-- [My miscellaneous code snippets and dev scripts](https://github.com/HON95/code)
-- [My miscellaneous Ansible playbooks](https://github.com/HON95/ansible-playbooks)
--->
+- [Betzy (Supercomputer)](/miscellanea/betzy/)
 
 {% include footer.md %}

+ 87 - 0
miscellanea/betzy.md

@@ -0,0 +1,87 @@
+---
+title: Betzy (Supercomputer)
+breadcrumbs:
+- title: Miscellanea
+---
+{% include header.md %}
+
+Norways most powerful supercomputer from 2020, managed by UNINETT Sigma2.
+
+## Specifications
+
+A mix of general XH2000 specifications and specific Betzy specifications.
+
+- Betzy overall specifications \[1\]\[2\]\[5\]\[9\]\[10\]:
+    - System: Atos BullSequana XH2000 with X2410 (AMD) and X2415 (A100) blades.
+    - OS: RHEL
+    - Compute nodes: 1’344x X2410 (AMD) + 4x X2415 (A100)
+    - CPUs total (excluding A100 nodes): 2’688 sockets, 86’016 cores, 172’032 threads
+    - Memory: 336TiB total (excluding A100 nodes)
+    - Storage: 7.8PB (2.5PB before 2021 upgrade), DNN powered, Lustre, 51GB/s bandwidth, 500k+ metadata OPS
+    - Interconnect topology: DragonFly+ topology
+    - Queueing system: Slurm
+    - Footprint: 14.78m2 (before 2021 upgrade)
+    - Power: 952kW, 95% of heat captured to water (before 2021 upgrade)
+    - Cooling: Liquid cooled
+- CPU specifications (excluding A100 nodes) \[1\]\[2\]\[3\]:
+    - AMD Epyc 7742
+    - 64 cores, 128 threads
+    - Clock: 2.25GHz base, 3.4GHz max boost
+    - PCIe: 4.0, x128
+    - Memory: DDR4, 8 channels, 3200MHz, 204.8GB/s per socket BW
+    - Supports AVX and AVX2.
+- Compute node specifications (excluding A100 nodes) \[1\]\[2\]:
+    - CPUs per node: 2 sockets, 128 kjerner, 256 threads
+    - Memory: 256GiB, split into 8 NUMA nodes
+    - Storage: 3x SATA or NVMe M.2 drives
+    - NIC: InfiniBand HDR 100
+- Blade specifications \[9\]:
+    - Betzy uses mainly X2410 blades (AMD), but also 4x X2415 blades (A100) (after the 2021 upgrade) \[10\].
+    - Size: 1U
+    - Cooling: Fanless, active liquid cooling.
+    - All blades types (both used and not used):
+        - X2410: 3x AMD EPYC Rome/Milan nodes (side-by-side) (6 CPUs total).
+        - X2415: 2x AMD EPYC Rome/Milan CPUs and 4x Nvidia A100 SXM4 GPUs (single node).
+        - X1120: 3x Intel Xeon nodes (side-by-side) (6 CPUs total).
+        - X1125: 2x Intel CPUs and 4x Nvidia V100 SXM2 GPUs (single node).
+- Cabinet specifications \[8\]\[9\]:
+    - Number of blades: 4-20 in front, 4-12 in back
+    - Management switches:
+        - Up to 2.
+        - Up to 48 1Gb/s or 10Gb/s ports.
+    - Interconnect switches:
+    - Up to 10.
+    - Infiniband HDR100: 80 ports, 100Gb/s (Betzy)
+    - Alternative technologies:
+        - Bull eXascale Interconnect (BXI): 48 ports, 100Gb/s
+        - High-speed Ethernet: Up to 48 ports, up to 100Gb/s
+    - Topology: DragonFly+ (Betzy)
+    - Alternative topologies:
+        - Full Fat Tree
+    - PSU: 6x 15kW shelves
+    - Power input: 3x 63A 3-phase 400V (for EU)
+    - Cooling:
+        - Direct Liquid Cooling (DLC)
+        - Hydraulic chassis (HYC)
+        - Primary (external) loop connected to customer water loop.
+        - Secondary (internal) loop connected to blades, management switches, interconnect switches and PSUs.
+
+## History
+
+- 7 December 2020: Inauguration. \[11\]
+- April 2021 (unknown date): Four new X2415 blades (A100) and 5.3PB more storage (from 2.5PB to 7.8PB). \[10\]
+
+## References
+
+- \[1\] UNINETT Sigma2. "Betzy." (Accessed 2020-09-03.) https://documentation.sigma2.no/hpc_machines/betzy.html
+- \[2\] UNINETT Sigma2. "Betzy Pilot Projects." (Accessed 2020-09-03.) https://documentation.sigma2.no/hpc_machines/betzy/betzy_pilot.html
+- \[3\] SPEC CPU 2017 Integer Rate Result for Atos BullSequana XH2000 (1 socket)
+- \[4\] AMD EPYC 7742. (Accessed 2020-09-03.) https://www.amd.com/en/products/cpu/amd-epyc-7742
+- \[5\] Atos. "Atos to deliver most powerful supercomputer in Norway to national e-infrastructure provider Uninett Sigma2." (Accessed 2020-09-03.) https://atos.net/en/2019/press-release_2019_06_06/atos-to-deliver-most-powerful-supercomputer-in-norway-to-national-e-infrastructure-provider-uninett-sigma2
+- \[6\] Atos. "Atos expands BullSequana X supercomputer range to include AMD processors." (Accessed 2020-09-03.) https://atos.net/en/2018/news_2018_11_12/atos-expands-bullsequana-x-supercomputer-range-include-amd-processors
+- \[8\] Atos. "BullSequana XH2000 brochure." (Accessed 2020-09-03.) https://atos.net/wp-content/uploads/2019/11/BullSequana_XH2000_Brochure_Atos.pdf
+- \[9\] Atos. "BullSequana XH2000 features." (Accessed 2020-09-03.) https://atos.net/wp-content/uploads/2020/07/BullSequanaXH2000_Features_Atos_supercomputers.pdf
+- \[10\] Digi.no. "Sigma2 skal utvide to av de norske superdatamaskinene." (Accessed 2021-04-21.) https://www.digi.no/artikler/sigma2-skal-utvide-to-av-de-norske-superdatamaskinene/509303
+- \[11\] UNINETT Sigma2. "Betzy Inauguration." (Accessed 2021-04-21.) https://www.sigma2.no/betzy-inauguration
+
+{% include footer.md %}

+ 108 - 0
se/cuda-tmp.md

@@ -0,0 +1,108 @@
+## General
+
+- Introduced by NVIDIA in 2006. While GPU compute was possible before through hackish methods, CUDA provided a programming model for compute which included e.g. thread blocks, shared memory and synchronization barriers.
+- Modern NVIDIA GPUs contain _CUDA cores_, _tensor cores_ and _RT cores_ (ray tracing cores). Tensor cores may be accessed in CUDA through special CUDA calls, but RT cores are (as of writing) only accessible from Optix and not CUDA.
+- The _compute capability_ describes the generation and supported features of a GPU.
+
+### Mapping the Programming Model to the Execution Model
+
+- The programmer decides the grid size (number of blocks and threads therein) when launching a kernel.
+- The device has a constant number of streaming multiprocessors (SMs) and CUDA cores (not to be confused with tensor cores or RT cores).
+- Each kernel launch (or rather its grid) is executed by a single GPU. To use multiple GPUs, multiple kernel launches are required by the CUDA application.
+- Each thread block is executed by a single SM and is bound to it for the entire execution. Each SM may execute multiple thread blocks.
+- Each CUDA core within an SM executes a thread from a block assigned to the SM.
+- **TODO** Warps and switches.
+
+## Programming
+
+### General
+
+- Branching:
+    - **TODO** How branching works and why it's bad.
+
+### Thread Hierarchy
+
+- Grids consist of a number of blocks and blocks concist of a number of threads.
+- Threads and blocks are indexed in 1D, 2D or 3D space (separately), which threads may access through the 3-compoent vectors `blockDim`, `blockIdx` and `threadIdx`.
+- The programmer decides the number of grids, blocks and threads to use, as well as the number of dimensions to use, for each kernel invocation.
+- The number of threads per block is typically limited to 1024.
+- See the section about mapping it to the execution model for a better understanding of why it's organized this way.
+
+### Memory Hierarchy
+
+- **TODO**
+- Memories (local to global):
+    1. **TODO** Fix, these names are wrong.
+    1. Registers.
+    1. Shared memory (block cache).
+    1. Read-only memories.
+    1. SM cache.
+    1. Global memory.
+
+### Synchronization
+
+- **TODO**
+- `__syncthreads` (device) provides block level barrier synchronization.
+- Grid level barrier synchronization is currently not possible through any native API call.
+- `cudaDeviceSynchronize`/`cudaStreamSynchronize` (host) blocks until the device or stream has finished all tasks (kernels/copies/etc.).
+
+### Measurements
+
+#### Time
+
+- To measure the total duration of a kernel invocation or memory copy on the CPU side, measure the duration from before the call to after it, including a `cudaDeviceSynchronize()` if the call is asynchronous.
+- To measure durations inside a kernel, use the CUDA event API (as used in this section hereafter).
+- Events are created and destroyed using `cudaEventCreate(cudaEvent_t *)` and `cudaEventDestroy(cudaEvent_t *)`.
+- Events are recorded (captured) using `cudaEventRecord`. This will capture the state of the stream it's applied to. The "time" of the event is when all previous tasks have completed and not the time it was called.
+- Elapsed time between two events is calculated using `cudaEventElapsedTime`.
+- Wait for an event to complete (or happen) using `cudaEventSynchronize`. For an event to "complete" means that the previous tasks (like a kernel) is finished executing. If the `cudaEventBlockingSync` flag is set for the event, the CPU will block while waiting (which yields the CPU), otherwise it will busy-wait.
+
+#### Bandwidth
+
+- To calculate the theoretical bandwidth, check the hardware specifications for the device, wrt. the memory clock rate and bus width and DDR.
+- To measure the effective bandwidth, divide the sum of the read and written data by the measured total duration of the transfers.
+
+#### Computational Throughput
+
+- Measured in FLOPS (or "FLOP/s" or "flops"), separately for the type of precision (half, single, double).
+- Measured by manually analyzing how many FLOPS a compoind operation is and then multiplied by how many times it was performed, divided by the total duration.
+- Make sure it's not memory bound (or label it as so).
+
+### Unified Virtual Addressing (UVA)
+
+- Causes CUDA to use a single address space for allocations for both the host and devices (if the host supports it).
+- Allows using `cudaMemcpy` without having to spacify in which device (or host) and memory the pointer exists in.
+- Allows _zero-copy_ memory where the GPU can access pinned/managed host memory over the PCIe interconnect (including the high latency for accessing off-device memory).
+
+### Unified Memory
+
+- Depends on the older UVA, which provides a single address space for both the host and devices, as well as zero-copy memory.
+- Virtually combines the pinned CPU/host memory and the GPU/device memory such that explicit memory copying between the two is no longer needed. Both the host and device may access the memory through a single pointer and data is automatically migrated (prefetched) between the two instead of demand-fetching it each time it's used (as for UVA).
+- Data migration happens automatically at page-level granularuity and follows pointers in order to support deep copies. As it automatically migrates data to/from the devices instead of accessing it over the PCIe interconnect on demand, it yields much better performance than UVA.
+- As Unified Memory uses paging, it implicitly allows oversubscribing GPU memory.
+- Keep in mind that GPU page faulting will affect kernel performance.
+- Unified Memory also provides support for system-wide atomic memory operations, for multi-GPU cooperative programs.
+- Explicit memory management may still be used for optimization purposes, although use of streams and async copying is typically needed to actually increase the performance.
+- `cudaMallocManaged` and `cudaFree` are used to allocate and deallocate managed memory.
+- Since unified memory removes the need for `cudaMemcpy` when copying data back to the host after the kernel is finished, you may use e.g. `cudaDeviceSynchronize` to wait for the kernel to finish before accessing the managed data.
+- While the Kepler and Maxwell architectures support a limited version of Unified Memory, the Pascal architecture is the first with hardware support for page faulting and migration via its Page Migration Engine. For the pre-Pascal architectures, _all_ managed data is automatically copied to the GPU right before lanuching a kernel on it, since they don't support page faulting for managed data currently present on the host or another device. This also means that Pascal and later includes memory copy delays in the kernel run time while pre-Pascal does not as everything is migrated before it begins executing (increasing the overall application runtime). This also prevents pre-Pascal GPUs from accessing managed data from both CPU and GPU concurrently (without causing segfaults) as it can't assure data coherence (although care must still be taken to avoid race conditions and data in invalid states for Pascal and later GPUs).
+- Explicit prefetching may be used to assist the data migration through the `cudaMemPrefetchAsync` call.
+
+### Streams
+
+- **TODO**
+- If no stream is specified, it defaults to stream 0, aka the "null stream".
+
+## Tools
+
+**TODO** Add stuff from other document.
+
+### Nsight
+
+- For debugging and profiling applications.
+- Comes as multiple variants:
+    - Nsight Systems: For general applications. Should also be used for CUDA and graphics applications.
+    - Nsight Compute: For CUDA applications.
+    - Nsight Graphics: For graphical applications.
+    - IDE integration.
+- Replaces nvprof.