瀏覽代碼

CUDA, ROCm, HIP

Håvard Ose Nordstrand 3 年之前
父節點
當前提交
59471d5a06
共有 7 個文件被更改,包括 181 次插入14 次删除
  1. 2 3
      config/hpc/containers.md
  2. 9 7
      config/hpc/cuda.md
  3. 33 0
      config/hpc/enroot.md
  4. 67 0
      config/hpc/hip.md
  5. 55 0
      config/hpc/rocm.md
  6. 2 0
      index.md
  7. 13 4
      se/hpc/cuda.md

+ 2 - 3
config/hpc/containers.md

@@ -55,12 +55,11 @@ breadcrumbs:
 - Supports using Docker images (and Docker Hub).
 - No daemon.
 - Slurm integration using NVIDIA's [Pyxis](https://github.com/NVIDIA/pyxis) SPANK plugin.
-- Support NVIDIA GPUs through NVIDIA's [libnvidia-container](https://github.com/nvidia/libnvidia-container) library and CLI utility.
-    - **TODO** AMD ROCm support?
+- Support NVIDIA GPUs through NVIDIA's [libnvidia-container](https://github.com/nvidia/libnvidia-container) library and CLI utility (_official_ support from NVIDIA unlike certain other solutions).
 
 ### Shifter
 
-I've never used it. It's very similar to Singularity.
+I've never used it. It's apparently very similar to Singularity.
 
 ## Best Practices
 

+ 9 - 7
config/hpc/cuda.md

@@ -11,7 +11,8 @@ NVIDIA CUDA (Compute Unified Device Architecture) Toolkit, for programming CUDA-
 ### Related Pages
 {:.no_toc}
 
-- [CUDA (software engineering)](/config/se/general/cuda.md)
+- [HIP](/config/hpc/hip/)
+- [CUDA (software engineering)](/se/general/cuda/)
 
 ## Resources
 
@@ -22,7 +23,7 @@ NVIDIA CUDA (Compute Unified Device Architecture) Toolkit, for programming CUDA-
 
 ## Setup
 
-### Linux
+### Linux Installation
 
 The toolkit on Linux can be installed in different ways:
 
@@ -34,19 +35,20 @@ If an NVIDIA driver is already installed, it must match the CUDA version.
 
 Downloads: [CUDA Toolkit Download (NVIDIA)](https://developer.nvidia.com/cuda-downloads)
 
-#### Ubuntu (NVIDIA CUDA Repo)
+#### Ubuntu w/ NVIDIA's CUDA Repo
 
 1. Follow the steps to add the NVIDIA CUDA repo: [CUDA Toolkit Download (NVIDIA)](https://developer.nvidia.com/cuda-downloads)
     - But don't install `cuda` yet.
-1. Remove anything NVIDIA or CUDA from the system to avoid conflicts: `apt purge --autoremove cuda nvidia-* libnvidia-*`
+1. Remove anything NVIDIA or CUDA from the system to avoid conflicts: `apt purge --autoremove 'cuda' 'cuda-' 'nvidia-*' 'libnvidia-*'`
     - Warning: May break your PC. There may be better ways to do this.
 1. Install CUDA from the new repo (includes the NVIDIA driver): `apt install cuda`
-1. Setup path: In `/etc/environment`, append `:/usr/local/cuda/bin` to the end of the PATH list.
+1. Setup PATH: `echo 'export PATH=$PATH:/usr/local/cuda/bin' | sudo tee -a /etc/profile.d/cuda.sh`
 
 ### Docker Containers
 
-- Docker containers may run NVIDIA applications using the NVIDIA runtime for Docker.
-- **TODO**
+Docker containers may run NVIDIA applications using the NVIDIA runtime for Docker.
+
+See [Docker](/config/virt-cont/docker/).
 
 ### DCGM
 

+ 33 - 0
config/hpc/enroot.md

@@ -0,0 +1,33 @@
+---
+title: Enroot
+breadcrumbs:
+- title: Configuration
+- title: High-Performance Computing (HPC)
+---
+{% include header.md %}
+
+A container technology for HPC, made by NVIDIA.
+
+## Information
+
+- For more general imformation and comparison to other HPC container technologies, see [Containers](/config/hpc/containers/).
+
+## Configuration
+
+**TODO**
+
+## Usage
+
+### Running
+
+- **TODO**
+
+### Images
+
+- **TODO**
+
+### GPUs
+
+- **TODO**
+
+{% include footer.md %}

+ 67 - 0
config/hpc/hip.md

@@ -0,0 +1,67 @@
+---
+title: HIP
+breadcrumbs:
+- title: Configuration
+- title: High-Performance Computing (HPC)
+---
+{% include header.md %}
+
+HIP (**TODO** abbreviation) is AMD ROCm's runtime API and kernel language, which is compilable for both AMD (through ROCm) and NVIDIA (through CUDA) GPUs.
+Compared to OpenCL (which is also supported by both NVIDIA and AMD), it's much more similar to CUDA (making it _very_ easy to port CUDA code) and allows using existing profiling tools and similar for CUDA and ROCm.
+
+### Related Pages
+{:.no_toc}
+
+- [ROCm](/config/hpc/rocm/)
+- [CUDA](/config/hpc/cuda/)
+
+## Resources
+
+- [HIP Installation (AMD ROCm Docs)](https://rocmdocs.amd.com/en/latest/Installation_Guide/HIP-Installation.html)
+
+## Info
+
+- HIP code can be compiled for AMD ROCm using the HIP-Clang compiler or for CUDA using the NVCC compiler.
+- If using both CUDA with an NVIDIA GPU and ROCm with an AMD GPU in the same system, HIP seems to prefer ROCm with the AMD GPU when building application. I found not way of changing the target platform (**TODO**).
+
+## Setup
+
+### Linux Installation
+
+Using **Ubuntu 20.04 LTS**.
+
+#### Common Steps Before
+
+1. Add the ROCm package repo (overlaps with ROCm installation):
+    1. Install requirements: `sudo apt install libnuma-dev wget gnupg2`
+    1. Add public key: `wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -`
+    1. Add repo: `echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list`
+    1. Update cache: `sudo apt update`
+
+#### Steps for NVIDIA Paltforms
+
+1. Install the CUDA toolkit and the NVIDIA driver: See [CUDA](/config/hpc/cuda/).
+1. Install: `sudo apt install hip-nvcc`
+
+#### Steps for AMD Paltforms
+
+1. Install stuff: `sudo apt install mesa-common-dev clang comgr`
+1. Install ROCm: See [ROCm](/config/hpc/rocm/).
+
+#### Common Steps After
+
+1. Fix symlinks and PATH:
+    - (NVIDIA platforms only) CUDA symlink (`/usr/local/cuda`): Should already point to the right thing.
+    - (AMD platforms only) ROCm symlink (`/opt/rocm`): `sudo ln -s /opt/rocm-4.2.0 /opt/rocm` (example)
+    - Add to PATH: `echo 'export PATH=$PATH:/opt/rocm/bin:/opt/rocm/rocprofiler/bin:/opt/rocm/opencl/bin' | sudo tee -a /etc/profile.d/rocm.sh`
+1. Verify installation: `/opt/rocm/bin/hipconfig --full`
+1. (Optional) Try to build the square example program: [square (ROCm HIP samples)](https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/0_Intro/square)
+
+## Usage and Tools
+
+- Show system info:
+    - Show lots of HIP stuff: `hipconfig --config`
+    - Show platform (`amd` or `nvidia`): `hipconfig --platform`
+- Convert CUDA program to HIP: `hipify-perl input.cu > output.cpp`
+
+{% include footer.md %}

+ 55 - 0
config/hpc/rocm.md

@@ -0,0 +1,55 @@
+---
+title: ROCm
+breadcrumbs:
+- title: Configuration
+- title: High-Performance Computing (HPC)
+---
+{% include header.md %}
+
+AMD ROCm (Radeon Open Compute), for programming AMD GPUs. AMD's alternative to NVIDIA's CUDA toolkit.
+It uses the runtime API and kernel language HIP, which is compilable for both AMD and NVIDIA GPUs.
+
+### Related Pages
+{:.no_toc}
+
+- [HIP](/config/hpc/hip/)
+
+## Resources
+
+- [ROCm Documentation (AMD ROCm Docs)](https://rocmdocs.amd.com/)
+- [ROCm Installation (AMD ROCm Docs)](https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html)
+
+## Setup
+
+### Linux Installation
+
+Using **Ubuntu 20.04 LTS**.
+
+#### Notes
+
+- Official installation instructions: [ROCm Installation (AMD ROCm Docs)](https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html)
+- **TODO** `video` and `render` groups required to use it? Using `sudo` as a temporary solution works.
+
+#### Steps
+
+1. If the `amdgpu-pro` driver is installed then uninstall it to avoid conflicts.
+1. If using Mellanox ConnectX NICs then Mellanox OFED must be installed before ROCm.
+1. Add the ROCm package repo:
+    1. Install requirements: `sudo apt install libnuma-dev wget gnupg2`
+    1. Add public key: `wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -`
+    1. Add repo: `echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/debian/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list`
+    1. Update cache: `apt update`
+1. Install: `sudo apt install rocm-dkms`
+1. Fix symlinks and PATH:
+    - ROCm symlink (`/opt/rocm`): `sudo ln -s /opt/rocm-4.2.0 /opt/rocm` (example) (**TODO** Will this automatically point to the right thing?)
+    - Add to PATH: `echo 'export PATH=$PATH:/opt/rocm/bin:/opt/rocm/rocprofiler/bin:/opt/rocm/opencl/bin' | sudo tee -a /etc/profile.d/rocm.sh`
+1. Reboot.
+1. Verify:
+    - `sudo /opt/rocm/bin/rocminfo` (should show e.g. one agent for the CPU and one for the GPU)
+    - `sudo /opt/rocm/opencl/bin/clinfo`
+
+## Usage and Tools
+
+- Show GPU info: `rocm-smi`
+
+{% include footer.md %}

+ 2 - 0
index.md

@@ -42,6 +42,8 @@ Random collection of config notes and miscellaneous stuff. _Technically not a wi
 - [Slurm Workload Manager](/config/hpc/slurm/)
 - [Containers](/config/hpc/containers/)
 - [Singularity](/config/hpc/singularity/)
+- [HIP](/config/hpc/hip/)
+- [ROCm](/config/hpc/rocm/)
 - [CUDA](/config/hpc/cuda/)
 - [Open MPI](/config/hpc/openmpi/)
 - [Interconnects](/config/hpc/interconnects/)

+ 13 - 4
se/hpc/cuda.md

@@ -238,12 +238,21 @@ breadcrumbs:
     - IDE integrations.
 - Replaces nvprof.
 
-#### Installation
+### Nsight Compute
 
-1. Download the run-files from the website for each variant (System, Compute, Graphics) you want.
-1. Run the run-files with sudo.
+#### Info
 
-### Nsight Compute
+- Replaces nvprof.
+
+#### Installation (Ubuntu)
+
+- Nsight Systems and Compute comes with CUDA if installed through NVIDIA's repos.
+- If it complains about something Qt, install `libqt5xdg3`.
+- Access to performance counters:
+    - Since access to GPU performance counters are limited to protect against side channel attacks (see [Security Notice: NVIDIA Response to “Rendered Insecure: GPU Side Channel Attacks are Practical” - November 2018 (NVIDIA)](https://nvidia.custhelp.com/app/answers/detail/a_id/4738)), it must be run either with sudo (or a user with `CAP_SYS_ADMIN`), or by setting a module option which disables the protection. For non-sensitive applications (e.g. for teaching), this protection is not required. See [NVIDIA Development Tools Solutions - ERR_NVGPUCTRPERM: Permission issue with Performance Counters (NVIDIA)](https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters) for more info.
+    - Enable access for all users: Add `options nvidia "NVreg_RestrictProfilingToAdminUsers=0"` to e.g. `/etc/modprobe.d/nvidia.conf` and reboot.
+
+#### Usage
 
 - May be run from command line (`ncu`) or using the graphical application (`ncu-ui`).
 - Kernel replays: In order to run all profiling methods for a kernel execution, Nsight might have to run the kernel multiple times by storing the state before the first kernel execution and restoring it for every replay. It does not restore any host state, so in case of host-device communication during the execution, this is likely to put the application in an inconsistent state and cause it to crash or give incorrect results. To rerun the whole application (aka "application mode") instead of transparently replaying individual kernels (aka "kernel mode"), specify `--replay-mode=application` (or the equivalent option in the GUI).