jeremy/HON95-wiki @ 930590a15b4a2152ed1b2ada0b7dccfa54d27e42

title: CUDA breadcrumbs:

title: Configuration
title: High-Performance Computing (HPC) --- {% include header.md %}

NVIDIA CUDA (Compute Unified Device Architecture) Toolkit, for programming CUDA-capable GPUs.

{:.no_toc}

Resources

Setup

Linux Installation

The toolkit on Linux can be installed in different ways:

Through an an existing package in your distro's repos (simplest and most compatible with other packages, but may be outdated).
Through a downloaded package manager package (up to date but may be incompatible with your installed NVIDIA driver).
Through a runfile (same as previous but more cross-distro and harder to manage).

If an NVIDIA driver is already installed, it must match the CUDA version.

Downloads: CUDA Toolkit Download (NVIDIA)

Ubuntu w/ NVIDIA's CUDA Repo

Follow the steps to add the NVIDIA CUDA repo: CUDA Toolkit Download (NVIDIA)
- Use the "deb (network)" method, which will show instructions for adding the repo.
- But don't install cuda yet.
Remove anything NVIDIA or CUDA from the system to avoid conflicts: apt purge --autoremove 'cuda' 'cuda-' 'nvidia-*' 'libnvidia-*'
- Warning: May break your PC. There may be better ways to do this.
Install CUDA from the new repo (includes the NVIDIA driver): apt install cuda
Setup PATH: echo 'export PATH=$PATH:/usr/local/cuda/bin' | sudo tee -a /etc/profile.d/cuda.sh

Docker Containers

Docker containers may run NVIDIA applications using the NVIDIA runtime for Docker.

See Docker.

DCGM

For monitoring GPU hardware and performance.
See the DCGM exporter for Prometheus for monitoring NVIDIA GPUs from Prometheus.

Programming

See CUDA (software engineering).

Usage and Tools

Gathering system/GPU information with nvidia-smi:
- Show overview: nvidia-smi
- Show topology matrix: nvidia-smi topo --matrix
- Show topology info: nvidia-smi topo <option>
- Show NVLink info: nvidia-smi nvlink --status -i 0 (for GPU #0)
- Monitor device stats: nvidia-smi dmon
To specify which devices are available to the CUDA application and in which order, set the CUDA_VISIBLE_DEVICES env var to a comma-separated list of device IDs.

Troubleshooting

"Driver/library version mismatch" and similar:

Other related error messages from various tools:

"Failed to initialize NVML: Driver/library version mismatch"
"forward compatibility was attempted on non supported HW"

Caused by the NVIDIA driver being updated without the kernel module being reloaded.

Solution: Reboot.

{% include footer.md %}

cuda.md 3.2 KB

Istoric Crud

Related Pages

Resources

Setup

Linux Installation

Ubuntu w/ NVIDIA's CUDA Repo

Docker Containers

DCGM

Programming

Usage and Tools

Troubleshooting

cuda.md 3.2 KB Istoric Crud

Related Pages

Resources

Setup

Linux Installation

Ubuntu w/ NVIDIA's CUDA Repo

Docker Containers

DCGM

Programming

Usage and Tools

Troubleshooting

cuda.md 3.2 KB

Istoric Crud