4 anni fa · 7d7d6b8599
--- a/config/hpc/cuda.md
+++ b/config/hpc/cuda.md
@@ -67,4 +67,17 @@ See [CUDA (software engineering)](/config/se/general/cuda.md).
 
															     - Monitor device stats: `nvidia-smi dmon`
														
 
															 - To specify which devices are available to the CUDA application and in which order, set the `CUDA_VISIBLE_DEVICES` env var to a comma-separated list of device IDs.
														
 
															+## Troubleshooting
														
 
															+
														
 
															+**"Driver/library version mismatch" and similar**:
														
 
															+
														
 
															+Other related error messages from various tools:
														
 
															+
														
 
															+- "Failed to initialize NVML: Driver/library version mismatch"
														
 
															+- "forward compatibility was attempted on non supported HW"
														
 
															+
														
 
															+Caused by the NVIDIA driver being updated without the kernel module being reloaded.
														
 
															+
														
 
															+Solution: Reboot.
														
 
															+
														
 
															 {% include footer.md %}
														
--- a/se/hpc/cuda.md
+++ b/se/hpc/cuda.md
@@ -115,13 +115,21 @@ breadcrumbs:
 
															 ### Streams
														
 
															-- **TODO**
														
 
															-- If no stream is specified, it defaults to stream 0, aka the "null stream".
														
 
															+- All device operations (kernels and memory operations) run sequentially in a single stream, which defaults to the "null stream" (stream 0) if none is specified.
														
 
															+- The null stream is synchronous with all other streams. The `cudaStreamNonBlocking` flag may be specified to other streams to avoid synchronizing with the null stream. CUDA 7 allows setting an option ("per-thread default stream") for changing the default behavior to (1) each host thread having a separate default stream and (2) default streams acting like regular streams (no longer synchronized with every other stream). To enable this, set `--default-stream per-thread` for `nvcc`. When enable, `cudaStreamLegacy` may be used if you need the old null stream for some reason.
														
 
															+- If using streams and the application is running less asynchronously than it should, make sure you're not (accidentally) using the null stream for anything.
														
 
															+- While streams are useful for decoupling and overlapping independent execution streams, the operations are still _somewhat_ performed in order (but potentially overlapping) on the device (on the different engines). Keep this in mind, e.g. when issuing multiple memory transfers for multiple streams.
														
 
															+- Streams allow for running asynchronous memory transfers and kernel execution at the same time, by running them in _different_, _non-default_ streams. For memory transfers, the memory must be managed/pinned. Take care _not_ to use the default stream as this will synchronize with everything.
														
 
															+- Streams are created with `cudaStreamCreate` and destroyed with `cudaStreamDestroy`.
														
 
															+- Memory transfers using streams requires using `cudaMemcpyAsync` (with the stream specified) instead of `cudaMemcpy`. The variants `cudaMemcpyAsync2D` and `cudaMemcpyAsync3D` may also be used for strided access.
														
 
															+- Kernels are issued within a stream by specifying teh stream as the fourth parameter (the third parameter may be set to `0` to ignore it).
														
 
															+- To wait for all operations for a stream and device to finish, use `cudaStreamSynchronize`. `cudaStreamQuery` may be used to query pending/unfinished operations without blocking. Events may also be used for synchronization. To wait for _all_ streams on a device, use the normal `cudaDeviceSynchronize` instead.
														
 
															 ### Miscellanea
														
 
															 - When transferring lots of small data arrays, try to combine them. For strided data, try to use `cudaMemcpy2D` or `cudaMemcpy3D`. Otherwise, try to copy the small arrays into a single, temporary, pinned array.
														
 
															 - For getting device attributes/properties, `cudaDeviceGetAttribute` is significantly faster than `cudaGetDeviceProperties`.
														
 
															+- Use `cudaDeviceReset` to reset all state for the device by destroying the CUDA context.
														
 
															 ## Tools
														
@@ -141,6 +149,10 @@ breadcrumbs:
 
															 - Example usage to show which CUDA calls and kernels tak the longest to run: `sudo nvprof <application>`
														
 
															 - Example usage to show interesting metrics: `sudo nvprof --metrics "eligible_warps_per_cycle,achieved_occupancy,sm_efficiency,alu_fu_utilization,dram_utilization,inst_replay_overhead,gst_transactions_per_request,l2_utilization,gst_requested_throughput,flop_count_dp,gld_transactions_per_request,global_cache_replay_overhead,flop_dp_efficiency,gld_efficiency,gld_throughput,l2_write_throughput,l2_read_throughput,branch_efficiency,local_memory_overhead" <application>`
														
 
															+### NVIDIA Visual Profiler (nvvp)
														
 
															+
														
 
															+- **TODO** Seems like an older version of Nsight.
														
 
															+
														
 
															 ### Nsight
														
 
															 - For debugging and profiling applications.
														
@@ -150,6 +162,15 @@ breadcrumbs:
 
															     - Nsight Graphics: For graphical applications.
														
 
															     - IDE integration.
														
 
															 - Replaces nvprof.
														
 
															+
														
 
															+#### Installation
														
 
															+
														
 
															+1. Download the run-files from the website for each variant (System, Compute, Graphics) you want.
														
 
															+1. Run the run-files with sudo.
														
 
															+1. **TODO** Fix path or something.
														
 
															+
														
 
															+#### Usage
														
 
															+
														
 
															 - When it reruns kernels for different tests, it restores the GPU state but not the host state. If this causes incorrect behavior, set `--replay-mode=application` to rerun the entire application instead.
														
 
															 {% include footer.md %}