Håvard Ose Nordstrand 4 anni fa
parent
commit
7d7d6b8599
2 ha cambiato i file con 36 aggiunte e 2 eliminazioni
  1. 13 0
      config/hpc/cuda.md
  2. 23 2
      se/hpc/cuda.md

+ 13 - 0
config/hpc/cuda.md

@@ -67,4 +67,17 @@ See [CUDA (software engineering)](/config/se/general/cuda.md).
     - Monitor device stats: `nvidia-smi dmon`
     - Monitor device stats: `nvidia-smi dmon`
 - To specify which devices are available to the CUDA application and in which order, set the `CUDA_VISIBLE_DEVICES` env var to a comma-separated list of device IDs.
 - To specify which devices are available to the CUDA application and in which order, set the `CUDA_VISIBLE_DEVICES` env var to a comma-separated list of device IDs.
 
 
+## Troubleshooting
+
+**"Driver/library version mismatch" and similar**:
+
+Other related error messages from various tools:
+
+- "Failed to initialize NVML: Driver/library version mismatch"
+- "forward compatibility was attempted on non supported HW"
+
+Caused by the NVIDIA driver being updated without the kernel module being reloaded.
+
+Solution: Reboot.
+
 {% include footer.md %}
 {% include footer.md %}

+ 23 - 2
se/hpc/cuda.md

@@ -115,13 +115,21 @@ breadcrumbs:
 
 
 ### Streams
 ### Streams
 
 
-- **TODO**
-- If no stream is specified, it defaults to stream 0, aka the "null stream".
+- All device operations (kernels and memory operations) run sequentially in a single stream, which defaults to the "null stream" (stream 0) if none is specified.
+- The null stream is synchronous with all other streams. The `cudaStreamNonBlocking` flag may be specified to other streams to avoid synchronizing with the null stream. CUDA 7 allows setting an option ("per-thread default stream") for changing the default behavior to (1) each host thread having a separate default stream and (2) default streams acting like regular streams (no longer synchronized with every other stream). To enable this, set `--default-stream per-thread` for `nvcc`. When enable, `cudaStreamLegacy` may be used if you need the old null stream for some reason.
+- If using streams and the application is running less asynchronously than it should, make sure you're not (accidentally) using the null stream for anything.
+- While streams are useful for decoupling and overlapping independent execution streams, the operations are still _somewhat_ performed in order (but potentially overlapping) on the device (on the different engines). Keep this in mind, e.g. when issuing multiple memory transfers for multiple streams.
+- Streams allow for running asynchronous memory transfers and kernel execution at the same time, by running them in _different_, _non-default_ streams. For memory transfers, the memory must be managed/pinned. Take care _not_ to use the default stream as this will synchronize with everything.
+- Streams are created with `cudaStreamCreate` and destroyed with `cudaStreamDestroy`.
+- Memory transfers using streams requires using `cudaMemcpyAsync` (with the stream specified) instead of `cudaMemcpy`. The variants `cudaMemcpyAsync2D` and `cudaMemcpyAsync3D` may also be used for strided access.
+- Kernels are issued within a stream by specifying teh stream as the fourth parameter (the third parameter may be set to `0` to ignore it).
+- To wait for all operations for a stream and device to finish, use `cudaStreamSynchronize`. `cudaStreamQuery` may be used to query pending/unfinished operations without blocking. Events may also be used for synchronization. To wait for _all_ streams on a device, use the normal `cudaDeviceSynchronize` instead.
 
 
 ### Miscellanea
 ### Miscellanea
 
 
 - When transferring lots of small data arrays, try to combine them. For strided data, try to use `cudaMemcpy2D` or `cudaMemcpy3D`. Otherwise, try to copy the small arrays into a single, temporary, pinned array.
 - When transferring lots of small data arrays, try to combine them. For strided data, try to use `cudaMemcpy2D` or `cudaMemcpy3D`. Otherwise, try to copy the small arrays into a single, temporary, pinned array.
 - For getting device attributes/properties, `cudaDeviceGetAttribute` is significantly faster than `cudaGetDeviceProperties`.
 - For getting device attributes/properties, `cudaDeviceGetAttribute` is significantly faster than `cudaGetDeviceProperties`.
+- Use `cudaDeviceReset` to reset all state for the device by destroying the CUDA context.
 
 
 ## Tools
 ## Tools
 
 
@@ -141,6 +149,10 @@ breadcrumbs:
 - Example usage to show which CUDA calls and kernels tak the longest to run: `sudo nvprof <application>`
 - Example usage to show which CUDA calls and kernels tak the longest to run: `sudo nvprof <application>`
 - Example usage to show interesting metrics: `sudo nvprof --metrics "eligible_warps_per_cycle,achieved_occupancy,sm_efficiency,alu_fu_utilization,dram_utilization,inst_replay_overhead,gst_transactions_per_request,l2_utilization,gst_requested_throughput,flop_count_dp,gld_transactions_per_request,global_cache_replay_overhead,flop_dp_efficiency,gld_efficiency,gld_throughput,l2_write_throughput,l2_read_throughput,branch_efficiency,local_memory_overhead" <application>`
 - Example usage to show interesting metrics: `sudo nvprof --metrics "eligible_warps_per_cycle,achieved_occupancy,sm_efficiency,alu_fu_utilization,dram_utilization,inst_replay_overhead,gst_transactions_per_request,l2_utilization,gst_requested_throughput,flop_count_dp,gld_transactions_per_request,global_cache_replay_overhead,flop_dp_efficiency,gld_efficiency,gld_throughput,l2_write_throughput,l2_read_throughput,branch_efficiency,local_memory_overhead" <application>`
 
 
+### NVIDIA Visual Profiler (nvvp)
+
+- **TODO** Seems like an older version of Nsight.
+
 ### Nsight
 ### Nsight
 
 
 - For debugging and profiling applications.
 - For debugging and profiling applications.
@@ -150,6 +162,15 @@ breadcrumbs:
     - Nsight Graphics: For graphical applications.
     - Nsight Graphics: For graphical applications.
     - IDE integration.
     - IDE integration.
 - Replaces nvprof.
 - Replaces nvprof.
+
+#### Installation
+
+1. Download the run-files from the website for each variant (System, Compute, Graphics) you want.
+1. Run the run-files with sudo.
+1. **TODO** Fix path or something.
+
+#### Usage
+
 - When it reruns kernels for different tests, it restores the GPU state but not the host state. If this causes incorrect behavior, set `--replay-mode=application` to rerun the entire application instead.
 - When it reruns kernels for different tests, it restores the GPU state but not the host state. If this causes incorrect behavior, set `--replay-mode=application` to rerun the entire application instead.
 
 
 {% include footer.md %}
 {% include footer.md %}