4 ani în urmă · 7d7d6b8599
--- a/config/hpc/cuda.md
+++ b/config/hpc/cuda.md
@@ -67,4 +67,17 @@ See [CUDA (software engineering)](/config/se/general/cuda.md).
 
				     - Monitor device stats: `nvidia-smi dmon`
			
 
				 - To specify which devices are available to the CUDA application and in which order, set the `CUDA_VISIBLE_DEVICES` env var to a comma-separated list of device IDs.
			
 
				 
			
 
				+## Troubleshooting
			
 
				+
			
 
				+**"Driver/library version mismatch" and similar**:
			
 
				+
			
 
				+Other related error messages from various tools:
			
 
				+
			
 
				+- "Failed to initialize NVML: Driver/library version mismatch"
			
 
				+- "forward compatibility was attempted on non supported HW"
			
 
				+
			
 
				+Caused by the NVIDIA driver being updated without the kernel module being reloaded.
			
 
				+
			
 
				+Solution: Reboot.
			
 
				+
			
 
				 {% include footer.md %}
			
--- a/se/hpc/cuda.md
+++ b/se/hpc/cuda.md
@@ -115,13 +115,21 @@ breadcrumbs:
 
				 
			
 
				 ### Streams
			
 
				 
			
 
				-- **TODO**
			
 
				-- If no stream is specified, it defaults to stream 0, aka the "null stream".
			
 
				+- All device operations (kernels and memory operations) run sequentially in a single stream, which defaults to the "null stream" (stream 0) if none is specified.
			
 
				+- The null stream is synchronous with all other streams. The `cudaStreamNonBlocking` flag may be specified to other streams to avoid synchronizing with the null stream. CUDA 7 allows setting an option ("per-thread default stream") for changing the default behavior to (1) each host thread having a separate default stream and (2) default streams acting like regular streams (no longer synchronized with every other stream). To enable this, set `--default-stream per-thread` for `nvcc`. When enable, `cudaStreamLegacy` may be used if you need the old null stream for some reason.
			
 
				+- If using streams and the application is running less asynchronously than it should, make sure you're not (accidentally) using the null stream for anything.
			
 
				+- While streams are useful for decoupling and overlapping independent execution streams, the operations are still _somewhat_ performed in order (but potentially overlapping) on the device (on the different engines). Keep this in mind, e.g. when issuing multiple memory transfers for multiple streams.
			
 
				+- Streams allow for running asynchronous memory transfers and kernel execution at the same time, by running them in _different_, _non-default_ streams. For memory transfers, the memory must be managed/pinned. Take care _not_ to use the default stream as this will synchronize with everything.
			
 
				+- Streams are created with `cudaStreamCreate` and destroyed with `cudaStreamDestroy`.
			
 
				+- Memory transfers using streams requires using `cudaMemcpyAsync` (with the stream specified) instead of `cudaMemcpy`. The variants `cudaMemcpyAsync2D` and `cudaMemcpyAsync3D` may also be used for strided access.
			
 
				+- Kernels are issued within a stream by specifying teh stream as the fourth parameter (the third parameter may be set to `0` to ignore it).
			
 
				+- To wait for all operations for a stream and device to finish, use `cudaStreamSynchronize`. `cudaStreamQuery` may be used to query pending/unfinished operations without blocking. Events may also be used for synchronization. To wait for _all_ streams on a device, use the normal `cudaDeviceSynchronize` instead.
			
 
				 
			
 
				 ### Miscellanea
			
 
				 
			
 
				 - When transferring lots of small data arrays, try to combine them. For strided data, try to use `cudaMemcpy2D` or `cudaMemcpy3D`. Otherwise, try to copy the small arrays into a single, temporary, pinned array.
			
 
				 - For getting device attributes/properties, `cudaDeviceGetAttribute` is significantly faster than `cudaGetDeviceProperties`.
			
 
				+- Use `cudaDeviceReset` to reset all state for the device by destroying the CUDA context.
			
 
				 
			
 
				 ## Tools
			
 
				 
			
@@ -141,6 +149,10 @@ breadcrumbs:
 
				 - Example usage to show which CUDA calls and kernels tak the longest to run: `sudo nvprof <application>`
			
 
				 - Example usage to show interesting metrics: `sudo nvprof --metrics "eligible_warps_per_cycle,achieved_occupancy,sm_efficiency,alu_fu_utilization,dram_utilization,inst_replay_overhead,gst_transactions_per_request,l2_utilization,gst_requested_throughput,flop_count_dp,gld_transactions_per_request,global_cache_replay_overhead,flop_dp_efficiency,gld_efficiency,gld_throughput,l2_write_throughput,l2_read_throughput,branch_efficiency,local_memory_overhead" <application>`
			
 
				 
			
 
				+### NVIDIA Visual Profiler (nvvp)
			
 
				+
			
 
				+- **TODO** Seems like an older version of Nsight.
			
 
				+
			
 
				 ### Nsight
			
 
				 
			
 
				 - For debugging and profiling applications.
			
@@ -150,6 +162,15 @@ breadcrumbs:
 
				     - Nsight Graphics: For graphical applications.
			
 
				     - IDE integration.
			
 
				 - Replaces nvprof.
			
 
				+
			
 
				+#### Installation
			
 
				+
			
 
				+1. Download the run-files from the website for each variant (System, Compute, Graphics) you want.
			
 
				+1. Run the run-files with sudo.
			
 
				+1. **TODO** Fix path or something.
			
 
				+
			
 
				+#### Usage
			
 
				+
			
 
				 - When it reruns kernels for different tests, it restores the GPU state but not the host state. If this causes incorrect behavior, set `--replay-mode=application` to rerun the entire application instead.
			
 
				 
			
 
				 {% include footer.md %}