|
@@ -44,26 +44,43 @@ breadcrumbs:
|
|
|
|
|
|
### Memory
|
|
|
|
|
|
-#### Memory Hierarchy
|
|
|
-
|
|
|
-- **TODO**
|
|
|
-- Memories (local to global):
|
|
|
- 1. **TODO** Fix, these names are wrong.
|
|
|
- 1. Registers.
|
|
|
- 1. Shared memory (block cache).
|
|
|
- 1. Read-only memories.
|
|
|
- 1. SM cache.
|
|
|
- 1. Global memory.
|
|
|
+#### Memory Hierarchy Overview
|
|
|
+
|
|
|
+- Memories:
|
|
|
+ - Register.
|
|
|
+ - Local.
|
|
|
+ - Shared.
|
|
|
+ - Global.
|
|
|
+ - Constant.
|
|
|
+ - Texture.
|
|
|
+- Accessibility and lifetimes:
|
|
|
+ - The global, constant and texture memories are accessible from the host and persist between kernel executions.
|
|
|
+ - The register and local memories are thread-local (where the latter is automatically spilled into when the register memory is full).
|
|
|
+ - The shared memories are block-local.
|
|
|
+- Read-write access:
|
|
|
+ - The constant and texture memories are read-only (wrt. the device).
|
|
|
+- Caching:
|
|
|
+ - The constant and texture memories are cached.
|
|
|
+ - The global and local memories are cached in L1 and L2 on newer devices.
|
|
|
+ - The register and shared memories are on-chip and fast, so they don't need to be cached.
|
|
|
|
|
|
#### Global Memory
|
|
|
|
|
|
- The largest and slowest memory on the device.
|
|
|
- Resides in the GPU DRAM.
|
|
|
- Variables may persist for the lifetime of the application.
|
|
|
-- The only memory the host can copy data into or out of.
|
|
|
+- One of the memories the host can access (outside of kernels).
|
|
|
- The only memory threads from different blocks can share data in.
|
|
|
- Statically declared in global scope using the `__device__` declaration or dynamically allocated using `cudaMalloc`.
|
|
|
-- Global memory coalescing: When multiple threads in a warp access global memory in strided fashion (e.g. when all threads in the warp access sequential parts of an array), the device will try to _coalesce_ the access into as few transactions as possible in order to mimimize memory load.
|
|
|
+- Global memory coalescing:
|
|
|
+ - When multiple threads in a warp access global memory in an aligned and sequential fashion (e.g. when all threads in the warp access sequential parts of an array), the device will try to _coalesce_ the access into as few 32-byte transactions as possible in order to reduce the number of transaction and increase the ratio of useful to fetched data.
|
|
|
+ - This description overlaps a bit with data alignment, which is described elsewhere on this page.
|
|
|
+ - Since the global memory will be accessed using 32-byte transactions, the data should be aligned to 32 bytes and preferably not too fragmented, to request as few 32-byte segments as possible. Note that memory allocated through the CUDA API is guaranteed to be aligned to 256 bytes.
|
|
|
+ - Special care should be taken to ensure that this is always done right.
|
|
|
+ - Caching will typically mitigate the impact of unaligned memory accesses.
|
|
|
+ - Thread block sizes that are multiple of the warp size (32) will give the most optimal alignments.
|
|
|
+ - Older hardware coalesce accesses within half warps instead of the whole warp.
|
|
|
+ - **TODO** More info.
|
|
|
|
|
|
#### Local Memory
|
|
|
|
|
@@ -94,9 +111,12 @@ breadcrumbs:
|
|
|
|
|
|
#### Data Alignment
|
|
|
|
|
|
-- Memory is accessed in 4, 8 or 16 byte transactions.
|
|
|
-- Accessing data with unaligned pointers incurs a performance hit.
|
|
|
-- Allocated date is always aligned to a 32-byte boundary, but elements within the array are generally not.
|
|
|
+- Memory is accessed in 4, 8 or 16 byte transactions. (**TODO** 32 byte?)
|
|
|
+- Accessing data with unaligned pointers generally incurs a performance hit, since it may fetch more segments than for aligned data or since it may prevent coalesced access.
|
|
|
+- Related to e.g. global memory coalescing (described somewhere else on this page).
|
|
|
+- Caching will typically mitigate somewhat the impact of unaligned memory accesses.
|
|
|
+- Memory allocated through the CUDA API is guaranteed to be aligned to 256 bytes.
|
|
|
+- Elements _within_ allocated arrays are generally not aligned unless special care is taken.
|
|
|
- To make sure array elements are aligned, use structs/classes with the `__align__(n)` qualifier and `n` as some multiple of the transaction sizes.
|
|
|
|
|
|
### Synchronization
|