Mastering Device Memory Accesses: A CUDA Programming Guide for Optimization

Understanding device memory accesses is crucial for optimizing CUDA programs. Instructions that interact with addressable memory, including global, local, shared, constant, or texture memory, can be re-issued multiple times depending on how memory addresses are distributed across threads within a warp. This distribution significantly impacts instruction throughput, varying across different memory types. For global memory, scattered addresses generally lead to reduced throughput.

Global Memory and Coalescing

Global memory resides in device memory, accessed through 32-, 64-, or 128-byte memory transactions. These transactions must be naturally aligned, meaning only memory segments aligned to their size can be accessed.

When a warp executes a global memory access instruction, the memory accesses from threads within the warp are coalesced into one or more transactions. The number of transactions depends on the word size accessed by each thread and the distribution of memory addresses. More transactions often mean more unused words transferred, reducing throughput. For instance, 32-byte transactions for each thread’s 4-byte access can divide throughput by eight.

The number of transactions and throughput impact varies with the device’s compute capability. Refer to specific compute capability documentation for detailed handling of global memory accesses.

To maximize global memory throughput, effective coalescing is essential:

Optimal Access Patterns: Adhere to access patterns optimized for your device’s compute capability.
Data Type Alignment: Use data types that meet size and alignment requirements.
Data Padding: In certain scenarios, like accessing 2D arrays, padding data can improve coalescing.

Size and Alignment Requirements for Global Memory

Global memory instructions support word sizes of 1, 2, 4, 8, or 16 bytes for reading and writing. A single global memory instruction is generated for data access only if:

The data type size is 1, 2, 4, 8, or 16 bytes.
The data is naturally aligned (address is a multiple of its size).

Failing to meet these requirements results in multiple instructions with interleaved access patterns, hindering full coalescing. Using compliant data types for global memory is highly recommended. Built-in vector types automatically fulfill alignment requirements.

For structures, alignment specifiers like __align__(8) or __align__(16) can be used:

struct __align__(8) { float x; float y; };

struct __align__(16) { float x; float y; float z; };

Memory allocated in global memory (via driver or runtime API routines) is always aligned to at least 256 bytes.

Incorrect results can occur when reading non-naturally aligned 8-byte or 16-byte words. Special attention is needed for alignment, especially in custom global memory allocation schemes where multiple arrays are allocated from a single large block. Ensure each array’s starting address is correctly aligned relative to the block’s start.

Optimizing Access to Two-Dimensional Arrays in CUDA

Accessing elements in a 2D array is a common pattern in CUDA programming. Consider a 2D array of width width, located at BaseAddress with element type type. Threads with index (tx, ty) often access elements using the address:

BaseAddress + width * ty + tx

For full coalescing in such accesses, both the thread block width and the array width should be multiples of the warp size. If the array width isn’t a multiple, allocate it with a width rounded up to the nearest multiple of the warp size and pad rows accordingly for better efficiency. Functions like cudaMallocPitch() and cuMemAllocPitch() and associated memory copy functions are designed to allocate arrays adhering to these constraints in a hardware-independent manner.

Local Memory Considerations

Local memory usage in CUDA kernels is tied to automatic variables under specific conditions:

Arrays with non-constant indices.
Large structures or arrays exceeding register space.
Variables in kernels using more registers than available (register spilling).

Compiler output, particularly PTX assembly code (obtained with -ptx or -keep options), reveals if a variable is initially placed in local memory (.local mnemonic, ld.local and st.local access). However, later compilation phases might still move variables to local memory due to register pressure. Tools like cuobjdump can confirm local memory usage in the cubin object. The compiler also reports total local memory usage (lmem) when compiling with --ptxas-options=-v. Be aware that some math functions might internally use local memory.

Local memory resides in device memory, mirroring global memory in latency and bandwidth characteristics. It’s also subject to global memory coalescing rules. However, local memory is organized to allow fully coalesced accesses when threads in a warp access the same relative address (e.g., same array index or structure member).

From compute capability 5.x onwards, local memory is cached in L2 cache, similar to global memory.

Shared Memory for High-Bandwidth Access

Shared memory, being on-chip, offers significantly higher bandwidth and lower latency compared to local or global memory. It’s divided into banks for concurrent access. Ideally, memory requests to distinct banks are serviced simultaneously, maximizing bandwidth.

Bank conflicts occur when multiple addresses in a request fall into the same bank, leading to serialized access. The hardware resolves conflicts by splitting requests into conflict-free parts, reducing throughput. n-way bank conflicts indicate the initial request is split into n separate requests.

Optimizing shared memory usage involves understanding address-to-bank mapping to minimize bank conflicts. Refer to compute capability-specific documentation for details on bank organization and conflict avoidance strategies for devices of different compute capabilities.

Constant Memory and Caching

Constant memory, also located in device memory, is cached in a constant cache. Constant memory is optimized for uniform access across threads within a warp.

Accessing different addresses within a warp in constant memory leads to serialized requests. Each unique address in the request becomes a separate request, reducing throughput proportionally. Performance then depends on cache hits and misses: cache hits provide constant cache throughput, while misses fall back to device memory throughput.

Texture and Surface Memory: Caching for Spatial Locality

Texture and surface memory reside in device memory and are cached in a texture cache. Texture fetches and surface reads only incur device memory access on a cache miss; otherwise, they are served from the texture cache. The texture cache is designed for 2D spatial locality, favoring threads in a warp accessing nearby 2D texture or surface addresses. It’s optimized for streaming fetches with constant latency. Cache hits reduce DRAM bandwidth demand but not fetch latency.

Using texture or surface fetching for device memory reads can be advantageous:

Improved Bandwidth for Non-Coalesced Accesses: When global or constant memory access patterns are suboptimal, texture/surface memory can provide higher bandwidth if there is spatial locality in accesses.
Offloaded Address Calculations: Dedicated units handle addressing calculations outside the kernel.
Packed Data Broadcasting: Packed data can be broadcast to separate variables in a single operation.
Optional Data Conversion: 8-bit and 16-bit integer input data can be converted to 32-bit floating-point values within [0.0, 1.0] or [-1.0, 1.0] ranges.

Utilizing texture and surface memory effectively depends on the application’s memory access patterns and data characteristics. For workloads exhibiting spatial locality, texture and surface memory can be powerful tools in CUDA programming.

Go back to << Device Memory | Up to << Memory Management

Figure 1: Memory Hierarchy

Figure 2 : Thread execution flow

Figure 3: Memory Transactions