[AI/System] CUDA
CUDA
CUDA (Compute Unifed Device Architecture) is a parallel computing platform and API model developed by NVIDIA. It allows developer to use NVIDIA GPU for general-purpose processing (GPGPU) tasks, in addition to there role in rendering graphics.
CUDA C/C++
CUDA extends C/C++ programming langues with specific syntax and constructs for parallel programming, enabling developers to write parallel code that can run on GPUs
Host: CPU and its memory (host memory)
Device: GPU and its memory (device memory)
PCI Bus: Peripheral Component Interconnect. Standar bus interface used in computers for connecting various hardware components like network cards, graphics cards and other peripherals to the motherboard.
Device Code: Portion of program that is designed to run on GPU. This code is optimized for the architecture and capabilities of the target device.
Host Code: Part of the program that runs on the CPU. This code is responsible for managing overall flow of the application. Responsible for data initialization, managing memory, launching kernels (device code) on GPU, and handling results from the deivce.
Kernel: Specialized function written in CUDA C/C++ that is designed to be executed in GPU.
Process Flow
- Copy inpu data from CPU memory to GPU memory
- Load GPU code and execute it, caching data on chip for performance
- Copy results from GPU memory to CPU memory
Examples
Hello World! with standard C that runs on the host.
NVIDIA compiler (nvcc) can be used to compile programs with no device code.
int main(void) {
printf("Hello World!\n");
return 0;
}
// output:
// $ nvcc hello_world.cu
// $ a.out
// Hello World!
Hello World! with Device Code.
- CUDA C/C++ keyword __global__ indicates a function that:
- runs on the device
- is called from host code
- nvcc seperates source code into host and device components
- device functions (e.g. mykernel()) processed by NVIDIA compiler
- host functions (e.g. main()) processed by host compiler. (gcc, cl.exe)
- triple angle brackets (<<<>>>) maark a call from host code to device ccode
- also called kernel launch
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
A simple kernel to add two integers.
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
- add() runs on device, so a, b, and c must point to device memory
- we need to allocate mmory on the GPU
- Memory Management
- device pointers point to GPU memory.
- may be passed to/from host code
- may not be dereferenced in host code
- host pointers point to CPU memory
- may be passed to/from device code
- may not dereferenced in device code
- simple CUDA API for handling device memory
- cudaMalloc(), cudaFree(), cudaMemcpy()
- similar to malloc(), free(), memcpy()
- device pointers point to GPU memory.
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
// allocate space for device copies of a, b, c
cudaMalloc((void**)&d_a, size); // sets d_a to address of the allocated memory block on GPU
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);
a = 2;
b = 7;
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Running in Parallel
Instead of executing add() once, execute N times in parallel
add<<< N, 1 >>>();
With add() running in parallel we can do vector addition. Each invocation can refer to its block index blockIdx.x
Block: each parallel invocation of add() is referred as a block. The set of blocks is referred to as grid
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
// allocate space for device copies of a, b, c
cudaMalloc((void**)&d_a, size); // sets d_a to address of the allocated memory block on GPU
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(a, N);
c = (int *)mall(size);
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// launch add() kernel on GPU with N blocks
add<<<N,1>>>(d_a, d_b, d_c);
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Threads and Blocks
CUDA Threads
- a block can be split into parallel threads
- threadIdx.x instead of blockIdx.x
Combining Threads and Blocks
- if M threads per block, a unique index for each thread is given by:
- int index = threadIdx.x + blockIdx.x * M;
- useing built-in variable blockDim.x:
- int index = threadIdx.x + blockIdx.x * blockDim.x
- why bother with threads?
- unlike parallel bocks, threads have mechanisms to efficiently:
- communicate
- synchronize
- unlike parallel bocks, threads have mechanisms to efficiently:
Cooperating Threads
Sharing Data Between Threads
- within a block, threads share data via shared memory.
- extremly fast on-chip memory
- allocated per block
- declare using __shared__
- void __syncthreads()
- synchronizes all threads within a block
Managing the Device
Coordinating Host & Device
- Kernel launches are asynchronous
- control returns to CPU immediately
- CPU needs to synchronize before consuming the results
- cudaMemcpy()
- copy begins when all preceding CUDA calls have completed
- blocks the CPU until the copy is complete
- cudaMemcpyAsync()
- asynchronous. does not block CPU
- cudaDeviceSynchronize()
- blocks CPU until all preceding CUDA calls have completed
- cudaMemcpy()
Reporting Errors
- All CUDA API calls return an error code (cudaError_t)
- error in API call itself or,
- error in earlier asynchronous operation
- cudaError_t cudaGetLastError(void)
- get the error code for the last error
- char *cudaGetErrorString(cudaError_T)
- get string to describe the error
Device Management
- Apps can query and select GPUs
- cudaGetDeviceCount(int *count)
- cudaSetDevice(int device)
- cudaGetDevice(int *device)
- cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
- Multiple host threads can share a device
- A single host thread can manage multiple devices
- cudaSetDevice(i) to select current device
- cudaMemcpy(...) for peer-to-peer copies
Compute Capability
Compute capability of a device describes its architecture such as, number of registers, sizes of memories, features and capabilites.
IDs and Dimensions
- kernel is launched as a grid of blocks of threads
- blockIdx and threadIdx are 3D
- we showed only one dimension (x)
More
- CUDA Programming Guide