[AI/System] CUDA

RealJuney 2023. 10. 3. 04:26

CUDA

CUDA (Compute Unifed Device Architecture) is a parallel computing platform and API model developed by NVIDIA. It allows developer to use NVIDIA GPU for general-purpose processing (GPGPU) tasks, in addition to there role in rendering graphics.

CUDA C/C++

CUDA extends C/C++ programming langues with specific syntax and constructs for parallel programming, enabling developers to write parallel code that can run on GPUs

Host: CPU and its memory (host memory)

Device: GPU and its memory (device memory)

PCI Bus: Peripheral Component Interconnect. Standar bus interface used in computers for connecting various hardware components like network cards, graphics cards and other peripherals to the motherboard.

Device Code: Portion of program that is designed to run on GPU. This code is optimized for the architecture and capabilities of the target device.

Host Code: Part of the program that runs on the CPU. This code is responsible for managing overall flow of the application. Responsible for data initialization, managing memory, launching kernels (device code) on GPU, and handling results from the deivce.

Kernel: Specialized function written in CUDA C/C++ that is designed to be executed in GPU.

Process Flow

Copy inpu data from CPU memory to GPU memory
Load GPU code and execute it, caching data on chip for performance
Copy results from GPU memory to CPU memory

Examples

Hello World! with standard C that runs on the host.

NVIDIA compiler (nvcc) can be used to compile programs with no device code.

int main(void) {
	printf("Hello World!\n");
    return 0;
}

// output:
// $ nvcc hello_world.cu
// $ a.out
// Hello World!

Hello World! with Device Code.

CUDA C/C++ keyword __global__ indicates a function that:
- runs on the device
- is called from host code
nvcc seperates source code into host and device components
- device functions (e.g. mykernel()) processed by NVIDIA compiler
- host functions (e.g. main()) processed by host compiler. (gcc, cl.exe)
triple angle brackets (<<<>>>) maark a call from host code to device ccode
- also called kernel launch

__global__ void mykernel(void) {
}

int main(void) {
    mykernel<<<1,1>>>();
    printf("Hello World!\n");
    return 0;
}

A simple kernel to add two integers.

__global__ void add(int *a, int *b, int *c) {
	*c = *a + *b;
}

add() runs on device, so a, b, and c must point to device memory
we need to allocate mmory on the GPU
Memory Management
- device pointers point to GPU memory.
  - may be passed to/from host code
  - may not be dereferenced in host code
- host pointers point to CPU memory
  - may be passed to/from device code
  - may not dereferenced in device code
- simple CUDA API for handling device memory
  - cudaMalloc(), cudaFree(), cudaMemcpy()
  - similar to malloc(), free(), memcpy()

int main(void) {
    int a, b, c;             // host copies of a, b, c
    int *d_a, *d_b, *d_c;    // device copies of a, b, c
    int size = sizeof(int);
    
    // allocate space for device copies of a, b, c
    cudaMalloc((void**)&d_a, size); // sets d_a to address of the allocated memory block on GPU
    cudaMalloc((void**)&d_b, size);
    cudaMalloc((void**)&d_c, size);
    
    a = 2;
    b = 7;
    
    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
    
    // launch add() kernel on GPU
    add<<<1,1>>>(d_a, d_b, d_c);
    
    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
    
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Running in Parallel

Instead of executing add() once, execute N times in parallel

add<<< N, 1 >>>();

With add() running in parallel we can do vector addition. Each invocation can refer to its block index blockIdx.x

Block: each parallel invocation of add() is referred as a block. The set of blocks is referred to as grid

__global__ void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

#define N 512

int main(void) {
    int *a, *b, *c;          // host copies of a, b, c
    int *d_a, *d_b, *d_c;    // device copies of a, b, c
    int size = N * sizeof(int);
    
    // allocate space for device copies of a, b, c
    cudaMalloc((void**)&d_a, size); // sets d_a to address of the allocated memory block on GPU
    cudaMalloc((void**)&d_b, size);
    cudaMalloc((void**)&d_c, size);
    
    a = (int *)malloc(size); random_ints(a, N);
    b = (int *)malloc(size); random_ints(a, N);
    c = (int *)mall(size);
    
    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
    
    // launch add() kernel on GPU with N blocks
    add<<<N,1>>>(d_a, d_b, d_c);
    
    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
    
    free(a); free(b); free(c);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
}

Threads and Blocks

CUDA Threads

a block can be split into parallel threads
threadIdx.x instead of blockIdx.x

Combining Threads and Blocks

if M threads per block, a unique index for each thread is given by:
- int index = threadIdx.x + blockIdx.x * M;
useing built-in variable blockDim.x:
- int index = threadIdx.x + blockIdx.x * blockDim.x
why bother with threads?
- unlike parallel bocks, threads have mechanisms to efficiently:
  - communicate
  - synchronize

Cooperating Threads

Sharing Data Between Threads

within a block, threads share data via shared memory.
extremly fast on-chip memory
allocated per block
declare using __shared__
void __syncthreads()
- synchronizes all threads within a block

Managing the Device

Coordinating Host & Device

Kernel launches are asynchronous
- control returns to CPU immediately
CPU needs to synchronize before consuming the results
- cudaMemcpy()
  - copy begins when all preceding CUDA calls have completed
  - blocks the CPU until the copy is complete
- cudaMemcpyAsync()
  - asynchronous. does not block CPU
- cudaDeviceSynchronize()
  - blocks CPU until all preceding CUDA calls have completed

Reporting Errors

All CUDA API calls return an error code (cudaError_t)
- error in API call itself or,
- error in earlier asynchronous operation
cudaError_t cudaGetLastError(void)
- get the error code for the last error
char *cudaGetErrorString(cudaError_T)
- get string to describe the error

Device Management

Apps can query and select GPUs
- cudaGetDeviceCount(int *count)
- cudaSetDevice(int device)
- cudaGetDevice(int *device)
- cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
Multiple host threads can share a device
A single host thread can manage multiple devices
- cudaSetDevice(i) to select current device
- cudaMemcpy(...) for peer-to-peer copies

Compute Capability

Compute capability of a device describes its architecture such as, number of registers, sizes of memories, features and capabilites.

IDs and Dimensions

kernel is launched as a grid of blocks of threads
- blockIdx and threadIdx are 3D
- we showed only one dimension (x)

CUDA Programming Guide
- https://docs.nvidia.com/cuda/cuda-c-programming-guide

저작자표시

티스토리

[AI/System] CUDA

[AI/System] CUDA

CUDA

CUDA C/C++

Running in Parallel

Threads and Blocks

Cooperating Threads

Managing the Device

More