In this post, I will show you how to write a vector addition code using CUDA . The code is listed below:
// Includes #include <stdio.h>; // CUDA includes #include <cuda_runtime.h>; #include <cutil_inline.h>; #include <cuda_runtime_api.h>; #define N 10 //Size of the array //Kernel function __global__ void add (float* a, float* b, float* c) { int tid = threadIdx.x + blockIdx.x * blockDim.x; // A thread id if (tid < N) { c[tid] = a[tid] + b[tid]; } } int main() { //Initialising inputs float* a; float* b; float* c; float* dev_a; float* dev_b; float* dev_c; //CUDA event timers cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); float time; //Allocating memory on the host a = (float*)malloc(N*sizeof(float)); b = (float*)malloc(N*sizeof(float)); c = (float*)malloc(N*sizeof(float)); for (int i = 0; i < N; ++i) { a[i] = (float)i; b[i] = (float)i; c[i] = 0.0; } //Allocating memory on the device cutilSafeCall(cudaMalloc( (void**)&dev_a, N*sizeof(float) )); cutilSafeCall(cudaMalloc( (void**)&dev_b, N*sizeof(float) )); cutilSafeCall(cudaMalloc( (void**)&dev_c, N*sizeof(float) )); //Copying data from host to device cutilSafeCall(cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice)); cutilSafeCall(cudaMemcpy(dev_b, b, N*sizeof(float), cudaMemcpyHostToDevice)); cutilSafeCall(cudaMemcpy(dev_c, c, N*sizeof(float), cudaMemcpyHostToDevice)); //Starting CUDA timer cudaEventRecord(start, 0); //Launching kernel add<<<N,1 >>>(dev_a, dev_b, dev_c); cudaThreadSynchronize(); //Stopping CUDA timer cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop); printf("Time taken by kernel: %f\n", time); //Copying data back to host cutilSafeCall(cudaMemcpy(c, dev_c, N*sizeof(float), cudaMemcpyDeviceToHost)); for(int i = 0; i < N; ++i) { printf("c[%d] = %f\n",i,c[i]); } //Freeing memory cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); free(a); free(b); free(c); return 0; }
Let me describe the code in detail.
Lines 1-6 includes the necessary header files.
Line 8 defines the size of the array. Well size 10 is too small a number for GPU vector addition. But for experimental purposes, this should be fine.
In lines 11-18, kernel function is defined. tid is a unique thread id.
Main function starts at line 20. In lines 23-28, input variables are defined.
In 31-33, CUDA event timers are defined which are defined to calculate the time taken on GPU. CPU timers might not have enough precision to measure the low times taken by the kernel on GPU.
In lines 37-39, memory is allocated on the host.In 41-46, inputs are initialized.
In 49-51, memory is allocated on device using cudaMalloc. cutilSafeCall makes sure that the commands are properly executed. If there’s any error in executing the command, cutilSafeCall returns an error at that line number. It’s a good practice to do this, to avoid bugs.
In lines 54-56, data is copied from host to device. This is done using cudaMemcpy. cudaMemcpyHostToDevice means the copy is from host to device.
In line 59, CUDA timer is started.
In line 62, CUDA kernel is called. It’s done using execution configuration syntax <<< >>>. The first argument inside it represents the number of blocks, the second argument being the number of threads per block. More details on these numbers, I will discuss in future posts.
cudaThreadSynchronize in line 63 is sort of a barrier synchronization which makes sure that all the threads have reached a certain point, in this case the end of kernel.
In line 66, we stop the CUDA timer.
In line 76, results are copied back to host from device. Note the cudaMemcpyDeviceToHost flag.
In 83-89, we free up the memory.
Makefile:
I am giving a general Makefile for compiling a CUDA code. Further details regarding the flags used, I will discuss in future posts.
# Add the root directory for the NVidia SDK installation ROOTDIR := [Path to NVIDIA_CUDA SDK]/C/src # Keep the executable here ROOTBINDIR := bin # Add source files here EXECUTABLE := vectoradd # Cuda source files (compiled with cudacc) CUFILES_sm_20 := vectoradd.cu # CUDA Dependencies CU_DEPS := \ # C/C++ source files (compiled with gcc / c++) CCFILES := \ # Do not link with CUTIL OMIT_CUTIL_LIB := 1 # Additional libraries needed by the project -po maxrregcount=15 USECUFFT := 1 CFLAGS = -pg -lc -fPIC -Wall -litpp -lblas -llapack CUDACCFLAGS := --use_fast_math --ptxas-options=-v ############################################################# # Rules and targets include $(ROOTDIR)/../common/common.mk
Then type make in the terminal. Output :
ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_30' ptxas info : Function properties for _Z3addPfS_S_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 4 registers, 332 bytes cmem[0] ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_10' ptxas info : Used 4 registers, 12+16 bytes smem, 4 bytes cmem[1] ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_20' ptxas info : Function properties for _Z3addPfS_S_ 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 4 registers, 44 bytes cmem[0]
The executable is in the bin directory (bin/darwin/release)
./vectoradd Time taken by kernel: 0.134816 c[0] = 0.000000 c[1] = 2.000000 c[2] = 4.000000 c[3] = 6.000000 c[4] = 8.000000 c[5] = 10.000000 c[6] = 12.000000 c[7] = 14.000000 c[8] = 16.000000 c[9] = 18.000000