Vector addition – CUDA

In this post, I will show you how to write a vector addition code using CUDA . The code is listed below:

// Includes
#include  <stdio.h>;
// CUDA includes
#include <cuda_runtime.h>;
#include <cutil_inline.h>;
#include <cuda_runtime_api.h>;

#define N 10               //Size of the array

//Kernel function
__global__ void add (float* a, float* b, float* c)
int tid = threadIdx.x + blockIdx.x * blockDim.x;     // A thread id
if (tid < N)
        c[tid] = a[tid] + b[tid];

int main()
    //Initialising inputs
    float* a;
    float* b;
    float* c;
    float* dev_a;
    float* dev_b;
    float* dev_c;

    //CUDA event timers
    cudaEvent_t start, stop;
    float time;

    //Allocating memory on the host
    a = (float*)malloc(N*sizeof(float));
    b = (float*)malloc(N*sizeof(float));
    c = (float*)malloc(N*sizeof(float));

    for (int i = 0; i < N; ++i)
            a[i] = (float)i;
            b[i] = (float)i;
            c[i] = 0.0;

    //Allocating memory on the device
    cutilSafeCall(cudaMalloc( (void**)&dev_a, N*sizeof(float) ));
    cutilSafeCall(cudaMalloc( (void**)&dev_b, N*sizeof(float) ));
    cutilSafeCall(cudaMalloc( (void**)&dev_c, N*sizeof(float) ));

    //Copying data from host to device
    cutilSafeCall(cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice));
    cutilSafeCall(cudaMemcpy(dev_b, b, N*sizeof(float), cudaMemcpyHostToDevice));
    cutilSafeCall(cudaMemcpy(dev_c, c, N*sizeof(float), cudaMemcpyHostToDevice));

    //Starting CUDA timer
    cudaEventRecord(start, 0);

    //Launching kernel
    add<<<N,1 >>>(dev_a, dev_b, dev_c);

    //Stopping CUDA timer
    cudaEventRecord(stop, 0);

    cudaEventElapsedTime(&time, start, stop);

    printf("Time taken by kernel: %f\n", time);

    //Copying data back to host
    cutilSafeCall(cudaMemcpy(c, dev_c, N*sizeof(float), cudaMemcpyDeviceToHost));
    for(int i = 0; i < N; ++i)
            printf("c[%d] = %f\n",i,c[i]);

    //Freeing memory


    return 0;

Let me describe the code in detail.

Lines 1-6 includes the necessary header files.

Line 8 defines the size of the array. Well size 10 is too small a number for GPU vector addition. But for experimental purposes, this should be fine.

In lines 11-18, kernel function is defined. tid is a unique thread id.

Main function starts at line 20. In lines 23-28, input variables are defined.

In 31-33, CUDA event timers are defined which are defined to calculate the time taken on GPU. CPU timers might not have enough precision to measure the low times taken by the kernel on GPU.

In lines 37-39, memory is allocated on the host.In 41-46, inputs are initialized.

In 49-51, memory is allocated on device using cudaMalloccutilSafeCall makes sure that the commands are properly executed. If there’s any error in executing the command, cutilSafeCall returns an error at that line number. It’s a good practice to do this, to avoid bugs.

In lines 54-56, data is copied from host to device. This is done using cudaMemcpycudaMemcpyHostToDevice means the copy is from host to device.

In line 59, CUDA timer is started.

In line 62, CUDA kernel is called. It’s done using execution configuration syntax <<<  >>>. The first argument inside it represents the number of blocks, the second argument being the number of threads per block. More details on these numbers, I will discuss in future posts.

cudaThreadSynchronize in line 63 is sort of a barrier synchronization which makes sure that all the threads have reached a certain point, in this case the end of kernel.

In line 66, we stop the CUDA timer.

In line 76, results are copied back to host from device. Note the cudaMemcpyDeviceToHost flag.

In 83-89, we free up the memory.

I am giving a general Makefile for compiling a CUDA code. Further details regarding the flags used, I will discuss in future posts.

# Add the root directory for the NVidia SDK installation
# Keep the executable here

# Add source files here
EXECUTABLE := vectoradd
# Cuda source files (compiled with cudacc)
CUFILES_sm_20 :=
# CUDA Dependencies
CU_DEPS :=  \
# C/C++ source files (compiled with gcc / c++)

# Do not link with CUTIL

# Additional libraries needed by the project -po maxrregcount=15
CFLAGS = -pg -lc -fPIC -Wall -litpp -lblas -llapack
CUDACCFLAGS := --use_fast_math --ptxas-options=-v
# Rules and targets

include $(ROOTDIR)/../common/

Then type make in the terminal. Output :

ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_30'
ptxas info : Function properties for _Z3addPfS_S_
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 332 bytes cmem[0]
ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_10'
ptxas info : Used 4 registers, 12+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_20'
ptxas info : Function properties for _Z3addPfS_S_
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 44 bytes cmem[0]

The executable is in the bin directory (bin/darwin/release)


Time taken by kernel: 0.134816
c[0] = 0.000000
c[1] = 2.000000
c[2] = 4.000000
c[3] = 6.000000
c[4] = 8.000000
c[5] = 10.000000
c[6] = 12.000000
c[7] = 14.000000
c[8] = 16.000000
c[9] = 18.000000


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s