Vector addition – CUDA

In this post, I will show you how to write a vector addition code using CUDA . The code is listed below:


// Includes
#include  <stdio.h>;
// CUDA includes
#include <cuda_runtime.h>;
#include <cutil_inline.h>;
#include <cuda_runtime_api.h>;

#define N 10               //Size of the array

//Kernel function
__global__ void add (float* a, float* b, float* c)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;     // A thread id
if (tid < N)
    {
        c[tid] = a[tid] + b[tid];
    }
}

int main()
{
    //Initialising inputs
    float* a;
    float* b;
    float* c;
    float* dev_a;
    float* dev_b;
    float* dev_c;

    //CUDA event timers
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    float time;

    //Allocating memory on the host
    a = (float*)malloc(N*sizeof(float));
    b = (float*)malloc(N*sizeof(float));
    c = (float*)malloc(N*sizeof(float));

    for (int i = 0; i < N; ++i)
        {
            a[i] = (float)i;
            b[i] = (float)i;
            c[i] = 0.0;
        }

    //Allocating memory on the device
    cutilSafeCall(cudaMalloc( (void**)&dev_a, N*sizeof(float) ));
    cutilSafeCall(cudaMalloc( (void**)&dev_b, N*sizeof(float) ));
    cutilSafeCall(cudaMalloc( (void**)&dev_c, N*sizeof(float) ));

    //Copying data from host to device
    cutilSafeCall(cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice));
    cutilSafeCall(cudaMemcpy(dev_b, b, N*sizeof(float), cudaMemcpyHostToDevice));
    cutilSafeCall(cudaMemcpy(dev_c, c, N*sizeof(float), cudaMemcpyHostToDevice));

    //Starting CUDA timer
    cudaEventRecord(start, 0);

    //Launching kernel
    add<<<N,1 >>>(dev_a, dev_b, dev_c);
    cudaThreadSynchronize();

    //Stopping CUDA timer
    cudaEventRecord(stop, 0);

    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    printf("Time taken by kernel: %f\n", time);

    //Copying data back to host
    cutilSafeCall(cudaMemcpy(c, dev_c, N*sizeof(float), cudaMemcpyDeviceToHost));
    for(int i = 0; i < N; ++i)
        {
            printf("c[%d] = %f\n",i,c[i]);
        }

    //Freeing memory
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    free(a);
    free(b);
    free(c);

    return 0;
}

Let me describe the code in detail.

Lines 1-6 includes the necessary header files.

Line 8 defines the size of the array. Well size 10 is too small a number for GPU vector addition. But for experimental purposes, this should be fine.

In lines 11-18, kernel function is defined. tid is a unique thread id.

Main function starts at line 20. In lines 23-28, input variables are defined.

In 31-33, CUDA event timers are defined which are defined to calculate the time taken on GPU. CPU timers might not have enough precision to measure the low times taken by the kernel on GPU.

In lines 37-39, memory is allocated on the host.In 41-46, inputs are initialized.

In 49-51, memory is allocated on device using cudaMalloccutilSafeCall makes sure that the commands are properly executed. If there’s any error in executing the command, cutilSafeCall returns an error at that line number. It’s a good practice to do this, to avoid bugs.

In lines 54-56, data is copied from host to device. This is done using cudaMemcpycudaMemcpyHostToDevice means the copy is from host to device.

In line 59, CUDA timer is started.

In line 62, CUDA kernel is called. It’s done using execution configuration syntax <<<  >>>. The first argument inside it represents the number of blocks, the second argument being the number of threads per block. More details on these numbers, I will discuss in future posts.

cudaThreadSynchronize in line 63 is sort of a barrier synchronization which makes sure that all the threads have reached a certain point, in this case the end of kernel.

In line 66, we stop the CUDA timer.

In line 76, results are copied back to host from device. Note the cudaMemcpyDeviceToHost flag.

In 83-89, we free up the memory.

Makefile:
I am giving a general Makefile for compiling a CUDA code. Further details regarding the flags used, I will discuss in future posts.

# Add the root directory for the NVidia SDK installation
ROOTDIR := [Path to NVIDIA_CUDA SDK]/C/src
# Keep the executable here
ROOTBINDIR := bin

# Add source files here
EXECUTABLE := vectoradd
# Cuda source files (compiled with cudacc)
CUFILES_sm_20 := vectoradd.cu
# CUDA Dependencies
CU_DEPS :=  \
# C/C++ source files (compiled with gcc / c++)
CCFILES := \

# Do not link with CUTIL
OMIT_CUTIL_LIB := 1

# Additional libraries needed by the project -po maxrregcount=15
USECUFFT := 1
CFLAGS = -pg -lc -fPIC -Wall -litpp -lblas -llapack
CUDACCFLAGS := --use_fast_math --ptxas-options=-v
#############################################################
# Rules and targets

include $(ROOTDIR)/../common/common.mk

Then type make in the terminal. Output :

ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_30'
ptxas info : Function properties for _Z3addPfS_S_
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 332 bytes cmem[0]
ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_10'
ptxas info : Used 4 registers, 12+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function '_Z3addPfS_S_' for 'sm_20'
ptxas info : Function properties for _Z3addPfS_S_
 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 44 bytes cmem[0]

The executable is in the bin directory (bin/darwin/release)

./vectoradd

Time taken by kernel: 0.134816
c[0] = 0.000000
c[1] = 2.000000
c[2] = 4.000000
c[3] = 6.000000
c[4] = 8.000000
c[5] = 10.000000
c[6] = 12.000000
c[7] = 14.000000
c[8] = 16.000000
c[9] = 18.000000

CUDA vs OpenCL

CUDA and OpenCL are two major programming frameworks for GPU computing. I have told briefly about them in one of the previous posts. Now, if you wanted to learn GPU Computing, which one to choose – CUDA or OpenCL?

Until recently, CUDA has attracted most of the attention from developers, especially in the High Performance Computing realm because of the good support from NVIDIA itself especially from the forums. But OpenCL is gaining ground rapidly. OpenCL software has now reached the point GPU programmers are taking a second look.

CUDA and OpenCL do mostly the same – it’s like Italians and French fighting over who has the most beautiful language, while they’re both Roman languages

nVidia’s CUDA is vendor-specific. It has better tools, better performance and there’s lot sample code, tools, documentation and utilities available. If you have an actual GPU project that you need to work on in the in short term and you can be certain that you only need to support high-end nVidia hardware, then CUDA is the way to go. OpenCL provides an open, industry-standard framework. As such, it has garnered support from nearly all processor manufacturers including AMD, Intel, and nVidia, as well as others that serve the mobile and embedded computing markets. As a result, applications developed in OpenCL are now portable across a variety of GPUs and CPUs. OpenCL, being an open standard, allows any vendor to implement OpenCL support on its products. Intel has announced that it will support OpenCL on future CPU products.

Ok, now you have two frameworks – which one to choose? Well, it depends on a lot of factors. If you are planning to implement a GPU project solely on nVidia’s cards, then CUDA is a better option. But if your application is to be deployed over a range of architectures then you need to work with OpenCL.

But to start off with, I personally prefer CUDA, because of the detailed documentation that nVidia has provided and also vast community support. You can post a question in nVidia forums (which are off-line now due to some security issues) and get clarifications from experts. And also there is Stackoverflow. The basic idea behind learning CUDA and OpenCL is the same. The skills and knowledge you develop while working with CUDA will mostly be transferrable to OpenCL later if needed. Also some tools like swan, convert a CUDA code into an OpenCL code. So, basically if you learn one, you can very easily work with the other. A good comparison of CUDA and OpenCL is shown here and here. You can also look in the references for more information.

Concluding,

CUDA

  • Better marketing
  • Good support and documentation
  • Many features and toolsets
  • Works only on nVidia cards

OpenCL

  • Supports many architectures
  • It’s open standard – which we always want
  • No proper documentation
  • Provided by different vendors in various packages – no universal package

Recently, OpenCL is gaining grounds on CUDA – this might be a reason that nVidia recently released its source code to developers and also stopped providing OpenCL support in newer releases of CUDA. Well, that indicates there is a stiff competition going on and I personally feel it’s only a matter of time that OpenCL will reach the level of CUDA.

References:

  1. http://www.streamcomputing.eu/blog/2011-06-22/opencl-vs-cuda-misconceptions/
  2. http://www.hpcwire.com/hpcwire/2012-02-28/opencl_gains_ground_on_cuda.html
  3. http://wiki.tiker.net/CudaVsOpenCL
  4. http://blog.accelereyes.com/blog/2012/02/17/opencl_vs_cuda_webinar_recap/

Installing CUDA on Mac OS

CUDA is a parallel computing platform and programming model that enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

In this post, I will tell you how to get started with CUDA on Mac OS. To use CUDA on your system, you will need the following installed:

  1. CUDA-enabled GPU. A list of such GPUs is available here
  2. Mac OS X v. 10.5.6 or later (10.6.3 or later for 64-bit CUDA applications)
  3. The gcc compiler and toolchain installed using Xcode
  4. CUDA software (available at no cost from http://developer.nvidia.com/cuda/cuda-downloads)

Once you have verified that you have a supported NVIDIA processor and a supported version the Mac OS, you need to download the CUDA software. Download the following packages for the latest version of the Development Tools from the site above:

  1. CUDA Driver
  2. CUDA Toolkit
  3. GPU Computing SDK

Installation:

  1. Install the CUDA Driver
    Install the CUDA driver package by executing the installer and following the on-screen prompts. This will install /Library/Framework/CUDA.framework and the UNIX-compatibility stub /usr/local/cuda/lib/libcuda.dylib that refers to it
  2. Install the CUDA Toolkit
    Install the CUDA Toolkit by executing the Toolkit installer package and following the on-screen prompts. The CUDA Toolkit supplements the CUDA Driver with compilers and additional libraries and header files that are installed into /usr/local/cuda by default
  3. Define the environment variables
    – The PATH variable needs to include /usr/local/cuda/bin
    – 
    DYLD_LIBRARY_PATH needs to contain /usr/local/cuda/lib
    The typical way to place these values in your environment is with the following commands:
    export PATH=/usr/local/cuda/bin:$PATH
    export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PAT
    To make these settings permanent, place them in ~/.bash_profile
  4. Install CUDA SDK
    The default installation process places the files in/Developer/GPU Computing

To compile the examples, cd into /Developer/GPU Computing/C and type make. The resulting binaries will be installed under the home directory in /Developer/GPU Computing/C/bin/darwin/release

Verify the installation by running ./deviceQuery, the output of which should be something like this

Sample CUDA deviceQuery Program

Now, you are all set to start with CUDA programming!

 

References:

  1. CUDA getting started guide for Mac OS

General Purpose Graphic Processing Unit (GPGPU)

In the previous post I described about a GPU, which was intended for graphics acceleration. But how do we, the common people, use the parallel computing capability of a GPU? For that, first let me talk about General Purpose Graphic Processing Unit.

General-purpose computing on graphics processing units (GPGPU, GPGP or less often GP²U) is the means of using a graphics processing unit (GPU), which typically handles computations only for computer graphics, to perform computations in applications traditionally handled by the central processing unit (CPU). Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. The general-purpose CPU is better at some stuff though : general programming, accessing memory randomly, executing steps in order, everyday stuff. It’s true, though, that CPUs are sprouting cores, looking more and more like GPUs in some respects.

In brief the thing about parallel processing is using tons of cores to break stuff up and crunch it all at once—is that applications have to be programmed to take advantage of it. It’s not easy, which is why Intel at this point hires more software engineers than hardware ones. So even if the hardware’s there, you still need the software to get there, and it’s a whole different kind of programming. Which brings us to OpenCL (Open Computing Language) and CUDA. They’re frameworks that make it way easier to use graphics cards for general purpose computing. OpenCL is the “open standard for parallel programming of heterogeneous systems” standardized by the Khronos Group—AMD, Apple, IBM, Intel, Nvidia, Samsung and a bunch of others are involved. In semi-English, it’s a cross-platform standard for parallel programming across different kinds of hardware—using both CPU and GPU—that anyone can use for free. CUDA is Nvidia’s own architecture for parallel programming on its graphics cards.

Ok…parallel computing is pretty great for scientists and organizations who deal with computationally intensive simulations that go on for days. But does it make our stuff go faster? The answer is — of course. Any one with a good  graphics card has a potential super computer by his side. Converting, decoding, creating and streaming videos—stuff you’re probably using now more than you did a couple years ago—have improved dramatically. Say bye-bye to  20-minute renders. Ditto for image editing; there’ll be less waiting for effects to propagate with giant images (Photoshop CS5, Illustrator, Fireworks etc. already use GPU acceleration). And also many applications like Mathematica, MATLAB and other modelling softwares are making use of GPUs for acceleration.