Vector Addition – OpenCL

In this post, I will show you how to write a vector addition code using OpenCL . The code is listed below:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#ifdef __APPLE__
#include <OpenCL/opencl.h>
#include <CL/cl.h>

#define DATA_SIZE 10

using namespace std;

const char *ProgramSource =
"__kernel void add(__global float *inputA, __global float *inputB, __global float *output)\n"\
"  size_t id = get_global_id(0);\n"\
"  output[id] = inputA[id] + inputB[id];\n"\

int main(void)
cl_context context;
cl_context_properties properties[3];
cl_kernel kernel;
cl_command_queue command_queue;
cl_program program;
cl_int err;
cl_uint num_of_platforms=0;
cl_platform_id platform_id;
cl_device_id device_id;
cl_uint num_of_devices=0;
cl_mem inputA, inputB, output;

size_t global;

float inputDataA[DATA_SIZE]={1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
float inputDataB[DATA_SIZE]={1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
float results[DATA_SIZE]={0};

int i;

// retreive a list of platforms avaible
if (clGetPlatformIDs(1, &platform_id, &num_of_platforms)!= CL_SUCCESS)
printf("Unable to get platform_id\n");
return 1;

// try to get a supported GPU device
if (clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &num_of_devices) != CL_SUCCESS)
printf("Unable to get device_id\n");
return 1;

// context properties list - must be terminated with 0
properties[0]= CL_CONTEXT_PLATFORM;
properties[1]= (cl_context_properties) platform_id;
properties[2]= 0;

// create a context with the GPU device
context = clCreateContext(properties,1,&device_id,NULL,NULL,&err);

// create command queue using the context and device
command_queue = clCreateCommandQueue(context, device_id, 0, &err);

// create a program from the kernel source code
program = clCreateProgramWithSource(context,1,(const char **) &ProgramSource, NULL, &err);

// compile the program
if (clBuildProgram(program, 0, NULL, NULL, NULL, NULL) != CL_SUCCESS)
printf("Error building program\n");
return 1;

// specify which kernel from the program to execute
kernel = clCreateKernel(program, "add", &err);

// create buffers for the input and ouput

inputA = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * DATA_SIZE, NULL, NULL);
inputB = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * DATA_SIZE, NULL, NULL);
output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * DATA_SIZE, NULL, NULL);

// load data into the input buffer
clEnqueueWriteBuffer(command_queue, inputA, CL_TRUE, 0, sizeof(float) * DATA_SIZE, inputDataA, 0, NULL, NULL);
clEnqueueWriteBuffer(command_queue, inputB, CL_TRUE, 0, sizeof(float) * DATA_SIZE, inputDataB, 0, NULL, NULL);

// set the argument list for the kernel command
clSetKernelArg(kernel, 0, sizeof(cl_mem), &inputA);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &inputB);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &output);


// enqueue the kernel command for execution
clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);

// copy the results from out of the output buffer
clEnqueueReadBuffer(command_queue, output, CL_TRUE, 0, sizeof(float) *DATA_SIZE, results, 0, NULL, NULL);

// print the results
printf("output: ");

for(i=0;i<DATA_SIZE; i++)
printf("%f ",results[i]);

// cleanup - release OpenCL resources

return 0;


To compile the code on a Mac, open terminal and type

g++ -o add add.c -framework OpenCL

The output is :

output: 2.000000

Take a look at line number 53, CL_DEVICE_TYPE_GPU is used to select a GPU device. Other alternatives for this flag include CL_DEVICE_TYPE_CPUCL_DEVICE_TYPE_ACCELERATORCL_DEVICE_TYPE_ALL etc . Refer to OpenCL documentation for more details (see here).


OpenCL code structure

OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software.

The Khronos consortium that manages the OpenCL standard has developed an applications programming interface (API) that is general enough to run on significantly different architectures while being adaptable enough that each hardware platform can still obtain high performance. The OpenCL API is a C with a C++ Wrapper API that is defined in terms of the C API. There are third-party bindings for many languages, including Java, Python, and .NET. The code that executes on an OpenCL device, which in general is not the same device as the host CPU, is written in the OpenCL C language. OpenCL C is a restricted version of the C99 language with extensions appropriate for executing data-parallel code on a variety of heterogeneous devices.

Let’s get started with OpenCL program structure. In the process, I will also indicate the analogy between CUDA and OpenCL commands later. In this way, it will be easy to learn CUDA and OpenCL side by side. In general, writing a code in OpenCL can be generalized in the following steps:

  1. Discover and initialize the platforms
  2. Discover and initialize the devices
  3. Create a context
  4. Create a command queue
  5. Create device buffers
  6. Write host data to device buffers
  7. Create and compile the program
  8. Create the kernel
  9. Set the kernel arguments
  10. Configure the work-item structure
  11. Enqueue the kernel for execution
  12. Read the output buffer back to the host
  13. Release OpenCL resources

Discover and initialize the platforms

In the OpenCL platform model, there is a single host that coordinates execution on one or more devices. The API function clGetPlatformIDs( ) is used to discover the set of available platforms for a given system.

Discover and initialize the devices

clGetDeviceIDs( ) is used to discover the devices. clGetDeviceInfo( ) is called to retrieve information such as name, type, and vendor from each device.

Create a context

A context is an abstract container that exists on the host. A context coordinates the mechanisms for host–device interaction, manages the memory objects that are available to the devices, and keeps track of the programs and kernels that are created for each device. The API function to create a context is clCreateContext( ).

Create a command queue

Communication with a device occurs by submitting commands to a command queue. The command queue is the mechanism that the host uses to request action by the device. The API clCreateCommandQueue( ) is used to create a command queue and associate it with a device.

Create device buffers

In order for data to be transferred to a device, it must first be encapsulated as a memory object. The API function clCreateBuffer( ) allocates the buffer and returns a memory object.

Write host data to device buffers

Data contained in host memory is transferred to and from an OpenCL buffer using the commands clEnqueueWriteBuffer( ) and clEnqueueReadBuffer( ), respectively.

Create and compile the program

OpenCL C code is called a program. A program is a collection of functions called kernels, where kernels are units of execution that can be scheduled to run on a device.

The process of creating a kernel is as follows:

  • The OpenCL C source code is stored in a character string. If the source code is stored in a file on a disk, it must be read into memory and stored as a character array.
  • The source code is turned into a program object,cl_program,by calling clCreate ProgramWithSource( ).
  • The program object is then compiled, for one or more OpenCL devices, with clBuildProgram( ).

Create the kernel

Now we have to obtain a cl_kernel object that can be used to execute kernels on a device is to extract the kernel from the cl_program. Extracting a kernel from a program is similar to obtaining an exported function from a dynamic library. The name of the kernel that the program exports is used to request it from the compiled program object. The name of the kernel is passed to clCreateKernel( ), along with the program object, and the kernel object will be returned if the program object was valid and the particular kernel is found.

Set the kernel arguments

Each kernel argument individually using the function clSetKernelArg( ). 

Configure the work-item structure

Define an index space (global work size) of work items for execution.

Enqueue the kernel for execution

Requesting that a device begin executing a kernel is done with a call to clEnqueueNDRangeKernel( ).

Read the output buffer back to the host

Use clEnqueueReadBuffer( ) to read the OpenCL output

Release OpenCL resources

This is done using appropriate clRelease commands.

In the next post, I will show an OpenCL equivalent of CUDA vector addition code from the previous post and then the command analogy between CUDA and OpenCL.


  3. Book – Heterogeneous Computing With OpenCL:

CUDA vs OpenCL

CUDA and OpenCL are two major programming frameworks for GPU computing. I have told briefly about them in one of the previous posts. Now, if you wanted to learn GPU Computing, which one to choose – CUDA or OpenCL?

Until recently, CUDA has attracted most of the attention from developers, especially in the High Performance Computing realm because of the good support from NVIDIA itself especially from the forums. But OpenCL is gaining ground rapidly. OpenCL software has now reached the point GPU programmers are taking a second look.

CUDA and OpenCL do mostly the same – it’s like Italians and French fighting over who has the most beautiful language, while they’re both Roman languages

nVidia’s CUDA is vendor-specific. It has better tools, better performance and there’s lot sample code, tools, documentation and utilities available. If you have an actual GPU project that you need to work on in the in short term and you can be certain that you only need to support high-end nVidia hardware, then CUDA is the way to go. OpenCL provides an open, industry-standard framework. As such, it has garnered support from nearly all processor manufacturers including AMD, Intel, and nVidia, as well as others that serve the mobile and embedded computing markets. As a result, applications developed in OpenCL are now portable across a variety of GPUs and CPUs. OpenCL, being an open standard, allows any vendor to implement OpenCL support on its products. Intel has announced that it will support OpenCL on future CPU products.

Ok, now you have two frameworks – which one to choose? Well, it depends on a lot of factors. If you are planning to implement a GPU project solely on nVidia’s cards, then CUDA is a better option. But if your application is to be deployed over a range of architectures then you need to work with OpenCL.

But to start off with, I personally prefer CUDA, because of the detailed documentation that nVidia has provided and also vast community support. You can post a question in nVidia forums (which are off-line now due to some security issues) and get clarifications from experts. And also there is Stackoverflow. The basic idea behind learning CUDA and OpenCL is the same. The skills and knowledge you develop while working with CUDA will mostly be transferrable to OpenCL later if needed. Also some tools like swan, convert a CUDA code into an OpenCL code. So, basically if you learn one, you can very easily work with the other. A good comparison of CUDA and OpenCL is shown here and here. You can also look in the references for more information.



  • Better marketing
  • Good support and documentation
  • Many features and toolsets
  • Works only on nVidia cards


  • Supports many architectures
  • It’s open standard – which we always want
  • No proper documentation
  • Provided by different vendors in various packages – no universal package

Recently, OpenCL is gaining grounds on CUDA – this might be a reason that nVidia recently released its source code to developers and also stopped providing OpenCL support in newer releases of CUDA. Well, that indicates there is a stiff competition going on and I personally feel it’s only a matter of time that OpenCL will reach the level of CUDA.



General Purpose Graphic Processing Unit (GPGPU)

In the previous post I described about a GPU, which was intended for graphics acceleration. But how do we, the common people, use the parallel computing capability of a GPU? For that, first let me talk about General Purpose Graphic Processing Unit.

General-purpose computing on graphics processing units (GPGPU, GPGP or less often GP²U) is the means of using a graphics processing unit (GPU), which typically handles computations only for computer graphics, to perform computations in applications traditionally handled by the central processing unit (CPU). Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. The general-purpose CPU is better at some stuff though : general programming, accessing memory randomly, executing steps in order, everyday stuff. It’s true, though, that CPUs are sprouting cores, looking more and more like GPUs in some respects.

In brief the thing about parallel processing is using tons of cores to break stuff up and crunch it all at once—is that applications have to be programmed to take advantage of it. It’s not easy, which is why Intel at this point hires more software engineers than hardware ones. So even if the hardware’s there, you still need the software to get there, and it’s a whole different kind of programming. Which brings us to OpenCL (Open Computing Language) and CUDA. They’re frameworks that make it way easier to use graphics cards for general purpose computing. OpenCL is the “open standard for parallel programming of heterogeneous systems” standardized by the Khronos Group—AMD, Apple, IBM, Intel, Nvidia, Samsung and a bunch of others are involved. In semi-English, it’s a cross-platform standard for parallel programming across different kinds of hardware—using both CPU and GPU—that anyone can use for free. CUDA is Nvidia’s own architecture for parallel programming on its graphics cards.

Ok…parallel computing is pretty great for scientists and organizations who deal with computationally intensive simulations that go on for days. But does it make our stuff go faster? The answer is — of course. Any one with a good  graphics card has a potential super computer by his side. Converting, decoding, creating and streaming videos—stuff you’re probably using now more than you did a couple years ago—have improved dramatically. Say bye-bye to  20-minute renders. Ditto for image editing; there’ll be less waiting for effects to propagate with giant images (Photoshop CS5, Illustrator, Fireworks etc. already use GPU acceleration). And also many applications like Mathematica, MATLAB and other modelling softwares are making use of GPUs for acceleration.

Hello world!

Hello World!

The future of supercomputing can be summed up in three letters: GPU. GPU stands for graphics processing unit. In the world of high-performance computing, graphics processing units are the talk of the town.

The GPU is a specialized circuit designed to accelerate the image output in a frame buffer intended for output to a display.
GPUs are very efficient at manipulating computer graphics and are generally more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. GPUs are moving from video games into high performance computing in a big way since companies like Nvidia and AMD began focusing on software and revised its hardware designs to make them easier to use. Basically a GPU has a large number of cores each capable of executing an operation of its own.

GPU-based high performance computers are starting to play a significant role in large-scale modelling. Three of the 5 most powerful supercomputers in the world take advantage of GPU acceleration. Not coincidentally, this is exactly what China has done to achieve the world’s fastest speeds with its “Tianhe-1A” supercomputer. That computer combines about 7,000 Nvidia GPUs with 14,000 Intel CPUs: the only hybrid CPU-GPU system in the world of that scale.

An example Nvidia’s Russell gave to think about the difference between a traditional CPU and a GPU is this: If you were looking for a word in a book, and handed the task to a CPU, it would start at page 1 and read it all the way to the end, because it’s a “serial” processor. It would be fast, but would take time because it has to go in order. A GPU, which is a “parallel” processor, “would tear the book into a thousand pieces” and read it all at the same time. Even if each individual word is read more slowly, the book may be read in its entirety quicker, because words are read simultaneously.