OpenCL code structure

OpenCL is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software.

The Khronos consortium that manages the OpenCL standard has developed an applications programming interface (API) that is general enough to run on significantly different architectures while being adaptable enough that each hardware platform can still obtain high performance. The OpenCL API is a C with a C++ Wrapper API that is defined in terms of the C API. There are third-party bindings for many languages, including Java, Python, and .NET. The code that executes on an OpenCL device, which in general is not the same device as the host CPU, is written in the OpenCL C language. OpenCL C is a restricted version of the C99 language with extensions appropriate for executing data-parallel code on a variety of heterogeneous devices.

Let’s get started with OpenCL program structure. In the process, I will also indicate the analogy between CUDA and OpenCL commands later. In this way, it will be easy to learn CUDA and OpenCL side by side. In general, writing a code in OpenCL can be generalized in the following steps:

  1. Discover and initialize the platforms
  2. Discover and initialize the devices
  3. Create a context
  4. Create a command queue
  5. Create device buffers
  6. Write host data to device buffers
  7. Create and compile the program
  8. Create the kernel
  9. Set the kernel arguments
  10. Configure the work-item structure
  11. Enqueue the kernel for execution
  12. Read the output buffer back to the host
  13. Release OpenCL resources

Discover and initialize the platforms

In the OpenCL platform model, there is a single host that coordinates execution on one or more devices. The API function clGetPlatformIDs( ) is used to discover the set of available platforms for a given system.

Discover and initialize the devices

clGetDeviceIDs( ) is used to discover the devices. clGetDeviceInfo( ) is called to retrieve information such as name, type, and vendor from each device.

Create a context

A context is an abstract container that exists on the host. A context coordinates the mechanisms for host–device interaction, manages the memory objects that are available to the devices, and keeps track of the programs and kernels that are created for each device. The API function to create a context is clCreateContext( ).

Create a command queue

Communication with a device occurs by submitting commands to a command queue. The command queue is the mechanism that the host uses to request action by the device. The API clCreateCommandQueue( ) is used to create a command queue and associate it with a device.

Create device buffers

In order for data to be transferred to a device, it must first be encapsulated as a memory object. The API function clCreateBuffer( ) allocates the buffer and returns a memory object.

Write host data to device buffers

Data contained in host memory is transferred to and from an OpenCL buffer using the commands clEnqueueWriteBuffer( ) and clEnqueueReadBuffer( ), respectively.

Create and compile the program

OpenCL C code is called a program. A program is a collection of functions called kernels, where kernels are units of execution that can be scheduled to run on a device.

The process of creating a kernel is as follows:

  • The OpenCL C source code is stored in a character string. If the source code is stored in a file on a disk, it must be read into memory and stored as a character array.
  • The source code is turned into a program object,cl_program,by calling clCreate ProgramWithSource( ).
  • The program object is then compiled, for one or more OpenCL devices, with clBuildProgram( ).

Create the kernel

Now we have to obtain a cl_kernel object that can be used to execute kernels on a device is to extract the kernel from the cl_program. Extracting a kernel from a program is similar to obtaining an exported function from a dynamic library. The name of the kernel that the program exports is used to request it from the compiled program object. The name of the kernel is passed to clCreateKernel( ), along with the program object, and the kernel object will be returned if the program object was valid and the particular kernel is found.

Set the kernel arguments

Each kernel argument individually using the function clSetKernelArg( ). 

Configure the work-item structure

Define an index space (global work size) of work items for execution.

Enqueue the kernel for execution

Requesting that a device begin executing a kernel is done with a call to clEnqueueNDRangeKernel( ).

Read the output buffer back to the host

Use clEnqueueReadBuffer( ) to read the OpenCL output

Release OpenCL resources

This is done using appropriate clRelease commands.

In the next post, I will show an OpenCL equivalent of CUDA vector addition code from the previous post and then the command analogy between CUDA and OpenCL.

References:

  1. http://www.khronos.org/opencl/
  2. http://en.wikipedia.org/wiki/OpenCL
  3. Book – Heterogeneous Computing With OpenCL: http://www.amazon.com/Heterogeneous-Computing-OpenCL-Benedict-Gaster/dp/0123877660