HCC is a single-source, C/C++ compiler for heterogeneous computing. It's optimized with HSA (http://www.hsafoundation.com/).
HC API : Moving Beyond C++AMP for Accelerated GPU Computing

HC is a C++ API for accelerated computing provided by the HCC compiler. It has some similarities to C++ AMP and therefore, reference materials (blogs, articles, books) that describe C++ AMP also proivide an excellent way to become familiar with HC. For example, both APIs use a parallel_for_each construct to specify a parallel execution region that runs on accelerator. However, HC has several important differences from C++ AMP, including the removal of the "restrict" keyword to annotate device code, an explicit asynchronous launch behavior for parallel_for_each, the support for non-constant tile size, the support for memory pointer, etc..


HC comes with two header files as of now:

Most HC APIs are stored under "hc" namespace, and the class name is the same as their counterpart in C++AMP "Concurrency" namespace. Users of C++AMP should find it easy to switch from C++AMP to HC.

Concurrency::accelerator hc::accelerator
Concurrency::accelerator_view hc::accelerator_view
Concurrency::extent hc::extent
Concurrency::index hc::index
Concurrency::completion_future hc::completion_future
Concurrency::array hc::array
Concurrency::array_view hc::array_view

How to build programs with HC API

Use "hcc-config", instead of "clamp-config", or you could manually add "-hc" when you invoke clang++. Also, "hcc" is added as an alias for "clang++".

For example:

1 hcc `hcc-config --cxxflags --ldflags` foo.cpp -o foo

HCC built-in macros

Built-in macros:

Macro Meaning
__HCC__ always be 1
__hcc_major__ major version number of HCC
__hcc_minor__ minor version number of HCC
__hcc_patchlevel__ patchlevel of HCC
__hcc_version__ combined string of __hcc_major__, __hcc_minor__, __hcc_patchlevel__

The rule for __hcc_patchlevel__ is: yyWW-(HCC driver git commit #)-(HCC clang git commit #)

Macros for language modes in use:

Macro Meaning
__HCC_AMP__ 1 in case in C++ AMP mode (-std=c++amp)
__HCC_HC__ 1 in case in HC mode (-hc)

Compilation mode: HCC is a single-source compiler where kernel codes and host codes can reside in the same file. Internally HCC would trigger 2 compilation iterations, and the following macros can be user by user programs to determine which mode the compiler is in.

Macro Meaning
__HCC_ACCELERATOR__ not 0 in case the compiler runs in kernel code compilation mode
__HCC_CPU__ not 0 in case the compiler runs in host code compilation mode

HC-specific features

Differences between HC API and C++ AMP

Despite HC and C++ AMP share a lot of similarities in terms of programming constructs (e.g. parallel_for_each, array, array_view, etc.), there are several significant differences between the two APIs.

Support for explicit asynchronous parallel_for_each

In C++ AMP, the parallel_for_each appears as a synchronous function call in a program (i.e. the host waits for the kernel to complete); howevever, the compiler may optimize it to execute the kernel asynchronously and the host would synchronize with the device on the first access of the data modified by the kernel. For example, if a parallel_for_each writes the an array_view, then the first access to this array_view on the host after the parallel_for_each would block until the parallel_for_each completes.

HC supports the automatic synchronization behavior as in C++ AMP. In addition, HC's parallel_for_each supports explicit asynchronous execution. It returns a completion_future (similar to C++ std::future) object that other asynchronous operations could synchronize with, which provides better flexibility on task graph construction and enables more precise control on optimization.

Annotation of device functions

C++ AMP uses the restrict(amp) keyword to annotatate functions that runs on the device.

1 void foo() restrict(amp) {
2 ..
3 }
4 ...
5 parallel_for_each(...,[=] () restrict(amp) {
6  foo();
7 });

HC uses a function attribute ([[hc]] or __attribute__((hc)) ) to annotate a device function.

1 void foo() [[hc]] {
2 ..
3 }
4 ...
5 parallel_for_each(...,[=] () [[hc]] {
6  foo();
7 });

The [[hc]] annotation for the kernel function called by parallel_for_each is optional as it is automatically annotated as a device function by the hcc compiler. The compiler also supports partial automatic [[hc]] annotation for functions that are called by other device functions within the same source file:

1 // Since bar is called by foo, which is a device function, the hcc compiler
2 // will automatically annotate bar as a device function
3 void bar() {
4 ...
5 }
7 void foo() [[hc]] {
8  bar();
9 }

Dynamic tile size

C++ AMP doesn't support dynamic tile size. The size of each tile dimensions has to be a compile-time constant specified as template arguments to the tile_extent object:

1 extent<2> ex(x, y);
3 // create a tile extent of 8x8 from the extent object
4 // note that the tile dimensions have to be constant values
5 tiled_extent<8,8> t_ex(ex);
7 parallel_for_each(t_ex, [=](tiled_index<8,8> t_id) restrict(amp) {
8 ...
9 });

HC supports both static and dynamic tile size:

1 extent<2> ex(x,y)
3 // create a tile extent from dynamically calculated values
4 // note that the the tiled_extent template takes the rank instead of dimensions
5 tx = test_x ? tx_a : tx_b;
6 ty = test_y ? ty_a : ty_b;
7 tiled_extent<2> t_ex(ex, tx, ty);
9 parallel_for_each(t_ex, [=](tiled_index<2> t_id) [[hc]] {
10 ...
11 });

Support for memory pointer

C++ AMP doens't support lambda capture of memory pointer into a GPU kernel.

HC supports capturing memory pointer by a GPU kernel.

1 // allocate GPU memory through the HSA API
2 int* gpu_pointer;
3 hsa_memory_allocate(..., &gpu_pointer);
4 ...
5 parallel_for_each(ext, [=](index i) [[hc]] {
6  gpu_pointer[i[0]]++;
7 }

For HSA APUs that supports system wide shared virtual memory, a GPU kernel can directly access system memory allocated by the host:

1 int* cpu_memory = (int*) malloc(...);
2 ...
3 parallel_for_each(ext, [=](index i) [[hc]] {
4  cpu_memory[i[0]]++;
5 });