HCC
HCC is a single-source, C/C++ compiler for heterogeneous computing. It's optimized with HSA (http://www.hsafoundation.com/).
|
HC is a C++ API for accelerated computing provided by the HCC compiler. It has some similarities to C++ AMP and therefore, reference materials (blogs, articles, books) that describe C++ AMP also proivide an excellent way to become familiar with HC. For example, both APIs use a parallel_for_each construct to specify a parallel execution region that runs on accelerator. However, HC has several important differences from C++ AMP, including the removal of the "restrict" keyword to annotate device code, an explicit asynchronous launch behavior for parallel_for_each, the support for non-constant tile size, the support for memory pointer, etc..
HC comes with two header files as of now:
Most HC APIs are stored under "hc" namespace, and the class name is the same as their counterpart in C++AMP "Concurrency" namespace. Users of C++AMP should find it easy to switch from C++AMP to HC.
Use "hcc-config", instead of "clamp-config", or you could manually add "-hc" when you invoke clang++. Also, "hcc" is added as an alias for "clang++".
For example:
Built-in macros:
Macro | Meaning |
---|---|
__HCC__ | always be 1 |
__hcc_major__ | major version number of HCC |
__hcc_minor__ | minor version number of HCC |
__hcc_patchlevel__ | patchlevel of HCC |
__hcc_version__ | combined string of __hcc_major__ , __hcc_minor__ , __hcc_patchlevel__ |
The rule for __hcc_patchlevel__
is: yyWW-(HCC driver git commit #)-(HCC clang git commit #)
Macros for language modes in use:
Macro | Meaning |
---|---|
__HCC_AMP__ | 1 in case in C++ AMP mode (-std=c++amp) |
__HCC_HC__ | 1 in case in HC mode (-hc) |
Compilation mode: HCC is a single-source compiler where kernel codes and host codes can reside in the same file. Internally HCC would trigger 2 compilation iterations, and the following macros can be user by user programs to determine which mode the compiler is in.
Macro | Meaning |
---|---|
__HCC_ACCELERATOR__ | not 0 in case the compiler runs in kernel code compilation mode |
__HCC_CPU__ | not 0 in case the compiler runs in host code compilation mode |
Despite HC and C++ AMP share a lot of similarities in terms of programming constructs (e.g. parallel_for_each, array, array_view, etc.), there are several significant differences between the two APIs.
parallel_for_each
In C++ AMP, the parallel_for_each
appears as a synchronous function call in a program (i.e. the host waits for the kernel to complete); howevever, the compiler may optimize it to execute the kernel asynchronously and the host would synchronize with the device on the first access of the data modified by the kernel. For example, if a parallel_for_each
writes the an array_view, then the first access to this array_view on the host after the parallel_for_each
would block until the parallel_for_each
completes.
HC supports the automatic synchronization behavior as in C++ AMP. In addition, HC's parallel_for_each
supports explicit asynchronous execution. It returns a completion_future
(similar to C++ std::future) object that other asynchronous operations could synchronize with, which provides better flexibility on task graph construction and enables more precise control on optimization.
C++ AMP uses the restrict(amp)
keyword to annotatate functions that runs on the device.
HC uses a function attribute ([[hc]]
or __attribute__((hc))
) to annotate a device function.
The [[hc]] annotation for the kernel function called by parallel_for_each
is optional as it is automatically annotated as a device function by the hcc compiler. The compiler also supports partial automatic [[hc]] annotation for functions that are called by other device functions within the same source file:
C++ AMP doesn't support dynamic tile size. The size of each tile dimensions has to be a compile-time constant specified as template arguments to the tile_extent object:
HC supports both static and dynamic tile size:
C++ AMP doens't support lambda capture of memory pointer into a GPU kernel.
HC supports capturing memory pointer by a GPU kernel.
For HSA APUs that supports system wide shared virtual memory, a GPU kernel can directly access system memory allocated by the host: