HCC
HCC is a single-source, C/C++ compiler for heterogeneous computing. It's optimized with HSA (http://www.hsafoundation.com/).
|
Represents a logical (isolated) accelerator view of a compute accelerator. More...
#include <hc.hpp>
Public Member Functions | |
accelerator_view (const accelerator_view &other) | |
Copy-constructs an accelerator_view object. More... | |
accelerator_view & | operator= (const accelerator_view &other) |
Assigns an accelerator_view object to "this" accelerator_view object and returns a reference to "this" object. More... | |
queuing_mode | get_queuing_mode () const |
Returns the queuing mode that this accelerator_view was created with. More... | |
execute_order | get_execute_order () const |
Returns the execution order of this accelerator_view. | |
bool | get_is_auto_selection () |
Returns a boolean value indicating whether the accelerator view when passed to a parallel_for_each would result in automatic selection of an appropriate execution target by the runtime. More... | |
unsigned int | get_version () const |
Returns a 32-bit unsigned integer representing the version number of this accelerator view. More... | |
accelerator | get_accelerator () const |
Returns the accelerator that this accelerator_view has been created on. | |
bool | get_is_debug () const |
Returns a boolean value indicating whether the accelerator_view supports debugging through extensive error reporting. More... | |
void | wait (hcWaitMode waitMode=hcWaitModeBlocked) |
Performs a blocking wait for completion of all commands submitted to the accelerator view prior to calling wait(). More... | |
void | flush () |
Sends the queued up commands in the accelerator_view to the device for execution. More... | |
completion_future | create_marker (memory_scope fence_scope=system_scope) const |
This command inserts a marker event into the accelerator_view's command queue. More... | |
completion_future | create_blocking_marker (completion_future &dependent_future, memory_scope fence_scope=system_scope) const |
This command inserts a marker event into the accelerator_view's command queue with a prior dependent asynchronous event. More... | |
completion_future | create_blocking_marker (std::initializer_list< completion_future > dependent_future_list, memory_scope fence_scope=system_scope) const |
This command inserts a marker event into the accelerator_view's command queue with arbitrary number of dependent asynchronous events. More... | |
template<typename InputIterator > | |
completion_future | create_blocking_marker (InputIterator first, InputIterator last, memory_scope scope) const |
This command inserts a marker event into the accelerator_view's command queue with arbitrary number of dependent asynchronous events. More... | |
void | copy (const void *src, void *dst, size_t size_bytes) |
Copies size_bytes bytes from src to dst. More... | |
void | copy_ext (const void *src, void *dst, size_t size_bytes, hcCommandKind copyDir, const hc::AmPointerInfo &srcInfo, const hc::AmPointerInfo &dstInfo, const hc::accelerator *copyAcc, bool forceUnpinnedCopy) |
Copies size_bytes bytes from src to dst. More... | |
void | copy_ext (const void *src, void *dst, size_t size_bytes, hcCommandKind copyDir, const hc::AmPointerInfo &srcInfo, const hc::AmPointerInfo &dstInfo, bool forceUnpinnedCopy) |
completion_future | copy_async (const void *src, void *dst, size_t size_bytes) |
Copies size_bytes bytes from src to dst. More... | |
completion_future | copy_async_ext (const void *src, void *dst, size_t size_bytes, hcCommandKind copyDir, const hc::AmPointerInfo &srcInfo, const hc::AmPointerInfo &dstInfo, const hc::accelerator *copyAcc) |
Copies size_bytes bytes from src to dst. More... | |
bool | operator== (const accelerator_view &other) const |
Compares "this" accelerator_view with the passed accelerator_view object to determine if they represent the same underlying object. More... | |
bool | operator!= (const accelerator_view &other) const |
Compares "this" accelerator_view with the passed accelerator_view object to determine if they represent different underlying objects. More... | |
size_t | get_max_tile_static_size () |
Returns the maximum size of tile static area available on this accelerator view. | |
int | get_pending_async_ops () |
Returns the number of pending asynchronous operations on this accelerator view. More... | |
bool | get_is_empty () |
Returns true if the accelerator_view is currently empty. More... | |
void * | get_hsa_queue () |
Returns an opaque handle which points to the underlying HSA queue. More... | |
void * | get_hsa_agent () |
Returns an opaque handle which points to the underlying HSA agent. More... | |
void * | get_hsa_am_region () |
Returns an opaque handle which points to the AM region on the HSA agent. More... | |
void * | get_hsa_am_system_region () |
Returns an opaque handle which points to the AM system region on the HSA agent. More... | |
void * | get_hsa_am_finegrained_system_region () |
Returns an opaque handle which points to the AM system region on the HSA agent. More... | |
void * | get_hsa_kernarg_region () |
Returns an opaque handle which points to the Kernarg region on the HSA agent. More... | |
bool | is_hsa_accelerator () |
Returns if the accelerator view is based on HSA. | |
void | dispatch_hsa_kernel (const hsa_kernel_dispatch_packet_t *aql, const void *args, size_t argsize, hc::completion_future *cf=nullptr, const char *kernel_name=nullptr) |
Dispatch a kernel into the accelerator_view. More... | |
bool | set_cu_mask (const std::vector< bool > &cu_mask) |
Set a CU affinity to specific command queues. More... | |
Friends | |
class | accelerator |
template<typename Q , int K> | |
class | array |
template<typename Q , int K> | |
class | array_view |
template<typename Kernel > | |
void * | Kalmar::mcw_cxxamp_get_kernel (const std::shared_ptr< Kalmar::KalmarQueue > &, const Kernel &) |
template<typename Kernel , int dim_ext> | |
void | Kalmar::mcw_cxxamp_execute_kernel_with_dynamic_group_memory (const std::shared_ptr< Kalmar::KalmarQueue > &, size_t *, size_t *, const Kernel &, void *, size_t) |
template<typename Kernel , int dim_ext> | |
std::shared_ptr< Kalmar::KalmarAsyncOp > | Kalmar::mcw_cxxamp_execute_kernel_with_dynamic_group_memory_async (const std::shared_ptr< Kalmar::KalmarQueue > &, size_t *, size_t *, const Kernel &, void *, size_t) |
template<typename Kernel , int dim_ext> | |
void | Kalmar::mcw_cxxamp_launch_kernel (const std::shared_ptr< Kalmar::KalmarQueue > &, size_t *, size_t *, const Kernel &) |
template<typename Kernel , int dim_ext> | |
std::shared_ptr< Kalmar::KalmarAsyncOp > | Kalmar::mcw_cxxamp_launch_kernel_async (const std::shared_ptr< Kalmar::KalmarQueue > &, size_t *, size_t *, const Kernel &) |
template<int N, typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const extent< N > &, const Kernel &) |
template<typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const extent< 1 > &, const Kernel &) |
template<typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const extent< 2 > &, const Kernel &) |
template<typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const extent< 3 > &, const Kernel &) |
template<typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const tiled_extent< 3 > &, const Kernel &) |
template<typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const tiled_extent< 2 > &, const Kernel &) |
template<typename Kernel > | |
completion_future | parallel_for_each (const accelerator_view &, const tiled_extent< 1 > &, const Kernel &) |
Represents a logical (isolated) accelerator view of a compute accelerator.
An object of this type can be obtained by calling the default_view property or create_view member functions on an accelerator object.
|
inline |
Copy-constructs an accelerator_view object.
This function does a shallow copy with the newly created accelerator_view object pointing to the same underlying view as the "other" parameter.
[in] | other | The accelerator_view object to be copied. |
|
inline |
Copies size_bytes bytes from src to dst.
Src and dst must not overlap. Note the src is the first parameter and dst is second, following C++ convention. The copy command will execute after any commands already inserted into the accelerator_view finish. This is a synchronous copy command, and the copy operation complete before this call returns.
|
inline |
Copies size_bytes bytes from src to dst.
Src and dst must not overlap. Note the src is the first parameter and dst is second, following C++ convention. This is an asynchronous copy command, and this call may return before the copy operation completes. If the source or dest is host memory, the memory must be pinned or a runtime exception will be thrown. Pinned memory can be created with am_alloc with flag=amHostPinned flag.
The copy command will be implicitly ordered with respect to commands previously equeued to this accelerator_view:
|
inline |
Copies size_bytes bytes from src to dst.
Src and dst must not overlap. Note the src is the first parameter and dst is second, following C++ convention. This is an asynchronous copy command, and this call may return before the copy operation completes. If the source or dest is host memory, the memory must be pinned or a runtime exception will be thrown. Pinned memory can be created with am_alloc with flag=amHostPinned flag.
The copy command will be implicitly ordered with respect to commands previously enqueued to this accelerator_view:
The copy_async_ext flavor allows caller to provide additional information about each pointer, which can improve performance by eliminating replicated lookups, and also allow control over which device performs the copy. This interface is intended for language runtimes such as HIP.
copyDir
: Specify direction of copy. Must be hcMemcpyHostToHost, hcMemcpyHostToDevice, hcMemcpyDeviceToHost, or hcMemcpyDeviceToDevice. copyAcc
: Specify which accelerator performs the copy operation. The specified accelerator must have access to the source and dest pointers - either because the memory is allocated on those devices or because the accelerator has peer access to the memory. If copyAcc is nullptr, then the copy will be performed by the host. In this case, the host accelerator must have access to both pointers. The copy operation will be performed by the specified engine but is not synchronized with respect to any operations on that device.
|
inline |
Copies size_bytes bytes from src to dst.
Src and dst must not overlap. Note the src is the first parameter and dst is second, following C++ convention. The copy command will execute after any commands already inserted into the accelerator_view finish. This is a synchronous copy command, and the copy operation complete before this call returns. The copy_ext flavor allows caller to provide additional information about each pointer, which can improve performance by eliminating replicated lookups. This interface is intended for language runtimes such as HIP.
copyDir
: Specify direction of copy. Must be hcMemcpyHostToHost, hcMemcpyHostToDevice, hcMemcpyDeviceToHost, or hcMemcpyDeviceToDevice. forceUnpinnedCopy
: Force copy to be performed with host involvement rather than with accelerator copy engines.
|
inline |
This command inserts a marker event into the accelerator_view's command queue with a prior dependent asynchronous event.
This marker is returned as a completion_future object. When its dependent event and all commands submitted prior to the marker event creation have been completed, the future is ready.
Regardless of the accelerator_view's execute_order (execute_any_order, execute_in_order), the marker always ensures older commands complete before the returned completion_future is marked ready. Thus, markers provide a mechanism to enforce order between commands in an execute_any_order accelerator_view.
fence_scope controls the scope of the acquire and release fences applied after the marker executes. Options are:
dependent_futures may be recorded in another queue or another accelerator. If in another accelerator, the runtime performs cross-accelerator sychronization.
|
inline |
This command inserts a marker event into the accelerator_view's command queue with arbitrary number of dependent asynchronous events.
This marker is returned as a completion_future object. When its dependent events and all commands submitted prior to the marker event creation have been completed, the completion_future is ready.
Regardless of the accelerator_view's execute_order (execute_any_order, execute_in_order), the marker always ensures older commands complete before the returned completion_future is marked ready. Thus, markers provide a mechanism to enforce order between commands in an execute_any_order accelerator_view.
fence_scope controls the scope of the acquire and release fences applied after the marker executes. Options are:
|
inline |
This command inserts a marker event into the accelerator_view's command queue with arbitrary number of dependent asynchronous events.
This marker is returned as a completion_future object. When its dependent events and all commands submitted prior to the marker event creation have been completed, the completion_future is ready.
Regardless of the accelerator_view's execute_order (execute_any_order, execute_in_order), the marker always ensures older commands complete before the returned completion_future is marked ready. Thus, markers provide a mechanism to enforce order between commands in an execute_any_order accelerator_view.
|
inline |
This command inserts a marker event into the accelerator_view's command queue.
This marker is returned as a completion_future object. When all commands that were submitted prior to the marker event creation have completed, the future is ready.
Regardless of the accelerator_view's execute_order (execute_any_order, execute_in_order), the marker always ensures older commands complete before the returned completion_future is marked ready. Thus, markers provide a mechanism to enforce order between commands in an execute_any_order accelerator_view.
fence_scope controls the scope of the acquire and release fences applied after the marker executes. Options are:
|
inline |
Dispatch a kernel into the accelerator_view.
This function is intended to provide a gateway to dispatch code objects, with some assistance from HCC. Kernels are specified in the standard code object format, and can be created from a varety of compiler tools including the assembler, offline cl compilers, or other tools. The caller also specifies the execution configuration and kernel arguments. HCC will copy the kernel arguments into an appropriate segment and insert the packet into the queue. HCC will also automatically handle signal and kernarg allocation and deallocation for the command.
The kernel is dispatched asynchronously, and thus this API may return before the kernel finishes executing.
Kernels dispatched with this API may be interleaved with other copy and kernel commands generated from copy or parallel_for_each commands. The kernel honors the execute_order associated with the accelerator_view. Specifically, if execute_order is execute_in_order, then the kernel will wait for older data and kernel commands in the same queue before beginning execution. If execute_order is execute_any_order, then the kernel may begin executing without regards to the state of older kernels. This call honors the packer barrier bit (1 << HSA_PACKET_HEADER_BARRIER) if set in the aql.header field. If set, this provides the same synchronization behaviora as execute_in_order for the command generated by this API.
aql
is an HSA-format "AQL" packet. The following fields must be set by the caller: aql.kernel_object aql.group_segment_size : includes static + dynamic group size aql.private_segment_size aql.grid_size_x, aql.grid_size_y, aql.grid_size_z aql.group_size_x, aql.group_size_y, aql.group_size_z aql.setup : The 2 bits at HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS. aql.header : Must specify the desired memory fence operations, and barrier bit (if desired.). A typical conservative setting would be: aql.header = (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE) | (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE) | (1 << HSA_PACKET_HEADER_BARRIER);
The following fields are ignored. The API will will set up these fields before dispatching the AQL packet: aql.completion_signal aql.kernarg
args
: Pointer to kernel arguments with the size and aligment expected by the kernel. The args are copied and then passed directly to the kernel. After this function returns, the args memory may be deallocated. argSz
: Size of the arguments. cf
: Written with a completion_future that can be used to track the status of the dispatch. May be NULL, in which case no completion_future is returned and the caller must use other synchronization techniqueues such as calling accelerator_view::wait() or waiting on a younger command in the same queue. kernel_name
: Optionally specify the name of the kernel for debug and profiling. May be null. If specified, the caller is responsible for ensuring the memory for the name remains allocated until the kernel completes.
The dispatch_hsa_kernel call will perform the following operations:
|
inline |
Sends the queued up commands in the accelerator_view to the device for execution.
An accelerator_view internally maintains a buffer of commands such as data transfers between the host memory and device buffers, and kernel invocations (parallel_for_each calls). This member function sends the commands to the device for processing. Normally, these commands to the GPU automatically whenever the runtime determines that they need to be, such as when the command buffer is full or when waiting for transfer of data from the device buffers to host memory. The flush member function will send the commands manually to the device.
Calling this member function incurs an overhead and must be used with discretion. A typical use of this member function would be when the CPU waits for an arbitrary amount of time and would like to force the execution of queued device commands in the meantime. It can also be used to ensure that resources on the accelerator are reclaimed after all references to them have been removed.
Because flush operates asynchronously, it can return either before or after the device finishes executing the buffered commandser, the commands will eventually always complete.
If the queuing_mode is queuing_mode_immediate, this function has no effect.
|
inline |
Returns an opaque handle which points to the underlying HSA agent.
|
inline |
Returns an opaque handle which points to the AM system region on the HSA agent.
This region can be used to allocate finegrained system memory which is accessible from the specified accelerator.
|
inline |
Returns an opaque handle which points to the AM region on the HSA agent.
This region can be used to allocate accelerator memory which is accessible from the specified accelerator.
|
inline |
Returns an opaque handle which points to the AM system region on the HSA agent.
This region can be used to allocate system memory which is accessible from the specified accelerator.
|
inline |
Returns an opaque handle which points to the Kernarg region on the HSA agent.
|
inline |
Returns an opaque handle which points to the underlying HSA queue.
|
inline |
Returns a boolean value indicating whether the accelerator view when passed to a parallel_for_each would result in automatic selection of an appropriate execution target by the runtime.
In other words, this is the accelerator view that will be automatically selected if parallel_for_each is invoked without explicitly specifying an accelerator view.
|
inline |
Returns a boolean value indicating whether the accelerator_view supports debugging through extensive error reporting.
The is_debug property of the accelerator view is usually same as that of the parent accelerator.
|
inline |
Returns true if the accelerator_view is currently empty.
Care must be taken to use this API in a thread-safe manner. As the accelerator completes work, the queue may become empty after this function returns false;
|
inline |
Returns the number of pending asynchronous operations on this accelerator view.
Care must be taken to use this API in a thread-safe manner,
|
inline |
Returns the queuing mode that this accelerator_view was created with.
See "Queuing Mode".
|
inline |
Returns a 32-bit unsigned integer representing the version number of this accelerator view.
The format of the integer is major.minor, where the major version number is in the high-order 16 bits, and the minor version number is in the low-order bits.
The version of the accelerator view is usually the same as that of the parent accelerator.
|
inline |
Compares "this" accelerator_view with the passed accelerator_view object to determine if they represent different underlying objects.
[in] | other | The accelerator_view object to be compared against. |
|
inline |
Assigns an accelerator_view object to "this" accelerator_view object and returns a reference to "this" object.
This function does a shallow assignment with the newly created accelerator_view object pointing to the same underlying view as the passed accelerator_view parameter.
[in] | other | The accelerator_view object to be assigned from. |
|
inline |
Compares "this" accelerator_view with the passed accelerator_view object to determine if they represent the same underlying object.
[in] | other | The accelerator_view object to be compared against. |
|
inline |
Set a CU affinity to specific command queues.
The setting is permanent until the queue is destroyed or CU affinity is set again. This setting is "atomic", it won't affect the dispatch in flight.
cu_mask | a bool vector to indicate what CUs you want to use. True represents using the cu. The first 32 elements represents the first 32 CUs, and so on. If its size is greater than physical CU number, the extra elements are ignored. It is user's responsibility to make sure the input is meaningful. |
|
inline |
Performs a blocking wait for completion of all commands submitted to the accelerator view prior to calling wait().
waitMode[in] | An optional parameter to specify the wait mode. By default it would be hcWaitModeBlocked. hcWaitModeActive would be used to reduce latency with the expense of using one CPU core for active waiting. |