HCC
HCC is a single-source, C/C++ compiler for heterogeneous computing. It's optimized with HSA (http://www.hsafoundation.com/).
HCC Profile Mode

HCC supports low-overhead profiler to trace or summarize command timestamp information to stderr for any HCC or HIP program. Tho profiler messages are interleaved with the trace output from the application - which is handy to identify the region-of-interest and can complement deeper analysis with the CodeXL GUI Additionally, the hcc profiler requires only console mode access and can be used on machine where graphics are not available or are hard to access.

Some other useful features:

HCC_PROFILE=2 enables a profile message after each command (kernel or data movement) completes. HCC_PROFILE=1 is a future capability that will shows a summary of kernel and data commands when hcc exits. (under development)

## One-liners

1 # Trace collection:
2 $ HCC_PROFILE=2 RunMyApp &> prof.out
3 
4 # Post-processing examples:
5 # Show summary
6 $ rpt prof.out
7 
8 # Show summary for a specified ROI for lines 100-200 of the file
9 $ rpt prof.out -r @100 --R @200
10 
11 # Generate text trace to stdout
12 $ rpt prof.out -t
13 
14 # Generate gui trace for reading with chrome://tracing
15 $ rpt prof.out -g prof.json

HCC text profile format

Kernel Commands

This example shows profiled kernel commands:

1 $ HCC_PROFILE=2 ./my-hcc-app ...
2 profile: kernel; Im2Col; 17.8 us; 94859076277181; 94859076277181; 94859076294941; #0.3.1;
3 profile: kernel; tg_betac_alphaab; 32.6 us; 94859076277181; 94859537593679; 94859537626319; #0.3.2;
4 profile: kernel; MIOpenConvUni; 125.4 us; 94859076277181; 94860077852212; 94860077977651; #0.3.3;
1 PROFILE: TYPE; KERNEL_NAME ; DURATION; ENQUEUE; START ; STOP ; ID

Memory Copy Commands

This example shows memory copy commands with full verbose output:

1 profile: copyslo; HostToDevice_sync_slow; 909.2 us; 94858703002; 94858703102; 94858704012; #0.0.0; 2359296 bytes; 2.2 MB; 2.5 GB/s;
2 profile: copy; DeviceToHost_sync_fast; 117.0 us; 94858726008; 94858726408; 94858726525; #0.0.0; 1228800 bytes; 1.2 MB; 10.0 GB/s;
3 profile: copy; DeviceToHost_sync_fast; 9.0 us; 94858726068; 94858726668; 94858726677; #0.0.0; 400 bytes; 0.0 MB; 0.0 GB/s;
4 profile: copy; HostToDevice_sync_fast; 15.2 us; 94858727039; 94858727639; 94858727654; #0.0.0; 9600 bytes; 0.0 MB; 0.6 GB/s;
5 profile: copy; HostToDevice_async_fast; 131.5 us; 94858729098; 94858729198; 94858729330; #0.6.1; 1228800 bytes; 1.2 MB; 8.9 GB/s;
6 PROFILE: TYPE; COPY_NAME ; DURATION; ENQUEUE; START; STOP; ID ; SIZE_BYTES; SIZE_MB; BANDWIDTH;

Barrier Commands

An example barrier command with full vebosity:

1 profile: barrier; deps:0_acq:none_rel:sys; 5.3 us; 94858731419410; 94858731419410; 94858731424690; #0.0.2;
2 PROFILE: TYPE; BARRIER_NAME ; DURATION; ENQUEUE; START ; STOP ; ID ;

Overhead

The hcc profiler does not add any additional synchronization between commands or queues. Profile information is recorded when a command is "disposed" by the HCC runtime. This may differ from the order that commands are dispatched ; the rpt tool should be used to recover the original command order. The profile mode will allocate a signal for each command to record the timestamp information. This can add 1-2 us to the overall program execution for command which do not already use a completion signal. However, the command duration (start-stop) is still accurate. Trace mode will generate profile records to stderr which will likely impact the overall application exection time. However, the GPU duration and timestamps are still valid.

Additional Details and tips

rpt

ROCm Profiling Tool ("rpt") is a command-line tool that post-processes the output from the HCC_PROFILE mode. Key features include:

Resources and Gaps

rpt organizes profile records into the parent resources where they execute. Resources can currently be GPUs (numbered 0..N-1) or the DATA bus. GPU resources execute kernel and barrier commands. DATA resources execute copy commands.

Gaps indicate periods of times when the resource is not utilized. rpt computes gap time by examining the delta between the stop and start time of consecutive records on the resource. Commands may be overlapped or completely hidden if a command starts eariler - in this case the duration will be 0.

The summary view uses a histogram to aggregate gaps into buckets and will show the range covered by the bucket. For example, the "gap >=10000us" buckets sumarizes all gaps which are equal or more than 10000us. Gaps are inclusive of the lower bound and exclusive of the upper bound. The –gaps option can be used to customize the histogram boundaries and this can be useful to isolate causes of gaps in some cases.

Large gaps can indicate the ROI is not set correctly - see the next section for suggestions on how to set the ROI. Inside a valid ROI:

Specifying ROI

A common problem when performing profiling analysis is to determine which region to focus the analysis on, and in particular to exclude initialization, warmup, and tear-down activity from the results. RPT provides options to specify the start and stop of the "Region of Interest" (ROI) using line numbers from the input file, timestamps, or regular expressions. The latter is particularly useful - modify the program to emit a marker message and then search for the specified marker.

ROI_POINTs can be specified with line numbers, start times, or search strings.

Text trace

The -t option generates a text trace, with one row per profile record. Gaps between the records are also computed and displayed.

1 Resource Start(ms) +Time(us) Type #Dev.Queue.Cmd LineNum Name
2 GPU0 980.530247: +0.00 barrier #0.2.2586 195139: depcnt=0,acq=none,rel=sys
3 DATA 166.264953: +814268.55 gap ==============
4 DATA 980.533506: +2.67 copy #0.2.2587 195140: HostToDevice_async_fast_128_bytes
5 GPU0 980.530247: +11.70 gap ====
6 GPU0 980.541950: +0.00 barrier #0.2.2588 195138: depcnt=1,acq=none,rel=acc
7 GPU0 980.541950: +223.59 gap ==========
8 GPU0 980.765535: +5.63 kernel #0.1.192922 201310: TensorKernel
9 GPU0 980.771165: +14.37 gap ====
10 GPU0 980.785536: +0.00 barrier #0.1.192923 201309: depcnt=0,acq=none,rel=acc
11 GPU0 980.785536: +5.94 gap ==
12 GPU0 980.791477: +0.00 barrier #0.3.304 195141: depcnt=1,acq=none,rel=acc
13 GPU0 980.791477: +7.70 gap ==
14 GPU0 980.799181: +0.00 barrier #0.3.305 195143: depcnt=0,acq=none,rel=sys
15 DATA 980.536173: +272.19 gap ==========
16 DATA 980.808366: +2.37 copy #0.3.306 195144: DeviceToHost_async_fast_4_bytes

Text output from the application(is shown interleaved with the profile records, preceded by leading ">". The –hide_app_text option can be used to supress this if desired.

The gaps are shown with a series of '==" indicating their length - longer bars indicating a longer gap.. The –gaps parmater (or defaults if not specified) also controls the split points for the bar display. The intent is to provide a visually distinct way to visualize the length of the gaps.

The combined view interleaves the records for all resources, sorted by their start order. However, the time delta and gap computation are computed on a per-resource basis. For example, in the case above the large "DATA" gaps show the time since the last DATA command.

Copy commands show:

Barrier commands show:

More information is available in the HSA Platofrm System Architecture Specification section 2.9 ("Architected Queueing Language").

The ROI controls apply to the text trace as well, and are useful to limit the generated text.

GUI trace

The -g or --gui_trace option generates a JSON-format file that can be input into chrome://tracing. Chrome tracing was originally designed for viewing browser performance information but is also a fully capable timeline viewing tool which is readily available on many platforms. To access the gui, start the Chrome brower and open the site "chrome://tracing". In the browser:

In the timeline view, each row shows a GPU queue or data resource. Rectangles indicate commands that execute on that row. More information can be obtained by clicking on the rectangle.

The ROI controls apply to the giu trace as well.

1 # Use rpt to create a json gui trace file
2 $ rpt prof.out -g prof.json

Start the Chrome brower and open the site "chrome://tracing". In the browswer, select "Load" from the upper left hand corner and navigate to the saved JSON file.

Example profiling analysis

1 # Collect the profiling information for the application
2 $ HCC_PROFILE=2 myapp &> myapp.prof

In this example, we had previously modified the program to print the message "iteration start" before every 10 iterations in the critical loop. We use the markers to filter the information parsed by roi, so we are examining exactly one iteration (specifically the profile records printed between the 3rd and 4th ocurrences of the phrase "iteration start" at the beginning of a line). By default, rpt emits a summary table showing the top users of each resource:

1 $ rpt -r 3%"^iteration start" -R 4%"^iteration start"
2 
3 ROI_START: GPU0 0.000000: +0.00 kernel 195169: miog_alphaab #0.1.199064
4 ROI_STOP : GPU0 6819.188139: +0.00 kernel 288991: miog_betac_alphaab #0.1.282703
5 ROI_TIME= 6.819 secs
6 Resource=GPU0 Showing 20/84 records 78.43% busy
7  Total(%) Time(ms) Calls Avg(us) Min(us) Max(us) Name
8  17.69% 1206185.8 23064 52.3 20.1 676.2 miog_betac_alphaab
9  11.34% 773334.6 10 77333.5 75585.2 84103.3 gap>=10000us
10  9.17% 625546.4 769 813.5 204.0 1810.3 MIOpenConv1x1
11  7.98% 544091.0 420 1295.5 417.9 3776.7 sp3AsmConvRxSF
12  6.77% 461441.2 81499 5.7 0.1 9.6 gap<10us
13  5.60% 382048.1 7718 49.5 28.7 145.2 miog_alphaab
14  4.55% 310281.5 11200 27.7 12.0 95.6 Col2Im
15  3.39% 231113.1 50 4622.3 2562.4 12843.2 MIOpenConvUni
16  3.37% 229728.0 12478 18.4 7.0 65.9 Im2Col
17  2.80% 190663.2 273 698.4 121.0 3323.0 mloPooling
18  2.25% 153615.5 99 1551.7 700.8 7101.0 gcnAsmConv3x3WrW
19  1.99% 135681.8 199 681.8 416.3 2355.9 sp3AsmConv3x3F
20  1.80% 122478.6 85 1440.9 191.4 2055.6 mloPoolingAveBwd
21  1.55% 105427.5 90 1171.4 693.3 1438.2 MLOpenCvBwdWrWLmap
22  1.48% 106004.4 940 112.8 17.5 1590.7 TensorAdder::ScalarSumSquares<float>
23  1.38% 94248.5 331 284.7 101.0 997.7 gap<1000us
24  1.37% 93703.6 40 2342.6 1347.6 5417.6 MIOpenCvBwdWrW
25  1.29% 87982.8 26 3384.0 1035.7 7178.7 gap<10000us
26  1.20% 82023.0 280 292.9 53.2 1909.4 BatchNormBwdSpatialDX
27  1.05% 71775.9 2340 30.7 3.9 131.7 TensorOp
28 
29 Resource=DATA Showing 8/8 records 0.00% busy
30  Total(%) Time(ms) Calls Avg(us) Min(us) Max(us) Name
31  93.20% 6355648.1 9 706183.1 693182.3 716972.0 gap>=10000us
32  0.12% 8100.3 21 385.7 161.7 636.0 gap<1000us
33  0.01% 666.9 8 83.4 76.4 94.5 gap<100us
34  0.00% 70.1 2 35.0 24.7 45.3 gap<50us
35  0.00% 67.7 30 2.3 1.9 2.7 DeviceToHost_async_fast_4_bytes
36  0.00% 41.4 9 4.6 2.7 6.3 gap<10us
37  0.00% 28.6 10 2.9 2.7 3.1 HostToDevice_async_fast_4_bytes
38  0.00% 27.7 10 2.8 2.7 3.1 HostToDevice_async_fast_128_bytes

Observations and further analysis

oprofile

Introduction

Installation

$ apt-get install oprofile

Data Collection

Start the program of interest.

1 $ ./myapp

In another terminal, determine the process id (perhaps with pgrep) and pass to `` $ operf --pidpgrep myapp`

1 By default, operf will backup any previously recorded perf data and create a new file.
2 
3 ### Data Reporting

Show callgraph

$ opreport -cl

Show hotspots in code:

$ opreport -dg ```