Profiler
Overview
Taichi includes a collection of profiling tools to help with code debugging and optimization. These tools collect hardware and Taichi-related information to measure program performance and identify bottlenecks.
Currently, Taichi provides two profiling tools:
ScopedProfiler
, which is responsible for analyzing the performance of the Taichi JIT compiler (host).KernelProfiler
, which is responsible for analyzing the performance of Taichi kernels (device). Its advanced mode, which works with the CUDA backend only, provides detailed low-level performance metrics, such as memory bandwidth consumption.
ScopedProfiler
ScopedProfiler
tracks the time spent on host tasks, such as JIT compilation.
This profiler is enabled by default. To display results in a hierarchical format, call ti.profiler.print_scoped_profiler_info()
.
For example:
import taichi as ti
ti.init(arch=ti.cpu)
var = ti.field(ti.f32, shape=1)
@ti.kernel
def compute():
var[0] = 1.0
print("Setting var[0] =", var[0])
compute()
ti.profiler.print_scoped_profiler_info()
note
ScopedProfiler
is a C++ class in Taichi.
KernelProfiler
KernelProfiler
retrieves the kernel profiling records from the backend, counts them in the Python scope, and prints the results to the console. Note that kernel_profiler
supports CPU and CUDA only. Ensure that you call ti.sync()
before performance profiling if your program is running on GPU.
- To enable this profiler, set
kernel profiler=True
while executingti.init()
. - Use
ti.profiler.print kernel profiler info()
to see the profiling results. There are two printing methods:- In "count" mode (the default), profiling recordings with the same kernel name are counted as a single profiling result.
- The profiler displays a list of kernels launched on hardware during the profiling period in "trace" mode. This option displays more comprehensive performance and hardware characteristics for each kernel.
- Use
ti.profiler.clear_kernel_profiler_info()
to clear the entries in this profiler.
For example:
import taichi as ti
ti.init(ti.cpu, kernel_profiler=True)
x = ti.field(ti.f32, shape=1024*1024)
@ti.kernel
def fill():
for i in x:
x[i] = i
for i in range(8):
fill()
ti.profiler.print_kernel_profiler_info('trace')
ti.profiler.clear_kernel_profiler_info() # Clears all records
for i in range(100):
fill()
ti.profiler.print_kernel_profiler_info() # The default mode: 'count'
The profiler outputs the following:
=========================================================================
X64 Profiler(trace)
=========================================================================
[ % | time ] Kernel name
[ 0.00% | 0.000 ms] jit_evaluator_0_kernel_0_serial
[ 60.11% | 2.668 ms] fill_c4_0_kernel_1_range_for
[ 6.06% | 0.269 ms] fill_c4_0_kernel_1_range_for
[ 5.73% | 0.254 ms] fill_c4_0_kernel_1_range_for
[ 5.68% | 0.252 ms] fill_c4_0_kernel_1_range_for
[ 5.61% | 0.249 ms] fill_c4_0_kernel_1_range_for
[ 5.63% | 0.250 ms] fill_c4_0_kernel_1_range_for
[ 5.61% | 0.249 ms] fill_c4_0_kernel_1_range_for
[ 5.59% | 0.248 ms] fill_c4_0_kernel_1_range_for
-------------------------------------------------------------------------
[100.00%] Total kernel execution time: 0.004 s number of records: 9
=========================================================================
=========================================================================
X64 Profiler(count)
=========================================================================
[ % total count | min avg max ] Kernel name
[100.00% 0.033 s 100x | 0.244 0.329 2.970 ms] fill_c4_0_kernel_1_range_for
-------------------------------------------------------------------------
[100.00%] Total kernel execution time: 0.033 s number of records: 1
=========================================================================
note
jit_evaluator_xxx
can be ignored because it is automatically generated by the system.- Taichi recommends running performance profiling multiple times to observe the minimum or average execution time.
Advanced mode
KernelProfiler
offers an experimental GPU profiling toolkit based on the Nvidia CUPTI for the CUDA backend, which has minimal and predictable profiling overhead and can record over 6,000 hardware metrics.
Prerequisites to using CUPTI:
- Install CUDA Toolkit.
- Build Taichi from source with CUDA toolkit:
TAICHI_CMAKE_ARGS="-DTI_WITH_CUDA_TOOLKIT:BOOL=ON" python3 setup.py develop --user
- Resolve the privileges issue of Nvidia profiling module (run with
sudo
to get administrative privileges):- Add
options nvidia NVreg_RestrictProfilingToAdminUsers=0
to/etc/modprobe.d/nvidia-kernel-common.conf
- Then
reboot
should resolve the permission issue (probably need to runupdate-initramfs -u
beforereboot
) - See also ERR_NVGPUCTRPERM.
- Add