Allocation
The use of API tracing to elucidate benchmarking results
A benchmark comparing Kokkos with native CUDA can be found in examples.kokkos.view.example_allocation_benchmarking.
To better understand the results, a comprehensive CUDA API tracing test is required.
Indeed, as tested in examples.kokkos.view.example_allocation_tracing.TestNSYS,
the code path followed by Kokkos depends on the memory space in which the allocation happens
as well as its size.
The tracing also brings to light that Kokkos always copies the Kokkos::View shared allocation
header, which greatly impacts the performance compared to allocating manually with CUDA (see also https://github.com/kokkos/kokkos/pull/8440).
Benchmarking
Comparison of repeated buffer allocation/deallocation using Kokkos or native CUDA, with stream-ordered allocation or not.
CUDA 13.0.0, Kokkos 4.7.01, NVIDIA GeForce RTX 5070 Ti, reprospect@3fd1b24.
Note that the results may vary with machine setup.
Comparing Kokkos::View allocation against native CUDA implementation
cudaMallocAsync calls are immediately followed by a stream or device synchronization, as seen in
https://github.com/kokkos/kokkos/blob/146241cf3a68454527994a46ac473861c2b5d4f1/core/src/Cuda/Kokkos_CudaSpace.cpp#L209-L220.
Moreover, Kokkos opts not to use cudaMallocAsync when allocation sizes fall below a threshold defined at
https://github.com/kokkos/kokkos/blob/c1a715cab26da9407867c6a8c04b2a1d6b2fc7ba/core/src/impl/Kokkos_SharedAlloc.hpp#L23.
Additionally, at least for its CUDA backend, Kokkos copies a shared allocation header (mainly used for debug), as seen e.g. in
https://github.com/kokkos/kokkos/blob/bba2d1f60741b6a2023b36313016c0a0dd125f42/core/src/impl/Kokkos_SharedAlloc.hpp#L325-L327.
This copy operation invariably uses cudaMemcpyAsync, as demonstrated in examples.kokkos.view.example_allocation_tracing.TestNSYS.
Consequently, Kokkos will always:
allocate buffers that are 128 bytes larger than requested
entail additional API calls
This copy of the shared allocation header has triggered the discussion at https://github.com/kokkos/kokkos/issues/8441.
The benchmark results clearly show that allocating with Kokkos consistently incurs additional overhead,
compared to a native CUDA implementation.
References:
- class examples.kokkos.view.example_allocation_benchmarking.Framework(*values)View on GitHub
Bases:
StrEnum- CUDA = 'CUDA'
- KOKKOS = 'Kokkos'
- __str__()
Return str(self).
- class examples.kokkos.view.example_allocation_benchmarking.HandleSubtitle(xpad=0.0, ypad=0.0, update_func=None)View on GitHub
Bases:
HandlerBase
- class examples.kokkos.view.example_allocation_benchmarking.ParametersView on GitHub
Bases:
TypedDict
- class examples.kokkos.view.example_allocation_benchmarking.Subtitle(text: str)View on GitHub
Bases:
object- __init__(text: str)View on GitHub
- get_label() strView on GitHub
- class examples.kokkos.view.example_allocation_benchmarking.TestAllocationView on GitHub
Bases:
CMakeAwareTestCaseRun the companion executable and make a nice visualization.
- PATTERN: Final[Pattern[str]] = re.compile('^With(CUDA|Kokkos)<(true|false)>/((?:cuda|kokkos)(?:_async)?)/count:([0-9]+)/size:([0-9]+)')
- THRESHOLD: Final[int] = 40000
Threshold for using stream ordered allocation, see https://github.com/kokkos/kokkos/blob/146241cf3a68454527994a46ac473861c2b5d4f1/core/src/Cuda/Kokkos_CudaSpace.cpp#L147.
- classmethod get_target_name() strView on GitHub
- classmethod params(*, name: str) ParametersView on GitHub
Parse the name of a case and return parameters.
- pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
- raw() dict[str, dict]View on GitHub
Run the benchmark and return the raw JSON-based results.
Warning
Be sure to remove –benchmark_min_time for better converged results.
- results(raw: dict) DataFrameView on GitHub
Processed results.
- test_memory_pool_attributes(raw) NoneView on GitHub
Retrieve the memory pool attributes and check consistency and behavior.
- test_visualize(results: DataFrame) NoneView on GitHub
Create a visualization of the results.
Tracing
- class examples.kokkos.view.example_allocation_tracing.Memory(*values)View on GitHub
Bases:
StrEnum- DEVICE = 'DEVICE'
- SHARED = 'MANAGED'
- __str__()
Return str(self).
- class examples.kokkos.view.example_allocation_tracing.TestAllocationView on GitHub
Bases:
CMakeAwareTestCaseTrace the CUDA API calls during
Kokkos::Viewallocation under different scenarios.It uses
examples/kokkos/view/example_allocation_tracing.cpp.- KOKKOS_TOOLS_NVTX_CONNECTOR_LIB
Used in
TestNSYS.report().
- classmethod get_target_name() strView on GitHub
- class examples.kokkos.view.example_allocation_tracing.TestNSYSView on GitHub
Bases:
TestAllocationnsys-focused analysis.
- HEADER_SIZE: Final[int] = 128
Size of the Kokkos::Impl::SharedAllocationHeader type, see https://github.com/kokkos/kokkos/blob/c1a715cab26da9407867c6a8c04b2a1d6b2fc7ba/core/src/impl/Kokkos_SharedAlloc.hpp#L23.
- checks(*, report: Report, expt_cuda_api_calls_allocation: Sequence[str], expt_cuda_api_calls_deallocation: Sequence[str], selectors: dict[str, ReportPatternSelector | None], memory: Memory, size: int) NoneView on GitHub
- static get_memory_id(report: Report, memory: Memory) int64View on GitHub
Retrieve the id from ENUM_CUDA_MEM_KIND whose name matches memory.
- pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
- report() ReportView on GitHub
Analyse with nsys, use
reprospect.tools.nsys.Cacher.
- test_above_41000_CudaSpace(report: Report) NoneView on GitHub
Check what happens above the threshold for Kokkos::CudaSpace (requested size is 41000).
- test_above_41000_CudaUVMSpace(report: Report) NoneView on GitHub
Check what happens above the threshold for Kokkos::CudaUVMSpace (requested size is 41000).
- test_under_39000_CudaSpace(report: Report) NoneView on GitHub
Check what happens under the threshold for Kokkos::CudaSpace (requested size is 39000).
- test_under_39000_CudaUVMSpace(report: Report) NoneView on GitHub
Check what happens under the threshold for Kokkos::CudaUVMSpace (requested size is 39000).