Allocation

The use of API tracing to elucidate benchmarking results

A benchmark comparing Kokkos with native CUDA can be found in examples.kokkos.view.example_allocation_benchmarking.

To better understand the results, a comprehensive CUDA API tracing test is required. Indeed, as tested in examples.kokkos.view.example_allocation_tracing.TestNSYS, the code path followed by Kokkos depends on the memory space in which the allocation happens as well as its size. The tracing also brings to light that Kokkos always copies the Kokkos::View shared allocation header, which greatly impacts the performance compared to allocating manually with CUDA (see also https://github.com/kokkos/kokkos/pull/8440).

Benchmarking

../../../_images/example_allocation_benchmarking.svg

Comparison of repeated buffer allocation/deallocation using Kokkos or native CUDA, with stream-ordered allocation or not. CUDA 13.0.0, Kokkos 4.7.01, NVIDIA GeForce RTX 5070 Ti, reprospect@3fd1b24. Note that the results may vary with machine setup.

Comparing Kokkos::View allocation against native CUDA implementation

cudaMallocAsync calls are immediately followed by a stream or device synchronization, as seen in https://github.com/kokkos/kokkos/blob/146241cf3a68454527994a46ac473861c2b5d4f1/core/src/Cuda/Kokkos_CudaSpace.cpp#L209-L220.

Moreover, Kokkos opts not to use cudaMallocAsync when allocation sizes fall below a threshold defined at https://github.com/kokkos/kokkos/blob/c1a715cab26da9407867c6a8c04b2a1d6b2fc7ba/core/src/impl/Kokkos_SharedAlloc.hpp#L23.

Additionally, at least for its CUDA backend, Kokkos copies a shared allocation header (mainly used for debug), as seen e.g. in https://github.com/kokkos/kokkos/blob/bba2d1f60741b6a2023b36313016c0a0dd125f42/core/src/impl/Kokkos_SharedAlloc.hpp#L325-L327. This copy operation invariably uses cudaMemcpyAsync, as demonstrated in examples.kokkos.view.example_allocation_tracing.TestNSYS. Consequently, Kokkos will always:

  1. allocate buffers that are 128 bytes larger than requested

  2. entail additional API calls

This copy of the shared allocation header has triggered the discussion at https://github.com/kokkos/kokkos/issues/8441.

The benchmark results clearly show that allocating with Kokkos consistently incurs additional overhead, compared to a native CUDA implementation.

References:

class examples.kokkos.view.example_allocation_benchmarking.Framework(*values)View on GitHub

Bases: StrEnum

CUDA = 'CUDA'
KOKKOS = 'Kokkos'
__str__()

Return str(self).

class examples.kokkos.view.example_allocation_benchmarking.HandleSubtitle(xpad=0.0, ypad=0.0, update_func=None)View on GitHub

Bases: HandlerBase

create_artists(legend: Legend, orig_handle: Artist, xdescent: float, ydescent: float, width: float, height: float, fontsize: float, trans: Transform) list[Artist]View on GitHub
class examples.kokkos.view.example_allocation_benchmarking.ParametersView on GitHub

Bases: TypedDict

count: int
framework: Framework
size: int
use_async: bool
class examples.kokkos.view.example_allocation_benchmarking.Subtitle(text: str)View on GitHub

Bases: object

__init__(text: str)View on GitHub
get_label() strView on GitHub
class examples.kokkos.view.example_allocation_benchmarking.TestAllocationView on GitHub

Bases: CMakeAwareTestCase

Run the companion executable and make a nice visualization.

PATTERN: Final[Pattern[str]] = re.compile('^With(CUDA|Kokkos)<(true|false)>/((?:cuda|kokkos)(?:_async)?)/count:([0-9]+)/size:([0-9]+)')
THRESHOLD: Final[int] = 40000

Threshold for using stream ordered allocation, see https://github.com/kokkos/kokkos/blob/146241cf3a68454527994a46ac473861c2b5d4f1/core/src/Cuda/Kokkos_CudaSpace.cpp#L147.

TIME_UNIT: Final = 'ns'

Time unit of the benchmark.

classmethod get_target_name() strView on GitHub
classmethod params(*, name: str) ParametersView on GitHub

Parse the name of a case and return parameters.

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
raw() dict[str, dict]View on GitHub

Run the benchmark and return the raw JSON-based results.

Warning

Be sure to remove –benchmark_min_time for better converged results.

results(raw: dict) DataFrameView on GitHub

Processed results.

test_memory_pool_attributes(raw) NoneView on GitHub

Retrieve the memory pool attributes and check consistency and behavior.

test_visualize(results: DataFrame) NoneView on GitHub

Create a visualization of the results.

Tracing

class examples.kokkos.view.example_allocation_tracing.Memory(*values)View on GitHub

Bases: StrEnum

DEVICE = 'DEVICE'
SHARED = 'MANAGED'
__str__()

Return str(self).

class examples.kokkos.view.example_allocation_tracing.TestAllocationView on GitHub

Bases: CMakeAwareTestCase

Trace the CUDA API calls during Kokkos::View allocation under different scenarios.

It uses examples/kokkos/view/example_allocation_tracing.cpp.

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB

Used in TestNSYS.report().

classmethod get_target_name() strView on GitHub
class examples.kokkos.view.example_allocation_tracing.TestNSYSView on GitHub

Bases: TestAllocation

nsys-focused analysis.

HEADER_SIZE: Final[int] = 128

Size of the Kokkos::Impl::SharedAllocationHeader type, see https://github.com/kokkos/kokkos/blob/c1a715cab26da9407867c6a8c04b2a1d6b2fc7ba/core/src/impl/Kokkos_SharedAlloc.hpp#L23.

checks(*, report: Report, expt_cuda_api_calls_allocation: Sequence[str], expt_cuda_api_calls_deallocation: Sequence[str], selectors: dict[str, ReportPatternSelector | None], memory: Memory, size: int) NoneView on GitHub
static get_memory_id(report: Report, memory: Memory) int64View on GitHub

Retrieve the id from ENUM_CUDA_MEM_KIND whose name matches memory.

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
report() ReportView on GitHub

Analyse with nsys, use reprospect.tools.nsys.Cacher.

test_above_41000_CudaSpace(report: Report) NoneView on GitHub

Check what happens above the threshold for Kokkos::CudaSpace (requested size is 41000).

test_above_41000_CudaUVMSpace(report: Report) NoneView on GitHub

Check what happens above the threshold for Kokkos::CudaUVMSpace (requested size is 41000).

test_under_39000_CudaSpace(report: Report) NoneView on GitHub

Check what happens under the threshold for Kokkos::CudaSpace (requested size is 39000).

test_under_39000_CudaUVMSpace(report: Report) NoneView on GitHub

Check what happens under the threshold for Kokkos::CudaUVMSpace (requested size is 39000).