Allocation

The use of API tracing to elucidate benchmarking results

A benchmark comparing Kokkos with native CUDA can be found in examples.kokkos.view.example_allocation_benchmarking.

To better understand the results, a comprehensive CUDA API tracing test is required. Indeed, as tested in examples.kokkos.view.example_allocation_tracing.TestNSYS, the code path followed by Kokkos depends on the memory space in which the allocation happens as well as its size. The tracing also brings to light that Kokkos always copies the Kokkos::View shared allocation header, which greatly impacts the performance compared to allocating manually with CUDA (see also https://github.com/kokkos/kokkos/pull/8440).

Benchmarking

Comparing `Kokkos::View` allocation against native CUDA implementation

cudaMallocAsync calls are immediately followed by a stream or device synchronization, as seen in https://github.com/kokkos/kokkos/blob/146241cf3a68454527994a46ac473861c2b5d4f1/core/src/Cuda/Kokkos_CudaSpace.cpp#L209-L220.

Moreover, Kokkos opts not to use cudaMallocAsync when allocation sizes fall below a threshold defined at https://github.com/kokkos/kokkos/blob/c1a715cab26da9407867c6a8c04b2a1d6b2fc7ba/core/src/impl/Kokkos_SharedAlloc.hpp#L23.

Additionally, at least for its CUDA backend, Kokkos copies a shared allocation header (mainly used for debug), as seen e.g. in https://github.com/kokkos/kokkos/blob/bba2d1f60741b6a2023b36313016c0a0dd125f42/core/src/impl/Kokkos_SharedAlloc.hpp#L325-L327. This copy operation invariably uses cudaMemcpyAsync, as demonstrated in examples.kokkos.view.example_allocation_tracing.TestNSYS. Consequently, Kokkos will always:

allocate buffers that are 128 bytes larger than requested
entail additional API calls

This copy of the shared allocation header has triggered the discussion at https://github.com/kokkos/kokkos/issues/8441.

The benchmark results clearly show that allocating with Kokkos consistently incurs additional overhead, compared to a native CUDA implementation.

References:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#stream-ordered-memory-allocator

class examples.kokkos.view.example_allocation_benchmarking.Framework(*values)View on GitHub 

Bases: StrEnum

CUDA = 'CUDA'

KOKKOS = 'Kokkos'

__str__(): Return str(self).

class examples.kokkos.view.example_allocation_benchmarking.HandleSubtitle(xpad=0.0, ypad=0.0, update_func=None)View on GitHub 

Bases: HandlerBase

create_artists(legend: Legend, orig_handle: Artist, xdescent: float, ydescent: float, width: float, height: float, fontsize: float, trans: Transform) → list[Artist]View on GitHub 

class examples.kokkos.view.example_allocation_benchmarking.ParametersView on GitHub 

Bases: TypedDict

count: int

framework: Framework

size: int

use_async: bool

class examples.kokkos.view.example_allocation_benchmarking.Subtitle(text: str)View on GitHub 

Bases: object

__init__(text: str)View on GitHub 

get_label() → strView on GitHub 

class examples.kokkos.view.example_allocation_benchmarking.TestAllocationView on GitHub 

Bases: CMakeAwareTestCase

Run the companion executable and make a nice visualization.

PATTERN: Final[Pattern[str]] = re.compile('^With(CUDA|Kokkos)<(true|false)>/((?:cuda|kokkos)(?:_async)?)/count:([0-9]+)/size:([0-9]+)')

THRESHOLD: Final[int] = 40000: Threshold for using stream ordered allocation, see https://github.com/kokkos/kokkos/blob/146241cf3a68454527994a46ac473861c2b5d4f1/core/src/Cuda/Kokkos_CudaSpace.cpp#L147.

TIME_UNIT: Final = 'ns': Time unit of the benchmark.

classmethod get_target_name() → strView on GitHub 

classmethod params(*, name: str) → ParametersView on GitHub : Parse the name of a case and return parameters.

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

raw() → dict[str, dict]View on GitHub : Run the benchmark and return the raw JSON-based results.

Warning

Be sure to remove –benchmark_min_time for better converged results.

results(raw: dict) → DataFrameView on GitHub : Processed results.

test_memory_pool_attributes(raw) → NoneView on GitHub : Retrieve the memory pool attributes and check consistency and behavior.

test_visualize(results: DataFrame) → NoneView on GitHub : Create a visualization of the results.

Tracing

class examples.kokkos.view.example_allocation_tracing.Memory(*values)View on GitHub 

Bases: StrEnum

DEVICE = 'DEVICE'

SHARED = 'MANAGED'

__str__(): Return str(self).

class examples.kokkos.view.example_allocation_tracing.TestAllocationView on GitHub 

Bases: CMakeAwareTestCase

Trace the CUDA API calls during Kokkos::View allocation under different scenarios.

It uses examples/kokkos/view/example_allocation_tracing.cpp.

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB: Used in TestNSYS.report().

classmethod get_target_name() → strView on GitHub 

class examples.kokkos.view.example_allocation_tracing.TestNSYSView on GitHub 

Bases: TestAllocation

nsys-focused analysis.

HEADER_SIZE: Final[int] = 128: Size of the Kokkos::Impl::SharedAllocationHeader type, see https://github.com/kokkos/kokkos/blob/c1a715cab26da9407867c6a8c04b2a1d6b2fc7ba/core/src/impl/Kokkos_SharedAlloc.hpp#L23.

checks(*, report: Report, expt_cuda_api_calls_allocation: Sequence[str], expt_cuda_api_calls_deallocation: Sequence[str], selectors: dict[str, ReportPatternSelector | None], memory: Memory, size: int) → NoneView on GitHub 

static get_memory_id(report: Report, memory: Memory) → int64View on GitHub : Retrieve the id from ENUM_CUDA_MEM_KIND whose name matches memory.

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

report() → ReportView on GitHub : Analyse with nsys, use reprospect.tools.nsys.Cacher.

test_above_41000_CudaSpace(report: Report) → NoneView on GitHub : Check what happens above the threshold for Kokkos::CudaSpace (requested size is 41000).

test_above_41000_CudaUVMSpace(report: Report) → NoneView on GitHub : Check what happens above the threshold for Kokkos::CudaUVMSpace (requested size is 41000).

test_under_39000_CudaSpace(report: Report) → NoneView on GitHub : Check what happens under the threshold for Kokkos::CudaSpace (requested size is 39000).

test_under_39000_CudaUVMSpace(report: Report) → NoneView on GitHub : Check what happens under the threshold for Kokkos::CudaUVMSpace (requested size is 39000).