Allocation ========== The use of API tracing to elucidate benchmarking results -------------------------------------------------------- A benchmark comparing :code:`Kokkos` with native CUDA can be found in :py:mod:`examples.kokkos.view.example_allocation_benchmarking`. To better understand the results, a comprehensive CUDA API tracing test is required. Indeed, as tested in :py:class:`examples.kokkos.view.example_allocation_tracing.TestNSYS`, the code path followed by :code:`Kokkos` depends on the memory space in which the allocation happens as well as its size. The tracing also brings to light that :code:`Kokkos` always copies the :code:`Kokkos::View` shared allocation header, which greatly impacts the performance compared to allocating manually with CUDA (see also https://github.com/kokkos/kokkos/pull/8440). Benchmarking ------------ .. figure:: example_allocation_benchmarking.svg Comparison of repeated buffer allocation/deallocation using :code:`Kokkos` or native CUDA, with stream-ordered allocation or not. CUDA 13.0.0, :code:`Kokkos` 4.7.01, NVIDIA GeForce RTX 5070 Ti, :lastcommit:`docs/source/examples/kokkos/view/example_allocation_benchmarking.svg`. Note that the results may vary with machine setup. .. automodule:: examples.kokkos.view.example_allocation_benchmarking Tracing ------- .. automodule:: examples.kokkos.view.example_allocation_tracing