Alignment

Because std::complex lacks __host__ __device__ annotations, the Kokkos library provides its own annotated implementation Kokkos::complex.

However, Kokkos::complex is not a drop-in replacement for std::complex. Beyond the annotations, there are other differences, including:

Alignment: Whereas std::complex is aligned to only the size of the underlying real type, Kokkos::complex is aligned to twice the size of the underlying real type by default (PR #2259).
Division operator: Kokkos::complex carries out the complex number division in a special way that provides better overflow protection (Issue #7618).
Operator overloads: There are also differences in allowed member and non-member operator overloads (PR #8212).

This example analyzes the impact of the specified alignment of Kokkos::complex. It analyzes a kernel that reads from and writes to contiguous arrays of 128-bit structures:

Type

sizeof [bytes]

alignof [bytes]

Alignment.DEFAULT

custom

16

8

Alignment.SPECIFIED

Kokkos::complex<double>

16

16

The example proceeds in two steps:

TestSASS examines compiler-generated instructions from the binary.

It shows that with specified alignment, reading or writing a Kokkos::complex<double> value in global memory compiles to 128-bit instructions.

By contrast, with the default alignment, the compiler generates two successive 64-bit instructions, the first for the real part, and the second for the imaginary part.
TestNCU performs a kernel profiling.
- Default alignment leads to twice more sector requests than specified alignment (see TestNCU.test_l1tex_memory_traffic_sector_count()).
- However, the additional sector requests with default alignment hit in L1 cache (see TestNCU.test_l2_memory_traffic_sector_count()).

See also:

https://github.com/NVIDIA/cuda-samples/blob/c94ff366aed18c797b8a85dfaac7817b0228b420/Samples/6_Performance/alignedTypes/alignedTypes.cu

class examples.kokkos.complex.example_alignment.Alignment(*values)View on GitHub 

Bases: StrEnum

DEFAULT = 'default'

SPECIFIED = 'specified'

__str__(): Return str(self).

class examples.kokkos.complex.example_alignment.TestAlignmentView on GitHub 

Bases: CMakeAwareTestCase

The kernel under study loads 3 arrays of ELEMENT_COUNT elements and writes to 1 array of ELEMENT_COUNT elements.

The threads work on consecutive elements of the arrays, in such a way that each thread reads a total of 3 elements and writes 1 element.

COMPLEX_DOUBLE_SIZE: Final[int] = 16

ELEMENT_COUNT: Final[int] = 1024

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB: Used in TestNCU.report().

LOAD_COUNT: Final[int] = 3

SECTOR_SIZE: Final[int] = 32

STORE_COUNT: Final[int] = 1

WARP_COUNT: Final[int] = 32

WARP_SIZE: Final[int] = 32

classmethod get_target_name() → strView on GitHub 

class examples.kokkos.complex.example_alignment.TestNCUView on GitHub 

Bases: TestAlignment

Kernel profiling.

METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='smsp__inst_executed', pretty_name='smsp__inst_executed', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCorrelation(name='sass__inst_executed_per_opcode'), MetricCounter(name='smsp__sass_inst_executed_op_global_ld', pretty_name='L1/TEX cache global load instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss', pretty_name='L1/TEX cache global load sectors miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='smsp__sass_inst_executed_op_global_st', pretty_name='L1/TEX cache global store instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='smsp__sass_inst_executed_op_local_st', pretty_name='L1/TEX cache local store instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', pretty_name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))

NVTX_INCLUDES: Final[tuple[str, ...]] = ('alignment',)

metrics(report: Report) → dict[Alignment, ProfilingMetrics]View on GitHub 

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

report() → ReportView on GitHub 

test_instruction_count(metrics: dict[Alignment, ProfilingMetrics]) → NoneView on GitHub : With specified alignment, half the load/store instructions are executed. Other instruction counts remain unchanged.

test_l1tex_memory_traffic_instruction_count(metrics: dict[Alignment, ProfilingMetrics]) → NoneView on GitHub : Runtime behavior corresponding to TestSASS.test_global_memory_instructions().

test_l1tex_memory_traffic_sector_count(metrics: dict[Alignment, ProfilingMetrics]) → NoneView on GitHub 

default	The real parts are read first, and then the imaginary parts, thus requiring each sector to be read again.
specified	The loads are coalesced into memory transactions of at least 32 bytes. Two consecutive complex double values are always loaded together in a single sector load.

test_l2_memory_traffic_sector_count(metrics: dict[Alignment, ProfilingMetrics]) → NoneView on GitHub 

The traffic out to L2 and out to DRAM is the same with both default and specialized alignments.

With the default alignment, there are two consecutive loads, but the second load concerns the same sector as the first load and can thus be expected to hit in L1 cache.

class examples.kokkos.complex.example_alignment.TestSASSView on GitHub 

Bases: TestAlignment

Binary analysis.

SIGNATURE: Final[dict[Alignment, Pattern[str]]] = {Alignment.DEFAULT: re.compile('MulAddFunctor<Kokkos::View<reprospect::examples::kokkos::complex::Complex<double>\\s*\\*, Kokkos::CudaSpace>>'), Alignment.SPECIFIED: re.compile('MulAddFunctor<Kokkos::View<Kokkos::complex<double>\\s*\\*, Kokkos::CudaSpace>>')}

property cubin: PathView on GitHub 

cuobjdump() → CuObjDumpView on GitHub 

decoder(cuobjdump: CuObjDump) → dict[Alignment, Decoder]View on GitHub 

test_global_memory_instructions(decoder: dict[Alignment, Decoder]) → NoneView on GitHub 

Check the type and count of global load and store instructions used.

default	Each element is loaded/stored with 2 instructions of 64 bits.
specified	Each element is loaded/stored with a single instruction of 128 bits.