Alignment

Because std::complex lacks __host__ __device__ annotations, the Kokkos library provides its own annotated implementation Kokkos::complex.

However, Kokkos::complex is not a drop-in replacement for std::complex. Beyond the annotations, there are other differences, including:

  • Alignment: Whereas std::complex is aligned to only the size of the underlying real type, Kokkos::complex is aligned to twice the size of the underlying real type by default (PR #2259).

  • Division operator: Kokkos::complex carries out the complex number division in a special way that provides better overflow protection (Issue #7618).

  • Operator overloads: There are also differences in allowed member and non-member operator overloads (PR #8212).

This example analyzes the impact of the specified alignment of Kokkos::complex. It analyzes a kernel that reads from and writes to contiguous arrays of 128-bit structures:

Type

sizeof [bytes]

alignof [bytes]

Alignment.DEFAULT

custom

16

8

Alignment.SPECIFIED

Kokkos::complex<double>

16

16

The example proceeds in two steps:

  1. TestSASS examines compiler-generated instructions from the binary.

    It shows that with specified alignment, reading or writing a Kokkos::complex<double> value in global memory compiles to 128-bit instructions.

    By contrast, with the default alignment, the compiler generates two successive 64-bit instructions, the first for the real part, and the second for the imaginary part.

  2. TestNCU performs a kernel profiling.

See also:

class examples.kokkos.complex.example_alignment.Alignment(*values)View on GitHub

Bases: StrEnum

DEFAULT = 'default'
SPECIFIED = 'specified'
__str__()

Return str(self).

class examples.kokkos.complex.example_alignment.TestAlignmentView on GitHub

Bases: CMakeAwareTestCase

The kernel under study loads 3 arrays of ELEMENT_COUNT elements and writes to 1 array of ELEMENT_COUNT elements.

The threads work on consecutive elements of the arrays, in such a way that each thread reads a total of 3 elements and writes 1 element.

COMPLEX_DOUBLE_SIZE: Final[int] = 16
ELEMENT_COUNT: Final[int] = 1024
KOKKOS_TOOLS_NVTX_CONNECTOR_LIB

Used in TestNCU.report().

LOAD_COUNT: Final[int] = 3
SECTOR_SIZE: Final[int] = 32
STORE_COUNT: Final[int] = 1
WARP_COUNT: Final[int] = 32
WARP_SIZE: Final[int] = 32
classmethod get_target_name() strView on GitHub
class examples.kokkos.complex.example_alignment.TestNCUView on GitHub

Bases: TestAlignment

Kernel profiling.

METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='smsp__inst_executed', pretty_name='smsp__inst_executed', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCorrelation(name='sass__inst_executed_per_opcode'), MetricCounter(name='smsp__sass_inst_executed_op_global_ld', pretty_name='L1/TEX cache global load instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss', pretty_name='L1/TEX cache global load sectors miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='smsp__sass_inst_executed_op_global_st', pretty_name='L1/TEX cache global store instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='smsp__sass_inst_executed_op_local_st', pretty_name='L1/TEX cache local store instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', pretty_name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))
NVTX_INCLUDES: Final[tuple[str, ...]] = ('alignment',)
metrics(report: Report) dict[Alignment, ProfilingMetrics]View on GitHub
pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
report() ReportView on GitHub
test_instruction_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub

With specified alignment, half the load/store instructions are executed. Other instruction counts remain unchanged.

test_l1tex_memory_traffic_instruction_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub

Runtime behavior corresponding to TestSASS.test_global_memory_instructions().

test_l1tex_memory_traffic_sector_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub

default

The real parts are read first, and then the imaginary parts, thus requiring each sector to be read again.

specified

The loads are coalesced into memory transactions of at least 32 bytes. Two consecutive complex double values are always loaded together in a single sector load.

test_l2_memory_traffic_sector_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub

The traffic out to L2 and out to DRAM is the same with both default and specialized alignments.

With the default alignment, there are two consecutive loads, but the second load concerns the same sector as the first load and can thus be expected to hit in L1 cache.

class examples.kokkos.complex.example_alignment.TestSASSView on GitHub

Bases: TestAlignment

Binary analysis.

SIGNATURE: Final[dict[Alignment, Pattern[str]]] = {Alignment.DEFAULT: re.compile('MulAddFunctor<Kokkos::View<reprospect::examples::kokkos::complex::Complex<double>\\s*\\*, Kokkos::CudaSpace>>'), Alignment.SPECIFIED: re.compile('MulAddFunctor<Kokkos::View<Kokkos::complex<double>\\s*\\*, Kokkos::CudaSpace>>')}
property cubin: PathView on GitHub
cuobjdump() CuObjDumpView on GitHub
decoder(cuobjdump: CuObjDump) dict[Alignment, Decoder]View on GitHub
test_global_memory_instructions(decoder: dict[Alignment, Decoder]) NoneView on GitHub

Check the type and count of global load and store instructions used.

default

Each element is loaded/stored with 2 instructions of 64 bits.

specified

Each element is loaded/stored with a single instruction of 128 bits.