Alignment
Because std::complex lacks __host__ __device__ annotations, the Kokkos library
provides its own annotated implementation Kokkos::complex.
However, Kokkos::complex is not a drop-in replacement for std::complex.
Beyond the annotations, there are other differences, including:
Alignment: Whereas
std::complexis aligned to only the size of the underlying real type,Kokkos::complexis aligned to twice the size of the underlying real type by default (PR #2259).Division operator:
Kokkos::complexcarries out the complex number division in a special way that provides better overflow protection (Issue #7618).Operator overloads: There are also differences in allowed member and non-member operator overloads (PR #8212).
This example analyzes the impact of the specified alignment of Kokkos::complex.
It analyzes a kernel that reads from and writes to contiguous arrays of 128-bit structures:
Type
sizeof[bytes]
alignof[bytes]custom
16
8
Kokkos::complex<double>16
16
The example proceeds in two steps:
TestSASSexamines compiler-generated instructions from the binary.It shows that with specified alignment, reading or writing a
Kokkos::complex<double>value in global memory compiles to 128-bit instructions.By contrast, with the default alignment, the compiler generates two successive 64-bit instructions, the first for the real part, and the second for the imaginary part.
TestNCUperforms a kernel profiling.Default alignment leads to twice more sector requests than specified alignment (see
TestNCU.test_l1tex_memory_traffic_sector_count()).However, the additional sector requests with default alignment hit in L1 cache (see
TestNCU.test_l2_memory_traffic_sector_count()).
See also:
- class examples.kokkos.complex.example_alignment.Alignment(*values)View on GitHub
Bases:
StrEnum- DEFAULT = 'default'
- SPECIFIED = 'specified'
- __str__()
Return str(self).
- class examples.kokkos.complex.example_alignment.TestAlignmentView on GitHub
Bases:
CMakeAwareTestCaseThe kernel under study loads 3 arrays of
ELEMENT_COUNTelements and writes to 1 array ofELEMENT_COUNTelements.The threads work on consecutive elements of the arrays, in such a way that each thread reads a total of 3 elements and writes 1 element.
- KOKKOS_TOOLS_NVTX_CONNECTOR_LIB
Used in
TestNCU.report().
- classmethod get_target_name() strView on GitHub
- class examples.kokkos.complex.example_alignment.TestNCUView on GitHub
Bases:
TestAlignmentKernel profiling.
- METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='smsp__inst_executed', pretty_name='smsp__inst_executed', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCorrelation(name='sass__inst_executed_per_opcode'), MetricCounter(name='smsp__sass_inst_executed_op_global_ld', pretty_name='L1/TEX cache global load instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss', pretty_name='L1/TEX cache global load sectors miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='smsp__sass_inst_executed_op_global_st', pretty_name='L1/TEX cache global store instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='smsp__sass_inst_executed_op_local_st', pretty_name='L1/TEX cache local store instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', pretty_name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))
- metrics(report: Report) dict[Alignment, ProfilingMetrics]View on GitHub
- pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
- report() ReportView on GitHub
- test_instruction_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub
With specified alignment, half the load/store instructions are executed. Other instruction counts remain unchanged.
- test_l1tex_memory_traffic_instruction_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub
Runtime behavior corresponding to
TestSASS.test_global_memory_instructions().
- test_l1tex_memory_traffic_sector_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub
default
The real parts are read first, and then the imaginary parts, thus requiring each sector to be read again.
specified
The loads are coalesced into memory transactions of at least 32 bytes. Two consecutive complex double values are always loaded together in a single sector load.
- test_l2_memory_traffic_sector_count(metrics: dict[Alignment, ProfilingMetrics]) NoneView on GitHub
The traffic out to L2 and out to DRAM is the same with both default and specialized alignments.
With the default alignment, there are two consecutive loads, but the second load concerns the same sector as the first load and can thus be expected to hit in L1 cache.
- class examples.kokkos.complex.example_alignment.TestSASSView on GitHub
Bases:
TestAlignmentBinary analysis.
- SIGNATURE: Final[dict[Alignment, Pattern[str]]] = {Alignment.DEFAULT: re.compile('MulAddFunctor<Kokkos::View<reprospect::examples::kokkos::complex::Complex<double>\\s*\\*, Kokkos::CudaSpace>>'), Alignment.SPECIFIED: re.compile('MulAddFunctor<Kokkos::View<Kokkos::complex<double>\\s*\\*, Kokkos::CudaSpace>>')}
- property cubin: PathView on GitHub
- cuobjdump() CuObjDumpView on GitHub
- test_global_memory_instructions(decoder: dict[Alignment, Decoder]) NoneView on GitHub
Check the type and count of global load and store instructions used.
default
Each element is loaded/stored with 2 instructions of 64 bits.
specified
Each element is loaded/stored with a single instruction of 128 bits.