Pointer aliasing, __restrict__ and read-only cached load

Using the __restrict__ qualifier to hint the compiler that pointers do not alias can bring significant gains, such as:

  1. Use the .CONSTANT modifier for the load instruction when the data is read-only.

  2. Reduce redundant memory loads by assuming no aliasing between pointers.

See NVI25 for more details.

However, depending on the context in which the __restrict__ qualifier is used, compilers may not be able to take advantage of it.

This example examines the SASS code generated by:

#define OPERATION(dst, src_a, src_b, index)    \
    dst[index]  = src_a[index] + src_b[index]; \
    dst[index] += src_a[index] + src_b[index];

for the following implementations:

  1. Method.GLOBAL_KERNEL, see TestSASS.test_global_kernel()

  2. Method.RESTRICT_RECAST_LAMBDA, see TestSASS.test_restrict_recast_lambda()

  3. Method.RESTRICT_RECAST_LOCAL, see TestSASS.test_restrict_recast_local()

  4. Method.RESTRICT_ACCESSOR, see TestSASS.test_restrict_accessor()

  5. Method.RESTRICT_MEMBER, see TestSASS.test_restrict_member()

  6. Method.RESTRICT_VIEW_MEMORY_TRAIT, see TestSASS.test_restrict_view_memory_trait()

  7. Method.LDG_ACCESSOR, see TestSASS.test_ldg_accessor()

These implementations either generate a single sequence of loads/operation/store when the compiler figures out pointers do not alias (TestSASS.match_single()), or two sequences otherwise (TestSASS.match_repeated()). When it generates two sequences, the memory traffic is increased significantly, as shown in TestNCU.

Strategies preventing pointer aliasing

The following strategies map to a single sequence, but each has a drawback:

  1. Method.GLOBAL_KERNEL

    Caution

    Writing a __global__ kernel with __restrict__ qualified arguments is not portable.

  2. Method.RESTRICT_RECAST_LAMBDA

    Caution

    Requires significant boilerplate code to be added. Hinders readability significantly.

  3. Method.LDG_ACCESSOR

    Caution

    Unsuitable for non-constant data buffers. Not directly related to pointer aliasing, but may have similar effects.

    Note

    According to NVIf, __ldg (or any other cache strategy) supports all C++ fundamental types, CUDA vector types (except x3 components), and extended floating-point types. However, as of CUDA 13.1, it seems to only support up to double2, i.e. up to 128-bit size objects. This seems contradictory with reprospect.test.features.Memory.max_transaction_size() that states 256-bit size objects are supported as of reprospect.tools.architecture.NVIDIAFamily.BLACKWELL and CUDA 13.

.CONSTANT load

To ensure that a global load with .CONSTANT is used, use either of:

  1. Method.GLOBAL_KERNEL

  2. Method.LDG_ACCESSOR

Strategies that fail at preventing pointer aliasing

The following strategies cannot be used to avoid pointer aliasing:

  1. Method.RESTRICT_RECAST_LOCAL

    It is not working across all compilers.

  2. Method.RESTRICT_ACCESSOR

    Writing a custom accessor, e.g. for std::mdspan, does not work.

    This is the strategy employed in PR 8151.

  3. Method.RESTRICT_MEMBER

    This strategy is also ineffective, as already noted in:

  4. Method.RESTRICT_VIEW_MEMORY_TRAIT

    The implementation is similar to Method.RESTRICT_ACCESSOR.

class examples.kokkos.view.example_restrict.Method(*values)View on GitHub

Bases: StrEnum

GLOBAL_KERNEL = 'global_kernel'

Use a __global__ kernel with __restrict__ qualified pointer arguments.

LDG_ACCESSOR = 'FunctorLDGAccessor'

Use a ldg accessor that uses __ldg to enforce read-only L1/Tex cache load.

References:

RESTRICT_ACCESSOR = 'FunctorRestrictAccessor'

Use a restrict accessor that declares the reference type as __restrict__.

RESTRICT_MEMBER = 'FunctorRestrictMember'

Declare pointer members as __restrict__.

RESTRICT_RECAST_LAMBDA = 'FunctorRestrictRecastLambda'

Recast pointers to __restrict__ qualified pointers through an intermediate lambda within the call operator.

RESTRICT_RECAST_LOCAL = 'FunctorRestrictRecastLocal'

Recast pointers to __restrict__ qualified pointers through intermediate local variables within the call operator.

RESTRICT_VIEW_MEMORY_TRAIT = 'FunctorRestrictViewMemoryTrait'

Use the Kokkos::Restrict memory trait.

__str__()

Return str(self).

class examples.kokkos.view.example_restrict.TestNCUView on GitHub

Bases: TestRestrict

Kernel profiling.

ELEMENT_COUNT: Final[int] = 128
ELEMENT_SIZE: Final[int] = 4
METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss', pretty_name='L1/TEX cache global load sectors miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_st', pretty_name='L1/TEX cache global store sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', pretty_name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))
NVTX_INCLUDES: Final[tuple[str, ...]] = ('global_kernel/', 'restrict_recast_lambda/', 'restrict_recast_local/', 'restrict_accessor/', 'restrict_member/', 'restrict_view_memory_trait/', 'ldg_accessor/')
SECTOR_COUNT_LOAD: Final[int] = 32

Expected number of load sectors requested for the single sequence scenario.

SECTOR_COUNT_STORE: Final[int] = 16

Expected number of store sectors for the single sequence scenario.

SECTOR_SIZE: Final[int] = 32
pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
report() ReportView on GitHub
results(report: Report) ProfilingResultsView on GitHub
test(method: Method, results: ProfilingResults) NoneView on GitHub

If there are 2 sequences, twice SECTOR_COUNT_LOAD are requested, but half is served from the cache.

class examples.kokkos.view.example_restrict.TestRestrictView on GitHub

Bases: CMakeAwareTestCase

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB

Used in TestNCU.report().

SCALAR_TYPE: Final[str] = 'int'
SIGNATURE_TEMPLATE: Final[str] = 'void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<reprospect::examples::kokkos::view::{functor}<{scalar}>, Kokkos::RangePolicy<>, Kokkos::Cuda>>'

Signature template when Kokkos::parallel_for is used.

property cubin: PathView on GitHub
classmethod get_target_name() strView on GitHub
signature() dict[Method, str]View on GitHub
class examples.kokkos.view.example_restrict.TestSASSView on GitHub

Bases: TestRestrict

Binary analysis.

cuobjdump() CuObjDumpView on GitHub
decoder(signature: dict[Method, str], cuobjdump: CuObjDump) dict[Method, Decoder]View on GitHub
match_repeated(cfg: Graph) boolView on GitHub

Match two loads/add/store sequences, such as:

LDG.E R0, desc[UR4][R2.64]
LDG.E R17, desc[UR4][R4.64]
...
IADD3 R17, PT, PT, R0, R17, RZ
...
STG.E desc[UR4][R6.64], R17
...
LDG.E R0, desc[UR4][R2.64]
LDG.E R8, desc[UR4][R4.64]
...
IADD3 R9, PT, PT, R17, R8, R0
...
STG.E desc[UR4][R6.64], R9
match_single(*, readonly: bool, cfg: Graph) boolView on GitHub

Match a single loads/add/store sequence, such as:

LDG.E.CONSTANT R2, desc[UR4][R2.64]
LDG.E.CONSTANT R5, desc[UR4][R4.64]
...
IADD3 R0, PT, PT, R2, R5, RZ
IADD3 R9, PT, PT, R0, R0, RZ
...
STG.E desc[UR4][R6.64], R9
test_global_kernel(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.GLOBAL_KERNEL.

It generates a single loads/add/store sequence, and always uses the .CONSTANT load path.

test_ldg_accessor(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.LDG_ACCESSOR.

It generates a single loads/add/store sequence, and always uses the .CONSTANT load path.

test_restrict_accessor(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.RESTRICT_ACCESSOR.

It generates two loads/add/store sequences.

test_restrict_member(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.RESTRICT_MEMBER.

It generates two loads/add/store sequences.

test_restrict_recast_lambda(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.RESTRICT_RECAST_LAMBDA.

It generates a single loads/add/store sequence, but misses the .CONSTANT modifier for recent architectures.

test_restrict_recast_local(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.RESTRICT_RECAST_LOCAL.

It generates a single loads/add/store sequence, and always uses the .CONSTANT load path for nvcc. It generates two loads/add/store sequences for clang.

test_restrict_view_memory_trait(decoder: dict[Method, Decoder]) NoneView on GitHub

Test for Method.RESTRICT_VIEW_MEMORY_TRAIT.

It generates two loads/add/store sequences.