Pointer aliasing, `restrict` and read-only cached load

Using the __restrict__ qualifier to hint the compiler that pointers do not alias can bring significant gains, such as:

Use the .CONSTANT modifier for the load instruction when the data is read-only.
Reduce redundant memory loads by assuming no aliasing between pointers.

See NVI25 for more details.

However, depending on the context in which the __restrict__ qualifier is used, compilers may not be able to take advantage of it.

This example examines the SASS code generated by:

#define OPERATION(dst, src_a, src_b, index)    \
    dst[index]  = src_a[index] + src_b[index]; \
    dst[index] += src_a[index] + src_b[index];

for the following implementations:

Method.GLOBAL_KERNEL, see TestSASS.test_global_kernel()
Method.RESTRICT_RECAST_LAMBDA, see TestSASS.test_restrict_recast_lambda()
Method.RESTRICT_RECAST_LOCAL, see TestSASS.test_restrict_recast_local()
Method.RESTRICT_ACCESSOR, see TestSASS.test_restrict_accessor()
Method.RESTRICT_MEMBER, see TestSASS.test_restrict_member()
Method.RESTRICT_VIEW_MEMORY_TRAIT, see TestSASS.test_restrict_view_memory_trait()
Method.LDG_ACCESSOR, see TestSASS.test_ldg_accessor()

These implementations either generate a single sequence of loads/operation/store when the compiler figures out pointers do not alias (TestSASS.match_single()), or two sequences otherwise (TestSASS.match_repeated()). When it generates two sequences, the memory traffic is increased significantly, as shown in TestNCU.

Strategies preventing pointer aliasing

The following strategies map to a single sequence, but each has a drawback:

Method.GLOBAL_KERNEL

Caution

Writing a __global__ kernel with __restrict__ qualified arguments is not portable.
Method.RESTRICT_RECAST_LAMBDA

Caution

Requires significant boilerplate code to be added. Hinders readability significantly.
Method.LDG_ACCESSOR

Caution

Unsuitable for non-constant data buffers. Not directly related to pointer aliasing, but may have similar effects.

Note

According to NVIf, __ldg (or any other cache strategy) supports all C++ fundamental types, CUDA vector types (except x3 components), and extended floating-point types. However, as of CUDA 13.1, it seems to only support up to double2, i.e. up to 128-bit size objects. This seems contradictory with reprospect.test.features.Memory.max_transaction_size() that states 256-bit size objects are supported as of reprospect.tools.architecture.NVIDIAFamily.BLACKWELL and CUDA 13.

`.CONSTANT` load

To ensure that a global load with .CONSTANT is used, use either of:

Method.GLOBAL_KERNEL
Method.LDG_ACCESSOR

Strategies that fail at preventing pointer aliasing

The following strategies cannot be used to avoid pointer aliasing:

Method.RESTRICT_RECAST_LOCAL

It is not working across all compilers.
Method.RESTRICT_ACCESSOR

Writing a custom accessor, e.g. for std::mdspan, does not work.

This is the strategy employed in PR 8151.
Method.RESTRICT_MEMBER

This strategy is also ineffective, as already noted in:
- https://discuss.pytorch.org/t/usage-of-restrict-in-cuda/150395
- https://forums.developer.nvidia.com/t/restrict-seems-to-be-ignored-for-base-pointers-in-structs-having-base-pointers-with-restrict-as-kernel-arguments-directly-works-as-expected/154020
Method.RESTRICT_VIEW_MEMORY_TRAIT

The implementation is similar to Method.RESTRICT_ACCESSOR.

class examples.kokkos.view.example_restrict.Method(*values)View on GitHub 

Bases: StrEnum

GLOBAL_KERNEL = 'global_kernel': Use a __global__ kernel with __restrict__ qualified pointer arguments.

LDG_ACCESSOR = 'FunctorLDGAccessor'

Use a ldg accessor that uses __ldg to enforce read-only L1/Tex cache load.

References:

[NVIf]

RESTRICT_ACCESSOR = 'FunctorRestrictAccessor': Use a restrict accessor that declares the reference type as __restrict__.

RESTRICT_MEMBER = 'FunctorRestrictMember': Declare pointer members as __restrict__.

RESTRICT_RECAST_LAMBDA = 'FunctorRestrictRecastLambda': Recast pointers to __restrict__ qualified pointers through an intermediate lambda within the call operator.

RESTRICT_RECAST_LOCAL = 'FunctorRestrictRecastLocal': Recast pointers to __restrict__ qualified pointers through intermediate local variables within the call operator.

RESTRICT_VIEW_MEMORY_TRAIT = 'FunctorRestrictViewMemoryTrait': Use the Kokkos::Restrict memory trait.

__str__(): Return str(self).

class examples.kokkos.view.example_restrict.TestNCUView on GitHub 

Bases: TestRestrict

Kernel profiling.

ELEMENT_COUNT: Final[int] = 128

ELEMENT_SIZE: Final[int] = 4

METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss', pretty_name='L1/TEX cache global load sectors miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_st', pretty_name='L1/TEX cache global store sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', pretty_name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))

NVTX_INCLUDES: Final[tuple[str, ...]] = ('global_kernel/', 'restrict_recast_lambda/', 'restrict_recast_local/', 'restrict_accessor/', 'restrict_member/', 'restrict_view_memory_trait/', 'ldg_accessor/')

SECTOR_COUNT_LOAD: Final[int] = 32: Expected number of load sectors requested for the single sequence scenario.

SECTOR_COUNT_STORE: Final[int] = 16: Expected number of store sectors for the single sequence scenario.

SECTOR_SIZE: Final[int] = 32

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

report() → ReportView on GitHub 

results(report: Report) → ProfilingResultsView on GitHub 

test(method: Method, results: ProfilingResults) → NoneView on GitHub : If there are 2 sequences, twice SECTOR_COUNT_LOAD are requested, but half is served from the cache.

class examples.kokkos.view.example_restrict.TestRestrictView on GitHub 

Bases: CMakeAwareTestCase

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB: Used in TestNCU.report().

SCALAR_TYPE: Final[str] = 'int'

SIGNATURE_TEMPLATE: Final[str] = 'void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<reprospect::examples::kokkos::view::{functor}<{scalar}>, Kokkos::RangePolicy<>, Kokkos::Cuda>>': Signature template when Kokkos::parallel_for is used.

property cubin: PathView on GitHub 

classmethod get_target_name() → strView on GitHub 

signature() → dict[Method, str]View on GitHub 

class examples.kokkos.view.example_restrict.TestSASSView on GitHub 

Bases: TestRestrict

Binary analysis.

cuobjdump() → CuObjDumpView on GitHub 

decoder(signature: dict[Method, str], cuobjdump: CuObjDump) → dict[Method, Decoder]View on GitHub 

match_repeated(cfg: Graph) → boolView on GitHub 

Match two loads/add/store sequences, such as:

LDG.E R0, desc[UR4][R2.64]
LDG.E R17, desc[UR4][R4.64]
...
IADD3 R17, PT, PT, R0, R17, RZ
...
STG.E desc[UR4][R6.64], R17
...
LDG.E R0, desc[UR4][R2.64]
LDG.E R8, desc[UR4][R4.64]
...
IADD3 R9, PT, PT, R17, R8, R0
...
STG.E desc[UR4][R6.64], R9

match_single(*, readonly: bool, cfg: Graph) → boolView on GitHub 

Match a single loads/add/store sequence, such as:

LDG.E.CONSTANT R2, desc[UR4][R2.64]
LDG.E.CONSTANT R5, desc[UR4][R4.64]
...
IADD3 R0, PT, PT, R2, R5, RZ
IADD3 R9, PT, PT, R0, R0, RZ
...
STG.E desc[UR4][R6.64], R9

test_global_kernel(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.GLOBAL_KERNEL.

It generates a single loads/add/store sequence, and always uses the .CONSTANT load path.

test_ldg_accessor(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.LDG_ACCESSOR.

It generates a single loads/add/store sequence, and always uses the .CONSTANT load path.

test_restrict_accessor(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.RESTRICT_ACCESSOR.

It generates two loads/add/store sequences.

test_restrict_member(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.RESTRICT_MEMBER.

It generates two loads/add/store sequences.

test_restrict_recast_lambda(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.RESTRICT_RECAST_LAMBDA.

It generates a single loads/add/store sequence, but misses the .CONSTANT modifier for recent architectures.

test_restrict_recast_local(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.RESTRICT_RECAST_LOCAL.

It generates a single loads/add/store sequence, and always uses the .CONSTANT load path for nvcc. It generates two loads/add/store sequences for clang.

test_restrict_view_memory_trait(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Test for Method.RESTRICT_VIEW_MEMORY_TRAIT.

It generates two loads/add/store sequences.

Pointer aliasing, __restrict__ and read-only cached load

Strategies preventing pointer aliasing

.CONSTANT load

Strategies that fail at preventing pointer aliasing

Pointer aliasing, `restrict` and read-only cached load

`.CONSTANT` load