Pointer aliasing, __restrict__ and read-only cached load
Using the __restrict__ qualifier to hint the compiler that pointers do not
alias can bring significant gains, such as:
Use the
.CONSTANTmodifier for the load instruction when the data is read-only.Reduce redundant memory loads by assuming no aliasing between pointers.
See NVI25 for more details.
However, depending on the context in which the __restrict__ qualifier is used, compilers
may not be able to take advantage of it.
This example examines the SASS code generated by:
#define OPERATION(dst, src_a, src_b, index) \
dst[index] = src_a[index] + src_b[index]; \
dst[index] += src_a[index] + src_b[index];
for the following implementations:
Method.RESTRICT_RECAST_LAMBDA, seeTestSASS.test_restrict_recast_lambda()Method.RESTRICT_RECAST_LOCAL, seeTestSASS.test_restrict_recast_local()Method.RESTRICT_ACCESSOR, seeTestSASS.test_restrict_accessor()Method.RESTRICT_VIEW_MEMORY_TRAIT, seeTestSASS.test_restrict_view_memory_trait()
These implementations either generate a single sequence of loads/operation/store
when the compiler figures out pointers do not alias (TestSASS.match_single()),
or two sequences otherwise (TestSASS.match_repeated()).
When it generates two sequences, the memory traffic is increased significantly, as shown in
TestNCU.
Strategies preventing pointer aliasing
The following strategies map to a single sequence, but each has a drawback:
-
Caution
Writing a
__global__kernel with__restrict__qualified arguments is not portable. -
Caution
Requires significant boilerplate code to be added. Hinders readability significantly.
-
Caution
Unsuitable for non-constant data buffers. Not directly related to pointer aliasing, but may have similar effects.
Note
According to NVIf,
__ldg(or any other cache strategy) supports all C++ fundamental types, CUDA vector types (except x3 components), and extended floating-point types. However, as of CUDA 13.1, it seems to only support up todouble2, i.e. up to 128-bit size objects. This seems contradictory withreprospect.test.features.Memory.max_transaction_size()that states 256-bit size objects are supported as ofreprospect.tools.architecture.NVIDIAFamily.BLACKWELLand CUDA 13.
.CONSTANT load
To ensure that a global load with .CONSTANT is used, use either of:
Strategies that fail at preventing pointer aliasing
The following strategies cannot be used to avoid pointer aliasing:
-
It is not working across all compilers.
-
Writing a custom accessor, e.g. for std::mdspan, does not work.
This is the strategy employed in PR 8151.
-
This strategy is also ineffective, as already noted in:
Method.RESTRICT_VIEW_MEMORY_TRAITThe implementation is similar to
Method.RESTRICT_ACCESSOR.
- class examples.kokkos.view.example_restrict.Method(*values)View on GitHub
Bases:
StrEnum- GLOBAL_KERNEL = 'global_kernel'
Use a
__global__kernel with__restrict__qualified pointer arguments.
- LDG_ACCESSOR = 'FunctorLDGAccessor'
Use a ldg accessor that uses
__ldgto enforce read-only L1/Tex cache load.References:
[NVIf]
- RESTRICT_ACCESSOR = 'FunctorRestrictAccessor'
Use a restrict accessor that declares the reference type as
__restrict__.
- RESTRICT_MEMBER = 'FunctorRestrictMember'
Declare pointer members as
__restrict__.
- RESTRICT_RECAST_LAMBDA = 'FunctorRestrictRecastLambda'
Recast pointers to
__restrict__qualified pointers through an intermediate lambda within the call operator.
- RESTRICT_RECAST_LOCAL = 'FunctorRestrictRecastLocal'
Recast pointers to
__restrict__qualified pointers through intermediate local variables within the call operator.
- RESTRICT_VIEW_MEMORY_TRAIT = 'FunctorRestrictViewMemoryTrait'
Use the
Kokkos::Restrictmemory trait.
- __str__()
Return str(self).
- class examples.kokkos.view.example_restrict.TestNCUView on GitHub
Bases:
TestRestrictKernel profiling.
- METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld_lookup_miss', pretty_name='L1/TEX cache global load sectors miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_st', pretty_name='L1/TEX cache global store sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', pretty_name='lts__t_sectors_srcunit_tex_op_read_lookup_miss', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))
- NVTX_INCLUDES: Final[tuple[str, ...]] = ('global_kernel/', 'restrict_recast_lambda/', 'restrict_recast_local/', 'restrict_accessor/', 'restrict_member/', 'restrict_view_memory_trait/', 'ldg_accessor/')
- SECTOR_COUNT_LOAD: Final[int] = 32
Expected number of load sectors requested for the single sequence scenario.
- SECTOR_COUNT_STORE: Final[int] = 16
Expected number of store sectors for the single sequence scenario.
- pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
- report() ReportView on GitHub
- results(report: Report) ProfilingResultsView on GitHub
- test(method: Method, results: ProfilingResults) NoneView on GitHub
If there are 2 sequences, twice
SECTOR_COUNT_LOADare requested, but half is served from the cache.
- class examples.kokkos.view.example_restrict.TestRestrictView on GitHub
Bases:
CMakeAwareTestCase- KOKKOS_TOOLS_NVTX_CONNECTOR_LIB
Used in
TestNCU.report().
- SIGNATURE_TEMPLATE: Final[str] = 'void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<reprospect::examples::kokkos::view::{functor}<{scalar}>, Kokkos::RangePolicy<>, Kokkos::Cuda>>'
Signature template when
Kokkos::parallel_foris used.
- property cubin: PathView on GitHub
- classmethod get_target_name() strView on GitHub
- signature() dict[Method, str]View on GitHub
- class examples.kokkos.view.example_restrict.TestSASSView on GitHub
Bases:
TestRestrictBinary analysis.
- cuobjdump() CuObjDumpView on GitHub
- match_repeated(cfg: Graph) boolView on GitHub
Match two loads/add/store sequences, such as:
LDG.E R0, desc[UR4][R2.64] LDG.E R17, desc[UR4][R4.64] ... IADD3 R17, PT, PT, R0, R17, RZ ... STG.E desc[UR4][R6.64], R17 ... LDG.E R0, desc[UR4][R2.64] LDG.E R8, desc[UR4][R4.64] ... IADD3 R9, PT, PT, R17, R8, R0 ... STG.E desc[UR4][R6.64], R9
- match_single(*, readonly: bool, cfg: Graph) boolView on GitHub
Match a single loads/add/store sequence, such as:
LDG.E.CONSTANT R2, desc[UR4][R2.64] LDG.E.CONSTANT R5, desc[UR4][R4.64] ... IADD3 R0, PT, PT, R2, R5, RZ IADD3 R9, PT, PT, R0, R0, RZ ... STG.E desc[UR4][R6.64], R9
- test_global_kernel(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.GLOBAL_KERNEL.It generates a single loads/add/store sequence, and always uses the
.CONSTANTload path.
- test_ldg_accessor(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.LDG_ACCESSOR.It generates a single loads/add/store sequence, and always uses the
.CONSTANTload path.
- test_restrict_accessor(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.RESTRICT_ACCESSOR.It generates two loads/add/store sequences.
- test_restrict_member(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.RESTRICT_MEMBER.It generates two loads/add/store sequences.
- test_restrict_recast_lambda(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.RESTRICT_RECAST_LAMBDA.It generates a single loads/add/store sequence, but misses the
.CONSTANTmodifier for recent architectures.
- test_restrict_recast_local(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.RESTRICT_RECAST_LOCAL.It generates a single loads/add/store sequence, and always uses the
.CONSTANTload path fornvcc. It generates two loads/add/store sequences forclang.
- test_restrict_view_memory_trait(decoder: dict[Method, Decoder]) NoneView on GitHub
Test for
Method.RESTRICT_VIEW_MEMORY_TRAIT.It generates two loads/add/store sequences.