`Kokkos::atomic_add`

Kokkos provides extended atomic support for objects of arbitrary size. Therefore, it has to support types that are not directly handled by the backend. This is achieved through the desul library [TrottLebrunGrandieArndt+22], that, depending on the size of the object and the targeted hardware, maps atomic operations to either:

atomic instruction
CAS loop
sharded lock table

Traditionally, CUDA atomics supported up to 64-bit size operations. Since compute capability 9.0, CUDA supports atomic CAS for objects up to 128-bit size. Therefore, there has been some effort in Kokkos to bring this support through desul. For instance, a Kokkos::atomic_add for 128-bit aligned Kokkos::complex<double> should use the sharded lock table implementation only for compute capability below 9.0, and resort to the CAS-based implementation otherwise.

To ensure that Kokkos implements the right code path, the following matchers can be used:

The following tests:

examples.kokkos.atomic.example_add_complex64.TestAtomicAddComplex64()
examples.kokkos.atomic.example_add_complex128.TestAtomicAddComplex128()
examples.kokkos.atomic.example_add_double256.TestAtomicAddDouble256()
examples.kokkos.atomic.example_add_int128.TestAtomicAddInt128()

verify that Kokkos::atomic_add maps to the right implementation by looking for an instruction sequence pattern.

class examples.kokkos.atomic.desul.AtomicAcquireMatcherView on GitHub 

Bases: object

Matcher for the trial to acquire a lock through an atomic exchange.

See:

classmethod build(arch: NVIDIAArch, compiler_id: str) → OrderedInSequenceMatcherView on GitHub 

class examples.kokkos.atomic.desul.AtomicReleaseMatcherView on GitHub 

Bases: object

Matcher for the release of a lock through an atomic exchange.

See:

classmethod build(arch: NVIDIAArch, compiler_id: str) → InstructionMatcherView on GitHub 

class examples.kokkos.atomic.desul.DeviceAtomicThreadFenceMatcherView on GitHub 

Bases: object

Matcher for the device atomic thread fence block.

See:

classmethod build(arch: NVIDIAArch) → OrderedInSequenceMatcherView on GitHub 

class examples.kokkos.atomic.desul.LockBasedAtomicMatcher(*, arch: NVIDIAArch, operation: Operation, compiler_id: str, size: int = 128, level: int = 20, load: SequenceMatcher | None = None, store: SequenceMatcher | None = None)View on GitHub 

Bases: SequenceMatcher

” Matcher for the desul lock-based atomic code path.

See:

https://github.com/desul/desul/blob/79f928075837ffb5d302aae188e0ec7b7a79ae94/atomics/include/desul/atomics/Lock_Based_Fetch_Op_CUDA.hpp#L39-L44

__init__(*, arch: NVIDIAArch, operation: Operation, compiler_id: str, size: int = 128, level: int = 20, load: SequenceMatcher | None = None, store: SequenceMatcher | None = None) → NoneView on GitHub 

collect(matched: list[InstructionMatch], new: InstructionMatch | list[InstructionMatch]) → intView on GitHub 

match(instructions: Sequence[Instruction | str]) → list[InstructionMatch] | NoneView on GitHub : Note

For data types that require many loads or stores, the operation instructions might be interleaved, such that the sequence within the memory thread fences is not strictly load/operation/store.

property next_index: intView on GitHub 

class examples.kokkos.atomic.desul.Operation(*args, **kwargs)View on GitHub 

Bases: Protocol

__init__(*args, **kwargs)

build(loads: Collection[InstructionMatch]) → SequenceMatcherView on GitHub 

examples.kokkos.atomic.desul.get_atomic_memory_suffix(compiler_id: str) → Literal['G', '']View on GitHub : See tests.test.sass.test_atomic.TestAtomicMatcher.test_exch_device_ptr().

class examples.kokkos.atomic.add.TestCaseView on GitHub 

Bases: CMakeAwareTestCase

Derived type must to define SIGNATURE_MATCHER.

SIGNATURE_MATCHER: ClassVar[Pattern[str]]

property cubin: PathView on GitHub 

cuobjdump() → CuObjDumpView on GitHub 

decoder(cuobjdump: CuObjDump) → DecoderView on GitHub 

test() → NoneView on GitHub : Run the executable.

class examples.kokkos.atomic.example_add_complex64.AddComplex64View on GitHub 

Bases: object

Addition of two 64-bit complex values.

build(loads: Collection[InstructionMatch] | None = None) → OrderedInterleavedInSequenceMatcher | UnorderedInterleavedInSequenceMatcherView on GitHub 

class examples.kokkos.atomic.example_add_complex64.TestAtomicAddComplex64View on GitHub 

Bases: TestCase

Tests for Kokkos::complex<float>.

SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<Kokkos::complex<float>\\s*\\*\\s*, Kokkos::CudaSpace>>')

classmethod get_target_name() → strView on GitHub 

test_cas_atomic(decoder: Decoder) → NoneView on GitHub : This test proves that it uses the CAS-based implementation.

class examples.kokkos.atomic.example_add_complex128.AddComplex128View on GitHub 

Bases: object

Addition of two 128-bit complex values.

real parts
imaginary parts
possibly with NOP

build(loads: Collection[InstructionMatch] | None = None) → UnorderedInSequenceMatcherView on GitHub 

class examples.kokkos.atomic.example_add_complex128.TestAtomicAddComplex128View on GitHub 

Bases: TestCase

Tests for Kokkos::complex<double>.

SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<Kokkos::complex<double>\\s*\\*\\s*, Kokkos::CudaSpace>>')

classmethod get_target_name() → strView on GitHub 

test_cas_atomic_as_of_hopper90(decoder: Decoder) → NoneView on GitHub : This test proves that it uses the CAS-based implementation.

test_lock_atomic_before_hopper90(decoder: Decoder) → NoneView on GitHub : This test proves that it uses the lock-based implementation.

class examples.kokkos.atomic.example_add_double256.AddDouble4(arch: NVIDIAArch)View on GitHub 

Bases: object

Addition of 2 double4 (whatever the alignment).

__init__(arch: NVIDIAArch) → NoneView on GitHub 

build(loads: Collection[InstructionMatch] | None = None) → UnorderedInSequenceMatcherView on GitHub 

class examples.kokkos.atomic.example_add_double256.Load256Matcher(arch: NVIDIAArch)View on GitHub 

Bases: object

__init__(arch: NVIDIAArch) → NoneView on GitHub 

build() → SequenceMatcherView on GitHub 

class examples.kokkos.atomic.example_add_double256.Store256Matcher(arch: NVIDIAArch)View on GitHub 

Bases: object

__init__(arch: NVIDIAArch) → NoneView on GitHub 

build() → SequenceMatcherView on GitHub 

class examples.kokkos.atomic.example_add_double256.TestAtomicAddDouble256View on GitHub 

Bases: TestCase

Verify that Kokkos::atomic_add for double4 maps to the desul lock-based array implementation (whatever the alignment).

SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<reprospect::examples::kokkos::atomic::Double4Aligned32\\s*\\*\\s*, Kokkos::CudaSpace>>')

classmethod get_target_name() → strView on GitHub 

test_lock_atomic(decoder: Decoder) → NoneView on GitHub : This test proves that it uses the lock-based implementation.

class examples.kokkos.atomic.example_add_int128.AddInt128View on GitHub 

Bases: object

Addition of 2 __int128 that uses a specific set of registers.

build(loads: Collection[InstructionMatch] | None = None) → AddInt128MatcherView on GitHub 

class examples.kokkos.atomic.example_add_int128.TestAtomicAddInt128View on GitHub 

Bases: TestCase

Tests for __int128.

SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<__int128\\s*\\*\\s*, Kokkos::CudaSpace>>')

classmethod get_target_name() → strView on GitHub 

test_cas_atomic_as_of_hopper90(decoder: Decoder) → NoneView on GitHub : This test proves that it uses the CAS-based implementation.

test_lock_atomic_before_hopper90(decoder: Decoder) → NoneView on GitHub : This test proves that it uses the lock-based implementation.

Kokkos::atomic_add

`Kokkos::atomic_add`