Kokkos::atomic_add
Kokkos provides extended atomic support for objects of arbitrary size.
Therefore, it has to support types that are not directly handled by the backend.
This is achieved through the desul library [TrottLebrunGrandieArndt+22],
that, depending on the size of the object and the targeted hardware, maps atomic operations to either:
atomic instruction
CAS loop
sharded lock table
Traditionally, CUDA atomics supported up to 64-bit size operations.
Since compute capability 9.0, CUDA supports atomic CAS for objects up to 128-bit size.
Therefore, there has been some effort in Kokkos to bring this support through desul.
For instance, a Kokkos::atomic_add for 128-bit aligned Kokkos::complex<double> should use
the sharded lock table implementation only for compute capability below 9.0,
and resort to the CAS-based implementation otherwise.
To ensure that Kokkos implements the right code path, the following matchers can be used:
The following tests:
examples.kokkos.atomic.example_add_complex64.TestAtomicAddComplex64()examples.kokkos.atomic.example_add_complex128.TestAtomicAddComplex128()examples.kokkos.atomic.example_add_double256.TestAtomicAddDouble256()examples.kokkos.atomic.example_add_int128.TestAtomicAddInt128()
verify that Kokkos::atomic_add maps to the right implementation
by looking for an instruction sequence pattern.
- class examples.kokkos.atomic.desul.AtomicAcquireMatcherView on GitHub
Bases:
objectMatcher for the trial to acquire a lock through an atomic exchange.
See:
- classmethod build(arch: NVIDIAArch, compiler_id: str) OrderedInSequenceMatcherView on GitHub
- class examples.kokkos.atomic.desul.AtomicReleaseMatcherView on GitHub
Bases:
objectMatcher for the release of a lock through an atomic exchange.
See:
- classmethod build(arch: NVIDIAArch, compiler_id: str) InstructionMatcherView on GitHub
- class examples.kokkos.atomic.desul.DeviceAtomicThreadFenceMatcherView on GitHub
Bases:
objectMatcher for the device atomic thread fence block.
See:
- classmethod build(arch: NVIDIAArch) OrderedInSequenceMatcherView on GitHub
- class examples.kokkos.atomic.desul.LockBasedAtomicMatcher(*, arch: NVIDIAArch, operation: Operation, compiler_id: str, size: int = 128, level: int = 20, load: SequenceMatcher | None = None, store: SequenceMatcher | None = None)View on GitHub
Bases:
SequenceMatcher” Matcher for the desul lock-based atomic code path.
See:
- __init__(*, arch: NVIDIAArch, operation: Operation, compiler_id: str, size: int = 128, level: int = 20, load: SequenceMatcher | None = None, store: SequenceMatcher | None = None) NoneView on GitHub
- collect(matched: list[InstructionMatch], new: InstructionMatch | list[InstructionMatch]) intView on GitHub
- match(instructions: Sequence[Instruction | str]) list[InstructionMatch] | NoneView on GitHub
Note
For data types that require many loads or stores, the operation instructions might be interleaved, such that the sequence within the memory thread fences is not strictly load/operation/store.
- property next_index: intView on GitHub
- class examples.kokkos.atomic.desul.Operation(*args, **kwargs)View on GitHub
Bases:
Protocol- __init__(*args, **kwargs)
- build(loads: Collection[InstructionMatch]) SequenceMatcherView on GitHub
- examples.kokkos.atomic.desul.get_atomic_memory_suffix(compiler_id: str) Literal['G', '']View on GitHub
See
tests.test.sass.test_atomic.TestAtomicMatcher.test_exch_device_ptr().
- class examples.kokkos.atomic.add.TestCaseView on GitHub
Bases:
CMakeAwareTestCaseDerived type must to define
SIGNATURE_MATCHER.- property cubin: PathView on GitHub
- cuobjdump() CuObjDumpView on GitHub
- decoder(cuobjdump: CuObjDump) DecoderView on GitHub
- test() NoneView on GitHub
Run the executable.
- class examples.kokkos.atomic.example_add_complex64.AddComplex64View on GitHub
Bases:
objectAddition of two 64-bit complex values.
- build(loads: Collection[InstructionMatch] | None = None) OrderedInterleavedInSequenceMatcher | UnorderedInterleavedInSequenceMatcherView on GitHub
- class examples.kokkos.atomic.example_add_complex64.TestAtomicAddComplex64View on GitHub
Bases:
TestCaseTests for
Kokkos::complex<float>.- SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<Kokkos::complex<float>\\s*\\*\\s*, Kokkos::CudaSpace>>')
- classmethod get_target_name() strView on GitHub
- test_cas_atomic(decoder: Decoder) NoneView on GitHub
This test proves that it uses the CAS-based implementation.
- class examples.kokkos.atomic.example_add_complex128.AddComplex128View on GitHub
Bases:
objectAddition of two 128-bit complex values.
real parts
imaginary parts
possibly with NOP
- build(loads: Collection[InstructionMatch] | None = None) UnorderedInSequenceMatcherView on GitHub
- class examples.kokkos.atomic.example_add_complex128.TestAtomicAddComplex128View on GitHub
Bases:
TestCaseTests for
Kokkos::complex<double>.- SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<Kokkos::complex<double>\\s*\\*\\s*, Kokkos::CudaSpace>>')
- classmethod get_target_name() strView on GitHub
- test_cas_atomic_as_of_hopper90(decoder: Decoder) NoneView on GitHub
This test proves that it uses the CAS-based implementation.
- test_lock_atomic_before_hopper90(decoder: Decoder) NoneView on GitHub
This test proves that it uses the lock-based implementation.
- class examples.kokkos.atomic.example_add_double256.AddDouble4(arch: NVIDIAArch)View on GitHub
Bases:
objectAddition of 2
double4(whatever the alignment).- __init__(arch: NVIDIAArch) NoneView on GitHub
- build(loads: Collection[InstructionMatch] | None = None) UnorderedInSequenceMatcherView on GitHub
- class examples.kokkos.atomic.example_add_double256.Load256Matcher(arch: NVIDIAArch)View on GitHub
Bases:
object- __init__(arch: NVIDIAArch) NoneView on GitHub
- build() SequenceMatcherView on GitHub
- class examples.kokkos.atomic.example_add_double256.Store256Matcher(arch: NVIDIAArch)View on GitHub
Bases:
object- __init__(arch: NVIDIAArch) NoneView on GitHub
- build() SequenceMatcherView on GitHub
- class examples.kokkos.atomic.example_add_double256.TestAtomicAddDouble256View on GitHub
Bases:
TestCaseVerify that
Kokkos::atomic_addfordouble4maps to the desul lock-based array implementation (whatever the alignment).- SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<reprospect::examples::kokkos::atomic::Double4Aligned32\\s*\\*\\s*, Kokkos::CudaSpace>>')
- classmethod get_target_name() strView on GitHub
- test_lock_atomic(decoder: Decoder) NoneView on GitHub
This test proves that it uses the lock-based implementation.
- class examples.kokkos.atomic.example_add_int128.AddInt128View on GitHub
Bases:
objectAddition of 2
__int128that uses a specific set of registers.- build(loads: Collection[InstructionMatch] | None = None) AddInt128MatcherView on GitHub
- class examples.kokkos.atomic.example_add_int128.TestAtomicAddInt128View on GitHub
Bases:
TestCaseTests for
__int128.- SIGNATURE_MATCHER: ClassVar[Pattern[str]] = re.compile('AtomicAddFunctor<Kokkos::View<__int128\\s*\\*\\s*, Kokkos::CudaSpace>>')
- classmethod get_target_name() strView on GitHub
- test_cas_atomic_as_of_hopper90(decoder: Decoder) NoneView on GitHub
This test proves that it uses the CAS-based implementation.
- test_lock_atomic_before_hopper90(decoder: Decoder) NoneView on GitHub
This test proves that it uses the lock-based implementation.