How do atomic operations translate from CUDA `C` to `SASS` ?

This example demonstrates how ReProspect can be used to quickly analyse how a given C operation is translated into its corresponding SASS instructions.

This approach offers several advantages:

Pedagogical — useful for teaching and learning
Comparative — helps understand how the mapping from C to SASS evolves across architectures

Source code

Let’s take a closer look at atomicAdd, described here. The kernel atomically adds a source array to a destination array, element-wise.

CODE = """\
#include "cuda.h"

__global__ void my_kernel(int * __restrict__ const dst, const int* __restrict__ const src) {
    const auto index = blockIdx.x * blockDim.x + threadIdx.x;
    atomicAdd(&dst[index], src[index]);
}
"""

Compilation

Compile the source code for a few architectures of your choice. As this analysis won’t actually run the executable, any architecture supported by nvcc should work.

import logging
import pathlib
import subprocess
import tempfile

from reprospect.tools import architecture

logging.basicConfig(level=logging.INFO)

ARCHES = [
    architecture.NVIDIAArch.from_compute_capability(80),
    architecture.NVIDIAArch.from_compute_capability(90),
    architecture.NVIDIAArch.from_compute_capability(120),
]

logging.info(subprocess.check_output(('nvcc', '--version')).decode())

with tempfile.TemporaryDirectory(delete=False) as tmpdir:
    for arch in ARCHES:
        source = pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}.cu'
        output = pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}'

        source.write_text(CODE)

        subprocess.check_call(('nvcc', f'--generate-code=arch={arch.as_compute},code=[{arch.as_sm}]', '-O3', '-c', source, '-o', output))

INFO:root:nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Binary analysis

Let’s inspect the generate SASS instructions.

from reprospect.tools import binaries

cuobjdump = {}

for arch in ARCHES:
    cuobjdump[arch], _ = binaries.CuObjDump.extract(
        file=pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}',
        arch=arch,
        cwd=pathlib.Path(tmpdir),
        cubin=f'atomic.1.{arch.as_sm}.cubin',
    )

INFO:root:Extracting 'SASS' from /tmp/tmprmyypxmq/atomic.1.sm_80.cubin using ('cuobjdump', '--gpu-architecture', 'sm_80', '--dump-sass', '--dump-resource-usage', PosixPath('/tmp/tmprmyypxmq/atomic.1.sm_80.cubin')).
INFO:root:Extracting 'SASS' from /tmp/tmprmyypxmq/atomic.1.sm_90.cubin using ('cuobjdump', '--gpu-architecture', 'sm_90', '--dump-sass', '--dump-resource-usage', PosixPath('/tmp/tmprmyypxmq/atomic.1.sm_90.cubin')).
INFO:root:Extracting 'SASS' from /tmp/tmprmyypxmq/atomic.1.sm_120.cubin using ('cuobjdump', '--gpu-architecture', 'sm_120', '--dump-sass', '--dump-resource-usage', PosixPath('/tmp/tmprmyypxmq/atomic.1.sm_120.cubin')).

from reprospect.tools.sass import Decoder

SIGNATURE = 'my_kernel(int *, const int *)'

for arch in ARCHES:
    logging.info(Decoder(code=cuobjdump[arch].functions[SIGNATURE].code))

Show code cell output

Hide code cell output

INFO:root:┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━━┓
┃ offset ┃ instruction                                     ┃ stall ┃ yield ┃ b0 ┃ b1 ┃ b2 ┃ b3 ┃ b4 ┃ b5 ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━━┩
│ 0000   │ MOV R1, c[0x0][0x28]                            │ 2     │ True  │    │    │    │    │    │    │
│ 0010   │ S2R R2, SR_CTAID.X                              │ 1     │ True  │ Wr │    │    │    │    │    │
│ 0020   │ HFMA2.MMA R7, -RZ, RZ, 0, 2.384185791015625e-07 │ 1     │ True  │    │    │    │    │    │    │
│ 0030   │ ULDC.64 UR4, c[0x0][0x118]                      │ 2     │ True  │    │    │    │    │    │    │
│ 0040   │ S2R R3, SR_TID.X                                │ 2     │ True  │ Wr │    │    │    │    │    │
│ 0050   │ IMAD R2, R2, c[0x0][0x0], R3                    │ 5     │ False │ Wa │    │    │    │    │    │
│ 0060   │ IMAD.WIDE.U32 R4, R2, R7, c[0x0][0x168]         │ 6     │ False │    │    │    │    │    │    │
│ 0070   │ LDG.E.CONSTANT R5, [R4.64]                      │ 1     │ True  │    │    │ Wr │    │    │    │
│ 0080   │ IMAD.WIDE.U32 R2, R2, R7, c[0x0][0x160]         │ 5     │ False │    │    │    │    │    │    │
│ 0090   │ RED.E.ADD.STRONG.GPU [R2.64], R5                │ 1     │ True  │    │    │ Wa │    │    │    │
│ 00a0   │ EXIT                                            │ 5     │ True  │    │    │    │    │    │    │
│ 00b0   │ BRA 0xb0                                        │ 0     │ False │    │    │    │    │    │    │
│ 00c0   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 00d0   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 00e0   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 00f0   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0100   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0110   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0120   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0130   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0140   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0150   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0160   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
│ 0170   │ NOP                                             │ 0     │ False │    │    │    │    │    │    │
└────────┴─────────────────────────────────────────────────┴───────┴───────┴────┴────┴────┴────┴────┴────┘
INFO:root:┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━━┓
┃ offset ┃ instruction                                ┃ stall ┃ yield ┃ b0 ┃ b1 ┃ b2 ┃ b3 ┃ b4 ┃ b5 ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━━┩
│ 0000   │ LDC R1, c[0x0][0x28]                       │ 1     │ True  │    │    │    │    │    │    │
│ 0010   │ S2R R0, SR_TID.X                           │ 7     │ True  │ Wr │    │    │    │    │    │
│ 0020   │ S2UR UR4, SR_CTAID.X                       │ 8     │ True  │ Wr │    │    │    │    │    │
│ 0030   │ LDC R7, c[0x0][RZ]                         │ 8     │ True  │ Wr │    │    │    │    │    │
│ 0040   │ LDC.64 R4, c[0x0][0x218]                   │ 8     │ True  │    │ Wr │    │    │    │    │
│ 0050   │ LDC.64 R2, c[0x0][0x210]                   │ 1     │ True  │    │    │ Wr │    │    │    │
│ 0060   │ IMAD R7, R7, UR4, R0                       │ 1     │ True  │ Wa │    │    │    │    │    │
│ 0070   │ ULDC.64 UR4, c[0x0][0x208]                 │ 3     │ False │    │    │    │    │    │    │
│ 0080   │ IMAD.WIDE.U32 R4, R7, 0x4, R4              │ 6     │ False │    │ Wa │    │    │    │    │
│ 0090   │ LDG.E.CONSTANT R5, desc[UR4][R4.64]        │ 1     │ True  │    │    │    │ Wr │    │    │
│ 00a0   │ IMAD.WIDE.U32 R2, R7, 0x4, R2              │ 5     │ False │    │    │ Wa │    │    │    │
│ 00b0   │ REDG.E.ADD.STRONG.GPU desc[UR4][R2.64], R5 │ 1     │ True  │    │    │    │ Wa │    │    │
│ 00c0   │ EXIT                                       │ 5     │ True  │    │    │    │    │    │    │
│ 00d0   │ BRA 0xd0                                   │ 0     │ False │    │    │    │    │    │    │
│ 00e0   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 00f0   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0100   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0110   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0120   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0130   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0140   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0150   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0160   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0170   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
└────────┴────────────────────────────────────────────┴───────┴───────┴────┴────┴────┴────┴────┴────┘
INFO:root:┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━━┳━━━━┓
┃ offset ┃ instruction                                ┃ stall ┃ yield ┃ b0 ┃ b1 ┃ b2 ┃ b3 ┃ b4 ┃ b5 ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━━╇━━━━┩
│ 0000   │ LDC R1, c[0x0][0x37c]                      │ 1     │ True  │    │    │    │    │    │    │
│ 0010   │ S2R R0, SR_TID.X                           │ 7     │ True  │ Wr │    │    │    │    │    │
│ 0020   │ S2UR UR6, SR_CTAID.X                       │ 1     │ True  │ Wr │    │    │    │    │    │
│ 0030   │ LDCU.64 UR4, c[0x0][0x358]                 │ 7     │ True  │    │ Wr │    │    │    │    │
│ 0040   │ LDC R7, c[0x0][0x360]                      │ 8     │ True  │ Wr │    │    │    │    │    │
│ 0050   │ LDC.64 R4, c[0x0][0x388]                   │ 8     │ True  │    │    │ Wr │    │    │    │
│ 0060   │ LDC.64 R2, c[0x0][0x380]                   │ 1     │ True  │    │    │    │ Wr │    │    │
│ 0070   │ IMAD R7, R7, UR6, R0                       │ 4     │ False │ Wa │    │    │    │    │    │
│ 0080   │ IMAD.WIDE.U32 R4, R7, 0x4, R4              │ 6     │ False │    │    │ Wa │    │    │    │
│ 0090   │ LDG.E.CONSTANT R5, desc[UR4][R4.64]        │ 1     │ True  │    │ Wa │ Wr │    │    │    │
│ 00a0   │ IMAD.WIDE.U32 R2, R7, 0x4, R2              │ 5     │ False │    │    │    │ Wa │    │    │
│ 00b0   │ REDG.E.ADD.STRONG.GPU desc[UR4][R2.64], R5 │ 1     │ True  │    │    │ Wa │    │    │    │
│ 00c0   │ EXIT                                       │ 5     │ True  │    │    │    │    │    │    │
│ 00d0   │ BRA 0xd0                                   │ 0     │ False │    │    │    │    │    │    │
│ 00e0   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 00f0   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0100   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0110   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0120   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0130   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0140   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0150   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0160   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
│ 0170   │ NOP                                        │ 0     │ False │    │    │    │    │    │    │
└────────┴────────────────────────────────────────────┴───────┴───────┴────┴────┴────┴────┴────┴────┘

For the kernel defined above, the atomic operation is always translated into a Reduction Operation on Generic Memory on the target architectures.

However, the exact encoding varies with compute capability:

On architectures prior to Hopper, it appears as RED.
On architectures starting from Hopper, it appears as REDG.

for arch, dump in cuobjdump.items():
    if arch.compute_capability < 90:
        assert 'RED.E.ADD.STRONG.GPU' in dump.functions[SIGNATURE].code
    else:
        assert 'REDG.E.ADD.STRONG.GPU' in dump.functions[SIGNATURE].code

How do atomic operations translate from CUDA C to SASS ?

Source code

Compilation

Binary analysis

How do atomic operations translate from CUDA `C` to `SASS` ?