How do atomic operations translate from CUDA C to SASS ?
This example demonstrates how ReProspect can be used to quickly analyse how a given C operation is translated into its corresponding SASS instructions.
This approach offers several advantages:
Pedagogical — useful for teaching and learning
Comparative — helps understand how the mapping from
CtoSASSevolves across architectures
Source code
Let’s take a closer look at atomicAdd, described here.
The kernel atomically adds a source array to a destination array, element-wise.
CODE = """\
#include "cuda.h"
__global__ void my_kernel(int * __restrict__ const dst, const int* __restrict__ const src) {
const auto index = blockIdx.x * blockDim.x + threadIdx.x;
atomicAdd(&dst[index], src[index]);
}
"""
Compilation
Compile the source code for a few architectures of your choice.
As this analysis won’t actually run the executable, any architecture supported by nvcc should work.
import logging
import pathlib
import subprocess
import tempfile
from reprospect.tools import architecture
logging.basicConfig(level=logging.INFO)
ARCHES = [
architecture.NVIDIAArch.from_compute_capability(80),
architecture.NVIDIAArch.from_compute_capability(90),
architecture.NVIDIAArch.from_compute_capability(120),
]
logging.info(subprocess.check_output(('nvcc', '--version')).decode())
with tempfile.TemporaryDirectory(delete=False) as tmpdir:
for arch in ARCHES:
source = pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}.cu'
output = pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}'
source.write_text(CODE)
subprocess.check_call(('nvcc', f'--generate-code=arch={arch.as_compute},code=[{arch.as_sm}]', '-O3', '-c', source, '-o', output))
INFO:root:nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
Binary analysis
Let’s inspect the generate SASS instructions.
from reprospect.tools import binaries
cuobjdump = {}
for arch in ARCHES:
cuobjdump[arch], _ = binaries.CuObjDump.extract(
file=pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}',
arch=arch,
cwd=pathlib.Path(tmpdir),
cubin=f'atomic.1.{arch.as_sm}.cubin',
)
INFO:root:Extracting 'SASS' from /tmp/tmprmyypxmq/atomic.1.sm_80.cubin using ('cuobjdump', '--gpu-architecture', 'sm_80', '--dump-sass', '--dump-resource-usage', PosixPath('/tmp/tmprmyypxmq/atomic.1.sm_80.cubin')).
INFO:root:Extracting 'SASS' from /tmp/tmprmyypxmq/atomic.1.sm_90.cubin using ('cuobjdump', '--gpu-architecture', 'sm_90', '--dump-sass', '--dump-resource-usage', PosixPath('/tmp/tmprmyypxmq/atomic.1.sm_90.cubin')).
INFO:root:Extracting 'SASS' from /tmp/tmprmyypxmq/atomic.1.sm_120.cubin using ('cuobjdump', '--gpu-architecture', 'sm_120', '--dump-sass', '--dump-resource-usage', PosixPath('/tmp/tmprmyypxmq/atomic.1.sm_120.cubin')).
from reprospect.tools.sass import Decoder
SIGNATURE = 'my_kernel(int *, const int *)'
for arch in ARCHES:
logging.info(Decoder(code=cuobjdump[arch].functions[SIGNATURE].code))
For the kernel defined above, the atomic operation is always translated into a Reduction Operation on Generic Memory on the target architectures.
However, the exact encoding varies with compute capability:
On architectures prior to
Hopper, it appears asRED.On architectures starting from
Hopper, it appears asREDG.
for arch, dump in cuobjdump.items():
if arch.compute_capability < 90:
assert 'RED.E.ADD.STRONG.GPU' in dump.functions[SIGNATURE].code
else:
assert 'REDG.E.ADD.STRONG.GPU' in dump.functions[SIGNATURE].code