{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "How do atomic operations translate from CUDA `C` to `SASS` ?\n",
    "=======================================================\n",
    "\n",
    "This example demonstrates how `ReProspect` can be used to quickly analyse how a given `C` operation is translated into its corresponding `SASS` instructions.\n",
    "\n",
    "This approach offers several advantages:\n",
    "- **Pedagogical** — useful for teaching and learning\n",
    "- **Comparative** — helps understand how the mapping from `C` to `SASS` evolves across architectures"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "Source code\n",
    "-----------\n",
    "\n",
    "Let's take a closer look at `atomicAdd`, described [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#atomicadd).\n",
    "The kernel atomically adds a source array to a destination array, element-wise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "CODE = \"\"\"\\\n",
    "#include \"cuda.h\"\n",
    "\n",
    "__global__ void my_kernel(int * __restrict__ const dst, const int* __restrict__ const src) {\n",
    "    const auto index = blockIdx.x * blockDim.x + threadIdx.x;\n",
    "    atomicAdd(&dst[index], src[index]);\n",
    "}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "Compilation\n",
    "-----------\n",
    "\n",
    "Compile the source code for a few architectures of your choice.\n",
    "As this analysis won't actually run the executable, any architecture supported by ``nvcc`` should work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "import pathlib\n",
    "import subprocess\n",
    "import tempfile\n",
    "\n",
    "from reprospect.tools import architecture\n",
    "\n",
    "logging.basicConfig(level=logging.INFO)\n",
    "\n",
    "ARCHES = [\n",
    "    architecture.NVIDIAArch.from_compute_capability(80),\n",
    "    architecture.NVIDIAArch.from_compute_capability(90),\n",
    "    architecture.NVIDIAArch.from_compute_capability(120),\n",
    "]\n",
    "\n",
    "logging.info(subprocess.check_output(('nvcc', '--version')).decode())\n",
    "\n",
    "with tempfile.TemporaryDirectory(delete=False) as tmpdir:\n",
    "    for arch in ARCHES:\n",
    "        source = pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}.cu'\n",
    "        output = pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}'\n",
    "\n",
    "        source.write_text(CODE)\n",
    "\n",
    "        subprocess.check_call(('nvcc', f'--generate-code=arch={arch.as_compute},code=[{arch.as_sm}]', '-O3', '-c', source, '-o', output))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "Binary analysis\n",
    "---------------\n",
    "\n",
    "Let's inspect the generate `SASS` instructions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from reprospect.tools import binaries\n",
    "\n",
    "cuobjdump = {}\n",
    "\n",
    "for arch in ARCHES:\n",
    "    cuobjdump[arch], _ = binaries.CuObjDump.extract(\n",
    "        file=pathlib.Path(tmpdir) / f'atomic.{arch.as_sm}',\n",
    "        arch=arch,\n",
    "        cwd=pathlib.Path(tmpdir),\n",
    "        cubin=f'atomic.1.{arch.as_sm}.cubin',\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "from reprospect.tools.sass import Decoder\n",
    "\n",
    "SIGNATURE = 'my_kernel(int *, const int *)'\n",
    "\n",
    "for arch in ARCHES:\n",
    "    logging.info(Decoder(code=cuobjdump[arch].functions[SIGNATURE].code))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "For the kernel defined above, the atomic operation is always translated into a *Reduction Operation on Generic Memory* on the target architectures.\n",
    "\n",
    "However, the exact encoding varies with compute capability:\n",
    "- On architectures **prior to** ``Hopper``, it appears as ``RED``.\n",
    "- On architectures **starting from** ``Hopper``, it appears as ``REDG``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "for arch, dump in cuobjdump.items():\n",
    "    if arch.compute_capability < 90:\n",
    "        assert 'RED.E.ADD.STRONG.GPU' in dump.functions[SIGNATURE].code\n",
    "    else:\n",
    "        assert 'REDG.E.ADD.STRONG.GPU' in dump.functions[SIGNATURE].code"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv-3.12",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}