Bibliography
Control codes — nervanasystems / maxas wiki. https://github.com/NervanaSystems/maxas/wiki/Control-Codes.
Kokkos ecosystem — a linux foundation project. https://kokkos.org.
Parsing nvidia 'ptxas' output memory types. https://github.com/openwall/john/wiki/Parsing-nvidia-'ptxas'-output---memory-types.
Ari B. Hayes, Fei Hua, Jin Huang, Yanhao Chen, and Eddy Z. Zhang. Decoding CUDA binary - file format. February 2019. URL: https://doi.org/10.5281/zenodo.2339027, doi:10.5281/zenodo.2339027.
Nhut-Minh Ho and Weng-Fai Wong. Exploiting half precision arithmetic in Nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2017. URL: http://ieeexplore.ieee.org/document/8091072/, doi:10.1109/HPEC.2017.8091072.
Rodrigo Huerta, Mojtaba Abaie Shoushtary, José-Lorenzo Cruz, and Antonio González. Analyzing modern NVIDIA GPU cores. 2025. URL: https://arxiv.org/abs/2503.20481.
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. 2018. URL: https://arxiv.org/abs/1804.06826.
NVIDIA. Nsight compute. https://developer.nvidia.com/nsight-compute.
NVIDIA. Nsight systems. https://developer.nvidia.com/nsight-systems.
NVIDIA. Printing code generation statistics. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#printing-code-generation-statistics.
NVIDIA. Cuobjdump. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump.
NVIDIA. CUDA binary utilities. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html.
NVIDIA. CUDA c++ programming guide: low-level load and store functions. https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html#low-level-load-and-store-functions. Accessed: 2025-12-28.
NVIDIA. CUDA GPU compute capability. https://developer.nvidia.com/cuda-gpus.
NVIDIA. CUDA c++ programming guide: __restrict__ pointers. https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html#restrict-pointers, 2025. Accessed: 2025-12-28.
Magnus Strengert. Requests, wavefronts, sectors metrics: understanding and optimizing memory-bound kernels with nsight compute. https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32089/, 2021. Presentation at GTC21.
Da Yan, Wei Wang, and Xiaowen Chu. Optimizing batched winograd convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '20, 32–44. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3332466.3374520, doi:10.1145/3332466.3374520.
Mengchi Zhang, Ahmad Alawneh, and Timothy G. Rogers. Judging a type by its pointer: optimizing gpu virtual functions. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, 2021. doi:10.1145/3445814.3446734.
Christian R. Trott, Damien Lebrun-Grandié, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, Daisy S. Hollman, Dan Ibanez, Nevin Liber, Jonathan Madsen, Jeff Miles, David Poliakoff, Amy Powell, Sivasankaran Rajamanickam, Mikael Simberg, Dan Sunderland, Bruno Turcksin, and Jeremiah Wilke. Kokkos 3: programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems, 33(4):805–817, January 2022. doi:10.1109/TPDS.2021.3097283.