Integration tests
Tests that exercise many features, but do not seek to exemplify use cases. For use cases, see Examples.
Half types
Analyze two kernels that compute the square of half-precision values with and without packing using:
many matchers from
reprospect.test.sasssome profiling with
reprospect.tools.ncu
- class tests.integration.test_half.TestNCUView on GitHub
Bases:
objectncu-based analysis of the individual vs packed implementation.
- METRICS: Final[tuple[Metric | MetricCorrelation | MetricDeviceAttribute, ...]] = (MetricDeviceAttribute(name='display_name'), MetricCounter(name='smsp__sass_inst_executed_op_global_ld', pretty_name='L1/TEX cache global load instructions sass', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_requests_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load requests', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='l1tex__t_sectors_pipe_lsu_mem_global_op_ld', pretty_name='L1/TEX cache global load sectors', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), Metric(name='launch__grid_dim_x', pretty_name='launch__grid_dim_x', subs=None), Metric(name='launch__grid_dim_y', pretty_name='launch__grid_dim_y', subs=None), Metric(name='launch__grid_dim_z', pretty_name='launch__grid_dim_z', subs=None), Metric(name='launch__block_dim_x', pretty_name='launch__block_dim_x', subs=None), Metric(name='launch__block_dim_y', pretty_name='launch__block_dim_y', subs=None), Metric(name='launch__block_dim_z', pretty_name='launch__block_dim_z', subs=None))
- pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
- results(workdir: Path, bindir: Path) ProfilingResultsView on GitHub
- test_memory(results: ProfilingResults) NoneView on GitHub
Compare the memory traffic.
- class tests.integration.test_half.TestSASSView on GitHub
Bases:
objectTests that combine different half-precision SASS instructions.
- cuobjdump(workdir: Path, parameters: Parameters, cmake_file_api: FileAPI) CuObjDumpView on GitHub
- pytestmark = [Mark(name='parametrize', args=('parameters', (Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.VOLTA: 'VOLTA'>, compute_capability=ComputeCapability(major=7, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.TURING: 'TURING'>, compute_capability=ComputeCapability(major=7, minor=5))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.AMPERE: 'AMPERE'>, compute_capability=ComputeCapability(major=8, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.AMPERE: 'AMPERE'>, compute_capability=ComputeCapability(major=8, minor=6))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.ADA: 'ADA'>, compute_capability=ComputeCapability(major=8, minor=9))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.HOPPER: 'HOPPER'>, compute_capability=ComputeCapability(major=9, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.BLACKWELL: 'BLACKWELL'>, compute_capability=ComputeCapability(major=10, minor=0))), Parameters(arch=NVIDIAArch(family=<NVIDIAFamily.BLACKWELL: 'BLACKWELL'>, compute_capability=ComputeCapability(major=12, minor=0))))), kwargs={'ids': <class 'str'>, 'scope': 'class'})]
- test_individual(parameters: Parameters, cuobjdump: CuObjDump) NoneView on GitHub
Analyse the individual implementation.
It loads only 16 bits at once and does a “broadcast” of the lower lane (
H0_H0) becauseHMUL2works with 2 lanes (packed instruction).Typically:
LDG.E.U16.CONSTANT.SYS R2, [R2] HMUL2 R0, R2.H0_H0, R2.H0_H0 STG.E.U16.SYS [R4], R0
- test_packed(parameters: Parameters, cuobjdump: CuObjDump) NoneView on GitHub
Analyse the packed implementation.
First, there is a block that performs the “odd” element and therefore looks like the individual implementation:
LDG.E.U16.CONSTANT.SYS R2, [R2] HMUL2 R0, R2.H0_H0, R2.H0_H0
Then, there is another block that performs the packed multiplication. It loads 32 bits at once. Typically:
LDG.E.CONSTANT.SYS R2, [R2] HMUL2 R7, R2, R2
Note that, even though the PTX always reads:
mul.f16x2 %r8,%r9,%r9
ptxasmay choose to useHFMA2:HFMA2 R7, R2, R2, -RZ
or even
HFMA2.MMA:HFMA2.MMA R7, R2, R2, -RZ
instead of
HMUL2, depending on the targeted architecture.