Dispatch
flowchart TD
block_0["LDC R1, c[0x0][0x37c]<br/>S2R R3, SR_TID.X<br/>S2UR UR4, SR_CTAID.X<br/>LDCU UR5, c[0x0][0x388]<br/>LDC R6, c[0x0][0x360]<br/>IMAD R6, R6, UR4, R3<br/>ISETP.GE.U32.AND P0, PT, R6, UR5, PT<br/>@P0 EXIT"]:::myblock
block_128["LDCU.64 UR4, c[0x0][0x358]<br/>LDC.64 R2, c[0x0][0x380]<br/>LDG.E.64 R2, desc[UR4][R2.64]<br/>LD.E R8, desc[UR4][R2.64]<br/>LDCU.64 UR4, c[0x0][0x380]<br/>HFMA2 R21, -RZ, RZ, 0, 0<br/>BSSY.RECONVERGENT B6, 0x150<br/>MOV R20, 0x140<br/>MOV R4, UR4<br/>MOV R5, UR5<br/>LDC.64 R8, c[0x2][R8]<br/>CALL.REL.NOINC R8 0x0<br/>BSYNC.RECONVERGENT B6"]:::myblock
block_336["EXIT"]:::myblock
block_352["IADD R1, R1, -0x8<br/>MOV R9, R5<br/>MOV R8, R4<br/>STL [R1+0x4], R16<br/>STL [R1], R2<br/>LDC.64 R2, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>MOV R6, 0x0<br/>MOV R4, R8<br/>MOV R5, R9<br/>LDC.64 R6, c[0x4][R6]<br/>IADD.64 R2, R2, 0x10<br/>ST.E.64 desc[UR4][R8.64], R2<br/>MOV R2, R20<br/>MOV R16, R21<br/>LEPC R20, 0x270<br/>CALL.ABS.NOINC R6<br/>MOV R20, R2<br/>MOV R21, R16<br/>LDL R2, [R1]<br/>LDL R16, [R1+0x4]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_720["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDC.64 R4, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>IADD.64 R4, R4, 0x10<br/>ST.E.64 desc[UR4][R2.64], R4<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_896["IADD R1, R1, -0x8<br/>MOV R9, R5<br/>MOV R8, R4<br/>STL [R1+0x4], R16<br/>STL [R1], R2<br/>LDC.64 R2, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>MOV R6, 0x0<br/>MOV R4, R8<br/>MOV R5, R9<br/>LDC.64 R6, c[0x4][R6]<br/>IADD.64 R2, R2, 0x10<br/>ST.E.64 desc[UR4][R8.64], R2<br/>MOV R2, R20<br/>MOV R16, R21<br/>LEPC R20, 0x490<br/>CALL.ABS.NOINC R6<br/>MOV R20, R2<br/>MOV R21, R16<br/>LDL R2, [R1]<br/>LDL R16, [R1+0x4]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_1264["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDC.64 R4, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>IADD.64 R4, R4, 0x10<br/>ST.E.64 desc[UR4][R2.64], R4<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_1440["IADD R1, R1, -0x8<br/>MOV R9, R5<br/>MOV R8, R4<br/>STL [R1+0x4], R16<br/>STL [R1], R2<br/>LDC.64 R2, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>MOV R6, 0x0<br/>MOV R4, R8<br/>MOV R5, R9<br/>LDC.64 R6, c[0x4][R6]<br/>IADD.64 R2, R2, 0x10<br/>ST.E.64 desc[UR4][R8.64], R2<br/>MOV R2, R20<br/>MOV R16, R21<br/>LEPC R20, 0x6b0<br/>CALL.ABS.NOINC R6<br/>MOV R20, R2<br/>MOV R21, R16<br/>LDL R2, [R1]<br/>LDL R16, [R1+0x4]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_1808["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDC.64 R4, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>IADD.64 R4, R4, 0x10<br/>ST.E.64 desc[UR4][R2.64], R4<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_1984["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 171<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_2192["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 175<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_2400["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 187<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_2608["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 191<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
block_2816["BRA 0xb00"]:::myblock
block_0 --> block_128
block_128 --> block_336
block_352 --> block_0
block_720 --> block_0
block_896 --> block_0
block_1264 --> block_0
block_1440 --> block_0
block_1808 --> block_0
block_1984 --> block_0
block_2192 --> block_0
block_2400 --> block_0
block_2608 --> block_0
block_2816 --> block_2816
classDef myblock text-align:left
Control flow graph that partitions the SASS code for the dynamic dispatch for virtual functions on device.
CUDA supports polymorphic classes in device code. However, dynamic dispatch for virtual functions incurs overhead:
Direct overhead:
vtable lookups incur additional instructions and memory traffic
indirect call may cause register spills/fills and a jump
Indirect overhead:
prevents inlining and other compiler optimizations
Zhang et al. [ZAR21] identified the typical instruction sequence pattern that dynamic dispatch for virtual functions typically generates in machine code:
load vtable pointer
access the vtable to obtain the function offset
resolve the function address via an additional kernel-specific level of indirection through constant memory
indirect call
This example compares dynamic dispatch for a virtual function call on device with static dispatch. It analyzes resource usage for both dispatch types and then verifies programmatically the presence of the dynamic dispatch instruction pattern that was identified by Zhang et al. [ZAR21].
In this way, this example demonstrates how to use ReProspect to create research artifacts that may accompany publications on CUDA code analysis.
- class examples.cuda.virtual_functions.example_dispatch.Derived(*values)View on GitHub
Bases:
StrEnum- DERIVED_A = 'DerivedA'
- DERIVED_B = 'DerivedB'
- __str__()
Return str(self).
- class examples.cuda.virtual_functions.example_dispatch.Dispatch(*values)View on GitHub
Bases:
StrEnum- DYNAMIC = 'dynamic'
- STATIC = 'static'
- __str__()
Return str(self).
- class examples.cuda.virtual_functions.example_dispatch.MemberFunction(*values)View on GitHub
Bases:
StrEnum- BAR = 'bar'
- FOO = 'foo'
- __str__()
Return str(self).
- class examples.cuda.virtual_functions.example_dispatch.TestAllImplementationsAreInAllKernelsView on GitHub
Bases:
TestBinaryAnalysisTestVtableLookupFooVsBar.test_all_instructions_identical_except_load_function_offset()shows the instructions generated fordynamic_foo_kernelanddynamic_bar_kernelare identical except for the instruction that loads the function offset from the vtable. This finding is surprising because it indicates that the SASS codes for each kernel must each contain the implementations of both virtual functions, even though each kernel only calls one of the virtual functions.In fact, when inspecting the SASS code, it can be observed that the SASS code for each kernel contains the implementations of all virtual member functions of all derived classes in the compile unit.
- test(decoder: dict[tuple[Dispatch, MemberFunction], Decoder]) NoneView on GitHub
Assert that the implementations of
DerivedA::foo(unsigned int)DerivedB::foo(unsigned int)DerivedA::bar(unsigned int)DerivedB::bar(unsigned int)
are present in the SASS code of each kernel.
- class examples.cuda.virtual_functions.example_dispatch.TestBinaryAnalysisView on GitHub
Bases:
TestDispatch- MARKER: Final[dict[tuple[Derived, MemberFunction], int]] = {(Derived.DERIVED_A, MemberFunction.BAR): 171, (Derived.DERIVED_A, MemberFunction.FOO): 175, (Derived.DERIVED_B, MemberFunction.BAR): 187, (Derived.DERIVED_B, MemberFunction.FOO): 191}
- SIGNATURE: Final[dict[tuple[Dispatch, MemberFunction], Pattern[str]]] = {(Dispatch.DYNAMIC, MemberFunction.BAR): re.compile('dynamic_bar_kernel'), (Dispatch.DYNAMIC, MemberFunction.FOO): re.compile('dynamic_foo_kernel'), (Dispatch.STATIC, MemberFunction.FOO): re.compile('static_foo_kernel')}
- property cubin: PathView on GitHub
- cuobjdump() CuObjDumpView on GitHub
- decoder(function: dict[tuple[Dispatch, MemberFunction], Function]) dict[tuple[Dispatch, MemberFunction], Decoder]View on GitHub
- function(cuobjdump: CuObjDump) dict[tuple[Dispatch, MemberFunction], Function]View on GitHub
Collect the SASS code and parse the resource usage information.
- class examples.cuda.virtual_functions.example_dispatch.TestDispatchView on GitHub
Bases:
CMakeAwareTestCase- classmethod get_target_name() strView on GitHub
- test_run() NoneView on GitHub
- class examples.cuda.virtual_functions.example_dispatch.TestDynamicDispatchInstructionSequenceView on GitHub
Bases:
TestBinaryAnalysisLook for the dynamic dispatch instruction pattern identified by Zhang et al. [ZAR21].
- MEMBER_FUNCTION: Final[MemberFunction] = 'foo'
- basic_block_dynamic_call(cfg: Graph) BasicBlockView on GitHub
Find the basic block that contains the dynamic dispatch instruction sequence by looking for a basic block that contains an indirect call instruction.
- basic_blocks_function_implementations(cfg: Graph) tuple[BasicBlock, ...]View on GitHub
Find the basic blocks that contain the function implementations by looking for basic blocks that contain a FADD instruction with the expected operand.
- cfg(decoder: dict[tuple[Dispatch, MemberFunction], Decoder]) GraphView on GitHub
- constant_bank_vtable(cuobjdump: CuObjDump, function: dict[tuple[Dispatch, MemberFunction], Function]) bytesView on GitHub
Read the constant memory bank expected to hold the function address that the dynamic dispatch resolves to.
- test_constant_bank_vtable(basic_blocks_function_implementations: tuple[BasicBlock, ...], constant_bank_vtable: bytes) NoneView on GitHub
Show that the kernel-specific function address resolution via the constant bank as in:
LDC.64 R8, c[0x2][R8]
may indeed resolve the indirect call as in:
CALL.REL.NOINC R10 0x0
to the function implementation basic block. This is done by verifying that each function implementation basic block offset is among the entries held in the constant bank.
- test_instruction_sequence_pattern_dynamic_call(basic_block_dynamic_call: BasicBlock) NoneView on GitHub
Verify the presence of the instruction sequence pattern described in Zhang et al. [ZAR21] within the dynamic call basic block.
The instruction sequence pattern looks like:
LDC.64 R2, c[0x0][0x380] # Load object pointer (this) LDG.E.64 R2, desc[UR4][R2.64] # Load vtable pointer (dereference this) LD.E R8, desc[UR4][R2.64] # Load function offset from vtable ... LDC.64 R8, c[0x2][R8] # Resolve kernel-specific function address via constant bank ... CALL.REL.NOINC R8 0x0 # Indirect call
- class examples.cuda.virtual_functions.example_dispatch.TestResourceUsageView on GitHub
Bases:
TestBinaryAnalysisCompare resource usage between
Dispatch.STATICandDispatch.DYNAMICdispatch forMemberFunction.FOO.- MEMBER_FUNCTION: Final[MemberFunction] = 'foo'
- detailed_register_usage(function: dict[tuple[Dispatch, MemberFunction], Function], nvdisasm: NVDisasm) dict[Dispatch, dict[RegisterType, tuple[int, int]]]View on GitHub
- dynamic(function: dict[tuple[Dispatch, MemberFunction], Function]) ResourceUsageView on GitHub
- nvdisasm(cuobjdump: CuObjDump) NVDisasmView on GitHub
- static(function: dict[tuple[Dispatch, MemberFunction], Function]) ResourceUsageView on GitHub
- test_dynamic_has_bytes_in_constant_bank_for_vtable(static: ResourceUsage, dynamic: ResourceUsage) NoneView on GitHub
- test_dynamic_uses_more_gprs(static: ResourceUsage, dynamic: ResourceUsage) NoneView on GitHub
- test_dynamic_uses_more_registers(detailed_register_usage: dict[Dispatch, dict[RegisterType, tuple[int, int]]]) NoneView on GitHub
- test_dynamic_uses_stack(static: ResourceUsage, dynamic: ResourceUsage) NoneView on GitHub
- class examples.cuda.virtual_functions.example_dispatch.TestVtableLookupFooVsBarView on GitHub
Bases:
TestBinaryAnalysisCompare the instructions generated for
dynamic_foo_kernelanddynamic_bar_kernel.- test_all_instructions_identical_except_load_function_offset(decoder: dict[tuple[Dispatch, MemberFunction], Decoder]) NoneView on GitHub
All instructions are identical between the two kernels, except for the instruction that loads the function offset from the vtable, which differs by an 8-byte memory address offset.
Whereas the instruction that loads the function offset from vtable looks for
dynamic_foo_kernellike:LD.E R8, desc[UR4][R2.64]
it looks for
dynamic_bar_kernellike:LD.E R8, desc[UR4][R2.64+0x8]