Dispatch

        flowchart TD
	block_0["LDC R1, c[0x0][0x37c]<br/>S2R R3, SR_TID.X<br/>S2UR UR4, SR_CTAID.X<br/>LDCU UR5, c[0x0][0x388]<br/>LDC R6, c[0x0][0x360]<br/>IMAD R6, R6, UR4, R3<br/>ISETP.GE.U32.AND P0, PT, R6, UR5, PT<br/>@P0 EXIT"]:::myblock
	block_128["LDCU.64 UR4, c[0x0][0x358]<br/>LDC.64 R2, c[0x0][0x380]<br/>LDG.E.64 R2, desc[UR4][R2.64]<br/>LD.E R8, desc[UR4][R2.64]<br/>LDCU.64 UR4, c[0x0][0x380]<br/>HFMA2 R21, -RZ, RZ, 0, 0<br/>BSSY.RECONVERGENT B6, 0x150<br/>MOV R20, 0x140<br/>MOV R4, UR4<br/>MOV R5, UR5<br/>LDC.64 R8, c[0x2][R8]<br/>CALL.REL.NOINC R8 0x0<br/>BSYNC.RECONVERGENT B6"]:::myblock
	block_336["EXIT"]:::myblock
	block_352["IADD R1, R1, -0x8<br/>MOV R9, R5<br/>MOV R8, R4<br/>STL [R1+0x4], R16<br/>STL [R1], R2<br/>LDC.64 R2, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>MOV R6, 0x0<br/>MOV R4, R8<br/>MOV R5, R9<br/>LDC.64 R6, c[0x4][R6]<br/>IADD.64 R2, R2, 0x10<br/>ST.E.64 desc[UR4][R8.64], R2<br/>MOV R2, R20<br/>MOV R16, R21<br/>LEPC R20, 0x270<br/>CALL.ABS.NOINC R6<br/>MOV R20, R2<br/>MOV R21, R16<br/>LDL R2, [R1]<br/>LDL R16, [R1+0x4]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_720["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDC.64 R4, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>IADD.64 R4, R4, 0x10<br/>ST.E.64 desc[UR4][R2.64], R4<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_896["IADD R1, R1, -0x8<br/>MOV R9, R5<br/>MOV R8, R4<br/>STL [R1+0x4], R16<br/>STL [R1], R2<br/>LDC.64 R2, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>MOV R6, 0x0<br/>MOV R4, R8<br/>MOV R5, R9<br/>LDC.64 R6, c[0x4][R6]<br/>IADD.64 R2, R2, 0x10<br/>ST.E.64 desc[UR4][R8.64], R2<br/>MOV R2, R20<br/>MOV R16, R21<br/>LEPC R20, 0x490<br/>CALL.ABS.NOINC R6<br/>MOV R20, R2<br/>MOV R21, R16<br/>LDL R2, [R1]<br/>LDL R16, [R1+0x4]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_1264["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDC.64 R4, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>IADD.64 R4, R4, 0x10<br/>ST.E.64 desc[UR4][R2.64], R4<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_1440["IADD R1, R1, -0x8<br/>MOV R9, R5<br/>MOV R8, R4<br/>STL [R1+0x4], R16<br/>STL [R1], R2<br/>LDC.64 R2, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>MOV R6, 0x0<br/>MOV R4, R8<br/>MOV R5, R9<br/>LDC.64 R6, c[0x4][R6]<br/>IADD.64 R2, R2, 0x10<br/>ST.E.64 desc[UR4][R8.64], R2<br/>MOV R2, R20<br/>MOV R16, R21<br/>LEPC R20, 0x6b0<br/>CALL.ABS.NOINC R6<br/>MOV R20, R2<br/>MOV R21, R16<br/>LDL R2, [R1]<br/>LDL R16, [R1+0x4]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_1808["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDC.64 R4, c[0x4][0x8]<br/>LDCU.64 UR4, c[0x0][0x358]<br/>IADD.64 R4, R4, 0x10<br/>ST.E.64 desc[UR4][R2.64], R4<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_1984["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 171<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_2192["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 175<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_2400["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 187<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_2608["IADD R1, R1, -0x8<br/>MOV R3, R5<br/>STL [R1], R2<br/>MOV R2, R4<br/>LDCU.64 UR4, c[0x0][0x358]<br/>LD.E.64 R2, desc[UR4][R2.64+0x8]<br/>IMAD.WIDE.U32 R6, R6, 0x4, R2<br/>LD.E R0, desc[UR4][R6.64]<br/>FADD R5, R0, 191<br/>ST.E desc[UR4][R6.64], R5<br/>LDL R2, [R1]<br/>IADD R1, R1, 0x8<br/>RET.REL.NODEC R20 0x0"]:::myblock
	block_2816["BRA 0xb00"]:::myblock
	block_0 --> block_128
	block_128 --> block_336
	block_352 --> block_0
	block_720 --> block_0
	block_896 --> block_0
	block_1264 --> block_0
	block_1440 --> block_0
	block_1808 --> block_0
	block_1984 --> block_0
	block_2192 --> block_0
	block_2400 --> block_0
	block_2608 --> block_0
	block_2816 --> block_2816
	classDef myblock text-align:left

Control flow graph that partitions the SASS code for the dynamic dispatch for virtual functions on device.

CUDA supports polymorphic classes in device code. However, dynamic dispatch for virtual functions incurs overhead:

Direct overhead:
- vtable lookups incur additional instructions and memory traffic
- indirect call may cause register spills/fills and a jump
Indirect overhead:
- prevents inlining and other compiler optimizations

Zhang et al. [ZAR21] identified the typical instruction sequence pattern that dynamic dispatch for virtual functions typically generates in machine code:

load vtable pointer
access the vtable to obtain the function offset
resolve the function address via an additional kernel-specific level of indirection through constant memory
indirect call

This example compares dynamic dispatch for a virtual function call on device with static dispatch. It analyzes resource usage for both dispatch types and then verifies programmatically the presence of the dynamic dispatch instruction pattern that was identified by Zhang et al. [ZAR21].

In this way, this example demonstrates how to use ReProspect to create research artifacts that may accompany publications on CUDA code analysis.

class examples.cuda.virtual_functions.example_dispatch.Derived(*values)View on GitHub 

Bases: StrEnum

DERIVED_A = 'DerivedA'

DERIVED_B = 'DerivedB'

__str__(): Return str(self).

class examples.cuda.virtual_functions.example_dispatch.Dispatch(*values)View on GitHub 

Bases: StrEnum

DYNAMIC = 'dynamic'

STATIC = 'static'

__str__(): Return str(self).

class examples.cuda.virtual_functions.example_dispatch.MemberFunction(*values)View on GitHub 

Bases: StrEnum

BAR = 'bar'

FOO = 'foo'

__str__(): Return str(self).

class examples.cuda.virtual_functions.example_dispatch.TestAllImplementationsAreInAllKernelsView on GitHub 

Bases: TestBinaryAnalysis

TestVtableLookupFooVsBar.test_all_instructions_identical_except_load_function_offset() shows the instructions generated for dynamic_foo_kernel and dynamic_bar_kernel are identical except for the instruction that loads the function offset from the vtable. This finding is surprising because it indicates that the SASS codes for each kernel must each contain the implementations of both virtual functions, even though each kernel only calls one of the virtual functions.

In fact, when inspecting the SASS code, it can be observed that the SASS code for each kernel contains the implementations of all virtual member functions of all derived classes in the compile unit.

DISPATCH: Final[Dispatch] = 'dynamic'

test(decoder: dict[tuple[Dispatch, MemberFunction], Decoder]) → NoneView on GitHub 

Assert that the implementations of

DerivedA::foo(unsigned int)
DerivedB::foo(unsigned int)
DerivedA::bar(unsigned int)
DerivedB::bar(unsigned int)

are present in the SASS code of each kernel.

class examples.cuda.virtual_functions.example_dispatch.TestBinaryAnalysisView on GitHub 

Bases: TestDispatch

BANK_VTABLE: Final[str] = '0x2'

MARKER: Final[dict[tuple[Derived, MemberFunction], int]] = {(Derived.DERIVED_A, MemberFunction.BAR): 171, (Derived.DERIVED_A, MemberFunction.FOO): 175, (Derived.DERIVED_B, MemberFunction.BAR): 187, (Derived.DERIVED_B, MemberFunction.FOO): 191}

SIGNATURE: Final[dict[tuple[Dispatch, MemberFunction], Pattern[str]]] = {(Dispatch.DYNAMIC, MemberFunction.BAR): re.compile('dynamic_bar_kernel'), (Dispatch.DYNAMIC, MemberFunction.FOO): re.compile('dynamic_foo_kernel'), (Dispatch.STATIC, MemberFunction.FOO): re.compile('static_foo_kernel')}

property cubin: PathView on GitHub 

cuobjdump() → CuObjDumpView on GitHub 

decoder(function: dict[tuple[Dispatch, MemberFunction], Function]) → dict[tuple[Dispatch, MemberFunction], Decoder]View on GitHub 

function(cuobjdump: CuObjDump) → dict[tuple[Dispatch, MemberFunction], Function]View on GitHub : Collect the SASS code and parse the resource usage information.

class examples.cuda.virtual_functions.example_dispatch.TestDispatchView on GitHub 

Bases: CMakeAwareTestCase

classmethod get_target_name() → strView on GitHub 

test_run() → NoneView on GitHub 

class examples.cuda.virtual_functions.example_dispatch.TestDynamicDispatchInstructionSequenceView on GitHub 

Bases: TestBinaryAnalysis

Look for the dynamic dispatch instruction pattern identified by Zhang et al. [ZAR21].

DISPATCH: Final[Dispatch] = 'dynamic'

MEMBER_FUNCTION: Final[MemberFunction] = 'foo'

basic_block_dynamic_call(cfg: Graph) → BasicBlockView on GitHub : Find the basic block that contains the dynamic dispatch instruction sequence by looking for a basic block that contains an indirect call instruction.

basic_blocks_function_implementations(cfg: Graph) → tuple[BasicBlock, ...]View on GitHub : Find the basic blocks that contain the function implementations by looking for basic blocks that contain a FADD instruction with the expected operand.

cfg(decoder: dict[tuple[Dispatch, MemberFunction], Decoder]) → GraphView on GitHub 

constant_bank_vtable(cuobjdump: CuObjDump, function: dict[tuple[Dispatch, MemberFunction], Function]) → bytesView on GitHub : Read the constant memory bank expected to hold the function address that the dynamic dispatch resolves to.

test_constant_bank_vtable(basic_blocks_function_implementations: tuple[BasicBlock, ...], constant_bank_vtable: bytes) → NoneView on GitHub 

Show that the kernel-specific function address resolution via the constant bank as in:

LDC.64 R8, c[0x2][R8]

may indeed resolve the indirect call as in:

CALL.REL.NOINC R10 0x0

to the function implementation basic block. This is done by verifying that each function implementation basic block offset is among the entries held in the constant bank.

test_instruction_sequence_pattern_dynamic_call(basic_block_dynamic_call: BasicBlock) → NoneView on GitHub 

Verify the presence of the instruction sequence pattern described in Zhang et al. [ZAR21] within the dynamic call basic block.

The instruction sequence pattern looks like:

LDC.64 R2, c[0x0][0x380]       # Load object pointer (this)
LDG.E.64 R2, desc[UR4][R2.64]  # Load vtable pointer (dereference this)
LD.E R8, desc[UR4][R2.64]      # Load function offset from vtable
...
LDC.64 R8, c[0x2][R8]          # Resolve kernel-specific function address via constant bank
...
CALL.REL.NOINC R8 0x0          # Indirect call

class examples.cuda.virtual_functions.example_dispatch.TestResourceUsageView on GitHub 

Bases: TestBinaryAnalysis

Compare resource usage between Dispatch.STATIC and Dispatch.DYNAMIC dispatch for MemberFunction.FOO.

MEMBER_FUNCTION: Final[MemberFunction] = 'foo'

detailed_register_usage(function: dict[tuple[Dispatch, MemberFunction], Function], nvdisasm: NVDisasm) → dict[Dispatch, dict[RegisterType, tuple[int, int]]]View on GitHub 

dynamic(function: dict[tuple[Dispatch, MemberFunction], Function]) → ResourceUsageView on GitHub 

nvdisasm(cuobjdump: CuObjDump) → NVDisasmView on GitHub 

static(function: dict[tuple[Dispatch, MemberFunction], Function]) → ResourceUsageView on GitHub 

test_dynamic_has_bytes_in_constant_bank_for_vtable(static: ResourceUsage, dynamic: ResourceUsage) → NoneView on GitHub 

test_dynamic_uses_more_gprs(static: ResourceUsage, dynamic: ResourceUsage) → NoneView on GitHub 

test_dynamic_uses_more_registers(detailed_register_usage: dict[Dispatch, dict[RegisterType, tuple[int, int]]]) → NoneView on GitHub 

test_dynamic_uses_stack(static: ResourceUsage, dynamic: ResourceUsage) → NoneView on GitHub 

class examples.cuda.virtual_functions.example_dispatch.TestVtableLookupFooVsBarView on GitHub 

Bases: TestBinaryAnalysis

Compare the instructions generated for dynamic_foo_kernel and dynamic_bar_kernel.

DISPATCH: Final[Dispatch] = 'dynamic'

test_all_instructions_identical_except_load_function_offset(decoder: dict[tuple[Dispatch, MemberFunction], Decoder]) → NoneView on GitHub 

All instructions are identical between the two kernels, except for the instruction that loads the function offset from the vtable, which differs by an 8-byte memory address offset.

Whereas the instruction that loads the function offset from vtable looks for dynamic_foo_kernel like:

LD.E R8, desc[UR4][R2.64]

it looks for dynamic_bar_kernel like:

LD.E R8, desc[UR4][R2.64+0x8]