Division

Dividing complex numbers is more subtle than it appears. The naïve formula:

\[\frac{a + bi}{c + di} = \frac{ac + bd + (bc - ad)\,i}{c^2 + d^2}\]

computes $c^2 + d^2$, which can overflow for large inputs. Underflow regimes exist as well. Many algorithms have therefore been proposed to make complex division more robust [BS12], each with different trade-offs in numerical accuracy and performance.

Some implementations also claim ISO/IEC 60559 compliance [InternationalOfStandardizationInternationalECommission24], which mandates specific handling of edge cases such as infinite or zero operands. The compliance test can be used to validate several implementations against these requirements. Note that libraries such as CCCL expose an opt-out mechanism for this handling, which can save computation cycles when such operands are not expected.

This example focuses on performance using the Newton fractal [con25] as a benchmark, which requires one complex division per iteration per pixel.

$../../../_images/fractal.svg$

Newton fractal for $z^4 - 1$ with relaxation factor $a = 1$.

class examples.kokkos.complex.example_division.Method(*values)View on GitHub 

Bases: StrEnum

ILogbScalbn = 'ILogbScalbn'

LogbScalbn = 'LogbScalbn'

NormDivision = 'NormDivision'

__str__(): Return str(self).

class examples.kokkos.complex.example_division.TestDivisionView on GitHub 

Bases: CMakeAwareTestCase

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB: Used in TestNCU.report().

classmethod get_target_name() → strView on GitHub 

class examples.kokkos.complex.example_division.TestNCUView on GitHub 

Bases: TestDivision

Kernel profiling.

ELEMENT_COUNT: Final[int] = 256

METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='smsp__inst_executed', pretty_name='smsp__inst_executed', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCorrelation(name='sass__inst_executed_per_opcode'), MetricCounter(name='sm__inst_executed_pipe_fp64', pretty_name='sm__inst_executed_pipe_fp64', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='sm__pipe_fp64_cycles_active', pretty_name='sm__pipe_fp64_cycles_active', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))

NVTX_INCLUDES: Final[tuple[str, ...]] = ('division',)

WARP_COUNT: Final[int] = 8

WARP_SIZE: Final[int] = 32

metrics(report: Report) → dict[Method, ProfilingMetrics]View on GitHub 

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

report() → ReportView on GitHub 

test_fp64_division_instructions(metrics: dict[Method, ProfilingMetrics]) → NoneView on GitHub : The number of executed MUFU instructions is equal to the number of divisions performed per work item times WARP_COUNT.

test_fp64_instructions_and_cycles(metrics: dict[Method, ProfilingMetrics]) → NoneView on GitHub : In terms of FP64 instructions executed and active pipeline cycles, Method.ILogbScalbn outperforms Method.LogbScalbn, which in turn outperforms Method.NormDivision.

test_fp64_predicate_instructions(metrics: dict[Method, ProfilingMetrics]) → NoneView on GitHub 

Method.LogbScalbn and Method.ILogbScalbn differ in their DSETP count.

Method.LogbScalbn calls Kokkos::isfinite on the Kokkos::logb result, emitting a DSETP. Method.ILogbScalbn replaces this with an integer comparison, staying on the integer pipeline.

test_i2f_f2i_instructions(metrics: dict[Method, ProfilingMetrics]) → NoneView on GitHub : Confirms TestSASS.test_i2f_f2i_instructions().

class examples.kokkos.complex.example_division.TestSASSView on GitHub 

Bases: TestDivision

Binary analysis.

SIGNATURE: Final[dict[Method, Pattern[str]]] = {Method.ILogbScalbn: re.compile('reprospect::examples::kokkos::complex::DivisorLogbScalbn<(\$bool\$1|true), (\$bool\$1|true)>'), Method.LogbScalbn: re.compile('reprospect::examples::kokkos::complex::DivisorLogbScalbn<(\$bool\$1|true), (\$bool\$0|false)>'), Method.NormDivision: re.compile('reprospect::examples::kokkos::complex::DivisorNormDivision<(\$bool\$1|true)>')}

property cubin: PathView on GitHub 

cuobjdump() → CuObjDumpView on GitHub 

decoder(function: dict[Method, Function]) → dict[Method, Decoder]View on GitHub 

detailed_register_usage(function: dict[Method, Function], nvdisasm: NVDisasm) → dict[Method, dict[RegisterType, tuple[int, int]]]View on GitHub 

function(cuobjdump: CuObjDump) → dict[Method, Function]View on GitHub 

nvdisasm(cuobjdump: CuObjDump) → NVDisasmView on GitHub 

test_detailed_register_usage(detailed_register_usage: dict[Method, dict[RegisterType, tuple[int, int]]]) → NoneView on GitHub 

test_i2f_f2i_instructions(decoder: dict[Method, Decoder]) → NoneView on GitHub 

All methods need to execute 4 INT64 to FP64 conversion instructions to convert the clock64() reading to FP64.

Method.LogbScalbn needs additional INT32 to FP64 conversions, likely inside Kokkos::logb, as well as one F2I.F64 instruction for static_cast<int>(logbw).

test_norm_division_uses_more_rcp(decoder: dict[Method, Decoder]) → NoneView on GitHub 

Division is implemented using a Newton-Raphson-like method. It starts by computing an approximation to the reciprocal of the denominator using the MUFU.RCP64H instruction.

References:

https://fp32.org/newton_raphson_division.html

class examples.kokkos.complex.example_division_benchmarking.Method(*values)View on GitHub 

Bases: StrEnum

ILogbScalbn = 'ILogbScalbn'

LogbScalbn = 'LogbScalbn'

NormDivision = 'NormDivision'

__str__(): Return str(self).

property is_ilog: boolView on GitHub 

property is_norm: boolView on GitHub 

class examples.kokkos.complex.example_division_benchmarking.ParametersView on GitHub 

Bases: TypedDict

branching_or_compliance: bool

height: int

method: Method

width: int

class examples.kokkos.complex.example_division_benchmarking.TestDivisionView on GitHub 

Bases: CMakeAwareTestCase

Run the companion executable and make a nice visualization.

PATTERN: Final[Pattern[str]] = regex.Regex('^NewtonFractal<Divisor(?P<divisor>NormDivision|LogbScalbn)(?:<(?:(?P<branching_or_compliance>true|false))?(?:[, ]*(?P<ilogb>true|false))?>)?>/(?P<full>[A-Za-z]+)/width:(?P<width>\\d+)/height:(?P<height>\\d+)', flags=regex.V0)

TIME_UNIT: Final = 'us': Time unit of the benchmark.

classmethod get_target_name() → strView on GitHub 

classmethod params(*, name: str) → ParametersView on GitHub : Parse the name of a case and return parameters.

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

raw() → dict[str, dict]View on GitHub : Run the benchmark and return the raw JSON-based results.

Warning

Be sure to remove –benchmark_min_time for better converged results.

results(raw: dict) → DataFrameView on GitHub : Processed results.

test_visualize(results: DataFrame) → NoneView on GitHub : Create a visualization of the results.

class examples.kokkos.complex.example_division_plot.TestDivisionView on GitHub 

Bases: CMakeAwareTestCase

EXTENT: Final[tuple[int, ...]] = (-1, 1, -1, 1)

data() → tuple[ndarray, ndarray]View on GitHub : Run the executable and read the output arrays.

classmethod get_target_name() → strView on GitHub 

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]

test(data: tuple) → NoneView on GitHub : Plot the colors and iterations with some artist touch.