Division

Dividing complex numbers is more subtle than it appears. The naïve formula:

\[\frac{a + bi}{c + di} = \frac{ac + bd + (bc - ad)\,i}{c^2 + d^2}\]

computes \(c^2 + d^2\), which can overflow for large inputs. Underflow regimes exist as well. Many algorithms have therefore been proposed to make complex division more robust [BS12], each with different trade-offs in numerical accuracy and performance. These algorithms typically use a scaling strategy. A further challenge in complex division is the handling of edge cases such as infinite or zero operands.

The standard ISO/IEC 9899 (Annex G) [InternationalOfStandardizationInternationalECommission24] raises the issue of overflow and underflow and specifies the expected outcomes for the edge cases to comply with IEC 60599 real floating-point arithmetic. It also suggests an implementation that adopts a scaling approach using the scalbn function and relies on branching to handle the edge cases. Compilers such as LLVM and libraries such as CCCL follow this suggested implementation. They also expose mechanisms to disable the branching, which can save computation cycles when such operands are not expected.

The current implementation of complex division in Kokkos does not follow the ISO/IEC 9899 (Annex G) implementation. Instead, it adopts a scaling approach using a division, and it only partially complies with the handling of the edge cases. Their implementation thus entails more divisions than the ISO/IEC 9899 (Annex G) implementation, see examples.kokkos.complex.example_division.TestSASS.test_norm_division_uses_more_rcp().

This example also explores a potential improvement of the ISO/IEC 9899 (Annex G) implementation that replaces the use of logb followed by a cast to int with a use of ilogb. This change decreases the number of fp64 instructions and cycles, see examples.kokkos.complex.example_division.TestNCU.test_fp64_instructions_and_cycles().

This example thus compares three implementations:

  1. examples.kokkos.complex.example_division.Method.NormDivision

  2. examples.kokkos.complex.example_division.Method.LogbScalbn

  3. examples.kokkos.complex.example_division.Method.ILogbScalbn

Both examples.kokkos.complex.example_division.Method.LogbScalbn and examples.kokkos.complex.example_division.Method.ILogbScalbn use a common C++ implementation templated on a flag that allows the handling of edge cases to be disabled and another flag that controls the use of ilogb.

This example has three parts:

  1. A compliance test that checks whether each implementation complies with the expected outcomes for the edge cases.

  2. A binary analysis in examples.kokkos.complex.example_division.TestSASS() and a kernel profiling in examples.kokkos.complex.example_division.TestNCU().

  3. A benchmark in examples.kokkos.complex.example_division_benchmarking.TestDivision().

The benchmark compares the performance of the three implementations using the Newton fractal [con25], which requires one complex division per iteration per pixel:

../../../_images/fractal.svg

Newton fractal for \(z^4 - 1\) with relaxation factor \(a = 1\).

class examples.kokkos.complex.example_division.Method(*values)View on GitHub

Bases: StrEnum

ILogbScalbn = 'ILogbScalbn'

Same as Method.LogbScalbn, but using the ilogb function.

LogbScalbn = 'LogbScalbn'

Implementation that uses a scaling with the scalbn function as suggested in [InternationalOfStandardizationInternationalECommission24].

NormDivision = 'NormDivision'

Implementation that uses the division-based scaling currently implemented in Kokkos.

__str__()

Return str(self).

class examples.kokkos.complex.example_division.TestDivisionView on GitHub

Bases: CMakeAwareTestCase

KOKKOS_TOOLS_NVTX_CONNECTOR_LIB

Used in TestNCU.report().

classmethod get_target_name() strView on GitHub
class examples.kokkos.complex.example_division.TestNCUView on GitHub

Bases: TestDivision

Kernel profiling.

ELEMENT_COUNT: Final[int] = 256
METRICS: tuple[Metric | MetricCorrelation, ...] = (MetricCounter(name='smsp__inst_executed', pretty_name='smsp__inst_executed', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCorrelation(name='sass__inst_executed_per_opcode'), MetricCounter(name='sm__inst_executed_pipe_fp64', pretty_name='sm__inst_executed_pipe_fp64', subs=(<MetricCounterRollUp.SUM: 'sum'>,)), MetricCounter(name='sm__pipe_fp64_cycles_active', pretty_name='sm__pipe_fp64_cycles_active', subs=(<MetricCounterRollUp.SUM: 'sum'>,)))
NVTX_INCLUDES: Final[tuple[str, ...]] = ('division',)
WARP_COUNT: Final[int] = 8
WARP_SIZE: Final[int] = 32
metrics(report: Report) dict[Method, ProfilingMetrics]View on GitHub
pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
report() ReportView on GitHub
test_fp64_division_instructions(metrics: dict[Method, ProfilingMetrics]) NoneView on GitHub

The number of executed MUFU instructions is equal to the number of divisions performed per work item times WARP_COUNT.

test_fp64_instructions_and_cycles(metrics: dict[Method, ProfilingMetrics]) NoneView on GitHub

In terms of FP64 instructions executed and active pipeline cycles, Method.ILogbScalbn outperforms Method.LogbScalbn, which in turn outperforms Method.NormDivision.

test_fp64_predicate_instructions(metrics: dict[Method, ProfilingMetrics]) NoneView on GitHub

Method.LogbScalbn and Method.ILogbScalbn differ in their DSETP count.

Method.LogbScalbn calls Kokkos::isfinite on the Kokkos::logb result, emitting a DSETP. Method.ILogbScalbn replaces this with an integer comparison, staying on the integer pipeline.

test_i2f_f2i_instructions(metrics: dict[Method, ProfilingMetrics]) NoneView on GitHub

Confirms TestSASS.test_i2f_f2i_instructions().

class examples.kokkos.complex.example_division.TestSASSView on GitHub

Bases: TestDivision

Binary analysis.

SIGNATURE: Final[dict[Method, Pattern[str]]] = {Method.ILogbScalbn: re.compile('reprospect::examples::kokkos::complex::DivisorLogbScalbn<(\\(bool\\)1|true), (\\(bool\\)1|true)>'), Method.LogbScalbn: re.compile('reprospect::examples::kokkos::complex::DivisorLogbScalbn<(\\(bool\\)1|true), (\\(bool\\)0|false)>'), Method.NormDivision: re.compile('reprospect::examples::kokkos::complex::DivisorNormDivision<(\\(bool\\)1|true)>')}
property cubin: PathView on GitHub
cuobjdump() CuObjDumpView on GitHub
decoder(function: dict[Method, Function]) dict[Method, Decoder]View on GitHub
detailed_register_usage(function: dict[Method, Function], nvdisasm: NVDisasm) dict[Method, dict[RegisterType, tuple[int, int]]]View on GitHub
function(cuobjdump: CuObjDump) dict[Method, Function]View on GitHub
nvdisasm(cuobjdump: CuObjDump) NVDisasmView on GitHub
test_detailed_register_usage(detailed_register_usage: dict[Method, dict[RegisterType, tuple[int, int]]]) NoneView on GitHub
test_i2f_f2i_instructions(decoder: dict[Method, Decoder]) NoneView on GitHub

All methods need to execute 4 INT64 to FP64 conversion instructions to convert the clock64() reading to FP64.

Method.LogbScalbn needs additional INT32 to FP64 conversions, likely inside Kokkos::logb, as well as one F2I.F64 instruction for static_cast<int>(logbw).

test_norm_division_uses_more_rcp(decoder: dict[Method, Decoder]) NoneView on GitHub

Division is implemented using a Newton-Raphson-like method. It starts by computing an approximation to the reciprocal of the denominator using the MUFU.RCP64H instruction.

References:

class examples.kokkos.complex.example_division_benchmarking.Method(*values)View on GitHub

Bases: StrEnum

ILogbScalbn = 'ILogbScalbn'
LogbScalbn = 'LogbScalbn'
NormDivision = 'NormDivision'
__str__()

Return str(self).

property is_ilog: boolView on GitHub
property is_norm: boolView on GitHub
class examples.kokkos.complex.example_division_benchmarking.ParametersView on GitHub

Bases: TypedDict

branching_or_compliance: bool
height: int
method: Method
width: int
class examples.kokkos.complex.example_division_benchmarking.TestDivisionView on GitHub

Bases: CMakeAwareTestCase

Run the companion executable and make a nice visualization.

PATTERN: Final[Pattern[str]] = regex.Regex('^NewtonFractal<Divisor(?P<divisor>NormDivision|LogbScalbn)(?:<(?:(?P<branching_or_compliance>true|false))?(?:[, ]*(?P<ilogb>true|false))?>)?>/(?P<full>[A-Za-z]+)/width:(?P<width>\\d+)/height:(?P<height>\\d+)', flags=regex.V0)
TIME_UNIT: Final = 'us'

Time unit of the benchmark.

classmethod get_target_name() strView on GitHub
classmethod params(*, name: str) ParametersView on GitHub

Parse the name of a case and return parameters.

pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
raw() dict[str, dict]View on GitHub

Run the benchmark and return the raw JSON-based results.

Warning

Be sure to remove –benchmark_min_time for better converged results.

results(raw: dict) DataFrameView on GitHub

Processed results.

test_visualize(results: DataFrame) NoneView on GitHub

Create a visualization of the results.

class examples.kokkos.complex.example_division_plot.TestDivisionView on GitHub

Bases: CMakeAwareTestCase

EXTENT: Final[tuple[int, ...]] = (-1, 1, -1, 1)
data() tuple[ndarray, ndarray]View on GitHub

Run the executable and read the output arrays.

classmethod get_target_name() strView on GitHub
pytestmark = [Mark(name='skipif', args=(True,), kwargs={'reason': 'needs a GPU'})]
test(data: tuple) NoneView on GitHub

Plot the colors and iterations with some artist touch.