Aligning quantization scales before incompatible operations

30 May 2023 by David Corvoysier

As explained in my introduction to Machine Learning quantization, important restrictions apply to operations performed on quantized inputs.

First, additions between the integer mantissa of quantized inputs can only be performed if they are in the same scale.

This comes from the representation of the quantized numbers:

$a = (n - zeropoint_a) * scale_a$

$b = (m - zeropoint) * scale_b$

$a$ and $b$ integer mantissa can only be added if $scale_a == scale_b$, allowing us to write directly:

$a + b = (n - zeropoint_a + m - zeropoint_b) * scale_a$

Intuitively, this is analog to say that you cannot add two quantities expressed in different units (like bytes and kilobytes) without converting one number representation to the other.

The same kind of restriction can also be extended to operations that combine the channels of the inputs, such as the Matrix Multiplication or the Convolution.

For such operations, the channels must be all in the same scale: in other words, the inputs of these operations must be quantized per-tensor.

The first restriction is a major issue for all Machine Learning models that are not purely sequential. In other words it is a major issue for all models of the 2020’s, as they all include parallel branches that are eventually merged with an addition layer.

The second restriction used to be rather harmless: most models used to have very homogeneous activations, allowing a lossless quantization to 8-bit per-tensor.

This changed with the introduction of Transformer models, whose activation ranges can vary with a factor from 1 to 100 between channels, making per-tensor quantization less efficient.

On devices that support float arithmetics, not being able to use directly the integer mantissa is hardly a problem, except maybe for efficiency.

On devices supporting only integer arithmetics this is a serious issue.

In the next paragraphs I will detail a method to align inputs using only integer operations.

Explicitly apply input scale using fixed-point arithmetics

In a previous post, I introduced the fixed-point representation and explained how it relates to quantization.

Going back to our problem, we see immediately that if the scales of the inputs were power-of-two’s, then the inputs could be interpreted as fixed-point numbers, and it would become trivial to align them.

Here comes the trick: it is actually not that difficult to obtain a fixed-point representation of the inputs, even with a scale that is not a power-of-two.

As a reminder, a quantized number is represented as:

$x = (n - zeropoint) * scale$

Our goal here is to obtain a fixed-point representation of $x$.

The thing is: fixed-point arithmetic operations produce fixed-point numbers, and the first term is already an 8-bit integer, i.e. a fixed-point with zero fractional bits, so all we have to do is to make sure the scale is a fixed-point number.

Since the inputs are quantized to 8-bit anyway, an 8-bit mantissa is enough to accurately represent a float32 scale, so we only need to keep the 8-bit most significant bits of the scale mantissa.

You can refer to this fixed-point conversion algorithm for an example of how we can convert the scale to a fixed-point representation.

Now that we have a fixed-point representation of the scale as:

$scale \approx i_s . 2^{-fracbits_s}$

We can derive an approximated fixed-point representation of $x$:

$x \approx ((n - zeropoint) * i_s). 2^{-fracbits_s}$

Due to the multiplication of the two integers, this representation has a higher bitwidth than the original quantized number, but it should not be an issue since the resulting mantissa needs to be calculated only when the operation is performed, and thus using an intermediate buffer with a larger bitwidth.

Note: If that is an issue, then it could still be reduced using a right bitshift whose magnitude would be evaluated using the calibration information.

Align inputs explicitly after converting them to fixed-point

Using the fixed-point scales obtained as specified in the previous paragraph, it is now possible to align inputs expressed with different scales.

$a \approx ((n - zeropoint_a) * p). 2^{-fracbits_a} = a_i . 2^{-fracbits_a}$

$b \approx ((m - zeropoint_b) * q). 2^{-fracbits_a} = b_i . 2^{-fracbits_b}$

At quantization time, we can evaluate channel-wise the maximum number of fractional bits for the two inputs we want to combine and produce two relative shifts to be applied to each one of them:

$maxfracbits = max(fracbits_a, fracbits_b)$

$shift_a = fracbits_a - maxfracbits$

$shift_b = fracbits_b - maxfracbits$

Then the sequence of operations before the addition is to:

convert inputs integer mantissa to a fixed-point representation:

$a_i = (n - zeropoint_a) * p$

$b_i = (m - zeropoint_b) * q$

align the resulting fixed-point:

$a_i = a_i « shift_a$

$b_i = b_i « shift_b$

perform the integer addition

$s_i = a_i + b_i$

This produces a fixed-point tensor with an implicit scale of $2^{-maxfracbits}$.

This additional scale needs to be taken into account when quantizing the outputs of the addition.

Mathematically, this means that the scale of the outputs obtained after calibration must be multiplied by $2^{-maxfracbits}$.

Note: as mentioned in a previous note, I will explain in another post how this can be achieved using integer arithmetics only.

Generalization to per-axis inputs

The same kind of alignment can be applied to inputs quantized per-axis when reaching an operation that requires per-tensor inputs.

Ths only difference is that the maximum number of fractional bits is a scalar value corresponding to the aligned per-tensor scale:

$maxfracbits = max(fracbits_a)$

comments powered by Disqus

Hi

I am David Corvoysier, versatile developer and open Source enthusiast.

All Posts

Aligning quantization scales before incompatible operations

30 May 2023 by David Corvoysier

As explained in my introduction to Machine Learning quantization, important restrictions apply to operations performed on quantized inputs.

First, additions between the integer mantissa of quantized inputs can only be performed if they are in the same scale.

This comes from the representation of the quantized numbers:

$a = (n - zeropoint_a) * scale_a$

$b = (m - zeropoint) * scale_b$

$a$ and $b$ integer mantissa can only be added if $scale_a == scale_b$, allowing us to write directly:

$a + b = (n - zeropoint_a + m - zeropoint_b) * scale_a$

Intuitively, this is analog to say that you cannot add two quantities expressed in different units (like bytes and kilobytes) without converting one number representation to the other.

(more…)

Resolve quantization scales after an operation

29 May 2023 by David Corvoysier

As explained in my introduction to Machine Learning quantization, the inputs, weights and outputs of a quantized operation are quantized each with a different scale.

In the same post, I explain how these scales can be folded into a single output scale, allowing the operation to be performed on the integer mantissa of the quantized inputs and weights:

$scale_{folded} = \frac{scale_{out}}{scale_{in} . scale_{w}}$

In another post I explain how heterogenous input scales could be converted to a fixed-point representation and aligned before the operation, resulting in yet another implicit scale expressed as a power-of-two that needs to be applied to the output scale.

In this post I explain how these output scales can be applied using integer arithmetics only.

(more…)

Fixed-point representation for quantization

26 May 2023 by David Corvoysier

As explained in my introduction to Machine Learning quantization, the quantization of a ML model produces a graph of operations applied on quantized tensors.

Quantized tensors are actually integer tensors that share the same float scale and integer zero-point.

The implementation of the quantized operations is device-specific.

One of the main design decision is how the inputs, weights and output float scales are propagated and applied in the quantized graph.

In two other posts I will explain how is is possible to use integer arithmetic operators for that purpose if the scales are represented as fixed-point numbers.

This posts is a brief introduction to the fixed-point representation and to the fixed-point arithmetic operators.

(more…)

A brief introduction to Machine Learning models quantization

25 May 2023 by David Corvoysier

Even before the development of Large Language Models (LLM), the increasing memory and computing requirements of Deep Neural Networks (DNN) has been a concern.

Functionally, DNN are graphs of arithmetic operations: the inputs are fed at the stem and the chain of operations produces the outputs at the head.

From an implementation perspective, the operations are performed on floating point numbers, which are a digital representation of decimal numbers composed of a mantissa and an exponent:

\[x = mantissa . 2^{exponent}\]

(more…)