Kaizou

Aligning quantization scales before incompatible operations

kaizouman@kaizou.org (David Corvoysier) — Tue, 30 May 2023 12:00:00 +0000

As explained in my introduction to Machine Learning quantization, important restrictions apply to operations performed on quantized inputs.

First, additions between the integer mantissa of quantized inputs can only be performed if they are in the same scale.

This comes from the representation of the quantized numbers:

$a = (n - zeropoint_a) * scale_a$

$b = (m - zeropoint) * scale_b$

$a$ and $b$ integer mantissa can only be added if $scale_a == scale_b$, allowing us to write directly:

$a + b = (n - zeropoint_a + m - zeropoint_b) * scale_a$

Intuitively, this is analog to say that you cannot add two quantities expressed in different units (like bytes and kilobytes) without converting one number representation to the other.

The same kind of restriction can also be extended to operations that combine the channels of the inputs, such as the Matrix Multiplication or the Convolution.

For such operations, the channels must be all in the same scale: in other words, the inputs of these operations must be quantized per-tensor.

The first restriction is a major issue for all Machine Learning models that are not purely sequential. In other words it is a major issue for all models of the 2020’s, as they all include parallel branches that are eventually merged with an addition layer.

The second restriction used to be rather harmless: most models used to have very homogeneous activations, allowing a lossless quantization to 8-bit per-tensor.

This changed with the introduction of Transformer models, whose activation ranges can vary with a factor from 1 to 100 between channels, making per-tensor quantization less efficient.

On devices that support float arithmetics, not being able to use directly the integer mantissa is hardly a problem, except maybe for efficiency.

On devices supporting only integer arithmetics this is a serious issue.

In the next paragraphs I will detail a method to align inputs using only integer operations.

Explicitly apply input scale using fixed-point arithmetics

In a previous post, I introduced the fixed-point representation and explained how it relates to quantization.

Going back to our problem, we see immediately that if the scales of the inputs were power-of-two’s, then the inputs could be interpreted as fixed-point numbers, and it would become trivial to align them.

Here comes the trick: it is actually not that difficult to obtain a fixed-point representation of the inputs, even with a scale that is not a power-of-two.

As a reminder, a quantized number is represented as:

$x = (n - zeropoint) * scale$

Our goal here is to obtain a fixed-point representation of $x$.

The thing is: fixed-point arithmetic operations produce fixed-point numbers, and the first term is already an 8-bit integer, i.e. a fixed-point with zero fractional bits, so all we have to do is to make sure the scale is a fixed-point number.

Since the inputs are quantized to 8-bit anyway, an 8-bit mantissa is enough to accurately represent a float32 scale, so we only need to keep the 8-bit most significant bits of the scale mantissa.

You can refer to this fixed-point conversion algorithm for an example of how we can convert the scale to a fixed-point representation.

Now that we have a fixed-point representation of the scale as:

$scale \approx i_s . 2^{-fracbits_s}$

We can derive an approximated fixed-point representation of $x$:

$x \approx ((n - zeropoint) * i_s). 2^{-fracbits_s}$

Due to the multiplication of the two integers, this representation has a higher bitwidth than the original quantized number, but it should not be an issue since the resulting mantissa needs to be calculated only when the operation is performed, and thus using an intermediate buffer with a larger bitwidth.

Note: If that is an issue, then it could still be reduced using a right bitshift whose magnitude would be evaluated using the calibration information.

Align inputs explicitly after converting them to fixed-point

Using the fixed-point scales obtained as specified in the previous paragraph, it is now possible to align inputs expressed with different scales.

$a \approx ((n - zeropoint_a) * p). 2^{-fracbits_a} = a_i . 2^{-fracbits_a}$

$b \approx ((m - zeropoint_b) * q). 2^{-fracbits_a} = b_i . 2^{-fracbits_b}$

At quantization time, we can evaluate channel-wise the maximum number of fractional bits for the two inputs we want to combine and produce two relative shifts to be applied to each one of them:

$maxfracbits = max(fracbits_a, fracbits_b)$

$shift_a = fracbits_a - maxfracbits$

$shift_b = fracbits_b - maxfracbits$

Then the sequence of operations before the addition is to:

convert inputs integer mantissa to a fixed-point representation:

$a_i = (n - zeropoint_a) * p$

$b_i = (m - zeropoint_b) * q$

align the resulting fixed-point:

$a_i = a_i « shift_a$

$b_i = b_i « shift_b$

perform the integer addition

$s_i = a_i + b_i$

This produces a fixed-point tensor with an implicit scale of $2^{-maxfracbits}$.

This additional scale needs to be taken into account when quantizing the outputs of the addition.

Mathematically, this means that the scale of the outputs obtained after calibration must be multiplied by $2^{-maxfracbits}$.

Note: as mentioned in a previous note, I will explain in another post how this can be achieved using integer arithmetics only.

Generalization to per-axis inputs

The same kind of alignment can be applied to inputs quantized per-axis when reaching an operation that requires per-tensor inputs.

Ths only difference is that the maximum number of fractional bits is a scalar value corresponding to the aligned per-tensor scale:

$maxfracbits = max(fracbits_a)$

Resolve quantization scales after an operation

kaizouman@kaizou.org (David Corvoysier) — Mon, 29 May 2023 12:00:00 +0000

As explained in my introduction to Machine Learning quantization, the inputs, weights and outputs of a quantized operation are quantized each with a different scale.

In the same post, I explain how these scales can be folded into a single output scale, allowing the operation to be performed on the integer mantissa of the quantized inputs and weights:

$scale_{folded} = \frac{scale_{out}}{scale_{in} . scale_{w}}$

In another post I explain how heterogenous input scales could be converted to a fixed-point representation and aligned before the operation, resulting in yet another implicit scale expressed as a power-of-two that needs to be applied to the output scale.

In this post I explain how these output scales can be applied using integer arithmetics only.

Reminder: how are output scales applied in a quantized graph

As a general principle, the last step of a quantized operation is a downscale to reduce the output bitwidth.

When applied to float outputs, the general formula for the downscale is:

$outputs_{uint8} = saturate(round(\frac{outputs_{float32)}}{scale_{out}}) + {zp_{out}})$

For a quantized output of scale $y_{s}$ and zero-point $y_{zp}$.

As explained in my quantization introduction, some compatible operations can be applied directly on the integer mantissa of the quantized inputs and weights, folding the inputs and weights scale into the output scale.

The downscale operation becomes then:

$outputs_{uint8} = saturate(round(\frac{outputs_{int32}}{scale_{folded}}) + zp_{out})$

with $scale_{folded} = \frac{scale_{out}}{scale_{in} . scale_{w}}$

This operation still requires a division and a round that is not easily implemented using integer arithmetic operators.

Use fixed-point folded scale reciprocal to obtain rescaled fixed-point outputs

The idea is to convert the scale to a fixed-point representation to be able to take advantage of integer arithmetic operators and obtain a fixed-point representation of the downscaled outputs.

Since the fixed-point division is a lossy operation, instead of dividing by the folded output scale, we can multiply by its reciprocal $\frac{1}{scale_{folded}}$.

The first step is to obtain a fixed-point representation of the reciprocal of the folded scale:

$rec_{folded} = to_fixed_point(\frac{scale_{in}.scale_{w}}{scale_{out}}) = rec_{int} . 2^{-fracbits_{rec}}$

You can refer to this fixed-point conversion algorithm for an example of how we can convert the scale to a fixed-point representation.

Then the rescaled outputs are simply evaluated as:

$outputs_{int32} = outputs_{int32}.rec_{folded}$

Reduce the precision of the fixed-point rescaled outputs using a rounded right-shift

The rescaled outputs are represented as a fixed-point number with an implicit scale of $2^{-fracbits_{rec}}$.

To obtain the actual 8-bit integer values corresponding to the original downscale operation, we must apply this implicit scale.

We use the rounded right-shift operation described in the fixed-point introduction post

$outputs_{int8} = outputs_{int32} + 2^{fracbits_{rec} - 1}» frac_bits_{rec}$

Then we can apply the zero-point:

$outputs_{uint8} = saturate(outputs_{int8} + zp_{out})$

Fixed-point representation for quantization

kaizouman@kaizou.org (David Corvoysier) — Fri, 26 May 2023 12:00:00 +0000

As explained in my introduction to Machine Learning quantization, the quantization of a ML model produces a graph of operations applied on quantized tensors.

Quantized tensors are actually integer tensors that share the same float scale and integer zero-point.

The implementation of the quantized operations is device-specific.

One of the main design decision is how the inputs, weights and output float scales are propagated and applied in the quantized graph.

In two other posts I will explain how is is possible to use integer arithmetic operators for that purpose if the scales are represented as fixed-point numbers.

This posts is a brief introduction to the fixed-point representation and to the fixed-point arithmetic operators.

Fixed-point representation

Before the introduction of the floating point representation, decimal values were expressed using a fixed-point representation.

This representation also uses a mantissa and an exponent, but the latter is implicit: it defines the number of bits in the mantissa dedicated to the fractional part of the number.

The minimum non-zero value that can be represented for a given number of fractional bits is $2^{-fracbits}$.

For instance, with three fractional bits, the smallest float number than can be represented is $2^{-3} = 0.125$.

Below is an example of an unsigned 8-bit fixed-point number with 4 fractional bits.

.------------------------------------.  
|  0   1   0   1 |  1   1   1    0   |
.------------------------------------.  
|  integer bits  |  fractional bits  |
.------------------------------------.  
|  3   2   1   0 | -1  -2  -3   -4   |
'------------------------------------'

The decimal value of that number is: $2^{2} + 2^{0} + 2^{-1} + 2^{-2} + 2^{-3} = 5.875$

The precision of the representation is directly related to the number of fractional bits.

Below are some more examples of PI represented with unsigned 8-bit fixed-point numbers different fractional bits:

float	frac_bits	mantissa	binary
3.140625	6	201	11001001
3.15625	5	101	01100101
3.125	4	50	00110010
3.125	3	25	00011001
3.25	2	13	00001100
3.0	1	6	00000110

Obtaining a fixed-point representation of a float

As a reminder, a float number is represented as:

\[x = mantissa * 2^{exponent}\]

Our goal here is to obtain a fixed-point representation of $x$.

Technically, we could directly take the float mantissa, but it is 24-bit, with a high risk of overflows in the downstream fixed-point operations.

For the range of numbers used in Machine Learning, an 8-bit mantissa is usually enough to accurately represent a float32 number.

As a consequnce, we only need to keep the 8 most significant bits of the mantissa, which effectively means quantizing the float to 8-bit with the power-of-two scale that minimizes the precision loss.

This can be achieved in several ways depending on the level of abstraction you are comfortable with: below is an algorithm relying only on high-level mathematical operations.

def to_fixed_point(x, bitwidth, signed=True):
    """Convert a number to a FixedPoint representation

    The representation is composed of a mantissa and an implicit exponent expressed as
    a number of fractional bits, so that:

    x ~= mantissa . 2 ** -frac_bits

    The mantissa is an integer whose bitwidth and signedness are specified as parameters.

    Args:
        x: the source number or array


    """
    if not isinstance(x, np.ndarray):
        x = np.array(x)
    # Evaluate the number of bits available for the mantissa
    mantissa_bits = bitwidth - 1 if signed else bitwidth
    # Evaluate the number of bits required to represent the whole part of x
    # as the power of two enclosing the absolute value of x
    # Note that it can be negative if x < 0.5
    whole_bits = np.ceil(np.log2(np.abs(x))).astype(np.int32)
    # Deduce the number of bits required for the fractional part of x
    # Note that it can be negative if the whole part exceeds the mantissa
    frac_bits = mantissa_bits - whole_bits
    # Evaluate the 'scale', which is the smallest value that can be represented (as 1)
    scale = 2. ** -frac_bits
    # Evaluate the minimum and maximum values for the mantissa
    mantissa_min = -2 ** mantissa_bits if signed else 0
    mantissa_max = 2 ** mantissa_bits - 1
    # Evaluate the mantissa by quantizing x with the scale, clipping to the min and max
    mantissa = np.clip(np.round(x / scale), mantissa_min, mantissa_max).astype(np.int32)
    return mantissa, frac_bits

The algorithm above produces a fixed-point representation of $x$ such that:

\[x_{float} \approx x_{int} . 2^{-x_{fracbits}}\]

Fixed-point addition (or subtraction)

The reason why the fixed-point representation comes to mind when it comes to quantization is that it has exactly the same restrictions regarding the addition of numbers: they must be expressed using the same amount of fractional bits.

The addition can then be performed directly on the underlying integer.

The resulting sum is a fixed-point number with the same fractional bits. It is exact unless it overflows.

What is really interesting here is that the alignment of fixed-point numbers is trivial: it can just be performed using a left bitshift.

Example:

The following fixed-point (values, fractional bits) pairs represent the following float values:

$a: (84, 3) = 84 * 2^{-3} = 10.5$

$b: (113, 4) = 113 * 2^{-4} = 7.0625$

Before summing a and b, we need to shift $a$ to the left to align it with $b$:

$s = a + b = 84 « 1 + 113 = 168 + 113 = 281$

The sum is a fixed-point number with 4 fractional bits:

$s: (281, 4) = 281 * 2^{-4} = 17.5625$

Fixed-point multiplication

The multiplication of two fixed-point numbers can be performed directly on the underlying integer numbers.

The resulting product is a fixed-point number with a number of fractional bits corresponding to the sum of the fractional bits of the inputs. It is exact unless it overflows.

Example:

Going back to our two numbers:

$a: (84, 3) = 84 * 2^{-3} = 10.5$

$b: (113, 4) = 113 * 2^{-4} = 7.0625$

Their fixed-point product is:

$p = a.b = (84 . 113, 3 + 4) = (9492, 7) = 74.15625$

Fixed-point downscale

The mantissa of the resulting product of two fixed-point numbers can go very quickly, which would eventually lead to an overflow when chaining multiple operations.

It is therefore common to ‘downscale’ the result of a multiplication using a right-shift.

Example:

Going back to our previous product:

$p = a.b = (84 . 113, 3 + 4) = (9492, 7) = 74.15625$

It can be downscaled to fit in 8-bit by shifting right and adjusting the fractional bits:

$downscale(p) = p » 6 = (148, 1) = 74$

Note that the right-shift operation always perform a floor, which may lead to a loss of precision.

For that reason, it is often implemented as a ‘rounded’ right-shift by adding $2^{n-1}$ before shifting of $n$.

Note: this is mathematically equivalent to adding $0.5$ to $\frac${x}{2^{n}}$ before taking its floor.

Fixed-point division

The division of two fixed-point numbers can be performed directly on the underlying integer numbers.

The resulting quotient is a fixed-point number with a number of fractional bits corresponding to the subtraction of the fractional bits of the inputs. It is usually not exact.

Example:

Going back to our two numbers:

$a: (84, 3) = 84 * 2^{-3} = 10.5$

$b: (113, 4) = 113 * 2^{-4} = 7.0625$

Their fixed-point division is:

$p = \frac{b}{a} = (\frac{113}{84}, 4 - 3) = (1, 1) = 0.5$

A possible mitigation is to left-shift the dividend before the division to increase its precision: the resulting quotient will in turn have an increased precision.

$b: (113, 4) « 3 = (113 « 3, 4 + 3) = (904, 7) = 904 * 2^{-7} = 7.0625$

$p = \frac{b}{a} = (\frac{904}{84}, 7 - 3) = (10, 4) = 0.625$

A brief introduction to Machine Learning models quantization

kaizouman@kaizou.org (David Corvoysier) — Thu, 25 May 2023 12:00:00 +0000

Even before the development of Large Language Models (LLM), the increasing memory and computing requirements of Deep Neural Networks (DNN) has been a concern.

Functionally, DNN are graphs of arithmetic operations: the inputs are fed at the stem and the chain of operations produces the outputs at the head.

From an implementation perspective, the operations are performed on floating point numbers, which are a digital representation of decimal numbers composed of a mantissa and an exponent:

\[x = mantissa . 2^{exponent}\]

The 32-bit floating point representation if the most common, as it allows to represent numbers in a range that is sufficient for most operations. The float32 mantissa is composed of 24-bit (including sign), and the exponent is 8-bit.

Each operation performed at an operating node in the inference device requires its inputs to be transferred from either a static memory location or the previous processing nodes.

The cost of these transfers adds-up with the cost of the operations themselves.

The DNN terminology for operation data is “weights” for static inputs and “activations” for dynamic inputs/outputs.

Note: the outputs of an operation are designated as “activations” even if it is not actually an activation.

The process of representating the n-bit weights and activations of a DNN into a smaller number of bits is called quantization¹.

It is typically used in DNN to “quantize” float32 weights and activations into 8-bit integer.

This brings several benefits:

reducing the weights to 8-bit requires 4 times less memory on the device to store them,
reducing the activations to 8-bits reduces the amount of data exchanged between nodes, which impacts latency,
using 8-bit instead of 32-bit inputs for an operation improves vectorization (multiple data processed at the same time for a single operation),
all standard integer arithmetic operations but the division are faster than their floating point counterpart,
GPU devices may include specific mechanisms to process 8-bit inputs (like NVIDIAS 8-bit Tensor cores).

A mathematical formulation of linear quantization

The most widespread type of quantization is the linear or affine quantization scheme first introduced in tensorflow lite².

The representation of a linearly quantized number is composed of:

an integer mantissa,
a float scale,
an integer zero-point.

\[x = (mantissa - zeropoint).scale\]

The scale is used to project back the integer numbers into a float representation.

The zero point corresponds to the value that zero takes in the target representation.

If we compare that formula with the floating point representation one can see immediately that each floating point number can be represented exactly with the same mantissa, a scale corresponding to the exponent and a null zero-point.

Of course this representation would be very inefficient because it would require two integer and a float to represent each number.

Applicability of quantization to Machine-Learning

When quantizing Machine-Learning models, one can take advantage of the fact that the training produces weights and activations stay within reasonably stable ranges for a given operation.

This comes from several empirical techniques used to improve convergence:

weights initialization³,
weights and/or activation regularization⁴,
explicit normalization layers⁵.

This means that the weights and activations tensors for a specific operation can be represented using the same scale and zero-point, thus leading to a very compact representation.

Note: this is why quantization is often categorized as a form of compression, although unlike most compression techniques, it produces numbers that can be directly used for arithmetic operations.

There are various subtypes of quantization.

The first two subtypes are related to the dimensions of the scale and zero-point:

per-tensor quantization uses a single scalar value for scale and zero-point for a whole tensor of weights or activations,
per-axis quantization uses a vector of scales and zero-points whose length corresponds to a single axis of the tensor (typically the channels or embeddings axis).

The second subtypes are related to the symmetry of the resulting quantized numbers:

symmetric quantization assumes that the quantization range is symmetric, which leads to a zero-point equal to zero and a signed integer representation of the values,
asymmetric quantization does not assume anything, and zero-point is typically non-null.

Weights are typically quantized symmetrically per-axis.

Activations are typically quantized asymmetrically, most of the time per-tensor.

Quantizing a float tensor

The first step to quantize a float tensor is to choose the quantization range, i.e. the minimum and maximum float values one wants to represent: $[Min, Max]$.

Since the weights are constant tensors, they are typically quantized using the mimimum and maximum values of the tensor, globally or along the channel axis.

Evaluating the quantization range of the activations is more difficult as they are dependent of the inputs of the previous operation. Their range is therefore evaluated globally inside a model, as explained in the next paragraph.

For a target bit width of n for the mantissa, one evaluates the scale as:

\[scale = \frac{Max - Min}{2^n - 1}\]

The zero-point is then deduced from the scale to make sure that $Min$ is mapped to the lowest integer value and $Max$ to the highest integer value.

This leads to the following formulas for signed/unsigned representations:

unsigned: $zeropoint = round(\frac{Min}{scale})$
signed: $zeropoint = round(\frac{Min}{scale}) - 2^{n - 1}$

The quantization of a float tensor is then:

\[mantissa = saturate(round(\frac{x}{scale}) + zeropoint)\]

Again, the saturation depends of the signed of the target representation:

unsigned: $[0, 2n - 1]$,
signed: $[-2^{n-1}, 2^{n-1} - 1]$.

Note that the zero-point always has the same signedness as the mantissa.

Quantizing a Machine Learning Model

As mentioned before, a Machine Learning model uses two types of tensors: weights and activations.

The static weights need to be quantized only once, each weight tensor producing three new static tensors for the mantissa, scale and zeropoint.

Since weights can contain positive and negative values, they are typically quantized into int8.

             .----------.
             |  Weights |
             |  float32 |
             | constant |
             +----+-----+
            /     |      \
           v      v       v
.----------. .----------. .------------.
|  Weights | |  scale   | | zero-point |
|   int8   | | float32  | |    int8    |
| constant | | constant | |  constant  |
'----------' '----------' '------------'

The dynamic activations on the other hand need to be quantized on-the-fly by inserting the quantization operations in the graph:

evaluate the quantization range,
quantize.

The evaluation of the quantization range is costly because is requires a full-scan of the activations tensor, which is a bottleneck for parallel processing.

For that reason, the activations quantization ranges are often evaluated before the inference on a selected number of samples: this is called the calibration of the quantized model.

Note: the operations that clip their outputs like the bounded ReLU are an exception and don’t require an explicit calibration, since the exact range of their outputs is known in advance.

After calibration, each activation float variable is mapped to an integer variable and two static tensors.

               .-----------.
              | Activations |
              |   float32   |
              |  variable   |
              /'-----+-----'\
             /       |       \
            v        v        v
 .-----------.  .----------. .------------.
| Activations | |  scale   | | zero-point |
|   (u)int8   | | float32  | |  (u)int8   |
|  variable   | | constant | |  constant  |
 '-----------'  '----------' '------------'

Note: the activations can be quantized into either int8 or uint8. It is simpler to quantize them to uint8 if they correspond to the output of a ReLU operation, since zero-point will be in that case 0.

Conceptually, the resulting graph is a clone of the original graph where all compatible operations are replaced by a version that operates on tuples of (mantissa, scale, zero-point).

Separating the constant and variable tensors, this leads to the following graphs:

              .---------.                   .--------.  .----------. .------------.
             |  Inputs   |                 |  Inputs  | |  scale   | | zero-point |
             |  float32  |                 |  (u)int8 | | float32  | |  (u)int8   |
             | variable  |                 | variable | | constant | |  constant  |
              '----+----'                   '----+---'  '-----+----' '------+-----'
                   |             .               '------------+-------------'
.----------.       v             |\      .----------.         |
| Weights  |   .------.       +--' \     | Weights  |         |
| float32  +->| Matmul |      +--. /     |  int8    +-.       |
| constant |   '---+--'          |/      | constant | |       |         .------------.
'----------'       |             '       '----------' |       |         |   scale    |
                   v                                  |       |       .-+  float32   |
              .---------.                .----------. |       v       | |  constant  |
             |  Outputs  |               |  scale   | |   .-------.   | '------------'
             |  float32  |               | float32  +-+->| QMatMul |<-+
             |  variable |               | constant | |   '---+---'   | .------------.
              '---------'                '----------' |       |       | | zero-point |
                                                      |       |       '-+  (u)int8   |
                                         .----------. |       |         |  constant  |
                                         |zero-point| |       |         '------------'
                                         |  int8    +-'       |
                                         | constant |         |
                                         '----------'         |
                                                              v
                                                          .--------.
                                                         | Outputs  |
                                                         |  (u)int8 |
                                                         | variable |
                                                          '--------'

Quantized linear operations

Most basic Machine Learning operations can be performed using integer arithmetics, which makes them compatible with linearly quantized inputs.

This does not mean however that one can just replace all floating point operations by an equivalent integer operation: the scale and zeropoint of all weights and activations must be taken into account to produce an equivalent graph.

Also, there are two important restrictions with respect to the inputs quantization:

additions between the integer mantissa of inputs can only be performed if they are in the same scale,
operations that combine the integer mantissa of inputs channels can only be performed if the channels are in the same scale, i.e if the inputs are quantized per-tensor.

Note: in another post I explain how it is possible to add two inputs quantized with different scales by adding an explicit alignment operation beforehand.

From an implementation perspective, operations accepting linearly quantized inputs are very specific to each device.

In the next paragraph, I will detail a possible implementation of a quantized matrix multiplication.

Wrap-up example: a quantized matrix multiplication

Let’s consider a simple matrix multiplication of an $X(I, J)$ input by a $W(J, K)$ set of weights:

$Y = X.W$

Since the matrix multiplication multiplies all inputs along the dimension of length $J$ and adds them, $X$ cannot be quantized per-axis, because it will lead to the addition of quantized numbers that are not in the same scale.

There is no such restriction on $W$, since the filters along $K$ are all applied independently.

After quantization of the weights per-axis and calibration of the inputs per-tensor, we obtain:

$X \approx X_s * (X_q - X_{zp})$, with $X_s()$, $X_q(I, J)$, $X_{zp}()$

$W \approx W_s * (W_q - W_{zp})$, with $W_s(K)$, $W_q(J, K)$, $W_{zp}(K)$

We can also approximate the outputs per-axis, assuming that the next operation does not require per-tensor inputs.

$Y \approx Y_s * (Y_q - Y_{zp})$, with $Y_s(K)$, $Y_q(I, K)$, $Y_{zp}(K)$

The operation is summarized on the graph below (note that the intermediate integer output Y_q can be implicit):

    .-----.  .-----. .------.
   |  X_q  | | X_s | | X_zp |
    '--+--'  '--+--' '--+---'
       '--------+-------'
.-----.         |
| W_q +-.       |
'-----' |       |          .-----.
        |       v        .-+ Y_s |
.-----. |  .---------.   | '-----'
| W_s +-+->| QMatMul |<--+
'-----' |  '----+----'   | .-----.
        |       |        '-+ Y_zp|
.-----. |       |          '-----'
|W_zp +-'       |(Y_q)
'-----'         |
                v
               .-.
              | Y |
               '-'

Going through the graph step by step:

evaluate the matrix multiplication of the quantized inputs to produce float outputs

$O = X_s * (X_q - X_{zp}) . W_s * (W_q - W_{zp})$

quantize the float outputs to obtain 8-bit integer outputs

$Y_q = saturate(round(\frac{O}{Y_s}) + Y_{zp})$

convert back the 8-bit integer outputs to float outputs

$Y \approx Y_s * (Yq - Y_{zp})$

Since $X_s$ is a scalar, and $W_s$ has the same dimension as the outputs last dimension, the first operation can also be written:

$O = (X_s * W_s) * (X_q - X_{zp}) . (W_q - W_{zp})$

This means that the matrix multiplication can be operated equivalently on integer values, and the result is a quantized integer number with a scale corresponding to the product of the inputs and weights scale and a null zero-point.

The quantized sequence of operations is then to:

evaluate the matrix multiplication of the 8-bit integer inputs to produce n-bit integer outputs

$O_q = (X_q - X_{zp}) . (W_q - W_{zp})$

convert the n-bit integer outputs to float outputs

$O = (X_s * W_s) * O_q$

quantize the float outputs to obtain 8-bit integer outputs

$Y_q = saturate(round(\frac{O}{Y_s}) + Y_{zp})$

convert back the 8-bit integer outputs to float outputs

$Y \approx Y_s * (Yq - Y_{zp})$

The question that should immediately arise at this stage is why we need another quantization operation after the matrix multiplication, since we already have a quantized output ?

The reason is simply the bitwidth of the outputs: we need an explicit quantization to make sure that the results of the integer matrix multiplication fit in 8-bit.

Note: when the operation is followed by a bias addition, the biases are typically quantized to 32-bit with a scale precisely equal to $X_s * W_s$ so that they can be added directly to the outputs before quantizing.

Going one step further and replacing $O$, since $Y_s$ has the same shape as $X_s * W_s$, we can omit the third step and write directly:

evaluate the matrix multiplication of the integer inputs to produce n-bit integer outputs

$O_q = (X_q - X_{zp}) . (W_q - W_{zp})$

quantize the n-bit integer outputs to obtain 8-bit integer outputs

$Y_q = saturate(round(\frac{X_s * W_s}{Y_s} * O_q) + Y_{zp})$

convert back the 8-bit integer outputs to float outputs

$Y \approx Y_s * (Yq - Y_{zp})$

This reveals that we can directly ‘downscale’ the integer outputs of the operation with a folded scale $F_s = \frac{Y_s}{X_s * W_s}$.

The downscaling operation can be implemented as a float division and a round.

Note: I will detail in another post an implementation using only integer arithmetic.

The simplified graph can be summarized below:

        .-----.   .------. 
       |  X_q  |  | X_zp |
        '--+--'   '--+---'
           '----+----'
.-----.         |
| W_q +-.       v
'-----' |  .----------.
        +->|IntMatMul |
.-----. |  '----+-----'
|W_zp +-'       |         .-----.   
'-----'         v       .-+ F_s |  
           .---------.  | '-----'   
           |Downscale|<-+         
           '----+----'  | .-----.      
                v       '-+ Y_zp|     
               .-.        '-----'
              | Y |
               '-'

This can be further simplified by removing the zero-points if we assume a symmetric quantization.

           .-----.  
          |  X_q  | 
           '--+--'  
              |
              v 
.-----.  .----------.
| W_q +->|IntMatMul |
'-----'  '----+-----'
              |             
              v          
         .---------.  .-----.   
         |Downscale|<-+ Y_s |         
         '----+----'  '-----'  
              v          
             .-.        
            | Y |
             '-'

Note: the quantized matrix multiplication can be implemented in very different ways on devices that do not have efficient implementations of the integer Matrix Multiplication.

References

Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, “Compressing Deep Convolutional Networks using Vector Quantization” arxiv, 2014. ↩
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” arxiv, 2017. ↩
Stone Yun, Alexander Wong, “Where Should We Begin? A Low-Level Exploration of Weight Initialization Impact on Quantized Behaviour of Deep Neural Networks”, arxiv, 2020. ↩
Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker, “Intriguing Properties of Quantization at Scale”, arxiv, 2023. ↩
Elaina Teresa Chai, “Analysis of quantization and normalization effects in deep neural networks”, stanford, 2021. ↩

Identify Repeating Patterns using Spiking Neural Networks in Tensorflow

kaizouman@kaizou.org (David Corvoysier) — Thu, 26 Jul 2018 10:38:00 +0000

Spiking neural networks (SNN) are the 3rd generation of neural networks.

SNN do not react on each stimulus, but rather accumulate inputs until they reach a threshold potential and generate a ‘spike’.

Because of their very nature, SNNs cannot be trained like 2nd generation neural networks using gradient descent.

Spike Timing Dependent Plasticity (STDP) is a biological process that inspired an unsupervised training method for SNNs.

In this article, I will provide an illustration of how STDP can be used to teach a single neuron to identify a repeating pattern in a continuous stream of input spikes.

For this, I will reproduce the STDP experiments described in Masquelier & Thorpe (2008) using Tensorflow instead of Matlab.

LIF neuron model

The LIF neuron model used in this experiment is based on Gerstner’s Spike Response Model.

At every time-step, the neuron membrane potential p is given by the formula:

\[p=\eta(t-t_{i})\sum_{j|t_{j}>t_{i}}{}w_{j}\varepsilon(t-t_{j})\]

where $\eta(t-t_{i})$ is the membrane response after a spike at time $t_{i}$:

\[\eta(t-t_{i})=K_{1}exp(-\frac{t-t_{i}}{\tau_{m}})-K_{2}(exp(-\frac{t-t_{i}}{\tau_{m}})-exp(-\frac{t-t_{i}}{\tau_{s}}))\]

and $\varepsilon(t)$ describes the Excitatory Post-Synaptic Potential of each synapse spike at time $t_{j}$:

\[\varepsilon(t-t_{j})=K(exp(-\frac{t-t_{j}}{\tau_{m}})-exp(-\frac{t-t_{j}}{\tau_{s}}))\]

Note that K has to be chosen so that the max of $\eta(t)$ is 1, knowing that $\eta(t)$ is maximum when: $t=\frac{\tau_{m}\tau_{s}}{\tau_{m}-\tau_{s}}ln(\frac{\tau_{m}}{\tau_{s}})$

In this simplified version of the neuron, the synaptic weights $w_{j}$ remain constant.

The main graph operations are described below (please refer to my jupyter notebook for details):

    # Excitatory post-synaptic potential (EPSP)
    def epsilon_op(self):

        # We only use the negative value of the relative spike times
        spikes_t_op = tf.negative(self.t_spikes)

        return self.K *(tf.exp(spikes_t_op/self.tau_m) - tf.exp(spikes_t_op/self.tau_s))
    
    # Membrane spike response
    def eta_op(self):
        
        # We only use the negative value of the relative time
        t_op = tf.negative(self.last_spike)
        
        # Evaluate the spiking positive pulse
        pos_pulse_op = self.K1 * tf.exp(t_op/self.tau_m)
        
        # Evaluate the negative spike after-potential
        neg_after_op = self.K2 * (tf.exp(t_op/self.tau_m) - tf.exp(t_op/self.tau_s))

        # Evaluate the new post synaptic membrane potential
        return self.T * (pos_pulse_op - neg_after_op)
    
    # Neuron behaviour during integrating phase (t_rest = 0)
    def w_epsilons_op(self):
        
        # Evaluate synaptic EPSPs. We ignore synaptic spikes older than the last neuron spike
        epsilons_op = tf.where(tf.logical_and(self.t_spikes >=0, self.t_spikes < self.last_spike - self.tau_rest),
                               self.epsilon_op(),
                               self.t_spikes*0.0)
                          
        # Agregate weighted incoming EPSPs 
        return tf.reduce_sum(self.w * epsilons_op)  
   ...
   def default_op(self):
        
        # Update weights
        w_op = self.default_w_op()
        
        # By default, the membrane potential is given by the sum of the eta kernel and the weighted epsilons
        with tf.control_dependencies([w_op]):
            return self.eta_op() + self.w_epsilons_op()
        
    def integrating_op(self):

        # Evaluate the new membrane potential, integrating both synaptic input and spike dynamics
        p_op = self.eta_op() + self.w_epsilons_op()

        # We have a different behavior if we reached the threshold
        return tf.cond(p_op > self.T,
                       self.firing_op,
                       self.default_op)
    
    def get_potential_op(self):
        
        # Update our internal memory of the synapse spikes (age older spikes, add new ones)
        update_spikes_op = self.update_spikes_times()
        
        # Increase the relative time of the last spike by the time elapsed
        last_spike_age_op = self.last_spike.assign_add(self.dt)
        
        # Update the internal state of the neuron and evaluate membrane potential
        with tf.control_dependencies([update_spikes_op, last_spike_age_op]):
            return tf.cond(self.t_rest > 0.0,
                           self.resting_op,
                           self.integrating_op)

Stimulate neuron with predefined synapse input

We replicate the $figure\,3$ of the original paper by stimulating a LIF neuron with six consecutive synapse spikes (dotted gray lines on the figure).

The neuron has a refractory period of $1\,ms$ and a threshold of $1$.

As in the original paper. we see that because of the leaky nature of the neuron, the stimulating spikes have to be nearly synchronous for the threshold to be reached.

Generate Poisson spike trains with varying rate

The original paper uses Poisson spike trains with a rate varying in the $[0, 90]\,Hz$ interval, with a variation speed that itself varies in the $[-1800, 1800]\,Hz$ interval (in random uniform increments in the $[-360,360]$ interval).

Optionally, we may force each synapse to spike at least every $\Delta_{max}\,ms$.

Please refer to my jupyter notebook for the details of the Spike trains generator.

We test our spike trains generator and draw the corresponding spikes. Both sets of spike trains use varying rates in the $[0, 90]\,Hz$ interval. The second set imposes $\Delta_{max}=50\,ms$.

We note the increased mean rate of the second set of spike trains, due to the minimum $20\,Hz$ rate we impose (ie the maximum interval we allow between two spikes is $50\,ms$).

Stimulate a LIF Neuron with random spike trains

We now feed the neuron with $500$ synapses that generate spikes at random interval with varying rates.

The synaptic efficacy weights are arbitrarily set to $0.475$ and remain constant throughout the simulation.

We draw the neuron membrane response to the $500$ random synaptic spike trains.

We can see that the neuron mostly saturates and continuously generates spikes.

Introduce Spike Timing Dependent Plasticity

We extend the LIFNeuron by allowing it to modify its synapse weights using a Spike Timing Dependent Plasticity algorithm (STDP).

The STDP algorithm rewards synapses where spikes occurred immediately before a neuron spike, and inflicts penalties to the synapses where spikes occur after the neuron spike.

The ‘rewards’ are called Long Term synaptic Potentiation (LTP), and the penalties Long Term synaptic Depression (LTD).

For each synapse that spiked $\Delta{t}$ before a neuron spike:

\[\Delta{w} = a^{+}exp(-\frac{\Delta{t}}{\tau^{+}})\]

For each synapse that spikes $\Delta{t}$ after a neuron spike:

\[\Delta{w} = -a^{-}exp(-\frac{\Delta{t}}{\tau^{-}})\]

As in the original paper, we only apply LTP, resp. LTD to the first spike before, resp. after a neuron spike on each synapse.

The main STDP graph operations are described below (please refer to my jupyter notebook for details:

    # Long Term synaptic Potentiation
    def LTP_op(self):
        
        # We only consider the last spike of each synapse from our memory
        last_spikes_op = tf.reduce_min(self.t_spikes, axis=0)

        # Reward all last synapse spikes that happened after the previous neutron spike
        rewards_op = tf.where(last_spikes_op < self.last_spike,
                              tf.constant(self.a_plus, shape=[self.n_syn]) * tf.exp(tf.negative(last_spikes_op/self.tau_plus)),
                              tf.constant(0.0, shape=[self.n_syn]))
        
        # Evaluate new weights
        new_w_op = tf.add(self.w, rewards_op)
        
        # Update with new weights clamped to [0,1]
        return self.w.assign(tf.clip_by_value(new_w_op, 0.0, 1.0))
    
    # Long Term synaptic Depression
    def LTD_op(self):

        # Inflict penalties on new spikes on synapses that have not spiked
        # The penalty is equal for all new spikes, and inversely exponential
        # to the time since the last spike
        penalties_op = tf.where(tf.logical_and(self.new_spikes, tf.logical_not(self.syn_has_spiked)),
                                tf.constant(self.a_minus, shape=[self.n_syn]) * tf.exp(tf.negative(self.last_spike/self.tau_minus)),
                                tf.constant(0.0, shape=[self.n_syn]))
        
        # Evaluate new weights
        new_w_op = tf.subtract(self.w, penalties_op)
        
        # Update the list of synapses that have spiked
        new_spikes_op = self.syn_has_spiked.assign(self.syn_has_spiked | self.new_spikes)
        
        with tf.control_dependencies([new_spikes_op]):
            # Update with new weights clamped to [0,1]
            return self.w.assign(tf.clip_by_value(new_w_op, 0.0, 1.0))

Test STDP with predefined input

We apply the same predefined spike train to an STDP capable LIFNeuron with a limited number of synapses, and draw the resulting rewards (green) and penalties (red).

On the graph above, we verify that the rewards (green dots) are assigned only when the neuron spikes, and that they are assigned to synapses where a spike occured before the neuron spike (big blue dots).

Note: a reward is assigned event if the synapse spike is not synchronous with the neuron spike, but it will be lower.

We also verify that a penaly (red dot) is inflicted on every synapse where a first spike occurs after a neuron spike.

Note: these penalties may later be counter-balanced by a reward if a neuron spike closely follows.

Stimulate an STDP LIF Neuron with random spike trains

The goal here is to check the effects of the STDP learning on the neuron behaviour when it is stimulated with our random spike trains.

We test the neuron response with three set of spike trains, with a mean rate of $35$, $45$ and $55$ $Hz$ respectively.

We see that the evolution of the synapse weights as a response to this steady stimulation is highly dependent of the mean input frequency.

If the mean input frequency is too low, the neuron exhibits a low decrease of the synaptic efficacy weights, down to the point where the neuron is not able to fire anymore.

If the mean input frequency is too high, the neuron exhibits in the contrary an increase of the synaptic efficacy weights, up to the point where it fires regardless of the input.

Using the STDP values of the original paper, only the exact mean frequency of $45$ $Hz$ (the one also used in the paper) exhibits some kind of stability.

As a conclusion, either our implementations differ, or the adverse effect of this particular STDP algorithm has been overlooked in the original paper, because as we will see later, the actual mean stimulation rate will be around $64$ $Hz$.

Generate recurrent spike trains

We don’t follow exactly the same procedure as in the original paper, as the evolution of the hardware and software allows us to generate spike trains more easily. The result, however, is equivalent.

We generate $2000$ spike trains, from which we force the $1000$ first to repeat a $50\,ms$ pattern at random intervals.

The time to the next pattern is chosen with a probability of $0.25$ among the next slices of $50\,ms$ (omitting the first one to avoid consecutive patterns).

We display the resulting synapse mean spiking rates, and some samples of the spike trains, identifying the pattern (gray areas).

We verify that the mean spiking rate is the same for both population of synapses (approximately $64\,Hz = 54\,Hz + 10\,Hz$).

We nevertheless notice that the standard deviation is much higher for the synapses involved in the pattern.

On the spike trains samples, one can visually recognize the patterns thanks to the gray background, but otherwise they would go unnoticed for the human eye.

We also verify that each pattern is slightly modified by the $10\,Hz$ spontaneous activity.

Stimulate an STDP LIF neuron with recurrent spiking trains

We perform a simulation on our STDP LIF neuron with the generated spike trains, and draw the neuron response at the begining, middle and end of the simulation.

On each sample, we identify the pattern interval with a gray background.

At the beginning of the stimulation, the neuron spikes continuously, inside and outside the pattern.

At the middle of the stimulation, the neuron fires mostly inside the pattern and sometimes outside the pattern (false positive).

At the end of the stimulation, the neuron fires only inside the pattern.

Important note: With the rates specified in the original paper, the neuron quickly saturates and doesn’t learn anything. With a tweaked LTD factor $a^{-}$, that seems to be dependent of the spike trains, the neuron learns the pattern after only a few seconds of presentation: Hurray ! For a given set of spike trains, you might adjust the rate to achieve a successful training

The neuron has become more an more selective as the pattern presentation were repeated, up to the point where the synapses involved in the pattern have dominant weights, as displayed on the graph below.

Discussion

We managed to reproduce the experiments described in Masquelier & Thorpe (2008) using Tensorflow.

However, we found out that the STDP parameters needed to be tweaked to adjust to the input spike train mean rate, and possibly also to adjust to the generated spike trains themselves, as for a given rate, the neuron did not react identically for different sets of spike trains.

Also, we found out that the neuron doesn’t necessarily identify the beginning of the pattern, but sometime its end.

These differences with the original paper raise questions about the differences between our implementation and the original one done in Matlab.

Leaky Integrate and Fire neuron with Tensorflow

kaizouman@kaizou.org (David Corvoysier) — Wed, 25 Jul 2018 10:38:00 +0000

Spiking Neural Networks (SNN) are the next generation of neural networks, that operate using spikes, which are discrete events that take place at points in time, rather than continuous values.

Essentially, once a stimulated neuron reaches a certain potential, it spikes, and the potential of that neuron is reset.

In this article, I will detail how the Leaky Integrate and Fire (LIF) spiking neuron model can be implemented using Tensorflow.

Leaky-integrate-and-fire model

We use the model described in § 4.1 of “Spiking Neuron Models”, by Gerstner and Kistler (2002).

The leaky integrate-and-fire (LIF) neuron is probably one of the simplest spiking neuron models, but it is still very popular due to the ease with which it can be analyzed and simulated.

The basic circuit of an integrate-and-fire model consists of a capacitor C in parallel with a resistor R driven by a current I(t):

The driving current can be split into two components, $I(t) = IR + IC$.

The first component is the resistive current $IR$ which passes through the linear resistor $R$.

It can be calculated from Ohm’s law as $IR = \frac{u}{R}$ where $u$ is the voltage across the resistor.

The second component $IC$ charges the capacitor $C$.

From the definition of the capacity as $C = \frac{q}{u}$ (where $q$ is the charge and $u$ the voltage), we find a capacitive current $IC = C\frac{du}{dt}$. Thus:

\[I(t) = \frac{u(t)}{R} + C\frac{du}{dt}\]

By multiplying the equation by $R$ and introducing the time constant $\tau_{m} = RC$ this yields the standard form:

\[\tau_{m}\frac{du}{dt}=-u(t) + RI(t)\]

where $u(t)$ represents the membrane potential at time $t$, $\tau_{m}$ is the membrane time constant and $R$ is the membrane resistance.

When the membrane potential reaches the spiking threshold $u_{thresh}$, the neuron ‘spikes’ and enters a resting state for a duration $\tau_{rest}$.

During the resting perdiod the membrane potential remains constant a $u_{rest}$.

Step 1: Create a single LIF model

In a first step, we create a tensorflow graph to evaluate the membrane response of a LIF neuron.

For encaspulation and isolation, the graph is a member of a LIFNeuron object that takes all model parameters at initialization.

The LIFNeuron object exposes the membrane potential Tensorflow ‘operation’ as a member.

The input current and considered time interval are passed at Tensorflow placeholders.

The main graph operations are described below (please refer to my jupyter notebook for details:

    # Neuron behaviour during integration phase (below threshold)
    def get_integrating_op(self):

        # Get input current
        i_op = self.get_input_op()

        # Update membrane potential
        du_op = tf.divide(tf.subtract(tf.multiply(self.r, i_op), self.u), self.tau) 
        u_op = self.u.assign_add(du_op * self.dt)
        # Refractory period is 0
        t_rest_op = self.t_rest.assign(0.0)
        
        with tf.control_dependencies([t_rest_op]):
            return u_op

    # Neuron behaviour during firing phase (above threshold)    
    def get_firing_op(self):                  

        # Reset membrane potential
        u_op = self.u.assign(self.u_rest)
        # Refractory period starts now
        t_rest_op = self.t_rest.assign(self.tau_rest)

        with tf.control_dependencies([t_rest_op]):
            return u_op

    # Neuron behaviour during resting phase (t_rest > 0)
    def get_resting_op(self):

        # Membrane potential stays at u_rest
        u_op = self.u.assign(self.u_rest)
        # Refractory period is decreased by dt
        t_rest_op = self.t_rest.assign_sub(self.dt)
        
        with tf.control_dependencies([t_rest_op]):
            return u_op

    def get_potential_op(self):
        
        return tf.case(
            [
                (self.t_rest > 0.0, self.get_resting_op),
                (self.u > self.u_thresh, self.get_firing_op),
            ],
            default=self.get_integrating_op
        )

Step 2: Stimulation by a square input current

We stimulate the neuron with three square input currents of vaying intensity: 0.5, 1.2 and 1.5 mA.

The first current step is not sufficient to trigger a spike. The two other trigger several spikes whose frequency increases with the input current.

Step 3: Stimulation by a random varying input current

We now stimulate the neuron with a varying current corresponding to a normal distribution of mean 1.5 mA and standard deviation 1.0 mA.

The input current triggers spike at regular intervals: the neuron mostly saturates, each spike being separated by the resting period.

Step 4: Stimulate neuron with synaptic currents

We now assume that the neuron is connected to input neurons through $m$ synapses.

The contribution of the synapses to the neuron input current is given by the general formula below:

\[I =\sum_{i}^{}w_{i}\sum_{f}{}I_{syn}(t-t_i^{(f)})\]

Where $t_i^{(f)}$ is the time of the f-th spike of the synapse $i$.

A typical implementation of the $I_{syn}$ function is:

\[I_{syn}(t)=\frac{q}{\tau}exp(-\frac{t}{\tau})\]

where $q$ is the total charge that is injected in a postsynaptic neuron via a synapse with efficacy $w_{j} = 1$.

Note that $\frac{dI_{syn}}{dt}=-\frac{I_{syn}(t)}{\tau}$.

We create a new neuron model derived from the LIFNeuron.

The graph for this neuron includes a modified operation to evaluate the input current at each time step based on a memory of synaptic spikes.

The graph requires a new boolean Tensorflow placeholder that contains the synapse spikes over the last time step.

The modified operation is displayed below (please refer to my jupyter notebook for details:

    # Override parent get_input_op method
    def get_input_op(self):
        
        # Update our memory of spike times with the new spikes
        t_spikes_op = self.update_spike_times()

        # Evaluate synaptic input current for each spike on each synapse
        i_syn_op = tf.where(t_spikes_op >=0,
                            self.q/self.tau_syn * tf.exp(tf.negative(t_spikes_op/self.tau_syn)),
                            t_spikes_op*0.0)

        # Add each synaptic current to the input current
        i_op =  tf.reduce_sum(self.w * i_syn_op)
        
        return tf.add(self.i_app, i_op)

Each synapse spikes according to an independent poisson process at $\lambda = 20 hz$.

We perform a simulation by evaluating the contribution of each synapse to the input current over time.

At every time step, we draw a single sample $r$ from a uniform distribution in the $[0,1]$ interval, and if it is lower than the probability of a spike over the time interval (ie $r < \lambda.dt$) then a spike occurred.

Note that this assumes that the chosen time interval is lower than the minimum synapse spiking interval.

As expected, the neuron spikes when several synapses spike together.

Simulating spiking neurons with Tensorflow

kaizouman@kaizou.org (David Corvoysier) — Tue, 24 Jul 2018 10:38:00 +0000

Spiking Neural Networks are the next generation of machine learning, according to the litterature.

After the feed-forward perceptrons of the last century and the bi-directional deep networks trained using gradient descent of today, this 3rd generation of neural networks uses biologically-realistic models of neurons to carry out computation.

A spiking neural network (SNN) operates using spikes, which are discrete events that take place at points in time, rather than continuous values. The occurrence of a spike is determined by differential equations that represent the membrane potential of the neuron. Essentially, once a neuron reaches a certain potential, it spikes, and the potential of that neuron is reset.

In this article, I will detail how this kind of network can be modelled using Tensorflow.

You can find a jupyter notebook corresponding to this article in my tensorflow sandbox.

The article is based on an existing exercise using Matlab.

Spiking neuron model

The neuron model is based on “Simple model on spiking neuron”, by Eugene M. Izhikevich.

Electronic version of the figure and reproduction permissions are freely available at www.izhikevich.com

The behaviour of the neuron is determined by its membrane potential v that increases over time when it is stimulated by an input current I. Whenever the membrane potential reaches the spiking threshold, the membrane potential is reset.

The membrane potential increase is mitigated by an adversary recovery effect defined by the u variable.

Tensorflow doesn’t support differential equations, so we need to approximate the evolution of the membrane potential and membrane recovery by evaluating their variations over small time intervals dt:

\[dv = 0.04v^2 + 5v + 140 -u + I\] \[du = a(bv -u)\]

We can then apply the variations by multiplying by the time interval dt:

\[v += dv.dt\] \[u += du.dt\]

As stated in the model, the $0.04$, $5$ and $140$ values have been defined so that $v$ is in $mV$, $I$ is in $A$ and $t$ in $ms$.

The corresponding Tensorflow code looks like this (see the jupyter notebook for details):

# Evaluate membrane potential increment for the considered time interval
# dv = 0 if the neuron fired, dv = 0.04v*v + 5v + 140 + I -u otherwise
dv_op = tf.where(has_fired_op,
                 tf.zeros(self.v.shape),
                 tf.subtract(tf.add_n([tf.multiply(tf.square(v_reset_op), 0.04),
                                       tf.multiply(v_reset_op, 5.0),
                                       tf.constant(140.0, shape=[self.n]),
                                       i_op]),
                             self.u))
                        
# Evaluate membrane recovery decrement for the considered time interval
# du = 0 if the neuron fired, du = a*(b*v -u) otherwise
du_op = tf.where(has_fired_op,
                 tf.zeros([self.n]),
                 tf.multiply(self.A, tf.subtract(tf.multiply(self.B, v_reset_op), u_reset_op)))
    
# Increment membrane potential, and clamp it to the spiking threshold
# v += dv * dt
v_op = tf.assign(self.v, tf.minimum(tf.constant(self.SPIKING_THRESHOLD, shape=[self.n]),
                                                 tf.add(v_reset_op, tf.multiply(dv_op, self.dt))))

# Decrease membrane recovery
u_op = tf.assign(self.u, tf.add(u_reset_op, tf.multiply(du_op, self.dt)))

Simulate a single neuron with injected current

In a first step, we stimulate the neuron model with a square input current.

The neuron spikes at regular intervals. After each spike, the neuron membrane goes to its resting potential before starting to increase again.

Step 2: Simulate a single neuron with synaptic input

It is a simple variation of the previous experiment, where the input current is the composition of currents coming from several synapses (typically here, a hundred).

The formula for evaluating the synaptic current corresponds to the weighted sum of the input current generated by each synapse:

\[Isyn = \sum_{j}^{}w_{in}(j).Isyn(j)\]

The current $Isyn(j)$ generated by each synapse is the multiplication of:

a linear response to the membrane potential, with a target objective of potential $E_{in}(j)$: ($E_{in}(j) -v$)
a conductance dynamics parameter, that is an exponential function $g_{in}(j)$ that is defined by a differential equation.

\[\frac{dg_{in}(j)}{dt} = \frac{g_{in}(j)}{tau}\]

Each input synapse emits a spike following a poisson distribution of frequency $frate$. The probability that a neuron fires during the time interval $dt$ is thus $frate.dt$.

To simulate the neuron, we draw random numbers r in the $[0,1]$ interval at each timestep, and is the number $r$ is less than $frate.dt$, we generate a synapse spike by increasing the conductance dynamics for that synapse:

\[g_{in}(j) = g_{in}(j) + 1\]

The complete synaptic current formula at each timestep is:

\[Isyn = \sum_{j}^{}w_{in}(j)g_{in}(j)(E_{in}(j) -v(t)) = \sum_{j}^{}w_{in}(j)g_{in}(j)E_{in}(j) - (\sum_{j}w_{in}(j)g_{in}(j)).v(t)\]

The corresponding Tensorflow code looks like this (see the jupyter notebook for details):

# First, update synaptic conductance dynamics:
# - increment by one the current factor of synapses that fired
# - decrease by tau the conductance dynamics in any case
g_in_update_op = tf.where(self.syn_has_spiked,
                          tf.add(self.g_in, tf.ones(shape=self.g_in.shape)),
                          tf.subtract(self.g_in, tf.multiply(self.dt,tf.divide(self.g_in, self.tau))))

# Update the g_in variable
g_in_op = tf.assign(self.g_in, g_in_update_op)

# We can now evaluate the synaptic input currents
# Isyn = Σ w_in(j)g_in(j)E_in(j) - (Σ w_in(j)g_in(j)).v(t)
i_op = tf.subtract(tf.einsum('nm,m->n', tf.constant(self.W_in), tf.multiply(g_in_op, tf.constant(self.E_in))),
                   tf.multiply(tf.einsum('nm,m->n', tf.constant(self.W_in), g_in_op), v_op))

We stimulate a neuron with $100$ synapses firing at $2 Hz$ between $200$ and $700 ms$.

Every millisecond, there are $0.001 * 2 * 100 = 0.2$ synapse spikes as an average.

In other words, a synapse spike occurs every $5 ms$ as an average.

The resulting membrane potential is displayed below:

The synaptic input current oscillates around a mean value of approximately $10 mA$.

Due to the increased input current, the neuron spikes faster than in the previous stimulation.

Step 3: Simulate 1000 neurons with synaptic input

Each neuron is either:

an inhibitory fast-spiking neuron $(a=0.1, d=2.0)$,
or an excitatory regular spiking neuron $(a=0.02, d=8.0)$.

with a proportion of $20$ % inhibitory.

We therefore define a random uniform vector p on $[0,1]$, and condition the a and d vectors of our neuron population on p.

\[a[p<0.2] = 0.1, a[p >=0.2] = 0.02\] \[d[p<0.2] = 2.0, d[p >=0.2] = 8.0\]

Each neuron is randomly connected with $10$ % of the input synapses, and thus receives an input synapse spike every $50 ms$ as an average.

Instead of displaying the membrane potentials, we just plot the neuron spikes for inhibitory (blue) and excitatory (yellow) neurons:

The neurons spike in ‘stripes’ at somehow regular intervals, with a bit of dispersion.

The neuron dynamics seem to act as a regulator to the synaptic ‘noise’.

Step 4: Simulate 1000 neurons with recurrent connections

A neuron i is sparsely (with probability $prc = 0.1$) connected to a neuron j.

Thus neuron i receives an additional current $Isyn(i)$ of the same form as the synaptic input:

\[Isyn(i) = \sum_{j}w(i,j)g(j)(E(j) -v(t))\]

The corresponding Tensorflow code looks like this (see the jupyter notebook for details):

# First, update recurrent conductance dynamics:
# - increment by one the current factor of synapses that fired
# - decrease by tau the conductance dynamics in any case
g_update_op = tf.where(has_fired_op,
                       tf.add(self.g, tf.ones(shape=self.g.shape)),
                       tf.subtract(self.g, tf.multiply(self.dt, tf.divide(self.g, self.tau))))
        
# Update the g variable
g_op = tf.assign(self.g, g_update_op)

# We can now evaluate the recurrent conductance
# I_rec = Σ wjgj(Ej -v(t))
i_rec_op = tf.einsum('ij,j->i', tf.constant(self.W), tf.multiply(g_op, tf.subtract(tf.constant(self.E), v_op)))

# Get the synaptic input currents from parent
i_in_op = super(SimpleSynapticRecurrentNeurons, self).get_input_ops(has_fired_op, v_op)
        
# The actual current is the sum of both currents
i_op = i_in_op + i_rec_op

Weights $w$ are Gamma distributed (scale $0.003$, shape $2$).

Inhibitory to excitatory connections are twice as strong.

$E(j)$ is set to $-85$ for inhibitory neurons, $0$ otherwise.

We again plot the neuron spikes for inhibitory (blue) and excitatory (yellow) neurons:

The addition of recurrent connections has drastically reduced the dispersion of the neuron spikes.

Explore Tensorflow features with the CIFAR10 dataset

kaizouman@kaizou.org (David Corvoysier) — Mon, 26 Jun 2017 16:51:00 +0000

The reason I started using Tensorflow was because of the limitations of my experiments so far, where I had coded my models from scratch following the guidance of the CNN for visual recognition course.

I already knew how CNN worked, and had already a good experience of what it takes to train a good model. I had also read a lot of papers presenting multiple variations of CNN topologies, those aiming at increasing accuracy like those aiming at reducing model complexity and size.

I work in the embedded world, so performance is obviously one of my primary concern, but I soon realized that the CNN state of the art for computer vision had not reached a consensus yet on the best compromise between accuracy and performance.

In particular, I noticed that some papers had neglected to investigate how the multiple characteristics of their models contribute to the overall results they obtain: I assume that this is because it takes an awful lot of time to train a single model, thus leaving no time for musing around.

Anyway, my goal was therefore to multiply experiments on several models to better isolate how each feature contributes to the efficiency of the training and to the performance of the inference.

More specifically, my goals were:

to verify that Tensorflow allowed me to improve the efficiency of my trainings (going numpy-only is desperately slow, even with BLAS and/or MKL),
to use this efficiency to multiply experiments, changing one model parameter at a time to see how it contributes to the overall accuracy,
to experiment with alternative CNN models to verify the claims in the corresponding papers.

Thanks to the CNN for visual recognition course, I had already used the CIFAR10 dataset extensively, and I was sure that its complexity was compatible with the hardware setup I had.

I therefore used the tensorflow CIFAR10 image tutorial as a starting point.

Setting up a Tensorflow environment

I have a pretty good experience in setting up development environments, and am very much aware of the mess your host system can become if you don’t maintain a good isolation between these developments environments.

After having tried several containment techniques (including chroots, Virtual Machines and virtual env), I now use docker, like everybody else in the industry.

Google provides docker images for the latest Tensorflow versions (both CPU and GPU), and also a development image that you can use to rebuild Tensorflow with various optimizations for your SoC.

You can refer to my step by step recipe to create your environment using docker.

Creating a CIFAR10 training framework

Taking the Tensorflow image tutorial as an inspiration, I developed a generic model training framework for the CIFAR10 dataset.

The framework uses several types of scripts for training and evaluations.

All scripts rely on the same data provider based on the tensorflow batch input pipeline.

The training scripts uses Tensorflow monitored training sessions, whose benefits are twofolds:

they neatly take care of tedious tasks like logs, saving checkpoints and summaries,
they almost transparently give access to the Tensorflow distributed mode to create training clusters.

There is one script for training on a single host and another one for clusters.

There is also a single evaluation script, and a script to ‘freeze’ a model, ie combine its graph definition with its trained weights into a single model file that can be loaded by another Tensorflow application.

I tested the framework on a model I had already created for the assignments of my course, verifying that I achieved the same accuracy.

The framework is in this github repository.

Reproducing the tutorial performance

The next step was to start experimenting to figure out what really matters in a CNN model for the CIFAR10 dataset.

The idea was to isolate the specific characteristic of the tutorial model to evaluate how they contribute to the overall model accuracy.

As a first step, I implemented the same model as the tutorial in my framework, but without all training bells and whistles.

Basic hyperparameters

Learning rate and batch size are two of the most important hyperparameters, and are usually well evaluated by model designers, as they have a direct impact on model convergence.

So I would assume they are usually well-defined. I nevertheless tried different training parameters, and finally decided to keep the ones provided by the tutorial, as they gave the best results:

learning rate = 0.1,
batch size = 128.

Note: the learning rate is more related to the model, and the batch size to the dataset.

Initialization

For the initialization parameters, I was a bit reluctant to investigate much, as there were too many variations.

More, I had already tried the Xavier initialization with good success, so I decided to initialize all variables with a Xavier initializer.

weight decay

For the weight decay, I used a global parameter for each model, but refined each for each variable, dividing it by the matrix size: my primary concern was to make sure that the induced loss did not explode.

Gradually improving from my first results

With my basic setup, I achieved results a bit lower than the tutorial (for exactly the same model):

75,3 % accuracy after 10,000 iterations instead of 81,3%.

Then, I added data augmentation, that smoothed a lot the training process:

drastic reduction of the overfitting,
lower results for early iterations,
much higher results after 5000+ iterations.

With data augmentation:

78,8 % accuracy after 10,000 iterations.

Finally, I used trainable variables moving averages instead of raw values, and it gave me the extra missing accuracy to match the tutorial performance:

81,4% accuracy after 10,000 iterations.

After 300,000 iterations, the model with data augmentation even reached 87% accuracy.

Conclusion

For the CIFAR10 dataset, data augmentation is a key factor for a successful training, and using variable moving averages ireally helps convergence.

Tutorial model metrics

Without data augmentation (32x32x3 images):

Size  : 1.76 Millions of parameters
Flops : 66.98 Millions of operations

With data augmentation (24x24x3 images):

Size     : 1.07 Millions of parameters
Flops    : 37.75 Millions of operations

Experimenting with the tutorial model topology

To better understand how the tutorial model topology, I tested a few ALexNet-style models variants.

Note: I call these models Alex-like as the tutorial is based on the models defined by Alexei krizhevsky, winner of the ImageNet challenge in 2012).

I didn’t save all variants I tried, but to summarize my experiments:

Local-response-normalization is useless,
One of the FC layer can be removed without harming accuracy too much,
For the same amount of parameters, more filters with smaller kernels are equivalent to the base setup.

My conclusion is that the tutorial model can be improved a bit in terms of size and processing power (see the Alex 4 variant for instance), but that it is already a good model for that specific topology that combines two standard convolutional layers with two dense layers.

Experimenting with alternative models

The next step was to experiment further with different models:

NiN networks that remove dense layers altogether,
SqueezeNets that parallelize convnets.

The idea was to stay within the same range in terms of computational cost and model size, but trying to find a better compromise between model accuracy, model size and inference performance.

The figure below provides accuracy for the three best models I obtained, compared to the tutorial version and one of the Alex-style variant.

For each model, I evaluated the model size in number of parameters, and its computational cost in number of operations.

To put these theoretical counters in perspective, I also got ‘real’ numbers by checking:

the actual disk size of the saved models,
the inference time using the C++ label_image tool (I added some traces)

The ratio between the number of parameters and the actual size on disk seems consistent for all models, but the inference time is not, and may vary greatly depending on the actual local optimizations. The winner is however the model with the less number of operations.

Here are the detailed numbers for all trained models :

Tuto

Accuracy : 87,2%
Size     : 1.07 Millions of parameters  / 4,278,750 bytes
Flops    : 37.75 Millions of operations / 44 ms

Alex (alex4)

Accuracy : 87,5%
Size     : 1.49 Millions of parameters  / 5,979,938 bytes
Flops    : 35.20 Millions of operations / 50 ms

NiN (nin2)

Accuracy : 89,8%
Size     : 0.97 Millions of parameters   / 3,881,548 bytes
Flops    : 251.36 Millions of operations / 90 ms

SqueezeNet (squeeze1)

Accuracy : 87,8%
Size     : 0.15 Millions of parameters   / 602,892 bytes
Flops    : 22.84 Millions of operations  / 27 ms

Conclusion

From all model topologies I studied here, the SqueezeNet architecture is by far the most efficient, reaching the same level of accuracy with a model that is more than six times lighter than the tutorial version, and more than 1,5 times faster.

Further experiments

In my alternative models, I had first included Inception, but I ruled it out after finding out how NiN was already costly: it would nevertheless be interesting to evaluate Xception, one of its derivative that uses depthwise separable convolutions.

Last, I would like to check how these models could be compressed using iterative pruning and quantization.

Build and boot a minimal Linux system with qemu

kaizouman@kaizou.org (David Corvoysier) — Fri, 23 Sep 2016 16:00:00 +0000

When you want to build a Linux system for an embedded target these days, it is very unlikely that you decide to do it from scratch.

Embedded Linux build systems are really smart and efficients, and will fit almost all use cases: should you need only a simple system, buildroot should be your first choice, and if you want to include more advanced features, or even create a full distribution, Yocto is the way to go.

That said, even if these tools will do all the heavy-lifting for you, they are not perfect, and if you are using less common configurations, you may stumble upon issues that were not expected. In that case, it may be important to understand what happens behind the scenes.

In this post, I will describe step-by-step how you can build a minimal Linux system for an embedded target and boot it using QEMU.

Install QEMU

QEMU is available for all major distros.

sudo apt-get install qemu

In this post I will create a system for an ARM target, just to make sure I don’t mix between my host and target systems (see the last paragraph of this introduction on cross-compilation).

You can list the ARM machines your QEMU setup supports from the command-line:

$ qemu-system-arm --machine help
Supported machines are:
versatileab          ARM Versatile/AB (ARM926EJ-S)
...
mainstone            Mainstone II (PXA27x)
...
midway               Calxeda Midway (ECX-2000)
virt                 ARM Virtual Machine
borzoi               Borzoi PDA (PXA270)

I will use in this tutorial an old Intel ARM platform, the Mainstone.

The only reason I chose this platform is because the maintainer of this board is Robert Jarzmik, who has been sitting next to me in the Open space for the last year. He is _very_knowledgeable on the Kernel and also very nice. Thanks to you, Bob !

#Generate the toolchain

To generate the binaries for our embedded target, we need a toolchain, which is a set of tools targeting the corresponding processor architecture.

Most of the time, the board manufacturer will have provided the toolchain as part of the BSP (Board Support Package).

Generating a toolchain used to be quite painful, but since the awesome crosstool-ng tool has been made available, this is a piece of cake.

More namedropping: kudos to my friend Yann E. Morin for developping crosstool-ng

First, we need to fetch and install the tool.

$ wget http://crosstool-ng.org/download/crosstool-ng/crosstool-ng-1.22.0.tar.xz
$ tar xf crosstool-ng-1.22.0.tar.xz
$ cd crosstool-ng/
$ ./configure
$ make
$ sudo make install

You can list the pre-configured toolchains that your cross-tool ng version supports:

$ ct-ng list-samples
Status  Sample name
  LN    config
  MKDIR config.gen
  IN    config.gen/arch.in
  IN    config.gen/kernel.in
  IN    config.gen/cc.in
  IN    config.gen/binutils.in
  IN    config.gen/libc.in
  IN    config.gen/debug.in
[G..]   alphaev56-unknown-linux-gnu
...
[G..]   armeb-unknown-linux-uclibcgnueabi
...
[G..]   xtensa-unknown-linux-uclibc
 L (Local)       : sample was found in current directory
 G (Global)      : sample was installed with crosstool-NG
 X (EXPERIMENTAL): sample may use EXPERIMENTAL features
 B (BROKEN)      : sample is currently broken

For the Mainstone board, we will use a generic ARM toolchain with uCLibc, a smaller C library for embedded targets.

You can get the details of the toolchain that will be produced from the command-line:

$ ct-ng show-arm-unknown-linux-uclibcgnueabi
  IN    config.gen/arch.in
  IN    config.gen/kernel.in
  IN    config.gen/cc.in
  IN    config.gen/binutils.in
  IN    config.gen/libc.in
[G..]   arm-unknown-linux-uclibcgnueabi
    OS             : linux-4.3
    Companion libs : gmp-6.0.0a mpfr-3.1.3 mpc-1.0.3 libelf-0.8.13 expat-2.1.0 ncurses-6.0
    binutils       : binutils-2.25.1
    C compilers    : gcc  |  5.2.0
    Languages      : C,C++
    C library      : uClibc-ng-1.0.9 (threads: nptl)
    Tools          : dmalloc-5.5.2 duma-2_5_15 gdb-7.10 ltrace-0.7.3 strace-4.10

Let’s generate (this will take a while):

$ ct-ng arm-unknown-linux-uclibcgnueabi
$ ct-ng build

By default, the toolchain will be installed under $(HOME)/x-tools/arm-unknown-linux-uclibcgnueabi. In order to use it, we add the toolchain bin directory to the PATH:

$ export PATH="${PATH}:${HOME}/x-tools/arm-unknown-linux-gnueabi/bin"
$ arm-unknown-linux-uclibcgnueabi-gcc --version
arm-unknown-linux-uclibcgnueabi-gcc (crosstool-NG crosstool-ng-1.22.0) 5.2.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Note that I have added a small routine to my shell startup script to automatically add paths to toolchains:
for dir in `ls ~/x-tools`; do
PATH=~/x-tools/$dir/bin:$PATH
done
export PATH

Sanity check: test cross-compilation environment

It is always a good practice to verify at regular intervals that your setup is correct. Here, we will make sure that the toolchain is able to generate ARM code that can be run by qemu-arm, the QEMU ARM CPU emulator.

For those unfamiliar with cross-compilation, this may also help to put things in perspective.

We will compile a very simple program:

main.c

#include 

int main(int argc, char*argv[])
{
	printf("Genuinely generated by the toolchain\n");
}

Let’s first build with a naive command:

$ arm-unknown-linux-uclibcgnueabi-gcc main.c -o sanity
$ chmod +x sanity

We verify that sanity is an ARM exec that cannot run on our system:

$ ./sanity
bash: ./sanity: cannot execute binary file: Exec format error
$ file sanity
sanity: ELF 32-bit LSB  executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), not stripped

Now, let’s try to run it with QEMU:

$ qemu-arm sanity
/lib/ld-uClibc.so.0: No such file or directory

What happened ? The reason we get this error is because by default GCC has generated a sanity executable that requires dynamic linking of system libraries, as the following command reveals:

$ readelf -d sanity

Dynamic section at offset 0x4f0 contains 18 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libc.so.1]
 0x0000000c (INIT)                       0x102d4
...

Here, QEMU needs to find the C library, and to load it using the dynamic linker, which happens to be also a library, ld-uClibc.so.0, as the INTERP program header reveals:

$ readelf -l sanity
Elf file type is EXEC (Executable file)
Entry point 0x10334
There are 6 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  PHDR           0x000034 0x00010034 0x00010034 0x000c0 0x000c0 R E 0x4
  INTERP         0x0000f4 0x000100f4 0x000100f4 0x00014 0x00014 R   0x1
      [Requesting program interpreter: /lib/ld-uClibc.so.0]
...

Both libraries are under the toolchain ‘sysroot’ directory.

Should you decide to support dynamic linking, the dynamic linker and the C library should at some point end up on your target.

Specifically for that purpose, QEMU supports specifying the path to dynamically linked libraries using the -L option or the QEMU_LD_PREFIX environment variable.

$ qemu-arm -L ~/x-tools/arm-unknown-linux-uclibcgnueabi/arm-unknown-linux-uclibcgnueabi/sysroot/ sanity
Genuinely generated by the toolchain

$ QEMU_LD_PREFIX=~/x-tools/arm-unknown-linux-uclibcgnueabi/arm-unknown-linux-uclibcgnueabi/sysroot/ qemu-arm sanity
Genuinely generated by the toolchain

If you want to avoid these linking issues, you can tell GCC to generate a static executable instead:

$ arm-unknown-linux-uclibcgnueabi-gcc -static main.c -o sanity
$ qemu-arm sanity
Genuinely generated by the toolchain

Configure and build the Linux Kernel

At the time this article is written, the latest Kernel stable version is 4.7.5.

$ mkdir linux
$ wget https://cdn.kernel.org/pub/linux/kernel/linux-4.7.5.tar.xz -O linux/linux-4.7.5.tar.xz
$ tar xf linux/linux-4.7.5.tar.xz -C linux

We select the mainstone configuration to build the Kernel

$ make -C linux/linux-4.7.5 ARCH=arm mainstone_defconfig O=linux/build

You need to specify the architecture to tell the Kernel where it should look for existing configurations (here arch/arm/configs)

The Linux Kernel is very versatile in the way it boots, and it can be frankly overwhelming if you consider all options.

In this article, I will illustrate two boot modes: a stand-alone Kernel with a RAM initrd, and a Kernel that boots on a root filesystem on an SD card.

As per the Linux Kernel documentation:

initrd provides the capability to load a RAM disk by the boot loader. This RAM disk can then be mounted as the root file system and programs can be run from it. Afterwards, a new root file system can be mounted from a different device. The previous root (from initrd) is then moved to a directory and can be subsequently unmounted.

initrd is mainly designed to allow system startup to occur in two phases, where the kernel comes up with a minimum set of compiled-in drivers, and where additional modules are loaded from initrd.

initrd is primarily intended to be a bootstrap in RAM that allows the Kernel to get access to the ‘real’ rootfs, but we can also use it to simply boot the Kernel without providing a rootfs.

We will see how we can create an initrd in the subsequent paragraphs.

The mainstone default configuration is fairly minimal, and we will need to add a few options to support these two boot modes.

First, we need to add initrd support by activating the BLK_DEV_INITRD configuration option.

Second, we need to add SD cards support for the mainstone board, that belongs to the PXA family. The driver is called MultiMedia card driver for PXA, and it requires Direct Memory Access: we will therefore need to select MMC, MMC_PXA, DMADEVICES and PXA_DMA.

We also need to activate the AEABI configuration to make sure the Kernel uses the latest ARM EABI convention. As per the Linux Kernel documentation:

This option allows for the kernel to be compiled using the latest ARM ABI (aka EABI). This is only useful if you are using a user space environment that is also compiled with EABI.

We need to add these options manually using the curses menuconfig interface:

$ make -C linux/build ARCH=arm menuconfig

General Setup->Initial RAM filesystem and RAM disk (initramfs/initrd) support Device Drivers->MMC/SD/SDIO card support->Intel PXA25x/.. Multimedia Card Interface support Device Drivers->DMA Engine support->PXA DMA support Kernel Features->Use the ARM EABI to compile the kernel

Once our Kernel has been properly configured, we can build it:

$ make -C linux/build ARCH=arm CROSS_COMPILE=arm-unknown-linux-uclibcgnueabi-

At the end of the build, our Kernel will be under linux/build/arch/arm/boot.

$ ls linux/build/arch/arm/boot/
compressed  Image  zImage

Sanity check: launch the Linux Kernel with QEMU

We verify that the Kernel has been properly generated by launching it with qemu-system-arm, the QEMU system emulator (note the difference with qemu-arm, the CPU emulator).

We pass four parameters on the command-line:

kernel: path to our Kernel,
machine: the machine w euse (here ‘mainstone’),
serial: set to ‘stdio’ to the Kernel printk logs in the console,
append: parameters to add to the Kernel command-line

$ qemu-system-arm -kernel linux/zImage -serial stdio -append 'console=ttyS0' -M mainstone
Two flash images must be given with the 'pflash' parameter

The mainstone board has two 64 Mb flash banks whose images must be provided on the qemu-system-arm command-line.

We create two empty images:

$ dd if=/dev/zero of=mainstone-flash0.img bs=1024 count=65536
$ dd if=/dev/zero of=mainstone-flash1.img bs=1024 count=65536

We can now launch the Kernel.

$ qemu-system-arm -kernel linux/zImage -append 'console=ttyS0' -machine mainstone -serial stdio -pflash mainstone-flash0.img -pflash mainstone-flash1.img
Booting Linux on physical CPU 0x0
Linux version 4.7.5 (xxx@yyy) (gcc version 5.2.0 (crosstool-NG crosstool-ng-1.22.0) ) #1 Tue Sep 27 09:35:52 CEST 2016
CPU: XScale-PXA270 [69054117] revision 7 (ARMv5TE), cr=00007977
...
XScale iWMMXt coprocessor detected.
VFS: Cannot open root device "(null)" or unknown-block(0,0): error -6
Please append a correct "root=" boot option; here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
...

It still fails because we didn’t provide a rootfs nor an initrd.

Create a tiny init

Let’s create a simplistic bootstrap:

main.c:

#include 

void main()
{
	printf("Tiny init ...\n");
	while(1);
}

We compile it using the ARM toolchain, passing a few CFLAGS to specify the mainstone CPU instruction set:

$ arm-unknown-linux-uclibcgnueabi-gcc -static -march=armv5te -mtune=xscale -Wa,-mcpu=xscale main.c -o init
$ chmod +x init

We will now use that bootstrap to boot the system after the Kernel has been loaded.

RAM boot using initrd

We create a CPIO RAM image that contains only the init program:

$ echo init | cpio -o --format=newc > initramfs

Now, if we launch the Kernel again, specifying our initramfs, we end up in the tiny init loop:

$ qemu-system-arm -kernel linux/zImage -append 'console=ttyS0' -machine mainstone -serial stdio -pflash mainstone-flash0.img -pflash mainstone-flash1.img -initrd initramfs
Booting Linux on physical CPU 0x0
Linux version 4.7.5 (xxx@yyy) (gcc version 5.2.0 (crosstool-NG crosstool-ng-1.22.0) ) #1 Tue Sep 27 09:35:52 CEST 2016
CPU: XScale-PXA270 [69054117] revision 7 (ARMv5TE), cr=00007977
...
XScale iWMMXt coprocessor detected.
Freeing unused kernel memory: 148K (c03cf000 - c03f4000)
This architecture does not have kernel memory protection.
Tiny init ...

Boot on a SD card image

We will now create an SD card image containing the tiny init code.

$ qemu-img create init.img 128K

We format the SD card image with an ext2 file-system.

$ mkfs.ext2 init.img
mke2fs 1.42.13 (17-May-2015)
Discarding device blocks: done
Creating filesystem with 128 1k blocks and 16 inodes

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

Then, we can mount it and copy the init program into the image

$ mkdir tmp
$ sudo mount -o loop init.img tmp
$ mkdir -p tmp/sbin
$ sudo cp init tmp/sbin/
$ sudo umount tmp
$ rmdir tmp

Note that the Kernel expects the init bootstrap to be under /sbin/init, and not at the root of the file system like in the initram file system.

We can now launch the Kernel specifying that the rootfs is on /dev/mmcblk0, which is the pseudo-device for the SD card passed to QEMU with the -sd option.

$ qemu-system-arm -kernel linux/zImage -append 'console=ttyS0 root=/dev/mmcblk0' -machine mainstone -serial stdio -pflash mainstone-flash0.img -pflash mainstone-flash1.img -sd init.img
Booting Linux on physical CPU 0x0
Linux version 4.7.5 (xxx@yyy) (gcc version 5.2.0 (crosstool-NG crosstool-ng-1.22.0) ) #1 Tue Sep 27 09:35:52 CEST 2016
CPU: XScale-PXA270 [69054117] revision 7 (ARMv5TE), cr=00007977
...
XScale iWMMXt coprocessor detected.
mmc0: host does not support reading read-only switch, assuming write-enable
mmc0: new SD card at address 4567
mmcblk0: mmc0:4567 QEMU! 1.00 GiB
VFS: Mounted root (ext2 filesystem) readonly on device 179:0.
Freeing unused kernel memory: 152K (c03ee000 - c0414000)
This architecture does not have kernel memory protection.
Tiny init ...

Voila !

In a following article, I will demonstrate how to create a small rootfs using BusyBox.

Benchmarking build systems for a large C project

kaizouman@kaizou.org (David Corvoysier) — Thu, 01 Sep 2016 16:00:00 +0000

The performance of build systems has been discussed at large in the developer community, with a strong emphasis made on the limitations of the legacy Make tool when dealing with large/complex projects.

I recently had to develop a build-system to create firmwares for embedded targets from more than 1000 source files.

The requirements were to use build recipes that could be customized for each directory and file in the source tree, similar to what the Linux Kernel does with kbuild.

I designed a custom recursive Make solution inspired by kbuild.

Note: for those interested, this is the build system used in the Intel Curie SDK for wearables.

After one major release, I had some time to muse around the abundant litterature on build systems, and in particular the infamous “Recursive Make considered harmful”, and started to wonder whether I had made the right design choice.

Obviously, my solution had the same limitation that all recursive make have: it was unable to export explicit dependency from one part of the tree to another, but we had easily overcomed that by relying solely on headers to express dependencies , and taking advantage of the GCC automatic dependencies, pretty much like all C projects do anyway.

The solution was also relatively fast, which seemed to contradict the claims of many people](http://stackoverflow.com/questions/559216/what-is-your-experience-with-non-recursive-make).

I therefore decided to do a little benchmark to sort it out.

You can check for yourself the several solutions in this repo.

The benchmark

The benchmark is to compile a hierachical source tree with directories containing each two source files (header + implementation), and one build fragment specifying a custom preprocessor definition. Each directory implementation ‘depends’ on its children directories sources by including their headers.

Yes, it is a wacky design, but I just wanted to challenge the build-system

The benchmark script tests several build-system invocations in four configurations:

cold start (full build from a fresh tree),
full rebuild (touch all sources and rebuild),
build leaf (only touch one of the leaf headers),
nothing to do.

The solutions

Kbuild

The first solution is a variant of my kbuild clone. The design is dead simple:

each directory has a Makefile fragment that produces a C static library,
a directory archive aggregates the object files in this directory and the static libraries of its subdirectories,
a generic Makefile is launched recursively on the source tree to generate libraries and aggregate them to the top.

The syntax of the Makefile fragments is the same as the one used by the Linux Kernel:

obj-y = foo.c bar/
cflags-y = -Isomepath -DFOO

The generic Makefile is a bit cryptic for those not familiar with the Make syntax, but it actually not very complicated.

This Makefile starts by including the Makefile fragment, then does some processing on the local obj-y variable, to identify local objects and subdirectories.

It then defines rules to:

build subdirectory archives by relaunching itself on each subdirectory,
build local objects, taking into account local CFLAGS,
create the directory library as a ‘thin’ archive, ie a list of references to actual object files.

THIS_FILE := $(abspath $(lastword $(MAKEFILE_LIST)))

all:

# Those are supposed to be passed on the command line
OUT ?= build
SRC ?= src

# Look for a Makefile in the current source directory
-include $(SRC)/Makefile

# First, identify if there are any directories specifed in obj-y that we need
# to descend into
subdir-y := $(sort $(patsubst %/,%,$(filter %/, $(obj-y))))

# Next, update the list of objects, replacing any specified directory by the
# aggregated object that will be produced when descending into it
obj-y := $(patsubst %/, %/built-in.a, $(obj-y))

# Prepend the subdirectories with the actual source directory
subdir-y := $(addprefix $(SRC)/,$(subdir-y))

# Prepend the objects with the actual build DIR
obj-y := $(addprefix $(OUT)/$(SRC)/,$(obj-y))

# Fake target used to force subdirectories to be visited on every Make call
.FORCE:
# Go into each subdirectory to build aggregated objects
$(OUT)/$(SRC)/%/built-in.a: .FORCE
	$(MAKE) -f $(THIS_FILE) SRC=$(SRC)/$* OUT=$(OUT)

# Include dependency files that may have been produced by a previous build
-include $(OUT)/$(SRC)/*.d

# Evaluate local CFLAGS
LOCAL_CFLAGS := -MD $(CFLAGS) $(cflags-y)

# Build C files
$(OUT)/$(SRC)/%.o: $(SRC)/%.c
	mkdir -p $(OUT)/$(SRC)
	$(CC) $(LOCAL_CFLAGS) -c -o $@ $<

# Create an aggregated object for this directory
$(OUT)/$(SRC)/built-in.a: $(obj-y)
	mkdir -p $(OUT)/$(SRC)
	$(AR) -rcT $@ $^

all: $(OUT)/$(SRC)/built-in.a

Note that since we cannot ‘guess’ if a nested library needs to be rebuilt, we force going into each subdirectory using a fake target. This is the main drawback of this solution, as every single directory of the source tree will be parsed even if no file has changed in the source tree.

The top-level Makefile has only two targets:

one to create the target executable based on the top aggregated library,
one to create the library by launching the generic Makefile at the top of the source tree.

all: $(OUT)/foo

$(OUT)/foo: $(OUT)/built-in.a
        $(CC) -o $@ $^

$(OUT)/built-in.a: .FORCE
        mkdir -p $(OUT)
        $(MAKE) -C $(SRC) -f $(CURDIR)/Makefile.kbuild \
                SRC=. \
                OUT=$(OUT)

.FORCE:

Non recursive Makefile

The second solution is one that is inspired by the principles of Peter Miller’s paper.

It uses the same Makefile fragments, but instead of recursively launching Make on subdirectories, it recursively includes the fragments.

The whole process is implemented using a recursive GNU Make template.

For performance reason, we use a single parameterized generic rule to build objects in the source tree.

During the evaluation of each subdirectory, we gather object files in a global variable, and customize the generic build rule by defining the value of the CFLAGS for each object in the subdirectory.

I first designed a variant that created a rule for each subdirectory, but its performances decreased exponentially with the number of directories.

At the end of the Makefile, we use a single foreach instruction to include dependency files based on the list of objects.

I also tried to include these during the subdirectories evaluation, but it was less performant

# These are actually passed to us, but provide default values for easier reuse
OUT ?= build
SRC ?= src

# We parse each subdirectory to gather object files
OBJS :=

# Sub-directory parsing function
define parse_subdir

# Reset sub-Makefile variables as a precaution
obj-y :=
cflags-y :=

# Include sub-Makefile
include $(1)/Makefile

# Isolate objects from subdirectories and prefix them with the output directory
_objs := $$(addprefix $(OUT)/$(1)/,$$(sort $$(filter-out %/, $$(obj-y))))

# Define a specific CFLAGS for objects in this subdir
$$(_objs): SUBDIR_CFLAGS := -MD $$(CFLAGS) $$(cflags-y)

# Add subdir objects to global list
OBJS += $$(_objs)

# Isolate subdirs from objects and prefix them with source directory
_subdirs := $$(addprefix $(1)/,$$(sort $$(patsubst %/,%,$$(filter %/, $$(obj-y)))))

# Recursively parse subdirs
$$(foreach subdir,$$(_subdirs), $$(eval $$(call parse_subdir,$$(subdir))))

endef

# Start parsing subdirectories at the root of the source tree
$(eval $(call parse_subdir,$(SRC)))

# Generic rule to compile C files
$(OUT)/%.o: %.c
        mkdir -p $(dir $@)
        $(CC) $(SUBDIR_CFLAGS) -c -o $@ $<

# Include GCC dependency files for each source file
$(foreach obj,$(OBJS),$(eval -include $(obj:%.o=%.d)))

The top-level Makefile just includes the “subdirectories” Makefile.

all: $(OUT)/foo

include $(CURDIR)/Makefile.subdir

$(OUT)/foo: $(OBJS)
        $(CC) -o $@ $^

It could be a single Makefile, but I found it neater to keep the “generic” template in a separate file.

Custom generated Makefile

As a variant to the previous solution, I tried to parse the Makefile fragments only once to generate a Makefile in the output directory, then generate the target.

This is basically the same template that is used to generate the actual Makefile: the only difference is that the list of objects and custom CFLAGS per directory are evaluated in memory AND written to the actual Makefile.

# These are actually passed to us, but provide default values for easier reuse
OUT ?= build
SRC ?= src

# The only goal of this Makefile is to generate the actual Makefile
all: $(OUT)/Makefile

...


$(OUT)/Makefile::
        @echo "Generating $@"
        @echo "all: $(OUT)/foo" >> $@

# Sub-directory parsing function
define parse_subdir

...

# Include sub-Makefile
include $(1)/Makefile

...

# Define a specific CFLAGS for objects in this subdir
$(1)_CFLAGS := -MD $$(CFLAGS) $$(cflags-y)
# Insert the corresponding goal modifier in the target Makefile
$(OUT)/Makefile::
        echo "$(OUT)/$(1)/%.o: LOCAL_CFLAGS=$$($(1)_CFLAGS)" >> $$@

...

endef

# Start parsing subdirectories at the root of the source tree
$(eval $(call parse_subdir,$(SRC)))

# Finalize target Makefile inserting generic C compilation rule and GCC
# dependencies for each source file
$(OUT)/Makefile::
        echo "OBJS:= $(OBJS)" >> $@
        echo "$(OUT)/%.o: %.c" >> $@
        echo '  mkdir -p $$(dir $$@)' >> $@
        echo '  $$(CC) $$(LOCAL_CFLAGS) -c -o $$@ $$<' >> $@
        echo "" >> $@
        $(foreach obj,$(OBJS),echo "-include $(obj:%.o=%.d)"; >> $@)
        @echo "$(OUT)/foo: $(OBJS)" >> $@
        @echo ' $$(CC) -o $$@ $$^' >> $@
        @echo "Done $@"

The top-level Makefile includes the generated Makefile and provides a rule to generate it: GNU Make take cares of the rest.

all: $(OUT)/foo

$(OUT)/foo: $(OUT)/Makefile .FORCE
        $(MAKE) -C $(OUT)

FRAGMENTS := \
        $(shell find $(SRC) -name Makefile -cnewer $(OUT)/Makefile 2>/dev/null)

$(OUT)/Makefile: $(FRAGMENTS)
        mkdir -p $(OUT)
        $(MAKE) -C $(SRC) -f $(CURDIR)/Makefile.gen \
                SRC=$(SRC) \
                OUT=$(OUT)

.FORCE:

Note the trick to make sure the Makefile is properly regenerated: since Make has difficulties to cope with a large number of dependencies, we use the shell to identify the fragments that have changed.

CMake

CMake is a Makefile generator. I added a CMake solution to compare it with the previous custom generated Makefile.

Two issues I had to solve were:

how to recursively select sources for the final target
how to express different CFLAGS for a directory

The simple solution I found was to use CMake subdirectories and to define a static library in each one of them.

ADD_LIBRARY(output_src_1 STATIC foo.c)
ADD_SUBDIRECTORY(1)
TARGET_LINK_LIBRARIES(output_src_1 output_src_1_1)
...
TARGET_LINK_LIBRARIES(output_src_1 output_src_1_9)
ADD_SUBDIRECTORY(10)
TARGET_LINK_LIBRARIES(output_src_1 output_src_1_10)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -D'CURDIR=output/src/1'")

It seems to lead CMake to create a recursive Makefile. It would be interesting to try a different approach using include to gather fragments and per-source properties to set the CFLAGS.

The top-level Makefile has two rules: one to build the generated Makefile, the other one to create the target using it.

$(OUT)/foo: $(OUT)/Makefile .FORCE
        $(MAKE) -C $(OUT)

-include $(OUT)/Makefile

$(OUT)/Makefile:
        mkdir -p $(OUT)
        cd $(OUT) && cmake -Wno-dev $(SRC)

.FORCE:

Note that the generated Makefile will detect automatically changes made to the Makefile fragments and regenerate the target Makefile thanks to CMake built-in checks.

Boilermake

Boilermake is an awesome generic non-recursive Make template. I included it in order to compare it to my own non-recursive solution.

Cninja (CMake + Ninja)

CMake is able to generate Ninja files, so I only had to adapt my CMake based-solution to compare the generated GNU Make build with the generated Ninja build.

One issue I had with Ninja is that it doesn’t cope well with large command lines.

There is an ugly fix that was introduced to address that for the WebKit project.

set(CMAKE_NINJA_FORCE_RESPONSE_FILE 1)

Guys, gotcha: when are you gonna fix this ?

Ninja

The CMake generated Ninja build performance was awesome for incremental builds, but not so good for full builds as soon as the number of files increased.

I had my suspicions it may come from the way CMake generated the Ninja file, especially with the intermediate libraries I had to declare.

Having received a similar feedback from the ninja mailing-list, I wrote a small parser in python to generate the build.ninja file directly from the Makefile fragments.

Here is how the generated file looks:

rule cc
  deps = gcc
  depfile = $out.d
  command = cc -MD -MF $out.d $cflags -c $in -o $out
rule ld
  command = cc @$out.rsp -o $out
  rspfile = $out.rsp
  rspfile_content = $in
build /home/david/dev/make-benchmark/output/src/main.o: cc /home/david/dev/make-benchmark/output/src/main.c
  cflags = -D'CURDIR=output/src'
build /home/david/dev/make-benchmark/output/src/foo.o: cc /home/david/dev/make-benchmark/output/src/foo.c
  cflags = -D'CURDIR=output/src'
build /home/david/dev/make-benchmark/output/src/1/foo.o: cc /home/david/dev/make-benchmark/output/src/1/foo.c
  cflags = -D'CURDIR=output/src/1'

...

build foo : ld /home/david/dev/make-benchmark/output/src/main.o ...

The results are indeed much better, as you will see in the next paragraph.

The raw results

I ran the benchmark on a Intel Core i7 with 16 GB RAM and an SSD drive.

All build times are in seconds.

$make --version
GNU Make 3.81
$cmake --version
cmake version 2.8.12.2
$ninja --version
1.3.4

Tree = 2 levels, 10 subdirectories per level (12 .c files)

|               | kbuild | nrecur | static | cmake | b/make | cninja | ninja |
|---------------|--------|--------|--------|-------|--------|--------|-------|
| cold start    |  0.08  |  0.06  |  0.08  | 0.55  |  0.08  |  0.36  | 0.08  |
| full rebuild  |  0.06  |  0.06  |  0.06  | 0.23  |  0.07  |  0.04  | 0.06  |
| rebuild leaf  |  0.04  |  0.03  |  0.03  | 0.16  |  0.04  |  0.05  | 0.02  |
| nothing to do |  0.01  |  0.00  |  0.00  | 0.06  |  0.01  |  0.00  | 0.00  |

Tree = 3 levels, 10 subdirectories per level (112 .c files)

|               | kbuild | nrecur | static | cmake | b/make | cninja | ninja |
|---------------|--------|--------|--------|-------|--------|--------|-------|
| cold start    |  0.47  |  0.45  |  0.51  | 1.84  |  0.52  |  0.91  | 0.53  |
| full rebuild  |  0.48  |  0.46  |  0.44  | 1.34  |  0.54  |  0.39  | 0.32  |
| rebuild leaf  |  0.10  |  0.09  |  0.09  | 0.46  |  0.11  |  0.07  | 0.05  |
| nothing to do |  0.06  |  0.05  |  0.06  | 0.40  |  0.07  |  0.00  | 0.01  |

Tree = 4 levels, 10 subdirectories per level (1112 .c files)

|               | kbuild | nrecur | static | cmake | b/make | cninja | ninja |
|---------------|--------|--------|--------|-------|--------|--------|-------|
| cold start    |  4.62  |  4.57  |  5.78  | 16.72 |  5.48  |  7.50  |  4.00 |
| full rebuild  |  4.85  |  4.57  |  4.78  | 15.12 |  5.56  |  6.39  |  3.90 |
| rebuild leaf  |  0.98  |  0.86  |  1.04  |  4.47 |  1.07  |  0.28  |  0.21 |
| nothing to do |  0.53  |  0.67  |  0.82  |  4.44 |  0.88  |  0.05  |  0.03 |

Tree = 5 levels, 10 subdirectories per level (11112 .c files)

|               | kbuild | nrecur | static | cmake  | b/make | cninja | ninja |
|---------------|--------|--------|--------|--------|--------|--------|-------|
| cold start    |  59.01 |  54.07 | 118.00 | 509.96 |  72.41 | 175.58 | 46.98 |
| full rebuild  |  63.41 |  61.38 | 103.95 | 376.40 |  80.17 | 101.76 | 46.66 |
| rebuild leaf  |  10.86 |  17.18 |  59.03 | 215.44 |  20.19 |   2.81 |  2.28 |
| nothing to do |   5.13 |  14.95 |  56.87 | 220.49 |  17.78 |   0.47 |  0.03 |

My two cents

From the results above, I conclude that:

for my use case, and with my hardware (I suspect SSD is a huge bonus for recursive Make), non-recursive and recursive Makefiles are equivalent,
my generated Makefile is completely suboptimal (would need to investigate),
CMake generated Makefiles are pretty darn slow …
As long as you don’t generate the build.ninja with CMake, Ninja is faster than any Make based solution, especially when only a few files have changed.

Decentralized modules declarations in C using ELF sections

kaizouman@kaizou.org (David Corvoysier) — Wed, 17 Aug 2016 16:00:00 +0000

In modular programming, a standard practice is to define common interfaces allowing the same type of operation to be performed on a set of otherwise independent modules.

modules = [a,b,...]

for each m in modules:
    m.foo
    m.bar

To implement this pattern, two mechanisms are required:

instantiation, to allow each module to define an ‘instance’ of the common interface,
registration, to allow each module to ‘provide’ this instance to other modules.

Instantiation is typically supported natively in high-level languages.

Registration is more difficult and usually requires specific code to be written, or relying on external frameworks.

Let’s see how these two mechanisms can be implemented for C programs.

Note: the code snippets in this post can be browsed on the following github repo

Interface instantiation

In C programs, interface instantiation is implemented using function pointers: basically, the common interface is specified using a struct whose members are the functions that needs to be implemented.

module.h:

struct module {
	void (*foo)(void*);
	int (*bar)(char*);
};

a.c:

#include "module.h"

struct module module_a = {
	.foo = foo_a;
	.bar = bar_a;
};

static void foo_a()
{
	
}

static void bar_a()
{

}

b.c:

#include "module.h"

struct module module_b = {
	.foo = foo_b;
	.bar = bar_b;
};

static void foo_b()
{
	
}

static void bar_b()
{

}

Interface registration

The goal here is to allow client code to be able to ‘find’ the interface instances provided by the modules.

The first question we need to address is whether we register interfaces statically at design time or dynamically at runtime.

Some systems like Linux provide mechanisms for special ‘constructors’ functions to be called at program initialization. We could take advantage of that feature to allow each module to register its interfaces: see a full example here.

In this article, I assume that we are on a system without such capability, and that we as a consequence can only rely on static registration.

Note that static registration is also more effective, and always desirable on devices with limited hardware.

A first solution for static registration of modules is to give the client code a direct access to the interface instances, by exposing them in public headers.

module.h:

struct module {
	void (*foo)();
	void (*bar)();
};

extern struct module module_a;
extern struct module module_b;

foo.c

#include "module.h"

void foo()
{
    module_a.foo();
    module_b.foo();
}

bar.c

#include "module.h"

void bar()
{
    module_a.bar();
    module_b.bar();
}

This works, but it is not quite satisfactory: as more modules are added to the program, the client code needs to be modified.

A better solution would be to store the instances anonymously in a static array:

module.h:

struct module {
	void (*foo)(void*);
	int (*bar)(char*);
};

extern struct module *modules[];

extern int modules_size;

module.c:

#include "module.h"

extern struct module module_a;
extern struct module module_b;

struct module *modules[2] = {
	&module_a,
	&module_b
};

int modules_size = 2;

foo.c

#include "module.h"

void foo()
{
	int i;
	for (i = 0; i < modules_size; i++) {
		modules[i]->foo();
	}
}

bar.c

#include "module.h"

void bar()
{
	int i;
	for (i = 0; i < modules_size; i++) {
		modules[i]->bar();
	}
}

This is quite neat, as we will only need to modify the module.c file when a new module is added.

This could be even better though: what if we could add modules without editing any other files ?

Taking advantage of ELF sections to create decentralized module tables

The only reason why we need to edit the modules.c file is because we need to add new entries to the global modules static array.

The array in itself is just a bunch of pointers written one after the other in a contiguous memory space: what if we could find a way to populate it directly from the modules themselves ?

This cannot be achieved by either the preprocessor or the compiler, as they process compilation units atomically (when a file is processed, the compiler has no knowledge of the other files it has compiled or will compile in the future).

The linker however has the knowledge of all symbols declared in the program, and is even capable of grouping them according to section definitions, as long as we specify them in a custom linker script.

We can take advantage of that to make sure that all references to the interface instances are stored in the same section, and define the modules array as being the start address of the section.

module.lds:

SECTIONS
{
	.modules : {
		modules_start = .;
		*(.modules)
		modules_end = .;
	}
}
INSERT AFTER .rodata;

module.h:

struct module {
	void (*foo)();
	void (*bar)();
};

extern const struct module modules_start[];
extern const struct module modules_end[];

Makefile:

OBJS := main.o a.o b.o foo.o bar.o

modules: $(OBJS)
	gcc -o $@ -T module.lds $(OBJS:*.c=*.o)

What we do here is that we add an extension to the generic linker script to add a modules section. We also insert two labels at the beginning and end of the section that can be accessed from the C code.

In the module.h file, we use these labels to declare external references to the start and end of the section.

Note that the external references have to be declared as arrays, and not pointers, to make sure the compiler maps correctly the address to the memory region containing the modules: should we have declared them as pointers, the compiler would have mapped the beginning of the memory region to a pointer, then dereferenced it to get access to the modules. You can refer to this post for a really good explaination of differences between arrays and pointers.

The modules have to be slightly nodified, to make sure they assign their interfaces to the new section:

a.c:

...
struct module __attribute__ ((section (".modules"))) module_a = {
	.foo = foo_a,
	.bar = bar_a
};

b.c:

...
struct module __attribute__ ((section (".modules"))) module_b = {
	.foo = foo_a,
	.bar = bar_a
};

The syntax is quite ugly, so you probably would hide it inside a preprocessor macro in the module.h file.

#define DECLARE_MODULE(name, ...) \
    struct module __attribute__ ((section (".modules"))) name = { __VA_ARGS__ };

Now we just have to access the global array from the client code using the variables defining its boundaries:

foo.c:

#include "module.h"

void foo()
{
	const struct module *m = modules_start;
	while (m < modules_end) {
		m->foo();
		m++;
	}
}

bar.c:

#include "module.h"

void bar()
{
	const struct module *m = modules_start;
	while (m < modules_end) {
		m->bar();
		m++;
	}
}

What we have now is a modules framework that can be extended without modifying its core. The modules registraton being static, this is greatly effective both in terms of RAM and CPU consumption.

Pitfalls with interface sections

There are a few things that you need to be aware of when using this framework.

First, you need to make sure that the linker aligns the modules in the same way the compiler would: otherwise when going through the table, you may shift and access the wrong data.

This is usually taken care of by enforcing alignment in the linker script:

module.lds:

SECTIONS
{
	.modules ALIGN(8) : {
		modules_start = .;
		*(.modules)
		modules_end = .;
	}
}
INSERT AFTER .rodata;

Second, depending on your your link configuration, your modules section may be optimized out, as the linker has no way of knowing that it is actually used.

In particular, the --gc-sections options will for sure make your table disappear.

The workaround is to explicitly tell the linker that it should keep these symbols:

module.lds:

SECTIONS
{
	.modules : {
		modules_start = .;
		*KEEP((.modules))
		modules_end = .;
	}
}
INSERT AFTER .rodata;

Last, if some of your modules are distributed as static libraries, the linker may also optimize out the corresponding symbols when linking the whole binary.

The workaround in that case is to prevent optimization by using the linker --whole-archive option.

Better understanding Linux secondary dependencies solving with examples

kaizouman@kaizou.org (David Corvoysier) — Thu, 08 Jan 2015 14:00:00 +0000

A few months ago I stumbled upon a linking problem with secondary dependencies I couldn’t solved without overlinking the corresponding libraries.

I only realized today in a discussion with my friend Yann E. Morin that not only did I use the wrong solution for that particular problem, but that my understanding of the gcc linking process was not as good as I had imagined.

This blog post is to summarize what I have now understood.

There is also a small repository on github with the mentioned samples.

A few words about Linux libraries

This paragraph is only a brief summary of what is very well described in The Linux Documentation Project library howto.

Man pages for the linux linker and loader are also a good source of information.

There are three kind of libraries in Linux: static, shared and dynamically loaded (DL).

Dynamically loaded libraries are very specific to some use cases like plugins, and would deserve an article on their own. I will only focus here on static and shared libraries.

Static libraries

A static library is simply an archive of object files conventionally starting with the lib prefix and ending with the .a suffix.

Example:

libfoobar.a

Static libraries are created using the ar program:

$ ar rcs libfoobar.a foo.o bar.o

Linking a program with a static library is as simple as adding it to the link command either directly with its full path:

$ gcc -o app main.c /path/to/foobar/libfoobar.a

or indirectly using the -l/L options:

$ gcc -o app main.c -lfoobar -L/path/to/foobar

Shared libraries

A shared library is an ELF object loaded by programs when they start.

Shared libraries follow the same naming conventions as static libraries, but with the .so suffix instead of .a.

Example:

libfoobar.so

Shared library objects need to be compiled with the -fPIC option that produces position-independent code, ie code that can be relocated in memory.

$ gcc -fPIC -c foo.c
$ gcc -fPIC -c bar.c

The gcc command to create a shared library is similar to the one used to create a program, with the addition of the -shared option.

$ gcc -shared -o libfoobar.so foo.o bar.o

Linking against a shared library is achieved using the exact same commands as linking against a static library:

$ gcc -o app main.c libfoobar.so

$ gcc -o app main.c -lfoobar -L/path/to/foobar

Shared libraries and undefined symbols

An ELF object maintains a table of all the symbols it uses, including symbols belonging to another ELF object that are marked as undefined.

At compilation time, the linker will try to resolve an undefined symbol by linking it either statically to code included in the overall output ELF object or dynamically to code provided by a shared library.

If an undefined symbol is found in a shared library, a DT_NEEDED entry is created for that library in the output ELF target.

The content of the DT_NEEDED field depends on the link command:

the full path to the library if the library was linked with an absolute path,
the library name otherwise (or the library soname if it was defined).

You can check the dependencies of an ELF object using the readelf command:

$ readelf -d main

$ readelf -d libbar.so

When producing an executable a symbol that remains undefined after the link will raise an error: all dependencies must therefore be available to the linker in order to produce the output binary.

For historic reason, this behavior is disabled when building a shared library: you need to specify the --no-undefined (or -z defs) flag explicitly if you want errors to be raised when an undefined symbol is not resolved.

$ gcc -Wl,--no-undefined -shared -o libbar.so -fPIC bar.c

$ gcc -Wl,-zdefs -shared -o libbar.so -fPIC bar.c

Note that when producing a static library, which is just an archive of object files, no actual ‘linking’ operation is performed, and undefined symbols are kept unchanged.

Library versioning and compatibility

Several versions of the same library can coexist in the system.

By conventions, two versions of the same library will use the same library name with a different version suffix that is composed of three numbers:

major revision,
minor revision,
build revision.

Example:

libfoobar.so.1.2.3

This is often referred as the library real name.

Also by convention, the library major version should be modified every time the library binary interface (ABI) is modified.

Following that convention, an executable compiled with a shared library version is theoretically able to link with another version of the same major revision.

This concept if so fundamental for expressing compatibility between programs and shared libraries that each shared library can be associated a soname, which is the library name followed by a period and the major revision:

Example:

libfoobar.so.1

The library soname is stored in the DT_SONAME field of the ELF shared object.

The soname has to be passed as a linker option to gcc.

$ gcc -shared -Wl,-soname,libfoobar.so.1 -o libfoobar.so foo.o bar.o

As mentioned before, whenever a library defines a soname, it is that soname that is stored in the DT_NEEDED field of ELF objects linked against that library.

Solving versioned libraries dependencies at build time

As mentioned before, libraries to be linked against can be specified using a shortened name and a path:

$ gcc -o app main.c -lfoobar -L/path/to/foobar

When installing a library, the installer program will typically create a symbolic link from the library real name to its linker name to allow the linker to find the actual library file.

Example:

/usr/lib/libfoobar.so -> libfoobar.so.1.5.3

The linker uses the following search paths to locate required shared libraries:

directories specified by -rpath-link options (more on that later)
directories specified by -rpath options (more on that later)
directories specified by the environment variable LD_RUN_PATH
directories specified by the environment variable LD_LIBRARY_PATH
directories specified in DT_RUNPATH or DT_RPATH of a shared library are searched for shared libraries needed by it
default directories, normally /lib and /usr/lib
directories listed inthe /etc/ld.so.conf file

Solving versioned shared libraries dependencies at runtime

On GNU glibc-based systems, including all Linux systems, starting up an ELF binary executable automatically causes the program loader to be loaded and run.

On Linux systems, this loader is named /lib/ld-linux.so.X (where X is a version number). This loader, in turn, finds and loads recursively all other shared libraries listed in the DT_NEEDED fields of the ELF binary.

Please note that if a soname was specified for a library when the executable was compiled, the loader will look for the soname instead of the library real name. For that reason, installation tools automatically create symbolic names from the library soname to its real name.

Example:

/usr/lib/libfoobar.so.1 -> libfoobar.so.1.5.3

When looking fo a specific library, if the value described in the DT_NEEDED doesn’t contain a /, the loader will consecutively look in:

directories specified at compilation time in the ELF object DT_RPATH (deprecated),
directories specified using the environment variable LD_LIBRARY_PATH,
directories specified at compile time in the ELF object DT_RUNPATH,
from the cache file /etc/ld.so.cache, which contains a compiled list of candidate libraries previously found in the augmented library path (can be disabled at compilation time),
in the default path /lib, and then /usr/lib (can be disabled at compilation time).

Proper handling of secondary dependencies

As mentioned in the introduction, my issue was related to secondary dependencies, ie shared libraries dependencies that are exported from one library to a target.

Let’s imagine for instance a program main that depends on a library libbar that itself depends on a shared library libfoo.

We will use either a static libbar.a or a shared libbar.so.

foo.c

int foo()
{
    return 42;
}

bar.c

int foo();

int bar()
{
    return foo();
}

main.c

int bar();

int main(int argc, char** argv)
{
    return bar();
}

Creating the libfoo.so shared library

libfoo has no dependencies but the libc, so we can create it with the simplest command:

$ gcc -shared -o libfoo.so -fPIC foo.c

Creating the libbar.a static library

As said before, static libraries are just archives of object files, without any means to declare external dependencies.

In our case, there is therefore no explicit connection whatsoever between libbar.a and libfoo.so.

$ gcc -c bar.c
$ ar rcs libbar.a bar.o

Creating the libbar.so dynamic library

The proper way to create the libbar.so shared library it by explicitly specifying it depends on libfoo:

$ gcc -shared -o libbar2.so -fPIC bar.c -lfoo -L$(pwd)

This will create the library with a proper DT_NEEDED entry for libfoo.

$ readelf -d libbar.so
Dynamic section at offset 0xe08 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libfoo.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
...

However, since undefined symbols are not by default resolved when building a shared library, we can also create a “dumb” version without any DT_NEEDED entry:

$ gcc -shared -o libbar_dumb.so -fPIC bar.c

Note that it is very unlikely that someone actually chooses to create such an incomplete library on purpose, but it may happen that by misfortune you encounter one of these beasts in binary form and still need to link against it (yeah, sh… happens !).

Linking against the libbar.a static library

As mentioned before, when linking an executable, the linker must resolve all undefined symbols before producing the output binary.

Trying to link only with libbar.a produces an error, since it has an undefined symbol and the linker has no clue where to find it:

$ gcc -o app_s main.c libbar.a
libbar.a(bar.o): In function `bar':
bar.c:(.text+0xa): undefined reference to `foo'
collect2: error: ld returned 1 exit status

Adding libfoo.so to the link command solves the problem:

$ gcc -o app main.c libbar.a -L$(pwd) -lfoo

You can verify that the app binary now explicitly depends on libfoo:

$ readelf -d app
Dynamic section at offset 0xe18 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libfoo.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
...

At run-time, the dynamic linker will look for libfoo.so, so unless you have installed it in standard directories (/lib or /usr/lib) you need to tell it where it is:

LD_LIBRARY_PATH=$(pwd) ./app

To summarize, when linking an executable against a static library, you need to specify explicitly all dependencies towards shared libraries introduced by the static library on the link command.

Note however that expressing, discovering and adding implicit static libraries dependencies is typically a feature of your build system (autotools, cmake).

Linking against the libbar.so shared library

As specified in the linker documentation, when the linker encounters an input shared library it processes all its DT_NEEDED entries as secondary dependencies:

if the linker output is a shared relocatable ELF object (ie a shared library), and the –copy-dt-needed-entries option is set (this is the legacy behavior) it will add all DT_NEEDED entries from the input library as new DT_NEEDED entries in the output,
if the linker output is a shared relocatable ELF object (ie a shared library), and if the –no-copy-dt-needed-entries option is set (this is the new default behavior for binutils, following a move initiated by major distros like Fedora ) it will simply ignore all DT_NEEDED entries from the input library,
if the linker ouput is a non-shared, non-relocatable link (our case), it will automatically add the libraries listed in the DT_NEEDED of the input library on the link command line, producing an error if it can’t locate them.

So, let’s see what happens when dealing with our two shared libraries.

Linking against the “dumb” library

When trying to link an executable against the “dumb” version of libbar.so, the linker encounters undefined symbols in the library itself it cannot resolve since it lacks the DT_NEEDED entry related to libfoo:

$ gcc -o app main.c -L$(pwd) -lbar_dumb
libbar_dumb.so: undefined reference to `foo'
collect2: error: ld returned 1 exit status

Let’s see how we can solve this.

Adding explicitly the libfoo.so dependency

Just like we did when we linked against the static version, we can just add libfoo to the link command to solve the problem:

$ gcc -o app main.c -L$(pwd) -lbar_dumb -lfoo

It creates an explicit dependency in the app binary:

$ readelf -d app
Dynamic section at offset 0xe18 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libbar_dumb.so]
 0x0000000000000001 (NEEDED)             Shared library: [libfoo.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
...

Again, at runtime you may need to tell the dynamic linker where libfoo.so is:

$ LD_LIBRARY_PATH=$(pwd) ./app

Note that having an explicit dependency to libfoo is not quite right, since our application doesn’t use directly any symbols from libfoo. What we’ve just done here is called overlinking, and it is BAD.

Let’s imagine for instance that in the future we decide to provide a newer version of libbar that uses the same ABI, but based on a new version of libfoo with a different ABI: we should theoretically be able to use that new version of libbar without recompiling our application, but what would really happen here is that the dynamic linker would actually try to load the two versions of libfoo at the same time, leading to unpredictable results. We would therefore need to recompile our application even if it is still compatible with the newest libbar.

As a matter of fact, this actually happened in the past: a libfreetype update in the debian distro caused 583 packages to be recompiled, with only 178 of them actually using it.

Ignoring libfoo dependency

There is another option you can use when dealing with the “dumb” library: tell the linker to ignore its undefined symbols altogether:

$ gcc -o app main.c -L$(pwd) -lbar_dumb -Wl,--allow-shlib-undefined

This will produce a binary that doesn’t declare its hidden dependencies towards libfoo:

$ readelf -d app
Dynamic section at offset 0xe18 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libbar_dumb.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
...

This isn’t without consequences at runtime though, since the dynamic linker is now unable to resolve the executable dependencies:

$ ./app: symbol lookup error: ./libbar_dumb.so: undefined symbol: foo

Your only option is then to load libfoo explicitly (yes, this is getting uglier and uglier):

$ LD_PRELOAD=$(pwd)/libfoo.so LD_LIBRARY_PATH=$(pwd) ./app

Linking against the “correct” library

Doing it the right way

As mentioned before, when linking against the correct shared library, the linker encounters the libfoo.so DT_NEEDED entry, adds it to the link command and finds it at the path specified by -L, thus solving the undefined symbols … or at least that is what I expected:

$ gcc -o app main.c -L$(pwd) -lbar
/usr/bin/ld: warning: libfoo.so, needed by libbar.so, not found (try using -rpath or -rpath-link)
/home/diec7483/dev/linker-example/libbar.so: undefined reference to `foo'
collect2: error: ld returned 1 exit status

Why the error ? I thought I had done everything by the book !

Okay, let’s take a look at the ld man page again, looking at the -rpath-link option. This says:

When using ELF or SunOS, one shared library may require another. This happens when an “ld -shared” link includes a shared library as one of the input files. When the linker encounters such a dependency when doing a non-shared, non-relocatable link, it will automatically try to locate the required shared library and include it in the link, if it is not included explicitly. In such a case, the -rpath-link option specifies the first set of directories to search. The -rpath-link option may specify a sequence of directory names either by specifying a list of names separated by colons, or by appearing multiple times.

Ok, this is not crystal-clear, but what it actually means is that when specifying the path for a secondary dependency, you should not use -L but -rpath-link:

$ gcc -o app main.c -L$(pwd) -lbar -Wl,-rpath-link=$(pwd)

You can now verify that app depends only on libbar:

$ readelf -d app
Dynamic section at offset 0xe18 contains 25 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libbar.so]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
...

And this is finally how things should be done.

You may also use -rpath instead of -rpath-link but in that case the specified path will be stored in the resulting executable, which is not suitable if you plan to relocate your binaries. Tools like cmake use the -rpath during the build phase (make), but remove the specified path from the executable during the installation phase(make install).

Conclusion

To summarize, when linking an executable against:

a static library, you need to specify all dependencies towards other shared libraries this static library depends on explicitly on the link command.
a shared library, you don’t need to specify dependencies towards other shared libraries this shared library depends on, but you may need to specify the path to these libraries on the link command using the -rpath/-rpath-link options.

Note however that expressing, discovering and adding implicit libraries dependencies is typically a feature of your build system (autotools, cmake), as demonstrated in my samples.

Unit testing with GoogleTest and CMake

kaizouman@kaizou.org (David Corvoysier) — Wed, 05 Nov 2014 22:00:00 +0000

Continuous integration requires a robust test environment to be able to detect regressions as early as possible.

A typical test environment will typically be composed of integration tests of the whole system and unit tests per components.

This post explains how to create unit tests for a C++ component using GoogleTest and CMake.

##Project structure

I will assume here that the project structure follows the model described in a previous post:

+-- CMakeLists.txt
+-- main
|    +-- CMakeLists
|    +-- main.cpp
|
+-- test
|    +-- CMakeLists.txt
|    +-- testfoo
|       +-- CMakeLists.txt
|       +-- main.cpp
|       +-- testfoo.h
|       +-- testfoo.cpp
|       +-- mockbar.h
|
+-- libfoo
|    +-- CMakeLists.txt
|    +-- foo.h
|    +-- foo.cpp
|
+-- libbar
     +-- CMakeLists.txt
     +-- bar.h
     +-- bar.cpp

The main subdirectory contains the main project target, an executable providing the super-useful libfoo service using the awesome libbar backend (for example libfoo could be a generic face recognition library and libbar a GPU-based image processing library).

The test directory contains a single executable allowing to test the libfoo service using a mock version of libbar.

From Wikipedia: In object-oriented programming, mock objects are simulated objects that mimic the behavior of real objects in controlled ways.

For those interested, the code for this sample project is on github.

##A closer look at the test directory

In my simplistic example, there is only one subdirectory under test, but in a typical project, it would contain several subdirectories, one for each test program.

Tests programs are based on Google’s Googletest framework and its GoogleMock extension.

Since all test programs will be using these packages, the root CMakeLists.txt file should contain all directives required to resolve the corresponding dependencies. This is where things get a bit hairy, since Google does not recommend to install these packages in binary form, but instead to recompile them with your project.

###Resolving GoogleTest and GoogleMock dependencies

There are at least three options to integrate your project with GoogleTest and GoogleMock.

####Having both packages integrated in your build system

Obviously, this is only an option if you actually do have a buildsystem, but if this is the case, this would be my recommendation.

Depending on how your buildsystem is structured, your mileage may vary, but in the end you should be able to declare GoogleTest and GoogleMock as dependencies using CMake functions like the built-in find_package or the pkg-config based pkg_check_modules.

find_package(PkgConfig)
pkg_check_modules(GTEST REQUIRED gtest>=1.7.0)
pkg_check_modules(GMOCK REQUIRED gmock>=1.7.0)

include_directories(
    ${GTEST_INCLUDE_DIRS}
    ${GMOCK_INCLUDE_DIRS}
)

####Add both packages sources to your project

Adding the GoogleTest and GoogleMock sources as subdirectories of test would allow you to compile them as part of your project.

This is however really ugly, and I wouldn’t recommend you doing that …

####Add both packages as external CMake projects

According to various answers posted on StackOverflow, this seems to be the recommended way of resolving GoogleTest and GoogleMock dependencies on a per project basis.

It takes advantage of the CMake ExternalProject module to fetch GoogleTest and GoogleMock sources from the internet and compile them as third-party dependencies in your project.

Below is a working example, with a few comments explaining what’s going on:

# We need thread support
find_package(Threads REQUIRED)

# Enable ExternalProject CMake module
include(ExternalProject)

# Download and install GoogleTest
ExternalProject_Add(
    gtest
    URL https://github.com/google/googletest/archive/master.zip
    PREFIX ${CMAKE_CURRENT_BINARY_DIR}/gtest
    # Disable install step
    INSTALL_COMMAND ""
)

# Get GTest source and binary directories from CMake project
ExternalProject_Get_Property(gtest source_dir binary_dir)

# Create a libgtest target to be used as a dependency by test programs
add_library(libgtest IMPORTED STATIC GLOBAL)
add_dependencies(libgtest gtest)

# Set libgtest properties
set_target_properties(libgtest PROPERTIES
    "IMPORTED_LOCATION" "${binary_dir}/googlemock/gtest/libgtest.a"
    "IMPORTED_LINK_INTERFACE_LIBRARIES" "${CMAKE_THREAD_LIBS_INIT}"
)

# Create a libgmock target to be used as a dependency by test programs
add_library(libgmock IMPORTED STATIC GLOBAL)
add_dependencies(libgmock gtest)

# Set libgmock properties
set_target_properties(libgmock PROPERTIES
    "IMPORTED_LOCATION" "${binary_dir}/googlemock/libgmock.a"
    "IMPORTED_LINK_INTERFACE_LIBRARIES" "${CMAKE_THREAD_LIBS_INIT}"
)

# I couldn't make it work with INTERFACE_INCLUDE_DIRECTORIES
include_directories("${source_dir}/googletest/include"
                    "${source_dir}/googlemock/include")

Note: It should theoretically be possible to set the GoogleTest and GoogleMock include directories as target properties using the INTERFACE_INCLUDE_DIRECTORIES variable, but it fails because these directoires don’t exist yet when they are declared. As a workaround, I had to explicitly use include_directories to specify them.

###Writing a testfoo test program for libfoo

The testfoo program depends on libfoo, GoogleTest and GoogleMock.

Here is how the testfoo CMakeLists.txt file would look like:

file(GLOB SRCS *.cpp)

add_executable(testfoo ${SRCS})

target_link_libraries(testfoo
    libfoo
    libgtest
    libgmock
)

install(TARGETS testfoo DESTINATION bin)

The libraries required for the build are listed under target_link_libraries. CMake will then add the appropriate include directories and link options.

The testfoo program will provide unit tests for the Foo class of the libfoo library defined below.

####foo.h

class Bar;

class Foo
{
    Foo(const Bar& bar);
    bool baz(bool useQux);
protected:
    const Bar& m_bar;
}

####foo.cpp

#include "bar.h"
#include "foo.h"

Foo::Foo(const bar& bar)
 : m_bar(bar) {};

bool Foo::baz(bool useQux) {
    if (useQux) {
        return m_bar.qux();
    } else {
        return m_bar.norf();
    }
}

The sample Test program described in the GoogleTest Documentation fits in a single file, but I prefer splitting the Unit Tests code in three types of files.

###main.cpp

The main.cpp file will contain only the test program main function. This is where you will put the generic Googletest Macro invocation to launch the tests and some initializations that need to be put in the main (nothing in this particular case).

#include "gtest/gtest.h"

int main(int argc, char **argv)
{
    ::testing::InitGoogleTest(&argc, argv);
    int ret = RUN_ALL_TESTS();
    return ret;
}

###testfoo.h

This file contains the declaration of the FooTest class, which is the test fixture for the Foo class.

#include "gtest/gtest.h"
#include "mockbar.h"

// The fixture for testing class Foo.
class FooTest : public ::testing::Test {

protected:

    // You can do set-up work for each test here.
    FooTest();

    // You can do clean-up work that doesn't throw exceptions here.
    virtual ~FooTest();

    // If the constructor and destructor are not enough for setting up
    // and cleaning up each test, you can define the following methods:

    // Code here will be called immediately after the constructor (right
    // before each test).
    virtual void SetUp();

    // Code here will be called immediately after each test (right
    // before the destructor).
    virtual void TearDown();

    // The mock bar library shaed by all tests
    MockBar m_bar;
};

###mockbar.h

Assuming the libbar library implements a public Bar interface, we use GoogleMock to provide a fake implementation for test purposes only:

#include "bar.h"

class MockBar: public Bar
{
    MOCK_METHOD0(qux, bool());
    MOCK_METHOD0(norf, bool());
} 

This will allow us to inject controlled values into the libfoo library when it will invoke the Bar class methods.

Please refer to the GoogleMock documentation for a detailed description of the GoogleMock features.

###testfoo.cpp

This file contains the implementation of the TestFoo fixture class.

This is where the actual tests are written.

We will test the output of the Foo::baz() method, first having default values for the Bar::qux() and Bar::norf() methods returned by our mock, then overrding the value returned by Bar::norf() with a value specific to our test.

In all test cases, we use GoogleTest expectations to verify the output of the Foo::baz method.

#include "mockbar.h"
#include "testfoo.h"

using ::testing::Return;

FooTest()
{
    // Have qux return true by default
    ON_CALL(m_bar,qux()).WillByDefault(Return(true));
    // Have norf return false by default
    ON_CALL(m_bar,norf()).WillByDefault(Return(false));
}

TEST_F(FooTest, ByDefaultBazTrueIsTrue) {
    Foo foo(m_bar);
    EXPECT_EQ(foo.baz(true), true);
}

TEST_F(FooTest, ByDefaultBazFalseIsFalse) {
    Foo foo(m_bar);
    EXPECT_EQ(foo.baz(false), false);
}

TEST_F(FooTest, SometimesBazFalseIsTrue) {
    Foo foo(m_bar);
    // Have norf return true for once
    EXPECT_CALL(m_bar,norf()).WillOnce(Return(true));
    EXPECT_EQ(foo.baz(false), false);
}

Please refer to the GoogleTest documentation for a much detailed presentation of how to create unit tests with Gtest.

##Building tests

As usual, it is recommended to build your program out-of-tree, ie in a directory separated from the sources.

mkdir build
cd build

First, you need to invoke the cmake command to generate the build files.

cmake ..

This should produce an output similar to this one:

-- The C compiler identification is GNU 4.8.2
-- The CXX compiler identification is GNU 4.8.2
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Configuring done
-- Generating done
-- Build files have been written to: ~/gtest-cmake-example/build

Then, build the project targets.

make

The following output corresponds to the case where GoogleTest and GoogleMock are automatically fetched from their repositories and built as third-party dependencies.

Scanning dependencies of target libfoo
[  7%] Building CXX object libfoo/CMakeFiles/libfoo.dir/foo.cpp.o
Linking CXX static library liblibfoo.a
[  7%] Built target libfoo
Scanning dependencies of target libbar
[ 15%] Building CXX object libbar/CMakeFiles/libbar.dir/bar.cpp.o
Linking CXX static library liblibbar.a
[ 15%] Built target libbar
Scanning dependencies of target myApp
[ 23%] Building CXX object main/CMakeFiles/myApp.dir/main.cpp.o
Linking CXX executable myApp
[ 23%] Built target myApp
Scanning dependencies of target gtest
[ 30%] Creating directories for 'gtest'
[ 38%] Performing download step (download, verify and extract) for 'gtest'
-- downloading...
     src='https://github.com/google/googletest/archive/master.zip'
     dst='/home/david/perso/gtest-cmake-example/build/test/gtest/src/master.zip'
     timeout='none'
-- [download 0% complete]
-- [download 1% complete]
...
-- [download 99% complete]
-- [download 100% complete]
-- downloading... done
-- verifying file...
     file='/home/david/perso/gtest-cmake-example/build/test/gtest/src/master.zip'
-- verifying file... warning: did not verify file - no URL_HASH specified?
-- extracting...
     src='/home/david/perso/gtest-cmake-example/build/test/gtest/src/master.zip'
     dst='/home/david/perso/gtest-cmake-example/build/test/gtest/src/gtest'
-- extracting... [tar xfz]
-- extracting... [analysis]
-- extracting... [rename]
-- extracting... [clean up]
-- extracting... done
[ 46%] No patch step for 'gtest'
[ 53%] No update step for 'gtest'
[ 61%] Performing configure step for 'gtest'
-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.8.4
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found PythonInterp: /usr/bin/python (found version "2.7.6")
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home/david/perso/gtest-cmake-example/build/test/gtest/src/gtest-build
[ 69%] Performing build step for 'gtest'
Scanning dependencies of target gmock
[ 14%] Building CXX object googlemock/CMakeFiles/gmock.dir/__/googletest/src/gtest-all.cc.o
[ 28%] Building CXX object googlemock/CMakeFiles/gmock.dir/src/gmock-all.cc.o
Linking CXX static library libgmock.a
[ 28%] Built target gmock
Scanning dependencies of target gmock_main
[ 42%] Building CXX object googlemock/CMakeFiles/gmock_main.dir/__/googletest/src/gtest-all.cc.o
[ 57%] Building CXX object googlemock/CMakeFiles/gmock_main.dir/src/gmock-all.cc.o
[ 71%] Building CXX object googlemock/CMakeFiles/gmock_main.dir/src/gmock_main.cc.o
Linking CXX static library libgmock_main.a
[ 71%] Built target gmock_main
Scanning dependencies of target gtest
[ 85%] Building CXX object googlemock/gtest/CMakeFiles/gtest.dir/src/gtest-all.cc.o
Linking CXX static library libgtest.a
[ 85%] Built target gtest
Scanning dependencies of target gtest_main
[100%] Building CXX object googlemock/gtest/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o
Linking CXX static library libgtest_main.a
[100%] Built target gtest_main
[ 76%] No install step for 'gtest'
[ 84%] Completed 'gtest'
[ 84%] Built target gtest
Scanning dependencies of target testfoo
[ 92%] Building CXX object test/testfoo/CMakeFiles/testfoo.dir/main.cpp.o
[100%] Building CXX object test/testfoo/CMakeFiles/testfoo.dir/testfoo.cpp.o
Linking CXX executable testfoo
[100%] Built target testfoo

##Running tests

Once the test programs have been built, you can run them individually …

test/testfoo/testfoo

… producing a detailed output …

[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from FooTest
[ RUN      ] FooTest.ByDefaultBazTrueIsTrue

GMOCK WARNING:
Uninteresting mock function call - taking default action specified at:
~/gtest-cmake-example/test/testfoo/testfoo.cpp:8:
    Function call: qux()
          Returns: true
Stack trace:
[       OK ] FooTest.ByDefaultBazTrueIsTrue (0 ms)
[ RUN      ] FooTest.ByDefaultBazFalseIsFalse

GMOCK WARNING:
Uninteresting mock function call - taking default action specified at:
~/gtest-cmake-example/test/testfoo/testfoo.cpp:10:
    Function call: norf()
          Returns: false
Stack trace:
[       OK ] FooTest.ByDefaultBazFalseIsFalse (0 ms)
[ RUN      ] FooTest.SometimesBazFalseIsTrue
[       OK ] FooTest.SometimesBazFalseIsTrue (0 ms)
[----------] 3 tests from FooTest (0 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test case ran. (0 ms total)
[  PASSED  ] 3 tests.

Note: You can get rid of GoogleMock warnings by using a nice mock.

… or globally through CTest …

make test

… producing only a test summary.

Running tests...
Test project ~/gtest-cmake-example/build
    Start 1: testfoo
1/1 Test #1: testfoo ..........................   Passed    0.00 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.00 sec

A typical Linux project using CMake

kaizouman@kaizou.org (David Corvoysier) — Mon, 03 Nov 2014 22:00:00 +0000

When it comes to choosing a make system on Linux, you basically only have two options: autotools or CMake. I have always found Autotools a bit counter-intuitive, but was reluctant to make the effort to switch to CMake because I was worried the learning curve would be too steep for a task you don’t have to perform that much often (I mean, you usually spend more time writing code than writing build rules).

A recent project of mine required writing a lot of new Linux packages, and I decided it was a good time to give CMake a try. This article is about how I have used it to build plain old Linux packages almost effortlessly.

Although CMake is fairly well documented, I personnally found the documentation (and especially the tutorial) a bit too CMake-oriented, forcing me to use cmake dedicated tools for tasks I had already tools for (tests and delivery for instance).

This is therefore my own tutorial to CMake, based on my primary requirement: just generate the makefiles using CMake, and use my own tools for everything else.

Project structure

The project structure is partly driven by the project design, but it would ususally contain at least two common sub-directories, along with several “module” sub-directories:

project
.
+-. main
+-. test
+-. moduleA
+-. moduleB

The main subdirectory contains the main project target, typically an executable.

The test directory contains one or more test executables.

The moduleX directories contain libraries to be used by either the tests or main executables.

At the root of the project, the main CMakeLists.txt should contain the common CMake directives that apply to all subdirectories.

First, the CMakeLists.txt would specify a minimum Cmake version, name your project and define a few common behaviours.

CMAKE_MINIMUM_REQUIRED(VERSION 2.8)

PROJECT(MyProject)

SET(CMAKE_INCLUDE_CURRENT_DIR ON)

Here, I only set one option that is of uttermost importance if you want to build out-of-tree AND generate some of your source files automatically (you most certainly do actually if you are using ANY modern framework like Qt). What it does is that it adds the ${CMAKE_CURRENT_SOURCE_DIR} (this one you don’t care that much) and ${CMAKE_CURRENT_BINARY_DIR} to the include path, allowing generated include files to be found by the compiler.

Finally, the CMakeLists.txt would list all subdirectories to be included in the project:

ADD_SUBDIRECTORY(main)
ADD_SUBDIRECTORY(test)
ADD_SUBDIRECTORY(moduleA)
ADD_SUBDIRECTORY(moduleB)
...

Configuring Modules

As explained in the previous paragraph, each subdirectory would contain at least either one executable or one library defined in a dedicated CMakeLists.txt file.

Executables are declared using the ADD_EXECUTABLE command:

ADD_EXECUTABLE(myapp
    ${MY_SRCS}
)

Libraries are declared using the ADD_LIBRARY command:

ADD_LIBRARY(mylib STATIC
    ${MY_SRCS}
)

Source files are specified either explicitly or using a wildcard:

SET(MY_SRC
    fileA.cpp
    fileB.cpp
    ...
)

file(GLOB MY_SRC
    "*.h"
    "*.cpp"
)

Note that using a wildcard, you need to rerun CMake if you add more files to a module

Solving dependencies between modules

Link dependencies

Link dependencies between modules are solved using the TARGET_LINK_LIBRARIES command.

CMake maintains throughout the whole project a named object for each target created by a command such as ADD_EXECUTABLE() or ADD_LIBRARY().

This target name can be passed to the TARGET_LINK_LIBRARIES command to tell CMake that an object A depends on on object B.

Example:

Given a library defined in a specific subdirectory

ADD_LIBRARY(mylib STATIC
    ${MY_LIBSRCS}
)

One can specify a dependency from an application to that library

ADD_EXECUTABLE(myapp
    ${MY_APPSRCS}
)

TARGET_LINK_LIBRARIES(myapp
    mylib
)

Include dependencies

Include dependencies are automatically solved for dependent libraries declared in the TARGET_LINK_LIBRARIES command if the corresponding libraries have properly declared their include directories using the TARGET_INCLUDE_DIRECTORIES command.

Example:

Given a library defined in a specific subdirectory

ADD_LIBRARY(mylib STATIC
    ${MY_LIBSRCS}
)

Specifying

TARGET_INCLUDE_DIRECTORIES(mylib
    /path/to/includes
)

Allows a dependent app to be aware of the mylib include path just when adding the lib to the TARGET_LINK_LIBRARIES

ADD_EXECUTABLE(myapp
    ${MY_APPSRCS}
)

TARGET_LINK_LIBRARIES(myapp
    mylib
)

Additional include dependencies can be solved explicitly using the INCLUDE_DIRECTORIES command, but most of the time, you won’t need it unless you have nested sub-directories that don’t have a CMakeLists.txt of their own (as a matter of fact, needing to add an explicit INCLUDE_DIRECTORIES may be a good hint that something is wrong with your other directives).

Resolving Dependencies towards external packages

Packages known by CMake

CMake provides a set of tools to register and retrieve information about packages stored in a CMake package registry.

CMake packages dependencies are solved easily by specifying them using the built-in CMake FIND_PACKAGE commands.

FIND_PACKAGE(Qt5Core)

This command will create a CMake target Qt5::Core that can be referenced in TARGET_LINK_LIBRARIES commands.

ADD_LIBRARY(mylib STATIC
    ${MY_LIBSRCS}
)

TARGET_LINK_LIBRARIES(mylib
    Qt5::Core
)

Note: The FIND_PACKAGE command will also export several related variables.

Just like when referencing an internal module, the paths to the specific includes of libraries found using FIND_PACKAGE are automatically added to the include search path. There is therefore no need to add them explicitly using an INCLUDE_DIRECTORIES directive.

Other packages: pkg-config

For package whose definition is not maintained in CMake (ie there is no FIND_PACKAGE macro written for them), you may rely on the generic pkg-config tool instead.

pkg-config is a helper tool used when compiling applications and libraries. It helps you insert the correct compiler options on the command line so an application can use gcc -o test test.c pkg-config --libs --cflags glib-2.0 for instance, rather than hard-coding values on where to find glib (or other libraries). It is language-agnostic, so it can be used for defining the location of documentation tools, for instance.

pkg-config compatible packages declare their include path, compiler options and linking flags in dedicated .pc files installed on the system.

Here is for instance the glib-2.0 pkg-configfile:

prefix=/usr
exec_prefix=${prefix}
libdir=${prefix}/lib/x86_64-linux-gnu
includedir=${prefix}/include

glib_genmarshal=glib-genmarshal
gobject_query=gobject-query
glib_mkenums=glib-mkenums

Name: GLib
Description: C Utility Library
Version: 2.36.0
Requires.private: libpcre
Libs: -L${libdir} -lglib-2.0 
Libs.private: -pthread  -lpcre    
Cflags: -I${includedir}/glib-2.0 -I${libdir}/glib-2.0/include

Before using pkg-config, you need to make sure the tool is available by inserting the following line in your CMakeLists.txt:

FIND_PACKAGE(PkgConfig)

Then, insert the following PKG_CHECK_MODULES command in your CMakeLists.txt file to tell CMake to resolve pkg-config dependencies for a specific package:

PKG_CHECK_MODULES(GLIB2 REQUIRED glib-2.0>=2.36.0)

The command will export several variables, including the XXX_LIBRARIES command that can be used in TARGET_LINK_LIBRARIES commands.

ADD_LIBRARY(mylib STATIC
    ${MY_LIBSRCS}
)

TARGET_LINK_LIBRARIES(mylib
    GLIB2_LIBRARIES
)

Unfortunately, I was unable to get the include paths of libraries found through pkg-config to be added automatically to the include source paths just like it it when using the standard FIND_PACKAGE function, so I needed to add them explicitly:

INCLUDE_DIRECTORIES(
    GLIB2_INCLUDE_DIRS
)

Exporting dependencies towards external packages

Although CMake supports its own mechanism to export dependencies, it is recommended to take advantage of the more generic pkg-config files.

CMake doesn’t provide any specific mechanism to generate .pc files.

However, one can take advantage of CMake variables substitution to generate a specific pkg-config file from a predefined template.

CONFIGURE_FILE(
  "${CMAKE_CURRENT_SOURCE_DIR}/pkg-config.pc.cmake"
  "${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}.pc"
)

A typical .pc template could be:

Name: ${PROJECT_NAME}
Description: ${PROJECT_DESCRIPTION}
Version: ${PROJECT_VERSION}
Requires: ${PKG_CONFIG_REQUIRES}
prefix=${CMAKE_INSTALL_PREFIX}
includedir=${PKG_CONFIG_INCLUDEDIR}
libdir=${PKG_CONFIG_LIBDIR}
Libs: ${PKG_CONFIG_LIBS}
Cflags: ${PKG_CONFIG_CFLAGS}

Where the following variables are provided by CMake:

PROJECT_NAME
PROJECT_DESCRIPTION
PROJECT_VERSION
CMAKE_INSTALL_PREFIX

And these ones need to be specified explicitly:

PKG_CONFIG_REQUIRES
PKG_CONFIG_INCLUDEDIR
PKG_CONFIG_LIBDIR
PKG_CONFIG_LIBS
PKG_CONFIG_CFLAGS

Example:

SET(PKG_CONFIG_REQUIRES glib-2.0)
SET(PKG_CONFIG_LIBDIR
    "\${prefix}/lib"
)
SET(PKG_CONFIG_INCLUDEDIR
    "\${prefix}/include/mylib"
)
SET(PKG_CONFIG_LIBS
    "-L\${libdir} -lmylib"
)
SET(PKG_CONFIG_CFLAGS
    "-I\${includedir}"
)

CONFIGURE_FILE(
  "${CMAKE_CURRENT_SOURCE_DIR}/pkg-config.pc.cmake"
  "${CMAKE_CURRENT_BINARY_DIR}/${PROJECT_NAME}.pc"
)

Installing files on target

Installing files on target is as simple as adding the corresponding INSTALL command to the target CMakeLists.txt.

To install the main targets of a project, use the TARGETS directive:

INSTALL(TARGETS myapp
        DESTINATION bin)

INSTALL(TARGETS mylib ARCHIVE
        DESTINATION lib)

Note: The files will be installed relatively to the path specified in the CMAKE_INSTALL_PREFIX cmake variable, prepended by the DESTDIR variable passed on the command line (ie make install DESTDIR=/home/toto)

Other project files can also be installed using the FILES directive:

INSTALL(FILES header.h
        DESTINATION include/mylib)

INSTALL(FILES "${CMAKE_BINARY_DIR}/${PROJECT_NAME}.pc"
        DESTINATION lib/pkgconfig)

Building the project

I personnally always recommend to build a project out-of-tree, ie to put all build subproducts into a separate directory. Incidentally, building out-of-tree is also a good way to find out if your project is properly configured …

So, the first step is to create a build directory

mkdir build && cd build

Then you need to tell CMake to generate the project makefiles according to specific directives you may specify on the command line (typically by setting variables). Most of the time, you can let CMake apply default values:

cmake ..

But you may need for instance to specify a custom installation prefix (by default CMake will use usr/local):

cmake -DCMAKE_INSTALL_PREFIX:PATH=usr ..

Once the makefiles have been generated you can simply build the project using make commands.

make

Finally, you can install the targets, either using defaults …

make install

… or specifying the destination directory (CMake use / as the default destination directory)

DESTDIR=/custom-destdir make install