Cortex M4 floating point optimisation

Carefully hand-written assembly language routines can bring dramatic savings in the runtime and RAM/ROM budget, for deeply embedded systems. This is especially true on very tiny Wearable devices, where small micro controllers are very often used, running at low frequency, with very limited memory resources.

For example, ARM Cortex M4 has a small by powerful floating point unit, which offers very complex instructions, like MAC’s, Square root calculation, conversions, Near-ZO loops. However, except for trivial floating point operations, the compiler is just unable to use those instructions in the generated code. Several reasons for that:

Cortex M4 has not built-in support for double, but only for float. This means that the following C code

#define COEFF (2.0)
float a = 2.0f;
float b = a * COEFF; // COEFF is double ...

will imply a conversion from float to double (according to the C langage), and therefore a double precision floating point library will be used instead of native code.

C language has no “power” operator (like Fortran) or Sqrt. So in that case, again a software library will be pulled-in.

In one recent project, i had to implement a time-critical routine in a Wearable device, using a Cortex M4 at 14 Mhz … (yes, not GHz !). By creating some very small assembly routines, savings in cycle times by a 100+ fold could be achieved (!), and more than 100% saving in code size !

How to achieve that ?

The key routine was delivered to us by the customer in pseudo-code, as follows:

typedef struct
{
    double value;
    double weight;
} weighted_value;

void nf_update(weighted_value *accumulator, const weighted_value *newValue, uint32_t power)
{
//Calculate new accumulated total
 
double total = (pow(accumulator->value, power) * accumulator->weight) + (pow(newValue->value, power) * newValue->weight);

//Calculate new weight
double weight = accumulator->weight + newValue->weight;

//Update accumulator value and weight
// power can be 2 or 4
accumulator->value = pow(total / weight, 1.0 / power);

accumulator->weight = weight;
}

As delivered, the pseudo-code can be compiled and the generated code will be bloated but functionally correct.

First optimisation

As the pseudo code was using “double” variables, we started challenging that by considering the required precision of the algorithm and the allowed dynamic of float and double types

Single precision is encoded in a 32-bit word, where

1-bit is used by the sign
7-bits for the exponent
24-bits for the mantissa

The exponent is encoded as a biased representation, hence the allowed dynamic is much higher than what the number of bits normally allow, to the cost of a lowered precision.

Double precision is encoded in a 64-bit double word, where

1-bit is used for the sign
10-bits for the exponent
53-bits for the mantissa

The dynamic allowed by these numbers representation are: 1.2 E-38 to 3.4 E38 (single precision) and 2.2 E-308 to 1.8 E308 for the double precision.

After profiling, we saw that the “accumulator” value was growing, in the worst case, to 4.810377E+20 which is largely within the allowed dynamic of single precision floating. Exit double precision 🙂

Second optimization

Looking at the operations involved in our “specification”, we see “power” can take values of 2 or 4. This means we have Exponentiation to the power of 2, 4 and both squared root and fourth root.

Knowing that M4 FP Unit has a one cycle VSQRT instruction (FP Squared Root) and clever MAC operations, we can craft optimized versions for both the pow() and sqrt() library, to be used instead.

Header file for our small library
 /*
 * @author Jacques Samoun
 */
#ifndef _FWK_SQRT_S_H
#define _FWK_SQRT_S_H

#include <stdint.h>
#include <stddef.h>

float norm_f32 (float a, float b, float c);
float norm_u32 (int a, int b, int c);
float abs_f32 (float);
int abs_32 (int);

float pow_s (float value, unsigned int exp);
float root2_s (float arg);
float root2n_s (float arg, unsigned int n);

#endif // _FWK_SQRT_S_H

All these functions can be put in one single small assembly file. I removed the “public” declarations, SECTION and END statements for clarity. This can be found in any Cortex Assembly reference manual.

Use of Near-ZO loop for computing average

By successive calls to VMLA (Multiply Accumulate) and clever register usage, we can compute the SQRT( a**2 + b**2 + c**2) (using ** to denote exponent). Intermediate result is accumulated in S3, and the final sqrt result is in S0. An Unsigned U32 version and a F32 version have been made.

; FP UNIT must be enabled before. The following functions assume this is the case

; float norm (float a, float b, float c)
norm_fuel_f32:
 ; mov arguments to FP registers
 MOV R0, #0
 VMOV.F32 S3, R0 ; weird but mandatory
 VMLA.F32 S3, S0,S0
 VMLA.F32 S3, S1,S1
 VMLA.F32 S3, S2,S2
 VSQRT.F32 S0, S3

BX LR ; return (result in S0)

; float norm_fuel (uint32_t a, uint32_t b, uint32_t c)
; Note: as MLA returns low-32 bits of the 64-bits 32x32 result,
; the result will always be true, whatever the sign (as the difference
; lies only in the upper 32-bits
norm_fuel_u32:
 ; Computation starts here
 MOV R3, #0
 MLA R3, R0, R0, R3
 MLA R3, R1, R1, R3
 MLA R3, R2, R2, R3

; Move value to FP register, so SQRT can take it from there
 VMOV.F32 S1, R3
 VCVT.F32.U32 S1, S1
 VSQRT.F32 S0, S1

BX LR ; return (result in S0)

Using native SQRT and ABS for powerful inliners

Direct leveraging of VSQRT and VABS instructions. This allow 1-2 lines in-liners. By optimizing the register usage (ARM calling convention) we can achieve VERY small functions.

abs_fuel_f32:
   VABS.F32 S0, S0
   BX LR;

abs_fuel_32:
   CMP R0, #0
   IT LT
   RSBLT R0, R0, #0
   BX LR

; float root2_s (float arg)
root2_s 
   VSQRT.F32 S0, S0
   BX LR

Using MAC’s and conditional instructions for more complex operations

pow () routine is implemented by successive calls to the FP Multiply Accumulate and Near-ZO loop. Note that we also leverage the very powerful branch predication (found in the ARM instruction set since the very first ARM instruction set). This allow to embed a condition code in an instruction, which is therefore conditionally executed transparently

; POW
; takes the first argument and raises it to the power given by the second
; argument
; both args are double precision floating values
; postive power

; float pow_s (float value, unsigned int exp);
pow_s:
 ; handle the trivial x^^0 first
 CMP R0, #0
 ITT EQ
 VMOVEQ.F32 S0, #1.0
 BXEQ LR
 
 ; From here, exponent is at least equal to 1
 ; handled this trivial case as well
 
 VMOV.F32 S2, S0
 VMOV.F32 S1, S0 ; return value for the trivial case exp 1
_pow_s$$1
 ; positive argument treatment - exp is at least 2
 CMP R0, #1
 ITTTT HI
 VMULHI.F32 S1,S2,S0
 VMOVHI.F32 S2,S1
 SUBHI R0, #1
 BHI _pow_s$$1
 
 VMOV.F32 S0, S1
 BX LR
 

; ROOT2N
; computes the 2^n root of the argument.
; 2^^n means 1,2,4,...2^^n
; simple calculation using square root multiple times
; This routine is heavily optimised to specific values, upto the 
; caller to check the args first as very little error check is done

; float root2n_s (float arg, unsigned int n)
root2n_s
 CMP R0, #2
 BCC _end

; argument is greater or equal to 2, so (hopefully) 2, 4, .. 2^n
 ; test and loop
 ; Test is still valid
_root2n_s$$1
 CMP R0, #2
 ITTT GE
 VSQRTGE.F32 S0, S0
 MOVGE R0, R0, ASR #1
 BGE _root2n_s$$1
_end
 BX LR

Putting all these together, we have conducted a comparative benchmark (using IAR Workbench IDE and Compiler chain) for the same algorithm, using DP Library (Double Precision), SP (Single Precision) library and our optimized routines … the results are impressive !

	Code Size(byte)	Static RAM (bytes)	runtime(cycles)
Using SPF SW library	17 144	11 028	10 470 (747 us @ 14 MHz)
Using hand optimized assembler routines (Single precision)	11 568	11 028	150 (10.7 us @ 14 MHz)
Using DPF SW library	15 232	11 028	15850 (1.3 ms @ 14 MHz)

To be fair the the Compiler library, our optimized libraries are doing heavy assumptions on the input parameters, as they are tailored to one specific application. This saves some code for parameter checking, which the library HAS TO DO in a conservative way. Moreover, these kind of optimizations are better put in time-critical loops and algorithm parts.

For the rest of the code (state-machines, message passing, ..) a decent compiler and library would do the job perfectly.