Thumb vs AVR Performance

The other day, I was working on an embedded application for driving stepper motors. In order to control my motor drivers, I was using the popular Atmega 328p MCU to speak with the host over serial, calculate paths, and pulse the stepper driver ICs at varying rates to control the speed of the system. Everything was working well, but I was beginning to run up against the limit of what the AVR could do in between the pulses of the motor clock, which at full tilt would request a motor update at 20KHz per motor. Since I’d been experimenting with the STM32 series of ARM based microcontrollers for other applications, I decided to spin a similar motor controller board based on the value line STM32F070 MCU, running at a respectable 48MHz (compared to the 16MHz of the AVR). My naïve assumption was that after porting the lower level parts of the code, I’d be able to deploy to the new MCU and immediately see a dramatic improvement in performance from the higher clock speed.

I was wrong.

When I deployed the code with only a single axis enabled, the MCU wasn’t able to keep up at all - the motor ready interrupt was firing at 20KHz, but the main event loop of the application, even heavily pared down to just the critical motor speed control code, was only able to execute at around 7KHz. What gives! In order to try and figure out where the performance gap is, I started off by writing a very simple benchmark loop for both devices - set a pin high, loop an 8 bit value from 0 to 255, and then clear the pin. Let’s start with the AVR version of the code:

void __attribute__ ((noinline)) test_loop() {
    PORTD |= (1 << DDD1); // Set indicator pin
    // Loop from 0..255. Include a NOP so that the compiler doesn't
    // (rightly) eliminate our useless loop.
    for (uint8_t i = 0; i < 0xFF; i++)
        asm __volatile__("nop");
    PORTD &= ~(1 << DDD1); // Clear indicator pin

AVR Cycle-counting

After compiling with avr-gcc -O2, we can disassemble with objdump to take a look at what’s actually running on our AVR core. As we’d expect, it’s fairly straight forward: the AVR instruction set has single-opcode bit set/clear ops, so the first and last lines become single instructions. The loop then consists of one initialization and three repeated instructions. Comments have been added below to explain each instruction:

00000080 <_Z4loopv>:
  80:   59 9a           sbi     0x0b, 1 ; 11    ; Set indicator pin
  82:   8f ef           ldi     r24, 0xFF       ; Load loop variable (255)
  84:   00 00           nop                     ; Loop body (nop)
  86:   81 50           subi    r24, 0x01       ; Subtract one from loop counter
  88:   e9 f7           brne    .-6             ; Jump back to loop label
  8a:   59 98           cbi     0x0b, 1 ; 11    ; Clear indicator pin
  8c:   08 95           ret                     ; Done

Now, let’s get a read on how fast this can run. I loaded it onto another of the same Atmega3 328p chips, let it run and measured the output signal on the logic analyzer. Each high section of the output waveform was a very consistent 63.88µS. This is all very well and good, but let’s prove to ourselves that this is a correct measurement. Since our cores aren’t doing any fancy prefetching/pipelining, we should be able to reason fairly well about how many cycles we expect assembly code to take. So let’s turn to the AVR instruction set datasheet: according to the tables in here, it seems we can expect the loop instructions subi and nop to execute in a single clock cycle, and the brne instruction to execute in 2 cycles if the condition is true (i != 0). Armed with this, we can check our measured number by calculating the expected clock cycle count:

                   1 // sbi
                   1 // ldi
   255 * (1 + 1 + 2) // nop + subi + brne (taken)
+                  1 // cbi
                1023 // Total instructions

At a clock speed of 16MHz, this should take 63.93µS. This lines up very nicely with our measured number! Now let’s take a look at our surprisingly underperforming ARM core.

ARM Performance test

Like before, we’re going to use a very simple loop. The only difference here is that we’ve replaced our AVR port manipulation logic with the STM32 equivalent. As before, we will compile with the -O2 flag.

void __attribute__ ((noinline)) test_loop() {
    GPIO_BSRR(GPIOB) = GPIO8; // Set indicator pin
    for (uint8_t i = 0; i < 0xFF; i++)
        asm __volatile__("nop");
    GPIO_BSRR(GPIOB) = GPIO8 << 16; // Clear indicator pin

Now, let’s flash that to our ARM core. Since it’s operating at a whopping 48MHz instead of the puny 16MHz of our AVR core, surely we’d expect it to execute our test loop in around (64 / (48/16)) = 21.3µS?

Spoiler alert: it does not

45.19µS! That’s almost half the speed we’d expect with a similarly implemented loop. Let’s take a look at the disassembly of our ARM loop and see what we can see (comments added on right):

080000c0 <_Z4loopv>:
 80000c0:       2280            movs    r2, #128        ; Load 0x80
 80000c2:       4b07            ldr     r3, [pc, #28]   ; Load GPIOB memory address
 80000c4:       0052            lsls    r2, r2, #1      ; Logical shift 0x80 left once to get 0x100
 80000c6:       601a            str     r2, [r3, #0]    ; Store 0x100 into GPIOB memory addr (set indicator)
 80000c8:       23ff            movs    r3, #255        ; Load loop variable
 80000ca:       46c0            nop                     ; Loop body
 80000cc:       3b01            subs    r3, #1          ; Subtract one from loop counter
 80000ce:       b2db            uxtb    r3, r3          ; Extend 8 bit value to 32 bit
 80000d0:       2b00            cmp     r3, #0          ; Compare to 0
 80000d2:       d1fa            bne.n   80000ca <_Z4loopv+0xa> ; Jump back into loop if != 0
 80000d4:       2280            movs    r2, #128        ; Load 0x80
 80000d6:       4b02            ldr     r3, [pc, #8]    ; Load GPIOB memory address
 80000d8:       0452            lsls    r2, r2, #17     ; Left shift again to get bit 8 set
 80000da:       601a            str     r2, [r3, #0]    ; Clear GPIOB pin 8
 80000dc:       4770            bx      lr              ; Return
 80000de:       46c0            nop                     ; (Dead code for alignment)
 80000e0:       48000418        stmdami r0, {r3, r4, sl}; GPIOB memory mapped IO location

The core of our loop seems pretty similar to the AVR: we execute a nop, a subtract, compare and a branch. We also have a sign-extend instruction on our loop counter in r3, which we can eliminate by converting our uint8_t loop counter to a uint32_t. But there are some goings on in the lean-in and lead-out that are a bit confusing - such as why are we loading 0x80 and left shifting it instead of loading 0x100? The answer is that our ARM core isn’t actually running the ARM instruction set - it’s running in Thumb mode.

Thumb mode

One of the downsides of a RISC ISA is that you tend to need more instructions in your binary than you would on a CISC ISA with more operations baked into silicon as a single opcode. For embedded and other space-sensitive applications, having lots of 32-bit ARM instructions can waste precious program ROM. To get around this, ARM created the Thumb instruction set, with (mostly) 16-bit instructions. So long as you’re mostly using common instructions, this can easily close to double your instruction density! But like all things, there are tradeoffs. One visible here is that the movs Thumb instruction only supports immediate values in the range [0..255], so loading our bit constant (1 << 8) cannot be done as a single movs - we have to load (1 << 7) and then shift it once more once it’s loaded to a register. The alternative for loading constants is what’s been done by GCC for the memory-mapped IO region for port B, which is to embed it as part of the function below the bx lr return call - 0x48000418 is our MMIO location, which is loaded using ldr r3, [pc, #28] to load a full word constant from memory. Presumably our 0x100 constant is not big enough for GCC to think it worth using the same trick here.

So let’s calculate our our cycle counts for the ARM code as we did for the AVR, using this list of instructions. We have a couple cycles overhead for our lead in and lead out, but the core of our loop has the following cycle characteristics:

80000c0:       2280            movs    r2, #128        ; 1 cycle
80000c2:       4b07            ldr     r3, [pc, #28]   ; 2 cycles
80000c4:       0052            lsls    r2, r2, #1      ; 1 cycle
80000c6:       601a            str     r2, [r3, #0]    ; 2 cycles
80000c8:       23ff            movs    r3, #255        ; 1 cycle
80000ca:       46c0            nop                     ; 1 cycle
80000cc:       3b01            subs    r3, #1          ; 1 cycle
80000ce:       b2db            uxtb    r3, r3          ; 1 cycle
80000d0:       2b00            cmp     r3, #0          ; 1 cycle
80000d2:       d1fa            bne.n   80000ca         ; 3 if taken, 1 if not taken
80000d4:       2280            movs    r2, #128        ; 1 cycle
80000d6:       4b02            ldr     r3, [pc, #8]    ; 2 cycles
80000d8:       0452            lsls    r2, r2, #17     ; 1 cycle
80000da:       601a            str     r2, [r3, #0]    ; 2 cycles
80000dc:       4770            bx      lr              ; 3 cycles

Calculating it out, that’s approximately 7 + (7 * 255) + 6 = 1,789 cycles! Close to double the count of our AVR code. At 48MHz, that’s a theoretical execution time of 37.27µS. Our actual measured time (45.19µS) is somewhat slower - my assumption is that this is due to flash wait states, as the STM32F070 this test was run on has 1 wait state for clock speeds > 24MHz. These delays should only be incurred for non-linear flash accesses, and so our repeated loop branching is a worse-case scenario here. Inserting an extra cycle into our loop for a prefetch miss results in a total cycle count of 2,053, and an expected runtime of 42.77µS, which is much closer to our measured time.

What should I take away from all this

At the end of digging into all this, I have a newfound respect for the AVR architecture - single cycle IO access makes it much more competitive for applications with lots of standalone IO calls (but perhaps not for standard protocols like SPI, which tend to be offloaded to baked-in peripherals on the ARM cores, and with DMA support can be extremely powerful). It’s also useful to be aware of the limitations and quirks of the Thumb instruction set - keeping constants small, avoiding spurious resize instructions and minimizing branches are ways to eke more performance out of tightly looped controllers. If I were to summarize some key takeaways, they might be: