Bootloader (Flash over CAN)

This project is still in the early design stages, so this page serves both as an intro to the project and as a living design document.

Goal

Currently, the only way to get firmware onto a controller board’s microcontroller is to plug in a USB programmer and connect it to the controller board to be flashed. This works, but if we want to flash multiple boards, it’s very slow, and programmatically updating firmware is very difficult.

Instead, what we’d like to do is write a bootloader which will allow us to replace the firmware on any board on the CAN network by sending the data over CAN. The bootloader will take up the first bit of the MCU’s flash and will take care of responding to any commands, replacing or updating its application code or its config, and jumping to the application firmware.

As an example of the interface, to update the MCI board’s firmware, you might attach your laptop to the CAN bus, then type

make boot-update PROJECT=mci

Or to update all the firmware in the car at once, you might just type

make boot-update

There are many other exciting applications of this beyond the primary application. For example, if the x86-side bootloader interface is encapsulated in a Python script, we could put that script on an internet-connected raspberry pi to enable over-the-air updates. We could also write automatic multi-board smoke tests by flashing a smoke test to multiple boards at once, in effect allowing automatic hardware integration tests.

Background

Our controller boards have STM32F072 microcontrollers with 128KiB of flash memory (persistent) and 16KiB of RAM.

Our current flashing practice is as follows. We use a CMSIS-DAP USB programmer which connects to the controller board via Serial Wire Debug (SWD) pins, which is a protocol with direct access to the MCU’s memory. We use OpenOCD (open on-chip-debug) scripts to overwrite flash through SWD with a binary file, which is created via linking all the object files produced from compilation into an ELF file and then converting that ELF file into a .bin file. This linking process is controlled by a linker script which determines the addresses of each section - text (code), data (global variables), bss (zero-initialized global variables), and others.

We want our existing flashing process to still be usable as a backup - if the bootloader fails, we still need to be able to flash our other projects onto the boards the old-fashioned way. Thus, any changes made to existing projects as part of the bootloader must be backwards-compatible: projects flashed normally must be unaffected.

Prior Art

STM32 devices include bootloaders over various peripherals in ROM: see this application note. Unfortunately, the STM32F072 devices we use only include USB, I2C, and USART bootloaders, not CAN, and we’d have to upgrade to STM32F1xx to use a native CAN bootloader as investigated here:https://uwmidsun.atlassian.net/l/c/avCveAoH . This isn’t really practical since it would be a big pain to change MCUs.

Many non-native CAN bootloaders have been written before. This design takes heavy inspiration from GitHub - cvra/can-bootloader: The bootloader used to flash our CAN-connected boards . A frontend for the native STM32 CAN bootloader is GitHub - marcinbor85/can-prog: Command-line tool to flashing devices by CAN-BUS . This bootloader tutorial is also useful: https://interrupt.memfault.com/blog/how-to-write-a-bootloader-from-scratch.

Architecture

Addressing controller boards

One major issue is how to address a certain MCU on the CAN network. Unlike the systems for which many other bootloaders were designed, Midnight Sun uses controller boards to hold microcontrollers for each board, rather than having an MCU built into each board. Thus, a certain controller board can’t be guaranteed to stay with one board forever, and we can’t just call one MCU “power_distribution”.

Therefore, the bootloader system should be able to identify a controller board by any of the following:

  • A numeric ID, set when flashing the bootloader and stored in its config. We can then physically label each controller board with its ID. ID 0 should be reserved for the client.

  • (Optional) A human-friendly string name, set when flashing the bootloader and stored in its config. Each controller board can also be labelled with its name. This is optional, but I think “centre console is on Gemini” sounds more fun than “centre console is on controller board 5”.

  • The name of the current project on the board, as a string. This will be set whenever the bootloader loads a new application project. This way, we can just say “update steering” and whichever controller board currently has the steering project will be updated, rather than having to find which board has steering.

  • Optionally, additional identifying information set by the application project. For example, we could differentiate between front and rear power distribution this way, and then we could selectively update rear power distribution.

The user should also be able to specify multiple boards to load the same firmware onto (e.g. boards 2,5,7), and if multiple boards are running the same project, we should be able to update all of them by saying something like “update power_distribution”. The ID/name should be settable when we flash the bootloader.

Memory Layout and Config

This part is *heavily* inspired by GitHub - cvra/can-bootloader: The bootloader used to flash our CAN-connected boards .

The STM32F072 has 128KiB of flash and 16KiB of RAM. A flash page is 2KiB, and a flash sector is 4KiB. (This matters because we can’t just write to flash normally, we have to erase a whole flash page at a time; as well, write protection works per sector.)

We will partition the flash into four sections:

  • the bootloader code

  • two identical bootloader config pages

  • the application code

  • (also: a section for the calib flash page, should be in the normal linker script too)

The two config pages are for redundancy. We will use the persist module to manage storing the config blob in those pages. A CRC (cyclic redundancy check - a quick hash function) will be stored along with the blob to ensure its integrity; if one page has an invalid blob, we overwrite it with the other one. This way we always have a valid config page, since fixing invalid config would otherwise require manually reflashing the bootloader onto the board.

In the config, we will also store a CRC of the application code. Before jumping to the application code, the bootloader will compute its CRC; if it doesn’t match the config, something must have been corrupted and it will refuse to boot.

A preliminary list of what we might want to store in the config:

  • config CRC (4 bytes) - CRC32 of the config blob

  • controller board ID (1 byte, 0<id<64) - numeric ID of the controller board

  • controller board name (64-byte C-string) - human-friendly name of the controller board

  • project name (64-byte C-string) - name of the current project, e.g. power_distribution. The empty string should mean “no project” (i.e. not flashed).

  • project info? (64-byte C-string?) - possible extra string set by the project to differentiate different boards, like rear for rear power distro

  • git version info (32-byte (?) C-string) - commit hash of the branch we flashed from, like 0bdfdd8-dirty; this is what’s printed by git_version.c. We could even try going for branch name.

  • application CRC (4 bytes) - CRC32 of the application code

  • application size (4 bytes) - needed for the CRC

Also, possibly a “project present?” bool.

We might consider write-protecting the bootloader code and possibly the config (when not intentionally writing to it) to prevent the bootloader or application code from overwriting those sections. (See section 3.3.2 of the stm32f0xx manual.)

Linker Scripts

To support this new memory layout, we will need a new linker script for building the bootloader and projects to be loaded via the bootloader. We will maintain a second set of linker scripts for the bootloader and switch to them when building the bootloader / its associated projects.

A consequence of the memory layout is that size requirements for the bootloader are very strict: if the bootloader grows too large, updates in the linker scripts are required and all the controller boards will have to be reflashed.

Protocol: Very high level

Like Babydriver, the bootloader will use a command-based, master-slave-style architecture. I’ll discuss here the kinds of operations that I think should take place; this protocol will be implemented using CAN.

Participants: the client is the computer sending commands, and the controller boards running the bootloader receive and respond to the commands. After a power cycle (maybe) or after being forced back into the bootloader from the application code, the bootloader should wait some time for the client to send commands before automatically jumping to the application code. (While we’re in the bootloader, the red LED should blink so we have a visual indication that we’re in the bootloader!)

A key concept here is pattern matching a set of controller boards. The idea is that the user can specify criteria which might match one, several, or even zero controller boards on the CAN network, and the operation should apply equally to all of them. For example, the user might specify the following (in pseudocode):

  • id=5 - match controller board 5, if it’s on the network

  • id=2,4,9,10 - match however many of controller boards 2, 4, 9, and 10 that are on the network

  • name=delta,tango - match controller boards delta and tango, if they’re on the network

  • project=power_distribution - match all controller boards running power distribution on the network

  • project=power_distribution, info=rear - match all controller boards running power distribution whose applications have set project info rear (could be used to select rear power distribution)

  • (maybe) commit-hash=0bdfd - match all controller boards whose git commit hashes start with 0bdfd (or if the commit hash specified is longer than the commit hash stored, all controller boards whose git commit hashes are a prefix of the one specified)

  • id=5, project=bms_carrier, name=curiosity - match controller board 5 running BMS carrier named curiosity, if such a board exists on the network. Otherwise, match nothing.

This can be implemented with a special pattern-matching operation: the client sends out a message with all the information the user specified, and controller boards respond

All of the following operations should be considered to use pattern matching, in that they apply to the set of controller boards specified by one of the above methods.

Querying

We should have a way to retrieve config information from matched controller boards.

The client sends out a message with all the pattern-matching information entered by the user. Each controller board responds with the following information from its config: numeric ID, name, current project name, project info (if used), and git version info (or even branch name!). The client could then e.g. display a printout like this:

ID Name Current Project Info Git Version 5 newton centre_console f8df7d2-clean 2 galileo bms_carrier 23daff3-dirty 8 maxwell power_distribution front 6a4a7bb-dirty 11 einstein steering bc869ef-clean 7 curie power_distribution rear c6e8925-clean 16 hawking mci 798fe65-dirty 4 faraday pedal_board 5131f78-clean 9 turing charger 9e6987b-dirty

(The client can also use the git command thing to look up the branch name from the commit hash and display it automagically!)

This command can also be used to implement pattern-matching for all of the following commands. To implement pattern-matching, all that’s technically required is to get a list of IDs that match a pattern, but the extra information can be used to display a warning before potentially-dangerous commands like flash, or just a list of the boards that a command applies to, like this:

Ping

Very simple life check. The client sends out a list of IDs, or none to ping all boards on the network. Each board sends back a message with its ID if it’s in the bootloader and ready to receive commands. Useful as a lightweight version of querying for internal uses.

Jump to application

Direct the matched boards to jump to the application code. The client sends out a list of IDs it wants to jump; each matched controller board computes the CRC of the application code, checks it, responds with a status code, and jumps to the application. (Actually doing this is super tricky: info here https://interrupt.memfault.com/blog/how-to-write-a-bootloader-from-scratch.)

Depending on how the design works out, we might be able to skip the CRC.

Update ID / Update name

Two separate commands. Update the ID or name of the matched controller board. The client should not allow this to be run when more than one controller board is matched, or when the ID/name matches the ID/name of another controller board on the network; however, it can’t detect if the ID/name is used in a controller board not on the network, so be careful.

The client sends out an ID to update and the new ID or name. The controller board overwrites the config (being careful to write one page, check it, and then write the other), and responds back with a status code.

Flash application code

The client sends out IDs of the boards to flash to, metadata like project name, git version (+branch?) info, and application CRC and size, then the application code itself. The controller boards write the application code to flash (making sure not to keep it all in memory at once to avoid overflows) and compute their own CRC of the application code. If it matches, they overwrite the project name/git version info/application CRC/application size in the config and clear the project info (being careful to write one page, check it, and then write the other), then respond back with an “OK” status code message. If it doesn’t match, they mark their config as “no project” (again being careful) and respond back with an appropriate status code.

Protocol: Some CAN implementation considerations

The best ideas are stolen: this section is heavily inspired by can-bootloader/PROTOCOL.markdown at master · cvra/can-bootloader

Transport layer (Datagrams)

As seen with Babydriver (e.g. the SPI and I2C modules), it’s a big pain to deal with the 8-byte CAN message data limit when trying to send variable-length data, or just data longer than 8 bytes. So, let’s use the following structure to transmit variable-length datagrams so we don’t have to deal with it.

This part isn’t really specific to the bootloader and should just be added to ms-common (using any two message IDs and taking node IDs (i.e. controller board IDs) as input).

A datagram will be represented as a stream of CAN messages, all sent sequentially. The first message in the sequence is the “start message”: it will have a different, higher ID than the other messages, which all have the same ID. (Lower IDs have priority on the CAN bus, so the start message should have a higher ID than the rest to discourage datagram interruptions on a low level.)

Datagrams should have the following format:

  1. Datagram protocol version (1 byte) - a constant, initially 0x00. Versions that don’t match should be silently ignored. Useful for backwards compatibility in the future.

  2. CRC32 of the whole datagram after this point (4 bytes)

  3. Datagram type ID (1 byte) - an ID specifying what the datagram is and how the data field is formatted, like a command ID. Sort of like a babydriver ID.

  4. Number of node IDs / controller board IDs addressed (n) (1 byte) - the special value 0 means every controller board / node on the network should receive the datagram.

  5. List of node IDs (n bytes, 1 byte per node ID)

  6. Data size in bytes (m) (2 bytes) - this value could physically go up to 65536, but the STM32F072 only has 16KiB of memory, which has to hold all of the data plus other stuff on the stack, global variables, etc. So, the data size MUST be less than or equal to 2048 bytes (2KiB). This is an arbitrary limit which is subject to change, but this value lets us transfer a whole 2KiB flash page in one datagram.

  7. Data (m bytes)

Note: all multi-byte integers (i.e. the CRC32 and data size in bytes) are in little-endian order, with the least significant byte first. (This is the default on STM32.)

Datagram messages will have the node ID (controller board ID) of the source node as part of the message’s arbitration ID, so the source of each message is identifiable. Thus multiple datagrams from different sources may be sent at the same time. Since the bootloader protocol is operating under a master-slave command-based architecture, the controller boards need only store and take action on messages from the client (with special ID 0), while the client must store datagrams from every controller board.

A timeout of 25ms (arbitrary, subject to change) should apply between messages in a datagram, after which the datagram transmission should be considered to have ended and the contents should be discarded.

CAN ID allocation

A remaining consideration is what IDs the messages should have. The CAN standard message format gives us an 11-bit arbitration ID, where lower IDs have priority. Our normal CAN infrastructure partitions this into 3 parts: a 6-bit message ID, a 1-bit ACK flag, and a 4-bit device ID (set per project). We include the device ID in the arbitration ID because if two nodes on the CAN network try to send a message with the same ID at the same time, it’s bad and causes bugs, so we guarantee different IDs by embedding a device ID in the arbitration ID.

A 4-bit device ID works fine for our normal CAN system since we have <16 boards with MCUs, but we might have up to 50 or so controller boards, so we need a 6-bit ID. We’ve also got device ID 0 reserved for some reason. Thus I propose the following structure for a bootloader datagram frame arbitration ID:

  • A 6-bit source/node/controller board ID, taking up the message ID slot

  • A 1-bit start-of-datagram flag, 1 for start messages and 0 for the rest, taking up the ACK flag’s bit

  • 4 bits of zeros for the device ID - we can even call this SYSTEM_CAN_DEVICE_ID_BOOTLOADER.

This can be implemented via a very minimal change in can_fsm.c: in prv_handle_rx, if rx_msg.device_id == SYSTEM_CAN_DEVICE_ID_BOOTLOADER, either call a bootloader function to handle it if we’re in the bootloader or else just ignore the message/jump back to the bootloader.

This scheme has the disadvantage that the node ID is at the beginning, so bootloader datagrams from controller boards aren’t given a higher priority in general. However, the client (which is the only party broadcasting extremely long and important datagrams like flashing content) has the special node ID 0, so the client’s non-starting datagram messages have the highest priority on the bus (all zeros) while the client’s starting messages have close to it - in our setup, bested only by the BPS heartbeat.

We will have to work around an x86 thing: see line 262 in x86/can_hw.c.

Under this scheme, code flashed via the bootloader can coexist with code flashed the traditional way, but the scheme does require that we reflash all boards so that every node on the network is aware of and at least ignores bootloader messages.

Breaking into the bootloader

One other topic: we should have a way to jump from the application code back to the bootloader via a CAN message to run more commands. We can either do it upon receipt of any bootloader datagram start message (and pass the start message back to the bootloader), or we can do it with a normal CAN message with a handler pre-registered. In any case, we’d have to initialize CAN in smoke tests and small projects in order for them to be accessible via this method.

Way to do this without starting up into bootloader on initialization: https://www.st.com/resource/en/application_note/dm00230416-onthefly-firmware-update-for-dual-bank-stm32-microcontrollers-stmicroelectronics.pdf. Two boot banks.

Client and API

The core of the client should be implemented as a modular Python script so that it can be deployed not just on x86 but also on e.g. a raspberry pi in the car to enable over-the-air firmware updates. The make interface to the bootloader client can be implemented as just shelling out to the Python script.

The make interface might look something like this:

Task List

  • Finalize design of the bootloader

  • Design the exact CAN protocol for each operation in terms of datagrams (maybe look into MessagePack?)

  • Implement the datagram protocol, both in Python (client) and in C (ms-common)

  • Create new linker scripts

  • Implement each command from both client (Python) and bootloader sides

    • The trickiest one is jumping to the application code - figure out how to do that while uninitializing everything

  • Write the main function / glue code of the bootloader, taking care to be very safe (using the redundancy in the config, checking CRCs)

Remember: we don’t want bugs in the bootloader! They’re hard to correct and can have bad consequences. Keep it simple and robust!