Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
make boot-update-all

There are many other exciting applications of this beyond the primary application. For example, if the x86-side bootloader interface is encapsulated in a Python script, we could put that script on an internet-connected raspberry pi to enable over-the-air updates. We could also write automatic multi-board smoke tests by flashing a smoke test to multiple boards at once, in effect allowing automatic hardware integration tests.

...

Our controller boards have STM32F072 microcontrollers with 128KB 128KiB of flash memory (persistent) and 16KB 16KiB of RAM.

Our current flashing practice is as follows. We use a CMSIS-DAP USB programmer which connects to the controller board via Serial Wire Debug (SWD) pins, which is a protocol with direct access to the MCU’s memory. We use OpenOCD (open on-chip-debug) scripts to overwrite flash through SWD with a binary file, which is created via linking all the object files produced from compilation into an ELF file and then converting that ELF file into a .bin file. This linking process is controlled by a linker script which determines the addresses of each section - text (code), data (global variables), bss (zero-initialized global variables), and others.

...

  • A numeric ID, set when flashing the bootloader and stored in its config. We can then physically label each controller board with its ID. ID 0 should be reserved for the client.

  • (Optional) A human-friendly string name, set when flashing the bootloader and stored in its config. Each controller board can also be labelled with its name. This is optional, but I think “centre console is on Gemini” sounds more fun than “centre console is on controller board 5”.

  • The name of the current project on the board, as a string. This will be set whenever the bootloader loads a new application project. This way, we can just say “update steering” and whichever controller board currently has the steering project will be updated, rather than having to find which board has steering.

  • Optionally, additional identifying information set by the application project. For example, we could differentiate between front and rear power distribution this way, and then we could selectively update rear power distribution.

The user should also be able to specify multiple boards to load the same firmware onto (e.g. boards 2,5,7), and if multiple boards are running the same project, we should be able to update all of them by saying something like “update power_distribution”. The ID/name should be settable when we flash the bootloader.

Memory Layout and Config

This part is *heavily* inspired by https://github.com/cvra/can-bootloader .

The STM32F072 has 128KB 128KiB of flash and 16KB 16KiB of RAM. A flash page is 2KB2KiB, and a flash sector is 4KB4KiB. (This matters because we can’t just write to flash normally, we have to erase a whole flash page at a time; as well, write protection works per sector.)

...

  • config CRC (4 bytes) - CRC32 of the config blob

  • controller board ID (1 byte?, 0<id<64) - numeric ID of the controller board

  • controller board name (64-byte C-string) - human-friendly name of the controller board

  • project name (64-byte C-string) - name of the current project, e.g. power_distribution. The empty string should mean “no project” (i.e. not flashed).

  • project info? (64-byte C-string?) - possible extra string set by the project to differentiate different boards, like rear for rear power distro

  • git version info (32-byte (?) C-string) - commit hash of the branch we flashed from, like 0bdfdd8-dirty; this is what’s printed by git_version.c. We could even try going for branch name.

  • application CRC (4 bytes) - CRC32 of the application code

  • application size (4 bytes) - needed for the CRC

Also, possibly a “project present?” bool.

We might consider write-protecting the bootloader code and possibly the config (when not intentionally writing to it) to prevent the bootloader or application code from overwriting those sections. (See section 3.3.2 of the stm32f0xx manual.)

...

A consequence of the memory layout is that size requirements for the bootloader are very strict: if the bootloader grows too large, updates in the linker scripts are required and all the controller boards will have to be reflashed.

Operations and CAN Protocol

Like Babydriver, the bootloader will use a command-based architecture.

TODO

Client and API

TODO

Task List

TODO.

Protocol: Very high level

Like Babydriver, the bootloader will use a command-based, master-slave-style architecture. I’ll discuss here the kinds of operations that I think should take place; this protocol will be implemented using CAN.

Participants: the client is the computer sending commands, and the controller boards running the bootloader receive and respond to the commands. After a power cycle (maybe) or after being forced back into the bootloader from the application code, the bootloader should wait some time for the client to send commands before automatically jumping to the application code. (While we’re in the bootloader, the red LED should blink so we have a visual indication that we’re in the bootloader!)

A key concept here is pattern matching a set of controller boards. The idea is that the user can specify criteria which might match one, several, or even zero controller boards on the CAN network, and the operation should apply equally to all of them. For example, the user might specify the following (in pseudocode):

  • id=5 - match controller board 5, if it’s on the network

  • id=2,4,9,10 - match however many of controller boards 2, 4, 9, and 10 that are on the network

  • name=delta,tango - match controller boards delta and tango, if they’re on the network

  • project=power_distribution - match all controller boards running power distribution on the network

  • project=power_distribution, info=rear - match all controller boards running power distribution whose applications have set project info rear (could be used to select rear power distribution)

  • (maybe) commit-hash=0bdfd - match all controller boards whose git commit hashes start with 0bdfd (or if the commit hash specified is longer than the commit hash stored, all controller boards whose git commit hashes are a prefix of the one specified)

  • id=5, project=bms_carrier, name=curiosity - match controller board 5 running BMS carrier named curiosity, if such a board exists on the network. Otherwise, match nothing.

This can be implemented with a special pattern-matching operation: the client sends out a message with all the information the user specified, and controller boards respond

All of the following operations should be considered to use pattern matching, in that they apply to the set of controller boards specified by one of the above methods.

Querying

We should have a way to retrieve config information from matched controller boards.

The client sends out a message with all the pattern-matching information entered by the user. Each controller board responds with the following information from its config: numeric ID, name, current project name, project info (if used), and git version info (or even branch name!). The client could then e.g. display a printout like this:

Code Block
ID   Name      Current Project     Info   Git Version
5    newton    centre_console             f8df7d2-clean
2    galileo   bms_carrier                23daff3-dirty
8    maxwell   power_distribution  front  6a4a7bb-dirty
11   einstein  steering                   bc869ef-clean
7    curie     power_distribution  rear   c6e8925-clean
16   hawking   mci                        798fe65-dirty
4    faraday   pedal_board                5131f78-clean
9    turing    charger                    9e6987b-dirty

(this might be even more useful if we added branch names!)

This command can also be used to implement pattern-matching for all of the following commands. To implement pattern-matching, all that’s technically required is to get a list of IDs that match a pattern, but the extra information can be used to display a warning before potentially-dangerous commands like flash, or just a list of the boards that a command applies to, like this:

Code Block
Flashing the following controller boards:
ID   Name      Current Project     Info   Git Version
8    maxwell   power_distribution  front  6a4a7bb-dirty
7    curie     power_distribution  rear   c6e8925-clean

Ping

Very simple life check. The client sends out a list of IDs, or none to ping all boards on the network. Each board sends back a message with its ID if it’s in the bootloader and ready to receive commands. Useful as a lightweight version of querying for internal uses.

Jump to application

Direct the matched boards to jump to the application code. The client sends out a list of IDs it wants to jump; each matched controller board computes the CRC of the application code, checks it, responds with a status code, and jumps to the application. (Actually doing this is super tricky: info here https://interrupt.memfault.com/blog/how-to-write-a-bootloader-from-scratch.)

Depending on how the design works out, we might be able to skip the CRC.

Update ID / Update name

Two separate commands. Update the ID or name of the matched controller board. The client should not allow this to be run when more than one controller board is matched, or when the ID/name matches the ID/name of another controller board on the network; however, it can’t detect if the ID/name is used in a controller board not on the network, so be careful.

The client sends out an ID to update and the new ID or name. The controller board overwrites the config (being careful to write one page, check it, and then write the other), and responds back with a status code.

Flash application code

The client sends out IDs of the boards to flash to, metadata like project name, git version (+branch?) info, and application CRC and size, then the application code itself. The controller boards write the application code to flash (making sure not to keep it all in memory at once to avoid overflows) and compute their own CRC of the application code. If it matches, they overwrite the project name/git version info/application CRC/application size in the config and clear the project info (being careful to write one page, check it, and then write the other), then respond back with an “OK” status code message. If it doesn’t match, they mark their config as “no project” (again being careful) and respond back with an appropriate status code.

Protocol: Some CAN implementation considerations

The best ideas are stolen: this section is heavily inspired by https://github.com/cvra/can-bootloader/blob/master/PROTOCOL.markdown

Transport layer (Datagrams)

As seen with Babydriver (e.g. the SPI and I2C modules), it’s a big pain to deal with the 8-byte CAN message data limit when trying to send variable-length data, or just data longer than 8 bytes. So, let’s use the following structure to transmit variable-length datagrams so we don’t have to deal with it.

This part isn’t really specific to the bootloader and should just be added to ms-common (using any two message IDs and taking node IDs (i.e. controller board IDs) as input).

A datagram will be represented as a stream of CAN messages, all sent sequentially. The first message in the sequence is the “start message”: it will have a different, higher ID than the other messages, which all have the same ID. (Lower IDs have priority on the CAN bus, so the start message should have a higher ID than the rest to discourage datagram interruptions on a low level.)

Datagrams should have the following format:

  1. Datagram protocol version (1 byte) - a constant, initially 0x00. Versions that don’t match should be silently ignored. Useful for backwards compatibility in the future.

  2. CRC32 of the whole datagram after this point (4 bytes)

  3. Number of node IDs / controller board IDs addressed (n) (1 byte) - the special value 0 means every controller board / node on the network should receive the datagram.

  4. List of node IDs (n bytes, 1 byte per node ID)

  5. Data size in bytes (m) (2 bytes) - this value could physically go up to 65536, but the STM32F072 only has 16KiB of memory, which has to hold all of the data plus other stuff on the stack, global variables, etc. So, the data size MUST be less than or equal to 2048 bytes (2KiB). This is an arbitrary limit which is subject to change, but this value lets us transfer a whole 2KiB flash page in one datagram.

  6. Data (m bytes)

Datagram messages will have the node ID (controller board ID) of the source node as part of the message’s arbitration ID, so the source of each message is identifiable. Thus multiple datagrams from different sources may be sent at the same time. Since the bootloader protocol is operating under a master-slave command-based architecture, the controller boards need only store and take action on messages from the client (with special ID 0), while the client must store datagrams from every controller board.

A timeout of 25ms (arbitrary, subject to change) should apply between messages in a datagram, after which the datagram transmission should be considered to have ended and the contents should be discarded.

CAN ID allocation

A remaining consideration is what IDs the messages should have. The CAN standard message format gives us an 11-bit arbitration ID, where lower IDs have priority. Our normal CAN infrastructure partitions this into 3 parts: a 6-bit message ID, a 1-bit ACK flag, and a 4-bit device ID (set per project). We include the device ID in the arbitration ID because if two nodes on the CAN network try to send a message with the same ID at the same time, it’s bad and causes bugs, so we guarantee different IDs by embedding a device ID in the arbitration ID.

A 4-bit device ID works fine for our normal CAN system since we have <16 boards with MCUs, but we might have up to 50 or so controller boards, so we need a 6-bit ID. We’ve also got device ID 0 reserved for some reason. Thus I propose the following structure for a bootloader datagram frame arbitration ID:

  • A 6-bit source/node/controller board ID, taking up the message ID slot

  • A 1-bit start-of-datagram flag, 1 for start messages and 0 for the rest, taking up the ACK flag’s bit

  • 4 bits of zeros for the device ID - we can even call this SYSTEM_CAN_DEVICE_ID_BOOTLOADER.

This can be implemented via a very minimal change in can_fsm.c: in prv_handle_rx, if rx_msg.device_id == SYSTEM_CAN_DEVICE_ID_BOOTLOADER, either call a bootloader function to handle it if we’re in the bootloader or else just ignore the message/jump back to the bootloader.

This scheme has the disadvantage that the node ID is at the beginning, so bootloader datagrams from controller boards aren’t given a higher priority in general. However, the client (which is the only party broadcasting extremely long and important datagrams like flashing content) has the special node ID 0, so the client’s non-starting datagram messages have the highest priority on the bus (all zeros) while the client’s starting messages have close to it - in our setup, bested only by the BPS heartbeat.

Under this scheme, code flashed via the bootloader can coexist with code flashed the traditional way, but the scheme does require that we reflash all boards so that every node on the network is aware of and at least ignores bootloader messages.

Breaking into the bootloader

One other topic: we should have a way to jump from the application code back to the bootloader via a CAN message to run more commands. This is a peripheral feature since we can just power cycle the system. We can either do it upon receipt of any bootloader datagram start message (and pass the start message back to the bootloader), or we can do it with a normal CAN message with a handler pre-registered. In any case, we’d have to initialize CAN in smoke tests and small projects in order for them to be accessible via this method.

Client and API

The core of the client should be implemented as a modular Python script so that it can be deployed not just on x86 but also on e.g. a raspberry pi in the car to enable over-the-air firmware updates. The make interface to the bootloader client can be implemented as just shelling out to the Python script.

The make interface might look something like this:

Code Block
# Query info from the boards on the network
# (shown here with all specifiers: these are used to match boards in all commands)
make boot-query ID=2,3,4 NAME=alpha,bravo CURRENT=solar,mci INFO=rear COMMIT=0bd1e
# note: CURRENT is so named to not conflict with PROJECT when flashing
# possible alternatives/aliases: CURRENTPROJECT, CURRPROJECT
make boot-ls  # possible alias
make ls  # possible alias

# Ping some IDs just to see if they're alive - mostly for debugging
make boot-ping ID=5,6,2

# Jump to the application code
make boot-start NAME=inky,winky,blinky,clyde
make boot  # possible alias

# Update ID or name (should have an "are you sure? y/n" prompt)
make boot-update-id ID=2 NEWID=3
make boot-update-name NAME=charlie NEWNAME=chaplin

# Flash the board/boards with one project over CAN
make boot-flash NAME=alpha PROJECT=bms_carrier

# Get each board's current project and flash that board with the same project
make boot-update ID=2,8,12

# We should also have a way to get make build to use the new linker scripts for debugging/CI

Task List

  • Finalize design of the bootloader

  • Design the exact CAN protocol for each operation in terms of datagrams (maybe look into MessagePack?)

  • Implement the datagram protocol, both in Python (client) and in C (ms-common)

  • Create new linker scripts

  • Implement each command from both client (Python) and bootloader sides

    • The trickiest one is jumping to the application code - figure out how to do that while uninitializing everything

  • Write the main function / glue code of the bootloader, taking care to be very safe (using the redundancy in the config, checking CRCs)

Remember: we don’t want bugs in the bootloader! They’re hard to correct and can have bad consequences. Keep it simple and robust!