We have recently been adding support to the libnvme and MCTP components of Linux that allow out-of-band communication with NVMe storage devices. One of the neat applications of this is to manage device firmware - allowing a BMC to update firmware over an out-of-band (OOB) channel, without any intervention from the host system.

The protocol for the firmware updates is specified by the NVM Express standard, so device support is generally pretty good. As long as the device supports NVMe-MI, out-of-band firmware update should work.

We have published an earlier article covering the background of libnvme and NVMe-MI, but as a brief overview:

Since the firmware update process is just a couple of Administration commands, we can do firmware management over the hardware I2C connection.

One note first: this article describes the low-level mechanisms for performing firmware updates and adjusting communication parameters to improve transfer speeds. Work is currently underway to provide upper-layer components for firmware- and device management, like the OpenBMC NVMe infrastructure. Those components will provide a much more integrated approach to firmware management on NVMe devices.

General MCTP setup๐Ÿ”—

For this to work, we'll need to be able to send MCTP messages to the NVMe device. There are a few background details in our previous MCTP on Linux document, but we've included some specific steps here too.

Firstly, we need to define the hardware connection, by defining an I2C controller as an MCTP-over-I2C interface. This is usually a matter of using the MCTP I2C transport binding format in the I2C controller's device tree node.

This would typically look something like:

    i2c6 {
      compatible = "vendor,your-i2c-controller";
      #address-cells = <1>;
      #size-cells = <0>;

      mctp-controller;

      mctp@10 {
        compatible = "mctp-i2c-controller";
        reg = <(0x10 | I2C_OWN_SLAVE_ADDRESS)>;
      };
    };

The mctp-controller and mctp@x nodes define this I2C device as capable of transferring MCTP messages, using a local I2C address of 0x10 in this example, but that is an arbitrary choice. This configuration can be applied to any I2C controller node, as long as the controller driver supports both host- and client modes.

In Linux, this will create a mctpi2cN network interface, where N is derived from the Linux I2C bus number. We use bus 6 on our development hardware, so we'll use that for the examples in this document.

To check that the interface is present, use mctp link:

$ mctp link show
dev lo index 1 address 0x00:00:00:00:00:00 net 1 mtu 65536 up
dev mctpi2c6 index 6 address 0x10 net 1 mtu 254 down

Next, we can apply the MCTP network configuration. This can be done a number of ways, but here we'll set everything up manually, just through the mctp utility. Here we'll use a local MCTP endpoint address (EID) of 8:

$ mctp address add 8 dev mctpi2c6
$ mctp link set mctpi2c6 up

We'll use the MCTP daemon mctpd for enumeration and address assignments, so get that started with something like:

$ systemctl start mctpd

Then we can tell mctpd to enumerate our NVMe device, by telling it about the I2C endpoint, by interface and physical address. We do this via mctpd's dbus interface, using busctl on the command line:

$ busctl --no-pager call \
    au.com.codeconstruct.MCTP1 \
    /au/com/codeconstruct/mctp1/interfaces/mctpi2c6 \
    au.com.codeconstruct.MCTP.Interface1 \
    SetupEndpoint ay 1 0x1d
yisb 9 1 "/xyz/openbmc_project/mctp/networks/1/endpoints/9" true

[this command has been changed recently for mctpd v2.0; see below for a v1.x equivalent]

The 0x1d value here is the physical (ie., I2C) address of the device. 0x1d is the standard default I2C address for NVMe devices.

The dbus response from mctpd indicates that the EID 9 was assigned to the NVMe device, and it belongs to MCTP network 1 (the default).

We can query mctp to see some details about the MCTP support:

$ busctl --no-pager introspect \
    au.com.codeconstruct.MCTP1 \
    /au/com/codecontrust/mctp1/networks/1/endpoints/9
    xyz.openbmc_project.MCTP.Endpoint
NAME                    TYPE      SIGNATURE RESULT/VALUE FLAGS
.EID                    property  y         9            const
.NetworkId              property  i         1            const
.SupportedMessageTypes  property  ay        1 4          const

The SupportedMessageTypes property contains an array of the MCTP message type values that the device advertises support for. In this case, it's reporting support for the MCTP Control Protocol (type 1), and NVMe-MI (type 4).

NVMe-MI communication๐Ÿ”—

Once we have MCTP communication established, we can use a recent version of the nvme-cli tools (hosted at https://github.com/linux-nvme/nvme-cli), by using device references of the format:

mctp:<network-id>,<endpoint-id>[:<controller-id>]

If the controller-id value is not specified, it defaults to 0.

For the device in our examples above above, it is present on network 1, EID 9. So, we can query the device identify data with:

$ nvme id-ctrl mctp:1,9
NVME Identify Controller:
vid       : 0xccde
ssvid     : 0x0123
sn        : 5314F9222890        
mn        : Code Construct NVMe device
fr        : CC000002
rab       : 3
ieee      : 00a075
cmic      : 0
mdts      : 10   
[...]

This shows that we have NVMe-MI communication working with our device.

Firmware updates๐Ÿ”—

Now that we have established that NVMe-MI communication works, we can perform a firmware update. This is a sequence of two commands:

There is also a NVMe Log Page that gives us information about the firmware images present on the device.

We can inspect the firmware log page with:

$ nvme fw-log mctp:1,9
Firmware Log for device:mctp:1,9
afi  : 0x1
frs1 : 0x3130303030304343 (CC000001)
frs2 : 0x3230303030304343 (CC000002)

- which shows two slot of firmware of different versions. The id-ctrl output above indicates that we're currently using the second (CC000002).

Send a new firmware image to the device with nvme fw-download:

$ nvme fw-download mctp:1,9 --fw firmware-CC000003.bin
Firmware download success

- this may take a while, depending on the firmware size. More on that below.

After sending the firmware to the device, we can activate that recent download with nvme fw-commit:

$ nvme fw-commit mctp:1,9 --action 1 --slot 3
Success activating firmware action:1 slot:3, but firmware requires subsystem reset
Multiple Update Detected (MUD) Value: 0

The --action value of 1 indicates a "replace and activate" request - the new image replaces whatever was in slot 3 (from --slot 3), and sets it to be activated on the next device reset.

The possible actions are given in section 5.12 of the NVMe specification, or here in the libnvme source code.

We should see the new firmware in the Firmware Slot Information log page:

$ nvme fw-log mctp:1,9
Firmware Log for device:mctp:1,9
afi  : 0x1
frs1 : 0x3130303030304343 (CC000001)
frs2 : 0x3230303030304343 (CC000002)
frs3 : 0x3330303030304343 (CC000003)

and after the next reset, the fr field of the Identify data should reflect the new firmware version:

$ nvme id-ctrl mctp:1,9
NVME Identify Controller:
vid       : 0xccde
ssvid     : 0x0123
sn        : 5314F9222890        
mn        : Code Construct NVMe device
fr        : CC000003
rab       : 3
ieee      : 00a075
cmic      : 0
mdts      : 10   
[...]

Hooray!

Performance: bus speed, protocol overhead and packet size๐Ÿ”—

The firmware download process can potentially involve sending a lot of data over the MCTP transport, itself using I2C as the hardware bus. A firmware payload needs to be broken up into NVMe messages, and those messages are themselves divided into MCTP packets.

Given:

We would need to split the firmware transfer into 4096-byte chunks (the maximum payload size of NVMe-MI commands), each sent as a separate command, and therefore a separate MCTP message. Each message requires a 64 byte NVMe Admin command header, plus a 4-byte Message Integrity Check (MIC).

So, we have 64 (header) + 4096 (payload) + 4 (MIC) = 4164 bytes for the message.

Because of the 68 byte MCTP MTU, these 4164-byte NVMe-MI command messages need to be broken up into 66 MCTP packets (65 ร— 64 bytes, 1 ร— 4 bytes). Each packet then requires 4 bytes of MCTP header.

When transmitting over I2C, each MCTP packet requires an extra 5 bytes: four bytes of I2C transport header, and 1 byte of PEC. We also need start and stop bits, but we'll disregard those for this calculation.

So, the total number of bytes transferred for each NVMe-MI command:

(66 + 4 + 5) ร— 65 + (4 + 4 + 5) ร— 1 = 4888 bytes over I2C

We need one response from the drive for each message: that will be a 44 byte NVMe-MI message, so a single MCTP packet with 5 bytes of I2C overhead.

This gives

4888 + 49 = 4937 bytes over I2C

- for transferring each 4096-byte chunk of firmware.

We then need 768 of those messages to transfer the entire 3MB payload, so 3791616 bytes - giving a 17% protocol overhead.

With an I2C bus running at 100kHz, and not accounting for packet- and message turnaround times (which is definitely not the case), that would take

            3791616 ร— 8 = 30332928 bits
30332928 รท (100 ร— 1000) = 303 seconds

So our best-case transfer time for this configuration is just over 5 minutes.

In practice, the NVMe-message turnaround times can be around 20ms - 100ms, giving another 15 - 75 seconds for our sequence of 768 messages. This will vary between NVMe device implementations, of course.

We can improve this with two mechanisms:

MCTP MTU๐Ÿ”—

By increasing the MCTP MTU, we can reduce the number of MCTP headers and I2C headers & PEC bytes. The maximum allowable MCTP-over-I2C message size is 254 bytes, so we would have 17 packets for each message (16 ร— 254 byte, 1 ร— 100 byte).

That gives us:

(254 + 4 + 5) ร— 16 + (100 + 4 + 5) ร— 1 = 4317 bytes over I2C

Including the response, which is still a single 49-byte MCTP message:

4317 + 49 = 4366 bytes over I2C

- for transferring each 4096-byte chunk of firmware.

In this case, the total bytes transferred is 3353088, and our protocol overhead has dropped to 6%. Timing for this transfer would be, at minimum:

            3353088 ร— 8 = 26824704 bits
26824704 รท (100 ร— 1000) = 268 seconds

We're a little better, at 4.5 minutes. We may also need to account for the same message-turnaround time of 15-75 seconds, as we have the same number of NVMe-MI messages sent. However, any inter-packet delays will be reduced to approximately one quarter.

Setting the MTU requires two adjustments: changing the MTU on the Linux MCTP stack, and changing the MTU on the NVMe device.

For the NVMe device, we can query the maximum supported MTU through the Read MI Data Structure commands, for the SMBus port:

$ mi-mctp 1 9 info
NVMe MI subsys info:
 num ports: 2
 major ver: 1
 minor ver: 1
NVMe MI port info:
  port 0
    type PCIe[1]
    MCTP MTU: 240
    MEB size: 0
    PCIe max payload: 0x200
    PCIe link speeds: 0x0f
    PCIe current speed: 0x01
    PCIe max link width: 0x04
    PCIe neg link width: 0x01
    PCIe port: 0x00
  port 1
    type SMBus[2]
    MCTP MTU: 250
    MEB size: 131072
    SMBus address: 0x53
    VPD access freq: 0x03
    MCTP address: 0x1d
    MCTP access freq: 0x03
    NVMe basic management: enabled
NVMe MI subsys health:
 subsystem status:  0x30
 smart warnings:    0xff
 composite temp:    59
 drive life used:   0%
 controller status: 0x0000

... and the currently-configured MTU through MI Get Configuration commands, specifying that same port index (1 for SMBus, shown above):

$ mi-mctp 1 9 get-config 1
SMBus access frequency (port 1): 100k [0x1]
MCTP MTU (port 1): 64

Then, we can set the MTU to the advertised maximum:

$ mi-mctp 1 9 set-config 1 mtu 250
$ mi-mctp 1 9 get-config 1
SMBus access frequency (port 1): 100k [0x1]
MCTP MTU (port 1): 250

For the Linux side, we can set a route-specific MTU for this particular device, by deleting the current route to that endpoint, and re-creating with a larger MTU:

$ mctp route del 9 via mctpi2c6
$ mctp route add 9 via mctpi2c6 mtu 254

Alternatively, we can use mctpd's dbus interface to set the MTU on the route for this device:

$ busctl --no-pager call \
    au.com.codeconstruct.MCTP1 \
    /au/com/codeconstruct/mctp1/networks/1/endpoints/9 \
    au.com.codeconstruct.MCTP.Endpoint1 \
    SetMTU u 254

Note that we're using 254 for the route here, and 250 for the NVMe configuration. The latter does not include the MCTP overhead, while the former does.

There is a small utility in the libnvme example code, mi-conf.c, which can query the device and configure the MTUs appropriately:

$ mi-conf mctp:1,9
MTU for port 1 set to 250 (was 64)

I2C bus frequency๐Ÿ”—

The NVMe-MI specification provides for multiple bus frequencies; 100kHz being the default, but devices may advertise support for 400kHz and 1MHz too.

Increasing the bus speed should give us a theoretical linear performance improvement, as we're sending the same bits, just faster. With our 254-byte MCTP MTU, we can perform the same 3MB firmware transfer in:

Of course, this depends on the NVMe device, I2C controller and platform implementation all supporting those speeds.

Like the MTU, we need to configure this on both the Linux I2C side, and the NVMe device.

The Linux bus speed configuration determines the bus speed while the kernel is driving the I2C bus - ie., when messages are being sent from host to device. The NVMe device configuration determines the bus speed while the device is driving the I2C bus - ie., when responses are being sent from device to host.

These should be configured to the same value. In practice, we have found that devices are fine receiving I2C data at any speed (up to their hardware maximum), and the NVMe device configuration only determines the speed of I2C transmissions.

The Linux-driven bus speed is determined by the I2C controller driver itself, and is typically global for each I2C bus. It is not currently possible to have different speeds for individual remote I2C clients.

This configuration is typically part of the I2C controller node in the device tree:

    i2c6 {
      compatible = "vendor,your-i2c-controller";
      #address-cells = <1>;
      #size-cells = <0>;

      /* clock i2c bus at 400kHz */
      bus-speed = <4000000>;

      mctp-controller;

      mctp@10 {
        compatible = "mctp-i2c-controller";
        reg = <(0x10 | I2C_OWN_SLAVE_ADDRESS)>;
      };
    };

Then, for the device, we can use the same MI Set Configuration command as for the MTU:

$ mi-mctp 1 9 set-config 1 freq 400k
$ mi-mctp 1 9 get-config 1
SMBus access frequency (port 1): 400k [0x2]
MCTP MTU (port 1): 250

Postscript: equivalents for mctpd version 1.x๐Ÿ”—

The command examples above are given for mctpd version 2.0, which is fairly recent. This section gives equivalents for the dbus interface in mctpd 1.0.

For the SetupEndpoint call:

$ busctl --no-pager call \
    xyz.openbmc_project.MCTP \
    /xyz/openbmc_project/mctp au.com.CodeConstruct.MCTP \
    SetupEndpoint say mctpi2c6 1 0x1d
yisb 9 1 "/xyz/openbmc_project/mctp/1/9" true

For introspecting the created device:

$ busctl --no-pager introspect \
    xyz.openbmc_project.MCTP \
    /xyz/openbmc_project/mctp/1/9 \
    xyz.openbmc_project.MCTP.Endpoint
NAME                    TYPE      SIGNATURE RESULT/VALUE FLAGS
.EID                    property  y         9            const
.NetworkId              property  i         1            const
.SupportedMessageTypes  property  ay        1 4          const

For the SetMTU call:

$ busctl --no-pager call \
    xyz.openbmc_project.MCTP \
    /xyz/openbmc_project/mctp/1/9 \
    au.com.CodeConstruct.MCTP.Endpoint \
    SetMTU u 254

Updates๐Ÿ”—