NVMe out-of-band firmware updates
We have recently been adding support to the libnvme and MCTP components of Linux that allow out-of-band communication with NVMe storage devices. One of the neat applications of this is to manage device firmware - allowing a BMC to update firmware over an out-of-band (OOB) channel, without any intervention from the host system.
The protocol for the firmware updates is specified by the NVM Express standard, so device support is generally pretty good. As long as the device supports NVMe-MI, out-of-band firmware update should work.
We have published an earlier article covering the background of libnvme and NVMe-MI, but as a brief overview:
-
the connector standards for NVMe devices (M.2 and U.2) define an I2C channel for out-of-band communication
-
the NVMe spec defines a Management Interface (MI) component, using MCTP over this I2C channel
-
the NVMe-MI spec provides a way to transfer NVMe Administration commands using NVMe-MI (over MCTP over I2C)
Since the firmware update process is just a couple of Administration commands, we can do firmware management over the hardware I2C connection.
One note first: this article describes the low-level mechanisms for performing firmware updates and adjusting communication parameters to improve transfer speeds. Work is currently underway to provide upper-layer components for firmware- and device management, like the OpenBMC NVMe infrastructure. Those components will provide a much more integrated approach to firmware management on NVMe devices.
General MCTP setup๐
For this to work, we'll need to be able to send MCTP messages to the NVMe device. There are a few background details in our previous MCTP on Linux document, but we've included some specific steps here too.
Firstly, we need to define the hardware connection, by defining an I2C controller as an MCTP-over-I2C interface. This is usually a matter of using the MCTP I2C transport binding format in the I2C controller's device tree node.
This would typically look something like:
i2c6 {
compatible = "vendor,your-i2c-controller";
#address-cells = <1>;
#size-cells = <0>;
mctp-controller;
mctp@10 {
compatible = "mctp-i2c-controller";
reg = <(0x10 | I2C_OWN_SLAVE_ADDRESS)>;
};
};
The mctp-controller
and mctp@x
nodes define this I2C device as capable
of transferring MCTP messages, using a local I2C address of 0x10
in this
example, but that is an arbitrary choice. This configuration can be applied
to any I2C controller node, as long as the controller driver supports both
host- and client modes.
In Linux, this will create a mctpi2cN
network interface, where N
is derived
from the Linux I2C bus number. We use bus 6 on our development hardware, so
we'll use that for the examples in this document.
To check that the interface is present, use mctp link
:
Next, we can apply the MCTP network configuration. This can be done a number
of ways, but here we'll set everything up manually, just through the mctp
utility. Here we'll use a local MCTP endpoint address (EID) of 8
:
We'll use the MCTP daemon mctpd
for enumeration and address assignments, so
get that started with something like:
Then we can tell mctpd
to enumerate our NVMe device, by telling it about
the I2C endpoint, by interface and physical address. We do this via mctpd
's
dbus interface, using busctl
on the command line:
[this command has been changed recently for mctpd
v2.0; see below for
a v1.x equivalent]
The 0x1d
value here is the physical (ie., I2C) address of the device. 0x1d
is
the standard default I2C address for NVMe devices.
The dbus response from mctpd
indicates that the EID 9
was assigned to the
NVMe device, and it belongs to MCTP network 1
(the default).
We can query mctp
to see some details about the MCTP support:
The SupportedMessageTypes
property contains an array of the MCTP message type
values that the device advertises support for. In this case, it's reporting
support for the MCTP Control Protocol (type 1), and NVMe-MI (type 4).
NVMe-MI communication๐
Once we have MCTP communication established, we can use a recent version
of the nvme-cli
tools (hosted at https://github.com/linux-nvme/nvme-cli),
by using device references of the format:
mctp:<network-id>,<endpoint-id>[:<controller-id>]
If the controller-id
value is not specified, it defaults to 0
.
For the device in our examples above above, it is present on network 1, EID 9. So, we can query the device identify data with:
This shows that we have NVMe-MI communication working with our device.
Firmware updates๐
Now that we have established that NVMe-MI communication works, we can perform a firmware update. This is a sequence of two commands:
- Firmware Download - to send the firmware data, possibly in small chunks, to the device
- Firmware Commit - to activate the downloaded data as a new firmware image
There is also a NVMe Log Page that gives us information about the firmware images present on the device.
We can inspect the firmware log page with:
)
)
- which shows two slot of firmware of different versions. The id-ctrl
output
above indicates that we're currently using the second (CC000002
).
Send a new firmware image to the device with nvme fw-download
:
$ nvme fw-download mctp:1,9 --fw firmware-CC000003.bin
Firmware download success
- this may take a while, depending on the firmware size. More on that below.
After sending the firmware to the device, we can activate that recent download
with nvme fw-commit
:
$ nvme fw-commit mctp:1,9 --action 1 --slot 3
Success activating firmware action:1 slot:3, but firmware requires subsystem reset
Multiple Update Detected (MUD) Value: 0
The --action
value of 1
indicates a "replace and activate" request - the
new image replaces whatever was in slot 3 (from --slot 3
), and sets it to
be activated on the next device reset.
The possible actions are given in section 5.12 of the NVMe specification,
or here in the libnvme
source code.
We should see the new firmware in the Firmware Slot Information log page:
)
)
)
and after the next reset, the fr
field of the Identify data should reflect
the new firmware version:
Hooray!
Performance: bus speed, protocol overhead and packet size๐
The firmware download process can potentially involve sending a lot of data over the MCTP transport, itself using I2C as the hardware bus. A firmware payload needs to be broken up into NVMe messages, and those messages are themselves divided into MCTP packets.
Given:
- A 100kHz I2C bus speed
- a 3MB firmware download
- A 68 byte MCTP Maximum Transmission unit (MTU) (4 byte header, 64 byte payload)
We would need to split the firmware transfer into 4096-byte chunks (the maximum payload size of NVMe-MI commands), each sent as a separate command, and therefore a separate MCTP message. Each message requires a 64 byte NVMe Admin command header, plus a 4-byte Message Integrity Check (MIC).
So, we have 64 (header) + 4096 (payload) + 4 (MIC) = 4164 bytes for the message.
Because of the 68 byte MCTP MTU, these 4164-byte NVMe-MI command messages need to be broken up into 66 MCTP packets (65 ร 64 bytes, 1 ร 4 bytes). Each packet then requires 4 bytes of MCTP header.
When transmitting over I2C, each MCTP packet requires an extra 5 bytes: four bytes of I2C transport header, and 1 byte of PEC. We also need start and stop bits, but we'll disregard those for this calculation.
So, the total number of bytes transferred for each NVMe-MI command:
(66 + 4 + 5) ร 65 + (4 + 4 + 5) ร 1 = 4888 bytes over I2C
We need one response from the drive for each message: that will be a 44 byte NVMe-MI message, so a single MCTP packet with 5 bytes of I2C overhead.
This gives
4888 + 49 = 4937 bytes over I2C
- for transferring each 4096-byte chunk of firmware.
We then need 768 of those messages to transfer the entire 3MB payload, so 3791616 bytes - giving a 17% protocol overhead.
With an I2C bus running at 100kHz, and not accounting for packet- and message turnaround times (which is definitely not the case), that would take
3791616 ร 8 = 30332928 bits
30332928 รท (100 ร 1000) = 303 seconds
So our best-case transfer time for this configuration is just over 5 minutes.
In practice, the NVMe-message turnaround times can be around 20ms - 100ms, giving another 15 - 75 seconds for our sequence of 768 messages. This will vary between NVMe device implementations, of course.
We can improve this with two mechanisms:
- Increasing the MCTP MTU, to reduce protocol overhead
- Increasing the I2C bus speed
MCTP MTU๐
By increasing the MCTP MTU, we can reduce the number of MCTP headers and I2C headers & PEC bytes. The maximum allowable MCTP-over-I2C message size is 254 bytes, so we would have 17 packets for each message (16 ร 254 byte, 1 ร 100 byte).
That gives us:
(254 + 4 + 5) ร 16 + (100 + 4 + 5) ร 1 = 4317 bytes over I2C
Including the response, which is still a single 49-byte MCTP message:
4317 + 49 = 4366 bytes over I2C
- for transferring each 4096-byte chunk of firmware.
In this case, the total bytes transferred is 3353088, and our protocol overhead has dropped to 6%. Timing for this transfer would be, at minimum:
3353088 ร 8 = 26824704 bits
26824704 รท (100 ร 1000) = 268 seconds
We're a little better, at 4.5 minutes. We may also need to account for the same message-turnaround time of 15-75 seconds, as we have the same number of NVMe-MI messages sent. However, any inter-packet delays will be reduced to approximately one quarter.
Setting the MTU requires two adjustments: changing the MTU on the Linux MCTP stack, and changing the MTU on the NVMe device.
For the NVMe device, we can query the maximum supported MTU through the Read MI Data Structure commands, for the SMBus port:
... and the currently-configured MTU through MI Get Configuration commands, specifying that same port index (1 for SMBus, shown above):
)
)
Then, we can set the MTU to the advertised maximum:
)
)
For the Linux side, we can set a route-specific MTU for this particular device, by deleting the current route to that endpoint, and re-creating with a larger MTU:
$ mctp route del 9 via mctpi2c6
$ mctp route add 9 via mctpi2c6 mtu 254
Alternatively, we can use mctpd
's dbus interface to set the MTU on the
route for this device:
Note that we're using 254
for the route here, and 250
for the NVMe
configuration. The latter does not include the MCTP overhead, while the
former does.
There is a small utility in the libnvme
example code,
mi-conf.c
,
which can query the device and configure the MTUs appropriately:
$ mi-conf mctp:1,9
MTU for port 1 set to 250 (was 64)
I2C bus frequency๐
The NVMe-MI specification provides for multiple bus frequencies; 100kHz being the default, but devices may advertise support for 400kHz and 1MHz too.
Increasing the bus speed should give us a theoretical linear performance improvement, as we're sending the same bits, just faster. With our 254-byte MCTP MTU, we can perform the same 3MB firmware transfer in:
- 268 seconds at 100kHz
- 67 seconds at 400kHz
- 27 seconds at 1MHz
Of course, this depends on the NVMe device, I2C controller and platform implementation all supporting those speeds.
Like the MTU, we need to configure this on both the Linux I2C side, and the NVMe device.
The Linux bus speed configuration determines the bus speed while the kernel is driving the I2C bus - ie., when messages are being sent from host to device. The NVMe device configuration determines the bus speed while the device is driving the I2C bus - ie., when responses are being sent from device to host.
These should be configured to the same value. In practice, we have found that devices are fine receiving I2C data at any speed (up to their hardware maximum), and the NVMe device configuration only determines the speed of I2C transmissions.
The Linux-driven bus speed is determined by the I2C controller driver itself, and is typically global for each I2C bus. It is not currently possible to have different speeds for individual remote I2C clients.
This configuration is typically part of the I2C controller node in the device tree:
i2c6 {
compatible = "vendor,your-i2c-controller";
#address-cells = <1>;
#size-cells = <0>;
/* clock i2c bus at 400kHz */
bus-speed = <4000000>;
mctp-controller;
mctp@10 {
compatible = "mctp-i2c-controller";
reg = <(0x10 | I2C_OWN_SLAVE_ADDRESS)>;
};
};
Then, for the device, we can use the same MI Set Configuration command as for the MTU:
)
)
Postscript: equivalents for mctpd
version 1.x๐
The command examples above are given for mctpd version 2.0, which is fairly
recent. This section gives equivalents for the dbus interface in mctpd
1.0.
For the SetupEndpoint
call:
For introspecting the created device:
For the SetMTU
call:
Updates๐
- 2024-09-19: Added
mctpd
v2.0 call interfaces