The Non-Volatile Memory Express (NVMe) standard describes an interface between storage controllers and host hardware. Alongside this, there's also a "management interface" (MI) standard for interacting with NVMe devices over a lighter-weight channel - typically to allow a management controller to discover and manage storage devices, over an out-of-band link, typically I²C/SMBus.

At Code Construct, we've been contributing some changes to the existing libnvme project, adding support for the NVMe-MI interface alongside the current NVMe-direct interface. By contributing the MI protocol support to this existing project, we're able to share a lot of the core NVMe protocol details, including standard data structures for device management.

The NVMe-MI support as been merged through an initial MI protocol support PR, plus a number of follow-up changes adding further MI functionality and expanding on test and documentation coverage. We owe a big thanks to Daniel Wagner, a maintainer of the libnvme project, for the collaboration and review on this work.

NVMe-MI components & command sets🔗

The MI spec provides for a few different types of commands to communicate with NVMe hardware:

Our focus is on providing management facilities for NVMe devices, so the MI & Admin command sets are the interesting ones here. The PCIe command-set is optional for devices to implement, so we'll leave that aside for now.

To summarise the command support:

Command setDefined inOperates onlibnvme-mi object
MINVMe-MIEndpointnvme_mi_ep_t
AdminNVMe baseControllernvme_mi_ctrl_t

Note: the NVMe-MI channel allows all sorts of control over the NVMe device, including potential Sanitize (ie., erase), read and write operations. If you're implementing an NVMe-MI channel to NVMe hardware, ensure you have suitable controls in place to ensure that these facilities can only be used with appropriate authorisation.

While the NVMe-MI communication can be performed over a few different transports, we have implemented this initial code using on top of the in-kernel MCTP infrastructure, which would typically be used on top of I²C/SMBus hardware. We have more details on the MCTP channel in our MCTP on Linux document.

That said, the libnvme-mi design is fairly transport-agnostic, adding others transports is definitely possible - they would just need their own endpoint constructor, and transport backend implementation.

The libnvme-mi API🔗

The MI-specific functions are built into a new library object, libnvme-mi.so, separate from the NVMe core interface in libnvme.so. However, the headers are mostly common across both libraries.

There are a couple of resources for using the libnvme-mi API:

but for the sake of examples, we'll show a small usage of the libnvme API to query a little information about the health status of a device, and retrieve some log information. Since this is example code, we have omitted error-checking; be sure to add it in if you're expanding on these examples.

First, we'll need to create a nvme_root_t - this is a handle to the "library-level" data structures, to handle logging and enumeration of devices.

    nvme_root_t root;
    
    root = nvme_mi_create_root(NULL, DEFAULT_LOGLEVEL);

Next, we open a handle to a specific NVMe-MI-over-MCTP endpoint. These are addressed by MCTP network-id and endpoint-id values:

    int net = 1;
    uint8_t eid = 9;

    nvme_mi_ep_t ep;
    ep = nvme_mi_open_mctp(root, net, eid);

This assumes we have prior knowledge of the MCTP network and EID - we have some facilities to enumerate available endpoints. More detail on that is in the enumeration section below.

Once we have our endpoint object, we can now issue MI command-set commands. libnvme-mi exposes these through a set of functions with a prefix of nvme_mi_mi_. To query the subsystem health status, we issue a NVMe-MI Subsystem Health Status Poll command (defined in section 5.6 of the NVMe-MI specification), using:

    struct nvme_mi_nvm_ss_health_status status;
    nvme_mi_mi_subsystem_health_status_poll(ep, false, &status);

This populates a struct nvme_mi_nvm_ss_health_status (defined in src/types.h), giving us the drive temperature and percentage of drive life used:

    printf("composite temperature: %d°C\n", status.ctemp);
    printf("drive life used: %d%%\n", status.pdlu);

This gives us a fairly straightforward method of querying the overall device status information.

We can do some more specific queries using the NVMe Admin command set. Unlike the MI commands, these are defined by the NVMe base specification, rather than the NVMe Management Interface specification. These Admin commands are exactly those described in the base spec, just tunneled over the MI channel.

Unlike MI commands, which operate on an NVMe-MI endpoint, the Admin commands operate on a NVMe controller. So, we need to create a controller object, which is a combination of endpoint and a controller id. We create controller objects with nvme_mi_init_ctrl:

    uint16_t ctrl_id = 1;
    nvme_mi_ctrl_t ctrl;
    ctrl = nvme_mi_init_ctrl(ep, ctrl_id);

Like the endpoint creation, this assumes we already know the controller ID. There is a little infrastructure to enumerate the controllers too, covered in the enumeration section below.

Using the ctrl object, we can now perform Admin command operations, like "Get Log Page" to query the device SMART log data:

    struct nvme_smart_log log;
    struct nvme_get_log_args args = {
            .args_size = sizeof(args),
            .log = &log,
            .len = sizeof(log),
            .lid = NVME_LOG_LID_SMART,
    };
    nvme_mi_admin_get_log_page(ctrl, &args);

    /* NVMe data is generally little-endian, convert */
    printf("time above warn temp: %d mins\n",
           le32toh(log.warning_temp_time));

    printf("time above crit temp: %d mins\n",
           le32toh(log.critical_comp_time));

The admin channel commands follow the conventions of the libnvme core API; most take an _args structure, which specifies the command parameters, and provides pointers for output data.

For the full set of available MI-channel Admin commands, check out the mi.h header.

Finally, don't forget to clean up!

    nvme_mi_free_root(root)

- this will free the root, including any endpoint and controller objects under that root. If more fine-grained control is required, nvme_mi_close() is available for freeing endpoints, and nme_mi_close_ctrl() for freeing controllers.

Device enumeration🔗

In the code samples above, we used two functions to create our endpoint and controller objects:

/* create an MCTP-connected endpoint from (net,eid) address data */
nvme_mi_ep_t nvme_mi_open_mctp(nvme_root_t root, int net, uint8_t eid)

/* create a controller object, from its controller id */
nvme_mi_ctrl_t nvme_mi_open_ctrl(nvme_mi_ep_t, uin16_t id);

However, these require us to know the IDs for network, endpoint and controller. These may be available through other means, but it would be nice if we can have these queried for us.

There are a couple of functions available in libnvme-mi to query the device topology. Firstly, we can find which endpoints are available with:

    nvme_root_t nvme_mi_scan_mctp(void);

This is a MCTP-specific function, that queries the system MCTP daemon ("mctpd"), over D-Bus, for NVMe-MI-capable endpoints that have been discovered. The MCTP daemon is a separate utility, available as part of the MCTP userspace tools, at https://github.com/CodeConstruct/mctp. Because it uses D-Bus, this requires that the libnvme-mi library was originally built with the libsystemd dependency enabled. If not, this will just return NULL.

On return, the root object will be populated with a list of discovered endpoints. We can iterate this list with the nvme_mi_for_each_endpoint macro:

    nvme_root_t root = nvme_mi_scan_mctp();

    nvme_mi_for_each_endpoint(root, ep) {
            do_something_with_endpoint(ep);
    }

If we're just performing MI commands, the endpoint reference will be sufficient. However, if we're also performing Admin commands, we'll need a controller object too. The controller-level discovery can be performed with:

    int nvme_mi_scan_ep(nvme_mi_ep_t ep, bool force_rescan);

This takes an existing nvme_mi_ep_t object, and populates its list of controllers. It does this by sending a Get MI Data command to the endpoint, requesting the list of attached controllers IDs, then creates a controller object for each reported ID.

We can iterate this list of discovered controllers with the nvme_mi_for_each_ctrl macro:

    nvme_mi_scan_ep(ep, false);

    nvme_mi_for_each_ctrl(ep, ctrl) {
        do_something_with_controller(ctrl);
    }

Using these examples, we can discover all of the connected NVMe controllers, without having advanced knowledge of the network, endpoint or controller IDs.

Future work🔗

Our work on the NVMe-MI command support is ongoing, and we have a few more changes queued, and in development, for further Admin commands. If you're interested in following-along with these, keep an eye on the libnvme upstream repository.

We'll also welcome any feedback you may have, particularly test reports for communicating with different NVMe-MI-capable hardware. Please send any comments or queries to Jeremy on jk@codeconstruct.com.au.