Describing binary data with Deku
Recently we've been diving into Deku, both working with it to implement an NVMe-MI emulator and various PLDM specifications, and working on it to bend it to our needs.
It's a project that sits in an interesting niche for embedded and systems work - providing declarative tools for bespoke communication and storage formats. As a small demonstration we'll explore using it to parse some ELF headers.
ELF is the Executable and Linkable Format, a binary structure for describing the various organisations of executable instructions that make up the on-disk and in-memory images of our applications. It has a history stretching all the way back to the System V Release 4.0 from 1988, and is the dominant binary executable format used on UNIX and Linux systems.
A core philosophy of ELF is its extensibility, aiming to gracefully support common designs with 8-bit bytes and 32-bit or 64-bit architectures of either endianness1, while still allowing for both larger and smaller machines. Part of the extensibility lies in its self-describing, machine-independent format. However, often where there's flexibility there's also complexity.
ELF currently divides its structures into two classes, Elf32 and
Elf64, targeting common respective machine architectures. With its
commitment to either byte-ordering, before we've finished parsing the ELF
header we already have to deal with the possibility of any of the four
combinations. To bootstrap to the point where we know which format and
endianness we require, ELF provides us with 16 bytes of identification
data, which
also gives us a nice entry-point to start implementing our Deku-based parser.
Reading an ELF header🔗
Applying a C approach to Rust we may write:
There are a few concerns here, in that:
- we have to manually check or populate the
ei_magicvalue, - there's the
ei_padvalue, which is uninteresting for users of the struct, and - we have to manually map this struct to the underlying bytes.
To address the last point we might apply #[repr(C)], but the former two
problems remain.
Using Deku we can instead represent it as below, resolving all three concerns simultaneously:
extern crate deku;
Here we introduce several Deku features:
- automatic implementation of Deku traits by deriving
DekuReadandDekuWrite, - the magic attribute, and
- the pad_bytes_after attribute.
Together these remove the need for explicit handling of the magic value in both the deserialisation and serialisation paths, and ensure we don't have to describe the padding as a struct member in its own right as we did with the C-inspired approach.
Assuming a Linux-based system, we can use a little main() that reads its own
executable as a means to exercise the definitions:
use DekuContainerRead;
This yields:
Identification {
ei_class: 2,
ei_data: 1,
ei_version: 1,
ei_osabi: 0,
ei_abiversion: 0,
}
So far so good. However, ELF specifies only a small set of valid values for each of the members, while the format fields and our current definition allow for arbitrary values.
Ergonomics and constraints with enum🔗
In order to only decode against expected values in the e_ident fields, we can
expand our definitions to describe those values as enums:
extern crate deku;
// ...
Running this with the main() from above yields:
Identification {
ei_class: Elf64,
ei_data: Lsb,
ei_version: Current,
ei_osabi: None,
ei_abiversion: 0,
}
At once, with a relatively small amount of work, we have created an API that both provides the usual Rust ergonomics and constrains the accepted binaries, rejecting those with values outside the supported version of the specification.
The implementation takes advantage of the the naturally
specified enum discriminant values by the use of the
id_type
and repr attributes. In other circumstances, an
id attribute
should be provided for each variant.
Static vs Dynamic Endianness🔗
We'll take a short detour to discuss Deku's handling of endianness in the face of ELF's flexibility. For problem domains such as networking, the endianness of the protocol is often statically defined by its specification. This makes for a slower-pace introduction.
Taking IP as an example
and borrowing a BSD-sockets addressing definition, we might represent struct in_addr as such:
As yet this does not handle endian-conversion, but with the addition of the field-level endian attribute we can make it so:
s_addr: u32,
}
We can also apply endian as a top-level Deku attribute on the struct.
In this specific case the choice will not make much difference as the
struct has only one primitive field, but with multiple fields this
can save quite a bit of effort. Further, Deku automatically propagates
the endian property to substructures through its ctx attribute
system, which we
can use to maintain consistency. By expanding our IP example slightly we can see
this propagation in action:
use Endian;
With this understood we're most of the way to tackling ELF's variable endianness. We will return to ELF now.
Data Model ⨯ Endianness🔗
Having defined the identification fields we now know the binary's primitive
ELF sizes (from e_class) and the endianness (from e_data). For a general
implementation we need to solve both problems simultaneously - omitting either
limits the implementation to a subset of platforms supported.
To process the remainder of the ELF header we need to provide this information as context for the subsequent data-type definitions. This motivates some design choices and the use of further Deku attributes.
The first approach that goes slightly against the grain of the specification
is separating e_ident from the definition of Ehdr. This is a practical
choice, as it enables the use of top-level attributes to define endianness of
fields of substructures. Subsequently, correct parsing with respect to
e_class motivates implementation of elf::Ehdr as an enum over class-specific
definitions - elf32::Ehdr and elf64::Ehdr2. We wrap the combination of
e_ident and an instance of elf::Ehdr in a struct Elf:
Working from the bottom up, we find the definition of struct Elf discussed
above the code sample. In this case I've named the invented field hdr, without
the e_ prefix, as a means to separate the implementation choice from fields
defined by the specification.
The hdr field definition carries the ctx attribute with e_ident as
its value. We see the use of this in expressions for both the top-level
id
and endian
attributes defined on elf::Ehdr. The variant-specific
id is also
used this time, to define the mapping from the ID provided by the context to the
relevant variant of the type.
This compact description is the solution to both problems of data model and
endianness. During parsing, Deku selects the appropriate elf::Ehdr variant
by the value of elf::Identification::ei_class, and propagates the endianness
defined by elf::Identification::ei_data by way of conversion through the
TryFrom<DataEncoding> implementation for deku::ctx::Endian.
To view the new structure we make a small change to our main():
- let (_, elf) = elf::Identification::from_bytes((exe.as_slice(), 0))?;
+ let (_, elf) = elf::Elf::from_bytes((exe.as_slice(), 0))?;
And with everything in place, a test parse yields:
Elf {
e_ident: Identification {
ei_class: Elf64,
ei_data: Lsb,
ei_version: Current,
ei_osabi: None,
ei_abiversion: 0,
},
hdr: Elf64(
Ehdr {
e_type: Dyn,
e_machine: X86_64,
e_version: Word(1),
e_entry: Addr(109776),
e_phoff: Off(64),
e_shoff: Off(4630240),
e_flags: Word(0),
e_ehsize: Half(64),
e_phentsize: Half(56),
e_phnum: Half(12),
e_shentsize: Half(64),
e_shnum: Half(43),
e_shstrndx: Half(41),
},
),
}
Mercifully, this matches the output of readelf -h:
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Position-Independent Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1acd0
Start of program headers: 64 (bytes into file)
Start of section headers: 4630240 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 12
Size of section headers: 64 (bytes)
Number of section headers: 43
Section header string table index: 41
Other Machines and Classes🔗
With a small change to the command-line handling, we can start to point our implementation at other binaries:
use DekuContainerRead;
/usr/bin/true for Debian S390x🔗
$ file true
true: ELF 64-bit MSB pie executable, IBM S/390, version 1 (SYSV), dynamically linked
$ ./target/debug/example-3 true
Elf {
e_ident: Identification {
ei_class: Elf64,
ei_data: Msb,
ei_version: Current,
ei_osabi: None,
ei_abiversion: 0,
},
hdr: Elf64(
Ehdr {
e_type: Dyn,
e_machine: S390,
e_version: Word(1),
e_entry: Addr(8160),
e_phoff: Off(64),
e_shoff: Off(41592),
e_flags: Word(0),
e_ehsize: Half(64),
e_phentsize: Half(56),
e_phnum: Half(10),
e_shentsize: Half(64),
e_shnum: Half(28),
e_shstrndx: Half(27),
},
),
}
/usr/bin/true for Debian ARMEL🔗
$ file true
true: ELF 32-bit LSB pie executable, ARM, EABI5 version 1 (SYSV), dynamically linked
$ ./target/debug/example-3 true
Elf {
e_ident: Identification {
ei_class: Elf32,
ei_data: Lsb,
ei_version: Current,
ei_osabi: None,
ei_abiversion: 0,
},
hdr: Elf32(
Ehdr {
e_type: Dyn,
e_machine: Arm,
e_version: Word(1),
e_entry: Addr(5412),
e_phoff: Off(52),
e_shoff: Off(66128),
e_flags: Word(83886592),
e_ehsize: Half(52),
e_phentsize: Half(32),
e_phnum: Half(10),
e_shentsize: Half(40),
e_shnum: Half(29),
e_shstrndx: Half(28),
},
),
}
An observation in this last case is that there are set bits in e_flags. Values
in the field are defined by the appropriate Processor-Specific ABI (psABI), which
for the binary at hand are in the "ELF Header" section of the ELF for the Arm®
Architecture
(AAELF32) specification. Fields in e_flags are important for understanding the
binary capabilities, so let's parse them as well.
AAELF32 Flags🔗
Note:
This serves as a demonstration of the bit-precise attributes, but makes some important assumptions. ELF defines
e_flagsas aWord, and without evidence to the contrary, it's assumed that the psABI (such as AAELF32) defines fields insidee_flagsin terms of theWordextracted with respect to the endianness defined bye_data, and not in the serialised bit order, which is how the code below treats it. To make the difference clear we add a Deku assert attribute to ensure the endianness required by the struct definition matches what was parsed for the purpose of the demonstration. Proper treatment needs a strategy along the lines of implementating Deku traits for a flags crate, which we will not cover here.
We define the following additional data types:
// ...
And make a small change to our existing data-model-specific Ehdr
definitions:
- e_flags: crate::elf::Word,
+ #[deku(ctx = "e_machine")]
+ e_flags: crate::elf::Eflags,
The usual context-passing technique is involved, with the new twist of bit-precise attributes defining fields and padding. We also encounter the bit_order attribute, which defines the direction in which the struct member order extracts bits from the underlying buffer. An insightful explanation of bit order can be found in the bitvec documentation.
With these changes in-place we find the ARM-specific flags are now extracted as appropriate:
Elf {
e_ident: Identification {
ei_class: Elf32,
ei_data: Lsb,
ei_version: Current,
ei_osabi: None,
ei_abiversion: 0,
},
hdr: Elf32(
Ehdr {
e_type: Dyn,
e_machine: Arm,
e_version: Word(1),
e_entry: Addr(5412),
e_phoff: Off(52),
e_shoff: Off(66128),
e_flags: Arm(
Eflags {
ef_arm_abi_float_hard: false,
ef_arm_abi_float_soft: true,
ef_arm_be8: false,
ef_arm_gccmask: false,
ef_arm_abimask: 5,
},
),
e_ehsize: Half(52),
e_phentsize: Half(32),
e_phnum: Half(10),
e_shentsize: Half(40),
e_shnum: Half(29),
e_shstrndx: Half(28),
},
),
}
Writing an ELF header🔗
A consequence of Deku's declarative approach is that serialising from the
in-memory representation is as straight-forward as deserialising to it. To
demonstrate a small change with a wide-ranging impact, we will switch the
endianness of the ELF metadata before writing the binary back out. We do so
through an updated implementation of our main():
use ;
// ...
Of significance is that the implementation doesn't change any of the definitions
or fields beyond updating ei_data. ei_data dictates the endianness of what
follows, and our use of Deku does the work of propagating that change via the
endian attribute as discussed above.
Testing against an x86-64 copy of /usr/bin/true, readelf -h reports:
$ diff -u \
> <(readelf -h true 2>&1 | head -n20) \
> <(readelf -h true.switched 2>&1 | head -n20)
--- /dev/fd/63 2025-11-20 14:56:34.917164518 +1030
+++ /dev/fd/62 2025-11-20 14:56:34.917164518 +1030
@@ -1,11 +1,11 @@
ELF Header:
- Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
+ Magic: 7f 45 4c 46 02 02 01 00 00 00 00 00 00 00 00 00
Class: ELF64
- Data: 2's complement, little endian
+ Data: 2's complement, big endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
- Type: DYN (Position-Independent Executable file)
+ Type: DYN (Shared object file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x27e0
We see what we hoped for - readelf reports the endianness of the binary has
switched, notably, without significant3 change of other reported values.
Inspecting the header using hexdump we find that the endianness of all
relevant fields has been updated in accordance with the change to ei_data:
$ hexdump -Xn 64 true
0000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
0000010 03 00 3e 00 01 00 00 00 e0 27 00 00 00 00 00 00
0000020 40 00 00 00 00 00 00 00 28 a2 00 00 00 00 00 00
0000030 00 00 00 00 40 00 38 00 0e 00 40 00 1e 00 1d 00
0000040
$ hexdump -Xn 64 true.switched
0000000 7f 45 4c 46 02 02 01 00 00 00 00 00 00 00 00 00
0000010 00 03 00 3e 00 00 00 01 00 00 00 00 00 00 27 e0
0000020 00 00 00 00 00 00 00 40 00 00 00 00 00 00 a2 28
0000030 00 00 00 00 00 40 00 38 00 0e 00 40 00 1e 00 1d
0000040
And with that, our exploration of Deku through ELF headers is complete! The full program listing is provided for interest.
For simplicity we will forget about middle-endian
For the purpose of brevity, we'll assume appropriate and equivalent implementations for the 32-bit ELF class while elaborating on the 64-bit definitions.
The change from DYN (Position-Independent Executable file) to
DYN (Shared object file) is the result of gluing the remainder
of the unmodified binary back onto the tail of the endian-adjusted
header.