GTIRB  v2.2.0
GrammaTech Intermediate Representation for Binaries
binary-representation

Binary Representation with GTIRB

GTIRB portably encodes binaries from a range of standard executable and linkable formats, such as ELF, PE, and Mach-O, allowing conversion to and from these formats to GTIRB. In ensure all information from the original binary is retained despite standard GTIRB data structures–which are intended to be general across all binary representations–being lossy for many aspects of particular representation, encoders are encouraged to include all raw bytes of the original file in the GTIRB encoding. In addition, GTIRB encodes information above and beyond what these formats store; it stores control flow, symbolization (reference) information, and other analysis results, with the goal of providing all essential information to support subsequent binary analysis and rewriting. Finally, GTIRB allows user-extensible data to be included in the form of AuxData tables which can easily reference other GTIRB elements–letting tools communicate with each other in a single standard in-file format.

Representing Binaries

Although executable file formats differ in many ways, they typically tend to have a similar structure. The bytes of the image are divided into sections, which contain the bytes consisting of the code and data along with information about how to load and adjust them at run-time. To facilitate linking with shared libraries, they have a symbol table, which specifies a list of names of entities this file provides or requires. To facilitate relocation in memory these files often contains a relocation table.

GTIRB contains all this information in standard forms. In GTIRB, a single executable or library is encoded as a module. A GTIRB file may have multiple modules, enclosed in a single IR. GTIRB encodes the standard features of all binary formats in the following structures:

Sections

Modules in GTIRB contain multiple sections. A section has a name reflecting any name given in the original file (e.g., .text), a set of properties, and a set of contents stored in byte intervals.

Symbols

Rather than storing a symbol table as a section, GTIRB stores a set of symbols associated with every module. These symbols have a name, a set of properties, and a referent. A referent may be an integer, indicating that the symbol is a numeric constant or fixed address, or a reference to a block. A block may be one of:

Block Kind Description
code block a series of executable instructions
data block a series of data bytes
proxy block indicating that the symbol is defined in another module

Byte Intervals

The bytes of a section are subdivided into chunks of bytes called byte intervals. This indirection layer serves two purposes:

  • Indicate what blocks can be moved independently of each other. It is guaranteed that you can shuffle around two byte intervals in a section, and doing so will preserve the program's semantics.
  • Support the generation of blocks with no original address. Byte intervals may have a fixed address, but they may also be unfixed, likely indicating that the byte interval was generated by a binary rewriting tool or is freely movable to any address.

Two byte intervals in the same section may not overlap in addresses (although sections can overlap with each other in some cases, such as in object code). Byte intervals contain code blocks or data blocks. The blocks within a byte interval can overlap. Examples of overlapping blocks include:

  • Overlapping data blocks are common.
    • One data block may representing an array may overlap many data blocks representing elements of the array.
    • Compilers often overlap strings with shared suffixes. The data blocks representing these strings will similarly overlap.
  • Overlapping code blocks are rare, however particularly clever or malicious code blocks in variable-width ISAs may overlap when two different sequences of instructions serialize to machine-code bytes which share common subsequences.

Byte intervals also hold symbolic expressions which indicate symbolic contents of code or data blocks.

Symbolic Expressions

To encode relocations, GTIRB associates symbolic expressions with byte-intervals. These specify that certain bytes in the binary refer to the address. This allows these bytes to be recalculated when the referent is moved in the binary image.

GTIRB does not specify exactly how symbolic expressions are transformed into bytes. This depends on where the symbolic expression is located; inside a code block, it depends on what part of an instruction it is of, while inside a data block, it depends on the size of the data block.

There are currently three kinds of symbolic expressions:

Kind Description
SymAddrConst the address of the referent of a symbol, plus or minus a fixed offset
SymAddrAddr the difference between two symbols, divided by a scale and plus an offset