GTIRB
v2.2.0
GrammaTech Intermediate Representation for Binaries
|
GTIRB portably encodes binaries from a range of standard executable and linkable formats, such as ELF, PE, and Mach-O, allowing conversion to and from these formats to GTIRB. In ensure all information from the original binary is retained despite standard GTIRB data structures–which are intended to be general across all binary representations–being lossy for many aspects of particular representation, encoders are encouraged to include all raw bytes of the original file in the GTIRB encoding. In addition, GTIRB encodes information above and beyond what these formats store; it stores control flow, symbolization (reference) information, and other analysis results, with the goal of providing all essential information to support subsequent binary analysis and rewriting. Finally, GTIRB allows user-extensible data to be included in the form of AuxData tables which can easily reference other GTIRB elements–letting tools communicate with each other in a single standard in-file format.
Although executable file formats differ in many ways, they typically tend to have a similar structure. The bytes of the image are divided into sections, which contain the bytes consisting of the code and data along with information about how to load and adjust them at run-time. To facilitate linking with shared libraries, they have a symbol table, which specifies a list of names of entities this file provides or requires. To facilitate relocation in memory these files often contains a relocation table.
GTIRB contains all this information in standard forms. In GTIRB, a single executable or library is encoded as a module. A GTIRB file may have multiple modules, enclosed in a single IR. GTIRB encodes the standard features of all binary formats in the following structures:
Modules in GTIRB contain multiple sections. A section has a name reflecting any name given in the original file (e.g., .text
), a set of properties, and a set of contents stored in byte intervals.
Rather than storing a symbol table as a section, GTIRB stores a set of symbols associated with every module. These symbols have a name, a set of properties, and a referent. A referent may be an integer, indicating that the symbol is a numeric constant or fixed address, or a reference to a block. A block may be one of:
Block Kind | Description |
---|---|
code block | a series of executable instructions |
data block | a series of data bytes |
proxy block | indicating that the symbol is defined in another module |
The bytes of a section are subdivided into chunks of bytes called byte intervals. This indirection layer serves two purposes:
Two byte intervals in the same section may not overlap in addresses (although sections can overlap with each other in some cases, such as in object code). Byte intervals contain code blocks or data blocks. The blocks within a byte interval can overlap. Examples of overlapping blocks include:
Byte intervals also hold symbolic expressions which indicate symbolic contents of code or data blocks.
To encode relocations, GTIRB associates symbolic expressions with byte-intervals. These specify that certain bytes in the binary refer to the address. This allows these bytes to be recalculated when the referent is moved in the binary image.
GTIRB does not specify exactly how symbolic expressions are transformed into bytes. This depends on where the symbolic expression is located; inside a code block, it depends on what part of an instruction it is of, while inside a data block, it depends on the size of the data block.
There are currently three kinds of symbolic expressions:
Kind | Description |
---|---|
SymAddrConst | the address of the referent of a symbol, plus or minus a fixed offset |
SymAddrAddr | the difference between two symbols, divided by a scale and plus an offset |