code_inference

This module computes the valid instructions organized in blocks of code. It infers the facts code_in_block, block, and code.

Potential code inference failures are recorded in the block_still_overlap and interval_schedule_tie relations.

The disassembly consists of several phases:

  • Traverse backward from invalid instructions through “must” edges

  • Traverse forward from “may” edges from all possible instructions to generate a superset of all possible code blocks in code_in_block_candidate. This traversal has a recursive component and a linear-sweep component.

  • Generate data block candidates using references from code block candidates

  • Split common block tails to ensure code blocks are non-overlapping in code_in_block_candidate_refined

  • Populate known_block based on binary metadata and propagate forward through “must” edges

  • Populate impossible_block with blocks which overlap known_blocks, and propagate backward through “must” edges

  • Identify remaining overlapping blocks in unresolved_block and assign points in block_points and block_points_proportional according to heuristics.

  • Resolve blocks with different boundaries using weighted interval scheduling, which weights blocks according to points and binary coverage

  • Resolve blocks with identical boundaries according to their points.

code_in_block(EA:address, Block:address)

The instruction at address ‘EA’ belong to the block ‘Block’. The block address is the address of the first instruction in the block. These are final addresses where the code is located. They are organized in blocks of sequential code and the block identifier is the address of the first instruction in the block.

block(Block:address)

There is a code block starting at ‘Block’.

code(EA:address)

The instruction at address ‘EA’ is code.

WARNING: Predicate not present in compiled Datalog program (Dead Code)

block_last_instruction(Block:address, EA:address)

The last instruction of ‘Block’ is at address ‘EA’

block_boundaries(block:address, BegAddr:address, EndAddr:address)

The block defined by address ‘Block’ extends from ‘BegAddr’ to ‘EndAddr’. In most cases ‘BegAddr’ and ‘Block’ coincide, but there are exceptions such as ARM 32 Thumb code blocks.

overlapping_instruction(EA:address, EA2: address)

Predicate to capture overlapping instructions. Certain overlapping instructions are permitted and do not generate block conflicts. Those overlapping instructions are captured by this predicate.

data_in_code(Begin:address, End:address)

There is a segment of data in a code section that spans from ‘Begin’ to ‘End’.

block_still_overlap(Block1:address, Type1:block_type, Size1:unsigned, Block2:address, Type2:block_type, Size2:unsigned)

This predicate if there are still overlapping blocks after the code inference. In that case the analysis has failed to resolve all conflicts.

may_fallthrough(From:address, To:address)

Instruction at address From might fallthrough to instruction at address To.

must_fallthrough(From:address, To:address)

Instruction at address From must fallthrough to instruction at address To.

likely_fallthrough(From:address, To:address)

This predicate is a special case of may_fallthrough. must_fallthrough is very conservative and does not include things like conditional jumps. However, conditional jumps can almost always fallthrough (unless there is obfuscation). likely_fallthrough captures those instructions.

may_have_symbolic_immediate(src:address, dest:address)

unlikely_have_symbolic_immediate(EA:address)

invalid(EA:address, Type:symbol)

max_instruction_size(Size:unsigned)

Largest instruction in the binary

possible_ea(EA:address)

WARNING: Predicate not present in compiled Datalog program (Dead Code)

known_code(EA:address)

The address EA is already known to be code in the input GTIRB.

basic_target(ea:address)

Starting addresses for code exploration.

block_limit(EA:address)

transition_block_limit(EA:address, Next:address)

We want to split blocks that go from non-padding to padding or from padding to non-padding.

However, this cannot be a regular block_limit because several instructions could fallthrough into another one. We need to consider the source address too.

possible_target_from(dest:address, src:address)

possible_target(Target:address)

code_in_block_candidate(EA:address, EA_block:address)

aligned_address_in_data(EA:address, Value:address)

Aligned address shown in data

padding_block_limit(Limit:address)

Addresses where the propagation of padding blocks should be limited.

Similar to (and a superset of) block_limit for code blocks.

nop_in_padding_candidate(EA:address, Start:address)

Build padding candidates from nop instructions

This is analogous to code_in_block_candidate, but for blocks of type “padding”.

padding_block_candidate(EA:address, Size:unsigned)

Padding block candidates can help fill in space when evaluating overlapping models.

after_end(EA:address, End:address)

Linear component of the forward traversal. after_end traverses the code lineraly skipping nops. Traversals starts at points where there is a no-fallthough, after jump tables and after literal pools. End represents where the traversal started and EA represents the current address.

common_tail(EA:address)

code_in_block_candidate_refined(EA:address, Block:address)

composite_data_access(EA:address, EA_load:address, Data:address, Size:unsigned)

A pair of instructions form an access to data with a predetermined size and location

data_block_candidate(Block:address, Size:unsigned)

A data block candidate in a code section. The candidate is located at address ‘Block’ and has ‘Size’ bytes.

block_candidate_boundaries(Block:address, Type:block_type, BegAddr:address, EndAddr:address)

The block candidate defined by address ‘Block’ extends from ‘BegAddr’ to ‘EndAddr’. In most cases ‘BegAddr’ and ‘Block’ coincide, but there are exceptions such as ARM 32 Thumb code blocks. These block candidates can be code blocks or data blocks.

multiple_fallthrough_to(Block:address)

WARNING: Predicate not present in compiled Datalog program (Dead Code)

block_overlap(Block1:address, Type1:block_type, Size1:unsigned, Block2:address, Type2:block_type, Size2:unsigned)

Two blocks overlap.

block_candidate_dependency_edge(BlockA:address, TypeA:block_type, SizeA:unsigned, BlockB:address, TypeB:block_type, SizeB:unsigned)

The block BlockA,TypeA implies the existence of of BlockB,TypeB

block_implies_block(BlockA:address, TypeA:block_type, SizeA:unsigned, BlockB:address, TypeB:block_type, SizeB:unsigned)

If BlockA exists, BlockB must also exist.

known_block(Block:address, Type:block_type, Size:unsigned, Reason:symbol)

Known blocks are blocks that must exist due to metadata.

impossible_block(Block:address, Type:block_type, Size:unsigned, Reason:symbol)

Logically impossible blocks.

unresolved_block(Block:address, Type:block_type, Size:unsigned)

Blocks that are still overlapping after known_block and impossible_block are determined.

unresolved_block_overlap(Block1:address, Type1:block_type, Size1:unsigned, Block2:address, Type2:block_type, Size2:unsigned)

Blocks that still overlap after known_block and impossible_block are determined.

type_ordering_map(Type:block_type, IntervalType:interval_type, AddrAdjust:address)

This predicate allows us to translate back and forth from block types (padding, code, or data) into numerical interval types that are easily sorted. For padding and code blocks, we can have Default or Thumb blocks and we distinguish those with a different interval type. The field AddrAdjust allows us to adjust the addresses of Thumb blocks.

This predicate also defines an implicit priority. If two blocks have the same weight and the same boundaries, the one with higher interval_type will be chosen. This derives from the way we deal with ties in the WIS algorithm. In case of a tie, the algorithm selects the block, which means that the last block in the ordering will be selected first.

unresolved_interval(Start:address, End:address, TypeOrd:interval_type, Weight:number)

Valid unresolved intervals

next_type(Start:address, End:address, PrevType:interval_type, NextType:interval_type)

Auxiliary predicate to help sorting the intervals. next_type captures the next type for a fixed start and end addresses.

next_start(PrevStart:address, End:address, NextStart:address)

Auxiliary predicate to help sorting the intervals. next_start captures the next starting address for a fixed end address.

next_end(PrevEnd:address, NextEnd:address)

Auxiliary predicate to help sorting the intervals. next_end captures the next end address overall.

unresolved_interval_order(ID:unsigned, Start:address, End:address, Type:interval_type, Weight:number)

Sort intervals lexicographically by <end address, start address, interval_type>

wis_has_prior(I:unsigned, Count:unsigned)

Weighted interval schedule: Count of intervals that end prior to the start of interval I.

Only calculated for intervals that have prior intervals.

wis_prior(I:unsigned, Count:unsigned)

wis_prior extends wis_has_prior, including intervals that have no prior intervals (i.e., with Count of 0).

wis_memo(I:unsigned, Value:number, Pred:unsigned)

Weighted interval schedule: memoized subproblems

Value is the optimized total weight for intervals [1:I].

Value is the total weight for selected intervals, and Pred (i.e., the Predecessor) is the previous entry in wis_memo, later used to reconstruct the optimal solution.

interval_schedule_tie(BlockA:address, TypeA:block_type, SizeA:unsigned, BlockB:address, TypeB:block_type, SizeB:unsigned)

Blocks A and B had equal weights during interval scheduling, and Block A was selected.

wis_schedule(Interval:unsigned)

Weighted interval schedule: reconstruct schedule of selected intervals

wis_schedule_iter(Iter:unsigned)

block_total_points(Block:address, Type:block_type, Size:unsigned, Points:number)

discarded_block(Block:address, Type:block_type, Size:unsigned, Reason:symbol, BlockPropagated:address)

Block: block to be discarded BlockPropagated (for debugging purpose):

  • previous block that the reason is propagated from (for “propagated”)

  • block that this block overlapped with (for “first instruction overlap”)

candidate_block_is_not_padding(Block:address)

candidate_block_is_padding(Block:address)

block_heuristic(Block:address, Type:block_type, Size:unsigned, Instance:address, HeuristicName:symbol)

block_heuristic defines properties of blocks that will receive points. This predicate uniquely identifies a block and the ‘HeuristicName’ property associated to it.

Some heuristics can be assigned to a block multiple times, this is made possible by the ‘Instance’ field. E.g. if we give points for specific instructions in the block or for each block predecessor. For heuristics that can only be applied once per block, the instance field is always 0.

negative_block_heuristic(Block:address, Type:block_type, Size:unsigned, Instance:address, HeuristicName:symbol)

Negative block heuristics are collected in their own relation.

This allows any block with negative_block_heuristic to be marked as unresolved_block without causing cyclic negation.

block_points(Block:address, Type:block_type, Size:unsigned, Instance:address, Weight:number, HeuristicName:symbol)

A block defined by its address ‘Block’, ‘Type’, and ‘Size’ has been assigned a heuristic ‘HeuristicName’ with weight ‘Weight’. The ‘Instance’ field can be used to assign a single heuristic to a block multiple times, see block_heuristics. The heuristic weights are determined by the heuristic_weight predicate.

block_points_proportional(Block:address, Type:block_type, Size:unsigned, WeightMult:number, HeuristicName:symbol)

This predicate specifies heuristics whose weight is proportional to the block size. A block defined by its address ‘Block’, ‘Type’, and ‘Size’ has been assigned a heuristic ‘HeuristicName’

with weight ‘WeightMult’*’Size’.

resolved_reaches(Block:address, Size:unsigned, Type:symbol)

Points for reachability

Give points to blocks that are reachable from a non-overlapping block. We distinguish two kinds of reachability “strong” and “weak” which get different amount of points. “weak” reachability is for may fallthroughs that have higher uncertainty such as function calls.

data_in_code_propagate(Current:address, Initial:address, EndByteInterval:address)

Auxiliary predicate to compute data_in_code. It propagates through the holes between code blocks and between the last code block of a byte interval and the end of the interval

first_block_in_byte_interval(Begin:address, End:address, BlockBeg:address)

Auxiliary predicate to compute data_in_code. BlockBeg contains the address of the first code block in the byte interval defined by the addresses [Begin,End). If a code byte interval has no code, no predicate will be generated.

next_block_in_byte_interval(Block:address, NextBlock:address)