Structured Text (Software Evolution Library)

2.1.5 Structured Text

The classes generated for tree-sitter use the rules stored in each language’s grammar file to enable implicit source text reproduction at the class level. This makes working with and mutating the AST much simpler. As an example, if an ’if’ statement AST without an ’else’ clause has an ’else’clause added to it, the source text of the AST will reflect that an ’else’clause has been added to it without needing to make any other updates. (Prior to structured text, slots holding connective white-space and punctuation required manual updates to accompany most changes to the content of an AST.)

Each class that is generated can have multiple subclasses which represent the different representations of source text that the base class can take. For example, the update expression in C represents both the pre-increment and post-increment. Two subclasses are generated to disambiguate between the source text representations–one for pre-increment and one for post-increment.

Frequently, these subclass ASTs can be copied with slight modifications to their slot values. This can leave the AST copy in an invalid state for the subclass it had been copied from. When this is detected, the AST’s class will be changed dynamically to the first subclass of the base class which can successfully produce source text with the given slot values. This behavior also applies to objects created with the base class, but it may choose a subclass that’s source text is not the desired representation, so it’s best to specify the exact subclass in case where this matters, such as update expressions in C.

Structured text ASTs contain at least 4 slots which help store information that isn’t implicit to the AST or its parent ast:

before-text :: stores text that directly precedes the AST but is not part of the rule associated with the AST. This is generally whitespace. This slot is preferred over the after-text slot when creating ASTs from a string with #'convert.
after-text :: stores text that directly procedes the AST but is not part of the rule associated with the AST. This is generally whitespace. This slot is preferred when a terminal token directly follows the AST which does not have a before-text slot due to being implicit source text.
before-asts :: stores comment and error ASTs that occur before the AST and before the contents of the before-text slot. The contents of this slot are considered children of the parent AST. This slot is preferred over the after-text slot when creating ASTs from a string with #’convert.
after-asts :: stores comment and error ASTs that occur before the AST and after the contents of the after-text slot. The contents of this slot are considered children of the parent AST. This slot is preferred when a terminal token directly follows the AST which does not have a before-text slot due to being implicit source text.
internal-asts-|#| :: store ASTs which are between two terminal tokens which are implicit source text. This slot can contain comment, error and inner-whitespace ASTs.

The internal-asts slots are generated based on the rule associated with the AST. Any possible place in the rule where two terminal tokens can appear consecutively, an internal-asts slot is placed.

A further ’text’ slot is also used for a subset of ASTs that are known computed-text ASTs. These ASTs hold information that is variable and must be computed and stored when the AST is created. The ASTs that are computed text can be identified by computed-text-node-p.

When creating ASTs, patch-whitespace can be used to insert whitespace in relevant places. This utilizes whitespace-between to determine how much whitespace should be placed in each slot. This currently does not populate inner-asts whitespace.