7 Internals
This section provides some information about the implementation of this package. It is not required at all to understand how to use any of the C Metaprogramming Utilities. None of the modules documented in this section should be used externally, and are subject to incompatible changes at any time.
7.1 Parser
C is not an easy language to parse. The grammar is essentially LALR(1), except for one nasty wrinkle: there are two lexical classes that are considered distinct in the grammar but whose lexical definitions are identical: identifier and typedef_name.
The language distinguish these two classes by their bindings, so parsers must maintain an environment and make the environment available to the lexer. However, because the grammar is LALR(1), the lexer may sometimes look ahead of the current production(s) being parsed. Consequently, the lexer must also make sure that it does not look ahead to an apparent identifier before committing all necessary updates to the environment.
typedef int T; |
T a; |
The following articles were useful in developing the parser:
ISO/IEC 9899:TC3—
the C99 standard. A Tour of C99—
a 2004 article on some of the new, exotic features of C99. Practical Parsing for ANSI C—
an article in Dr. Dobb’s Journal with a few helpful tips on parsing C. It has some errors, however, and shouldn’t be considered authoritative. Parsing C, the last word from comp.compilers—
a message from Jim Roskind back in 1992 with some tips on tracking typedef bindings. Parsing C typedefs from comp.compilers—
a message from Dale R. Worley back in 1992 with excellent advice on how to allow typedef names in certain identifier contexts without generating LALR conflicts.
7.1.1 Grammar Definitions
A declarator terminator is a token of token-name 'COMMA, '=, or 'SEMI_COLON.
A declarator id is the value of the DeclaratorId attribute of a ‹Declarator› node, using the following attribution rules for the language grammar:
| ‹Declarator›X | ::= | [‹Pointer›] ‹DirectDeclarator›X $0.DeclaratorId ← $1.DeclaratorId |
| ‹DirectDeclarator›X | ::= | X $0.DeclaratorId ← $1 |
|
| | | "(" ‹Declarator›X ")" $0.DeclaratorId ← $1.DeclaratorId |
|
| | | ‹DirectDeclarator›X "[" ‹TypeQualifier›* [‹AssignmentExpression›] "]" $0.DeclaratorId ← $1.DeclaratorId |
|
| | | ‹DirectDeclarator›X "[" "static" ‹TypeQualifier›* ‹AssignmentExpression› "]" $0.DeclaratorId ← $1.DeclaratorId |
|
| | | ‹DirectDeclarator›X "[" TypeQualifier+ "static" ‹AssignmentExpression› "]" $0.DeclaratorId ← $1.DeclaratorId |
|
| | | ‹DirectDeclarator›X "[" ‹TypeQualifier›* "*" "]" $0.DeclaratorId ← $1.DeclaratorId |
|
| | | ‹DirectDeclarator›X "(" ‹ParameterTypeList› ")" $0.DeclaratorId ← $1.DeclaratorId |
|
| | | ‹DirectDeclarator›X "(" [‹List›‹Identifier›] ")" $0.DeclaratorId ← $1.DeclaratorId |
7.1.2 Grammar Invariants
Following are some invariants—
Every ‹Declarator› has a declarator id.
In a ‹Declaration› or ‹StructDeclaration› production, a ‹TypedefName› can be used as a declarator id iff no ‹TypeSpecifier› precedes it in the declaration.
In a valid program, the token immediately following a ‹Declarator› is always a declarator terminator.
7.1.3 Syntactic Contexts
The internal module c/private/syntactic-context defines structures that are used internally to maintain information about parsing and lexing state. It could be accessed via
(require c/private/syntactic-context) | package: c-utils |
but generally shouldn’t be used.
struct
(struct lexer-state ( read? source offset declarators brace-depth parenthesis-depth previous-token) #:extra-constructor-name make-lexer-state) read? : boolean? source : any offset : (or/c position? #f) declarators : (listof (box/c (or/c symbol? #f))) brace-depth : exact-nonnegative-integer? parenthesis-depth : exact-nonnegative-integer? previous-token : (or/c token? #f)
The read? field is used by the lexer for implementing a C reader. When read? is #t, the lexer reports an EOF token when it reaches a terminating delimiter.
The source field is used to identify the input source, such as with a path. The offset is used as a base offset for generating source location information. When offset is #f the base location (make-position 1 1 0) is used.
The declarators stack contains the cache for the current declarator id if the current context is a declaration context. Both the lexer state and parser state maintain views of the declarators stack, which can be compared to determine whether the lexer is currently looking ahead. The car of the stack is the current declarator cache.
The brace-depth tracks the nesting depth of curly-brace delimiters, and the parenthesis-depth tracks the nesting depth of parenthesis delimiters.
The previous-token stores the previously lexed token or #f if no tokens have been read yet.
'block—
a block context, such as the top-level of a program or compound statement. 'formals—
the formal arguments to a function definition or function signature. 'preamble—
the preamble to a function definition body, i.e., the declarations preceding the function body in a K & R-style function definition. 'struct—
the body of a struct type. 'union—
the body of a union type. 'enum—
the body of an enum type. 'statement—
a non-declaration context such as a statement or expression.
'typedef—
a typedef declaration in a 'block major context. 'for—
a for statement in a 'statement major context. #f—
anything else.
struct
(struct parser-state (env declarators major-context minor-context) #:extra-constructor-name make-parser-state) env : (listof (listof (cons symbol (or/c 'var 'type)))) declarators : (listof (box/c (or/c symbol? #f))) major-context : (listof major context) minor-context : (listof minor context)
The declarators stack contains the parser’s view of the same cache as the lexer-state-declarators stack.
The major-context stack tracks the major context. The car of the stack is the current major context. The current major context is a declaration context if it is one of 'block, 'formals, or 'preamble.
The minor-context stack tracks the minor context. The car of the stack is the current minor context. A minor context is a typedef context if its value is 'typedef.
7.1.4 Lexer and Parser Invariants
Following are some invariants about the state of the lexer and parser.
After an ‹InitDeclaratorList› there is one extra declarator id left in both the lexer-state-declarators and parser-state-declarators stack.
In a declaration context, when the lexer reaches a declarator terminator it pops the lexer-state-declarators stack.
In a declaration context, the declarator id of every ‹Declarator› is either cached by the parser before the lexer reaches the next declarator terminator or immediately precedes the declarator terminator token.
In a declaration context, when the parser reaches a declarator terminator it adds the saved declarator id to the environment.
In a declaration context, the declarator id is added to the current environment when the lexer reaches a declarator terminator.
In a non-declaration context, declarator id’s are not cached, nor is the current environment frame ever extended.
In a declaration context, if the lexer reaches a declarator terminator and the parser-state-declarators and lexer-state-declarators are the same, then the parser has not reduced the preceding ‹Declarator›, i.e., the lexer is looking ahead.
In a declaration context, if the parser reduces a ‹Declarator› and the parser-state-declarators and lexer-state-declarators are the same, then the lexer has not reached the declarator terminator, i.e., the lexer has not looked ahead.
The lexer never reads more than one token ahead of the innermost production preceding the parser’s current position in the grammar.
7.1.5 Type Names as Identifiers
A popular but non-standard extension to C is to allow ‹TypedefName› tokens to be used in many of the contexts that expect an ‹Identifier›. For example, a variable declaration may shadow an existing name that was already bound by typedef:
typedef int T; |
void foo(char T) { } |
However, these extensions do result in one esoteric ambiguity: in a parameter declaration such as
typedef void (*fn)(int (T)); |
it is possible to interpret the parenthesized ‹TypedefName› T as either a
parenthesized declaration—
To be a conservative extension of standard C, however, the ambiguity must be resolved with the latter interpretation: the token T is treated as a type name, not an identifier.
The implementation of this resolution actually requires forking the definition of ‹DirectDeclarator› to provide an alternative version whose parenthesized production does not allow occurrences of ‹TypedefName›.
7.2 Header Compilation
The compilation of headers uses a curious computational idiom of two separate monads in two distinct stages. The first monadic pass (“precompilation”) generates a computation in the second monadic pass (“compilation”).
More to follow.