On this page:
7.1 Parser
7.1.1 Grammar Definitions
7.1.2 Grammar Invariants
7.1.3 Syntactic Contexts
lexer-state
parser-state
7.1.4 Lexer and Parser Invariants
7.1.5 Type Names as Identifiers
7.2 Header Compilation

7 Internals

This section provides some information about the implementation of this package. It is not required at all to understand how to use any of the C Metaprogramming Utilities. None of the modules documented in this section should be used externally, and are subject to incompatible changes at any time.

7.1 Parser

C is not an easy language to parse. The grammar is essentially LALR(1), except for one nasty wrinkle: there are two lexical classes that are considered distinct in the grammar but whose lexical definitions are identical: identifier and typedef_name.

The language distinguish these two classes by their bindings, so parsers must maintain an environment and make the environment available to the lexer. However, because the grammar is LALR(1), the lexer may sometimes look ahead of the current production(s) being parsed. Consequently, the lexer must also make sure that it does not look ahead to an apparent identifier before committing all necessary updates to the environment.

For example, in a program such as:

typedef int T;

T a;

the second occurrence of T might be tokenized during lookahead before the parser has had a chance to update the the environment from the first declaration. As a result, the lexer must be careful whenever it reaches the end of a potential declaration to commit any outstanding updates to the environment.

The following articles were useful in developing the parser:

7.1.1 Grammar Definitions

A declarator terminator is a token of token-name 'COMMA, '=, or 'SEMI_COLON.

A declarator id is the value of the DeclaratorId attribute of a ‹Declarator› node, using the following attribution rules for the language grammar:

 

DeclaratorX

 ::= 

[‹Pointer›] DirectDeclaratorX   $0.DeclaratorId  $1.DeclaratorId

 

DirectDeclaratorX

 ::= 

X   $0.DeclaratorId  $1

 

  |  

"(" DeclaratorX ")"   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" TypeQualifier›* [‹AssignmentExpression›] "]"   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" "static" TypeQualifier›* AssignmentExpression "]"   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" TypeQualifier+ "static" AssignmentExpression "]"   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "[" TypeQualifier›* "*" "]"   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "(" ParameterTypeList ")"   $0.DeclaratorId  $1.DeclaratorId

 

  |  

DirectDeclaratorX "(" [‹ListIdentifier] ")"   $0.DeclaratorId  $1.DeclaratorId

7.1.2 Grammar Invariants

Following are some invariants—essentially, informal lemmas—about the implemented grammar:

7.1.3 Syntactic Contexts

The internal module c/private/syntactic-context defines structures that are used internally to maintain information about parsing and lexing state. It could be accessed via

 (require c/private/syntactic-context) package: c-utils

but generally shouldn’t be used.

struct

(struct lexer-state (read?
    source
    offset
    declarators
    brace-depth
    parenthesis-depth
    previous-token)
    #:extra-constructor-name make-lexer-state)
  read? : boolean?
  source : any
  offset : (or/c position? #f)
  declarators : (listof (box/c (or/c symbol? #f)))
  brace-depth : exact-nonnegative-integer?
  parenthesis-depth : exact-nonnegative-integer?
  previous-token : (or/c token? #f)
State pertaining to the lexer.

The read? field is used by the lexer for implementing a C reader. When read? is #t, the lexer reports an EOF token when it reaches a terminating delimiter.

The source field is used to identify the input source, such as with a path. The offset is used as a base offset for generating source location information. When offset is #f the base location (make-position 1 1 0) is used.

The declarators stack contains the cache for the current declarator id if the current context is a declaration context. Both the lexer state and parser state maintain views of the declarators stack, which can be compared to determine whether the lexer is currently looking ahead. The car of the stack is the current declarator cache.

The brace-depth tracks the nesting depth of curly-brace delimiters, and the parenthesis-depth tracks the nesting depth of parenthesis delimiters.

The previous-token stores the previously lexed token or #f if no tokens have been read yet.

A major context is one of
  • 'blocka block context, such as the top-level of a program or compound statement.

  • 'formalsthe formal arguments to a function definition or function signature.

  • 'preamblethe preamble to a function definition body, i.e., the declarations preceding the function body in a K & R-style function definition.

  • 'structthe body of a struct type.

  • 'unionthe body of a union type.

  • 'enumthe body of an enum type.

  • 'statementa non-declaration context such as a statement or expression.

A minor context is one of
  • 'typedefa typedef declaration in a 'block major context.

  • 'fora for statement in a 'statement major context.

  • #fanything else.

struct

(struct parser-state (env declarators major-context minor-context)
    #:extra-constructor-name make-parser-state)
  env : (listof (listof (cons symbol (or/c 'var 'type))))
  declarators : (listof (box/c (or/c symbol? #f)))
  major-context : (listof major context)
  minor-context : (listof minor context)
State pertaining to the parser. The env field is the current environment.

The declarators stack contains the parser’s view of the same cache as the lexer-state-declarators stack.

The major-context stack tracks the major context. The car of the stack is the current major context. The current major context is a declaration context if it is one of 'block, 'formals, or 'preamble.

The minor-context stack tracks the minor context. The car of the stack is the current minor context. A minor context is a typedef context if its value is 'typedef.

7.1.4 Lexer and Parser Invariants

Following are some invariants about the state of the lexer and parser.

7.1.5 Type Names as Identifiers

A popular but non-standard extension to C is to allow ‹TypedefName› tokens to be used in many of the contexts that expect an ‹Identifier›. For example, a variable declaration may shadow an existing name that was already bound by typedef:

typedef int T;

void foo(char T) { }

However, these extensions do result in one esoteric ambiguity: in a parameter declaration such as

typedef void (*fn)(int (T));

it is possible to interpret the parenthesized ‹TypedefNameT as either a parenthesized declaration—i.e., an int parameter, named T, to the fn function type—or a nested function parameter list—i.e., an anonymous parameter of type T to a function type of return type int.

To be a conservative extension of standard C, however, the ambiguity must be resolved with the latter interpretation: the token T is treated as a type name, not an identifier.

The implementation of this resolution actually requires forking the definition of ‹DirectDeclarator› to provide an alternative version whose parenthesized production does not allow occurrences of ‹TypedefName›.

7.2 Header Compilation

The compilation of headers uses a curious computational idiom of two separate monads in two distinct stages. The first monadic pass (“precompilation”) generates a computation in the second monadic pass (“compilation”).

More to follow.