How to parse contextual keywords in a programming language

Thanks to Lucian Wischik for reviewing this article.

Table of contents:

What’s the problem with contextual keywords?
- Parsing can be difficult when there is ambiguity
Approachs
Related posts
Comments

What’s the problem with contextual keywords?

Ordinarily in a programming language, a keyword isn’t usable as an identifier. For example, you often can’t name a variable int, because int is the keyword used to denote the integral numeric type.

Sometimes, you want to add a new keyword to a language, but can’t break backward compatibility. For example, when C# added async/await, they needed to retain async as an identifier, so that old programs that used variables or classes named async would still compile.

For example, you may have written code that used async as an identifier in a program written for a previous version of C#:

bool async = true;

But you should now also be able to write code using async with its special meaning as a keyword:

async Task<void> foo()
{
  // ...
}

That is to say, the token async should be treated as an identifier under some circumstances, and as a keyword under others. A keyword with this property is called a “contextual keyword” in C#. Other languages have identical concepts, such as override and final in C++.

Lexing is typically strictly separated from parsing. Then the question is: when should we lex the string “async” appearing in the source code as an identifier, and when should we lex it as the async token? We need information about the surrounding tokens to decide that, but we don’t have that until the parsing phase, which runs later.

Parsing can be difficult when there is ambiguity

Sometimes, a keyword has no ambiguity during parsing. If the async keyword can never be used in a place where a valid identifier could appear, and vice versa, then the problem can be solved in a straightforward manner, discussed shortly.

Unfortunately, this isn’t the case in many languages with async/await. For example, consider the following code snippet of C#:

async(foo, bar)

Depending on the following token, this could be either a function call or a lambda declaration. If an operator follows in such away that this appears to be expression, then we have a function call:

async(foo, bar) + 3 // calling the function named `async`

But if a double-arrow follows, then it must be a lambda:

async(foo, bar) => foo + bar // declaring an async lambda

You would have to look ahead an arbitrary number of tokens (from the async keyword to the next operator) in order to disambiguate these two cases, so oftentimes implementing a contextual keyword can’t be solved with a small change to the grammar.

Approachs

I asked some coworkers who have worked on languages how this is implemented, and detailed how various languages (mainly C#) implement contextual keywords, with examples.

Parsing `partial` directly in the grammar

C# has the partial keyword which is used to mark a class as being implemented in multiple source files. The grammar is such that the partial keyword only appears before the class keyword in a class definition. For example:

partial class Foo {
  void DoSomething() {
    int partial = 3;
    // ...
  }
}

The partial keyword can be parsed unambiguously, because an identifier can’t appear as a modifier when declaring a class, and the partial token can’t appear anywhere the grammar expects an identifier.

The idea is allow the partial token to be a valid value for an identifier. This is especially useful for generated parsers, since it requires only a grammar change, and doesn’t fundamentally change lexing or parsing.

It can be done by amending your grammar as follows:

partial ::= "partial"
ident_ ::= [a-zA-Z]+
ident ::=
  | ident_
  | partial
  | /* any other contextual keywords... */

This allows you to write partial anywhere you need an identifier, but also have access to the partial keyword. When extracting the actual identifier name from the identifier node in the parse tree, you just need to handle the case of the identifier being the partial token, and return the literal string “partial” in that case.

Having special syntax for a verbatim identifier

C# supports prefixing any identifier with the @ symbol to use any name as an identifier, regardless of conflicts with keywords. For example, you could write @for to refer to an identifier named “for”.

The @ symbol used in this way is unambiguous in the C# grammar, so it’s easily implemented. However, it doesn’t actually solve the backward-compatibility problem, since any new keywords you minted will still be parsed incorrectly in older programs.

Verbatim identifiers are useful mostly in a case such as interoperating with another language where the identifier in question is not a keyword. Depending on your needs, this may be sufficient, and will obviate reworking the grammar of your language.

Parsing `async` syntactically using arbitrary lookahead

C# and Flow, probably among others, use a purely-syntactic approach to disambiguate the async token from an identifier named “async”. The idea is to scan ahead in the token stream until you see a token that disambiguates whether it’s a keyword or an identifier. The exact implementation will depend on the details of the grammar in question, but you can find the current implementation of parsing async in C# here.

Parsing `var` using semantic information

C# uses the var keyword in the place of an actual type name to tell the compiler to infer the type of the given variable. That is, you can type

var foo = MakeAnInt();

and the compiler will determine from the return type of MakeAnInt that the variable foo should be of type int.

Fundamentally, this can’t be resolved in a purely syntactic way. By definition, var is allowed to appear in a strict subset of places where an identifier (referring to a type name) could appear.

C# solves this problem by looking up the var symbol in the semantic analyzer, and only treating it as the var keyword if there is no type named var in the current scope.

Parsing `async` syntactically using constant lookahead

The Hack language added async with the same backward-compatibility issues present in other languages that also added it. In particular, the lambda syntax is hard to parse in the same way that it’s hard to parse in C# and Javascript. For example, the following is hard to parse:

async ($foo, $bar)

It could be a function call:

async ($foo, $bar) + 3

Or it could be lambda syntax:

async ($foo, $bar) ==> $foo + $bar

Hack “solves” this problem while only needing constant lookahead. If there is a space after async, it is parsed as the async keyword. Otherwise it is parsed as an identifier. Thus, the above example is unambiguous: it is parsed as a lambda.

This has serious consequences, because a developer has to be aware of this implementation detail, or otherwise incorrectly add spaces to their code and get inscrutable parse errors.

But hey — “if it compiles, it works”, right?

The following are hand-curated posts which you might find interesting.

Date		Title
09 Apr 2017	(this post)	How to parse contextual keywords in a programming language
23 Dec 2017		Why LINQ syntax differs from SQL, list comprehensions, etc.

Want to see more of my posts? Follow me on Twitter or subscribe via RSS.

What’s the problem with contextual keywords?

Parsing can be difficult when there is ambiguity

Approachs

Parsing partial directly in the grammar

Having special syntax for a verbatim identifier

Parsing async syntactically using arbitrary lookahead

Parsing var using semantic information

Parsing async syntactically using constant lookahead

Related posts

Comments

Parsing `partial` directly in the grammar

Parsing `async` syntactically using arbitrary lookahead

Parsing `var` using semantic information

Parsing `async` syntactically using constant lookahead