Parsing is Difficult

Blog Theme

https://news.ycombinator.com/item?id=13821829 -- Real parsers are hand-written; they use code generators or meta-languages.

https://news.ycombinator.com/item?id=13041646 -- metalanguage feedback in response to old META II paper. Ruby parser is complex. JRuby parser is an example of writing two parsers.

https://news.ycombinator.com/item?id=13630686 -- my own metalanguage

Language Design and Theory of Computation -- CFGs aren't powerful enough. Java has two grammars.

Note about error messages: are parse errors really hard? They all seem similar, as long as you have location info? Python doesn't seem to have any special handling of parse errors. It uses a grammmar.

I think type-checking error messages are hard, like Clang diagnostics. And runtime errors to an extent. But parse errors may be easy if the language is straightforward enough.

The one place I might want an advanced error message is for $((. You want to know where it started thinking of things as arithmetic. The place you fix is different than the place the error occurred.

Real Languages have at least Two Parsers

Python: CPython parser, lib2to3, redbaron
Ruby: MRI parser, JRuby parser
Go: C parser, Go parser
Scala: seems like scala-meta might have another parser
JavaScript: many different parsers
Unified parsers:
- Clang
- Roslyn

From coverage.py. The Python stdlib tokenizer isn't good enough, because it's not lossless! It's also a second tokenizer, since the C one is not wrapped.

def phys_tokens(toks):
    """Return all physical tokens, even line continuations.

    tokenize.generate_tokens() doesn't return a token for the backslash that
    continues lines.  This wrapper provides those tokens so that we can
    re-create a faithful representation of the original source.

    Returns the same values as generate_tokens()

Gaps Between Theory and Practice

https://www.reddit.com/r/ProgrammingLanguages/comments/74ktjg/what_compromises_have_you_made/do0rsku/ -- In theory, parsing is about testing whether a language is in a set. In practice, it's also about creating an AST/LST structure. This is where the concept of "ambiguity" arises -- more than one possible sequence of productions can lead to the same string.
- It's also analogous to the "group capture" problem in regular expressions. We want to retain linear time performance and specify greedy/nongreedy behavior when extracting substrings: https://swtch.com/~rsc/regexp/

Uh oh!

Parsing is Difficult

Real Languages have at least Two Parsers

Gaps Between Theory and Practice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!