Skip to content

Polyglot Language Understanding

andychu edited this page Jan 2, 2024 · 36 revisions

This is a survey of projects/research that try to understand multiple programming languages in a "unified" way.

There are many lexing / syntax-highlighting-only projects toward the end of the page. The more interesting ones attempt something closer to parsing, and even semantic analysis.

But the simpler projects are naturally the most comprehensive in terms of the number of languages supported. They're valuable "corpuses" of language info.

This page is editable -- feel free to add other projects, with links, a description, and why they're interesting.


I made a rough categorization by light vs. heavy. It refers to how much code is shared between language "back ends". If no code is shared, it's "heavy".

That is, you could "simply" import entire compiler front ends and output protobufs, which is what Google Kythe did I believe. That would be heavy. Or you could rewrite lightweight lexers/parsers for every language in your own DSL.

(Note: light is not necessarily better than heavy!)


Note that finding patterns for syntax highlighting kind of "bleeds in" to the problem of finding patterns that indicate bugs and security issues.

Lightweight Implementations

Concept: Island Grammars. An island grammar only precisely defines small portions of the syntax of a language. The rest of the syntax is defined imprecisely, for instance as a list of characters, or a list of tokens.

Heavyweight Implementations

Polyglot Interfaces

Syntax Highlighting

  • TODO: link to these

    • TextMate Grammars
    • Vim grammars
    • Google code prettify (archived JavaScript library)
  • ctags (Universal, Exuberant) -- Integrated with vim. Very approximate, text-only analysis of languages.

    • See FAQ on "what happens when it's wrong?"
    • Although it's not clear how much sharing there is
    • Used by the OpenGrok source browser (written in Java)

Other Surveys

Clone this wiki locally