-
-
Notifications
You must be signed in to change notification settings - Fork 163
Polyglot Language Understanding
This is a survey of projects/research that try to understand multiple programming languages in a "unified" way.
There are many lexing / syntax-highlighting-only projects toward the end of the page. The more interesting ones attempt something closer to parsing, and even semantic analysis.
But the simpler projects are naturally the most comprehensive in terms of the number of languages supported. They're valuable "corpuses" of language info.
This page is editable -- feel free to add other projects, with links, a description, and why they're interesting.
I made a rough categorization by light vs. heavy. It refers to how much code is shared between language "back ends". If no code is shared, it's "heavy".
That is, you could "simply" import entire compiler front ends and output protobufs, which is what Google Kythe did I believe. That would be heavy. Or you could rewrite lightweight lexers/parsers for every language in your own DSL.
(Note: light is not necessarily better than heavy!)
Note that finding patterns for syntax highlighting kind of "bleeds in" to the problem of finding patterns that indicate bugs and security issues.
- uchex / microchex (Stanford paper, 2016)
- implemented with Haskell Parsec, original implementation was Python
- How To Build Static Analyzers in Orders of Magnitude Less Code (PDF)
- Morning Paper Writeup
- micro-grammars, parser combinators
- "belief-style checkers" (not the only supported technique)
-
Comby
- Implemented in OCaml
- https://comby.dev/en/projects - CMU paper
- Strange Loop 2019 - "Parser Parser Combinators for Program Transformation" by Rijnard van Tonder
-
sylver
- not open source?
- Sylver is a language-agnostic tool for source code exploration and analysis.
- *Using the SYLQ query language REPL, you can perform syntax-aware search on your codebase to find
Concept: Island Grammars. An island grammar only precisely defines small portions of the syntax of a language. The rest of the syntax is defined imprecisely, for instance as a list of characters, or a list of tokens.
-
semgrep / coccinelle (OCaml)
- Semgrep: a static analysis journey (2021) - How an academic project for the Linux kernel evolved into a multilingual security tool
- INRIA -> Facebook -> r2c
- facebook/pfff repo (OCaml) style issues and potential bugs.*
-
https://github.com/github/semantic -- appears inactive
- Haskell
-
Google Kythe - open source version of code search project started by Steve Yegge
-
- Developed at Lawrence Livermore National Laboratory (LLNL), ROSE is an open source compiler infrastructure to build source-to-source program transformation and analysis tools for large-scale C (C89 and C98), C++ (C++98 and C++11), UPC, Fortran (77, 95, 2003), OpenMP, Java, Python, PHP, and Binary applications.
- ROSE is particularly well suited for building custom tools for static analysis, program optimization, arbitrary program transformation, domain-specific optimizations, complex loop optimizations, performance analysis, and cyber-security
- Written in C++ - https://github.com/rose-compiler/rose/tree/weekly/src/AstNodes/Expression
-
- [Doxygen] automates the generation of documentation from source code comments, parsing information about classes, functions, and variables to produce output in formats like HTML and PDF
- Doxygen provides robust support for documenting C++ code, recognizing the intricacies of the language and generating comprehensive documentation.
- Next to C++, Doxygen also supports C, Python, PHP, Java, C#, Objective-C, Fortran, VHDL, Splice, IDL, and Lex.
-
- [Atom] is a novel intermediate representation and a cli tool for parsing and slicing codebases in multiple programming languages
- Generate usages, data flows, and reachable flow slices for codebases in json format
- Export the various representations including data flows to graphml and dot format for advanced visualization and analysis
- Written in Scala and distributed as a container image and npm package.
-
SCIP - a better code indexing format than LSIF (Sourcegraph, 2022)
- Sourcegraph code navigation such as “Go to definition” comes in two flavors: search-based and precise. Search-based code navigation is available out-of-the-box. It is fast and always available, but it can occasionally return false-positive and false-negative results. Precise code navigation, on the other hand, requires custom configuration to set up, but the results are compiler-accurate and work across repositories. Both search-based and precise code navigation are useful in their own ways. While search-based is less powerful, it is a quick and convenient solution. Precise is more powerful, but it also requires more upfront investment to configure.
- scip-typescript: a new TypeScript and JavaScript indexer
- https://github.com/sourcegraph/scip-java
-
Language Server Protocol
-
TODO: link to these
- TextMate Grammars - Manual
- Vim grammars - vimdoc Manual
- https://github.com/googlearchive/code-prettify (archived JavaScript syntax highlighting library)
-
ctags (Universal, Exuberant) -- Integrated with vim. Very approximate, text-only analysis of languages.
- See FAQ on "what happens when it's wrong?" https://ctags.sourceforge.net/faq.html#10
- Although it's not clear how much sharing there is
- Used by the OpenGrok source browser (written in Java)
- Highlight, by Andre Simon
- https://gitlab.com/saalen/highlight
- Used by cgit (as a source filter)
- C++ Boost Regexes in Lua Config Files
- 250+ languages
- Shell - https://gitlab.com/saalen/highlight/-/blob/master/langDefs/shellscript.lang?ref_type=heads
- some here doc support -- as a Lua function plugin
- does it have nested strings support? I'm very curious to test many implementations
- this one is a command line tool, so it should be easy to test