Exhaustive MySQL Parser #157

adamziel · 2024-08-17T10:43:37Z

Context

This PR ships an exhaustive MySQL lexer and parser that produce a MySQL query AST. This is the first step to significantly improve MySQL compatibility and expand WordPress plugin support on SQLite. It's an easier, more stable, and an easier to maintain method than the current token processing. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.

This PR is part of the Advanced MySQL support project.

See the MySQL parser proposal for additional context.

This PR ships

A MySQL lexer, adapted from the AI-generated one by @adamziel. It's over 3x smaller and close to 2x faster.
A MySQL grammar written in ANTLR v4 format, adapted from the MySQL Workbench grammar by adding and fixing some cases and reordering some rules.
A script to factor, convert, and compress the grammar to a PHP array.
A dynamic recursive parser implemented by @adamziel.
A script to extract tests from the MySQL repository.
A test suite of almost 70k queries.
WIP SQLite driver by @adamziel, a demo and foundation for the next phase.

At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever.

Running tests

The lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use:

php tests/parser/run-lexer-tests.php
php tests/parser/run-parser-tests.php

This will lex / lex & parse all the ~70k queries.

Implementation

Parser

A simple recursive parser to transform (token stream, grammar) => parse tree. In this PR, we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have some specific properties).

The parse_recursive() method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.

run-mysql-driver.php

A quick and dirty implementation of what a MySQL parse tree ➔ SQLite database driver could look like. It easily supports WITH and UNION queries that would be really difficult to implement the current SQLite integration plugin.

The tree transformation is an order of magnitude easier to read, expand, and maintain than the current implementation. I stand by this, even though the temporary ParseTreeTools/SQLiteTokenFactory API included in this PR seems annoying, and I'd like to ship something better than that. Here's a glimpse:

function translateQuery($subtree, $rule_name=null) {
    if(is_token($subtree)) {
        $token = $subtree;
        switch ($token->type) {
            case MySQLLexer::EOF: return new SQLiteExpression([]);
            case MySQLLexer::IDENTIFIER:
                return SQLiteTokenFactory::identifier(
                    SQLiteTokenFactory::identifierValue($token)
                );

            default:
                return SQLiteTokenFactory::raw($token->text);
        }
    }

    switch($rule_name) {
        case 'indexHintList':
            // SQLite doesn't support index hints. Let's
            // skip them.
            return null;

        case 'fromClause':
            // Skip `FROM DUAL`. We only care about a singular 
            // FROM DUAL statement, as FROM mytable, DUAL is a syntax
            // error.
            if(
                ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && 
                !ParseTreeTools::hasChildren($ast, 'tableReferenceList')
            ) {
                return null;
            }

        case 'functionCall':
            $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text;
            return translateFunctionCall($name, $ast[0]['udfExprList']);
    }
}

Technical details

MySQL Grammar

We use the MySQL workbench grammar, manually adapted, modified, and fixed, and converted from ANTLR4 format to a PHP array.

The grammar conversion pipeline is done by convert-grammar.php and goes like this:

Parse MySQLParser.g4 grammar into a PHP tree.
Flatten the grammar so that any nested rules become top-level and are referenced by generated names. This factors compound rules into separate rules, e.g. query ::= SELECT (ALL | DISTINCT) becomes query ::= select %select_fragment0 and %select_fragment0 ::= ALL | DISTINCT.
Expand *, +, ? modifiers into separate, right-recursive rules. For example, columns ::= column (',' column)* becomes columns ::= column columns_rr and columns_rr ::= ',' column | ε.
Compress and export the grammar as a PHP array. It replaces all string names with integers and ships an int->string map to reduce the file size.

The mysql-grammar.php file size is ~70kb in size, which is small enough. The parser can handle about 1000 complex SELECT queries per second on a MacBook Pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term. If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.

Known issues

There are some small issues and incomplete edge cases. Here are the ones I'm currently aware of:

A very special case in the lexer is not handled — While identifiers can't consist solely of numbers, in the identifier part after a ., this is possible (e.g., 1ea10.1 is a table name & column name). This is not handled yet, and it may be worth checking if all cases in the identifier part after a . are handled correctly.
Another very special case in the lexer — While the lexer does support version comments, such as /*!80038 ... / and nested comments within them, a nested comment within a non-matched version is not supported (e.g., SELECT 1 /*!99999 /* */ */). Additionally, we currently support only 5-digit version specifiers (80038), but 6 digits should probably work as well (080038).
Version specifiers are not propagated to the PHP grammar yet, and versions are not applied in the grammar yet (only in the lexer). This will be better to bring in together with version-specific test cases.
Some rules in the grammar may not have version specifiers, or they may be incorrect.
The _utf8 underscore charset should be version-dependent (only on MySQL 5), and maybe some others are too. We can check this by SHOW CHARACTER SET on different MySQL versions.
The PHPized grammar now contains array indexes of the main rules, while previously they were not listed. It seems there are numeric gaps. It might be a regression caused when manually parsing the grammar. I suppose it's an easy fix.
Some components need better test coverage (although the E2E 70k query test suite is pretty good for now).
~~The tests are not run on CI yet.~~
I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it.

This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist.

Updates

Since the thread here is pretty long, here are quick links to the work-in-progress updates:

Next steps

These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:

Bring in a comprehensive MySQL queries test suite, similar to WHATWG URL test data for parsing URLs. First, just ensure the parser either returns null or any parse tree where appropriate. Then, once we have more advanced tree processing, actually assert the parser outputs the expected query structures.
Create a MySQLOnSQLite database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build a MySQLOnPostgres driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?
Port MySQL features already supported by the SQLite database integration plugin to the new MySQLOnSQLite driver. For example, SQL_CALC_FOUND_ROWS option or the INTERVAL syntax.
Run SQLite database integration plugin test suite on the new MySQLOnSQLite driver and ensure they pass.
Rewire this plugin to use the new MySQLOnSQLite driver instead of the current plumbing.

logic. Refactor the grammar.

…le for a more useful parse tree, 15x smaller grammar file, and faster parsing time

adamziel · 2024-09-13T11:19:14Z

@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes

… X'01'

custom-parser/parser/DynamicRecursiveDescentParser.php

JanJakes · 2024-11-12T15:38:16Z

@adamziel Thanks for the great and detailed review! Great comments and ideas! I think I now went through all of them
resolved most, replied to some, documented some with TODOs. I took a lot of iterations on the lexer, while on the parser side, I added a bit more TODOs to think through and address some of the ideas in another PR, if that makes sense.

Some of the improvements are:

Lexer now needs no PCRE, nor the UTF-8 decoder. I realized we only ever need to match the U+0080-U+FFFF range and added manual checks with are faster, together with tests that compare the whole range against what PCRE matches.
Simplified and optimized many places with strspn and strcspn in the lexer.
Significantly simplified the lexer state. Basically, we now only change $this->bytes_already_read. Also removed the "channel", etc.
Tokenizing methods in the lexer now have unified naming and API.
A lot of docs added (but more needed still).
Implemented integer type detection in the lexer (int vs. long, etc.).
Cleanup, naming, better docs in the parser, but also a lot of TODO comments in these classes.
All lexer and parser tests are now run in CI via PHPUnit.
Polished and documented all the scripts.

There is still more to be done, better documented, and polished, but I think now it would make sense to do that in another pass. I added TODOs for all issues I'm aware of, including naming, missing documentation, etc.

I understand we're testing for PHP errors and parsing failures. How do we know the produced parse tree is correct? Could we also have unit tests for that? If there's no easy way to do it for the entire test set, can we do it for a limited subset of the most tricky queries?

If we run a development version of MySQL, we can use the SHOW PARSE_TREE statement to inspect what parse tree MySQL produces. I haven't investigated yet to which extent we could really use it in scale (e.g., if we can somehow map it to our ASTs), but it may be something worth exploring.

Otherwise, I think we may also need to create a test suite of expected ASTs manually, and then gradually expand it.

composer.json

grammar-tools/convert-grammar.php

tests/mysql/WP_MySQL_Server_Suite_Parser_Tests.php

tests/tools/mysql-download-tests.sh

tests/tools/run-lexer-benchmark.php

tests/tools/run-parser-test.php

wp-includes/mysql/class-wp-mysql-lexer.php

wp-includes/parser/class-wp-parser-grammar.php

adamziel · 2024-11-13T12:56:28Z

@JanJakes marvelous work ❤️ This looks really good and I don't see anything blocking. Minor nitpicks aside, we're good to merge – thank you so much for your hard work and perseverance here!

adamziel · 2024-11-13T12:57:39Z

I can't approve my own PR, but I can merge it. Just say when and I'll click the button 🎉

JanJakes · 2024-11-13T15:38:41Z

@adamziel Thanks! Addressed all but #157 (comment). When I'm sure I know what we want here, I can fix that too.

adamziel · 2024-11-18T11:19:17Z

This PR is in a great place, monumental work here @JanJakes!

Exhaustive MySQL Parser (WordPress#157)

MySQL AST Parser

ccc341b

adamziel changed the title ~~Custom MySQL AST Parser~~ Exhaustive MySQL Parser Aug 17, 2024

adamziel mentioned this pull request Aug 17, 2024

Explorations: MySQL Query -> AST Parser #153

Closed

1 task

adamziel added 3 commits August 17, 2024 19:14

Fix parser overriding parts of the parse tree as it constructs them.

78fdf69

Output ParseTree using a class, not an array for much simpler processing

c8652d5

logic. Refactor the grammar.

Manually factor left recursion into right recursion in the grammar fi…

137d6ca

…le for a more useful parse tree, 15x smaller grammar file, and faster parsing time

adamziel mentioned this pull request Aug 18, 2024

Support nuances of the SELECT syntax: WITH, UNION, subqueries etc. #106

Open

adamziel added 2 commits August 20, 2024 15:37

Explore support for SQL_CALC_FOUND_ROWS

0a2440c

Support VALUES() call

0406d71

adamziel mentioned this pull request Aug 27, 2024

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) adamziel/wxr-normalize#1

Closed

adamziel mentioned this pull request Sep 11, 2024

Use WordPress Playground swissspidy/wp-performance-action#173

Merged

3 tasks

adamziel mentioned this pull request Sep 16, 2024

CI: Monitor support for all the SQL queries used by all the WordPress plugins #159

Open

JanJakes added 11 commits September 26, 2024 19:50

Extract queries from MySQL test suite and test the parser against them

87573f2

Implement handling for manually added lexer symbols

9629702

Fix passing nulls to "ctype_" functions

d63bc6e

Add support for hex format x'ab12', X'ab12', and bin format x'01' and…

8e7e2e8

… X'01'

Fix wrong MySQL version conditions (AI hallucinations)

ebcc17e

Implement the checkCharset() placeholder function

1551b0e

Document manual grammar factoring

cdd84b4

Fix "alterOrderList" that has a wrong definition in the original grammar

f50b515

Fix "createUser" that was incorrectly converted from ANTLR to EBNF

e267f67

Fix "castType" that was incomplete in the original grammar

cd543af

Fix "SELECT ... WHERE ... INTO @var" using a negative lookahead

135f29f

JanJakes force-pushed the recursive-descent-parser branch from d3e623d to 135f29f Compare September 26, 2024 17:51

adamziel commented Sep 26, 2024

View reviewed changes

custom-parser/parser/DynamicRecursiveDescentParser.php Outdated Show resolved Hide resolved

JanJakes added 4 commits September 27, 2024 10:28

Fix "EXPLAIN FORMAT=..." by reordering grammar rules

27524dd

Fix special "WINDOW" and "OVER" cases by adjusting grammar rules

069342f

Fix "GRANT" and "REVOKE" by adjusting grammar rules to solve conflicts

9bfc977

Use ebnfutils to dump grammar conflicts

ca4de77

JanJakes force-pushed the recursive-descent-parser branch 2 times, most recently from 6618be1 to 9971062 Compare November 12, 2024 15:19

JanJakes added 2 commits November 12, 2024 16:20

Use false rather than null when a parser subtree doesn't match

dec0e3f

Declare mbstring in dev dependencies

afc70bb

JanJakes force-pushed the recursive-descent-parser branch from 9971062 to afc70bb Compare November 12, 2024 15:21