Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exhaustive MySQL Parser #157

Open
wants to merge 128 commits into
base: develop
Choose a base branch
from
Open

Exhaustive MySQL Parser #157

wants to merge 128 commits into from

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Aug 17, 2024

Note

You're the most welcome to take over this PR. I won't be able to drive this to completion before November. Do you need a reliable SQLite support? You can make this happen by driving this PR to completion (and fork if needed)! Are you dreaming about running WordPress on PostgreSQL? Finishing this PR is the first step there.

Read the MySQL parser proposal for the full context on this PR.

Ships an exhaustive MySQL Lexer and Parser that produce a structured parse tree. This is the first step towards supporting multiple databases. It's an easier, more stable, and an easier to maintain method than the token processing we use now. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.

We don't have an AST yet, but we have a decent parse tree. That may already be sufficient – by adjusting the grammar file we can mold it into almost anything we want. If that won't be sufficient or convenient for any reason, converting that parse tree into an AST is a simple tree mapping. It could be done with a well crafted MySQLASTRecursiveIterator and a good AI prompt.

Implementation

The three focal points of this PR are:

  • MySQLLexer.php – turns an SQL query into a stream of tokens
  • DynamicRecursiveDescentParser.php – turns a stream of tokens into a parse tree
  • run-mysql-driver.php – proof of concept of an parse-tree-based query conversion

Before diving further, check out a few parse trees this parser generated:

MySQLLexer.php

This is an AI-generated lexer I initially proposed in #153. It needs a few passes from a human to inline most methods and cover a few tokens it doesn't currently produce, but overall it seems solid.

DynamicRecursiveDescentParser.php

A simple recursive parser to transform (token stream, grammar) => parse tree. In this PR we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have specific properties).

The parse_recursive() method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.

run-mysql-driver.php

A quick and dirty implementation of what a MySQL parse tree ➔ SQLite database driver could look like. It easily supports WITH and UNION queries that would be really difficult to implement the current SQLite integration plugin.

The tree transformation is an order of magnitude easier to read, expand, and maintain than the current. I stand by this, even though the temporary ParseTreeTools/SQLiteTokenFactory API included in this PR seems annoying and I'd like to ship something better than that. Here's a glimpse:

function translateQuery($subtree, $rule_name=null) {
    if(is_token($subtree)) {
        $token = $subtree;
        switch ($token->type) {
            case MySQLLexer::EOF: return new SQLiteExpression([]);
            case MySQLLexer::IDENTIFIER:
                return SQLiteTokenFactory::identifier(
                    SQLiteTokenFactory::identifierValue($token)
                );

            default:
                return SQLiteTokenFactory::raw($token->text);
        }
    }

    switch($rule_name) {
        case 'indexHintList':
            // SQLite doesn't support index hints. Let's
            // skip them.
            return null;

        case 'fromClause':
            // Skip `FROM DUAL`. We only care about a singular 
            // FROM DUAL statement, as FROM mytable, DUAL is a syntax
            // error.
            if(
                ParseTreeTools::hasChildren($ast, MySQLLexer::DUAL_SYMBOL) && 
                !ParseTreeTools::hasChildren($ast, 'tableReferenceList')
            ) {
                return null;
            }

        case 'functionCall':
            $name = $ast[0]['pureIdentifier'][0]['IDENTIFIER'][0]->text;
            return translateFunctionCall($name, $ast[0]['udfExprList']);
    }
}

A deeper technical dive

MySQL Grammar

We use the MySQL workbench grammar converted from ANTLR4 format to a PHP array.

You can tweak the MySQLParser-reordered.ebnf file and regenerate the php grammar with the create_grammar.sh script. You'll need to run npm install before you do that.

The grammar conversion pipeline goes like this:

  1. g4 ➔ EBNF with grammar-converter
  2. EBNF ➔ JSON with node-ebnf. This already factors compound rules into separate rules, e.g. query ::= SELECT (ALL | DISTINCT) becomes query ::= select %select_fragment0 and %select_fragment0 ::= ALL | DISTINCT.
  3. Rule expansion with a python script: Expand *, +, ? into modifiers into separate, right-recursive rules. For example, columns ::= column (',' column)* becomes columns ::= column columns_rr and columns_rr ::= ',' column | ε.
  4. JSON ➔ PHP with a PHP script. It replaces all string names with integers and ships an int->string map to reduce the file size,

I ignored nuances like MySQL version-specific rules and output channels for this initial explorations. I'm now confident the approach from this PR will work. We're in a good place to start thinking about incorporating these nuances. I wonder if we even have to distinguish between MySQL 5 vs 8 syntax, perhaps we could just assume version 8 or use a union of all the rules.

✅ The grammar file is large, but fine for v1

Edit: I factored the grammar manually instead of using the automated factoring algorithm, and the grammar.php file size went down to 70kb. This one is now solved. Everything until the next header is no longer relevant and I'm only leaving it here for context.

grammar.php is 1.2MB, or 100kb gzipped. This already is a "compressed" form where all rules and tokens are encoded as integers.

I see three ways to reduce the size:

  1. Explore further factorings of the grammar. Run left factoring to deduplicate any ambigous rules, then extract AB|AC|AD into A(B|C|D) etc.
  2. Remove a large part of the grammar. We can either drop support for hand-picked concepts like CREATE PROCEDURE, or modularize the grammar and lazy-load the parts we actually need to use. For example, most of the time we won't need anything related to GRANT PRIVILIGES or DROP INDEX.
  3. Collapse some tokens into the same token. Perhaps we don't need the same granularity as the original grammar.

The speed is decent

The proposed parser can handle about 1000 complex SELECT queries per second on a MacBook pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term.

If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.

Next steps

These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:

  • Bring in a comprehensive MySQL queries test suite, similar to WHATWG URL test data for parsing URLs. First, just ensure the parser either returns null or any parse tree where appropriate. Then, once we have more advanced tree processing, actually assert the parser outputs the expected query structures.
  • Create a MySQLOnSQLite database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build a MySQLOnPostgres driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?
  • Port MySQL features already supported by the SQLite database integration plugin to the new MySQLOnSQLite driver. For example, SQL_CALC_FOUND_ROWS option or the INTERVAL syntax.
  • Run SQLite database integration plugin test suite on the new MySQLOnSQLite driver and ensure they pass.
  • Rewire this plugin to use the new MySQLOnSQLite driver instead of the current plumbing.

@adamziel adamziel changed the title Custom MySQL AST Parser Exhaustive MySQL Parser Aug 17, 2024
@adamziel
Copy link
Collaborator Author

adamziel commented Sep 13, 2024

@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes

@@ -422,6 +422,16 @@ private function parse_recursive($rule_id) {
$node->append_child($subnode);
}
}

// Negative lookahead for INTO after a valid SELECT statement.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comments!

@JanJakes
Copy link
Collaborator

JanJakes commented Oct 31, 2024

Wrap up

I think it's time to wrap up this pull request as a first iteration of the exhaustive MySQL lexer & parser, and start looking into the next phase — the SQL query translation.

From almost 70 000 queries, only 11 are failing (or invalid) now. There's likely to be more edge-case issues and missing version specifiers, but we can focus on improving the correctness later on. In part, we could use some automated checks.

This PR ships

  1. A MySQL lexer, adapted from the AI-generated one by @adamziel. It's over 3x smaller and close to 2x faster.
  2. A MySQL grammar written in ANTLR v4 format, adapted from the MySQL Workbench grammar by adding and fixing some cases and reordering some rules.
  3. A script to factor and convert the grammar to a PHP array. There are no 3rd party deps now, it's just a PHP script.
  4. A dynamic recursive parser implemented by @adamziel.
  5. A script to extract tests from the MySQL repository.
  6. A test suite of almost 70k queries.
  7. WIP SQLite driver by @adamziel, a demo and foundation for the next phase.

At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever.

Running tests

The lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use:

php tests/parser/run-lexer-tests.php
php tests/parser/run-parser-tests.php

This will lex / lex & parse all the ~70k queries.

Known issues

There are some small issues and incomplete edge cases. Here are the ones I'm currently aware of:

  1. A very special case in the lexer is not handled — While identifiers can't consist solely of numbers, in the identifier part after a ., this is possible (e.g., 1ea10.1 is a table name & column name). This is not handled yet, and it may be worth checking if all cases in the identifier part after a . are handled correctly.
  2. Another very special case in the lexer — While the lexer does support version comments, such as /*!80038 ... / and nested comments within them, a nested comment within a non-matched version is not supported (e.g., SELECT 1 /*!99999 /* */ */). Additionally, we currently support only 5-digit version specifiers (80038), but 6 digits should probably work as well (080038).
  3. Version specifiers are not propagated to the PHP grammar yet, and versions are not applied in the grammar yet (only in the lexer). This will be better to bring in together with version-specific test cases.
  4. Some rules in the grammar may not have version specifiers, or they may be incorrect.
  5. The _utf8 underscore charset should be version-dependent (only on MySQL 5), and maybe some others are too. We can check this by SHOW CHARACTER SET on different MySQL versions.
  6. The PHPized grammar now contains array indexes of the main rules, while previously they were not listed. It seems there are numeric gaps. It might be a regression caused when manually parsing the grammar. I suppose it's an easy fix.
  7. Some components need better test coverage (although the E2E 70k query test suite is pretty good for now).
  8. The tests are not run on CI yet.
  9. I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it.

This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist.

Previous updates

Since the thread here is pretty long, here are quick links to the previous updates:

Ready for review

I would consider this to be ready for review now. I fixed all coding styles, and excluded the tests for now. Additionally, all the new files are excluded from the plugin build at the moment (.gitattributes).

I guess you can now have a look, @adamziel, @brandonpayton, @bgrgicak, and anybody I missed. I'll try to incorporate any suggestions soon (likely on Monday, as I'm AFK tomorrow).

const GRAMMAR_FILE = __DIR__ . '/../../wp-includes/mysql/mysql-grammar.php';

// Convert the original MySQLParser.g4 grammar to a JSON format.
// The grammar is also flattened and expanded to an ebnf-to-json-like format.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a TODO comment to migrate it to a grammar-based g4 parser at one point.

@adamziel
Copy link
Collaborator Author

Let's make sure the known issues are @TODOs somewhere in the code – that will act as a reminder and also make the limitations clear when working with the code.

@adamziel
Copy link
Collaborator Author

Some rules in the grammar may not have version specifiers, or they may be incorrect.

What does it mean that they can be incorrect?

@adamziel
Copy link
Collaborator Author

adamziel commented Oct 31, 2024

I'm not sure if the new code fully satisfies the plugin PHP version requirement. We need to check that — e.g., that there are no PHP 7.1 features used. Not fully sure, but I think there's no lint for PHP version in the repo, so we could add it.

I think we're good with PHP 7.2+ since WordPress bumped the minimal version recently. Also, the CI runner uses PHP 7.2+ so enabling the unit tests in CI may be enough to ensure compatibility.

@JanJakes
Copy link
Collaborator

What does it mean that they can be incorrect?

Sometimes, there are little issues like this one or a few missing rules like this. The MySQL workbench grammar is not 100% equivalent to the original MySQL grammar in some small details. There could be more, although satisfying the large test case should make us confident we're >99% correct. That said, we could compare all the tiny details with the original MySQL server grammar, if needed, and explore if some comparison could be made automatically (e.g., fetching the MySQL server grammar in all different versions and for some tokens, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants