-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exhaustive MySQL Parser #157
base: develop
Are you sure you want to change the base?
Conversation
logic. Refactor the grammar.
…le for a more useful parse tree, 15x smaller grammar file, and faster parsing time
@bgrgicak tested all plugins in the WordPress plugin directory for installation errors. The top 1000 results are published at https://github.com/bgrgicak/playground-tester/blob/main/logs/2024-09-13-09-22-17/wordpress-seo. A lot of these are about SQL queries. Just migrating to new parser would solve many of these errors and give us a proper foundation to add support for more MySQL features. CC @JanJakes |
d3e623d
to
135f29f
Compare
@@ -422,6 +422,16 @@ private function parse_recursive($rule_id) { | |||
$node->append_child($subnode); | |||
} | |||
} | |||
|
|||
// Negative lookahead for INTO after a valid SELECT statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great comments!
17b2613
to
d869785
Compare
These rules are now actually packed to %N names, so this has no effect.
23f10b5
to
2cf39d4
Compare
Wrap upI think it's time to wrap up this pull request as a first iteration of the exhaustive MySQL lexer & parser, and start looking into the next phase — the SQL query translation. From almost 70 000 queries, only 11 are failing (or invalid) now. There's likely to be more edge-case issues and missing version specifiers, but we can focus on improving the correctness later on. In part, we could use some automated checks. This PR ships
At the moment, all the new files are omitted from the plugin build, so they have no effect on production whatsoever. Running testsThe lexer & parser tests suite is not yet integrated into the CI and existing test commands. To run the tests, use: php tests/parser/run-lexer-tests.php
php tests/parser/run-parser-tests.php This will lex / lex & parse all the ~70k queries. Known issuesThere are some small issues and incomplete edge cases. Here are the ones I'm currently aware of:
This list is mainly for me, in order not to forget these. I will later port it into a tracking issue with a checklist. Previous updatesSince the thread here is pretty long, here are quick links to the previous updates:
Ready for reviewI would consider this to be ready for review now. I fixed all coding styles, and excluded the tests for now. Additionally, all the new files are excluded from the plugin build at the moment ( I guess you can now have a look, @adamziel, @brandonpayton, @bgrgicak, and anybody I missed. I'll try to incorporate any suggestions soon (likely on Monday, as I'm AFK tomorrow). |
const GRAMMAR_FILE = __DIR__ . '/../../wp-includes/mysql/mysql-grammar.php'; | ||
|
||
// Convert the original MySQLParser.g4 grammar to a JSON format. | ||
// The grammar is also flattened and expanded to an ebnf-to-json-like format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a TODO comment to migrate it to a grammar-based g4 parser at one point.
Let's make sure the known issues are |
What does it mean that they can be incorrect? |
I think we're good with PHP 7.2+ since WordPress bumped the minimal version recently. Also, the CI runner uses PHP 7.2+ so enabling the unit tests in CI may be enough to ensure compatibility. |
Sometimes, there are little issues like this one or a few missing rules like this. The MySQL workbench grammar is not 100% equivalent to the original MySQL grammar in some small details. There could be more, although satisfying the large test case should make us confident we're >99% correct. That said, we could compare all the tiny details with the original MySQL server grammar, if needed, and explore if some comparison could be made automatically (e.g., fetching the MySQL server grammar in all different versions and for some tokens, etc.). |
Note
You're the most welcome to take over this PR. I won't be able to drive this to completion before November. Do you need a reliable SQLite support? You can make this happen by driving this PR to completion (and fork if needed)! Are you dreaming about running WordPress on PostgreSQL? Finishing this PR is the first step there.
Read the MySQL parser proposal for the full context on this PR.
Ships an exhaustive MySQL Lexer and Parser that produce a structured parse tree. This is the first step towards supporting multiple databases. It's an easier, more stable, and an easier to maintain method than the token processing we use now. It will also dramatically improve WordPress Playground experience – database integration is the single largest source of issues.
We don't have an AST yet, but we have a decent parse tree. That may already be sufficient – by adjusting the grammar file we can mold it into almost anything we want. If that won't be sufficient or convenient for any reason, converting that parse tree into an AST is a simple tree mapping. It could be done with a well crafted
MySQLASTRecursiveIterator
and a good AI prompt.Implementation
The three focal points of this PR are:
Before diving further, check out a few parse trees this parser generated:
MySQLLexer.php
This is an AI-generated lexer I initially proposed in #153. It needs a few passes from a human to inline most methods and cover a few tokens it doesn't currently produce, but overall it seems solid.
DynamicRecursiveDescentParser.php
A simple recursive parser to transform
(token stream, grammar) => parse tree
. In this PR we use MySQL tokens and MySQL grammar, but the same parser could also support XML, IMAP, many other grammars (as long as they have specific properties).The
parse_recursive()
method is just 100 lines of code (excluding comments). All of the parsing rules are provided by the grammar.run-mysql-driver.php
A quick and dirty implementation of what a
MySQL parse tree ➔ SQLite
database driver could look like. It easily supportsWITH
andUNION
queries that would be really difficult to implement the current SQLite integration plugin.The tree transformation is an order of magnitude easier to read, expand, and maintain than the current. I stand by this, even though the temporary
ParseTreeTools
/SQLiteTokenFactory
API included in this PR seems annoying and I'd like to ship something better than that. Here's a glimpse:A deeper technical dive
MySQL Grammar
We use the MySQL workbench grammar converted from ANTLR4 format to a PHP array.
You can tweak the
MySQLParser-reordered.ebnf
file and regenerate thephp
grammar with thecreate_grammar.sh
script. You'll need to run npm install before you do that.The grammar conversion pipeline goes like this:
g4 ➔ EBNF
with grammar-converterEBNF ➔ JSON
with node-ebnf. This already factors compound rules into separate rules, e.g.query ::= SELECT (ALL | DISTINCT)
becomesquery ::= select %select_fragment0
and%select_fragment0 ::= ALL | DISTINCT
.*
,+
,?
into modifiers into separate, right-recursive rules. For example,columns ::= column (',' column)*
becomescolumns ::= column columns_rr
andcolumns_rr ::= ',' column | ε
.JSON ➔ PHP
with a PHP script. It replaces all string names with integers and ships an int->string map to reduce the file size,I ignored nuances like MySQL version-specific rules and output channels for this initial explorations. I'm now confident the approach from this PR will work. We're in a good place to start thinking about incorporating these nuances. I wonder if we even have to distinguish between MySQL 5 vs 8 syntax, perhaps we could just assume version 8 or use a union of all the rules.
✅ The grammar file is large, but fine for v1
Edit: I factored the grammar manually instead of using the automated factoring algorithm, and the grammar.php file size went down to 70kb. This one is now solved. Everything until the next header is no longer relevant and I'm only leaving it here for context.
grammar.php
is 1.2MB, or 100kb gzipped. This already is a "compressed" form where all rules and tokens are encoded as integers.I see three ways to reduce the size:
AB|AC|AD
intoA(B|C|D)
etc.CREATE PROCEDURE
, or modularize the grammar and lazy-load the parts we actually need to use. For example, most of the time we won't need anything related toGRANT PRIVILIGES
orDROP INDEX
.The speed is decent
The proposed parser can handle about 1000 complex SELECT queries per second on a MacBook pro. It only took a few easy optimizations to go from 50/seconds to 1000/second. There's a lot of further optimization opportunities once we need more speed. We could factor the grammar in different ways, explore other types of lookahead tables, or memoize the matching results per token. However, I don't think we need to do that in the short term.
If we spend enough time factoring the grammar, we could potentially switch to a LALR(1) parser and cut most time spent on dealing with ambiguities.
Next steps
These could be implemented either in follow-up PRs or as updates to this PR – whichever is more convenient:
MySQLOnSQLite
database driver to enable running MySQL queries on SQLite. Read this comment for more context. Use any method that's convenient for generating SQLite queries. Feel free to restructure and redo any APIs proposed in this PR. Be inspired by the idea we may build aMySQLOnPostgres
driver one day, but don't actually build any abstractions upfront. Make the driver generic so it can be used without WordPress. Perhaps it could implement a PDO driver interface?MySQLOnSQLite
driver. For example,SQL_CALC_FOUND_ROWS
option or theINTERVAL
syntax.MySQLOnSQLite
driver and ensure they pass.MySQLOnSQLite
driver instead of the current plumbing.