Explorations: MySQL Query -> AST Parser #153

adamziel · 2024-08-11T16:10:09Z

Note

Read the MySQL parser proposal for the full context on this PR.

Description

This PR explores a full MySQL Lexer and Parser to enable an AST-based MySQL -> SQLite, PostgreSQL, etc. bridge. This would be much more stable, easy to maintain, and easy to expand than token processing approach we use now.

The proposed MySQLLexer.php and MySQLParser.php have plenty of opportunities for optimization and refactoring, but overall they are fantastic starting points for this work and should save us a few months of explorations. The entire code is ~1MB (or 100kb gzipped), has no dependencies, and can parse 500 complex SELECT queries in ~800ms. I think we can reduce that time 10x or so.

How was it done?

I fed the official MySQLParser.g4 grammar to Google Gemini Pro and asked it to build a PHP parsed based on that. The 2 million token input context window made it really viable. Gemini can only output ~8000 tokens, so I had to take the response and feed it back to the model as an input. It took some time, so I ran a loop overnight and in the morning I had a decent starting point.

I used the generate_parser.py script included in this PR to generate the code. I built that script with AI studio where I initially uploaded the MySQL grammar, tuned the model parameters, and perfected the prompt. After the export I only had to make a few adjustments such as adding a loop, local file cache etc.

Once the parser was ready, I then used the lexing grammar and a similar prompt to generate the Lexer. Yes, the Lexer came second. I asked Gemini to make the lexer class plug-and-play with the parser class. However, once the Lexer was done, I realized I used the wrong Parser grammar and regenerated the entire Parser from scratch, only this time I included the Lexer class with my prompt for reference.

The entire project costed ~$520 in Google Cloud charges, $300 out of which I covered using free trial credits.

Rationale

As the proposal explains:

processing AST is much easier than processing tokens
No existing parser available to us can correctly parse the MySQL syntax

Testing instructions

Run PHPUnit tests as follows:

phpunit -c ./phpunit.xml.dist  --filter WP_MySQL_Parser_Tests

Next steps

Discuss the idea, look for any blockers

Ask Gemini to implement all the missing methods

Refactor the Lexer to avoid all the extra work it does on lookaheads
Evaluate switching from PHP Arrays to PHP Objects where applicable (huge speed gains may be possible)
Bring in a solid test suite from one of MySQL projects to ensure we can parse all kinds of queries
If everything went well, bring this parser into the project behind a try/catch statement, try to use it by default and fall back to the existing approach on failure. Collect logs when users opted-in to that. Once we achieve a feature and stability parity, switch to this parser entirely.

cc @dmsnell @aristath @brandonpayton @schlessera

…e a few if/else lists with switches

JanJakes · 2024-08-12T08:02:49Z

@adamziel Wow, this is an incredible demonstration of what's possible with current AI tools! Intuitively, I would've never guessed that it could do such a great job.

I'll add one question that's on my mind: Why not apply this to a language that can be compiled to WASM, such as Rust, for instance? How would that affect performance and bundle size? Would there be a binding cost (since PHP is WASM as well)?

Additionally, would new builds for new releases of MySQL be created similarly, or would the parser require a manual maintenance from now on?

adamziel · 2024-08-12T12:35:07Z

Thank you @JanJakes, and great questions!

I'll add one question that's on my mind: Why not apply this to a language that can be compiled to WASM, such as Rust, for instance? How would that affect performance and bundle size? Would there be a binding cost (since PHP is WASM as well)

For Playground, that would be brilliant. For WordPress core, that wouldn't help use SQLite at all since PHP has no WASM support yet. I think we may eventually maintain a semi-automatically-maintained C or Rust implementation to speed things up.

Additionally, would new builds for new releases of MySQL be created similarly, or would the parser require a manual maintenance from now on?

The ANTLR .g4 grammar doesn't seem to be maintained. We'd either need to find an up-to-data and well-maintained MariaDB grammar file and figure out a semi-automated workflow, or, if we can't find one, do manual maintenance.

adamziel · 2024-08-12T12:37:29Z

As I'm working through this parser, there's quite a few places that require:

Small adjustments in the condition structure
Reordering a few if/else branches
Looking ahead a few tokens more than what Gemini did

It can all be done in a week or two, but I now wonder whether converting MySQLParser.g4 to another format would enable generating a parser that correctly handles all these nuances off the bat. I'll explore that.

dmsnell · 2024-08-12T22:29:20Z

wp-content/plugins/sqlite-database-integration/wp-includes/mysql-parser/MySQLLexer.php

+    public const BIGINT_SYMBOL = 43;
+    public const REAL_SYMBOL = 44;
+    public const DOUBLE_SYMBOL = 45;
+    public const FLOAT_SYMBOL    = 46;


next time tell the AI to run its output through WPCS 😆

dmsnell · 2024-08-12T22:29:56Z

wp-content/plugins/sqlite-database-integration/wp-includes/mysql-parser/MySQLLexer.php

+    public const BACKUP_SYMBOL = 100;
+    public const BEFORE_SYMBOL = 101;
+    public const BEGIN_SYMBOL = 102;
+    public const BETWEEN_SYMBOL = 103;


is 104 skipped in the grammar?

dmsnell · 2024-08-12T22:37:31Z

wp-content/plugins/sqlite-database-integration/wp-includes/mysql-parser/MySQLLexer.php

+    const PipesAsConcat = 1;
+    const HighNotPrecedence = 2;
+    const NoBackslashEscapes = 4;
+    public const ANSI_QUOTES = 8;


this is an interesting bit. it appears like this was arbitrarily chosen as the only escaping mode? given an arbitrary enum value?

dmsnell

Crazy idea: ask it to replace all const values with string literals and see how that impacts the built size.

adamziel · 2024-08-17T10:46:43Z

I'll close this PR in favor of #157.

adamziel added 9 commits August 11, 2024 17:52

MySQL AST Parser v1

21ffab6

Refactor expr() to support NOT in front of another expression. Replac…

a6fcaf8

…e a few if/else lists with switches

Support ON DUPLICATE KEY UPDATE

8c1c6f2

Implement all the missing methods

010c6df

Bugfix in bitExpr() to enable subqueries in UPDATE queries

8da2a8c

More tests and support DELETE LIMIT

f0f2cc6

Support for some SET queries

258a144

Add more query types

34398e8

Ensure every equal() call is preserved in $children

e527548

JanJakes mentioned this pull request Aug 12, 2024

Support nuances of the SELECT syntax: WITH, UNION, subqueries etc. #106

Open

adamziel added 8 commits August 12, 2024 10:34

Support more SHOW queries

25ad77a

Support more SHOW queries

c576f44

Support more FLUSH queries

bb2de35

Support all test queries so far

fb57f92

Replace all getText()-based checks with getType()-based checks

b62b9f4

Add a missing || symbol

a68f6e2

Add PHPUnit tests

de70be1

Support more GRANT query types

1fe180c

dmsnell reviewed Aug 12, 2024

View reviewed changes

JanJakes mentioned this pull request Aug 15, 2024

Implement altering column with SET DEFAULT and DROP DEFAULT #152

Closed

adamziel mentioned this pull request Aug 17, 2024

Exhaustive MySQL Parser #157

Merged

adamziel closed this Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explorations: MySQL Query -> AST Parser #153

Explorations: MySQL Query -> AST Parser #153

Uh oh!

adamziel commented Aug 11, 2024 •

edited

Loading

Uh oh!

JanJakes commented Aug 12, 2024

Uh oh!

adamziel commented Aug 12, 2024

Uh oh!

adamziel commented Aug 12, 2024 •

edited

Loading

Uh oh!

dmsnell Aug 12, 2024

Uh oh!

dmsnell Aug 12, 2024

Uh oh!

dmsnell Aug 12, 2024

Uh oh!

dmsnell left a comment

Uh oh!

adamziel commented Aug 17, 2024

Uh oh!

Uh oh!

Explorations: MySQL Query -> AST Parser #153

Explorations: MySQL Query -> AST Parser #153

Uh oh!

Conversation

adamziel commented Aug 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How was it done?

Rationale

Testing instructions

Next steps

Uh oh!

JanJakes commented Aug 12, 2024

Uh oh!

adamziel commented Aug 12, 2024

Uh oh!

adamziel commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmsnell Aug 12, 2024

Choose a reason for hiding this comment

Uh oh!

dmsnell Aug 12, 2024

Choose a reason for hiding this comment

Uh oh!

dmsnell Aug 12, 2024

Choose a reason for hiding this comment

Uh oh!

dmsnell left a comment

Choose a reason for hiding this comment

Uh oh!

adamziel commented Aug 17, 2024

Uh oh!

Uh oh!

adamziel commented Aug 11, 2024 •

edited

Loading

adamziel commented Aug 12, 2024 •

edited

Loading