Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explorations: MySQL Query -> AST Parser #153

Closed
wants to merge 17 commits into from
Closed

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Aug 11, 2024

Note

Read the MySQL parser proposal for the full context on this PR.

Description

This PR explores a full MySQL Lexer and Parser to enable an AST-based MySQL -> SQLite, PostgreSQL, etc. bridge. This would be much more stable, easy to maintain, and easy to expand than token processing approach we use now.

The proposed MySQLLexer.php and MySQLParser.php have plenty of opportunities for optimization and refactoring, but overall they are fantastic starting points for this work and should save us a few months of explorations. The entire code is ~1MB (or 100kb gzipped), has no dependencies, and can parse 500 complex SELECT queries in ~800ms. I think we can reduce that time 10x or so.

How was it done?

I fed the official MySQLParser.g4 grammar to Google Gemini Pro and asked it to build a PHP parsed based on that. The 2 million token input context window made it really viable. Gemini can only output ~8000 tokens, so I had to take the response and feed it back to the model as an input. It took some time, so I ran a loop overnight and in the morning I had a decent starting point.

I used the generate_parser.py script included in this PR to generate the code. I built that script with AI studio where I initially uploaded the MySQL grammar, tuned the model parameters, and perfected the prompt. After the export I only had to make a few adjustments such as adding a loop, local file cache etc.

Once the parser was ready, I then used the lexing grammar and a similar prompt to generate the Lexer. Yes, the Lexer came second. I asked Gemini to make the lexer class plug-and-play with the parser class. However, once the Lexer was done, I realized I used the wrong Parser grammar and regenerated the entire Parser from scratch, only this time I included the Lexer class with my prompt for reference.

The entire project costed ~$520 in Google Cloud charges, $300 out of which I covered using free trial credits.

Rationale

As the proposal explains:

  • processing AST is much easier than processing tokens
  • No existing parser available to us can correctly parse the MySQL syntax

Testing instructions

Run PHPUnit tests as follows:

phpunit -c ./phpunit.xml.dist  --filter WP_MySQL_Parser_Tests

Next steps

  • Discuss the idea, look for any blockers
  • Ask Gemini to implement all the missing methods
  • Refactor the Lexer to avoid all the extra work it does on lookaheads
  • Evaluate switching from PHP Arrays to PHP Objects where applicable (huge speed gains may be possible)
  • Bring in a solid test suite from one of MySQL projects to ensure we can parse all kinds of queries
  • If everything went well, bring this parser into the project behind a try/catch statement, try to use it by default and fall back to the existing approach on failure. Collect logs when users opted-in to that. Once we achieve a feature and stability parity, switch to this parser entirely.

cc @dmsnell @aristath @brandonpayton @schlessera

@JanJakes
Copy link
Collaborator

@adamziel Wow, this is an incredible demonstration of what's possible with current AI tools! Intuitively, I would've never guessed that it could do such a great job.

I'll add one question that's on my mind: Why not apply this to a language that can be compiled to WASM, such as Rust, for instance? How would that affect performance and bundle size? Would there be a binding cost (since PHP is WASM as well)?

Additionally, would new builds for new releases of MySQL be created similarly, or would the parser require a manual maintenance from now on?

@adamziel
Copy link
Collaborator Author

Thank you @JanJakes, and great questions!

I'll add one question that's on my mind: Why not apply this to a language that can be compiled to WASM, such as Rust, for instance? How would that affect performance and bundle size? Would there be a binding cost (since PHP is WASM as well)

For Playground, that would be brilliant. For WordPress core, that wouldn't help use SQLite at all since PHP has no WASM support yet. I think we may eventually maintain a semi-automatically-maintained C or Rust implementation to speed things up.

Additionally, would new builds for new releases of MySQL be created similarly, or would the parser require a manual maintenance from now on?

The ANTLR .g4 grammar doesn't seem to be maintained. We'd either need to find an up-to-data and well-maintained MariaDB grammar file and figure out a semi-automated workflow, or, if we can't find one, do manual maintenance.

@adamziel
Copy link
Collaborator Author

adamziel commented Aug 12, 2024

As I'm working through this parser, there's quite a few places that require:

  • Small adjustments in the condition structure
  • Reordering a few if/else branches
  • Looking ahead a few tokens more than what Gemini did

It can all be done in a week or two, but I now wonder whether converting MySQLParser.g4 to another format would enable generating a parser that correctly handles all these nuances off the bat. I'll explore that.

public const BIGINT_SYMBOL = 43;
public const REAL_SYMBOL = 44;
public const DOUBLE_SYMBOL = 45;
public const FLOAT_SYMBOL = 46;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next time tell the AI to run its output through WPCS 😆

public const BACKUP_SYMBOL = 100;
public const BEFORE_SYMBOL = 101;
public const BEGIN_SYMBOL = 102;
public const BETWEEN_SYMBOL = 103;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 104 skipped in the grammar?

const PipesAsConcat = 1;
const HighNotPrecedence = 2;
const NoBackslashEscapes = 4;
public const ANSI_QUOTES = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting bit. it appears like this was arbitrarily chosen as the only escaping mode? given an arbitrary enum value?

Copy link
Member

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy idea: ask it to replace all const values with string literals and see how that impacts the built size.

@adamziel
Copy link
Collaborator Author

I'll close this PR in favor of #157.

@adamziel adamziel closed this Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants