Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace proof-of-sql-parser with sqlparser. #235

Open
4 tasks
JayWhite2357 opened this issue Oct 7, 2024 · 28 comments
Open
4 tasks

Replace proof-of-sql-parser with sqlparser. #235

JayWhite2357 opened this issue Oct 7, 2024 · 28 comments
Labels
💎 Bounty enhancement New feature or request

Comments

@JayWhite2357
Copy link
Contributor

JayWhite2357 commented Oct 7, 2024

Background and Motivation

Currently, we have an in-house parser that is built on the lalrpop parser-generator. This has been good while the supported syntax has been simple. However, as the supported syntax has grown, we need a more comprehensive parser.

The sqlparser crate is the parser used by DataFusion, which is part of the Arrow ecosystem. It is a feature-rich parser that ultimately will require less code maintenance. It is no_std compatible, so there should be no issues integrating it.

Changes Required

  • Replace proof-of-sql-parser usage within the proof-of-sql crate with sqlparser usage.
  • Ensure that no functionality is removed.
  • Add thorough testing.
  • Remove the workspace and simplify the repo to be a single crate: the proof-of-sql crate.
@JayWhite2357
Copy link
Contributor Author

@iajoiner may be able to provide a bit of guidance on this issue, however, there is no concrete plan for this.
It would be wise to discuss this plan with @iajoiner and myself to ensure that things are heading in the right direction before building in earnest.

@JayWhite2357 JayWhite2357 added the enhancement New feature or request label Oct 8, 2024
@JayWhite2357
Copy link
Contributor Author

/bounty $10000

Copy link

algora-pbc bot commented Oct 9, 2024

💎 $10,000 bounty • Space and Time

Steps to solve:

  1. Start working: (Optional) Comment /attempt #235 with your implementation plan. Note: we will only assign an issue if you include an implementation plan with a time estimate. Additionally, to be assigned an issue, you must have previously contributed to the project. You can still work on an issue and submit a PR without being assigned.
  2. Submit work: Create a pull request including /claim #235 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to spaceandtimelabs/sxt-proof-of-sql!

Add a bountyShare on socials

Attempt Started (GMT+0) Solution
🔴 @varshith257 Oct 9, 2024, 1:50:03 PM WIP
🔴 @animeshd9 Oct 9, 2024, 8:21:53 PM WIP
🟢 @TomBebb Oct 15, 2024, 10:39:12 PM WIP
🟢 @deependujha Oct 16, 2024, 12:17:06 PM WIP

@varshith257
Copy link
Contributor

varshith257 commented Oct 9, 2024

/attempt #235

Algora profile Completed bounties Tech Active attempts Options
@varshith257 15 bounties from 7 projects
Go, Scala,
TypeScript & more
﹟239
Cancel attempt

@varshith257
Copy link
Contributor

@JayWhite2357 I will connect on discord for the same

@animeshd9
Copy link

animeshd9 commented Oct 9, 2024

/attempt #235

@varshith257
Copy link
Contributor

varshith257 commented Oct 10, 2024

Hi @JayWhite2357 and @iajoiner,I'd like to proceed with the transition from the LALRPOP-based parser to sqlparser for the proof-of-sql-parser crate. I have gone through current implementation and working of proof-of-sql parser. Here's the approach I have in mind:

  • I will add the sqlparser crate to handle the parsing of SQL queries.

  • Then I move ahead to the replace the LALRPOP Parser

    • The LALRPOP-based parser defined in sql.lalrpop will be replaced by a new parsing function that utilizes sqlparser to parse
    • This will specifically focus on the SELECT statements following PostgreSQL syntax as it currently does(Select Statement Parsing, Expressions and Operators, Lexer Rules, ORDER/GROUP BY, Literals and Identifiers etc...)
  • Next I step to map the sqlparser AST to existing intermediate AST (intermediate_ast::SelectStatement and other structures) to minimize changes to the rest of the crate.This will ensure the current logic built on top of the AST stays intact
    Next use the sqlparser::ast structures entirely, phasing out the current intermediate AST.

  • I will thoroughly test the changes incorporate at every stage to ensure all current functionality is maintained.

  • I think we can also adapt the existing ParseError and ParseResult types to handle errors from sqlparser

Before moving forward with this plan in my mind, I wanted to check in and ensure this approach aligns with the migration goals. If there are any specific concerns or alternative suggestions, I’d be happy to adjust the plan.

I am looking forward to your feedback!

@JayWhite2357
Copy link
Contributor Author

I'm looking into it a bit, and I feel like sqlparser::ast::Query is thing that would be analogous to intermediate_ast::SelectStatement.

My initial intent was for the sqlparser AST to replace the intermediate AST. This would mean that QueryExpr::try_new would accept a sqlparser::ast::Query (or similar) instead of a intermediate_ast::SelectStatement. In this situation, most (if not all) of the code changes would be in the proof_of_sql::sql::parse module.

Only replacing lalrpop with sqlparser, but not replacing intermediate AST is an interesting idea that I hadn't thought of. Perhaps it makes sense as a stepping stone, but I feel like it can't be the end goal here.

@iajoiner might have some feedback on this.

@varshith257
Copy link
Contributor

varshith257 commented Oct 11, 2024

Thanks @JayWhite2357 for your view on this. @iajoiner Any insights on this

@JayWhite2357
Copy link
Contributor Author

Thanks @JayWhite2357 for your view on this. @iajoiner Any insights on this

I chatted with him. He's on the same page. The goal here should be to remove the intermediate AST altogether.

@varshith257
Copy link
Contributor

Got it! I just started with basic SELECT parsing logic to see how sqlparser works with an example.

@JayWhite2357 I have joined Discord. If we have a thread in a related channel on discord, we can easily track all progress and discuss more about it in moving forward.

@iajoiner
Copy link
Contributor

@varshith257 I can chat with you on Discord. What's your handle there?

@varshith257
Copy link
Contributor

@varshith257 I can chat with you on Discord. What's your handle there?

@iajoiner I'd: vamshi_257

@iajoiner
Copy link
Contributor

Cool! Just sent you a message there.

@TomBebb
Copy link

TomBebb commented Oct 15, 2024

/attempt #235

1 similar comment
@deependujha
Copy link

deependujha commented Oct 16, 2024

/attempt #235

@varshith257
Copy link
Contributor

@JayWhite2357 Is @iajoiner is on holiday? I am also thinking of connecting with you on discord :

Here's my ID: : vamshi_257

@JayWhite2357
Copy link
Contributor Author

@varshith257 we're all a bit swamped at the moment. I connected with you on discord as well.

@JayWhite2357
Copy link
Contributor Author

The main effort here should be adding a function here:


The function should look like this:

pub fn try_new_from_sqlparser(
    ast: sqlparser::ast::Query,
    default_schema: Identifier,
    schema_accessor: &dyn SchemaAccessor,
) -> ConversionResult<Self>;

This should be able to be added without changing any existing code. Once it is in place and well tested, we can rename it to try_new and drop the existing try_new method. After that, the proof-of-sql-parser crate will be unused and we can drop it along with related code.

There are probably some complexities here that I'm missing. It may not be as straightforward as this. (For example, since QueryExpr depends on Identifier, which is a proof-of-sql-parser type, we will need to move or replace it before dropping proof-of-sql-parser completely.)

@iajoiner
Copy link
Contributor

iajoiner commented Oct 21, 2024

@varshith257 @TomBebb @deependujha
Some possible replacements

  • proof_of_sql_parser::Identifier -> sqlparser::ast::Ident
  • proof_of_sql_parser::ResourceId -> Vec<Ident> (we can move ResourceId into proof-of-sql and redefine it as [Ident; 2] if necessary)
  • proof_of_sql_parser::intermediate_ast::BinaryOperator -> sqlparser::ast::BinaryOperator
  • proof_of_sql_parser::intermediate_ast::UnaryOperator -> sqlparser::ast::UnaryOperator
  • proof_of_sql_parser::intermediate_ast::Expression -> sqlparser::ast::Expr
  • proof_of_sql_parser::SelectStatement -> sqlparser::ast::Query

@iajoiner
Copy link
Contributor

iajoiner commented Oct 21, 2024

A few more ideas:

  • proof_of_sql_parser::utility needs to be modified to use sqlparser structs instead.
  • For timestamp it requires a bit more work. What we need to aim at is most likely Timestamp(Option<u64>, TimezoneInfo). (@Dustin-Ray any insights?)

@iajoiner
Copy link
Contributor

iajoiner commented Oct 21, 2024

@varshith257 @TomBebb @deependujha Let's get this done ASAP and we will be more than willing to help unblock you guys if necessary. Please feel free to shoot me a message either here or on Discord (username: chloemargaret) if you guys have any questions.

@varshith257
Copy link
Contributor

Sure @iajoiner. Lets aim to complete this ASAP within a few weeks

@JayWhite2357
Copy link
Contributor Author

Just discussed this with @iajoiner some more.
The plan should be to focus on adding the pub fn try_new_from_sqlparser function.
Once this is added, we can make the appropriate replacements.
This has the benefit of not being a breaking change and not requiring large amounts of refactorization at any point. Instead the swap can be done after the new sqlparser functionality is well-tested.

This should be able to be added without changing any existing code: specifically, the proof-of-sql-parser crate should not need to be modified until after the try_new_from_sqlparser function is completed.

@varshith257
Copy link
Contributor

@JayWhite2357 @iajoiner, I have DM you on Discord about a blocker. Have a look

@iajoiner
Copy link
Contributor

iajoiner commented Nov 4, 2024

@JayWhite2357 @animeshd9 @deependujha @TomBebb @varshith257
Proposed steps

pub fn try_new(
    ast: sqlparser::ast::Query,
    default_schema: sqlparser::ast::Ident,
    schema_accessor: &dyn SchemaAccessor,
) -> ConversionResult<Self>; 

There should be no construct of proof-of-sql-parser still used in the proof-of-sql crate any more.

@iajoiner
Copy link
Contributor

iajoiner commented Nov 5, 2024

@animeshd9 @TomBebb @varshith257 Since we need proof-of-sql-parser -> sqlparser adaptation to make sure we don't add additional use cases of proof-of-sql-parser constructs in proof-of-sql crate (in fact #334 already adds additional use cases which complicates this issue) I decided to do the first step myself.

The rest of the subtasks will be added as separated issues linked back here with separate bounties. We will set up deadlines so that you guys know when each task has to be completed before the bounty period ends and we start to work on the task internally. We will usually give a heads up at least a week prior to the deadline for each subtask.

iajoiner added a commit that referenced this issue Nov 7, 2024
Please be sure to look over the pull request guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md#submit-pr.

# Please go through the following checklist
- [x] The PR title and commit messages adhere to guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md.
In particular `!` is used if and only if at least one breaking change
has been introduced.
- [x] I have run the ci check script with `source
scripts/run_ci_checks.sh`.

# Rationale for this change
In order to allow #235 to be done in time for JOIN-related integrations
we need to get `proof-of-sql-parser` -> `sqlparser` adaptions done.
Large parts of the work going forward can then become more manageable.
<!--
Why are you proposing this change? If this is already explained clearly
in the linked issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

 Example:
 Add `NestedLoopJoinExec`.
 Closes #345.

Since we added `HashJoinExec` in #323 it has been possible to do
provable inner joins. However performance is not satisfactory in some
cases. Hence we need to fix the problem by implement
`NestedLoopJoinExec` and speed up the code
 for `HashJoinExec`.
-->

# What changes are included in this PR?
- add `sqlparser.rs` with adaptations
<!--
There is no need to duplicate the description in the ticket here but it
is sometimes worth providing a summary of the individual changes in this
PR.

Example:
- Add `NestedLoopJoinExec`.
- Speed up `HashJoinExec`.
- Route joins to `NestedLoopJoinExec` if the outer input is sufficiently
small.
-->

# Are these changes tested?
Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

Example:
Yes.
-->
iajoiner added a commit that referenced this issue Nov 12, 2024
…parser::ast::UnaryOp` in the proof-of-sql crate (#363)

Please be sure to look over the pull request guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md#submit-pr.

# Please go through the following checklist
- [x] The PR title and commit messages adhere to guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md.
In particular `!` is used if and only if at least one breaking change
has been introduced.
- [x] I have run the ci check script with `source
scripts/run_ci_checks.sh`.

# Rationale for this change
This PR addresses the need to replace the
`proof_of_sql_parser::intermediate_ast::UnaryOp` with the
`sqlparser::ast::UnaryOp` in the `proof-of-sql` crate as part of a
larger transition toward integrating the `sqlparser` .

This change is a subtask of issue #235, with the main goal of
streamlining the repository by switching to the `sqlparser` crate and
gradually replacing intermediary constructs like
`proof_of_sql_parser::intermediate_ast` with `sqlparser::ast`.
<!--
Why are you proposing this change? If this is already explained clearly
in the linked issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

 Example:
 Add `NestedLoopJoinExec`.
 Closes #345.

Since we added `HashJoinExec` in #323 it has been possible to do
provable inner joins. However performance is not satisfactory in some
cases. Hence we need to fix the problem by implement
`NestedLoopJoinExec` and speed up the code
 for `HashJoinExec`.
-->

# What changes are included in this PR?
- All instances of `proof_of_sql_parser::intermediate_ast::UnaryOp` have
been replaced with `sqlparser::ast::UnaryOp`
- Every usage of `UnaryOp` has been updated to maintain the original
functionality, ensuring no changes to the logic or behavior.
- Any unsupported `UnaryOp` variants from `sqlparser` have been
appropriately handled using existing error handling mechanisms (i.e.,
the `Unsupported `variant in `ExpressionEvaluationError`).
 
<!--
There is no need to duplicate the description in the ticket here but it
is sometimes worth providing a summary of the individual changes in this
PR.

Example:
- Add `NestedLoopJoinExec`.
- Speed up `HashJoinExec`.
- Route joins to `NestedLoopJoinExec` if the outer input is sufficiently
small.
-->

# Are these changes tested?

Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

Example:
Yes.
-->

Part of #235
iajoiner added a commit that referenced this issue Nov 17, 2024
…lparser::ast::BinaryOp` in the proof-of-sql crate (#362)

Please be sure to look over the pull request guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md#submit-pr.

# Please go through the following checklist
- [x] The PR title and commit messages adhere to guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md.
In particular `!` is used if and only if at least one breaking change
has been introduced.
- [x] I have run the ci check script with `source
scripts/run_ci_checks.sh`.

# Rationale for this change
This PR addresses the need to replace the
`proof_of_sql_parser::intermediate_ast::BinaryOp` with the
`sqlparser::ast::BinaryOp` in the `proof-of-sql` crate as part of a
larger transition toward integrating the `sqlparser` .

This change is a subtask of issue #235, with the main goal of
streamlining the repository by switching to the `sqlparser` crate and
gradually replacing intermediary constructs like
`proof_of_sql_parser::intermediate_ast` with `sqlparser::ast`.
<!--
Why are you proposing this change? If this is already explained clearly
in the linked issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

 Example:
 Add `NestedLoopJoinExec`.
 Closes #345.

Since we added `HashJoinExec` in #323 it has been possible to do
provable inner joins. However performance is not satisfactory in some
cases. Hence we need to fix the problem by implement
`NestedLoopJoinExec` and speed up the code
 for `HashJoinExec`.
-->

# What changes are included in this PR?
- All instances of `proof_of_sql_parser::intermediate_ast::BinaryOp`
have been replaced with `sqlparser::ast::BinaryOp`
- Every usage of `BianryOp` has been updated to maintain the original
functionality, ensuring no changes to the logic or behavior.
- Any unsupported `BinaryOp` variants from `sqlparser` have been
appropriately handled using existing error handling mechanisms (i.e.,
the `Unsupported `variant in `ExpressionEvaluationError`).

<!--
There is no need to duplicate the description in the ticket here but it
is sometimes worth providing a summary of the individual changes in this
PR.

Example:
- Add `NestedLoopJoinExec`.
- Speed up `HashJoinExec`.
- Route joins to `NestedLoopJoinExec` if the outer input is sufficiently
small.
-->

# Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

Example:
Yes.
-->
Yes

Closes #349 
Part of #235
@iajoiner
Copy link
Contributor

The deadline for this ticket is Jan 5th 2025.

iajoiner added a commit that referenced this issue Dec 16, 2024
…sqlparser::ast::Ident` in the proof-of-sql crate (#382)

Please be sure to look over the pull request guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md#submit-pr.

# Please go through the following checklist
- [x] The PR title and commit messages adhere to guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md.
In particular `!` is used if and only if at least one breaking change
has been introduced.
- [x] I have run the ci check script with `source
scripts/run_ci_checks.sh`.

# Rationale for this change
This PR addresses the need to replace the
`proof_of_sql_parser::Identifier` with the `sqlparser::ast::Ident` in
the `proof-of-sql` crate as part of a larger transition toward
integrating the `sqlparser` .

This change is a subtask of issue #235, with the main goal of
streamlining the repository by switching to the `sqlparser` crate and
gradually replacing intermediary constructs like
`proof_of_sql_parser::intermediate_ast` with `sqlparser::ast`.
<!--
Why are you proposing this change? If this is already explained clearly
in the linked issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

 Example:
 Add `NestedLoopJoinExec`.
 Closes #345.

Since we added `HashJoinExec` in #323 it has been possible to do
provable inner joins. However performance is not satisfactory in some
cases. Hence we need to fix the problem by implement
`NestedLoopJoinExec` and speed up the code
 for `HashJoinExec`.
-->

# What changes are included in this PR?
- All instances of `proof_of_sql_parser::Identifier` have been replaced
with `sqlparser::ast::Ident`
- A few of them required an identifier (e.g. Expression::Column, etc..),
which is dependent on the Identifier and will be migrated at the
refactoring of Exprs.
- Every usage of `Identifier` has been updated to maintain the original
functionality, ensuring no changes to the logic or behavior.
- The breaking change here is that `Ident` doesn't support `Copy` trait
so we have needed the clones in the places where values are moved
- Deleted the test
`we_cannot_convert_a_record_batch_if_it_has_repeated_column_names`
because the `sqlparser` now differentiates between uppercase and
lowercase identifiers. Case normalization is no longer applied and
`sqlparser` treats `a` and `A` as distinct identifiers.

- Examples are updated to align with `sqlparser`'s case-sensitive
behavior.

<!--
There is no need to duplicate the description in the ticket here but it
is sometimes worth providing a summary of the individual changes in this
PR.

Example:
- Add `NestedLoopJoinExec`.
- Speed up `HashJoinExec`.
- Route joins to `NestedLoopJoinExec` if the outer input is sufficiently
small.
-->

# Are these changes tested?

Yes
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

Example:
Yes.
-->

Part of #235
iajoiner added a commit that referenced this issue Dec 18, 2024
Please be sure to look over the pull request guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md#submit-pr.

# Please go through the following checklist
- [ ] The PR title and commit messages adhere to guidelines here:
https://github.com/spaceandtimelabs/sxt-proof-of-sql/blob/main/CONTRIBUTING.md.
In particular `!` is used if and only if at least one breaking change
has been introduced.
- [ ] I have run the ci check script with `source
scripts/run_ci_checks.sh`.

# Rationale for this change

To update docs to reflect the migration of Identifier -> Ident
<!--
Why are you proposing this change? If this is already explained clearly
in the linked issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.

 Example:
 Add `NestedLoopJoinExec`.
 Closes #345.

Since we added `HashJoinExec` in #323 it has been possible to do
provable inner joins. However performance is not satisfactory in some
cases. Hence we need to fix the problem by implement
`NestedLoopJoinExec` and speed up the code
 for `HashJoinExec`.
-->

# What changes are included in this PR?

- Updated docs in the proof-of-sql crate with Ident
<!--
There is no need to duplicate the description in the ticket here but it
is sometimes worth providing a summary of the individual changes in this
PR.

Example:
- Add `NestedLoopJoinExec`.
- Speed up `HashJoinExec`.
- Route joins to `NestedLoopJoinExec` if the outer input is sufficiently
small.
-->

# Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

Example:
Yes.
-->
Part of  #235
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💎 Bounty enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants
@JayWhite2357 @iajoiner @TomBebb @animeshd9 @deependujha @varshith257 and others