Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support setting time range in Copy From statement #3511

Closed
MichaelScofield opened this issue Mar 14, 2024 · 10 comments · Fixed by #4405
Closed

Support setting time range in Copy From statement #3511

MichaelScofield opened this issue Mar 14, 2024 · 10 comments · Fixed by #4405
Labels
good first issue Good for newcomers

Comments

@MichaelScofield
Copy link
Collaborator

What type of enhancement is this?

User experience

What does the enhancement do?

As title.

Currently there's timestamp_range: Option<TimestampRange> field in the struct CopyTableRequest, but it's not set. We can set it from the SQL parser first, then check if it works.

The copy from in postgresql supports "where condition" (https://www.postgresql.org/docs/current/sql-copy.html), maybe we can do that, too. For this issue, we can extract the time range from the "where condition", and leave more general filtering in the future.

Implementation challenges

No response

@MichaelScofield MichaelScofield added the good first issue Good for newcomers label Mar 14, 2024
@naosense
Copy link

hi @MichaelScofield ,i want to take this, would you give me some further advice?

@MichaelScofield
Copy link
Collaborator Author

@naosense Sure. The "copy from" statement is parsed in file "copy_parser.rs", you can start from there. First you might want to see how "where" can be parsed in sqlparser. If lucky, the "copy from" statement from sqlparser may already carry the "where" part. Then you need to find how to extract the desired "time range" from the filters in "where". Finally set the time range in the struct CopyTableRequest, do some test to see if it works!

@WenyXu
Copy link
Member

WenyXu commented Mar 20, 2024

hi @MichaelScofield ,i want to take this, would you give me some further advice?

I think we can add a where clause for the Copy From statement, then forward the filter of the where clause to the scan request(The Copy From statement sends a scan request to retrieve data).

@naosense
Copy link

After my initial research, I do not have much idea about solving this issue, probably because I am not familiar with the greptimedb and sqlparser. I hope someone can take over, I'm afraid this issue is beyond my ability😅.

@okJiang
Copy link

okJiang commented Mar 23, 2024

I spent quite a while trying to solve this problem. There has been a small progress, but since I am a newbie in rust, I find it difficult for me to solve rust syntax problems in a short time, such as error handling, type conversion, and syntax sugar. So I decided to post my progress here, welcome to discuss ideas.

First I successfully parsed the where clause and got the expression. As a rookie, it took me a while to get if-else to return the same type of result...

let predicate = if self.parser.parse_keyword(Keyword::WHERE) {
            Some(self
                .parser
                .parse_expr()
                .context(error::SyntaxSnafu)?;)
} else {
            None
};

Then I found that the returned expression did not meet the needs of the issue. What needs to be returned here is timestampRange. So, the next step is to convert the expression in sql-parser into timestampRange. I think what can be done is to:

  1. check if the type of the expression is AnyOp or AllOp
  2. check if the data type is timestamp
  3. handle various >, < situations
  4. ...

@MichaelScofield
Copy link
Collaborator Author

@naosense that's ok, feel free to take another good first issue!

@MichaelScofield
Copy link
Collaborator Author

@okJiang we can make things a lot simpler by restricting the allowed syntax here. The filters in copy from stmt only take one of the following 3 forms:

  • where <F> and <F>
  • where <F> or <F>
  • where <F>

where F = <column> > | >= | < | <= | = <timestamp literal value>

@kysshsy
Copy link

kysshsy commented May 22, 2024

I am trying to solve this problem. It seems that I need to pass arguments to fn to_copy_table_request and I can

  1. covert string to timestamp with timezone
  2. check the specified columns is correct.

I can't find a elegant way to pass the parsed arugments.

pub type WhereClause = Option<(
    Option<(ObjectName, ParserBinaryOperator, String)>,
    Option<ParserBinaryOperator>,
    Option<(ObjectName, ParserBinaryOperator, String)>,
)>;

Should I simply pass Expr to the to_copy_table_request function and parse it again, or are there some methods to make the WhereClause more elegant

@evenyag
Copy link
Contributor

evenyag commented May 24, 2024

Should I simply pass Expr to the to_copy_table_request function and parse it again, or are there some methods to make the WhereClause more elegant

@kysshsy For simplicity, what about reusing the time range option?

let timestamp_range = timestamp_range_from_option_map(&with, &query_ctx)?;

Then we use the time range to filter output record batches from files.

If we'd like to support WHERE, maybe we should provide full capabilities of the where expression. We could create a new issue for WHERE support in COPY statement.

@v0y4g3r
Copy link
Contributor

v0y4g3r commented May 24, 2024

@kysshsy Since we've marked this issue as GFI, I think we can resort to a simple solution. We don't need to support complex time ranges expressed by where clause, but to reuse the time range in WITH options, which is already in CopyTableRequest:

pub timestamp_range: Option<TimestampRange>,

This will solve 90% use cases.

Once we've built a record batch stream via build_read_stream:

async fn build_read_stream(

we can evaluate the batches yielded just like here:

let filters = table
.schema()
.timestamp_column()
.and_then(|c| {
common_query::logical_plan::build_filter_from_timestamp(
&c.name,
req.timestamp_range.as_ref(),
)
})
.into_iter()
.collect::<Vec<_>>();
let table_provider = Arc::new(DfTableProviderAdapter::new(table));
let table_source = Arc::new(DefaultTableSource::new(table_provider));
let mut builder = LogicalPlanBuilder::scan_with_filters(
df_table_ref,
table_source,
None,
filters.clone(),
)
.context(BuildDfLogicalPlanSnafu)?;
for f in filters {
builder = builder.filter(f).context(BuildDfLogicalPlanSnafu)?;
}
let plan = builder.build().context(BuildDfLogicalPlanSnafu)?;
let output = self
.query_engine
.execute(LogicalPlan::DfPlan(plan), query_ctx)
.await
.context(ExecLogicalPlanSnafu)?;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants