Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make COPY TO align with CREATE EXTERNAL TABLE #9604

Merged
merged 8 commits into from
Mar 18, 2024

Conversation

metesynnada
Copy link
Contributor

Which issue does this PR close?

Closes #9369.

Rationale for this change

Changing

OPTIONS (
    format X,
    X.foo.bar baz
)

into

STORED AS X
OPTIONS (
    format.foo.bar baz
)

What changes are included in this PR?

  • "format" prefix is defined for all formats.
  • COPY TO statement is now closer to CREATE EXTERNAL TABLE syntax.

More talk on synnada-ai#10

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 14, 2024
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the PR looks good, but I have some concerns about the syntax we are dropping. Most of all, I think queries like this should continue to work:

COPY table to 'file.csv'

Copy is useful on the commandline for quickly moving files around and this syntax is more compact and intuitive than

COPY table to 'file.csv' STORED AS CSV

So, I think we should continue supporting the ability to infer the format when copying to a single file with an unabiguous extension.

@@ -397,7 +397,7 @@ CopyTo: format=csv output_url=output.csv options: ()

#[test]
fn plan_explain_copy_to() {
let sql = "EXPLAIN COPY test_decimal to 'output.csv'";
let sql = "EXPLAIN COPY test_decimal to 'output.csv' STORED AS CSV";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should continue to support the original syntax here, which is in my opinion much more clear and concise.

Copy link
Contributor Author

@metesynnada metesynnada Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this, I made STORED AS optional.

@@ -23,7 +23,7 @@
# create.sql came from
# https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql
# Data file made with DuckDB:
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET);
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' STORED AS PARQUET;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we should also maintain backwards compatibility. It is nice to have a dedicated keyword for the often required format option, but I think user's queries in the old syntax should continue working

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I firmly believe that a singular approach to the format introduction is the most logical and maintainable route. Thus implementing a breaking change aligns better with our goals for clarity and efficiency.

For this particular highlight, I didn't read the DuckDB reminder, this is why I changed it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that supporting the duckdb style would be nice, but that this PR's consistency is also nice

I wonder if it would be possible (as a follow on PR) to add some special case backwards compatibility code to support the (FORMAT PARQUET) style? It seems to me like this could be a small amount of additional, localized code, and then we could going forward use the unified approach in this PR by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9713 to track

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @metesynnada -- other than the commented out test, this PR looks good to go in my opinion.

It would be good to get @devinjdangelo 's opinion as well

I think the follow ups would be:

  1. make the format. prefix optional
  2. Support (FORMAT PARQUET) style option for backwards compatibility

@@ -23,7 +23,7 @@
# create.sql came from
# https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql
# Data file made with DuckDB:
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET);
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should revert this change as it is a comment showing what command was used in duckdb (not a datafusion command)

@ozankabak
Copy link
Contributor

Thanks @metesynnada -- other than the commented out test, this PR looks good to go in my opinion.

It would be good to get @devinjdangelo 's opinion as well

I think the follow ups would be:

  1. make the format. prefix optional
  2. Support (FORMAT PARQUET) style option for backwards compatibility

Sounds good. Let's get the consistent base syntax in with this PR, and have follow-ons for shortcuts (1) and backwards compatibility (2).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @metesynnada -- I think this PR is good to go from my perspective after:

  1. We file a ticket to track backwards compatible (FORMAT PARQUET) syntax
  2. We file a ticket about what is going on with escaped quotes (aka COPY TO allign with CREATE EXTERNAL TABLE synnada-ai/datafusion-upstream#10 (comment))
  3. @devinjdangelo gives it a final review

CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION 'test_files/scratch/copy/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

## Until the partition by parsing uses ColumnDef, this test is meaningless since it becomes an overfit. Even in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the expalanation of why this test is commented out: synnada-ai#10 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9714 to track

Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @metesynnada this looks great! It definitely is nice now that copying data into files and then creating a table over the files have very similar syntaxes. And it remains easy/compact to copy to a single file, which is also very nice! 🚀

DataFusion CLI v36.0.0
❯ copy (values (1, 2, 3), (4, 5, 6)) to 'file.csv';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.020 seconds.

❯ select * from 'file.csv';
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | 2       | 3       |
| 4       | 5       | 6       |
+---------+---------+---------+
2 rows in set. Query took 0.005 seconds.
❯ copy (values ('1', 2, 3), ('4', 5, 6)) to 'partitioned_csv/' partitioned by (column1) stored as csv;
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.027 seconds.
❯ create external table partitioned_csv stored as CSV location 'partitioned_csv' partitioned by (column1) with header row;
0 rows in set. Query took 0.001 seconds.

❯ select * from partitioned_csv;
+---------+---------+---------+
| column2 | column3 | column1 |
+---------+---------+---------+
| 5       | 6       | 4       |
| 2       | 3       | 1       |
+---------+---------+---------+
2 rows in set. Query took 0.001 seconds.

/// CSV Header row?
pub has_header: bool,
/// File type (Parquet, NDJSON, CSV, etc)
pub stored_as: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -54,8 +54,8 @@ select * from validate_partitioned_parquet_bar order by col1;

# Copy to directory as partitioned files
query ITT
COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/'
(format parquet, partition_by 'column2, column3', 'parquet.compression' 'zstd(10)');
COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' STORED AS parquet PARTITIONED BY (column2, column3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty cool. Definitely an improvement in readability over the prior syntax 👍

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left three minor comments

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/sql/src/parser.rs Outdated Show resolved Hide resolved
@ozankabak
Copy link
Contributor

This seems good to go - I will wait for a little bit more in case there are any more comments and then merge this today.

@ozankabak ozankabak merged commit b137f60 into apache:main Mar 18, 2024
24 checks passed
@ozankabak
Copy link
Contributor

Let's address any remaining issues in quick follow-ons.

@alamb
Copy link
Contributor

alamb commented Mar 20, 2024

Filed #9716 to track the enhancement to avoid repeating .format over and over

@alamb
Copy link
Contributor

alamb commented Mar 23, 2024

Thanks to @devinjdangelo and @tinfoil-knight I think we have completed all the follow on items #9716 and #9713

I filed #9754 to update the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aligning COPY TO and CREATE TABLE Syntax
4 participants