-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make COPY TO align with CREATE EXTERNAL TABLE #9604
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the PR looks good, but I have some concerns about the syntax we are dropping. Most of all, I think queries like this should continue to work:
COPY table to 'file.csv'
Copy is useful on the commandline for quickly moving files around and this syntax is more compact and intuitive than
COPY table to 'file.csv' STORED AS CSV
So, I think we should continue supporting the ability to infer the format when copying to a single file with an unabiguous extension.
@@ -397,7 +397,7 @@ CopyTo: format=csv output_url=output.csv options: () | |||
|
|||
#[test] | |||
fn plan_explain_copy_to() { | |||
let sql = "EXPLAIN COPY test_decimal to 'output.csv'"; | |||
let sql = "EXPLAIN COPY test_decimal to 'output.csv' STORED AS CSV"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should continue to support the original syntax here, which is in my opinion much more clear and concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, I made STORED AS optional.
@@ -23,7 +23,7 @@ | |||
# create.sql came from | |||
# https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql | |||
# Data file made with DuckDB: | |||
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET); | |||
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' STORED AS PARQUET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think we should also maintain backwards compatibility. It is nice to have a dedicated keyword for the often required format option, but I think user's queries in the old syntax should continue working
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I firmly believe that a singular approach to the format introduction is the most logical and maintainable route. Thus implementing a breaking change aligns better with our goals for clarity and efficiency.
For this particular highlight, I didn't read the DuckDB reminder, this is why I changed it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that supporting the duckdb style would be nice, but that this PR's consistency is also nice
I wonder if it would be possible (as a follow on PR) to add some special case backwards compatibility code to support the (FORMAT PARQUET)
style? It seems to me like this could be a small amount of additional, localized code, and then we could going forward use the unified approach in this PR by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9713 to track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @metesynnada -- other than the commented out test, this PR looks good to go in my opinion.
It would be good to get @devinjdangelo 's opinion as well
I think the follow ups would be:
- make the
format.
prefix optional - Support
(FORMAT PARQUET)
style option for backwards compatibility
@@ -23,7 +23,7 @@ | |||
# create.sql came from | |||
# https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql | |||
# Data file made with DuckDB: | |||
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET); | |||
# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should revert this change as it is a comment showing what command was used in duckdb (not a datafusion command)
Sounds good. Let's get the consistent base syntax in with this PR, and have follow-ons for shortcuts (1) and backwards compatibility (2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @metesynnada -- I think this PR is good to go from my perspective after:
- We file a ticket to track backwards compatible (
FORMAT PARQUET
) syntax - We file a ticket about what is going on with escaped quotes (aka COPY TO allign with CREATE EXTERNAL TABLE synnada-ai/datafusion-upstream#10 (comment))
- @devinjdangelo gives it a final review
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV | ||
LOCATION 'test_files/scratch/copy/escape_quote/' PARTITIONED BY ("'test2'", "'test3'"); | ||
|
||
## Until the partition by parsing uses ColumnDef, this test is meaningless since it becomes an overfit. Even in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the expalanation of why this test is commented out: synnada-ai#10 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9714 to track
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @metesynnada this looks great! It definitely is nice now that copying data into files and then creating a table over the files have very similar syntaxes. And it remains easy/compact to copy to a single file, which is also very nice! 🚀
DataFusion CLI v36.0.0
❯ copy (values (1, 2, 3), (4, 5, 6)) to 'file.csv';
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.020 seconds.
❯ select * from 'file.csv';
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1 | 2 | 3 |
| 4 | 5 | 6 |
+---------+---------+---------+
2 rows in set. Query took 0.005 seconds.
❯ copy (values ('1', 2, 3), ('4', 5, 6)) to 'partitioned_csv/' partitioned by (column1) stored as csv;
+-------+
| count |
+-------+
| 2 |
+-------+
1 row in set. Query took 0.027 seconds.
❯ create external table partitioned_csv stored as CSV location 'partitioned_csv' partitioned by (column1) with header row;
0 rows in set. Query took 0.001 seconds.
❯ select * from partitioned_csv;
+---------+---------+---------+
| column2 | column3 | column1 |
+---------+---------+---------+
| 5 | 6 | 4 |
| 2 | 3 | 1 |
+---------+---------+---------+
2 rows in set. Query took 0.001 seconds.
/// CSV Header row? | ||
pub has_header: bool, | ||
/// File type (Parquet, NDJSON, CSV, etc) | ||
pub stored_as: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -54,8 +54,8 @@ select * from validate_partitioned_parquet_bar order by col1; | |||
|
|||
# Copy to directory as partitioned files | |||
query ITT | |||
COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' | |||
(format parquet, partition_by 'column2, column3', 'parquet.compression' 'zstd(10)'); | |||
COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' STORED AS parquet PARTITIONED BY (column2, column3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty cool. Definitely an improvement in readability over the prior syntax 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left three minor comments
This seems good to go - I will wait for a little bit more in case there are any more comments and then merge this today. |
Let's address any remaining issues in quick follow-ons. |
Filed #9716 to track the enhancement to avoid repeating |
Thanks to @devinjdangelo and @tinfoil-knight I think we have completed all the follow on items #9716 and #9713 I filed #9754 to update the documentation |
Which issue does this PR close?
Closes #9369.
Rationale for this change
Changing
into
What changes are included in this PR?
More talk on synnada-ai#10
Are these changes tested?
Are there any user-facing changes?