Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer ListingTable Path Creation To Write #8007

Closed
wants to merge 1 commit into from

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #7994

Rationale for this change

This is part of the general ticket of making the file IO story more coherent and less surprising.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Oct 31, 2023
/// If you wish to specify a path that does not exist on the local
/// machine you must provide it as a fully-qualified [file URI]
/// e.g. `file:///myfile.txt`
/// Otherwise, the path will be resolved to an absolute path, and converted
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that is necessary to accomodate this, we need to permit ListingTableUrl to be created for paths that don't currently exist

@@ -250,7 +245,6 @@ single_file_test(a bigint, b bigint)
STORED AS csv
LOCATION 'test_files/scratch/insert_to_external/single_csv_table.csv'
OPTIONS(
create_local_path 'true',
single_file 'true',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to replace single_file with inference based on the ListingTableUrl, i.e. if it ends with a path separator, as a follow on PR

@@ -214,7 +178,12 @@ impl ListingTableUrl {
}
}
},
false => futures::stream::once(store.head(&self.prefix)).boxed(),
false => futures::stream::once(store.head(&self.prefix))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the behaviour consistent with listing above

@@ -855,14 +855,17 @@ impl TableProvider for ListingTable {
let input_partitions = input.output_partitioning().partition_count();
let writer_mode = match self.options.insert_mode {
ListingTableInsertMode::AppendToFile => {
if input_partitions > file_groups.len() {
if file_groups.is_empty() && self.options.single_file {
// This is a hack, longer term append should be split out (#7994)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this we get an error if this is the first write as there are no file_groups. Longer-term we shouldn't be conflating ListingTable IO and that of streaming use-cases

@tustvold tustvold marked this pull request as draft October 31, 2023 14:06
@tustvold tustvold force-pushed the do-not-canonicalize-paths branch from 3efbe44 to da0554d Compare October 31, 2023 14:30
@@ -46,37 +46,54 @@ pub struct ListingTableUrl {
impl ListingTableUrl {
/// Parse a provided string as a `ListingTableUrl`
///
/// A URL can either refer to a single object, or a collection of objects with a
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't changing the pre-existing behaviour but attempting to better document it

@tustvold tustvold force-pushed the do-not-canonicalize-paths branch from da0554d to 72778b1 Compare October 31, 2023 14:31
/// If you wish to specify a path that does not exist on the local
/// machine you must provide it as a fully-qualified [file URI]
/// e.g. `file:///myfile.txt`
/// If the path already exists in the local filesystem this will be used to determine if this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to preserve the pre-existing behaviour where it would use the filesystem to determine if the path refers to a directory or not

@@ -362,8 +352,7 @@ statement ok
CREATE EXTERNAL TABLE
table_without_values(field1 BIGINT NULL, field2 BIGINT NULL)
STORED AS parquet
LOCATION 'test_files/scratch/insert_to_external/external_parquet_table_q2'
OPTIONS (create_local_path 'true');
LOCATION 'test_files/scratch/insert_to_external/external_parquet_table_q2/';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these paths don't exist anymore, we need to use the trailing / to indicate that we intend for a directory, as opposed to a file

@tustvold tustvold added api change Changes the API exposed to users of the crate and removed api change Changes the API exposed to users of the crate labels Oct 31, 2023
@tustvold
Copy link
Contributor Author

I'm not sure if this counts as an API change, I defer to the reviewers

| Option | Description | Default Value |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- |
| SINGLE_FILE | If true, indicates that this external table is backed by a single file. INSERT INTO queries will append to this file. | false |
| CREATE_LOCAL_PATH | If true, the folder or file backing this table will be created on the local file system if it does not already exist when running INSERT INTO queries. | false |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is now effectively enabled by default

@tustvold tustvold force-pushed the do-not-canonicalize-paths branch from 72778b1 to b96615a Compare October 31, 2023 14:53
@tustvold tustvold force-pushed the do-not-canonicalize-paths branch from b96615a to 7edff96 Compare October 31, 2023 14:58
} else {
return Err(err);
}
create_external_table_test(location, &sql).await.unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer returns an error, making it consistent with the other stores

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 22, 2024
@github-actions github-actions bot closed this Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant