Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove FileWriterMode and ListingTableInsertMode (#7994) #8017

Merged
merged 4 commits into from
Nov 17, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Nov 1, 2023

Which issue does this PR close?

Relates to #7994
Closes #7991

Rationale for this change

Following #8021 append semantics are handled by a separate TableProvider, this allows removing the corresponding logic from ListingTable.

Further cleanups will likely follow, but this is all that is necessary to unblock #8029

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@tustvold tustvold closed this Nov 1, 2023
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Nov 1, 2023
@tustvold tustvold reopened this Nov 15, 2023
@tustvold tustvold force-pushed the remove-listing-table-append branch from 1d7e8bd to 76624a6 Compare November 15, 2023 15:08
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Nov 15, 2023
@tustvold tustvold added the api change Changes the API exposed to users of the crate label Nov 15, 2023
@tustvold tustvold changed the title POC: Remove ListingTable Append Support (#7994) Remove ListingTable WriterMode Support (#7994) Nov 15, 2023
/// `AsyncPutWriter` is an object that facilitates asynchronous writing to object stores.
/// It is specifically designed for the `object_store` crate's `put` method and sends
/// whole bytes at once when the buffer is flushed.
pub struct AsyncPutWriter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't actually hooked up anywhere (and was actually incorrect and wouldn't have worked anyway #7991)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I double checked and it appears to be dead code (unused and untested):

https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20AsyncPutWriter&type=code

Screenshot 2023-11-16 at 8 40 03 AM

@@ -229,16 +116,6 @@ impl<W: AsyncWrite + Unpin + Send> AsyncWrite for AbortableWrite<W> {
}
}

/// An enum that defines different file writer modes.
#[derive(Debug, Clone, Copy)]
pub enum FileWriterMode {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given FileWriterMode::Put is not implemented by anything, and we are removing FileWriterMode::Append, it seemed simplest to just remove this. This also frees up implementations to choose an adaptive IO strategy, where they can determine if Put is appropriate

@tustvold tustvold force-pushed the remove-listing-table-append branch from 76624a6 to 1e86301 Compare November 15, 2023 15:23
@tustvold tustvold changed the title Remove ListingTable WriterMode Support (#7994) Remove FileWriterMode (#7994) Nov 15, 2023
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 15, 2023
@@ -267,8 +267,6 @@ INSERT INTO single_file_test values (4, 5), (6, 7);
query II
select * from single_file_test;
----
1 2
Copy link
Contributor Author

@tustvold tustvold Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change worth highlighting, inserting into a ListingTable backed by a single CSV or JSON file now overwrites. We could also make this error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be an error. It would also be really good to call out how to get "append" semantics (which I believe requires using CREATE UNBOUNDED TABLE ...)?

@tustvold
Copy link
Contributor Author

@tustvold tustvold marked this pull request as ready for review November 15, 2023 15:27
///Data is appended as new files in existing TablePaths
AppendNewFiles,
///Throw an error if insert into is attempted on this table
Error,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth highlighting this also removes ListingTableInsertMode::Error, I wasn't sure that this was important to preserve but happy to restore if people feel strongly. @devinjdangelo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing error is fine by me. A formal access control mechanism would be better than the error mode option.

I do think there is some value to append mode for ListingTable. Specifically it enables tables backed by a single file where inserts append to that file.

This is a relatively niche usecase so I'm not bothered if it is removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically it enables tables backed by a single file where inserts append to that file.

I believe this use-case is now covered by StreamTable unless I am missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be (I haven't dug into StreamTable yet). The usecase for append I'm thinking of isn't really streaming related. More like a batch job that appends to one file or a fixed set of files on disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamTable will do the former, but cannot, yet, handle multiple files. We could add this but I would rather scream test if anyone is actually doing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to hear about a usecase of appending to a set of existing files -- it is not a usecase I am familiar with but also my background is in analytic database where it is more common to make new files (and then "compact/rewrite" them as subsequent step) rather than modifying existing files

@tustvold tustvold changed the title Remove FileWriterMode (#7994) Remove FileWriterMode and ListingTableInsertMode (#7994) Nov 15, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this (and #8021) @tustvold -- I think this looks like a nice alighnment to separate object_store backed ListingTables and the more flexible StreamingTable.

I do think we should make writing into an existing file error (rather than change the behavior from append to overwrite) as it is a significant change that I think could result in dataloss for existing users who expected the old behavior.

I also don't think we should merge it until @ozankabak and @metegenez have a chance to weigh in and approve, given the potential overlap between what I believe they are working on and the changes in this PR.

/// `AsyncPutWriter` is an object that facilitates asynchronous writing to object stores.
/// It is specifically designed for the `object_store` crate's `put` method and sends
/// whole bytes at once when the buffer is flushed.
pub struct AsyncPutWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I double checked and it appears to be dead code (unused and untested):

https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20AsyncPutWriter&type=code

Screenshot 2023-11-16 at 8 40 03 AM

/// A wrapper struct with abort method and writer
pub(crate) struct AbortableWrite<W: AsyncWrite + Unpin + Send> {
writer: W,
mode: AbortMode,
multipart: MultiPart,
}

impl<W: AsyncWrite + Unpin + Send> AbortableWrite<W> {
/// Create a new `AbortableWrite` instance with the given writer, and write mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the doc comments may also need updating

///Data is appended as new files in existing TablePaths
AppendNewFiles,
///Throw an error if insert into is attempted on this table
Error,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to hear about a usecase of appending to a set of existing files -- it is not a usecase I am familiar with but also my background is in analytic database where it is more common to make new files (and then "compact/rewrite" them as subsequent step) rather than modifying existing files

@@ -267,8 +267,6 @@ INSERT INTO single_file_test values (4, 5), (6, 7);
query II
select * from single_file_test;
----
1 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be an error. It would also be really good to call out how to get "append" semantics (which I believe requires using CREATE UNBOUNDED TABLE ...)?

@ozankabak
Copy link
Contributor

Thank you, we will take a look tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold -- this PR is good to go from my perspective, but just to be over communicative I do not think we should merge it until review and approval from @ozankabak / @metesynnada

@@ -283,7 +283,12 @@ struct StreamWrite(Arc<StreamConfig>);

impl DisplayAs for StreamWrite {
fn fmt_as(&self, _t: DisplayFormatType, f: &mut Formatter) -> std::fmt::Result {
write!(f, "{self:?}")
f.debug_struct("StreamWrite")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change you end up with an absolutely massive display line

@ozankabak
Copy link
Contributor

Thanks @tustvold -- this PR is good to go from my perspective, but just to be over communicative I do not think we should merge it until review and approval from @ozankabak / @metesynnada

Thanks, we will circle back tomorrow morning. I took a quick look and it looks good to me, but I want to consult with @metesynnada just to be sure.

@tustvold
Copy link
Contributor Author

Do you require further time? Happy to wait, just checking where we stand

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold, AFAICT this is good to go from our perspective. In case we are overlooking anything, we can send a quick follow-on PR. But all seems good now 🚀

@tustvold tustvold merged commit 325a3fb into apache:main Nov 17, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AsyncPutWriter::poll_shutdown_inner Incorrect
4 participants