-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove FileWriterMode and ListingTableInsertMode (#7994) #8017
Conversation
1d7e8bd
to
76624a6
Compare
/// `AsyncPutWriter` is an object that facilitates asynchronous writing to object stores. | ||
/// It is specifically designed for the `object_store` crate's `put` method and sends | ||
/// whole bytes at once when the buffer is flushed. | ||
pub struct AsyncPutWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't actually hooked up anywhere (and was actually incorrect and wouldn't have worked anyway #7991)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I double checked and it appears to be dead code (unused and untested):
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20AsyncPutWriter&type=code
@@ -229,16 +116,6 @@ impl<W: AsyncWrite + Unpin + Send> AsyncWrite for AbortableWrite<W> { | |||
} | |||
} | |||
|
|||
/// An enum that defines different file writer modes. | |||
#[derive(Debug, Clone, Copy)] | |||
pub enum FileWriterMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given FileWriterMode::Put is not implemented by anything, and we are removing FileWriterMode::Append, it seemed simplest to just remove this. This also frees up implementations to choose an adaptive IO strategy, where they can determine if Put
is appropriate
76624a6
to
1e86301
Compare
@@ -267,8 +267,6 @@ INSERT INTO single_file_test values (4, 5), (6, 7); | |||
query II | |||
select * from single_file_test; | |||
---- | |||
1 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change worth highlighting, inserting into a ListingTable backed by a single CSV or JSON file now overwrites. We could also make this error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be an error. It would also be really good to call out how to get "append" semantics (which I believe requires using CREATE UNBOUNDED TABLE ...
)?
///Data is appended as new files in existing TablePaths | ||
AppendNewFiles, | ||
///Throw an error if insert into is attempted on this table | ||
Error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth highlighting this also removes ListingTableInsertMode::Error, I wasn't sure that this was important to preserve but happy to restore if people feel strongly. @devinjdangelo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing error is fine by me. A formal access control mechanism would be better than the error mode option.
I do think there is some value to append mode for ListingTable. Specifically it enables tables backed by a single file where inserts append to that file.
This is a relatively niche usecase so I'm not bothered if it is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically it enables tables backed by a single file where inserts append to that file.
I believe this use-case is now covered by StreamTable unless I am missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be (I haven't dug into StreamTable yet). The usecase for append I'm thinking of isn't really streaming related. More like a batch job that appends to one file or a fixed set of files on disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamTable will do the former, but cannot, yet, handle multiple files. We could add this but I would rather scream test if anyone is actually doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to hear about a usecase of appending to a set of existing files -- it is not a usecase I am familiar with but also my background is in analytic database where it is more common to make new files (and then "compact/rewrite" them as subsequent step) rather than modifying existing files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this (and #8021) @tustvold -- I think this looks like a nice alighnment to separate object_store backed ListingTable
s and the more flexible StreamingTable
.
I do think we should make writing into an existing file error (rather than change the behavior from append to overwrite) as it is a significant change that I think could result in dataloss for existing users who expected the old behavior.
I also don't think we should merge it until @ozankabak and @metegenez have a chance to weigh in and approve, given the potential overlap between what I believe they are working on and the changes in this PR.
/// `AsyncPutWriter` is an object that facilitates asynchronous writing to object stores. | ||
/// It is specifically designed for the `object_store` crate's `put` method and sends | ||
/// whole bytes at once when the buffer is flushed. | ||
pub struct AsyncPutWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I double checked and it appears to be dead code (unused and untested):
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20AsyncPutWriter&type=code
/// A wrapper struct with abort method and writer | ||
pub(crate) struct AbortableWrite<W: AsyncWrite + Unpin + Send> { | ||
writer: W, | ||
mode: AbortMode, | ||
multipart: MultiPart, | ||
} | ||
|
||
impl<W: AsyncWrite + Unpin + Send> AbortableWrite<W> { | ||
/// Create a new `AbortableWrite` instance with the given writer, and write mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the doc comments may also need updating
///Data is appended as new files in existing TablePaths | ||
AppendNewFiles, | ||
///Throw an error if insert into is attempted on this table | ||
Error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to hear about a usecase of appending to a set of existing files -- it is not a usecase I am familiar with but also my background is in analytic database where it is more common to make new files (and then "compact/rewrite" them as subsequent step) rather than modifying existing files
@@ -267,8 +267,6 @@ INSERT INTO single_file_test values (4, 5), (6, 7); | |||
query II | |||
select * from single_file_test; | |||
---- | |||
1 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be an error. It would also be really good to call out how to get "append" semantics (which I believe requires using CREATE UNBOUNDED TABLE ...
)?
Thank you, we will take a look tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tustvold -- this PR is good to go from my perspective, but just to be over communicative I do not think we should merge it until review and approval from @ozankabak / @metesynnada
@@ -283,7 +283,12 @@ struct StreamWrite(Arc<StreamConfig>); | |||
|
|||
impl DisplayAs for StreamWrite { | |||
fn fmt_as(&self, _t: DisplayFormatType, f: &mut Formatter) -> std::fmt::Result { | |||
write!(f, "{self:?}") | |||
f.debug_struct("StreamWrite") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change you end up with an absolutely massive display line
Thanks, we will circle back tomorrow morning. I took a quick look and it looks good to me, but I want to consult with @metesynnada just to be sure. |
Do you require further time? Happy to wait, just checking where we stand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold, AFAICT this is good to go from our perspective. In case we are overlooking anything, we can send a quick follow-on PR. But all seems good now 🚀
Which issue does this PR close?
Relates to #7994
Closes #7991
Rationale for this change
Following #8021 append semantics are handled by a separate TableProvider, this allows removing the corresponding logic from ListingTable.
Further cleanups will likely follow, but this is all that is necessary to unblock #8029
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?