-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement S3 log store with transactions backed by DynamoDb #1904
Conversation
Not completely finished yet, a bunch of open questions / follow-up items that need to be addressed or discussed:
|
Exiting work! Just some initial comments.
I'm not too invested in S3, so probably my opinion should have too much weight 😆. I believe the overall direction we were taken is to make this configurable. Specifically, there is a PR (#1825) about to be merged that separates the S3 lock into a separate crate. Would it maybe be possible to base this of that PR?
Makes sense to have that in a
makes sense.
Haven't looked through the actual code yet, but I seem to remember that the databricks mechanism uses a single dynamo table to manage locking for multiple delta tables? If so, it may make sense that creation is separate.
Elsewhere we have been adopting the visibility crate, to be able to expose some APIs, but make it clear that they are not part of the official API contact of the crate. |
@roeap : unless I'm missing something, the linked PR #1825 only moves catalog-related logic into a separate crate, and S3 itself and its locking logic is unchanged. I don't think there would be any difficulties in rebasing but I also don't think the changes there are related. It touches on a different point, though: Should I switch to the official AWS SDK instead of rusoto? (@rtyler) |
@dispanser - you are of course correct! i guess aws just triggered something 😆 ... since rusoto is no longer maintained. i think the aws sdk is the way to go ... |
5749ec5
to
f45eb28
Compare
Correct, there's a single DynamoDb lock table for a possibly very large number of tables.
I've read through the discussion regarding the crate split at 1713:
and your response wrt object stores:
If the goal of the split is to create crates that make sense to release independently, I'm not sure what a good scope would be: all of S3, or just the lock client code? My proposal would be to do these things in separate PRs, and I'm happy to assist in all of them :-). For now, I'll look into rusoto -> aws sdk for the new lock client logic, but this would result in a |
I created a separate branch to play with the rusoto replacement at https://github.com/dispanser/delta-rs/tree/s3dynamodb-logstore-aws-sdk. The mechanics are mostly easy, but one thing that stands out is that the central config loading mechanism is async, from the docs: let config = aws_config::load_from_env().await; This leads to |
I've been playing with
All my attempts to work (i.e., google) my way around this have so far failed, e.g following this blog post. Additionally, everyone on the Rust forums and reddit tells you that it's simply a mistake to even try doing this async wrapping sync wrapping async I'm trying to do here. It seems that the requirement for de-serializing I'm out of my depth here. As for the rest of this PR, I'd really get a somewhat authorative answer to some questions I raised above:
|
1599b2c
to
2bd60dc
Compare
The existing locking code in the dynamodb-lock crate is more buggy than I am happy with. I am comfortable making a "hard" break in terms of functionality for the
I think putting it into a separate crate in this workspace is the correct path forward here. One of the goals for #1713 is that we remove some of the feature surface area of As to the challenges on the |
9615bc4
to
b12aaff
Compare
I went ahead and created a separate crate, |
I made a bunch of "manual integration tests" with Spark, and verified a bunch of things:
Some things I noticed:
In general, I wasn't able to get Spark to use |
53d41f2
to
47efff7
Compare
This log store implementation provides an alternative approach to allow for multi-cluster writes for use with S3 and DynamoDb. It is compatible with the approach taken in the upstream delta library and enables writers on the JVM (e.g., Apache Spark) and `delta-rs` to write to the same delta table concurrently.
47efff7
to
9956e3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some things which need to be cleaned up here in pulling all the AWS things into the AWS crate. There's still some cross-dependencies which I have started to fix in a topic branch based on this work.
I am going to merge it so that work can continue off main
prior to the next release
I guess fine if the work is done quickly, but in the past especially QP always emphasised that we wanted to keep main in a release-able state at all times. For future scenarios I think a topic branch would be a better approach to collaborate on partial work. |
@roeap Main has not been in a releasable state because of the substantial works going on with splitting crates and other things for some time, which is why I had created the 0.16.x release branch. |
@rtyler - good point, since we are splitting, we need to rebuild release anyhow... |
Description
This log store implementation provides an alternative approach to allow for multi-cluster writes for use with S3 and DynamoDb. It is compatible with the approach taken in the upstream delta library and enables writers on the JVM (e.g., Apache Spark) and
delta-rs
to write to the same delta table concurrently.