-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Delta table support for filesystem
destination
#1382
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
…ystem-delta-table
dlt/common/destination/reference.py
Outdated
@@ -309,8 +309,12 @@ def restore_file_load(self, file_path: str) -> LoadJob: | |||
"""Finds and restores already started loading job identified by `file_path` if destination supports it.""" | |||
pass | |||
|
|||
def can_do_logical_replace(self, table: TTableSchema) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this can become a destination capability if we turn Delta into a full destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we do not need this on the highest abstraction level. This belongs only to filesystem
client and there it is enough to override should_truncate_table_before_load
.
i'm trying to keep the abstract classes as simple as possible. two methods below are already a stretch (but I do not have an idea where to move them)
|
||
assert isinstance(self.config.credentials, AwsCredentials) | ||
storage_options = self.config.credentials.to_session_credentials() | ||
storage_options["AWS_REGION"] = self.config.credentials.region_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deltalake
library requires that AWS_REGION
is provided. We need to add it to DLT_SECRETS_TOML
under [destination.filesystem.credentials]
to make s3
tests pass on CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah! this also may come from the machine default credentials. nevertheless we should warn or exit when this is not set
|
||
assert isinstance(self.config.credentials, GcpServiceAccountCredentials) | ||
gcs_creds = self.config.credentials.to_gcs_credentials() | ||
gcs_creds["token"]["private_key_id"] = "921837921798379812" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be changed so that private_key_id
is fetched from configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm OK, when you authenticate in Python you do not need to do that... we can add this as optional field. this also means that OAUTH authentication will not work? I think it is fine.
btw, can delta-rs find default google credentials? you can check if has_default_credentials()
and then leave token
as None. works for fsspec
storage_options = self.config.credentials.to_session_credentials() | ||
storage_options["AWS_REGION"] = self.config.credentials.region_name | ||
# https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#enable-unsafe-writes-in-s3-opt-in | ||
storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting AWS_S3_ALLOW_UNSAFE_RENAME
to true
is the simplest setup. Perhaps we can later extend and let the user configure a locking provider.
Context: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting. we have a locking provider but probably not compatible with delta. it is called transactional_file.py
@rudolfix Can you review? Delta tables are managed at the folder level, not the file level. Hence, they are treated differently than the I'll add docs after we've settled on the user interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh I was sure we can have delta tables as a single file. when I look at your code I think we should do something else:
make a destination out of that. it could be based on filesystem and use the same credentials. OFC we support only append and replace.
- we do not use file format. we should use
table_format
, we adddelta
to it (alongpyiceberg
) - you can create different jobs withing the
filesystem
destination depending ontable_format
. - check how we do it in athena. I think here it will be much simpler - you can then separate delta code from regular file code and in the future we can add pyiceberg support easily
- I'm not sure we can add
merge
support this way.... however we can always abuse the existing merge mechanism (when there's merge write disposition, delta job does nothing but requests a followup job so at the end we process all table files at once)
def _write_delta_table( | ||
self, path: str, table: "pa.Table", write_disposition: TWriteDisposition # type: ignore[name-defined] # noqa | ||
) -> None: | ||
"""Writes in-memory Arrow table to on-disk Delta table.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two questions here:
- we can have many files for a given table. are we able to write them at once?
- to the above: writing several tables at once in parallel: is it supported? (should be really :))
- do we really need to load parquet file into memory? I know that you clean it up. but we can implement paruqet alignment differently ie. via another "flavour" of parquet that given destination can request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
assert isinstance(self.config.credentials, AwsCredentials) | ||
storage_options = self.config.credentials.to_session_credentials() | ||
storage_options["AWS_REGION"] = self.config.credentials.region_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah! this also may come from the machine default credentials. nevertheless we should warn or exit when this is not set
storage_options = self.config.credentials.to_session_credentials() | ||
storage_options["AWS_REGION"] = self.config.credentials.region_name | ||
# https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#enable-unsafe-writes-in-s3-opt-in | ||
storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting. we have a locking provider but probably not compatible with delta. it is called transactional_file.py
|
||
assert isinstance(self.config.credentials, GcpServiceAccountCredentials) | ||
gcs_creds = self.config.credentials.to_gcs_credentials() | ||
gcs_creds["token"]["private_key_id"] = "921837921798379812" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm OK, when you authenticate in Python you do not need to do that... we can add this as optional field. this also means that OAUTH authentication will not work? I think it is fine.
btw, can delta-rs find default google credentials? you can check if has_default_credentials()
and then leave token
as None. works for fsspec
import pyarrow as pa | ||
from deltalake import write_deltalake | ||
|
||
def adjust_arrow_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of those look like utility function that could be available independently and also unit tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them to utils.py
and made them independent. Haven't added unit tests yet.
dlt/common/destination/reference.py
Outdated
@@ -214,6 +214,20 @@ def exception(self) -> str: | |||
pass | |||
|
|||
|
|||
class DirectoryLoadJob: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minimal for now. Want to get some feedback before further polishing.
dlt/common/storages/load_package.py
Outdated
@@ -177,6 +178,15 @@ def __str__(self) -> str: | |||
return self.job_id() | |||
|
|||
|
|||
class ParsedLoadJobDirectoryName(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also very minimal. Same as above.
|
||
def restore_file_load(self, file_path: str) -> LoadJob: | ||
return EmptyLoadJob.from_file_path(file_path, "completed") | ||
|
||
def start_dir_load(self, table: TTableSchema, dir_path: str, load_id: str) -> DirectoryLoadJob: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps iceberg
will also be a directory job if we add it.
@rudolfix Can you review once more? I addressed some of your feedback. Biggest changes since last review:
|
…ystem-delta-table
@@ -216,6 +222,244 @@ def some_source(): | |||
assert table.column("value").to_pylist() == [1, 2, 3, 4, 5] | |||
|
|||
|
|||
@pytest.mark.essential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only marked this test as essential. Perhaps the other tests also need the mark. Do we have a guideline for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- if we add a new feature, it makes sense to write 1-2 tests with happy paths and make them as essential
- if tests are finishing quickly (like all the local pipeline tests here, they also can be marked as essential)
essential is used to have a fast smoke-testing for destinations in normal CI runs
…eltaTable method signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! amazing work!
I added a few small fixes mostly to handle failing pydantic tests. see commits for details
Description
This PR enables writing datasets to Delta tables in the
filesystem
destination.A user can specify
delta
astable_format
in a resource definition:Related Issues
Contributes to #978