Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Delta table support for filesystem destination #1382

Merged
merged 50 commits into from
Jun 6, 2024

Conversation

jorritsandbrink
Copy link
Collaborator

@jorritsandbrink jorritsandbrink commented May 17, 2024

Description

This PR enables writing datasets to Delta tables in the filesystem destination.

A user can specify delta as table_format in a resource definition:

@dlt.resource(table_name="a_delta_table", table_format="delta")
def a_resource():
    ...

Related Issues

Contributes to #978

@jorritsandbrink jorritsandbrink added the enhancement New feature or request label May 17, 2024
@jorritsandbrink jorritsandbrink self-assigned this May 17, 2024
Copy link

netlify bot commented May 17, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit a75a0f9
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6660c9a216f2bd000830fc81

@@ -309,8 +309,12 @@ def restore_file_load(self, file_path: str) -> LoadJob:
"""Finds and restores already started loading job identified by `file_path` if destination supports it."""
pass

def can_do_logical_replace(self, table: TTableSchema) -> bool:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this can become a destination capability if we turn Delta into a full destination.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we do not need this on the highest abstraction level. This belongs only to filesystem client and there it is enough to override should_truncate_table_before_load.

i'm trying to keep the abstract classes as simple as possible. two methods below are already a stretch (but I do not have an idea where to move them)


assert isinstance(self.config.credentials, AwsCredentials)
storage_options = self.config.credentials.to_session_credentials()
storage_options["AWS_REGION"] = self.config.credentials.region_name
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deltalake library requires that AWS_REGION is provided. We need to add it to DLT_SECRETS_TOML under [destination.filesystem.credentials] to make s3 tests pass on CI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! this also may come from the machine default credentials. nevertheless we should warn or exit when this is not set


assert isinstance(self.config.credentials, GcpServiceAccountCredentials)
gcs_creds = self.config.credentials.to_gcs_credentials()
gcs_creds["token"]["private_key_id"] = "921837921798379812"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be changed so that private_key_id is fetched from configuration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm OK, when you authenticate in Python you do not need to do that... we can add this as optional field. this also means that OAUTH authentication will not work? I think it is fine.

btw, can delta-rs find default google credentials? you can check if has_default_credentials() and then leave token as None. works for fsspec

storage_options = self.config.credentials.to_session_credentials()
storage_options["AWS_REGION"] = self.config.credentials.region_name
# https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#enable-unsafe-writes-in-s3-opt-in
storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting AWS_S3_ALLOW_UNSAFE_RENAME to true is the simplest setup. Perhaps we can later extend and let the user configure a locking provider.

Context: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. we have a locking provider but probably not compatible with delta. it is called transactional_file.py

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix Can you review?

Delta tables are managed at the folder level, not the file level. Hence, they are treated differently than the csv, jsonl, and parquet file formats.

I'll add docs after we've settled on the user interface.

@jorritsandbrink jorritsandbrink requested a review from rudolfix May 19, 2024 14:46
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh I was sure we can have delta tables as a single file. when I look at your code I think we should do something else:

make a destination out of that. it could be based on filesystem and use the same credentials. OFC we support only append and replace.

  1. we do not use file format. we should use table_format, we add delta to it (along pyiceberg)
  2. you can create different jobs withing the filesystem destination depending on table_format. - check how we do it in athena. I think here it will be much simpler
  3. you can then separate delta code from regular file code and in the future we can add pyiceberg support easily
  4. I'm not sure we can add merge support this way.... however we can always abuse the existing merge mechanism (when there's merge write disposition, delta job does nothing but requests a followup job so at the end we process all table files at once)

dlt/pipeline/pipeline.py Outdated Show resolved Hide resolved
def _write_delta_table(
self, path: str, table: "pa.Table", write_disposition: TWriteDisposition # type: ignore[name-defined] # noqa
) -> None:
"""Writes in-memory Arrow table to on-disk Delta table."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two questions here:

  1. we can have many files for a given table. are we able to write them at once?
  2. to the above: writing several tables at once in parallel: is it supported? (should be really :))
  3. do we really need to load parquet file into memory? I know that you clean it up. but we can implement paruqet alignment differently ie. via another "flavour" of parquet that given destination can request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced the concept of a "directory job", 1 is possible now. 2 is also possible. Both are tested in test_pipeline_delta_filesystem_destination. 3 seems not possible as discussed on the chat:
image


assert isinstance(self.config.credentials, AwsCredentials)
storage_options = self.config.credentials.to_session_credentials()
storage_options["AWS_REGION"] = self.config.credentials.region_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! this also may come from the machine default credentials. nevertheless we should warn or exit when this is not set

storage_options = self.config.credentials.to_session_credentials()
storage_options["AWS_REGION"] = self.config.credentials.region_name
# https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#enable-unsafe-writes-in-s3-opt-in
storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. we have a locking provider but probably not compatible with delta. it is called transactional_file.py


assert isinstance(self.config.credentials, GcpServiceAccountCredentials)
gcs_creds = self.config.credentials.to_gcs_credentials()
gcs_creds["token"]["private_key_id"] = "921837921798379812"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm OK, when you authenticate in Python you do not need to do that... we can add this as optional field. this also means that OAUTH authentication will not work? I think it is fine.

btw, can delta-rs find default google credentials? you can check if has_default_credentials() and then leave token as None. works for fsspec

import pyarrow as pa
from deltalake import write_deltalake

def adjust_arrow_schema(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of those look like utility function that could be available independently and also unit tested

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them to utils.py and made them independent. Haven't added unit tests yet.

@rudolfix rudolfix mentioned this pull request May 22, 2024
14 tasks
@@ -214,6 +214,20 @@ def exception(self) -> str:
pass


class DirectoryLoadJob:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minimal for now. Want to get some feedback before further polishing.

@@ -177,6 +178,15 @@ def __str__(self) -> str:
return self.job_id()


class ParsedLoadJobDirectoryName(NamedTuple):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also very minimal. Same as above.


def restore_file_load(self, file_path: str) -> LoadJob:
return EmptyLoadJob.from_file_path(file_path, "completed")

def start_dir_load(self, table: TTableSchema, dir_path: str, load_id: str) -> DirectoryLoadJob:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps iceberg will also be a directory job if we add it.

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix Can you review once more?

I addressed some of your feedback. Biggest changes since last review:

  1. Introduction of the "directory job" so multiple files can be loaded at once. Thus far, the concept of a job was tightly coupled with a single file (if I understand correctly). But if there are multiple Parquet files, we want to be able to load them into the Delta table in a single commit. That is now possible. Code is still rogue/minimal, but I want some feedback before polishing it up.
  2. Dedicated job for Delta table loads: DeltaLoadFilesystemJob.
  3. Turned Delta-related methods into static utility methods.

dlt/normalize/normalize.py Outdated Show resolved Hide resolved
dlt/normalize/normalize.py Outdated Show resolved Hide resolved
dlt/normalize/normalize.py Outdated Show resolved Hide resolved
dlt/normalize/normalize.py Show resolved Hide resolved
@@ -216,6 +222,244 @@ def some_source():
assert table.column("value").to_pylist() == [1, 2, 3, 4, 5]


@pytest.mark.essential
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only marked this test as essential. Perhaps the other tests also need the mark. Do we have a guideline for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if we add a new feature, it makes sense to write 1-2 tests with happy paths and make them as essential
  2. if tests are finishing quickly (like all the local pipeline tests here, they also can be marked as essential)

essential is used to have a fast smoke-testing for destinations in normal CI runs

rudolfix
rudolfix previously approved these changes Jun 5, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! amazing work!

I added a few small fixes mostly to handle failing pydantic tests. see commits for details

@rudolfix rudolfix merged commit 1c1ce7e into devel Jun 6, 2024
48 of 50 checks passed
@rudolfix rudolfix deleted the 978-filesystem-delta-table branch June 6, 2024 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants