Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add delta lake best practices #2147

Merged
merged 2 commits into from
Feb 19, 2024

Conversation

MrPowers
Copy link
Contributor

Adds a docs page on the Delta Lake best practices.

This is a first pass and I think this should evolve over time.

This is some of the most important content for our users IMO.

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docs as always, just left two comment :)

It may also be worthwhile to mention that for data science usecases you would want to keep the logs indefinitely or as long as you want a period of traceability. Because the moment you remove the logs you lose the ability to time travel or check the change data feed, which could mean you cannot explain to business or legal why your model said Y in 2022 for example


* It’s only suitable for low-cardinality columns.
* It can create many small files, especially if you use the wrong partition key or frequently update the Delta table.
* It can cause some queries that don’t rely on the partition key to run slower (because of the excessive number of small files)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe highlight that a lot of small files is problematic for IO throughput

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this note


Create a good vacuum strategy for your tables to minimize your storage costs.

## Registering tables in a metastore/catalog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make a huge note here because none of the catalogs allow it to be read outside of the vendors ecosystem, so delta-rs cannot ready any catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I am just going to take out this whole section for now. Hopefully we have a better catalog story soon.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice if databricks opens up the unity catalog to be accessible from the outside, this would solve the disjunction between the delta-rs and databricks spark -delta world

@MrPowers MrPowers requested a review from ion-elgreco February 7, 2024 11:16
@ion-elgreco ion-elgreco force-pushed the docs-delta-best-practices branch from eeb0d15 to a67ed9c Compare February 19, 2024 14:21
@ion-elgreco ion-elgreco enabled auto-merge (squash) February 19, 2024 14:22
@ion-elgreco ion-elgreco merged commit 0449db9 into delta-io:main Feb 19, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants