-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add delta lake best practices #2147
docs: add delta lake best practices #2147
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice docs as always, just left two comment :)
It may also be worthwhile to mention that for data science usecases you would want to keep the logs indefinitely or as long as you want a period of traceability. Because the moment you remove the logs you lose the ability to time travel or check the change data feed, which could mean you cannot explain to business or legal why your model said Y in 2022 for example
docs/delta-lake-best-practices.md
Outdated
|
||
* It’s only suitable for low-cardinality columns. | ||
* It can create many small files, especially if you use the wrong partition key or frequently update the Delta table. | ||
* It can cause some queries that don’t rely on the partition key to run slower (because of the excessive number of small files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe highlight that a lot of small files is problematic for IO throughput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this note
docs/delta-lake-best-practices.md
Outdated
|
||
Create a good vacuum strategy for your tables to minimize your storage costs. | ||
|
||
## Registering tables in a metastore/catalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make a huge note here because none of the catalogs allow it to be read outside of the vendors ecosystem, so delta-rs cannot ready any catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I am just going to take out this whole section for now. Hopefully we have a better catalog story soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really nice if databricks opens up the unity catalog to be accessible from the outside, this would solve the disjunction between the delta-rs and databricks spark -delta world
eeb0d15
to
a67ed9c
Compare
Adds a docs page on the Delta Lake best practices.
This is a first pass and I think this should evolve over time.
This is some of the most important content for our users IMO.