-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: on append, overwrite, delete and z-ordering #1897
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# Appending to and overwriting a Delta Lake table | ||
|
||
This section explains how to append to an exising Delta table and how to overwrite a Delta table. | ||
|
||
## Delta Lake append transactions | ||
|
||
Suppose you have a Delta table with the following contents: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
+-------+----------+ | ||
``` | ||
|
||
Append two additional rows of data to the table: | ||
|
||
```python | ||
from deltalake.writer import write_deltalake | ||
|
||
df = pd.DataFrame({"num": [8, 9], "letter": ["dd", "ee"]}) | ||
write_deltalake("tmp/some-table", df, mode="append") | ||
``` | ||
|
||
Here are the updated contents of the Delta table: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
| 8 | dd | | ||
| 9 | ee | | ||
+-------+----------+ | ||
``` | ||
|
||
Now let's see how to perform an overwrite transaction. | ||
|
||
## Delta Lake overwrite transactions | ||
|
||
Now let's see how to overwrite the exisitng Delta table. | ||
|
||
```python | ||
df = pd.DataFrame({"num": [11, 22], "letter": ["aa", "bb"]}) | ||
write_deltalake("tmp/some-table", df, mode="overwrite") | ||
``` | ||
|
||
Here are the contents of the Delta table after the overwrite operation: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 11 | aa | | ||
| 22 | bb | | ||
+-------+----------+ | ||
``` | ||
|
||
Overwriting just performs a logical delete. It doesn't physically remove the previous data from storage. Time travel back to the previous version to confirm that the old version of the table is still accessable. | ||
|
||
``` | ||
dt = dl.DeltaTable("tmp/some-table", version=1) | ||
|
||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
| 8 | dd | | ||
| 9 | ee | | ||
+-------+----------+ | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Creating a Delta Lake Table | ||
|
||
This section explains how to create a Delta Lake table. | ||
|
||
You can easily write a DataFrame to a Delta table. | ||
|
||
```python | ||
import deltalake as dl | ||
from deltalake.writer import write_deltalake | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We expose write_deltalake also on the deltalake module directly. Also in other docs we import directly from there, maybe do it here as well for consistency There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get this error: You sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had a typo. You're right. Updated. |
||
import pandas as pd | ||
|
||
df = pd.DataFrame({"num": [1, 2, 3], "letter": ["a", "b", "c"]}) | ||
write_deltalake("tmp/some-table", df) | ||
``` | ||
|
||
Here are the contents of the Delta table in storage: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
+-------+----------+ | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Deleting rows from a Delta Lake table | ||
|
||
This section explains how to delete rows from a Delta Lake table. | ||
|
||
Suppose you have the following Delta table with four rows: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
| 4 | d | | ||
+-------+----------+ | ||
``` | ||
|
||
Here's how to delete all the rows where the `num` is greater than 2: | ||
|
||
```python | ||
dt = dl.DeltaTable("tmp/my-table") | ||
dt.delete("num > 2") | ||
``` | ||
|
||
Here are the contents of the Delta table after the delete operation has been performed: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
+-------+----------+ | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Delta Lake Z Order | ||
|
||
This section explains how to Z Order a Delta table. | ||
|
||
Z Ordering colocates similar data in the same files, which allows for more better file skipping and faster queries. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed, thanks |
||
|
||
Suppose you have a table with `first_name`, `age`, and `country` columns. | ||
|
||
If you Z Order the data by the `country` column, then individuals from the same country will be stored in the same files. When you subquently query the data for individuals from a given country, it will execute faster because more data can be skipped. | ||
|
||
Here's how to Z Order a Delta table: | ||
|
||
```python | ||
dt = DeltaTable("tmp") | ||
dt.optimize.z_order([country]) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you forgot to remove this import as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.