-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: on append, overwrite, delete and z-ordering #1897
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Appending to and overwriting a Delta Lake table | ||
|
||
This section explains how to append to an exising Delta table and how to overwrite a Delta table. | ||
|
||
## Delta Lake append transactions | ||
|
||
Suppose you have a Delta table with the following contents: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
+-------+----------+ | ||
``` | ||
|
||
Append two additional rows of data to the table: | ||
|
||
```python | ||
df = pd.DataFrame({"num": [8, 9], "letter": ["dd", "ee"]}) | ||
dl.writer.write_deltalake("tmp/some-table", df, mode="append") | ||
``` | ||
|
||
Here are the updated contents of the Delta table: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
| 8 | dd | | ||
| 9 | ee | | ||
+-------+----------+ | ||
``` | ||
|
||
Now let's see how to perform an overwrite transaction. | ||
|
||
## Delta Lake overwrite transactions | ||
|
||
Now let's see how to overwrite the exisitng Delta table. | ||
|
||
```python | ||
df = pd.DataFrame({"num": [11, 22], "letter": ["aa", "bb"]}) | ||
dl.writer.write_deltalake("tmp/some-table", df, mode="overwrite") | ||
``` | ||
|
||
Here are the contents of the Delta table after the overwrite operation: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 11 | aa | | ||
| 22 | bb | | ||
+-------+----------+ | ||
``` | ||
|
||
Overwriting just performs a logical delete. It doesn't physically remove the previous data from storage. Time travel back to the previous version to confirm that the old version of the table is still accessable. | ||
|
||
``` | ||
dt = dl.DeltaTable("tmp/some-table", version=1) | ||
|
||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
| 8 | dd | | ||
| 9 | ee | | ||
+-------+----------+ | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Creating a Delta Lake Table | ||
|
||
This section explains how to create a Delta Lake table. | ||
|
||
You can easily write a DataFrame to a Delta table. | ||
|
||
```python | ||
import deltalake as dl | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you forgot to remove this import as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated. |
||
import pandas as pd | ||
|
||
df = pd.DataFrame({"num": [1, 2, 3], "letter": ["a", "b", "c"]}) | ||
dl.writer.write_deltalake("tmp/some-table", df) | ||
``` | ||
|
||
Here are the contents of the Delta table in storage: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
+-------+----------+ | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Deleting rows from a Delta Lake table | ||
|
||
This section explains how to delete rows from a Delta Lake table. | ||
|
||
Suppose you have the following Delta table with four rows: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
| 3 | c | | ||
| 4 | d | | ||
+-------+----------+ | ||
``` | ||
|
||
Here's how to delete all the rows where the `num` is greater than 2: | ||
|
||
```python | ||
dt = dl.DeltaTable("tmp/my-table") | ||
dt.delete("num > 2") | ||
``` | ||
|
||
Here are the contents of the Delta table after the delete operation has been performed: | ||
|
||
``` | ||
+-------+----------+ | ||
| num | letter | | ||
|-------+----------| | ||
| 1 | a | | ||
| 2 | b | | ||
+-------+----------+ | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Delta Lake Z Order | ||
|
||
This section explains how to Z Order a Delta table. | ||
|
||
Z Ordering colocates similar data in the same files, which allows for more better file skipping and faster queries. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed, thanks |
||
|
||
Suppose you have a table with `first_name`, `age`, and `country` columns. | ||
|
||
If you Z Order the data by the `country` column, then individuals from the same country will be stored in the same files. When you subquently query the data for individuals from a given country, it will execute faster because more data can be skipped. | ||
|
||
Here's how to Z Order a Delta table: | ||
|
||
```python | ||
dt = DeltaTable("tmp") | ||
dt.optimize.z_order([country]) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we showcase examples where you import it like as dl.
I think we generally import directly, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ion-elgreco - I updated this. Also going to create an issue to discuss this in more detail. Thank you!