Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: on append, overwrite, delete and z-ordering #1897

Merged
merged 4 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/usage/appending-overwriting-delta-lake-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Appending to and overwriting a Delta Lake table

This section explains how to append to an exising Delta table and how to overwrite a Delta table.

## Delta Lake append transactions

Suppose you have a Delta table with the following contents:

```
+-------+----------+
| num | letter |
|-------+----------|
| 1 | a |
| 2 | b |
| 3 | c |
+-------+----------+
```

Append two additional rows of data to the table:

```python
from deltalake.writer import write_deltalake

df = pd.DataFrame({"num": [8, 9], "letter": ["dd", "ee"]})
write_deltalake("tmp/some-table", df, mode="append")
```

Here are the updated contents of the Delta table:

```
+-------+----------+
| num | letter |
|-------+----------|
| 1 | a |
| 2 | b |
| 3 | c |
| 8 | dd |
| 9 | ee |
+-------+----------+
```

Now let's see how to perform an overwrite transaction.

## Delta Lake overwrite transactions

Now let's see how to overwrite the exisitng Delta table.

```python
df = pd.DataFrame({"num": [11, 22], "letter": ["aa", "bb"]})
write_deltalake("tmp/some-table", df, mode="overwrite")
```

Here are the contents of the Delta table after the overwrite operation:

```
+-------+----------+
| num | letter |
|-------+----------|
| 11 | aa |
| 22 | bb |
+-------+----------+
```

Overwriting just performs a logical delete. It doesn't physically remove the previous data from storage. Time travel back to the previous version to confirm that the old version of the table is still accessable.

```
dt = dl.DeltaTable("tmp/some-table", version=1)

+-------+----------+
| num | letter |
|-------+----------|
| 1 | a |
| 2 | b |
| 3 | c |
| 8 | dd |
| 9 | ee |
+-------+----------+
```
26 changes: 26 additions & 0 deletions docs/usage/create-delta-lake-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Creating a Delta Lake Table

This section explains how to create a Delta Lake table.

You can easily write a DataFrame to a Delta table.

```python
import deltalake as dl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you forgot to remove this import as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

from deltalake.writer import write_deltalake
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expose write_deltalake also on the deltalake module directly. Also in other docs we import directly from there, maybe do it here as well for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this error: cannot import name 'DeltaLake' from 'deltalake'

You sure write_deltalake is exposed on deltalake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a typo. You're right. Updated.

import pandas as pd

df = pd.DataFrame({"num": [1, 2, 3], "letter": ["a", "b", "c"]})
write_deltalake("tmp/some-table", df)
```

Here are the contents of the Delta table in storage:

```
+-------+----------+
| num | letter |
|-------+----------|
| 1 | a |
| 2 | b |
| 3 | c |
+-------+----------+
```
34 changes: 34 additions & 0 deletions docs/usage/deleting-rows-from-delta-lake-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Deleting rows from a Delta Lake table

This section explains how to delete rows from a Delta Lake table.

Suppose you have the following Delta table with four rows:

```
+-------+----------+
| num | letter |
|-------+----------|
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
+-------+----------+
```

Here's how to delete all the rows where the `num` is greater than 2:

```python
dt = dl.DeltaTable("tmp/my-table")
dt.delete("num > 2")
```

Here are the contents of the Delta table after the delete operation has been performed:

```
+-------+----------+
| num | letter |
|-------+----------|
| 1 | a |
| 2 | b |
+-------+----------+
```
16 changes: 16 additions & 0 deletions docs/usage/optimize/delta-lake-z-order.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Delta Lake Z Order

This section explains how to Z Order a Delta table.

Z Ordering colocates similar data in the same files, which allows for more better file skipping and faster queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more or better, right? not both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks


Suppose you have a table with `first_name`, `age`, and `country` columns.

If you Z Order the data by the `country` column, then individuals from the same country will be stored in the same files. When you subquently query the data for individuals from a given country, it will execute faster because more data can be skipped.

Here's how to Z Order a Delta table:

```python
dt = DeltaTable("tmp")
dt.optimize.z_order([country])
```
17 changes: 11 additions & 6 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,17 @@ nav:
- Usage:
- Installation: usage/installation.md
- Overview: usage/index.md
- Loading a Delta Table: usage/loading-table.md
- Examining a Delta Table: usage/examining-table.md
- Querying a Delta Table: usage/querying-delta-tables.md
- Managing a Delta Table: usage/managing-tables.md
- Writing Delta Tables: usage/writing-delta-tables.md
- Small file compaction: usage/small-file-compaction-with-optimize.md
- Creating a table: usage/create-delta-lake-table.md
- Loading a table: usage/loading-table.md
- Append/overwrite tables: usage/appending-overwriting-delta-lake-table.md
- Examining a table: usage/examining-table.md
- Querying a table: usage/querying-delta-tables.md
- Managing a table: usage/managing-tables.md
- Writing a table: usage/writing-delta-tables.md
- Deleting rows from a table: usage/deleting-rows-from-delta-lake-table.md
- Optimize:
- Small file compaction: usage/optimize/small-file-compaction-with-optimize.md
- Z Order: usage/optimize/delta-lake-z-order.md
- API Reference:
- api/delta_table.md
- api/schema.md
Expand Down
Loading