Skip to content

Commit fa51d5f

Browse files
MrPowersion-elgreco
authored andcommitted
docs: add pandas integration
1 parent 59bb321 commit fa51d5f

File tree

3 files changed

+281
-0
lines changed

3 files changed

+281
-0
lines changed
+279
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# Using Delta Lake with pandas
2+
3+
Delta Lake is a great storage system for pandas analyses. This page shows how it's easy to use Delta Lake with pandas, the unique features Delta Lake offers pandas users, and how Delta Lake can make your pandas analyses run faster.
4+
5+
Delta Lake is very easy to install for pandas analyses, just run `pip install deltalake`.
6+
7+
Delta Lake allows for performance optimizations, so pandas queries can run much faster than the query run on data stored in CSV or Parquet. See the following chart for the query runtime for the a Delta tables compared with CSV/Parquet.
8+
9+
![](pandas-query-csv-parquet-delta.png)
10+
11+
Z Ordered Delta tables run this query much faster than when the data is stored in Parquet or CSV. Let's dive in deeper and see how Delta Lake makes pandas faster.
12+
13+
## Delta Lake makes pandas queries run faster
14+
15+
There are a few reasons Delta Lake can make pandas queries run faster:
16+
17+
1. column pruning: only grabbing the columns relevant for a query
18+
2. file skipping: only reading files with data for the query
19+
3. row group skipping: only reading row groups with data for the query
20+
4. Z ordering data: colocating similar data in the same files, so file skipping is more effective
21+
22+
Reading less data (fewer columns and/or fewer rows) is how Delta Lake makes pandas queries run faster.
23+
24+
Parquet allows for column pruning and row group skipping, but doesn't support file-level skipping or Z Ordering. CSV doesn't support any of these performance optimizations.
25+
26+
Let's take a look at a sample dataset and run a query to see the performance enhancements offered by Delta Lake.
27+
28+
Suppose you have a 1 billion row dataset with 9 columns, here are the first three rows of the dataset:
29+
30+
```
31+
+-------+-------+--------------+-------+-------+--------+------+------+---------+
32+
| id1 | id2 | id3 | id4 | id5 | id6 | v1 | v2 | v3 |
33+
|-------+-------+--------------+-------+-------+--------+------+------+---------|
34+
| id016 | id046 | id0000109363 | 88 | 13 | 146094 | 4 | 6 | 18.8377 |
35+
| id039 | id087 | id0000466766 | 14 | 30 | 111330 | 4 | 14 | 46.7973 |
36+
| id047 | id098 | id0000307804 | 85 | 23 | 187639 | 3 | 5 | 47.5773 |
37+
+-------+-------+--------------+-------+-------+--------+------+------+---------+
38+
```
39+
40+
The dataset is roughly 50 GB when stored as an uncompressed CSV files. Let's run some queries on a 2021 Macbook M1 with 64 GB of RAM.
41+
42+
Start by running the query on an uncompressed CSV file:
43+
44+
```python
45+
(
46+
pd.read_csv(f"{Path.home()}/data/G1_1e9_1e2_0_0.csv", usecols=["id1", "id2", "v1"])
47+
.query("id1 == 'id016'")
48+
.groupby("id2")
49+
.agg({"v1": "sum"})
50+
)
51+
```
52+
53+
This query takes 234 seconds to execute. It runs out of memory if the `usecols` parameter is not set.
54+
55+
Now let's convert the CSV dataset to Parquet and run the same query on the data stored in a Parquet file.
56+
57+
```python
58+
(
59+
pd.read_parquet(
60+
f"{Path.home()}/data/G1_1e9_1e2_0_0.parquet", columns=["id1", "id2", "v1"]
61+
)
62+
.query("id1 == 'id016'")
63+
.groupby("id2")
64+
.agg({"v1": "sum"})
65+
)
66+
```
67+
68+
This query takes 118 seconds to execute.
69+
70+
Parquet stores data in row groups and allows for skipping when the `filters` predicates are set. Run the Parquet query again with row group skipping enabled:
71+
72+
```python
73+
(
74+
pd.read_parquet(
75+
f"{Path.home()}/data/G1_1e9_1e2_0_0.parquet",
76+
columns=["id1", "id2", "v1"],
77+
filters=[("id1", "==", "id016")],
78+
)
79+
.query("id1 == 'id016'")
80+
.groupby("id2")
81+
.agg({"v1": "sum"})
82+
)
83+
```
84+
85+
This query runs in 19 seconds. Lots of row groups can be skipped for this particular query.
86+
87+
Now let's run the same query on a Delta table to see the out-of-the box performance:
88+
89+
```python
90+
(
91+
DeltaTable(f"{Path.home()}/data/deltalake_baseline_G1_1e9_1e2_0_0", version=0)
92+
.to_pandas(filters=[("id1", "==", "id016")], columns=["id1", "id2", "v1"])
93+
.query("id1 == 'id016'")
94+
.groupby("id2")
95+
.agg({"v1": "sum"})
96+
)
97+
```
98+
99+
This query runs in 8 seconds, which is a significant performance enhancement.
100+
101+
Now let's Z Order the Delta table by `id1` which will make the data skipping even better. Run the query again on the Z Ordered Delta table:
102+
103+
```python
104+
(
105+
DeltaTable(f"{Path.home()}/data/deltalake_baseline_G1_1e9_1e2_0_0", version=1)
106+
.to_pandas(filters=[("id1", "==", "id016")], columns=["id1", "id2", "v1"])
107+
.query("id1 == 'id016'")
108+
.groupby("id2")
109+
.agg({"v1": "sum"})
110+
)
111+
```
112+
113+
The query now executes in 2.4 seconds.
114+
115+
Delta tables can make certain pandas queries run much faster.
116+
117+
## Delta Lake lets pandas users time travel
118+
119+
Start by creating a Delta table:
120+
121+
```python
122+
from deltalake import write_deltalake, DeltaTable
123+
124+
df = pd.DataFrame({"num": [1, 2, 3], "letter": ["a", "b", "c"]})
125+
write_deltalake("tmp/some-table", df)
126+
```
127+
128+
Here are the contents of the Delta table (version 0 of the Delta table):
129+
130+
```
131+
+-------+----------+
132+
| num | letter |
133+
|-------+----------|
134+
| 1 | a |
135+
| 2 | b |
136+
| 3 | c |
137+
+-------+----------+
138+
```
139+
140+
Now append two rows to the Delta table:
141+
142+
```python
143+
df = pd.DataFrame({"num": [8, 9], "letter": ["dd", "ee"]})
144+
write_deltalake("tmp/some-table", df, mode="append")
145+
```
146+
147+
Here are the contents after the append operation (version 1 of the Delta table):
148+
149+
```
150+
+-------+----------+
151+
| num | letter |
152+
|-------+----------|
153+
| 1 | a |
154+
| 2 | b |
155+
| 3 | c |
156+
| 8 | dd |
157+
| 9 | ee |
158+
+-------+----------+
159+
```
160+
161+
Now perform an overwrite transaction:
162+
163+
```python
164+
df = pd.DataFrame({"num": [11, 22], "letter": ["aa", "bb"]})
165+
write_deltalake("tmp/some-table", df, mode="overwrite")
166+
```
167+
168+
Here are the contents after the overwrite operation (version 2 of the Delta table):
169+
170+
```
171+
+-------+----------+
172+
| num | letter |
173+
|-------+----------|
174+
| 8 | dd |
175+
| 9 | ee |
176+
+-------+----------+
177+
```
178+
179+
Read in the Delta table and it will grab the latest version by default:
180+
181+
```
182+
DeltaTable("tmp/some-table").to_pandas()
183+
184+
+-------+----------+
185+
| num | letter |
186+
|-------+----------|
187+
| 11 | aa |
188+
| 22 | bb |
189+
+-------+----------+
190+
```
191+
192+
You can easily time travel back to version 0 of the Delta table:
193+
194+
```
195+
DeltaTable("tmp/some-table", version=0).to_pandas()
196+
197+
+-------+----------+
198+
| num | letter |
199+
|-------+----------|
200+
| 1 | a |
201+
| 2 | b |
202+
| 3 | c |
203+
+-------+----------+
204+
```
205+
206+
You can also time travel to version 1 of the Delta table:
207+
208+
```
209+
DeltaTable("tmp/some-table", version=1).to_pandas()
210+
211+
+-------+----------+
212+
| num | letter |
213+
|-------+----------|
214+
| 1 | a |
215+
| 2 | b |
216+
| 3 | c |
217+
| 8 | dd |
218+
| 9 | ee |
219+
+-------+----------+
220+
```
221+
222+
Time travel is a powerful feature that pandas users cannot access with CSV or Parquet.
223+
224+
## Schema enforcement
225+
226+
Delta tables only allow you to append DataFrame with matching schema by default. Suppose you have a DataFrame with `num` and `animal` columns, which is different from the Delta table that has columns with `num` and `letter` columns.
227+
228+
Try to append this DataFrame with a mismatched schema to the existing table:
229+
230+
```
231+
df = pd.DataFrame({"num": [5, 6], "animal": ["cat", "dog"]})
232+
write_deltalake("tmp/some-table", df)
233+
```
234+
235+
This transaction will be rejected and will return the following error message:
236+
237+
```
238+
ValueError: Schema of data does not match table schema
239+
Data schema:
240+
num: int64
241+
animal: string
242+
-- schema metadata --
243+
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 474
244+
Table Schema:
245+
num: int64
246+
letter: string
247+
```
248+
249+
Schema enforcement protects your table from getting corrupted by appending data with mismatched schema. Parquet and CSV don't offer schema enforcement for pandas users.
250+
251+
## In-memory vs. in-storage data changes
252+
253+
It's important to distinguish between data stored in-memory and data stored on disk when understanding the functionality offered by Delta Lake.
254+
255+
pandas loads data from storage (CSV, Parquet, or Delta Lake) into in-memory DataFrames.
256+
257+
pandas makes it easy to modify the data in memory, say update a column value. It's not easy to update a column value in storage systems like CSV or Parquet using pandas.
258+
259+
Delta Lake makes it easy for pandas users to update data in storage.
260+
261+
## Why Delta Lake allows for faster queries
262+
263+
Delta tables store data in many files and metadata about the files in the transaction log. Delta Lake allows for certain queries to skip entire files, which makes pandas queries run much faster.
264+
265+
## More resources
266+
267+
See this talk on why Delta Lake is the best file format for pandas analyses to learn more:
268+
269+
<iframe width="560" height="315" src="https://www.youtube.com/embed/A8bvJlG6phk?si=xHVZB5LhaWfTBU0r" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
270+
271+
## Conclusion
272+
273+
Delta Lake provides many features that make it an excellent format for pandas analyses:
274+
275+
* performance optimizations make pandas queries run faster
276+
* data management features make pandas analyses more reliable
277+
* advanced features allow you to perform more complex pandas analyses
278+
279+
Python deltalake offers pandas users a better experience compared with CSV/Parquet.
Loading

mkdocs.yml

+2
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@ nav:
3434
- api/delta_table.md
3535
- api/schema.md
3636
- api/storage.md
37+
- Integrations:
38+
- pandas: integrations/delta-lake-pandas.md
3739
not_in_nav: |
3840
/_build/
3941

0 commit comments

Comments
 (0)