Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added initial parquet doc from LB #19

Merged
merged 2 commits into from
May 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file removed language-tips/R/scalability.md
Empty file.
86 changes: 0 additions & 86 deletions language-tips/python/missingness.md

This file was deleted.

130 changes: 0 additions & 130 deletions language-tips/python/scalability.md

This file was deleted.

19 changes: 0 additions & 19 deletions language-tips/python/set-operations.md

This file was deleted.

27 changes: 27 additions & 0 deletions tips/files/parquet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Author: LB
Maintainer: BP

## Parquet Files
What is Parquet? According to Databricks.com, “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk”.

## What this means
Parquet files read, store and write data based on columns and not rows. It turns out that passing data in this way is less expensive than working in a row-oriented direction.

## Benefits
According to Russell Jurney, the abilities of column-oriented formatting to store each column of data together and can load them one at a time. This leads to two performance optimizations:
1. You only pay for the columns you load. This is called columnar storage.
Let m be the total number of columns in a file and n be the number of columns requested by the user. Loading n columns results in justn/m raw I/O volume.
2. The similarity of values within separate columns results in more efficient compression. This is called columnar compression.

Note the event_type column in both row and column-oriented formats in the diagram below. A compression algorithm will have a much easier time compressing repeats of the value party in this column if they make up the entire value for that row, as in the column-oriented format. By contrast, the row-oriented format requires the compression algorithm to figure out repeats occur at some offset in the row which will vary based on the values in the previous columns. This is a much more difficult task.

(unlinked image representing storage format diagrams)

The column-oriented storage format can load just the columns of interest. Within these columns, similar or repeated values such as ‘party’ within the ‘event_type’ column compress more efficiently.
Columnar storage combines with columnar compression to produce dramatic performance improvements for most applications that do not require every column in the file. I have often used PySpark to load CSV or JSON data that took a long time to load and converted it to Parquet format, after which using it with PySpark or even on a single computer in Pandas became quick and painless.

More later...
- Pyarrow
- Fastparquet

done.