streamlit_app_duck.py
: Inspired / Copied from streamlit demo repo- Analyzes a month of NYC Uber Pickup location data. The original is from the Streamlit demo gallery
- A Streamlit demo converted to utilize DuckDB to run data analysis faster and on more data than raw pandas.
01_duck_streamlit.py
: Inspired / Copied from duckdb and arrow blog post- Analyze 10 years (1.5 Billion rows / 40 GB) of NYC Taxi Pickup location data and demo some other filter optimizations over Pandas
- A blog post on the power of DuckDB + Arrow converted to an interacive demo in Streamlit
Read more in the accompanying blog post ✍🏻
Check out the speed up on loading data. From left to right:
5.087 s
: streamlit example (100,000 rows)54.306 s
: streamlit example (Full Dataset usingpd.read_csv
)1.178 s
: this example (Full Dataset usingpyarrow
+duckdb
)
(Note: Profiled with pyinstrument, see more on the caveats / how it works in this post)
I wrote the load_data
function above to match what the original code does, which is load the data into a Dataframe
, not just load the schema.
After it's loaded then pandas
and numpy
are used for some additional filtering and computation.
The real point of duckdb
is to do your filtering and computation before loading all of your data into memory.
For parity with the streamlit demo load_data
originally would be:
data = duckdb.arrow(data)
return data.arrow().to_pandas()
Just returning the duckdb instance will drop the time load_data
takes to ~0.1 s
!
Then you have an in memory analysis object ready to go.
data = duckdb.arrow(data)
return data
git clone [email protected]:gerardrbentley/uber-nyc-pickups-duckdb.git duckdb-streamlit
cd duckdb-streamlit
python -m venv venv
. ./venv/bin/activate
python -m pip install -r requirements.txt
streamlit run streamlit_app_duck.py
NOTE: The following will download 40 GB of data to your machine. Not available on streamlit cloud due to storage limitations.
Going deeper into the DuckDB / Arrow power, we can filter and analyze even larger datasets.
We can select 304,851
interesting rows from all 1,547,741,381
in the 10 year dataset in < 3 seconds on a laptop!
The following will download necessary files and then run the app
# Setup
python -m pip install boto3
# Download datasets
wget https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
python 00_download_nyc_data.py
# Run the demo
streamlit run 01_duck_streamlit.py