Skip to content

Latest commit

 

History

History
80 lines (56 loc) · 3.46 KB

README.md

File metadata and controls

80 lines (56 loc) · 3.46 KB

Open in Streamlit

Streamlit + DuckDB Demo: Uber / Taxi Pickups in New York City

  • streamlit_app_duck.py: Inspired / Copied from streamlit demo repo
    • Analyzes a month of NYC Uber Pickup location data. The original is from the Streamlit demo gallery
    • A Streamlit demo converted to utilize DuckDB to run data analysis faster and on more data than raw pandas.
  • 01_duck_streamlit.py: Inspired / Copied from duckdb and arrow blog post
    • Analyze 10 years (1.5 Billion rows / 40 GB) of NYC Taxi Pickup location data and demo some other filter optimizations over Pandas
    • A blog post on the power of DuckDB + Arrow converted to an interacive demo in Streamlit

Read more in the accompanying blog post ✍🏻

One Month Uber Dataset

Check out the speed up on loading data. From left to right:

  • 5.087 s: streamlit example (100,000 rows)
  • 54.306 s: streamlit example (Full Dataset using pd.read_csv)
  • 1.178 s: this example (Full Dataset using pyarrow + duckdb)

load data speedup compare

(Note: Profiled with pyinstrument, see more on the caveats / how it works in this post)

Analysis

I wrote the load_data function above to match what the original code does, which is load the data into a Dataframe, not just load the schema. After it's loaded then pandas and numpy are used for some additional filtering and computation.

The real point of duckdb is to do your filtering and computation before loading all of your data into memory.

For parity with the streamlit demo load_data originally would be:

    data = duckdb.arrow(data)
    return data.arrow().to_pandas()

Just returning the duckdb instance will drop the time load_data takes to ~0.1 s! Then you have an in memory analysis object ready to go.

    data = duckdb.arrow(data)
    return data

Run this demo locally

git clone [email protected]:gerardrbentley/uber-nyc-pickups-duckdb.git duckdb-streamlit
cd duckdb-streamlit
python -m venv venv
. ./venv/bin/activate
python -m pip install -r requirements.txt

streamlit run streamlit_app_duck.py

10 Years of data

NOTE: The following will download 40 GB of data to your machine. Not available on streamlit cloud due to storage limitations.

Going deeper into the DuckDB / Arrow power, we can filter and analyze even larger datasets.

We can select 304,851 interesting rows from all 1,547,741,381 in the 10 year dataset in < 3 seconds on a laptop!

The following will download necessary files and then run the app

# Setup
python -m pip install boto3
# Download datasets
wget https://github.com/cwida/duckdb-data/releases/download/v1.0/lineitemsf1.snappy.parquet
python 00_download_nyc_data.py
# Run the demo
streamlit run 01_duck_streamlit.py