Skip to content

Releases: gumdropsteve/turbo-telegram

SQL Based + Full 2015 Data Download & Processing

03 Feb 11:48
66756c2
Compare
Choose a tag to compare

Turbo-telegram v0.0.3-beta

This release brings 2 major updates for the price of 1!

Main PRs: #14, #17

BlazingSQL based NYC Dashboard

Rework of taxi_dashboard.ipynb to utilize SQL queries when producing all DataFrames.

  • BlazingSQL table of query results that's then focused accordingly
  • Apply cuDF's .to_pandas() for HoloViews plots

This eliminates post-query filtering of results, freeing up GPU memory & enabling use of much larger datasets.

2015 Taxi Data Download & Processing

Users can now download & pre-process all 12 months of 2015 NYC yellow cab data. Total download size is ~20.07 GB before processing and ~18.94 GB (CSV0) after processing.

NOTE: taxi_dashboard.ipynb does NOT yet point to this new data. This will be implemented soon, but issues such as optimizing big data integration for single-GPU users need to be addressed first.

New Files

  • download_data.ipynb.ipynb

    • based off HoloViz taxi_preprocessing_example.py
    • downloads & processes all 12 months of 2015 NYC taxi data
    • uses BlazingSQL & Numpy1 to configure data for use with Datashader / HoloViews
      • single node / processes 1 month at a time to ensure anyone w/ compatible GPU can run
      • tested w/ 16GB Tesla T4 GPU on AWS, runs end-to-end in 7-8 min2
      • GPU capacity test via final visualization under "Extra" (at end) calls thru August (8/12 months)34
  • sql_check.py

    • based off RAPIDS sql_check.py
    • checks for installation of BlazingSQL & installs via Anaconda if not found
    • called in download_data.ipynb imports section if BSQL not found & user wants to install

Footnotes

0 12 files, 18 columns (each) * 135,216,505 rows (total/combined)
1 elimination of NumPy expected w/ resolution of BlazingDB/blazingsql#334 (UPDATE 4 Feb: BSQL only merged to master branch 95c963c)
2 last run: 4m 27s download; 3m 25s processing (largely from writing .to_csv()); 7m 52s total
3 sticking to consecutive months starting with January, this was the largest table query to process w/o kernel crashing, ~12.6GB CSV which is ~25GB on GPU, running off 1 16GB Tesla T4 GPU AWS EC2 instance
4 Here's how that plot looked;
download (1)

More data, less bugs [Taxi Dashboard Update]

13 Jan 06:42
047234a
Compare
Choose a tag to compare

NYC Taxi Dashboard Update

improvements

  • more data
    • added filtered & converted data for February & March of 2015
    • base table now created from Q1 2015 via wildcard (*) in file path
  • simplified code & more clear notes & docstrings

issues addressed

  • resolved #8, moved riders & fare input checks up from common_filtering to input of common_filtering, now common_filtering not engaged until input is checked
  • resolved #9, simplified if/elif/else statements under map outputs

extra

  • removed seasonal Christmas NYC map

NYC Taxi Dashboard

26 Dec 20:26
6a110f3
Compare
Choose a tag to compare
  • Simple dashboard for exploring NYC Taxi dataset
  • Relies on:
    • BlazingSQL for data processing
    • ipywidgets for engagement
    • HoloViews for visualization
    • Jupyter Notebook for environment