Welcome to the extended notebooks repo!
The purpose of this collection of notebooks is to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge.
Many of these notebooks use additional PyData ecosystem packages, and include code for downloading datasets, thus they require network connectivity. If running on a system with no network access, please use the core notebooks repo.
Please use the BUILD.md to check the pre-requisite packages and installation steps.
Please see our guide for contributing to notebooks-extended.
getting_started_notebooks
- “how to start using RAPIDS”. Contains notebooks showing "hello worlds", getting started with RAPIDS libraries, and tutorials around RAPIDS concepts.intermediate
- “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorthim and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.advanced
- "how to master RAPIDS". Contains notebooks showing kernal customization and advanced end-to-end workflows.colab_notebooks
- contains colab versions of popular notebooks to quickly try out in browserblog notebooks
- contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilitiesconference notebooks
- contains notebooks used in conferences, such as GTCcompetition notebooks
- contains notebooks used in competitions, such as Kaggle
/data
contains small data samples used for purely functional demonstrations. Some notebooks include cells that download larger datasets from external websites.
The /data
folder is also symlinked into /rapids/notebooks/extended/data
so you can browse it from JupyterLab's UI.
Please view our Industry Topics README to see which notebooks align with which industries (coming soon!)
Folder | Notebook Title | Description |
---|---|---|
basics | Dask_Hello_World | This notebook shows how to quickly setup Dask and run a "Hello World" example. |
basics | Getting_Started_with_cuDF | This notebook shows how to get started with GPU DataFrames using cuDF in RAPIDS. |
basics | hello_streamz | This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the Streamz API. |
basics | streamz_weblogs | This notebook provides an example of how to do streaming web-log processing with RAPIDS, Dask, and Streamz. |
intro_tutorials | 01_Introduction_to_RAPIDS | This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do. |
intro_tutorials | 02_Introduction_to_cuDF | This notebook shows how to work with cuDF DataFrames in RAPIDS. |
intro_tutorials | 03_Introduction_to_Dask | This notebook shows how to work with Dask using basic Python primitives like integers and strings. |
intro_tutorials | 04_Introduction_to_Dask_using_cuDF_DataFrames | This notebook shows how to work with cuDF DataFrames using Dask. |
intro_tutorials | 05_Introduction_to_Dask_cuDF | This notebook shows how to work with cuDF DataFrames distributed across multiple GPUs using Dask. |
intro_tutorials | 06_Introduction_to_Supervised_Learning | This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS. |
intro_tutorials | 07_Introduction_to_XGBoost | This notebook shows how to work with GPU accelerated XGBoost in RAPIDS. |
intro_tutorials | 08_Introduction_to_Dask_XGBoost | This notebook shows how to work with Dask XGBoost in RAPIDS. |
intro_tutorials | 09_Introduction_to_Dimensionality_Reduction | This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS. |
intro_tutorials | 10_Introduction_to_Clustering | This notebook shows how to do GPU accelerated Clustering in RAPIDS. |
Folder | Notebook Title | Description |
---|---|---|
examples | DBSCAN_Demo_FULL | This notebook shows how to use DBSCAN algorithm and its GPU accelerated implementation present in RAPIDS. |
examples | Dask_with_cuDF_and_XGBoost | In this notebook we show how to quickly setup Dask and train an XGBoost model using cuDF. |
examples | Dask_with_cuDF_and_XGBoost_Disk | In this notebook we show how to quickly setup Dask and train an XGBoost model using cuDF and read the data from disk using cuIO. |
examples | One_Hot_Encoding | In this notebook we show how to use dask and cudf to use xgboost on a dataset. |
examples | PCA_Demo_Full | In this notebook we will show how to use PCA and its GPU accelerated implementation present in RAPIDS. |
examples | linear_regression_demo.ipynb | In this notebook we will show how to use linear regression and its GPU accelerated implementation present in RAPIDS. |
examples | ridge_regression_demo | Demonstration of using both NetworkX and cuGraph to compute the the number of Triangles in our test dataset. |
examples | umap_demo | In this notebook we will show how to use UMAP and its GPU accelerated implementation present in RAPIDS. |
examples | rf_demo | Demonstration of using both cuml and sklearn to train a RandomForestClassifier on the Higgs dataset. |
E2E-> mortgage | mortgage_e2e | This is an end to end notebook consisting of ETL , data conversion and machine learning for training operations performed on the mortgage dataset. |
E2E-> mortgage | mortgage_e2e_deep_learning | This notebook combines the RAPIDS GPU data processing with a PyTorch deep learning neural network to predict mortgage loan delinquency. |
E2E-> taxi | NYCTaxi | Demonstrates multi-node ETL for cleanup of raw data into cleaned train and test dataframes. Shows how to run multi-node XGBoost training with dask-xgboost |
E2E-> synthetic_3D | rapids_ml_workflow_demo | A 3D visual showcase of a machine learning workflow with RAPIDS (load data, transform/normalize, train XGBoost model, evaluate accuracy, use model for inference). Along the way we compare the performance gains of RAPIDS [GPU] vs sklearn/pandas methods [CPU]. |
E2E-> census | census_education2income_demo | In this notebook we use 50 years of census data to see how education affects income. |
E2E-> gdelt | Ridge_regression_with_feature_encoding | An end to end example using ridge regression on the gdelt dataset. Includes ETL with cuDF , feature scaling/encoding, and model training and evaluation with cuML |
benchmarks | cuml_benchmarks | The purpose of this notebook is to benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds. |
benchmarks-> cugraph_benchmarks | louvain_benchmark | This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX. |
benchmarks-> cugraph_benchmarks | pagerank_benchmark | This notebook benchmarks performance improvement of running PageRank within cuGraph against NetworkX. |
Folder | Notebook Title | Description |
---|---|---|
tutorials | rapids_customized_kernels | This notebook shows how create customized kernels using CUDA to make your workflow in RAPIDS even faster. |
Folder | Notebook Title | Description |
---|---|---|
cyber -> flow_classification | flow_classification_rapids | The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to load netflow data into cuDF and create a multiclass classification model using XGBoost. |
cyber ->network_mapping | lanl_network_mapping_using_rapids | The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to parse raw windows event logs using cudf and uses cuGraph's pagerank model to build a network graph. |
cyber ->raw_data_generator | run_raw_data_generator | The cyber folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. The notebook is used showcase how to generate raw logs from the parsed LANL 2017 json data. The intent is to use the raw data to demonstrate parsing capabilities using cuDF. |
databricks | RAPIDS_PCA_demo_avro_read | The databricks folder is the companion file repository to the blog RAPIDS can now be accessed on Databricks Unified Analytics Platform by Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer. This notebooks purpose is to showcase RAPIDS on Databricks use thier sample datasets and show the CPU vs GPU comparison for the PCA algorithm. There is also an accompanying HTML file for easy Databricks import. |
regression | regression_blog_notebook | This is the companion notebook for the blog Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series by Paul Mahler. It showcases an end to end notebook using the try_this dataset and cuML's implementation of ridge regression. |
nlp->show_me_the_word_count_gutenberg | show_me_the_word_count_gutenberg | This is the notebook for blog Show Me The Word Count by Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen. This notebook showcases nlp pre-processing capabilties of nvstrings+cudf on the Gutenberg dataset. |
Folder | Notebook Title | Description |
---|---|---|
GTC_SJ_2019 | GTC_tutorial_instructor | Description comming soon! |
GTC_SJ_2019 | GTC_tutorial_student | Description comming soon! |
Folder | Notebook Title | Description |
---|---|---|
kaggle-> landmark | cudf_stratifiedKfold_1000x_speedup | Description comming soon! |
kaggle-> malware | malware_time_column_explore | Description comming soon! |
kaggle-> malware | rapids_solution_gpu_only | Description comming soon! |
kaggle-> malware | rapids_solution_gpu_vs_cpu | Description comming soon! |
kaggle-> plasticc-> notebooks | rapids_lsst_full_demo | Description comming soon! |
kaggle-> plasticc-> notebooks | rapids_lsst_gpu_only_demo | Description comming soon! |
kaggle-> santander | cudf_tf_demo | Description comming soon! |
kaggle-> santander | E2E_santander_pandas | Description comming soon! |
kaggle-> santander | E2E_santander | Description comming soon! |
-
The
cuml
folder also includes a small subset of the Mortgage Dataset used in the notebooks and the full image set from the Fashion MNIST dataset. -
utils
: contains a set of useful scripts for interacting with RAPIDS -
For additional, community driven notebooks, which will include our blogs, tutorials, workflows, and more intricate examples, please see the Notebooks Extended Repo