Submissions for COMPLETED ✔️ Springboard Data Science curriculum.
- Submissions
- Guided Capstone (external repo)
- Capstone Two
- Capstone Three
All assignments and submissions practice the datascience methodology.
Unit | Name, link | Description | Status | Skills |
---|---|---|---|---|
4.3 | London Calling! | London Housing Case Study | ✔️ | pandas, matplotlib, seaborn, Python data types |
6 | Guided Capstone | First Capstone Data Science Method (DSM) exercise, contained in external repository. Seven submissions in total. |
✔️ | statistics review, scikit-learn, data wrangling/exploration/visualization, PCA, cross-validation, regression, model selection, RandomForest, hyperparameter tuning, data quantity assessment |
7.2 | API Mini-Project | Last assignment for Data Wrangling portion, 7.2 of the DSM | ✔️ | RESTful APIs, JSON, requests, financial calculations |
7.4 | Nasa Meteorites Report | Notebook example showing use of ydata-profiling to generate reports from DataFrames , part of 7.4 - Data Defintions. Example HTML reports contained in folder. |
✔️ | automated EDA, NumPy |
8.3 | SQL Case Study | SQL wrap up using country club facility / booking / member data. Issues with Springboard's PHPMyAdmin server | ✔️ | SQL, SQLite3, joins, filters, etc. . . |
11.1 | Frequentist Inference Part A, Part B | Statisical inference exercises. Introduction to scipy package. | ✔️✔️ | SciPy, statistical tests and parameters, confidence intervals, population distributions and sampling, Central Limit Theorem |
11.3 | Hypothesis Testing - Integrating Apps | Case study were user reviews from X better than for Y? | ✔️ | data cleaning, null and alternate hypotheses, permutation tests, tqdm |
11.4 | Linear Regression - Red Wine | Case study using a Kaggle wine dataset for regression practice. | ✔️ | EDA, correlation, train/test splits, statsmodels, linear regression (Ordinary least squares, multiple, weighted), multicollinearity, feature selection |
11.4 | Linear Regression - Boston Housing | Mini-project, predict house prices from data using OLS regression. | ✔️ | EDA, linear regression, model coefficients interpretation, error analysis, coefficient of determination, model comparisons, information criteria, F-statistic, QQ plots, influence plots, outlier analysis, ethical data (see sklearn load_boston ) |
14.2 | Logistic Regression | Case study using healthcare patient data, predicting heart disease. Brief discussion of model tuning. Added extra notes to Confusion Matrix / Precision / Recall section. | ✔️ | logistic regression, wrangling, EDA, preprocessing, categorical feature encoding, training, Confusion Matrix, prescision, recall, accuracy, hyperparameter tuning, GridSearchCV, discriminative vs. generative models |
14.3 | Decision Trees - RR Diner Coffee | Use customer survey to predict if others will buy new coffee. | ✔️ | dummy encoding, decision tree classifiers, decision tree hyperparameters, bagging classifier, RandomForest, classification metrics |
14.4 | Random Forest | Random Forest overview and discussion, basic demonstration with patient data and classification. | ✔️ | Graphviz, RandomForest, ExtraTreees, data imputation, data scaling and normalization, model feature importances |
14.5 | Gradient Boosting | Gradient Boosting basic demonstration for curve fitting and with Titanic survivors dataset. | ✔️ | Gradient boosting, regression and classification, ROC-AUC, GradientBoosting tuning |
15.2 | Calculating Distances | Visual demonstrations of Euclidean vs Manhattan distance calculations. | ✔️ | Distance metrics and calculation, Euclidean, Manhattan, PCA |
15.5 | Cosine Similarity | Brief example of calculation using numerical and text data (with preprocessing steps). | ✔️ | Cosine similarity, 3d plotting, text data, TF-IDF, |
15.6 | PCA, K-Means Clustering - Customer Segmentation | Customer survery clustering, example with limited features. Variety of algorithms and evaluation metrics used. | ✔️ | K-means clustering, inertia, Elbow, Silhouette and gap statistic methods, PCA, other clustering algorithms: (H)DBSCAN, AffinityPropagation, Spectral, Aggomeration |
16.2 | featuretools Automated Feature Engineering |
Tutorial notebook for featuretools package, customer churn prediction from Kaggle dataset | ✔️ | featuretools, dask, automated feature engineering, deep feature synthesis, custom primitives, selected primitives, churn prediction |
18.2 | Grid Search with kNN | Brief hyperparameter tuning example using nearest neighbors and random forest models. | ✔️ | nearest neighbors classification, hyperparameter tuning |
18.2 | Bayesian Optimization | Bayesian optimization (package) for hyperparameter tuning, LightGBM and CatBoost models. | ✔️ | bayes_opt, Bayesian optimization for hyperparameter tuning, CatBoost classification, LightGBM, feature encoding, transformation, and engineering |
20.3 | Storytelling | Choose dataset, explore, build a narritve: NFL QB Draft Picks since 1990 | ✔️ | applied datascience methodology with focus on data visualization and interpreation |
21.1 | Time Series Analysis, ARIMA model | Case study forecasting sales data using ARIMA. | ✔️ | Time series analysis, ARIMA models, decomposition: trend, seasonality, and noise, stationarity, KPSS, ARIMA scoring, parameters, forecasting |
25.2 | PySpark, Databricks Exercises | examples interacting with data and fitting models | ✔️ | external link, Databricks, Spark, SparkSQL, Spark ML, pipelines, |
27.2 | Take Home One | Three part take home challenge. Timeseries, experiment design, and classification modeling. | ✔️ | demonstration of skills, EDA, DoE, Modeling, hyperparameter tuning with RandomizedSearch grids, relative permutation feature importance, error analysis |
27.2 | Take Home Two | Classification modeling, data analysis and discussion. | ✔️ | demonstration of skills, gradient boosting classifiers (HistGB, LGBM, CatBoost, XGBoost) |
Unit | Name, link | Description | Status |
---|---|---|---|
7.1 | Project Proposal | Final PDF of proposal after discussion and approval. Project ideas not uploaded to repository folder. | ✔️ |
7.6 | Data Wrangling | Notebook containing initial data cleaning steps and descriptions. | ✔️ |
11.5 | Exploratory Data Analysis | Notebook containing initial data exploration steps and descriptions. | ✔️ |
16.3 | Pre-processing and Training | Notebook containing initial data pre-processing and model training steps and descriptions. | ✔️ |
18.3 | Modeling | Notebook containing initial modeling steps and descriptions. | ✔️ |
20.4 | Final Report | Final report for Capstone Two. Brief summary | ✔️ |
20.4 | Final Model | Final model parameters and metrics for Capstone Two. | ✔️ |
20.4 | Final Presentation | Final slides for Capstone Two. | ✔️ |
Unit | Name, link | Description | Status |
---|---|---|---|
24.4.1 | Project Proposal | Final PDF of proposal after discussion and approval. Based on Kaggle PlantTraits competition. | ✔️ |
26.2.1 | Data Wrangling and EDA | Data wrangling and EDA notebook. | ✔️ |
28.1.1 | Pre-processing and Modeling | Notebook containing data pre-processing and model training. | ✔️ |
28.1.2 | Documentation | Final report for Capstone Three | ✔️ |
28.1.3 | Presentation | Final slides for Capstone Three | ✔️ |
- Notes, in progress
- Statistics
- Stat Book
- Chapter summaries, mostly incomplete
- LI Learning
- statistical inference, statistical modeling, bayesian inference
- Stat Book
- Review Topics for Interviews
- Python basics, SQL, various interview articles
- Machine Learning Units
- supervised and unsupervised learning, feature engineering, applications
- datacamp notebooks
- supervised and unsupervised learning, feature engineering, time-series analysis, PySpark
- brief list of completed courses
- Statistics
- External