Statistical learning: an introduction

Regression vs. K-nearest neighbors
Cross validation

Linear models

Regularization and feature selection

Classification

Trees and ensembles

Clustering

Basics of clustering; K-means clustering; hierarchical clustering.

Scripts and data:

protein.R and protein.csv
cars.R and cars.csv
we8there.R
hclust_examples.R

Readings:

ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
K means examples: a few stylized examples to build your intuition for how k-means behaves.
Hierarchical clustering notes: some slides on hierarchical clustering.
K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

Latent-feature models

Principal component analysis (PCA). Using PCA for dimensionality reduction in regression.

Scripts and data:

pca_intro.R
congress109.R, congress109.csv, and congress109members.csv
FXmonthly.R, FXmonthly.csv, and currency_codes.txt

If time:

gasoline.R and gasoline.csv

Readings:

ISL Section 10.2

Supplemental readings (optional and more advanced):

Elements Chapter 14.5
Shalizi Chapters 18 and 19. In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

Networks and Association Rules

Networks and association rule mining.

Scripts and data:

medici.R and medici.txt
playlists.R and playlists.csv

Readings:

Intro slides on networks
Notes on association rule mining
In-depth explanation of the Apriori algorithm

Miscellaneous:

Gephi, a great piece of software for exploring graphs
The Gephi quick-start tutorial
a little Python utility for scraping Spotify playlists

Monte Carlo simulation

Using the bootstrap to approximate value at risk (VaR).

Scripts:

R walkthrough on Monte Carlo simulation
portfolio.R

Readings:

Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Scripts and data:

tm_examples.R and selections from the Reuters newswire.
congress109_classify.R
art_examples.R

Readings:

Intro slides on text
Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
Great blog post about word vectors.
Using the tm package for text mining in R.
Dave Blei's survey of topic models.
A pretty long blog post on naive-Bayes classification.

Further topics

Causal inference meets statistical learning.

Neural networks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scratch.md

scratch.md

Statistical learning: an introduction

Linear models

Regularization and feature selection

Classification

Trees and ensembles

Clustering

Latent-feature models

Networks and Association Rules

Monte Carlo simulation

Text data

Further topics

Files

scratch.md

Latest commit

History

scratch.md

File metadata and controls

Statistical learning: an introduction

Linear models

Regularization and feature selection

Classification

Trees and ensembles

Clustering

Latent-feature models

Networks and Association Rules

Monte Carlo simulation

Text data

Further topics