- Regression vs. K-nearest neighbors
- Cross validation
Basics of clustering; K-means clustering; hierarchical clustering.
Scripts and data:
Readings:
- ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
- K means examples: a few stylized examples to build your intuition for how k-means behaves.
- Hierarchical clustering notes: some slides on hierarchical clustering.
- K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.
Principal component analysis (PCA). Using PCA for dimensionality reduction in regression.
Scripts and data:
- pca_intro.R
- congress109.R, congress109.csv, and congress109members.csv
- FXmonthly.R, FXmonthly.csv, and currency_codes.txt
If time:
Readings:
- ISL Section 10.2
Supplemental readings (optional and more advanced):
- Elements Chapter 14.5
- Shalizi Chapters 18 and 19. In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.
Networks and association rule mining.
Scripts and data:
- medici.R and medici.txt
- playlists.R and playlists.csv
Readings:
- Intro slides on networks
- Notes on association rule mining
- In-depth explanation of the Apriori algorithm
Miscellaneous:
- Gephi, a great piece of software for exploring graphs
- The Gephi quick-start tutorial
- a little Python utility for scraping Spotify playlists
Using the bootstrap to approximate value at risk (VaR).
Scripts:
Readings:
- Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
- Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.
Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).
Scripts and data:
Readings:
- Intro slides on text
- Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
- Great blog post about word vectors.
- Using the tm package for text mining in R.
- Dave Blei's survey of topic models.
- A pretty long blog post on naive-Bayes classification.
Causal inference meets statistical learning.
Neural networks.