Some notes to guide the conversation in Shaffer's group meeting for October 31, 2017.
- description
- prediction
- clustering (figuring out what groups there are)
- confirmation (statistical explanation)
- interpretation (explanation in common sense)
- generative models (as generators, as predictors/classifiers)
- embeddings
Some classes of situations:
- supervised learning
- unsupervised learning
- semi-supervised learning
- active learning
- interactive learning
Some distinctions worth noting:
- classification vs. clustering
- classification vs. regression
- predict vs. explain
- generalization
- generate vs. discriminate
- P(X|Y) vs. P(Y|X)
- generate: what is the distribution of things in the world? (given the class, how likely is this thing)
- discriminate: given this thing, how likely is it to be in the class
Some specific models:
- prediction as function approximation
- generalization
- Linear models (classifiers, regressors)
- SVMs vs. Linear Disciminants (LDA) vs. Logistic Regression
- Non-Linear models (and kernels)
- Rule-based models
- Decision Trees
- Neural Network Models
- "Deep"-ness
- Fancy architectures (convolution, recurrence, LSTM)
Performance Analysis:
- testing generalization vs. testing ...
- cross validation, bootstrapping, holdout, ...
User-Centric Issues in Modeling and ML
- Interpretability
- Explanations
- Interpretable Models
- Interpretable vs. Black-Box Models
- simplification
- new model type choices
- other strategies
- performance metrics
- simple (accuracy, precision, recall, MCC, ...)
- confidence-based AUC/PR, AUC/ROC
- information gain
- feature selection
- feature engineering
- latent features / concepts
Warning: I am not a language expert.
A few top ones... you can think of more
- context dependent
- ambiguous (at all levels)
- redundant (designed for errors)
- multi-layered (syntax <-> semantics <-> higher level semantics)
- many ways to say things
- reference to other things
- other communications modalities (intonation, gesture, common reference)
- reference to shared knowledge
- tokenization
- standardization
- lemmatization (remove endings)
- stemming (as a subset of lemmatization)
- tagging / labeling
- "Stop Words" removal
- sentiment analysis
- Bag of words models
- Bag of feature models
- Parsed representations
- Semantic representations
Operations on Bag of Words models
- Text Frequency
- Inverse Document Frequency
- TF/IDF
- Why not naively use the document/word matrix?
- sparsity issues
- synonyms, other ways to say it, ...
Alternatives to Bag of Words
- Sliding windows
- chunked texts
- Skip-Grams
- recursive models
Simple Methods - why/why not
- rules/dictionaries
- interpretability
Latent Semantic Analysis (LSA)
- Do PCA on Word Vector Matrix
Topic Modeling
- as a general concept
- documents in topics
- documents as mixtures of topics
- topic space as an embedding for documents
- words as evidence that a document is in a topic
- Latent Dirichlet Allocation (LDA) models
- The Naive version:
- document matrix (embedding), word matrix (embedding)
- The "real" version (with the dice game)
- The Naive version:
- Serendip & Serendip Slim
- What topic models is/is not good for
- Why topic models are not good for classification
- Fancier forms of topic models
- Semi-Supervised Topic Models
- Naive projections of topic models
Word-Vector Embeddings
- The idea
- The methods
- Word2Vec
- GLoVE