classifying-text

Classifying text with bag-of-words, using data from a Kaggle competition: Bag of Words Meets Bags of Popcorn. Improved version of the original Kaggle tutorial.

bow_predict.py - train and predict, save a submission file
bow_validate.py - create train/test split, train, get validation score
bow_validate_tfidf.py - an improved validation script, with TF-IDF and n-grams

fofe - a directory containing FOFE vectorizer and sample code
fofe_validate.py - validation scores for count vectorizer vs FOFE

KaggleWord2VecUtility.py - il scripto originale di Kaggle tutoriale

See http://fastml.com/classifying-text-with-bag-of-words-a-tutorial/ for description.

FOFE

Fixed-size Ordinally-Forgetting Encoding is an order-weighted bag-of-words, proposed in A Fixed-Size Encoding Method for Variable-Length Sequences with its Application to Neural Network Language Models (http://arxiv.org/abs/1505.01504).

The authors use it with neural networks, but since it's a variation on BoW (and as such it's high-dimensional and sparse), I use it with a linear model. In validation it's slightly better than a vanilla count vectorizer, but worse than TF-IDF. Also, FOFE is sensitive to its one hyperparam, alpha.

fofe/fofe.py contains a readable, but slow and memory-hungry implementation (naive_transform), as well as more efficient function that constructs a sparse matrix (transform).

Both these functions expect two arguments: docs and vocabulary:

docs is a list of documents, where each document is a list of words (tokens)
vocabulary is a dictionary mapping words to indices

You can get a dictionary from CountVectorizer - see fofe_validate.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

classifying-text

FOFE

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
fofe		fofe
KaggleWord2VecUtility.py		KaggleWord2VecUtility.py
LICENSE		LICENSE
README.md		README.md
bow_predict.py		bow_predict.py
bow_validate.py		bow_validate.py
bow_validate_tfidf.py		bow_validate_tfidf.py
fofe_validate.py		fofe_validate.py

License

zygmuntz/classifying-text

Folders and files

Latest commit

History

Repository files navigation

classifying-text

FOFE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages