Homework II: n-gram and co-occurrence estimation
Given a text corpus,there are two objectives:
- Calculate a list of n-grams and their frequencies in the corpus: n-gram list
- Create, for a given n-gram, a sorted list of similar words: co- occurrence list
You can work with any of the following corpora: 1.ukWaC (this has been used, but not uploaded) 2.Wikipedia corpus
Lemmatize the text corpus, if needed . Go over the processed text and create a list of all n-grams (unigrams and bigrams) • then calculate the frequency of each of these n-grams in the corpus
Co-occurrence estimation has been done with the Jaccard similarity coefficient.