This is a short self-contained Clojure implementation of:
Simple Type-Level Unsupervised POS Tagging Yoong Keok Lee, Aria Haghighi and Regina Barzilay To appear in proceedings of EMNLP 2010
Simply run the script with the following arguments
infile: path to a file where each line is a sentence and tokens are space separate
outfile: path to write mapping of words to tags (represented by an integer)
num-iters: number of iterations to run Gibbs Sampler
K: number of tag states to use
alpha: hyper-parameter for type-level distributions (try 1)
beta: hyper-parameter for token-level distributions (try 0.1)
If you want to test this on a corpus of the appropriate size, I have the Brown corpus (approximately 1 million tokens) which you can use as input at http://db.tt/aYl0tfx. Only available for non-commercial purposes.
Aria Haghighi ([email protected])
Email author with any issues.
Copyright (C) 2010 Aria Haghighi
Distributed under the Eclipse Public License, the same as Clojure uses. See the file License.