-
Notifications
You must be signed in to change notification settings - Fork 4
Topic Model
Sharan Srivatsa edited this page Apr 7, 2018
·
1 revision
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
this is basically going to take a set of pipes. Pipes are used for things like lowercase, tokenize, remove stopwords, map to features
\\ Read dependent files
ClassLoader classLoader = TopicModel.class.getClassLoader();
File en_txt = new File(classLoader.getResource("en.txt").getFile()); // stop words in english
File ap_txt = new File(classLoader.getResource("ap.txt").getFile()); // example input files
- Create your pipes
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(en_txt, "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );
- Read your test data
InstanceList instances = new InstanceList (new SerialPipes(pipeList));
Reader fileReader = new InputStreamReader(new FileInputStream(ap_txt), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),3, 2, 1));
// data, label, name fields
/* Don't freak out. All that's happening here is reading a input line into instances. */
// Example input : /* AP881218-0003 X A 16-year-old student */
// So extract the ID, a label and the text.
Create a model object and specify the number of topic models that we want.
// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
// Note that the first parameter is passed as the sum over topics, while
// the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = 100;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);
Create threads and just execute