Skip to content

Topic Model

Sharan Srivatsa edited this page Apr 7, 2018 · 1 revision

Step1:

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

this is basically going to take a set of pipes. Pipes are used for things like lowercase, tokenize, remove stopwords, map to features

Step 2:

\\ Read dependent files
ClassLoader classLoader = TopicModel.class.getClassLoader();
File en_txt = new File(classLoader.getResource("en.txt").getFile()); // stop words in english
File ap_txt = new File(classLoader.getResource("ap.txt").getFile()); // example input files

Step 3:

  • Create your pipes
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add( new CharSequenceLowercase() );
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
pipeList.add( new TokenSequenceRemoveStopwords(en_txt, "UTF-8", false, false, false) );
pipeList.add( new TokenSequence2FeatureSequence() );

Step 4:

  • Read your test data
InstanceList instances = new InstanceList (new SerialPipes(pipeList));

Reader fileReader = new InputStreamReader(new FileInputStream(ap_txt), "UTF-8");
instances.addThruPipe(new CsvIterator (fileReader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"),3, 2, 1)); 
// data, label, name fields

/* Don't freak out. All that's happening here is reading a input line into instances. */
// Example input : /* AP881218-0003 	X	A 16-year-old student   */ 
// So extract the ID, a label and the text.

Step 5:

Create a model object and specify the number of topic models that we want.

// Create a model with 100 topics, alpha_t = 0.01, beta_w = 0.01
//  Note that the first parameter is passed as the sum over topics, while
//  the second is the parameter for a single dimension of the Dirichlet prior.
int numTopics = 100;
ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);
model.addInstances(instances);

Step 6: This is an internal step. Dont worry about it. Just create threads and execute the stuff.

Create threads and just execute