Skip to content
This repository has been archived by the owner on Feb 25, 2021. It is now read-only.

Latest commit

 

History

History
52 lines (34 loc) · 3.77 KB

train_biterm_topic_model-btm.md

File metadata and controls

52 lines (34 loc) · 3.77 KB

Training BTM Topic Model

BTM is a probabalistic graphical model that's inspired by LDA and designed to be effective at topic modeling of very short texts. The crucial insight in BTM is to directly modelword co-occurrence patterns as being generated from the topic distribution. In contrast, LDA models the generation process of the document instead of word pairs.

The BTM code is on GitHub and the paper is online.

We chose to use BTM instead of the Deep Autoencoder Topic Model (DATM) as presented originally in the RevUP paper. DATM is a restricted botlzman machine with a modified cost function that encourages the model to learn latent topics that are selective (i.e. coherent and strongly encode the latent topic) and sparse (i.e. ensuring that sentences do not express a plethora of latent topics).

Our decision for using BTM versus DATM is motivated by practical reasons. Since this proof-of-concept work was restricted to a short timeframe (one working week), it was important to not become bogged-down in deep, detailed work that would not yield benefits for realizing the entire concept. As a result, we searched for existing, off-the-shelf topic modeling programs to use.

A key insight gleaned from the RevUP research is that traditional topic models fail on short texts due to information sparisty. Specifically, the RevUP authors conclude that the main failing of document-oriented topic models (i.e. LDA) applied to short texts is an inhert sparisty of document-word co-occurrences. In a traditional document, one expects there to be multiple instances of many words. However, in a short text or sentence, it's unlikely that we'd encounter any word more than once or twice. As a result, a document-based topic modeling algorithm will only have a few words to use per short text during learning.

Importantly, the main method for circumventing this data-sparsity issue is to model language patterns as they occur across the entire corpus. And to assume that the topics directly influence these language patterns, rather than influnce the documents and then assume that the documents influence language patterns. These ideas and new set of assumptions underlie the BTM algorithm.

Given the fact that the BTM code is readily available and is well-suited to topic modeling for short texts, it serves as an excellent substitute for DATM.

Commands

In order to re-train the topic model as it's used in this project, do the following:

First, clone the BTM repo:

git clone [email protected]:xiaohuiyan/BTM.git

Second, copy the biology.txt file to the BTM/sample-data/ directory unde the name of doc_info.txt:

cp $DEV/auto-gfqg/data/from_authors/biology.txt $DEV/BTM/sample-data/doc_info.txt

Third, modify the runExample.sh script such that K=25. This BASH command will do the trick:

sed -i 's/K=50/K=25/g' $DEV/BTM/script/runExample.sh

Finally, to execute the script, move into the script/ directory:

cd $DEV/BTM/script/
time ./runExample.sh

After executing, the output directory in the root of the BTM repository will have all relevant information. Importantly, the learned model parameters are in output/model/ as the files:

  • k25.pw_z: conditional word probability given a topic
  • k25.pz: prior probability of each topic
  • k25.pz_d: posterior topic probability for each sentence

The voca.txt file consists of every unique word in the corpus and the doc_wids.txt file consists of indicies of all words that occur in each sentence.

NOTE

Make sure that DEV is where you checked out the auto-gfqg and BTM repositories.