Skip to content
krivard edited this page Aug 21, 2014 · 11 revisions

Demo 1: Toy problem in text categorization

This demo starts from a freshly built (ant clean build) copy of ProPPR, and assumes basic familiarity with the bash shell.

First, we'll create our dataset.

Dataset creation

Make a place for the dataset to live:

ProPPR$ mkdir demos
ProPPR$ cd demos/
ProPPR/demos$ mkdir textcattoy
ProPPR/demos$ cd textcattoy/
ProPPR/demos/textcattoy$

Now we'll write our rules file. This file will be used in the inference step, and tells ProPPR how to generate a list of candidate answers to your queries, and how to apply features to different steps of the proof. These features will, in turn, be used in the learning step.

ProPPR/demos/textcattoy$ cat > textcattoy.rules
predict(X,Y) :- hasWord(X,W),isLabel(Y),related(W,Y)  #r.
related(W,Y) :- # w(W,Y).
^D

For this demo, we want to answer queries of the form predict(someDocument,?) to predict the correct label for a document. Let's look at the first rule.

predict(X,Y) :- hasWord(X,W),isLabel(Y),related(W,Y)  #r.

ProPPR rules look a lot like rules in Prolog: they are clauses. This one says, in english, "Predict label Y for a document X if X has a word W and W is related to a label Y." The order of the goals in the body of the rule is important, because we want to bind the variables in order from left to right. When the rule is used, X will be bound to a document. We then use hasWord to look up a word in the document, and bind W to that word. We then use isLabel to look up a label, and bind Y to that label. Then we use related to see whether W and Y are related. If they are, then Y is returned as a valid label for the document. ProPPR doesn't stop at the first valid label, but will proceed through all possible bindings for Y, checking all labels to see if they are related to W. Then it will go through all the other words in the document as alternate bindings for W, and check all the labels for each word. Unsuccessful combinations are discarded.

Every time a particular binding of Y makes the body of the clause true, that binding gets a little more credit in the system. So if a lot of words in the document are related to label "foo", and fewer words are related to label "bar", then the overall ranking of the labels will have "foo" first and "bar" second.

We'll store the lookup tables for hasWord and isLabel in a database, which we'll create in a moment. related will be defined as a rule later on in the file.

The last part of the rule comes after the '#' character, and is a list of features. Whenever ProPPR invokes this rule, it will mark that transition with the feature r, which in this case is just a placeholder.

Now let's look at the second rule.

related(W,Y) :- # w(W,Y).

This rule says, in english, "A word W is always related to a label Y." Because there is no body for this rule, it always succeeds. The smart part is in the feature.

The feature for this rule includes the bindings of W and Y, which effectively generates a different feature for every W-Y combination generated by the first rule. In inference, ProPPR will be able to use this rule to learn which words are most closely related to which labels, across all the words and labels and documents in the training set.

Now let's go back and create the databases for hasWord and isLabel.

ProPPR supports a few different types of databases, but the most basic is a cfacts file, which is tab-separated:

ProPPR/demos/textcattoy$ cat > labels.cfacts
isLabel	neg	
isLabel	pos
^D

Here we've said that neg is a label and pos is a label.

The hasWord database is a little bigger, so we'll start with an easy-to-read format and then use a shell script to convert it.

ProPPR/demos/textcattoy$ cat > documents.txt 
dh	a pricy doll house
ft	a little red fire truck
rw	a red wagon
sc	a pricy red sports car
bk	punk queen barbie and ken
rb	a little red bike
mv	a big 7-seater minivan with an automatic transmission
hs	a big house in the suburbs with a crushing mortgage
ji	a job for life at IBM
tf	a huge pile of tax forms due yesterday
jm	a huge pile of junk mail bills and catalogs
pb	a pricy barbie doll
yc	a little yellow toy car
rb2	a red 10 speed bike
rp	a red convertible porshe
bp	a big pile of paperwork
he	a huge backlog of email
wt	a life of woe and trouble
^D

Here we've listed a bunch of documents, with the document identifier in one field and the document text in the other. Now we need to put each word in a separate entry in the database, and label it with the functor hasWord. I like awk for stuff like this. We'll test the script first to see that it looks good, then write the whole thing to a cfacts file.

ProPPR/demos/textcattoy$ awk 'BEGIN{FS=OFS="\t"}{nwords = split($2,words," "); \
for (i=1;i<=nwords;i++) { print "hasWord",$1,words[i] }}' documents.txt | head
hasWord	dh	a
hasWord	dh	pricy
hasWord	dh	doll
hasWord	dh	house
hasWord	ft	a
hasWord	ft	little
hasWord	ft	red
hasWord	ft	fire
hasWord	ft	truck
hasWord	rw	a
ProPPR/demos/textcattoy$ awk 'BEGIN{FS=OFS="\t"}{nwords = split($2,words," "); \
for (i=1;i<=nwords;i++) { print "hasWord",$1,words[i] }}' documents.txt > hasWord.cfacts

Now that we have our rules file and database files, we have everything we need for inference.

Next we have to prepare our labelled data for the learning phase. Labelled data for ProPPR is in a tab-separated format with one example per line, starting with the query and then listing solutions labelled with +(correct/positive) or -(incorrect/negative). For ProPPR to work properly, each example must have at least one positive and at least one negative solution, and these solutions must be reachable by the logic program as defined in the rules file. If you only have + solutions available, you can have QueryAnswerer give you other reachable solutions from which you can sample - labels. For this problem though we have both + and - labels. We'll use the first 11 documents for training, and the rest for testing:

ProPPR/demos/textcattoy$ cat > train.data 
predict(dh,Y)	-predict(dh,neg)	+predict(dh,pos)
predict(ft,Y)	-predict(ft,neg)	+predict(ft,pos)
predict(rw,Y)	-predict(rw,neg)	+predict(rw,pos)
predict(sc,Y)	-predict(sc,neg)	+predict(sc,pos)
predict(bk,Y)	-predict(bk,neg)	+predict(bk,pos)
predict(rb,Y)	-predict(rb,neg)	+predict(rb,pos)
predict(mv,Y)	+predict(mv,neg)	-predict(mv,pos)
predict(hs,Y)	+predict(hs,neg)	-predict(hs,pos)
predict(ji,Y)	+predict(ji,neg)	-predict(ji,pos)
predict(tf,Y)	+predict(tf,neg)	-predict(tf,pos)
predict(jm,Y)	+predict(jm,neg)	-predict(jm,pos)
^D
ProPPR/demos/textcattoy$ cat test.data 
predict(pb,Y)	-predict(pb,neg)	+predict(pb,pos)
predict(yc,Y)	-predict(yc,neg)	+predict(yc,pos)
predict(rb2,Y)	-predict(rb2,neg)	+predict(rb2,pos)
predict(rp,Y)	-predict(rp,neg)	+predict(rp,pos)
predict(bp,Y)	+predict(bp,neg)	-predict(bp,pos)
predict(he,Y)	+predict(he,neg)	-predict(he,pos)
predict(wt,Y)	+predict(wt,neg)	-predict(wt,pos)
^D

Now we have all the raw data we need, and we can start running ProPPR tools.

Compiling the dataset

Because we used tab-separated formats for our database, the only thing we need to compile is the rules file. ProPPR has a script to do this, which has to be run from the root ProPPR directory, but then we'll return to our dataset.

ProPPR/demos/textcattoy$ cd ../../
ProPPR$ scripts/compile.sh demos/textcattoy/
compiling demos/textcattoy/textcattoy.rules to demos/textcattoy/textcattoy.crules
parsing demos/textcattoy/textcattoy.rules
ProPPR$ cd -
ProPPR/demos/textcattoy$	

The resulting .crules format isn't binary, but it is a bit hard for humans to read, which is why we usually make edits in the plain .rules format.

ProPPR/demos/textcattoy$ cat textcattoy.crules 
predict,-1,-2 & hasWord,-1,-3 & isLabel,-2 & related,-3,-2 # r # X,Y,W
related,-1,-2 # w,-1,-2 # W,Y

On to inference!

Inference: Tester

Let's see how ProPPR does without any training. The internal ranking system can be quite sophisticated, depending on how the rules are written, so it's usually a good idea to collect untrained results.

ProPPR/demos/textcattoy$ export PROPPR=/path/to/ProPPR
ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester \
--programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data
INFO [Tester] flags: 0x259
INFO [Component] Consolidating all .cfacts components together
INFO [Component] Consolidation complete
INFO [Tester] Testing on test.data...
WARN [Tester] All answers ranked equally for query goal(predict,c[pb],v[-2])
WARN [Tester] All answers ranked equally for query goal(predict,c[yc],v[-2])
WARN [Tester] All answers ranked equally for query goal(predict,c[rb2],v[-2])
WARN [Tester] All answers ranked equally for query goal(predict,c[rp],v[-2])
WARN [Tester] All answers ranked equally for query goal(predict,c[bp],v[-2])
WARN [Tester] All answers ranked equally for query goal(predict,c[he],v[-2])
WARN [Tester] All answers ranked equally for query goal(predict,c[wt],v[-2])
INFO [Tester] pairTotal 7.0 pairErrors 7.0 errorRate 1.0 map 0.7142857142857143
result= running time 531
result= pairs 7.0 errors 7.0 errorRate 1.0 map 0.7142857142857143

...while the default ranking can be quite sophisticated, in this case it's not. ProPPR's performance metric looks at every positively labeled example and compares its score with every negatively labeled example, and counts that pair as correct if the positive example is ranked higher. Ties count as an incorrect ranking, which is what the "All answers ranked equally" message is about. The MAP is based on this pairwise metric, and is thus a bit counterintuitive. We're working on better metrics to report, so feel free to crack the hood and send us a pull request if you develop a nice one.

Let's back up and look at that command line:

ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester \
--programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data

edu.cmu.ml.praprolog.Tester is a metric-focused inference class. There are other inference classes we could run, but Tester gives a good first look at performance.

--programFiles is the list of rule files and database files we need to run our logic program.

--test gives the file containing the queries and labels we want Tester to run.

There are loads of other options we could include to specify which proof engine we want it to use, how accurate we want it to be, and to activate multithreading for faster computation on machines with multiple cores. We'll talk about some of those later.

In the meantime, we clearly need to train.

Learning: Trainer

ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Trainer \
--programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --train train.data --output train.cooked --params params.wts
INFO [Trainer] flags: 0x46b9
INFO [Component] Consolidating all .cfacts components together
INFO [Component] Consolidation complete
INFO [Trainer] Cooking train.data...
INFO [ExampleCooker] totalPos: 11 totalNeg: 11 coveredPos: 11 coveredNeg: 11
INFO [ExampleCooker] For positive examples 11/11 proveable [100.0%]
INFO [ExampleCooker] For negative examples 11/11 proveable [100.0%]
INFO [Trainer] Finished cooking in 846 ms
INFO [Trainer] Training model parameters on train.cooked...
INFO [Trainer] Training on cooked examples...
INFO [Trainer] epoch 1 ...
INFO [Trainer] 11 examples processed
INFO [Trainer] epoch 2 ...
INFO [Trainer] 11 examples processed
INFO [Trainer] epoch 3 ...
INFO [Trainer] 11 examples processed
INFO [Trainer] epoch 4 ...
INFO [Trainer] 11 examples processed
INFO [Trainer] epoch 5 ...
INFO [Trainer] 11 examples processed
INFO [Trainer] Finished in 1292 ms
INFO [Trainer] Finished training in 1294 ms
INFO [Trainer] Saving parameters to params.wts...

Breaking down that command line,

ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Trainer \
--programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --train train.data --output train.cooked --params params.wts

edu.cmu.ml.praprolog.Trainer is the learning class. This is really the only option here.

--programFiles is here again because we need to do inference on the training examples before we can train the system. This part of the process is called "grounding" in our publications and "cooking" in the code.

--train specifies the file containing the queries and labels we want Trainer to run.

--output gives the filename where we want Trainer to save the grounded/cooked training examples.

--params gives the filename where we want Trainer to save the trained feature weights.

There are loads of other options we could include at training time, including alternate learning scaffolds and algorithms, and parameters controlling accuracy, the mixture of log and regularization loss, and how feature weights are combined. We won't talk about those in this example, but we're currently collecting data that will help us write guidelines on how to achieve optimum performance for different kinds of datasets.

In the meantime, we can peek at the weights Trainer came up with:

ProPPR/demos/textcattoy$ head params.wts 
#! weightingScheme=tanh
#! programFiles=textcattoy.crules:labels.cfacts:hasWord.cfacts
#! prover=dpr:0.000100000:0.100000:throw
w(junk,pos)	0.996540
r	0.970365
w(barbie,pos)	1.02734
w(with,neg)	1.03418
w(due,pos)	0.986711
w(catalogs,pos)	0.989665
w(crushing,neg)	1.01162

Here we can see how the w(W,Y) feature specification in the rules file created a different feature for different word-label combinations, and how each of those features received a different weight. When we re-run inference using these weights in place of the defaults, ProPPR will generate a different ranking of the candidate solutions. If training has worked effectively, the correct solutions will be at the top of the list.

Post-Training Inference: Tester

Now we'll run inference on our test set using the trained weights:

ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester \
--programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data --params params.wts 
INFO [Tester] flags: 0x259
INFO [Component] Consolidating all .cfacts components together
INFO [Component] Consolidation complete
INFO [Tester] Testing on test.data...
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[yellow],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[yellow],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[toy],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[toy],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[10],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[10],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[speed],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[speed],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[convertible],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[convertible],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[porshe],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[porshe],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[paperwork],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[paperwork],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[backlog],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[backlog],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[email],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[email],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[woe],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[woe],c[pos]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[trouble],c[neg]) (this message only prints once)
WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[trouble],c[pos]) (this message only prints once)
INFO [Tester] pairTotal 7.0 pairErrors 1.0 errorRate 0.14285714285714285 map 0.9285714285714286
result= running time 537
result= pairs 7.0 errors 1.0 errorRate 0.14285714285714285 map 0.9285714285714286

Before training, we made 7 pair errors in 7 pairs for a 100% error rate. After training, we made only 1 pair error. It worked!

There's a new command line option,

--params specifies the trained weights file.

The warnings about using a default weight are because our test set includes some words that weren't in our training set. A few of these non-overlapping features are okay, but as the number of unknown features approaches the number of trained features, that's a sign your training and testing data are too different and you're not going to get good transfer.