Skip to content
krivard edited this page Aug 21, 2014 · 11 revisions

Demo 1: Toy problem in text categorization

This demo starts from a freshly built (ant clean build) copy of ProPPR, and assumes basic familiarity with the bash shell.

First, we'll create our dataset.

Dataset creation

Make a place for the dataset to live:

ProPPR$ mkdir demos
ProPPR$ cd demos/
ProPPR/demos$ mkdir textcattoy
ProPPR/demos$ cd textcattoy/
ProPPR/demos/textcattoy$

Now we'll write our rules file. This file will be used in the inference step, and tells ProPPR how to generate a list of candidate answers to your queries, and how to apply features to different steps of the proof. These features will, in turn, be used in the learning step.

ProPPR/demos/textcattoy$ cat > textcattoy.rules
predict(X,Y) :- hasWord(X,W),isLabel(Y),related(W,Y)  #r.
related(W,Y) :- # w(W,Y).
^D

For this demo, we want to answer queries of the form predict(someDocument,?) to predict the correct label for a document. Let's look at the first rule.

predict(X,Y) :- hasWord(X,W),isLabel(Y),related(W,Y)  #r.

ProPPR rules look a lot like rules in Prolog: they are clauses. This one says, in english, "Predict label Y for a document X if X has a word W and W is related to a label Y." The order of the goals in the body of the rule is important, because we want to bind the variables in order from left to right. When the rule is used, X will be bound to a document. We then use hasWord to look up a word in the document, and bind W to that word. We then use isLabel to look up a label, and bind Y to that label. Then we use related to see whether W and Y are related. If they are, then Y is returned as a valid label for the document. ProPPR doesn't stop at the first valid label, but will proceed through all possible bindings for Y, checking all labels to see if they are related to W. Then it will go through all the other words in the document as alternate bindings for W, and check all the labels for each word. Unsuccessful combinations are discarded.

Every time a particular binding of Y makes the body of the clause true, it gets a little more credit in the system. So if a lot of words in the document are related to label "foo", and fewer words are related to label "bar", then the overall ranking of the labels will have "foo" first and "bar" second.

We'll store the lookup tables for hasWord and isLabel in a database, which we'll create in a moment. The definition of related will be defined as a rule later on in the file.

The last part of the rule comes after the '#' character, and is a list of features. Whenever ProPPR invokes this rule, it will mark that transition with the feature r, which in this case is just a placeholder.

Now let's look at the second rule.

related(W,Y) :- # w(W,Y).

This rule says, in english, "A word W is always related to a label Y." Because there is no body for this rule, it always succeeds. The smart part is in the feature.

The feature for this rule includes the bindings of W and Y, which effectively generates a different feature for every W-Y combination generated by the first rule. In inference, ProPPR will be able to use this rule to learn which words are most closely related to which labels, across all the words and labels and documents in the training set.

Now let's go back and create the databases for hasWord and isLabel.

ProPPR supports a few different types of databases, but the most basic is a cfacts file, which is tab-separated:

ProPPR/demos/textcattoy$ cat > labels.cfacts
isLabel	neg	
isLabel	pos
^D

Here we've said that neg is a label and pos is a label.

The hasWord database is a little bigger, so we'll start with an easy-to-read format and then use a shell script to convert it.

ProPPR/demos/textcattoy$ cat > documents.txt 
dh	a pricy doll house
ft	a little red fire truck
rw	a red wagon
sc	a pricy red sports car
bk	punk queen barbie and ken
rb	a little red bike
mv	a big 7-seater minivan with an automatic transmission
hs	a big house in the suburbs with a crushing mortgage
ji	a job for life at IBM
tf	a huge pile of tax forms due yesterday
jm	a huge pile of junk mail bills and catalogs
pb	a pricy barbie doll
yc	a little yellow toy car
rb2	a red 10 speed bike
rp	a red convertible porshe
bp	a big pile of paperwork
he	a huge backlog of email
wt	a life of woe and trouble
^D

Here we've listed a bunch of documents, with the document identifier in one field and the document text in the other. Now we need to put each word in a separate entry in the database, and label it with the functor hasWord. I like awk for stuff like this. We'll test the script first to see that it looks good, then write the whole thing to a cfacts file.

ProPPR/demos/textcattoy$ awk 'BEGIN{FS=OFS="\t"}{nwords = split($2,words," "); for (i=1;i<=nwords;i++) { print "hasWord",$1,words[i] }}' documents.txt | head
hasWord	dh	a
hasWord	dh	pricy
hasWord	dh	doll
hasWord	dh	house
hasWord	ft	a
hasWord	ft	little
hasWord	ft	red
hasWord	ft	fire
hasWord	ft	truck
hasWord	rw	a
ProPPR/demos/textcattoy$ awk 'BEGIN{FS=OFS="\t"}{nwords = split($2,words," "); for (i=1;i<=nwords;i++) { print "hasWord",$1,words[i] }}' documents.txt > hasWord.cfacts

Now that we have our rules file and database files, we have everything we need for inference.

Next we have to prepare our labelled data for the learning phase. Labelled data for ProPPR is in a tab-separated format with one example per line, starting with the query and then listing solutions labelled with +(correct/positive) or -(incorrect/negative). For ProPPR to work properly, each example must have at least one positive and at least one negative solution, and these solutions must be reachable by the logic program as defined in the rules file. If you only have + solutions available, you can have QueryAnswerer give you other reachable solutions from which you can sample - labels. For this problem though we have both + and - labels. We'll use the first 11 documents for training, and the rest for testing:

ProPPR/demos/textcattoy$ cat > train.data 
predict(dh,Y)	-predict(dh,neg)	+predict(dh,pos)
predict(ft,Y)	-predict(ft,neg)	+predict(ft,pos)
predict(rw,Y)	-predict(rw,neg)	+predict(rw,pos)
predict(sc,Y)	-predict(sc,neg)	+predict(sc,pos)
predict(bk,Y)	-predict(bk,neg)	+predict(bk,pos)
predict(rb,Y)	-predict(rb,neg)	+predict(rb,pos)
predict(mv,Y)	+predict(mv,neg)	-predict(mv,pos)
predict(hs,Y)	+predict(hs,neg)	-predict(hs,pos)
predict(ji,Y)	+predict(ji,neg)	-predict(ji,pos)
predict(tf,Y)	+predict(tf,neg)	-predict(tf,pos)
predict(jm,Y)	+predict(jm,neg)	-predict(jm,pos)
^D
ProPPR/demos/textcattoy$ cat test.data 
predict(pb,Y)	-predict(pb,neg)	+predict(pb,pos)
predict(yc,Y)	-predict(yc,neg)	+predict(yc,pos)
predict(rb2,Y)	-predict(rb2,neg)	+predict(rb2,pos)
predict(rp,Y)	-predict(rp,neg)	+predict(rp,pos)
predict(bp,Y)	+predict(bp,neg)	-predict(bp,pos)
predict(he,Y)	+predict(he,neg)	-predict(he,pos)
predict(wt,Y)	+predict(wt,neg)	-predict(wt,pos)
^D

Now we have all the raw data we need, and we can start running ProPPR tools.

ProPPR/demos/textcattoy$ cd ../../ ProPPR$ scripts/compile.sh demos/textcattoy/ compiling demos/textcattoy/textcattoy.rules to demos/textcattoy/textcattoy.crules parsing demos/textcattoy/textcattoy.rules ProPPR$ cd - ProPPR/demos/textcattoy$ export PROPPR=~/git/rinkitink/ProPPR ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester --programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data INFO [Tester] flags: 0x259 INFO [Component] Consolidating all .cfacts components together INFO [Component] Consolidation complete INFO [Tester] Testing on test.data... WARN [Tester] All answers ranked equally for query goal(predict,c[pb],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[yc],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[rb2],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[rp],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[bp],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[he],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[wt],v[-2]) INFO [Tester] pairTotal 7.0 pairErrors 7.0 errorRate 1.0 map 0.7142857142857143 result= running time 531 result= pairs 7.0 errors 7.0 errorRate 1.0 map 0.7142857142857143 ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/:$PROPPR/conf/ edu.cmu.ml.praprolog.Trainer --programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --train train.data --output train.cooked --params params.wts INFO [Trainer] flags: 0x46b9 INFO [Component] Consolidating all .cfacts components together INFO [Component] Consolidation complete INFO [Trainer] Cooking train.data... INFO [ExampleCooker] totalPos: 11 totalNeg: 11 coveredPos: 11 coveredNeg: 11 INFO [ExampleCooker] For positive examples 11/11 proveable [100.0%] INFO [ExampleCooker] For negative examples 11/11 proveable [100.0%] INFO [Trainer] Finished cooking in 846 ms INFO [Trainer] Training model parameters on train.cooked... INFO [Trainer] Training on cooked examples... INFO [Trainer] epoch 1 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 2 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 3 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 4 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 5 ... INFO [Trainer] 11 examples processed INFO [Trainer] Finished in 1292 ms INFO [Trainer] Finished training in 1294 ms INFO [Trainer] Saving parameters to params.wts... ProPPR/demos/textcattoy$ head params.wts #! weightingScheme=tanh #! programFiles=textcattoy.crules:labels.cfacts:hasWord.cfacts #! prover=dpr:0.000100000:0.100000:throw w(junk,pos) 0.996540 r 0.970365 w(barbie,pos) 1.02734 w(with,neg) 1.03418 w(due,pos) 0.986711 w(catalogs,pos) 0.989665 w(crushing,neg) 1.01162 ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester --programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data --params params.wts INFO [Tester] flags: 0x259 INFO [Component] Consolidating all .cfacts components together INFO [Component] Consolidation complete INFO [Tester] Testing on test.data... WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[yellow],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[yellow],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[toy],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[toy],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[10],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[10],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[speed],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[speed],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[convertible],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[convertible],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[porshe],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[porshe],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[paperwork],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[paperwork],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[backlog],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[backlog],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[email],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[email],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[woe],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[woe],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[trouble],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[trouble],c[pos]) (this message only prints once) INFO [Tester] pairTotal 7.0 pairErrors 1.0 errorRate 0.14285714285714285 map 0.9285714285714286 result= running time 537 result= pairs 7.0 errors 1.0 errorRate 0.14285714285714285 map 0.9285714285714286