-
Notifications
You must be signed in to change notification settings - Fork 49
Walkthroughs
This demo starts from a freshly built (ant clean build
) copy of ProPPR, and assumes basic familiarity with the bash shell.
First, we'll create our dataset.
Make a place for the dataset to live:
ProPPR$ mkdir demos
ProPPR$ cd demos/
ProPPR/demos$ mkdir textcattoy
ProPPR/demos$ cd textcattoy/
ProPPR/demos/textcattoy$
Now we'll write our rules file. This file will be used in the inference step, and tells ProPPR how to generate a list of candidate answers to your queries, and how to apply features to different steps of the proof. These features will, in turn, be used in the learning step.
ProPPR/demos/textcattoy$ cat > textcattoy.rules
predict(X,Y) :- hasWord(X,W),isLabel(Y),related(W,Y) #r.
related(W,Y) :- # w(W,Y).
^D
For this demo, we want to answer queries of the form predict(someDocument,?)
to predict the correct label for a document. Let's look at the first rule.
predict(X,Y) :- hasWord(X,W),isLabel(Y),related(W,Y) #r.
ProPPR rules look a lot like rules in Prolog: they are clauses. This one says, in english, "Predict label Y for a document X if X has a word W and W is related to a label Y." The order of the goals in the body of the rule is important, because we want to bind the variables in order from left to right. When the rule is used, X will be bound to a document. We then use hasWord
to look up a word in the document, and bind W to that word. We then use isLabel
to look up a label, and bind Y to that label. Then we use related
to see whether W and Y are related. If they are, then Y is returned as a valid label for the document. ProPPR doesn't stop at the first valid label, but will proceed through all possible bindings for Y, checking all labels to see if they are related to W. Then it will go through all the other words in the document as alternate bindings for W, and check all the labels for each word. Unsuccessful combinations are discarded.
Every time a particular binding of Y makes the body of the clause true, it gets a little more credit in the system. So if a lot of words in the document are related to label "foo", and fewer words are related to label "bar", then the overall ranking of the labels will have "foo" first and "bar" second.
We'll store the lookup tables for hasWord
and isLabel
in a database, which we'll create in a moment. The definition of related
will be defined as a rule later on in the file.
The last part of the rule comes after the '#' character, and is a list of features. Whenever ProPPR invokes this rule, it will mark that transition with the feature r
, which in this case is just a placeholder.
Now let's look at the second rule.
related(W,Y) :- # w(W,Y).
This rule says, in english, "A word W is always related to a label Y." Because there is no body for this rule, it always succeeds. The smart part is in the feature.
The feature for this rule includes the bindings of W and Y, which effectively generates a different feature for every W-Y combination generated by the first rule. In inference, ProPPR will be able to use this rule to learn which words are most closely related to which labels, across all the words and labels and documents in the training set.
Now let's go back and create the databases for hasWord
and isLabel
.
ProPPR supports a few different types of databases, but the most basic is a cfacts file, which is tab-separated:
ProPPR/demos/textcattoy$ cat > labels.cfacts
isLabel neg
isLabel pos
^D
Here we've said that neg
is a label and pos
is a label.
The hasWord
database is a little bigger, so we'll start with an easy-to-read format and then use a shell script to convert it.
ProPPR/demos/textcattoy$ cat > documents.txt
dh a pricy doll house
ft a little red fire truck
rw a red wagon
sc a pricy red sports car
bk punk queen barbie and ken
rb a little red bike
mv a big 7-seater minivan with an automatic transmission
hs a big house in the suburbs with a crushing mortgage
ji a job for life at IBM
tf a huge pile of tax forms due yesterday
jm a huge pile of junk mail bills and catalogs
pb a pricy barbie doll
yc a little yellow toy car
rb2 a red 10 speed bike
rp a red convertible porshe
bp a big pile of paperwork
he a huge backlog of email
wt a life of woe and trouble
^D
Here we've listed a bunch of documents, with the document identifier in one field and the document text in the other. Now we need to put each word in a separate entry in the database, and label it with the functor hasWord
. I like awk for stuff like this. We'll test the script first to see that it looks good, then write the whole thing to a cfacts file.
ProPPR/demos/textcattoy$ awk 'BEGIN{FS=OFS="\t"}{nwords = split($2,words," "); for (i=1;i<=nwords;i++) { print "hasWord",$1,words[i] }}' documents.txt | head
hasWord dh a
hasWord dh pricy
hasWord dh doll
hasWord dh house
hasWord ft a
hasWord ft little
hasWord ft red
hasWord ft fire
hasWord ft truck
hasWord rw a
ProPPR/demos/textcattoy$ awk 'BEGIN{FS=OFS="\t"}{nwords = split($2,words," "); for (i=1;i<=nwords;i++) { print "hasWord",$1,words[i] }}' documents.txt > hasWord.cfacts
Now that we have our rules file and database files, we have everything we need for inference.
Next we have to prepare our labelled data for the learning phase. Labelled data for ProPPR is in a tab-separated format with one example per line, starting with the query and then listing solutions labelled with +(correct/positive) or -(incorrect/negative). For ProPPR to work properly, each example must have at least one positive and at least one negative solution, and these solutions must be reachable by the logic program as defined in the rules file. If you only have + solutions available, you can have QueryAnswerer give you other reachable solutions from which you can sample - labels. For this problem though we have both + and - labels. We'll use the first 11 documents for training, and the rest for testing:
ProPPR/demos/textcattoy$ cat > train.data
predict(dh,Y) -predict(dh,neg) +predict(dh,pos)
predict(ft,Y) -predict(ft,neg) +predict(ft,pos)
predict(rw,Y) -predict(rw,neg) +predict(rw,pos)
predict(sc,Y) -predict(sc,neg) +predict(sc,pos)
predict(bk,Y) -predict(bk,neg) +predict(bk,pos)
predict(rb,Y) -predict(rb,neg) +predict(rb,pos)
predict(mv,Y) +predict(mv,neg) -predict(mv,pos)
predict(hs,Y) +predict(hs,neg) -predict(hs,pos)
predict(ji,Y) +predict(ji,neg) -predict(ji,pos)
predict(tf,Y) +predict(tf,neg) -predict(tf,pos)
predict(jm,Y) +predict(jm,neg) -predict(jm,pos)
^D
ProPPR/demos/textcattoy$ cat test.data
predict(pb,Y) -predict(pb,neg) +predict(pb,pos)
predict(yc,Y) -predict(yc,neg) +predict(yc,pos)
predict(rb2,Y) -predict(rb2,neg) +predict(rb2,pos)
predict(rp,Y) -predict(rp,neg) +predict(rp,pos)
predict(bp,Y) +predict(bp,neg) -predict(bp,pos)
predict(he,Y) +predict(he,neg) -predict(he,pos)
predict(wt,Y) +predict(wt,neg) -predict(wt,pos)
^D
Now we have all the raw data we need, and we can start running ProPPR tools.
ProPPR/demos/textcattoy$ cd ../../ ProPPR$ scripts/compile.sh demos/textcattoy/ compiling demos/textcattoy/textcattoy.rules to demos/textcattoy/textcattoy.crules parsing demos/textcattoy/textcattoy.rules ProPPR$ cd - ProPPR/demos/textcattoy$ export PROPPR=~/git/rinkitink/ProPPR ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester --programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data INFO [Tester] flags: 0x259 INFO [Component] Consolidating all .cfacts components together INFO [Component] Consolidation complete INFO [Tester] Testing on test.data... WARN [Tester] All answers ranked equally for query goal(predict,c[pb],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[yc],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[rb2],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[rp],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[bp],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[he],v[-2]) WARN [Tester] All answers ranked equally for query goal(predict,c[wt],v[-2]) INFO [Tester] pairTotal 7.0 pairErrors 7.0 errorRate 1.0 map 0.7142857142857143 result= running time 531 result= pairs 7.0 errors 7.0 errorRate 1.0 map 0.7142857142857143 ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/:$PROPPR/conf/ edu.cmu.ml.praprolog.Trainer --programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --train train.data --output train.cooked --params params.wts INFO [Trainer] flags: 0x46b9 INFO [Component] Consolidating all .cfacts components together INFO [Component] Consolidation complete INFO [Trainer] Cooking train.data... INFO [ExampleCooker] totalPos: 11 totalNeg: 11 coveredPos: 11 coveredNeg: 11 INFO [ExampleCooker] For positive examples 11/11 proveable [100.0%] INFO [ExampleCooker] For negative examples 11/11 proveable [100.0%] INFO [Trainer] Finished cooking in 846 ms INFO [Trainer] Training model parameters on train.cooked... INFO [Trainer] Training on cooked examples... INFO [Trainer] epoch 1 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 2 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 3 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 4 ... INFO [Trainer] 11 examples processed INFO [Trainer] epoch 5 ... INFO [Trainer] 11 examples processed INFO [Trainer] Finished in 1292 ms INFO [Trainer] Finished training in 1294 ms INFO [Trainer] Saving parameters to params.wts... ProPPR/demos/textcattoy$ head params.wts #! weightingScheme=tanh #! programFiles=textcattoy.crules:labels.cfacts:hasWord.cfacts #! prover=dpr:0.000100000:0.100000:throw w(junk,pos) 0.996540 r 0.970365 w(barbie,pos) 1.02734 w(with,neg) 1.03418 w(due,pos) 0.986711 w(catalogs,pos) 0.989665 w(crushing,neg) 1.01162 ProPPR/demos/textcattoy$ java -cp $PROPPR/bin:$PROPPR/lib/*:$PROPPR/conf/ edu.cmu.ml.praprolog.Tester --programFiles textcattoy.crules:labels.cfacts:hasWord.cfacts --test test.data --params params.wts INFO [Tester] flags: 0x259 INFO [Component] Consolidating all .cfacts components together INFO [Component] Consolidation complete INFO [Tester] Testing on test.data... WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[yellow],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[yellow],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[toy],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[toy],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[10],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[10],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[speed],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[speed],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[convertible],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[convertible],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[porshe],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[porshe],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[paperwork],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[paperwork],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[backlog],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[backlog],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[email],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[email],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[woe],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[woe],c[pos]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[trouble],c[neg]) (this message only prints once) WARN [InnerProductWeighter] Using default weight 1.0 for unknown feature goal(w,c[trouble],c[pos]) (this message only prints once) INFO [Tester] pairTotal 7.0 pairErrors 1.0 errorRate 0.14285714285714285 map 0.9285714285714286 result= running time 537 result= pairs 7.0 errors 1.0 errorRate 0.14285714285714285 map 0.9285714285714286