Skip to content
wwcohen edited this page Aug 29, 2014 · 20 revisions

Demo 2: Learning and Inference with the WebKB data

Before you do this walkthough, you should have ProPPR installed and on your Java classpath. You'll also want to set up a shell variable that points to the root of the ProPPR installation, so you can call scripts easily: for instance, you might use something like:

export PROPPR_HOME=~/code/Praprolog

Inference, Queries, Answers and Provers

A sample database

Inference with ProPPR is our name for using ProPPR to answer queries. To answer queries, you need to supply a database, and you will also usually supply a set of rules. To do inference, ProPPR uses something called a prover. There are several implemented provers, with different pros and cons. We're going to working with a smallish database which was produced from the classic WebKB dataset. This dataset is based on a few thousand web pages, classified into categories like "course home page" or "department home page", which were crawled many years ago from some CS departments. Step one is to download a copy of that database in the ProPPR format, maybe with

wget http://www.cs.cmu.edu/~wcohen/proppr/webkb-full.graph

This is one format for a ProPPR database, called a graph file. A graph file contains triples, somewhat like RDF triples, consisting of a relation name, a source node, and a destination node. As an example, if you type

grep -w r98 webkb-full.graph

you will see all the triples concerning the node r98, which look like this:

hasTerm r98     algorithm
hasTerm r98     approach
hasTerm r98     cern
hasTerm r98     check
hasTerm r98     comput
...
inDoc   algorithm       r98
inDoc   approach        r98
inDoc   cern    r98
...
linkFrom        r98     r237
linkTo  r237    r98

Not to be too mysterious about this, r98 represents a web page, the hasTerm triples are stemmed versions of words extracted from that page, and the linkFrom triples is a hyperlink from r98 to another page, r237. For every triple, there is an inverse of it (with the relation inDoc or linkTo, and the source and destination nodes swapped). The purpose of storing both of these is that graph databases are indexed specially: ProPPR will be able to go quickly from a node (like r98) to nodes that it is directly connected (like r237, or the node for the word algorithm).

A little more exploration: we can find the most frequent terms in this dataset using:

grep hasTerm webkb-full.graph | cut -f3 | sort | uniq -c | sort -nr | head -40

and if we do, we'll see they are strings like "text", "server", "type", "html", "gmt", "content", and so on - uninformative terms that are probably taken from headers.

Querying the database

To submit some queries against this database, edit the file tutorial1.queries to contain

linkFrom(r98,X)
linkTo(r98,Y)
hasTerm(r98,W)
inDoc(databas,DocId)

Each of these queries is a pattern to match against a database triple, where uppercase identifiers (like X,Y,DocId) are variables. Hence, in each pattern, we have specified the relation and source nodes, and are asking which destinations match the variables. To answer the queries use this command:

java edu.cmu.ml.praprolog.QueryAnswerer --queries tutorial1.queries --output tutorial1.answers --programFiles webkb-full.graph --prover ppr  

which will produce output like this:

INFO [QueryAnswerer] Running queries from tutorial1.queries; saving results to tutorial1.answers
INFO [QueryAnswerer] Querying: goal(linkFrom,c[r98],v[-1])
INFO [QueryAnswerer] Writing 1 solutions...
INFO [QueryAnswerer] Querying: goal(linkTo,c[r98],v[-2])
INFO [QueryAnswerer] Writing 0 solutions...
INFO [QueryAnswerer] Querying: goal(hasTerm,c[r98],v[-3])
INFO [QueryAnswerer] Writing 38 solutions...
INFO [QueryAnswerer] Querying: goal(inDoc,c[databas],v[-4])
INFO [QueryAnswerer] Writing 144 solutions...

and fill the file tutorial1.answers with something like the output below. The 'comment' lines (starting with hash) document the query that was given, except that variables appearing in queries are replaced with negative integers. The other lines give potential answers to a query, along with a rank for that answer (the first column) and a score.

# proved 1      linkFrom(r98,-1)        4 msec
1       1.0     -1=c[r237]
# proved 2      linkTo(r98,-2)  0 msec
# proved 3      hasTerm(r98,-3) 31 msec
1       0.02631578947368423     -3=c[random]
2       0.02631578947368423     -3=c[suggest]
3       0.02631578947368423     -3=c[modifi]
4       0.02631578947368423     -3=c[fridai]
5       0.02631578947368423     -3=c[cornell]
6       0.02631578947368423     -3=c[year]
...
# proved 4      inDoc(databas,-4)       84 msec
1       0.006944444444444444    -4=c[r1073]
2       0.006944444444444444    -4=c[r126]
3       0.006944444444444444    -4=c[r1130]
4       0.006944444444444444    -4=c[r349]
...

Querying the database with rules

Edit the file tutorial2.rules to contain this text:

simword(Word1,Word2) :- inDoc(Word1,Doc),hasTerm(Doc,Word2).

To read this, think of ":-" as "if" and the comma as "and", and think of variables in the "if" part (i.e., the body of the rule) as existentially quantified: so the rule above says roughly "Word1 and Word2 are similar (i.e., satisfy the simword relation) if there is a document Doc such that Word1 is in the document Doc, and Word2 is term in Doc".

Now type the command

python $PROPPR_HOME/scripts/rulecompiler.py tutorial2.rules tutorial2.crules

and edit tutorial2.queries to contain

simword(databas,W)                                                              |
simword(exam,W)

and finally submit this command (making sure you modify the --programFiles argument to include the new compiled rule you created):

 java edu.cmu.ml.praprolog.QueryAnswerer --queries tutorial2.queries --output tutorial2.answers --programFiles webkb-full.graph:tutorial2.crules --prover ppr  

which should give you output, in tutorial2.answers, something like the below. In this case the rankings are a little more interesting. For "databas", after we slog past the dozen or so most frequent terms, we get recognizable stems, like the stems for "computer science", "systems", and "department", which might be from department home pages; for "exam", we can recognize stems of "instructor", "office hours", "grade", and other course-related terms.

# proved 1      simword(databas,-1)     1426 msec
1       0.014081628776463089    -1=c[text]
2       0.014081628776463089    -1=c[server]
...
11      0.012019056632918192    -1=c[comput]
12      0.011043926246969014    -1=c[scienc]
13      0.011034829093448897    -1=c[page]
14      0.010695982535238914    -1=c[nov]
15      0.010103870754654385    -1=c[univers]
16      0.00999472020313304     -1=c[home]
17      0.00952860240362735     -1=c[system]
18      0.008060673694055016    -1=c[depart]
19      0.007813967154745428    -1=c[research]
20      0.007640605114867448    -1=c[version]
21      0.007407893785288104    -1=c[ncsa]
...
# proved 2      simword(exam,-1)        819 msec
...
15      0.008826322452046495    -1=c[offic]
16      0.008814365300357042    -1=c[hour]
17      0.00845835624269028     -1=c[comput]
18      0.008348982179901694    -1=c[assign]
19      0.007929864414629046    -1=c[home]
20      0.007839277354855954    -1=c[instructor]
21      0.00777219391877796     -1=c[class]
22      0.00714994475489742     -1=c[grade]
23      0.006699516133144052    -1=c[fall]
24      0.006611328295957702    -1=c[program]
25      0.006547088944118696    -1=c[final]
...

TODO:

  • with prover --prover dpr:0.0001:0.1
  • tracing
  • more complex similarities - eg, tutorial3.rules
  • heuristics for epsilon? timing?
  • should there be a script for comparison of ranks? eg (ppr-dpr)/ppr by rank?

Semi-supervised learning

TODO:

true classes

  • cp classes.graph ~/afs-home/www/proppr
  • sample seeds and predicted(Y,X1,Y1) :- seed(Y,X0),simDoc(X0,X1),class(X1,Y1).

similarity and propagation

  • webkb as SSL problem
  • experiment with train/test splits and multiple seeds

supervised learning

  • webkb as SL problem
  • experiment with train/test splits and multiple seeds

relational learning

  • look at linkto.graph classes and see if it's informative
  • add linkto features
  • other algorithms?