Skip to content

Commit

Permalink
Berlin Buzzwords LTR Demo
Browse files Browse the repository at this point in the history
  • Loading branch information
diegoceccarelli committed Jul 23, 2017
1 parent 99093ca commit fc67b5c
Show file tree
Hide file tree
Showing 72 changed files with 10,532 additions and 18 deletions.
92 changes: 82 additions & 10 deletions README.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,85 @@
Apache Lucene/Solr
## Berlin Buzzword Demo

### install the demo

from the folder `solr` run:

ant dist
ant server
bin/solr -e wikipedia -Dsolr.ltr.enabled=true

then download and index the dump:

cd py-solr-buzzwords
# get the simple wikipedia dump
wget FIXME
./index_wikipedia.py simplewiki-20170501-pages-articles.json.gz

install the needed python packages:

pip install pysolr
pip install flask

### running the demo
run:

cd py-solr-buzzwords
./demo.py

## 1. Collect query document judgements

You can mark results us relevant, and add new query to the dataset (stored in `dataset.json`).

./annotate_queries.py

## 2. Extract query-document features

First you need to load the features in solr:

curl -XPUT 'http://localhost:8983/solr/wikipedia/schema/feature-store' --data-binary "@./features.json" -H 'Content-type:application/json'

Then, you can extract features for a query-document by using the ltr document document transformer, for example try the `berlin` query:

http://localhost:8983/solr/wikipedia/select?indent=on&q=asd&wt=json&fl=title,score,[features%20efi.query=berlin]

## 3. Train a linear model

The script will get the features for each query/document` in the `dataset.json` file and will produce a training file that will be use to train a model. It will train a model, and upload it on solr.

./train_linear_model.py

If you run the script and then run (or refresh) `demo.py`, you will see the performance of the model on the right side of the screen.
If you click on the name of the model, you will see how documents are ranked using that model.

## 4. Train a tree model

Same as above, but this time we will train a tree model (LambdaMart).

./train_linear_model.py

LambdaMart is trained to optimize a particular quality metric, by default it will optmize Precision at 10, but you can change the metric, e.g.:

./train_linear_model.py P@1
./train_linear_model.py NDCG@10

will optimize the model for Precision at 1 or Normalized Discounted Cumulative Gain [1].

You can also increase the number of trees used, e.g.,

./train_linear_model.py P@1 100
./train_linear_model.py NDCG@10 1000

more trees will probably make the tree more precise, but slowing down the performance at query time.

[1] https://en.wikipedia.org/wiki/Discounted_cumulative_gain









lucene/ is a search engine library
solr/ is a search engine server that uses lucene

To compile the sources run 'ant compile'
To run all the tests run 'ant test'
To setup your ide run 'ant idea', 'ant netbeans', or 'ant eclipse'
For Maven info, see dev-tools/maven/README.maven

For more information on how to contribute see:
http://wiki.apache.org/lucene-java/HowToContribute
http://wiki.apache.org/solr/HowToContribute
Binary file added py-solr-buzzwords/RankLib.jar
Binary file not shown.
65 changes: 65 additions & 0 deletions py-solr-buzzwords/annotate_queries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#!/usr/bin/env python

from flask import Flask
from flask import request
from flask import render_template
import json
import pysolr
import os
from dataset import Dataset
from rankers import Rankers

app = Flask(__name__)
MAX_RESULTS = 30
dataset = Dataset()
rankers = Rankers()

@app.route("/query", methods=['POST', 'GET'])
def query():
global solr
q = request.args.get('q')
try:
results = rankers.query('default', q)
except Exception as e:
print e
return "Cannot connect with Lucene/Solr", e
return render_template('annotate.html', query=q, results=results)

@app.route("/annotate", methods=['POST', 'GET'])
def annotate():
global solr
q = request.args.get('q')
rank = int(request.args.get('rank'))
try:
results = rankers.query('default', q)
except Exception as e:
print e
return "Cannot connect with Lucene/Solr"
if len(results.docs) == 0:
return "No results for query "+q
rank = rank % min(MAX_RESULTS, len(results.docs))
article = results.docs[rank]
rel = dataset.get_relevance(q, article['wikiTitle'])
dataset.annotate(q, article['wikiTitle'], rel)
return render_template('annotate_res.html', query=q, article=article, rank=rank, rel=rel)


@app.route("/store", methods=['POST', 'GET'])
def store():
q = request.args.get('q')
rank = int(request.args.get('rank'))
rel = int(request.args.get('rel'))
results = rankers.query('default', q)
doc = results.docs[rank]
dataset.annotate(q, doc["wikiTitle"], rel)
return 'ok'


if __name__ == "__main__":
import threading, webbrowser
port = 5000
url = "http://localhost:{0}/annotate?q=berlin&rank=0".format(port)
threading.Timer(1.25, lambda: webbrowser.open(url) ).start()
app.run()


Loading

0 comments on commit fc67b5c

Please sign in to comment.