Berlin Buzzwords LTR Demo

bloomberg · Jul 23, 2017 · fc67b5c · fc67b5c
1 parent 99093ca
commit fc67b5c
Show file tree

Hide file tree

Showing 72 changed files with 10,532 additions and 18 deletions.
diff --git a/README.txt b/README.txt
@@ -1,13 +1,85 @@
-Apache Lucene/Solr
+## Berlin Buzzword Demo
+
+### install the demo
+
+from the folder `solr` run:
+
+  ant dist
+  ant server
+  bin/solr -e wikipedia -Dsolr.ltr.enabled=true
+
+then download and index the dump: 
+
+  cd py-solr-buzzwords
+  # get the simple wikipedia dump
+  wget FIXME 
+  ./index_wikipedia.py simplewiki-20170501-pages-articles.json.gz
+
+install the needed python packages: 
+
+  pip install pysolr
+  pip install flask
+
+### running the demo
+run:
+
+  cd py-solr-buzzwords
+  ./demo.py 
+
+## 1. Collect query document judgements
+
+You can mark results us relevant, and add new query to the dataset (stored in `dataset.json`).
+
+  ./annotate_queries.py
+
+## 2. Extract query-document features
+
+First you need to load the features in solr: 
+
+  curl -XPUT 'http://localhost:8983/solr/wikipedia/schema/feature-store' --data-binary "@./features.json" -H 'Content-type:application/json'
+
+Then, you can extract features for a query-document by using the ltr document document transformer, for example try the `berlin` query:
+
+  http://localhost:8983/solr/wikipedia/select?indent=on&q=asd&wt=json&fl=title,score,[features%20efi.query=berlin]
+
+## 3. Train a linear model
+
+The script will get the features for each query/document` in the `dataset.json` file and will produce a training file that will be use to train a model. It will train a model, and upload it on solr. 
+
+  ./train_linear_model.py
+
+If you run the script and then run (or refresh) `demo.py`, you will see the performance of the model on the right side of the screen. 
+If you click on the name of the model, you will see how documents are ranked using that model. 
+
+## 4. Train a tree model
+
+Same as above, but this time we will train a tree model (LambdaMart).
+
+  ./train_linear_model.py
+
+LambdaMart is trained to optimize a particular quality metric, by default it will optmize Precision at 10, but you can change the metric, e.g.:
+
+  ./train_linear_model.py P@1 
+  ./train_linear_model.py NDCG@10
+
+will optimize the model for Precision at 1 or Normalized Discounted Cumulative Gain [1].
+
+You can also increase the number of trees used, e.g.,
+
+  ./train_linear_model.py P@1 100
+  ./train_linear_model.py NDCG@10 1000
+
+more trees will probably make the tree more precise, but slowing down the performance at query time. 
+
+[1] https://en.wikipedia.org/wiki/Discounted_cumulative_gain
+
+
+
+
+
+
+
+
 
-lucene/ is a search engine library
-solr/ is a search engine server that uses lucene
 
-To compile the sources run 'ant compile'
-To run all the tests run 'ant test'
-To setup your ide run 'ant idea', 'ant netbeans', or 'ant eclipse'
-For Maven info, see dev-tools/maven/README.maven
 
-For more information on how to contribute see:
-http://wiki.apache.org/lucene-java/HowToContribute
-http://wiki.apache.org/solr/HowToContribute
diff --git a/py-solr-buzzwords/RankLib.jar b/py-solr-buzzwords/RankLib.jar
diff --git a/py-solr-buzzwords/annotate_queries.py b/py-solr-buzzwords/annotate_queries.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python
+
+from flask import Flask
+from flask import request
+from flask import render_template 
+import json
+import pysolr
+import os
+from dataset import Dataset
+from rankers import Rankers
+
+app = Flask(__name__)
+MAX_RESULTS = 30
+dataset = Dataset()
+rankers = Rankers()
+
+@app.route("/query",  methods=['POST', 'GET'])
+def query():
+    global solr
+    q = request.args.get('q') 
+    try:
+       results = rankers.query('default', q) 
+    except Exception as e:
+        print e
+        return "Cannot connect with Lucene/Solr", e
+    return render_template('annotate.html', query=q, results=results)
+
+@app.route("/annotate",  methods=['POST', 'GET'])
+def annotate():
+    global solr
+    q = request.args.get('q') 
+    rank = int(request.args.get('rank'))
+    try:
+       results = rankers.query('default', q) 
+    except Exception as e:
+        print e
+        return "Cannot connect with Lucene/Solr"
+    if len(results.docs) == 0:
+        return "No results for query "+q
+    rank = rank % min(MAX_RESULTS, len(results.docs))
+    article = results.docs[rank] 
+    rel = dataset.get_relevance(q, article['wikiTitle'])
+    dataset.annotate(q, article['wikiTitle'], rel)
+    return render_template('annotate_res.html', query=q, article=article, rank=rank, rel=rel)
+
+
+@app.route("/store",  methods=['POST', 'GET'])
+def store():
+    q = request.args.get('q') 
+    rank = int(request.args.get('rank'))
+    rel = int(request.args.get('rel'))
+    results = rankers.query('default', q) 
+    doc = results.docs[rank]
+    dataset.annotate(q, doc["wikiTitle"], rel)
+    return 'ok'
+
+
+if __name__ == "__main__":
+    import threading, webbrowser
+    port = 5000  
+    url = "http://localhost:{0}/annotate?q=berlin&rank=0".format(port)
+    threading.Timer(1.25, lambda: webbrowser.open(url) ).start()
+    app.run()
+
+