forked from apache/lucene-solr
-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
99093ca
commit fc67b5c
Showing
72 changed files
with
10,532 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,85 @@ | ||
Apache Lucene/Solr | ||
## Berlin Buzzword Demo | ||
|
||
### install the demo | ||
|
||
from the folder `solr` run: | ||
|
||
ant dist | ||
ant server | ||
bin/solr -e wikipedia -Dsolr.ltr.enabled=true | ||
|
||
then download and index the dump: | ||
|
||
cd py-solr-buzzwords | ||
# get the simple wikipedia dump | ||
wget FIXME | ||
./index_wikipedia.py simplewiki-20170501-pages-articles.json.gz | ||
|
||
install the needed python packages: | ||
|
||
pip install pysolr | ||
pip install flask | ||
|
||
### running the demo | ||
run: | ||
|
||
cd py-solr-buzzwords | ||
./demo.py | ||
|
||
## 1. Collect query document judgements | ||
|
||
You can mark results us relevant, and add new query to the dataset (stored in `dataset.json`). | ||
|
||
./annotate_queries.py | ||
|
||
## 2. Extract query-document features | ||
|
||
First you need to load the features in solr: | ||
|
||
curl -XPUT 'http://localhost:8983/solr/wikipedia/schema/feature-store' --data-binary "@./features.json" -H 'Content-type:application/json' | ||
|
||
Then, you can extract features for a query-document by using the ltr document document transformer, for example try the `berlin` query: | ||
|
||
http://localhost:8983/solr/wikipedia/select?indent=on&q=asd&wt=json&fl=title,score,[features%20efi.query=berlin] | ||
|
||
## 3. Train a linear model | ||
|
||
The script will get the features for each query/document` in the `dataset.json` file and will produce a training file that will be use to train a model. It will train a model, and upload it on solr. | ||
|
||
./train_linear_model.py | ||
|
||
If you run the script and then run (or refresh) `demo.py`, you will see the performance of the model on the right side of the screen. | ||
If you click on the name of the model, you will see how documents are ranked using that model. | ||
|
||
## 4. Train a tree model | ||
|
||
Same as above, but this time we will train a tree model (LambdaMart). | ||
|
||
./train_linear_model.py | ||
|
||
LambdaMart is trained to optimize a particular quality metric, by default it will optmize Precision at 10, but you can change the metric, e.g.: | ||
|
||
./train_linear_model.py P@1 | ||
./train_linear_model.py NDCG@10 | ||
|
||
will optimize the model for Precision at 1 or Normalized Discounted Cumulative Gain [1]. | ||
|
||
You can also increase the number of trees used, e.g., | ||
|
||
./train_linear_model.py P@1 100 | ||
./train_linear_model.py NDCG@10 1000 | ||
|
||
more trees will probably make the tree more precise, but slowing down the performance at query time. | ||
|
||
[1] https://en.wikipedia.org/wiki/Discounted_cumulative_gain | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
lucene/ is a search engine library | ||
solr/ is a search engine server that uses lucene | ||
|
||
To compile the sources run 'ant compile' | ||
To run all the tests run 'ant test' | ||
To setup your ide run 'ant idea', 'ant netbeans', or 'ant eclipse' | ||
For Maven info, see dev-tools/maven/README.maven | ||
|
||
For more information on how to contribute see: | ||
http://wiki.apache.org/lucene-java/HowToContribute | ||
http://wiki.apache.org/solr/HowToContribute |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
#!/usr/bin/env python | ||
|
||
from flask import Flask | ||
from flask import request | ||
from flask import render_template | ||
import json | ||
import pysolr | ||
import os | ||
from dataset import Dataset | ||
from rankers import Rankers | ||
|
||
app = Flask(__name__) | ||
MAX_RESULTS = 30 | ||
dataset = Dataset() | ||
rankers = Rankers() | ||
|
||
@app.route("/query", methods=['POST', 'GET']) | ||
def query(): | ||
global solr | ||
q = request.args.get('q') | ||
try: | ||
results = rankers.query('default', q) | ||
except Exception as e: | ||
print e | ||
return "Cannot connect with Lucene/Solr", e | ||
return render_template('annotate.html', query=q, results=results) | ||
|
||
@app.route("/annotate", methods=['POST', 'GET']) | ||
def annotate(): | ||
global solr | ||
q = request.args.get('q') | ||
rank = int(request.args.get('rank')) | ||
try: | ||
results = rankers.query('default', q) | ||
except Exception as e: | ||
print e | ||
return "Cannot connect with Lucene/Solr" | ||
if len(results.docs) == 0: | ||
return "No results for query "+q | ||
rank = rank % min(MAX_RESULTS, len(results.docs)) | ||
article = results.docs[rank] | ||
rel = dataset.get_relevance(q, article['wikiTitle']) | ||
dataset.annotate(q, article['wikiTitle'], rel) | ||
return render_template('annotate_res.html', query=q, article=article, rank=rank, rel=rel) | ||
|
||
|
||
@app.route("/store", methods=['POST', 'GET']) | ||
def store(): | ||
q = request.args.get('q') | ||
rank = int(request.args.get('rank')) | ||
rel = int(request.args.get('rel')) | ||
results = rankers.query('default', q) | ||
doc = results.docs[rank] | ||
dataset.annotate(q, doc["wikiTitle"], rel) | ||
return 'ok' | ||
|
||
|
||
if __name__ == "__main__": | ||
import threading, webbrowser | ||
port = 5000 | ||
url = "http://localhost:{0}/annotate?q=berlin&rank=0".format(port) | ||
threading.Timer(1.25, lambda: webbrowser.open(url) ).start() | ||
app.run() | ||
|
||
|
Oops, something went wrong.