UCL Search Engine

UCL IRDM 2017 Group Project - Option 1

We are using both of the open source IR packages below:

Nutch Crawling

Build nutch with ant.
Create a urls directory under apache-nutch-1.12/runtime/local.
Create seed.txt file under urls and put http://www.cs.ucl.ac.uk/ into the file.
Create new crawldb by executing bin/nutch inject crawl/crawldb urls under the apache-nutch-1.12/runtime/local folder.
Start crawling with our fetch.sh script which is under the nutch_shell folder in the format like ./fetch.sh <Iterations>.
Dedup nutch by bin/nutch dedup crawl/crawldb.

Generate webgraph by bin/nutch webgraph -webgraphdb crawl/webgraphdb -segment crawl/segments/*.
Execute PageRank by bin/nutch org.apache.nutch.scoring.webgraph.PageRank -webgraphdb crawl/webgraphdb.
Update score in crawldb by bin/nutch scoreupdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb.
Put scoring-link into the <value> tag of the property with <name>plugin.includes</name> in apache-nutch-1.12/runtime/local/conf/nutch-site.xml. Or put it in apache-nutch-1.12/conf/nutch-site.xml and rebuild with ant.
Reindex solr.

Start solr server.
Create a new core ucl with bin/solr create -c ucl.
Modify the schema or ucl by modifying managed-schema.xml and restart server or throuth the solr api. Change type of content to text_general.
Index with nutch by bin/nutch solrindex http://localhost:8983/solr/ucl crawl/crawldb crawl/segments/* -normalize -deleteGone.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
apache-nutch-1.12		apache-nutch-1.12
json		json
nutch_shell		nutch_shell
.gitignore		.gitignore
COMPGI15_Group_Report.pdf		COMPGI15_Group_Report.pdf
README.md		README.md
dict.json		dict.json
filter_json.py		filter_json.py
json.sh		json.sh
metrics.py		metrics.py
queries.txt		queries.txt
requirements.txt		requirements.txt
utils.py		utils.py