UCL IRDM 2017 Group Project - Option 1
We are using both of the open source IR packages below:
-
Build nutch with
ant
. -
Create a
urls
directory underapache-nutch-1.12/runtime/local
. -
Create
seed.txt
file underurls
and puthttp://www.cs.ucl.ac.uk/
into the file. -
Create new crawldb by executing
bin/nutch inject crawl/crawldb urls
under theapache-nutch-1.12/runtime/local
folder. -
Start crawling with our
fetch.sh
script which is under thenutch_shell
folder in the format like./fetch.sh <Iterations>
. -
Dedup nutch by
bin/nutch dedup crawl/crawldb
.
-
Generate webgraph by
bin/nutch webgraph -webgraphdb crawl/webgraphdb -segment crawl/segments/*
. -
Execute PageRank by
bin/nutch org.apache.nutch.scoring.webgraph.PageRank -webgraphdb crawl/webgraphdb
. -
Update score in crawldb by
bin/nutch scoreupdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb
. -
Put
scoring-link
into the<value>
tag of the property with<name>plugin.includes</name>
inapache-nutch-1.12/runtime/local/conf/nutch-site.xml
. Or put it inapache-nutch-1.12/conf/nutch-site.xml
and rebuild with ant. -
Reindex solr.
-
Start solr server.
-
Create a new core
ucl
withbin/solr create -c ucl
. -
Modify the schema or ucl by modifying
managed-schema.xml
and restart server or throuth the solr api. Change type ofcontent
totext_general
. -
Index with nutch by
bin/nutch solrindex http://localhost:8983/solr/ucl crawl/crawldb crawl/segments/* -normalize -deleteGone
.