forked from juanplopes/hadoop2solr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
66 lines (48 loc) · 2.55 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow.
Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed.
====================================================
Building the Hadoop job jar
====================================================
% mkdir hadoop2solr-home
% cd hadoop2solr-home
% git clone git://github.com/bixolabs/hadoop2solr.git
% cd hadoop2solr
% ant job
====================================================
Setting up the Hadoop cluster
====================================================
% cd hadoop2solr-home
% git clone git://github.com/bixo/bixo.git bixo
% cd bixo/ec2
% . setenv.sh
% hadoop-ec2 launch-cluster hadoop2solr 1 m1.large
% hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar
% hadoop-ec2 screen hadoop2solr
====================================================
Running the Hadoop jop (on the hadoop2solr cluster)
====================================================
% hadoop fs -mkdir /user/root/working
% hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb
Monitor the job progress using your browser until the job completes successfully
% hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index
Monitor the job progress using your browser until the job completes successfully
The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index.
% hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index
====================================================
Setting up Solr (on the hadoop2solr master)
====================================================
% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip
% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip
% unzip solr.zip
% unzip solr-conf.zip
% mkdir solr-data
% cd solr-data
% ln -s /mnt/solr-index/part-00000 index
% cd ../solr
% java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 &
====================================================
Cleaning up/testing
====================================================
Use the AWS Console to kill off the slave server.
Use the AWS Console to open up TCP port 8983 on the master server.
Open a browser window on http://<ec2 server public name>:8983/solr/admin