README

This project contains the code and data for the dzone article on using Solr as a NoSQL endpoint for a Hadoop workflow.

Below are the steps that would get you up and running, assuming you have an AWS account set up as per http://openbixo.org/documentation/running-bixo-in-ec2/, and a git client installed.

====================================================
Building the Hadoop job jar
====================================================

% mkdir hadoop2solr-home
% cd hadoop2solr-home
% git clone git://github.com/bixolabs/hadoop2solr.git 
% cd hadoop2solr
% ant job

====================================================
Setting up the Hadoop cluster
====================================================

% cd hadoop2solr-home
% git clone git://github.com/bixo/bixo.git bixo
% cd bixo/ec2
% . setenv.sh
% hadoop-ec2 launch-cluster hadoop2solr 1 m1.large
% hadoop-ec2 push hadoop2solr ../hadoop2solr/build/ hadoop2solr-job-1.0-SNAPSHOT.jar
% hadoop-ec2 screen hadoop2solr

====================================================
Running the Hadoop jop (on the hadoop2solr cluster)
====================================================

% hadoop fs -mkdir /user/root/working
% hadoop distcp s3n://bixolabs-dzone/urls /user/root/working/crawldb

Monitor the job progress using your browser until the job completes successfully

% hadoop jar hadoop2solr-job-1.0-SNAPSHOT.jar com.bixolabs.tools.IndexTool -input working/crawldb -output working/solr-index

Monitor the job progress using your browser until the job completes successfully

The output directory will contain a single 'part-00000' directory, which contains a set of Lucene files for a single index.

% hadoop fs -copyToLocal /user/root/working/solr-index /mnt/solr-index

====================================================
Setting up Solr (on the hadoop2solr master)
====================================================

% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr.zip
% wget --no-check-certificate https://github.com/downloads/bixolabs/hadoop2solr/solr-conf.zip
% unzip solr.zip
% unzip solr-conf.zip
% mkdir solr-data
% cd solr-data
% ln -s /mnt/solr-index/part-00000 index
% cd ../solr
% java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar > jetty.log 2>&1 &

====================================================
Cleaning up/testing
====================================================

Use the AWS Console to kill off the slave server.

Use the AWS Console to open up TCP port 8983 on the master server.

Open a browser window on http://<ec2 server public name>:8983/solr/admin