Skip to content

0.4 release milestone discussion

Aron Ahmadia edited this page Oct 4, 2015 · 5 revisions

Aron is using this Wiki page to keep track of development notes and issues regarding the 0.4 release milestone. This isn't a discussion board or issue tracker, and the notes here will be migrated to documentation or code as part of the release.

Nutch REST Refactor

git clone http://github.com/apache/nutch.git
cd nutch
curl -O https://issues.apache.org/jira/secure/attachment/12764747/NUTCH-2132.patch
patch -p0 < NUTCH-2132.patch
ant

To start the Nutch REST server after it's been built:

cd runtime/local
# Linux folks, you're on your own here, this is Mac magic:
export JAVA_HOME=$(/usr/libexec/java_home)
./bin/nutch startserver

Interacting with Nutch Pub/Sub Stream

  • Currently streams to RabbitMQ. Since Redis could also fill this role, we should consider implementing in Redis instead to avoid increasing the number of services required for a Memex Explorer installation.

  • The RabbitMQ producer doesn't send events from different crawls to different queues. This is essential behavior if we'd like to support independent monitoring of concurrent crawls.

DataWake Seed Integration