Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Crawler in a box

William Bartholomew edited this page Mar 20, 2017 · 4 revisions

For convenience, a Docker configuration for running the crawler is available. This comes pre-configured with Rabbit MQ for queuing, MongoDB for document storage, Redis for caching, Metabase for insights and the GHCrawler dashboard for configuration and control. Each of these runs in its own container. The compose file is in docker/docker-compose.yml in the GHCrawler repo.

NOTE This is an evolving solution and the steps for running will be simplified published, ready-to-use images on Docker Hub. For now, follow these steps

  1. Clone the Microsoft/ghcrawler and Microsoft/crawler-dashboard repos.
  2. In a command prompt go to ghcrawler/docker and run docker-compose up.

Once the containers are up and running, you should see some crawler related messages in the container's console output every few seconds. You can control the crawler either using the cc command line tool or the browser-based dashboard, both of which are described below.

You can also hookup directly to the crawler infrastructure. By default the containers expose a number of endpoints at different ports on localhost. Note that if you have trouble starting the containers due to port conflicts, either shutdown your services using these ports or edit the docker/docker-compose.yml file to change the ports.

  • Crawler Dashboard (4000) -- Open http://localhost:4000 in your browser to see what's happening and control some behaivors and configurations
  • Crawler (3000) -- http://localhost:3000 gives you direct access to the REST API for the crawler
  • MongoDB (27017 and 28017) -- Direct access to the Mongo DB
  • Redis (6379) -- Observe what's happening in Redis. Not much else for you to do here
  • RabbitMQ (5672 and 15672) -- Hit http://localhost:15672 with a browser to see and maange the RabbitMQ queues
  • Metabase (5000) -- Hit http://localhost:5000 to get live insights in your browser via Metabase

Updating the default Metabase for Docker configurations:

The Metabase configured by default has some canned queries and a dashboard. If you want to clear that out and start fresh, do the following:

  1. Ensure you're starting from a completely clean container (docker-compose down && docker-compose up).
  2. Crawl a small org to populate Mongo so you have schema/sample data to work with.
  3. Open the Metabase URL and configure the questions, dashboard, etc. you want
  4. REMEMBER: Any changes you make will be persisted
  5. Copy the Metabase database by changing to the docker/metabase folder in the GHCrawler repository and running:
  docker cp docker_metabase_1:/var/opt/metabase/dockercrawler.db.mv.db .
Clone this wiki locally