Configuration

There are two aspects to configuring the crawler, infrastructure settings and runtime configuration. The infrastructure settings identify storage, queuing and redis services, and define crawler instance identifiers. The runtime configuration covers topics such as concurrency, timeouts, and org filters. Most (though not all) aspects of this configuration can be changed dynamically and affect the currently running crawlers. Some configurations and all settings require a restart of the crawler processes to take effect.

Runtime configuration

The runtime configuration is expressed as JSON and is broken into a discrete object for each of the crawler subsystems, each of which is detailed below. As you can have many instances of the crawler working together at once, the configuration shared in Redis and changed centrally. Each crawler subscribes to changes so changing a configuration for one, changes it for all crawlers with the same name.

Crawler

Infrastructure settings

{
  "NODE_ENV": "localhost",
  "CRAWLER_MODE": "Standard",
  "CRAWLER_OPTIONS_PROVIDER": ["defaults" | "memory" | "redis"],
  "CRAWLER_INSIGHTS_KEY": "[SECRET]",
  "CRAWLER_ORGS_FILE": "../orgs",
  "CRAWLER_GITHUB_TOKENS": "[SECRET]",
  "CRAWLER_REDIS_URL": "peoplesvc-dev.redis.cache.windows.net",
  "CRAWLER_REDIS_ACCESS_KEY": "[SECRET]",
  "CRAWLER_REDIS_PORT": 6380,
  "CRAWLER_QUEUE_PROVIDER": "amqp10",
  "CRAWLER_AMQP10_URL": "amqps://RootManageSharedAccessKey:[SECRET]@ghcrawlerdev.servicebus.windows.net",
  "CRAWLER_QUEUE_PREFIX": "ghcrawlerdev",
  "CRAWLER_STORE_PROVIDER": "azure",
  "CRAWLER_STORAGE_NAME": "ghcrawlerdev",
  "CRAWLER_STORAGE_ACCOUNT": "ghcrawlerdev",
  "CRAWLER_STORAGE_KEY": "[SECRET]",
  "CRAWLER_DOCLOG_STORAGE_ACCOUNT": "ghcrawlerdev",
  "CRAWLER_DOCLOG_STORAGE_KEY": "[SECRET]"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

Runtime configuration

Crawler

Infrastructure settings

Clone this wiki locally