Configuration

There are two aspects to configuring the crawler, infrastructure settings and runtime configuration. The infrastructure settings identify storage, queuing and redis services, and define crawler instance identifiers. The runtime configuration covers topics such as concurrency, timeouts, and org filters. Most (though not all) aspects of this configuration can be changed dynamically and affect the currently running crawlers. Some configurations and all settings require a restart of the crawler processes to take effect.

Runtime configuration

The runtime configuration is expressed as JSON and is broken into a discrete object for each of the crawler subsystems, each of which is detailed below. As you can have many instances of the crawler working together at once, the configuration shared in Redis and changed centrally. Each crawler subscribes to changes so changing a configuration for one, changes it for all crawlers with the same name.

restart -- This configuration value requires a restart of each associated crawler instance

dynamic -- Changes to this configuration value take effect immediately across all associated crawlers

Crawler

A typical crawler configuration looks like

crawler: {
  name: config.get('CRAWLER_NAME') || 'crawler',
  count: 0,
  pollingDelay: 5000,
  processingTtl: 60 * 1000,
  orgList: CrawlerFactory.loadOrgs()
}

name static -- The name of this crawler. The name of a crawler is used in several places to qualify shared resources such as database and queue names. It is important that associate crawlers use the same name and unrelated crawlers use different names if they are sharing infrastructure.

count dynamic -- Each crawler can run many concurrent processing loops. Seeing the count to 0 stops processing. Most of the processing the crawler does is light on CPU. Bump up this number until the crawler node processes max out the core on which they are running. Since config values are shared, this affects all associated crawlers so watch the behavior if using heterogeneous compute platforms.

pollingDelay dynamic -- Each loop (see count) polls the queues for work. If there is no work, the loop will wait for pollingDelay milliseconds.

processingTtl dynamic -- Crawler requests have a unique signature based on their type, url and policy. Requests with the same request signature must processed sequentially. The processingTtl is the upper bound number of milliseconds the lock on a request signature will live.

orgList dynamic -- If specified, the crawler will only queue and process requests for entities in the listed set of GitHub orgs.

Fetcher

Queuing

Storage

Locker

Infrastructure settings

{
  "NODE_ENV": "localhost",
  "CRAWLER_MODE": "Standard",
  "CRAWLER_OPTIONS_PROVIDER": ["defaults" | "memory" | "redis"],
  "CRAWLER_INSIGHTS_KEY": "[SECRET]",
  "CRAWLER_ORGS_FILE": "../orgs",
  "CRAWLER_GITHUB_TOKENS": "[SECRET]",
  "CRAWLER_REDIS_URL": "peoplesvc-dev.redis.cache.windows.net",
  "CRAWLER_REDIS_ACCESS_KEY": "[SECRET]",
  "CRAWLER_REDIS_PORT": 6380,
  "CRAWLER_QUEUE_PROVIDER": "amqp10",
  "CRAWLER_AMQP10_URL": "amqps://RootManageSharedAccessKey:[SECRET]@ghcrawlerdev.servicebus.windows.net",
  "CRAWLER_QUEUE_PREFIX": "ghcrawlerdev",
  "CRAWLER_STORE_PROVIDER": "azure",
  "CRAWLER_STORAGE_NAME": "ghcrawlerdev",
  "CRAWLER_STORAGE_ACCOUNT": "ghcrawlerdev",
  "CRAWLER_STORAGE_KEY": "[SECRET]",
  "CRAWLER_DOCLOG_STORAGE_ACCOUNT": "ghcrawlerdev",
  "CRAWLER_DOCLOG_STORAGE_KEY": "[SECRET]"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly