-
Notifications
You must be signed in to change notification settings - Fork 90
Configuration
There are two aspects to configuring the crawler, infrastructure settings and runtime configuration. The infrastructure settings identify storage, queuing and redis services, and define crawler instance identifiers. The runtime configuration covers topics such as concurrency, timeouts, and org filters. Most (though not all) aspects of this configuration can be changed dynamically and affect the currently running crawlers. Some configurations and all settings require a restart of the crawler processes to take effect.
The runtime configuration is expressed as JSON and is broken into a discrete object for each of the crawler subsystems, each of which is detailed below. As you can have many instances of the crawler working together at once, the configuration shared in Redis and changed centrally. Each crawler subscribes to changes so changing a configuration for one, changes it for all crawlers with the same name.
restart -- This configuration value requires a restart of each associated crawler instance
dynamic -- Changes to this configuration value take effect immediately across all associated crawlers
`config.get('NAME_HERE') -- Identifies a value that comes either from an environment variable or an env.json file. See the [Infrastructure settings] below for details.
The crawler subsystem is the engine that continually loops getting and processing requests. A typical crawler
configuration looks like the following:
crawler: {
name: config.get('CRAWLER_NAME') || 'crawler',
count: 0,
pollingDelay: 5000,
processingTtl: 60 * 1000,
orgList: CrawlerFactory.loadOrgs()
}
name
static -- The name of this crawler. The name of a crawler is used in several places to qualify shared resources such as database and queue names. It is important that associate crawlers use the same name and unrelated crawlers use different names if they are sharing infrastructure.
count
dynamic -- Each crawler can run many concurrent processing loops. Seeing the count to 0
stops processing. Most of the processing the crawler does is light on CPU. Bump up this number until the crawler node processes max out the core on which they are running. Since config values are shared, this affects all associated crawlers so watch the behavior if using heterogeneous compute platforms.
pollingDelay
dynamic -- Each loop (see count
) polls the queues for work. If there is no work, the loop will wait for pollingDelay
milliseconds.
processingTtl
dynamic -- Crawler requests have a unique signature based on their type, url and policy. Requests with the same request signature must processed sequentially. The processingTtl
is the upper bound number of milliseconds the lock on a request signature will live.
orgList
dynamic -- If specified, the crawler will only queue and process requests for entities in the listed set of GitHub orgs.
The fetcher materialized resources referenced by or included in requests. Typically this means a REST call to the GitHub API or fetching content from the crawler's store of responses. A typical fetcher
configuration looks like the following:
fetcher: {
baselineFrequency: 60, // seconds
callCapLimit: 30, // calls
callCapWindow: 1, // seconds
computeLimit: 15000, // milliseconds
computeLimitStore: 'memory',
computeWindow: 15, // seconds
deferDelay: 500,
metricsStore: 'redis',
tokenLowerBound: 50
}
`baselineFrequency dynamic -- How often to ping GitHub to check network latency. In an effort to treat GitHub's APIs well, the crawler bounds the amount of compute it is asking for. To do this it periodically senses the network latency to GitHub. This number is then subtracted from a request's round trip time to get an idea of the compute cost of a request.
callCapLimit
dynamic -- The max number of calls to GitHub a given crawler (Node) process will do per callCapWindow
. This avoids flooding GitHub with requests.
computeLimit
dynamic -- The amount of GitHub compute milliseconds to consume per computeWindow
seconds of wall-clock time. Generally you should keep the limit and the window about the same amount of time. While not perfect, that seems to avoid GitHub's secondary throttling.
computeLimitStore
dynamic -- Where to track consumed GitHub compute time. Valid values are memory
and redis
. Use redis
if you want to coordinate compute across a set of processes on one machine (IP address).
computeWindow
dynamic -- The number of seconds over which to track GitHub compute usage. See computeLimit
.
tokenLowerBound
dynamic -- The number of GitHub API calls to leave on a token before setting it aside until GitHub resets the token (every hour). Set at least a big as the total count across all crawlers using a shared token. The first crawler loop to hit the lower bound will bench the token but the other loops may not notice. Give them a chance to avoid hitting GitHub's rate limiting.
{
"NODE_ENV": "localhost",
"CRAWLER_MODE": "Standard",
"CRAWLER_OPTIONS_PROVIDER": ["defaults" | "memory" | "redis"],
"CRAWLER_INSIGHTS_KEY": "[SECRET]",
"CRAWLER_ORGS_FILE": "../orgs",
"CRAWLER_GITHUB_TOKENS": "[SECRET]",
"CRAWLER_REDIS_URL": "peoplesvc-dev.redis.cache.windows.net",
"CRAWLER_REDIS_ACCESS_KEY": "[SECRET]",
"CRAWLER_REDIS_PORT": 6380,
"CRAWLER_QUEUE_PROVIDER": "amqp10",
"CRAWLER_AMQP10_URL": "amqps://RootManageSharedAccessKey:[SECRET]@ghcrawlerdev.servicebus.windows.net",
"CRAWLER_QUEUE_PREFIX": "ghcrawlerdev",
"CRAWLER_STORE_PROVIDER": "azure",
"CRAWLER_STORAGE_NAME": "ghcrawlerdev",
"CRAWLER_STORAGE_ACCOUNT": "ghcrawlerdev",
"CRAWLER_STORAGE_KEY": "[SECRET]",
"CRAWLER_DOCLOG_STORAGE_ACCOUNT": "ghcrawlerdev",
"CRAWLER_DOCLOG_STORAGE_KEY": "[SECRET]"
}