-
Notifications
You must be signed in to change notification settings - Fork 91
Configuration
There are two aspects to configuring the crawler, infrastructure settings and runtime configuration. The infrastructure settings include storage, queuing and redis service location identifiers and credentials, and define crawler instance identifiers. The runtime configuration covers topics such as concurrency, timeouts, and org filters. Most (though not all) aspects of this configuration can be changed dynamically and affect the currently running crawlers. Some configurations and all settings require a restart of the crawler processes to take effect.
The runtime configuration is expressed as JSON and is broken into a discrete object for each of the crawler subsystems, each of which is detailed below. As you can have many instances of the crawler working together at once, the configuration shared in Redis and changed centrally. Each crawler subscribes to changes so changing a configuration for one, changes it for all crawlers with the same name.
static -- This configuration value requires a restart of each associated crawler instance
dynamic -- Changes to this configuration value take effect immediately across all associated crawlers
config.get('NAME_HERE')
-- Identifies a value that comes either from an environment variable or an env.json file. See the [Infrastructure settings] below for details.
The crawler subsystem is the engine that continually loops getting and processing requests. A typical crawler
configuration looks like the following:
crawler: {
name: config.get('CRAWLER_NAME') || 'crawler',
count: 0,
pollingDelay: 5000,
processingTtl: 60 * 1000,
orgList: CrawlerFactory.loadOrgs()
}
name
static -- The name of this crawler. The name of a crawler is used in several places to qualify shared resources such as database and queue names. It is important that associate crawlers use the same name and unrelated crawlers use different names if they are sharing infrastructure.
count
dynamic -- Each crawler can run many concurrent processing loops. Seeing the count to 0
stops processing. Most of the processing the crawler does is light on CPU. Bump up this number until the crawler node processes max out the core on which they are running. Since config values are shared, this affects all associated crawlers so watch the behavior if using heterogeneous compute platforms.
pollingDelay
dynamic -- Each loop (see count
) polls the queues for work. If there is no work, the loop will wait for pollingDelay
milliseconds.
processingTtl
dynamic -- Crawler requests have a unique signature based on their type, url and policy. Requests with the same request signature must processed sequentially. The processingTtl
is the upper bound number of milliseconds the lock on a request signature will live.
orgList
dynamic -- If specified, the crawler will only queue and process requests for entities in the listed set of GitHub orgs.
The fetcher materialized resources referenced by or included in requests. Typically this means a REST call to the GitHub API or fetching content from the crawler's store of responses. A typical fetcher
configuration looks like the following:
fetcher: {
baselineFrequency: 60, // seconds
callCapLimit: 30, // calls
callCapWindow: 1, // seconds
computeLimit: 15000, // milliseconds
computeLimitStore: 'memory',
computeWindow: 15, // seconds
deferDelay: 500,
metricsStore: 'redis',
tokenLowerBound: 50
}
baselineFrequency
dynamic -- How often to ping GitHub to check network latency. In an effort to treat GitHub's APIs well, the crawler bounds the amount of compute it is asking for. To do this it periodically senses the network latency to GitHub. This number is then subtracted from a request's round trip time to get an idea of the compute cost of a request.
callCapLimit
dynamic -- The max number of calls to GitHub a given crawler (Node) process will do per callCapWindow
. This avoids flooding GitHub with requests.
computeLimit
dynamic -- The amount of GitHub compute milliseconds to consume per computeWindow
seconds of wall-clock time. Generally you should keep the limit and the window about the same amount of time. While not perfect, that seems to avoid GitHub's secondary throttling.
computeLimitStore
dynamic -- Where to track consumed GitHub compute time. Valid values are memory
and redis
. Use redis
if you want to coordinate compute across a set of processes on one machine (IP address).
computeWindow
dynamic -- The number of seconds over which to track GitHub compute usage. See computeLimit
.
deferDelay
dynamic -- When the fetcher is unable to get a token, delay the requesting loop by deferDelay
milliseconds so we don't feverishly try to get benched tokens.
metricsStore
static -- The provider to use for storing fetcher metrics like number of hits on GitHub.
tokenLowerBound
dynamic -- The number of GitHub API calls to leave on a token before setting it aside until GitHub resets the token (every hour). Set at least a big as the total count across all crawlers using a shared token. The first crawler loop to hit the lower bound will bench the token but the other loops may not notice. Give them a chance to avoid hitting GitHub's rate limiting.
The queuing subsystem is the heart of the crawler. Each request to process a resource goes on one of several queues whether it is a user request, a result of an event, or derived from processing some other document. The queuing provider is pluggable. The key element of a provider implementation is that is durable. Since the crawler process itself is stateless, all pending and discovered work is stored in the queues. Loss of queue entries results in data loss. A typical setup is shown below. Note that some options are only relevant to particular providers.
queuing: {
provider: config.get('CRAWLER_QUEUE_PROVIDER') || 'amqp',
queueName: config.get('CRAWLER_QUEUE_PREFIX') || 'crawler',
weights: { events: 10, immediate: 3, soon: 2, normal: 3, later: 2 },
parallelPush: 10,
metricsStore: 'redis',
events: {
provider: config.get('CRAWLER_EVENT_PROVIDER') || 'webhook',
topic: config.get('CRAWLER_EVENT_TOPIC_NAME') || 'crawler',
queueName: config.get('CRAWLER_EVENT_QUEUE_NAME') || 'crawler'
},
attenuation: {
ttl: 3000
},
tracker: {
ttl: 60 * 60 * 1000
},
socketOptions: {},
credit: 100,
messageSize: 240,
pushRateLimit: 200
}
provider
static -- The name of the provider to use for queuing. Current valid values are amqp
, amqp10
and memory
. In the amqp*
cases you are free to configure in whatever infrastructure that supports the selected queuing protocol. For example, RabbitMQ supports AMQP and Azure Service Bus supports AMQP 1.0. Configurations for these technologies are provided with the crawler. Other can be added.
queueName
static -- The queue name prefix to use. It is common to use the same queuing infrastructure for multiple scenarios (e.g., development and production). Using the queueName
you can differentiate your development queues from those of others and the production queues. It is important that this be unique amongst all distinct crawler configurations sharing the same infrastructure.
weights
dynamic -- The crawler runs with five queues which are consulted in order, events, immediate, soon, normal, later, starting at random queue. This property sets the relative weight of each queue as the starting point. If a queue is consulted and does not have any messages, the next queue in the order is consulted.
parallelPush
dynamic -- When pushing batches of requests to a queue, run parallelPush
pushes in parallel. This improves performance through concurrency.
metricsStore
static -- The provider to use for storing queuing metrics related to queue pushes, pops, accepts and rejects.
attenuation.ttl
dynamic -- The millisecond window in which request pushes will be deduplicated. This prevents rapid-fire queuing of the same request. The scenario happens, for example, when processing a pull request and queuing the user, merged_by and assignee values which may all be the same person.
tracker.ttl
dynamic -- Each request being processed is locked to prevent concurrent processing of a duplicate request by another crawler loop. This is the time to live for that lock. Should be set to greater than the max expected processing time for a request.
credit
dynamic -- [AMQP10] The number of messages the queuing system can/should push to the crawler. This number needs to be bigger than the practical processing rate of the crawler given its CPU, network and loop count. Making this setting too big will cause unnecessary locking and redelivery of messages.
messageSize
static -- [AMQP10] The max size, in kilobytes, of a queued message. Some events come with quite a large payload and that payload is included in the queued request. We've had good success with ~240.
pushRate
dynamic -- [AMQP10] To avoid overrunning the queuing system, set this to limit the number of requests that are pushed to the queues per second.
socketOptions
static -- [AMQP] Options to use when creating the socket(s) used to talk to the queuing system. In particular, this can be used update the Certificate Authority chain when using services with SSL and self/alternatively signed certificates.
The storage subsystem maintains all the previously seen API responses. The storage system also supports a delta store that can be used to siphon off updates to feed into other systems (e.g., drive an eventing mechanism that responds to new content). There are out of the box providers based on MongoDB (mongo
), Azure Blob storage (azure
) and in-memory (memory
). An example configuration is shown here:
storage: {
ttl: 3 * 1000,
provider: config.get('CRAWLER_STORE_PROVIDER') || 'azure',
delta: {
provider: config.get('CRAWLER_DELTA_PROVIDER')
}
}
provider
static -- The name of the provider to use for storage. Valid values currently are memory
, azure
, mongo
. For azure
or mongo
you will also need to supply credential and location information in the Infrastructure settings.
ttl
dynamic -- Some providers cache recently read documents to reduce latency for tightly correlated calls like getting the etag and getting the content. Control how long the cache lives by setting this property. Keep it tight. Around 3 seconds.
delta.provider
static -- The name of the provider to use for handling updated documents. The delta provider is called in addition to the storage provider. Using this you can maintain change logs, alert on changes etc. Leave this property out if you do not want a delta store.
The locker is the mechanism used to ensure that a given request (type and url) is only processed by one crawler at any given time. If you are using a single process system then an in-memory locker is fine. If you have multiple processes then you need some shared infrastructure like Redis to manage the locks. A typical configuration is as follows:
locker: {
provider: 'redis',
retryCount: 3,
retryDelay: 200
}
provider
static -- The name of the locker provider to use. Valid values are memory
and redis
.
retryCount
static -- [Redis] The number of times Redis should retry its operations.
retryDelay
static -- [Redis] The number of milliseconds to wait between Redis operation retries.
Infrastructure settings define how the system is structured and identify resources such as which services to use. In some cases these settings show up in the Runtime configuration, in others they are purely internal. All of these settings can be supplied as environment variables (e.g., in Node's process.env
) or in a painless-config env.json
file. All infrastructure settings changes require a restart of the associated crawlers. An example env.json file is shown below.
{
"NODE_ENV": "production",
"CRAWLER_SERVICE_PORT": "random",
"CRAWLER_NAME": "<your crawler name>",
"CRAWLER_OPTIONS_PROVIDER": "redis",
"CRAWLER_MODE": "Standard",
"CRAWLER_INSIGHTS_KEY": "<your key>",
"CRAWLER_ORGS_FILE": "../orgs",
"CRAWLER_GITHUB_TOKENS": "<semi-colon separated token/attribute list",
"========== Redis settings ==========": "",
"CRAWLER_REDIS_URL": "<redis machine name>",
"CRAWLER_REDIS_ACCESS_KEY": "<your secret>",
"CRAWLER_REDIS_PORT": "6380",
"CRAWLER_REDIS_TLS": "true",
"========== Queue settings ==========": "",
"CRAWLER_QUEUE_PROVIDER": "amqp",
"CRAWLER_AMQP_URL": "amqps://<user:password>@machine:5671",
"CRAWLER_RABBIT_MANAGER_ENDPOINT": "https://<user:password>@<machine>:15672",
"CRAWLER_QUEUE_PREFIX": "<your prefix>",
"========== Event settings ==========": "",
"CRAWLER_EVENT_PROVIDER": "webhook",
"CRAWLER_WEBHOOK_SECRET": "<your secret>",
"========== Storage settings ==========": "",
"CRAWLER_STORE_PROVIDER": "azure",
"CRAWLER_STORAGE_ACCOUNT": "<account name>",
"CRAWLER_STORAGE_NAME": "<your storage differentiator>",
"CRAWLER_STORAGE_KEY": "<your secret>",
"CRAWLER_DELTA_PROVIDER": "azure",
"CRAWLER_DELTA_STORAGE_ACCOUNT": "<delta storage account>",
"CRAWLER_DELTA_STORAGE_KEY": "<your secret>"
}
NODE_ENV
-- The standard Node environment values are valid here. This value is used to differentiate resources in shared infrastructure like Redis.
CRAWLER_SERVICE_PORT
-- The port on which the crawler service should listen. This is useful for connecting the dashboard or cli. Set this to a port number or "random"
if you are running multiple crawler processes on the same machine.
CRAWLER_NAME
--
CRAWLER_OPTIONS_PROVIDER
--
CRAWLER_MODE
--
CRAWLER_INSIGHTS_KEY
--
CRAWLER_ORGS_FILE
--
CRAWLER_GITHUB_TOKENS
CRAWLER_REDIS_URL
--
CRAWLER_REDIS_ACCESS_KEY
--
CRAWLER_REDIS_PORT
--
CRAWLER_REDIS_TLS
--
CRAWLER_QUEUE_PROVIDER
--
CRAWLER_AMQP_URL
--
CRAWLER_RABBIT_MANAGER_ENDPOINT
--
CRAWLER_QUEUE_PREFIX
--
CRAWLER_EVENT_PROVIDER
--
CRAWLER_WEBHOOK_SECRET
--
CRAWLER_STORE_PROVIDER
--
CRAWLER_STORAGE_ACCOUNT
--
CRAWLER_STORAGE_NAME
--
CRAWLER_STORAGE_KEY
--
CRAWLER_DELTA_PROVIDER
--
CRAWLER_DELTA_STORAGE_ACCOUNT
--
CRAWLER_DELTA_STORAGE_KEY
--