Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Architecture

Jeff McAffer edited this page Jun 13, 2017 · 3 revisions

GHCrawler works like a traditional web crawler except it is crawling APIs rather than web pages. The crawler itself is a set of infinite loops where each iteration of the loop gets a request, fetches the corresponding resource, processes the response (optionally pushing more requests on the queues), and saves the processed document.

The key elements of the system are listed here and detailed further on this wiki:

  1. Request -- A description of a resource to crawl and how to process and travere
  2. Queues -- A collection of queues in priority order
  3. Fetcher -- A mechanism for GET'ing resources from GitHub or the crawler's previous work
  4. Processor -- A set of functions for processing the fetched resources
  5. Store -- A means of storing processed documents and providing them for future processing

Execution model

To work with the crawler code you need to get into the mindset that there are only four primitive operations; Fetch, Process, Queue, and Save. Each iteration of the crawler loop pops a request off a queue, fetches at most one resource as indicated by the request and produces at most one document to be saved. The processing of one request may queue any number of additional requests. The processing of a document must be completely stateless, and by the earlier definition, cannot fetch additional resources.

If an error happens anywhere in the crawler loop, the current request is requeued for reprocessing at a future time. If the same request is requeued too many times, it is put in a deadletter box for operator assessment. Of course, some errors are fatal and can be immediately deadletter'd.

The execution of the crawler is depicted in the image below and detailed in the discussion that follows.

Requests

A request has four main parts;

  • type -- The type of entity being requested. This typically matches the name of the entity in the GitHub API (e.g., the repo type refers to repositories, user is user, etc.).
  • url -- The GitHub API URL for the entity being requested. This is NOT the normal web URL that you browse in Chrome or Edge. For example, https://api.github.com/repos/contoso/test, is the URL for the test repo in the contoso org.
  • policy -- The traversal policy to use when processing the requested entity.
  • context -- A JSON object that contains additional information to use when processing this request. Typically this info comes from the processing of a prior request.
    • qualifier -- the URN of the "parent" entity for this request. For example, the request to process an issue will have a context.qualifier of urn:repo:<repo number>. This value is then used when constructing the URN, unique identifier, for the issue.

Queuing

The crawler typically maintains a set of queues that are consulted in priority order according to a simple weighting scheme. The queues are aptly named event, immediate, soon, normal, later. Events coming in from GitHub are volatile so must be handled first. Most other requests are placed on the normal queue.

The queue implementation is pluggable and can range from in-memory to AMQP-based technologies. There are a few key characteristics that are typical of queues:

  • deduplicated -- The same request (same type, url and policy) should only be on the queue once. Given the iterative nature of the crawler processing, a request on the queue is guaranteed to be processed at some point. Putting it on twice does not change that. Request processing should be idempotent, so processing two identical requests should be fine but it is wasteful of processing cost and GitHub API use.

  • guaranteed -- A request placed on a queue must be guaranteed to be processed at some point, even if the request is pulled off and processing fails.

  • unordered -- While the queues themselves may be ordered (most are), processing and execution should not depend on that ordering. Any request may fail and be requeued thus changing the ordering.

Fetching

Given a request popped off a queue, the crawler proceeds to fetch the requested entity. In practice this amounts to

  • Check cache -- Determine if the resource has been previously fetched. If so, is there an etag to be used to verify with the origin. Using etags can reduce the call cost and avoid rate limiting
  • Get a token --
  • Do the call --
  • Annotate request --

Processing

Clone this wiki locally