Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Data architecture

Jeff McAffer edited this page Jun 9, 2017 · 2 revisions

The final output of the crawler is a store of JSON documents that both include all of the response from GitHub for a given entity and are annotated with metadata that enables easier traversal and interconnection. Concretely, each output document has the following structure:

{
  <content from GitHub>
  _metadata: {
    "type": "clones",
    "url": "https://api.github.com/repos/OSTC/php-sdk-binary-tools/traffic/clones",
    "fetchedAt": "2017-03-10T02:49:11.524Z",
    "links": {
      "self": {
        "href": "urn:repo:10075507:clones:2017_03_10",
        "type": "resource"
      },
      "repo": {
        "href": "urn:repo:10075507",
        "type": "resource"
      }
    },
    "etag": "\"be220f1fd43c5b6ac9b4302a2ed3b167\"",
    "version": 13,
    "processedAt": "2017-03-10T02:49:11.525Z"
  }
}

Most of these properties are pretty obvious. type and links are worth looking at more deeply.

The type of a document very much relates to the type of entity in the GitHub API. They are typically 1:1 direct matches. This type also shows up in crawler Requests, in the GitHubProcessor.js function names, and in the funky urn values in the links. The key is that they all match.

The links property is a collection of relationships between entities. Every document has a self link that is its unique, global identity. The value of a link is a URN. URNs are semantically unstructured (i.e., they are just opaque strings) but syntactically are type:id pairs separated by another :. So urn:repo:21 refers to the repo whose GitHub id is 21. URNs never include entity names, only ids. Names can change. Ids do not.

A document can have many additional links. Some will have many links, some very few, depending on the type of the document. The names of the links are generally the same as the related property name in the GitHub API response. So the clones document above is related to the repo identified in the repo link.

A document may also have a siblings link if it is a member of a collection of related documents. For example, the document for an issue might have the following link:

  "siblings": {
    "href": "urn:repo:6476660:issues",
    "type": "collection"
  },

That signifies that issue is related to all the other issues that have a siblings link with the same href. This particular href value is the URN for a given repo with :issues on the end. So this siblings entry relates the issue to all the other issues in the same repo. Given this structure, you can find all the issues for a repo by querying the data store for all documents that have a _metadata.links.siblings.href property with the URN urn:repo:<repo id>:issues. This approach allows collection members (e.g., issues) to be added and removed by manipulating the individual documents -- no need to update a master list somewhere.

Notice that each link also has a type. Valid types are resource, collection, and relation.

resource

A resource link is a reference to a single entity. For example, a repo links to its owner resource.

collection

A collection link points to a set of resources that are typically owned by some other entity, like a repo's issues. These are always 1:N relationships. Essentially a collection link points to all the resources that are siblings of one another. So the document for repo 6476660 from above would point to its issues with a link like the following:

  "issues": {
    "href": "urn:repo:17843034:issues",
    "type": "collection"
  }

relation

A relation is used to represent an M:N relationship such as that between repos and teams. One repo can have many teams and one team can have many repos. A link of type relation points to a mapping table that relates the two entities. So, for example, a repo document might have a teams link like:

  "teams": {
    "href": "urn:repo:17843034:teams:pages:d9a3e50f-386d-4236-9805-80fb7ff54898",
    "type": "relation"
  }

Here the href points to a set of documents that have a unique link (sort of like the siblings link) with the matching value. Those documents list out all of the related entities. So for example, the following is from the relation document mapping our repo to the teams listed in the resources property.

  "unique": {
    "href": "urn:repo:17843034:teams:pages:d9a3e50f-386d-4236-9805-80fb7ff54898",
    "type": "collection"
  },
  "resources": {
    "hrefs": [
      "urn:team:652356",
      "urn:team:652429",
      "urn:team:1778605"
    ],
    "type": "resource"
  }

It is important to remember that there may be many of these documents all with the same unique property. These represent the pages of results that come back from GitHub. The unique value is used to group together sets of pages that make up the complete list.

This can be a little confusing but it is really a "follow the links" game where the type of the link tells you how to follow:

  • resource means look for something with a self link that matches
  • collection means look for documents that have a siblings link that matches
  • relation means look for all of the documents that have a unique link that matches and then follow the aggregated resources links.
Clone this wiki locally