-
Notifications
You must be signed in to change notification settings - Fork 91
Data architecture
The final output of the crawler is a store of JSON documents that both include all of the response from GitHub for a given entity and are annotated with metadata that enables easier traversal and interconnection. Concretely, each output document has the following structure:
{
<content from GitHub>
_metadata: {
"type": "clones",
"url": "https://api.github.com/repos/OSTC/php-sdk-binary-tools/traffic/clones",
"fetchedAt": "2017-03-10T02:49:11.524Z",
"links": {
"self": {
"href": "urn:repo:10075507:clones:2017_03_10",
"type": "resource"
},
"repo": {
"href": "urn:repo:10075507",
"type": "resource"
}
},
"etag": "\"be220f1fd43c5b6ac9b4302a2ed3b167\"",
"version": 13,
"processedAt": "2017-03-10T02:49:11.525Z"
}
}
Most of these properties are pretty obvious. type
and links
are worth looking at more deeply.
The type
of a document very much relates to the type of entity in the GitHub API. They are typically 1:1 direct matches. This type also shows up in crawler Requests, in the GitHubProcessor.js function names, and in the funky urn
values in the links
. The key is that they all match.
The links
property is a collection of relationships between entities. Every document has a self
link that is its unique, global identity. The value of a link is a URN. URNs are semantically unstructured (i.e., they are just opaque strings) but syntactically are type:id
pairs separated by another :
. So urn:repo:21
refers to the repo whose GitHub id is 21. URNs never include entity names, only ids. Names can change. Ids do not.
A document can have many additional links. Some will have many links, some very few, depending on the type of the document. The names of the links are generally the same as the related property name in the GitHub API response. So the clones document above is related to the repo identified in the repo
link.
A document may also have a siblings
link if it is a member of a collection of related documents. For example, the document for an issue might have the following link:
"siblings": {
"href": "urn:repo:6476660:issues",
"type": "collection"
},
That signifies that issue is related to all the other issues that have a siblings
link with the same href
. This particular href
value is the URN for a given repo with :issues
on the end. So this siblings entry relates the issue to all the other issues in the same repo. Given this structure, you can find all the issues for a repo by querying the data store for all documents that have a _metadata.links.siblings.href
property with the URN urn:repo:<repo id>:issues
. This approach allows collection members (e.g., issues) to be added and removed by manipulating the individual documents -- no need to update a master list somewhere.
Notice that each link
also has a type
. Valid types are resource
, collection
, and relation
.
A resource link is a reference to a single entity. For example, a repo links to its owner
resource.
A collection link points to a set of resources that are typically owned by some other entity, like a repo's issues. These are always 1:N relationships. Essentially a collection link points to all the resources that are siblings
of one another. So the document for repo 6476660 from above would point to its issues with a link like the following:
"issues": {
"href": "urn:repo:17843034:issues",
"type": "collection"
}
A relation is used to represent an M:N relationship such as that between repos and teams. One repo can have many teams and one team can have many repos. A link of type relation
points to a mapping table that relates the two entities. So, for example, a repo document might have a teams
link like:
"teams": {
"href": "urn:repo:17843034:teams:pages:d9a3e50f-386d-4236-9805-80fb7ff54898",
"type": "relation"
}
Here the href points to a set of documents that have a unique
link (sort of like the siblings
link) with the matching value. Those documents list out all of the related entities. So for example, the following is from the relation document mapping our repo to the teams listed in the resources
property.
"unique": {
"href": "urn:repo:17843034:teams:pages:d9a3e50f-386d-4236-9805-80fb7ff54898",
"type": "collection"
},
"resources": {
"hrefs": [
"urn:team:652356",
"urn:team:652429",
"urn:team:1778605"
],
"type": "resource"
}
It is important to remember that there may be many of these documents all with the same unique
property. These represent the pages of results that come back from GitHub. The unique
value is used to group together sets of pages that make up the complete list.
This can be a little confusing but it is really a "follow the links" game where the type of the link tells you how to follow:
- resource means look for something with a
self
link that matches - collection means look for documents that have a
siblings
link that matches - relation means look for all of the documents that have a
unique
link that matches and then follow the aggregatedresources
links.