- Summary
- Problem
- Proposal
- Guiding principles
- Items we've tried to address
- Items for further work
- Draft list of endpoints
/
(Root)/resource/{path}
/documents
/documents/{content_id}
/documents/{content_id}/{locale}
/documents/{content_id}/{locale}/editions
/documents/{content_id}/{locale}/change-notes
/documents/{content_id}/{locale}/editions/live
/documents/{content_id}/{locale}/editions/version/{version}
/editions
/editions/{id}
/editions/{id}/change-note
/editions/change-notes
/locations
/locations/lookup/{path}
/locations/{id}
/gones
/gones/{id}
- Draft list of entities
- Next steps
- Answers to hypothetical questions
- Why REST? Isn't everyone using GraphQL now?
- How might this be rolled out?
- Do we need a content store should we be just querying the Publishing API?
- Why use the word "live" when it is already used in the context of live content store?
- I'm not too sure about the naming of "x"?
- Is it preferred to lookup content via content_id than to use a path?
- Is there a plan for how to document this?
- Does this proposal include attachments to content, such as PDF or image files?
- Amendments following feedback
This RFC serves as an introduction to the approach that we (the API for Content team) have taken to defining the next iteration of the GOV.UK Content API which considers historical content.
The purpose of this RFC is to present our approach and ideas for what we feel is the next logical iteration of the Content API. We are seeking feedback from the wider GOV.UK developer community on this approach and looking for community consensus that this a sensible path to proceed further on. We welcome questions and are happy to explain further how we arrived at suggestions we have proposed.
This RFC presents the principles we have applied to defining things, the items we have considered inside/outside scope, a draft list of endpoints, and a draft list of types reflected in the new API.
GOV.UK, as a member of Open Government Partnership (OGP), has made a commitment to:
- Provide APIs for government content
- Provide a full version history of every published page
Currently we loosely meet the first criteria of this commitment - with our
unofficially supported /api/content
endpoint - and don't meet the second one.
To meet both of these we need the means to access historic content, which would logically be through the content store, and to assess how well our current Content API meets the first commitment.
We have identified the following problems with the current Content API:
- There is just a single lookup for content - by path - and not means to navigate through content.
- There is useful information not exposed (such as routing) since it is not used to render pages.
- The information provided is hard to understand without knowledge of schemas and other sub-systems.
- Somewhat confusing responses are returned for non-content items such as redirects and gones.
When considered in the context of historic content we have these additional problems:
- The model of a ContentItem representing the compound of a Document and Edition which causes significant data replication.
- The means of lookup (path) blurs the lines between current and historic content.
- An approach to identify which content is current and which is historic.
There are a number of basic principles we have followed in defining this API, an understanding of these may be useful in understanding the "why" in some of our suggestions.
If we are to have a single API that we use internally and externally we gain the following:
- Less to maintain - only 1 API
- Higher chance of catching issues internally
- External API evolves implicitly with our internal needs
And we accept the following problems:
- Data included in the API that may be of no use outside GOV.UK
- More frequent changes to accommodate the changes needed for GOV.UK's evolution
We've structured the endpoints based on what data we have and the logical way the resources fit together rather than considering what use cases may be. The reasoning for this (aside from general REST API recommendations) is that we are building this to enable usage of our data and we aren't trying to anticipate what those uses may be.
However in contrast to this we have considered how the current content/{path}
endpoint can be replaced without requiring any additional API lookups. And
have chosen for endpoints to have a preference for returning "live" content.
We chose to start from the principle that every resource returns an entity or a list of entities. Each one of these entities has a defined structure and a canonical URL.
This offers us a number of advantages compared to returning arbitrary data:
- Consistency - if the same data is used in two places you can expect it to be structured the same.
- Easier to model consumption of the API - the entity has a name you can use, same classes can be used for multiple API responses.
- Expansion of the API need an integrated consideration of the system.
And we accept the following problems:
- In some cases responses may be more verbose than necessary to return an entity.
- We may need to implicitly embed related entities in responses to provide a holistic response in endpoints.
- There could end up being a lot of entities in the system were we to model each schema.
We have tried to follow industry best practices on REST APIs wherever appropriate. Some aspects of this is are:
- Usage of nouns not verbs in endpoints
- Intention to provide filtering, sorting and pagination of collections
- Usage of links within the responses to communicate state and relationships
This API proposal is designed around historical content being as easy to lookup as content that is currently live. There is the expectation that a user of the API can filter based upon just live, just past or a mixture of content.
We've tried to stay clear of the concept of unpublishing as is used in Publishing API and other publishing apps. The concept of unpublishing is confusing in a historic context (and arguably even in our current context) as it implies the inverse of publishing but something unpublished would still be in the history.
We propose handling unpublishings through content replacing the Resource at a particular URL and having timestamps that indicate when the resource was live.
The term "replace" is not actually used in the proposal for the API however the term "retired" has been introduced.
A new concept introduced in this proposal is subtyping Gone into RetiredGone and RevokedGone. The former is intended for the current scenario that Gone is used for - a document that was once on GOV.UK but reached the end of it's useful life.
The concept of RevokedGone is intended for handling legally sensitive information. It would be used to replace either an Edition or Document that has needed to be removed for legal reasons and cannot continue to be in our history.
Note: The ideas surrounding this approach have been revised.
This proposal introduces a number of endpoints for navigating and filtering collections of content. Potential usages this may provide is:
- Looking up all editions published within a time window
- Following the change notes of all content being published
- Tracking all content being removed from GOV.UK
Currently in the content store paths that configure the router to return redirect or gone responses are modelled as ContentItems. This proposal suggests that these should be modelled separately.
Redirects are suggested to be modelled as a type of routing and not be associated with a resource - as we do not store data beyond the routing with a redirect.
Gones are considered to be a distinct resource from an Edition.
By separating these concerns from content we have the advantage to make the rules for validity of a ContentItem stricter - as there is less variance in what they might contain - and provide a more meaningful response when accessing Gones/Redirects.
There are a number of items that we are considering or are postponing investigating/iteration. These are listed here to indicate current thoughts on them.
We are considering what structure we will use to represent the data. We are looking for a way to express the type, hypertext links and meta information without this being confused with the data. We'd like to apply this consistently across all API responses.
We have investigated json:api and Hal Specification and felt neither standards were an ideal fit for us. So we're looking to define something simple which seems to be the more common approach. APIs we're taking inspiration from the aforementioned standards and popular APIs such as Stripe.
This proposal does not consider what might need to change in how data is written to the Content API. This is because only a single application, Publishing API, writes to the Content Store whereas there are many users of the read API.
We intend to have done due diligence that is easy to write to the API before any of the suggestions here are implemented.
One of the challenging parts of explaining content in the Content API is the pre-requisite of understanding of what govuk-content-schemas are and how they can be used to describe the fields that make up content.
This proposal does not consider the effect they have, though there is the possibility that these may be defined as a subtype of Edition. We intend to perform an investigation into how these can be used to explain content in the current content store and apply our learnings to the next iteration.
This proposal offers numerous endpoints that are the canonical method to look up an entity, which may require the lookup by ID.
The current Content API does not expose any sequential IDs, and it could be that by introducing IDs that are sequential we accidentally reveal information that is not intended to be public (such as the ordering that policies are drafted in).
It may be appropriate to use UUID for all ID purposes, although this could
cause confusion with our content_id
format.
One of the problems we have with introducing historic content is how to handle the links of that external content. It's a problem we have been considering passively and want to explore it in more detail. This proposal does not attempt to address it.
Some of the options considered are:
- Keeping all expanded links up to date
- Separating links into Document and Edition links
- Tying Edition links to a particular major version of a piece of content.
A requested feature for the current Content API is the means for there to be a richer method to include data from a different content item than expanded links. eg a method to pick particular relevant data from an expanded link.
This proposal does not attempt to solve this problem.
By revealing historic content we begin to see content which isn't particularly well suited to the document/edition model. An example of this is a smart answer which is written in code and has an edition updated every deploy.
We expect that by introducing history this problem will be revealed more and believe it should be investigated but is not a priority.
An early idea is having an Application entity to handle content that is published automatically.
This is a list of the endpoints we are proposing for the next iteration of the Content API.
The root of the API, would return information to help someone get started with the API and links.
Entity returned: A custom one
This is used to access a resource by the path
it is available at. It is
synonymous with the /api/content/{path}
endpoint from the current Content API.
For a path
that is a RedirectRoute this will return a
redirect, where the resource is a type of Gone a 410 will be returned
with the Gone response.
A timestamp
parameter could be provided to return the resource available at a
particular time.
Entity returned: Resource
An endpoint to navigate through all documents that have been available on GOV.UK, would default to showing those which have a live edition.
Could be used to track when new documents are added to GOV.UK.
Entity returned: List<Document>
This endpoint allows a user to browse the available locales a document is
available in for a particular content_id
.
Entity returned: List<Document>
This is the canonical path for a particular document. Used to look up details of a Document.
Entity returned: Document
This is used to browse through all editions available for a particular document. Could be used to compare how a piece of content has changed over time. Could be filtered by whether minor changes are shown.
Entity returned: List<Edition>
This is used to browse through the change notes for a particular document.
Entity returned: List<ChangeNote>
Used to access the live edition for a document. Live meaning the version that is currently on the particular content store.
Entity returned: Edition
Used to look up a particular edition of a document by the version number that describes it. Would offer links to navigate to earlier versions.
Entity returned: Edition
This returns a paginated list of editions that match parameters, by default it would return just live ones. This endpoint could be used to track changes to particular groupings and to track when new items are published on GOV.UK
Entity returned: List<Edition>
This is the canonical method to look up an Edition.
Entity returned: Edition
This is the canonical method to look up the change note for a particular edition.
Entity returned: ChangeNote
This endpoint can be used to browse through all the change notes for every edition (defaulting to live ones). Which can be used to track the reasons for why things are changing on GOV.UK.
Entity returned: List<ChangeNote>
This endpoint is to browse what is on GOV.UK from a path
perspective. It can
be used to browse the history of a path and to determine what was on GOV.UK at
a particular time.
Entity returned: List<Location>
This endpoint is used to lookup the routing data for a particular path, it can
be provided with a timestamp to determine the time you are looking up. Unlike
/resource/{path}
this returns the routing data rather than the resource.
Entity returned: Location
The canonical method to lookup a Location.
Entity returned: Location
This endpoint is used to lookup content that has gone from GOV.UK, which could be because it was retired or revoked. This could be used to keep track of what is being taken off GOV.UK.
Entity returned: List<Gone>
Note: The ideas surrounding this approach have been revised.
The canonical method to lookup a Gone.
Entity returned: Gone
This is a list of the entities envisioned to be used for the API, with brief descriptions of their purpose.
A object that represents all editions of a piece of content. Would store information consistent across all edition, such as content_id, locale, first_published_at. Could be used to access latest iteration of a piece of content.
This object represents an edition of a document, which is therefore a piece of content. This stores information such as title, content, description.
This describes a change that has been made to an edition. It includes information such as a note and timestamp.
This object is used to represent a collection of routes that is associated with a Resource. It will have information such as base_path, timestamps route was live.
This represents something that can be at a Location. Currently the known items would be an Edition or a Gone, however this could expand in future.
This represents a single route that would be included in a Location. It would contain a path and whether it is a prefix or not.
This would be a subtype of route that has additional information associated with a redirect, such as destination, segment mode, etc.
A gone is a generic type that represents a piece of content that is no longer available. This is available in two types RetiredGone and RevokedGone.
Note: The ideas surrounding this approach have been revised.
A RetiredGone is used as a resource for a document that is no longer available. This is synonymous with a Publishing API unpublishing of type “gone”
A RevokedGone is a type to represent a new concept in the Content API which is for content that has been removed for legal reasons.
The purpose of this RFC is to present and explain the approach the API for Content team have taken towards defining an API. We feel that this is a suitable point for seeking opinions as we've established ideas and patterns but everything is still malleable enough that there's scope to reconsider.
We're particularly interested to learn whether the changes suggested will cause problems for users of the content store or, alternatively, might solve problems that currently exist.
We welcome any ideas or insights on the things we are suggesting here, particularly with reference to projects that we may not be aware of.
Our next steps with this work is likely to be prototyping of these endpoints within the Publishing API, which is to help inform the structure of fields.
GraphQL is an interesting proposition to us and has clearly been gaining traction. It still seems an unlikely first choice for our stack as it would require substantial re-tooling of our stack and feels like a potentially unfamiliar interface for partners.
We are, however, interested in GraphQL and we feel that the usage of entities keeps open an opportunity to provide a GraphQL interface at a later date if there is sufficient need for it.
There would be a plan to allow access to both the current Content API and the next iteration to co-exist for a period of time. This would allow time for migration.
There are some unknowns about the process of enabling historic content to be viewable and who needs to be consulted/what data needs to be fixed. We might decouple this proposal from this concern by initially reducing the scope of this API to be just content that is currently live.
This is a question that has come up a number of times since we started considering the differences in responsibilities the Content Store has once historic content is a factor.
Outside of history we do have some advantages in a content store, such as a separation between draft and live content store and data optimised for fast reading.
We feel that we will learn in time whether the shared responsibilities between Content Store and Publishing API are frustrating or not, however we feel that by continuing to use the Content Store we have an easier path to completing this work and not requiring large changes to the service stack.
We agree that this isn't ideal, we're just not sure of alternatives. We initially considering using the word "current" but this felt too generic, we want to avoid the term "published" because past items that are available are still published. So we concluded that live was the most fitting word to describe this and wouldn't be a problem for external users - since they don't have access to the draft content store.
Any ideas/thoughts on this or a more suitable name are welcome.
Please tell us so we can consider it and any naming suggestions are definitely welcome, but we'll probably not want to get to involved in a naming debate at this stage to avoid bikeshedding.
In a nut shell the answer is no.
In longer form though, content_ids and paths serve different purposes which due to the relations of data can eventually lead to the same result. eg a Location (path) is associated with an Edition, which is associated with a Document (content_id). A user can use a content_id to get a historic overview of every edition of a piece of content, whereas path is used to get a single edition of a piece of content. If you aren't concerned with past editions you need only consider path.
The number of endpoints involving a content_id compared to path seems to have created an impression of content_id being the preferred method. This isn't intended to be so however it is a natural side effect of content_id being part of a solid data model. Whereas path and it's associated Location model is used to link to a number of different data models, so navigating through this is impractical due to their generic representation.
The intention is to document this in a continuation of the approach established for the current iteration of the Content API which is done through a microsite that is generated based on an OpenAPI v3 specification.
The choice to use OpenAPI is based on a proposal to standardise on the usage of OpenAPI v3.
This proposal is based around the Content API being populated by data available from the Publishing API. Currently the Publishing API only has explicit records for HTML Attachment and not other ones. This RFC does not propose any changes to this and considers how/whether attachments belong in Publishing API/Content API to be a distinct problem from what this RFC is addressing.
As was pointed out by @edent there is a HTTP Status Code provided for content that is "Unavailable for Legal Reasons". This was very helpful as it has given us an opportunity to reconsider having two types of Gone entity for different scenarios.
Initially we had intended both types of Gone would return HTTP 410 Gone responses, which was a key factor in them sharing a common super type. However if we consider that we want to return a different 451 response then this questions the wisdom in a shared common approach.
In light of this the following changes are proposed:
- Only one type of Gone which will no longer be a generic subtype but instead replace RetiredGone.
- The type of RevokedGone would be replaced by a UnavailableLegal type.
- An
/unavailable-legal
endpoint would be created to browse through content unavailable for legal reasons. - Future investigation would be required for deciding a strategy for cases where a URL itself is unavailable to be displayed due to legal reasons.