You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
More narrowly, “web caching” is the term for storing HTTP responses, such as entire pages or API responses, in order to make subsequent responses faster.
This report focuses specifically on web caching, and more particularly on pages and API responses. While media files and static assets also benefit from web caching, requirements for caching them are already well-solved in Apostrophe.
🌌 What web caching isn’t
Apostrophe has a module called @apostrophecms/cache. This is a low-level cache for programmers, which can be used to cache any value for a period of time. By itself, it is not a web caching solution, although it can be part of one.
Where web caching happens
Web caching can occur in several places:
Inside the Apostrophe Node.js application itself (the “backend application”).
Inside a single reverse proxy server (nginx) operating on or near the same server as Apostrophe (the “reverse proxy”).
On a CDN, such as Cloudflare or Cloudfront, which positions content closer to the consumer so that the (”at the edge”). In this scenario the server on which Apostrophe actually runs is known as an “origin server.”
Caches provided by corporate proxies.
In the browser’s own built-in cache.
There are many approaches to actually implementing web caches. However, with the exception of simple caching strategies such as express-cache-on-demand, it is not necessary or wise to reinvent them.
Instead, Apostrophe should strive to become a “good caching citizen” by taking advantage of features build into the HTTP protocol specification, specifically the Cache-Control header. This will allow those using Apostrophe to fully benefit from CDNs, including free options like Cloudflare, without awkward workarounds for logged-in users.
To understand how to do this, we’ll start by understanding the challenges surrounding what can be cached, and for how long.
The problem of stale content
The trouble with caching is knowing when it is OK to serve cached content, and when it is not. In Apostrophe, cached content can become outdated in many ways:
Because the page in question has been edited.
Because a document related to that page, such as an image piece, has been edited.
Because a document indirectly related to that page, for instance via a “recent articles” async component, has been edited.
Because an external source, such as a third-party API accessed by an async component in Apostrophe, has been updated.
Because of the passage of time: a piece has reached its publication date, or its expiration date.
Since there are so many ways for content to appear indirectly on a page, it is difficult to know for certain when a cached response is out of date. This has bearing on our choice of strategy, as described below.
Invalidation
One approach to caching is invalidation. In the invalidation approach, Apostrophe outputs an ETag header with each response, which the browser or an intermediate cache can later use to ask Apostrophe if the response has changed or not.
In many systems this is a simple version number or modification timestamp. Unfortunately, because of the many indirect ways content can become stale, this is not really adequate.
Another approach is for Apostrophe to set the ETag to a cryptographic hash (a checksum) of the content. This does work, in that a later request from the browser or intermediate cache will send an If-None-Match header with the ETag, and Apostrophe can compute a new hash and compare them; if they are the same, Apostrophe doesn’t have to actually send the data. However, Apostrophe still has to do the computational work, so the win here is only in bandwidth, which is usually not the most important concern driving the request for caching. CPU, RAM and performance issues are much more common concerns and these are not helped if the response must be fully computed every time.
The Last-Modified and If-Modified-Since headers can be used to accomplish something very similar, with the same limitations.
To avoid the performance hit of regenerating the entire response, Apostrophe could attempt to invalidate the response based on whether the page or its directly related documents have changed. To do that efficiently, we would have to add “backlinks” to Apostrophe — Apostrophe would have to keep track of the documents that reference each document, not just the other way around as it is today.
However, this would still not be enough to guarantee no false cache matches, because of the other sources of stale content mentioned above: indirect relationships (”most recent articles” components), time-related changes (the article is scheduled to be published at time X), and third-party API consumption in async components.
For this reason, it is recommended that invalidation based on backlinks be used only as an adjunct to expiration, to give it a more immediate feel in most but not all cases.
Expiration
Expiration is a vastly more straightforward approach to caching, with a straightforward tradeoff: sometimes responses are stale, for a known amount of time.
Any HTTP response can include a Cache-control header specifying a max-age. This is a simple way to indicate how long the response can be kept and reused — not just by the browser cache, but by intermediate caches, such as an nginx cache (in a simple single-server reverse proxy configuration) or a CDN like Cloudflare.
Past versions of Apostrophe have featured a one-hour cache option, implemented directly in Apostrophe. In addition, they sometimes featured a “clear the whole cache” button. In practice this worked very well: customers almost never really needed to click “clear the whole cache,” but they had that option for emergencies, real or perceived.
However those operating larger siteswill likely prefer caching at the edge because it yields performance benefits that aren’t possible if every request must reach Apostrophe itself. That rules out a “clear cache” button implemented purely by Apostrophe. These customers could instead address the problem by:
Displaying or refreshing time-critical notifications (”snow emergency,” etc.) via browser API calls that are not cached (we need to provide easy ways to opt individual APIs out of any overall cache headers sent by Apostrophe)
Accepting that content one hour old is not a serious concern for most sites
Asking us to implement cache invalidation based on the backlinks approach, while also maintaining a fairly short expiration time (one hour)
If these concerns are addressed an approach based on max-age should be acceptable and appropriate for the great majority of larger customers.
For smaller sites, the built-in “clear cache” button is more important. we could consider implementing an in-Apostrophe cache powered by the apos.cache module, using a least-recently-used algorithm to discard content and with appropriate safeguards on total space consumed by the cache. This would allow us to provide a “clear cache” button to these customers. Alternatively, we could offer a simple integration to clear an nginx cache running on the same server. This would be fairly simple too, and faster, although it may require a component that runs with sudo privileges. However it seems most likely that even those operating small sites would prefer to take advantage of free caches like Cloudflare, which have their own "clear the cache" UI.
The problem of “editor” content versus “public” content
The discussion of caching headers above is simplified, avoiding one important topic: all HTTP caching relies by default on the idea that two GET requests for the same URL should always return the same thing, until that document actually changes.
In particular, Apostrophe’s in-context editing is sometimes in direct conflict with this, because editors and the public see different content.
This is also an issue for most Intranet sites, which might present different responses for the same page or hide a page entirely from some users.
The HTTP specification suggests one way to address this: the Vary header.
Vary allows us to say that the cache should consider two responses to be separate for caching purposes based on additional headers, not just the URL.
One obvious approach is:
Vary: Cookie
In this approach any difference in a cookie prevents Apostrophe from caching the response. At first this sounds good: logged-in users have a session cookie that logged-out users do not.
Unfortunately there are two problems. One is easily solved (and has been in a new PR), the other is not:
A3 actually gives a session cookie to everyone even if their session is empty. This has been fixed in this PR. Related to this, A3 also gives a unique CSRF cookie to everyone, even though this is not required post-IE11, as long as SameSite: lax is configured. This is also fixed in the same PR.
Third-party integrations like Google Analytics create tons of cookies which have nothing to do with cacheability. Unfortunately this problem is much less tractable. We cannot ask developers to stop using Google Analytics.
Solving the limitations of Vary with a custom header
In my research, a popular solution to the problems of using Vary with Cookie is to create a custom header just for the sake of the Vary header. Browser API calls would be responsible for including this header where appropriate.
In Apostrophe, this header could look like:
X-Apos-Identity: [session cookie, or "public"]
This would be sent by the browser in every request.
We can then send:
Vary: X-Apos-Identity
As part of the HTTP response.
Unfortunately, while this works for API responses, **there is no good way to incorporate a custom header into ordinary page requests during regular navigation on the site. “**Boiling down” the session cookie to a custom header is thus usually implemented in intermediary caches rather than the browser itself, as described by Fastly. But, this approach is very “high touch” in terms of how much configuration is required on the part of devops teams.
Avoiding the limitations of Vary with a cookie and location.refresh
A simpler solution which should work for our needs is for all logged-in responses must carry a Cache-Control: no-store header to ensure they are never cached at all (at least by default).
In addition, if a user logs in and navigates to a new page but receives a cached “public” response with logged-out content only, Apostrophe must be able to recognize this situation and remedy it by forcing a full refresh.
Fortunately this is possible. The body can contain a simple attribute indicating whether a user is logged in or not, and a small stub of JavaScript in our standard public library can check whether this agrees with a cookie that is only present in the browser when the user is logged in. If not location.reload can be triggered. location.reload bypasses caches via Cache-Control: no-cache.
Since session cookies themselves are typically set to HttpOnly for better security, this will likely have to be a new, parallel cookie that is not set to HttpOnly, such as shortname.loggedIn, that is always set and cleared at the same time and does not carry the actual session identifier but rather just acts as a flag.
🌌 Apostrophe already has a mechanism to force a refresh on login or logout, based on session storage, called aposStateChange. However this mechanism is probably not well-suited to this new problem, and might in fact be something we can remove partially or completely in favor of this simpler solution. Or they may coexist, depending on what we learn in further study.
Summary of recommendations
My recommendations are:
Make Apostrophe a better caching citizen by eliminating unnecessary cookies (already implemented).
Offer a simple way to configure Max-age for Apostrophe page and GET REST API responses. However there must be a simple way to opt an API out of this, which will require tech design.
To support invalidation, implement backlinks for Apostrophe documents, set ETag to the last time the document or any of its backlinks was modified (stored as a new property of the document), and implement If-None-Match support as a fast check for a match with that single property. This unlocks the possibility of performant invalidation for the most common cases, although it can never be perfect for the reasons explained above, so there should always be a Max-age too when using any kind of page or API request web caching in Apostrophe. Implementing backlinks also unlocks possibilities like showing users where an image is used, merging two tags with good performance, etc. However this is the most challenging recommendation to implement.
Ship express-cache-on-demand, enabled by default, because it is an immediate benefit to all users that doesn’t have any of the downsides discussed above, although the benefit is mainly for single pages receiving very high traffic.
Other than express-cache-on-demand, don’t implement the cache itself directly in Apostrophe. Instead our documentation should cover how to cache “at the edge” by enabling caching in nginx (for smaller customers and the open source community), as well as in Cloudflare (a good choice and even free for many sites) and Cloudfront (which can be a good choice for some enterprise customers).
To avoid stale content and assets after a fresh deployment, the release ID should be part of every ETag header.
Notes on CSRF protection and why it is so much simpler after the recent PR
CSRF protection is a lot simpler in modern browsers! While the OWASP security vulnerabilities site still warns against using a “secret cookie” as a CSRF protection mechanism, because IE11 will send cookies for all requests, even those triggered from a third-party site (a classic CSRF attack), all modern browsers have ceased to do so as long as the cookie is set with SameSite: lax or SameSite: strict.
All modern browsers other than Safari now default to SameSite: lax, however Safari still defaults to SameSite: none, so it is important to set this for both the CSRF cookie and the session cookie, which is fixed in a recent PR.
This has led to a major simplification the same PR: in addition to not setting a unique, secret CSRF token in a cookie, we also do not “double-send” that cookie in a header and check that they are equal, as that was only necessary to accommodate IE11’s lack of support for SameSite. All the server has to do is verify that the cookie is set to its constant, well-known value, because requests not satisfying the Same-Origin policy will not be able to send it at all at all. This means that using apos.http.post is no longer mandatory to satisfy Apostrophe’s CSRF check after the PR, although it is still a good practice to encourage, in case we have to reintroduce something in this area.
After the recent PR is published, A3 will have no CSRF protection built-in for IE11, however IE11 reaches its official, final, blown-off-the-desktop-forever end of life date in June 2022, and is already functionally gone for the general public. None of our own clients have expressed a requirement for IE11 compatibility going forward, and you can't edit Apostrophe content in IE11 anyway, so we consider a lack of CSRF protection in a deprecated browser to be an acceptable loss. Those who must have CSRF protection for customer-facing APIs in IE11 can implement their own.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello community! What follows is a proposal to better support web caching in Apostrophe 3.x. Your input is welcome.
If you don't have time for the whole thing, please check out the summary of recommendations near the end.
What is web caching?
As DigitalOcean's "web caching basics" page says, caching is the term for storing reusable responses in order to make subsequent responses faster.
More narrowly, “web caching” is the term for storing HTTP responses, such as entire pages or API responses, in order to make subsequent responses faster.
This report focuses specifically on web caching, and more particularly on pages and API responses. While media files and static assets also benefit from web caching, requirements for caching them are already well-solved in Apostrophe.
🌌 What web caching isn’t
Apostrophe has a module called
@apostrophecms/cache
. This is a low-level cache for programmers, which can be used to cache any value for a period of time. By itself, it is not a web caching solution, although it can be part of one.Where web caching happens
Web caching can occur in several places:
nginx
) operating on or near the same server as Apostrophe (the “reverse proxy”).There are many approaches to actually implementing web caches. However, with the exception of simple caching strategies such as
express-cache-on-demand
, it is not necessary or wise to reinvent them.Instead, Apostrophe should strive to become a “good caching citizen” by taking advantage of features build into the HTTP protocol specification, specifically the
Cache-Control
header. This will allow those using Apostrophe to fully benefit from CDNs, including free options like Cloudflare, without awkward workarounds for logged-in users.To understand how to do this, we’ll start by understanding the challenges surrounding what can be cached, and for how long.
The problem of stale content
The trouble with caching is knowing when it is OK to serve cached content, and when it is not. In Apostrophe, cached content can become outdated in many ways:
Since there are so many ways for content to appear indirectly on a page, it is difficult to know for certain when a cached response is out of date. This has bearing on our choice of strategy, as described below.
Invalidation
One approach to caching is invalidation. In the invalidation approach, Apostrophe outputs an
ETag
header with each response, which the browser or an intermediate cache can later use to ask Apostrophe if the response has changed or not.In many systems this is a simple version number or modification timestamp. Unfortunately, because of the many indirect ways content can become stale, this is not really adequate.
Another approach is for Apostrophe to set the
ETag
to a cryptographic hash (a checksum) of the content. This does work, in that a later request from the browser or intermediate cache will send anIf-None-Match
header with theETag
, and Apostrophe can compute a new hash and compare them; if they are the same, Apostrophe doesn’t have to actually send the data. However, Apostrophe still has to do the computational work, so the win here is only in bandwidth, which is usually not the most important concern driving the request for caching. CPU, RAM and performance issues are much more common concerns and these are not helped if the response must be fully computed every time.The
Last-Modified
andIf-Modified-Since
headers can be used to accomplish something very similar, with the same limitations.To avoid the performance hit of regenerating the entire response, Apostrophe could attempt to invalidate the response based on whether the page or its directly related documents have changed. To do that efficiently, we would have to add “backlinks” to Apostrophe — Apostrophe would have to keep track of the documents that reference each document, not just the other way around as it is today.
However, this would still not be enough to guarantee no false cache matches, because of the other sources of stale content mentioned above: indirect relationships (”most recent articles” components), time-related changes (the article is scheduled to be published at time X), and third-party API consumption in async components.
For this reason, it is recommended that invalidation based on backlinks be used only as an adjunct to expiration, to give it a more immediate feel in most but not all cases.
Expiration
Expiration is a vastly more straightforward approach to caching, with a straightforward tradeoff: sometimes responses are stale, for a known amount of time.
Any HTTP response can include a
Cache-control
header specifying amax-age
. This is a simple way to indicate how long the response can be kept and reused — not just by the browser cache, but by intermediate caches, such as an nginx cache (in a simple single-server reverse proxy configuration) or a CDN like Cloudflare.Past versions of Apostrophe have featured a one-hour cache option, implemented directly in Apostrophe. In addition, they sometimes featured a “clear the whole cache” button. In practice this worked very well: customers almost never really needed to click “clear the whole cache,” but they had that option for emergencies, real or perceived.
However those operating larger siteswill likely prefer caching at the edge because it yields performance benefits that aren’t possible if every request must reach Apostrophe itself. That rules out a “clear cache” button implemented purely by Apostrophe. These customers could instead address the problem by:
If these concerns are addressed an approach based on
max-age
should be acceptable and appropriate for the great majority of larger customers.For smaller sites, the built-in “clear cache” button is more important. we could consider implementing an in-Apostrophe cache powered by the
apos.cache
module, using a least-recently-used algorithm to discard content and with appropriate safeguards on total space consumed by the cache. This would allow us to provide a “clear cache” button to these customers. Alternatively, we could offer a simple integration to clear an nginx cache running on the same server. This would be fairly simple too, and faster, although it may require a component that runs with sudo privileges. However it seems most likely that even those operating small sites would prefer to take advantage of free caches like Cloudflare, which have their own "clear the cache" UI.The problem of “editor” content versus “public” content
The discussion of caching headers above is simplified, avoiding one important topic: all HTTP caching relies by default on the idea that two GET requests for the same URL should always return the same thing, until that document actually changes.
In particular, Apostrophe’s in-context editing is sometimes in direct conflict with this, because editors and the public see different content.
This is also an issue for most Intranet sites, which might present different responses for the same page or hide a page entirely from some users.
The HTTP specification suggests one way to address this: the
Vary
header.Vary
allows us to say that the cache should consider two responses to be separate for caching purposes based on additional headers, not just the URL.One obvious approach is:
Vary: Cookie
In this approach any difference in a cookie prevents Apostrophe from caching the response. At first this sounds good: logged-in users have a session cookie that logged-out users do not.
Unfortunately there are two problems. One is easily solved (and has been in a new PR), the other is not:
SameSite: lax
is configured. This is also fixed in the same PR.Solving the limitations of
Vary
with a custom headerIn my research, a popular solution to the problems of using
Vary
withCookie
is to create a custom header just for the sake of theVary
header. Browser API calls would be responsible for including this header where appropriate.In Apostrophe, this header could look like:
X-Apos-Identity: [session cookie, or "public"]
This would be sent by the browser in every request.
We can then send:
As part of the HTTP response.
Unfortunately, while this works for API responses, **there is no good way to incorporate a custom header into ordinary page requests during regular navigation on the site. “**Boiling down” the session cookie to a custom header is thus usually implemented in intermediary caches rather than the browser itself, as described by Fastly. But, this approach is very “high touch” in terms of how much configuration is required on the part of devops teams.
Avoiding the limitations of
Vary
with a cookie andlocation.refresh
A simpler solution which should work for our needs is for all logged-in responses must carry a
Cache-Control: no-store
header to ensure they are never cached at all (at least by default).In addition, if a user logs in and navigates to a new page but receives a cached “public” response with logged-out content only, Apostrophe must be able to recognize this situation and remedy it by forcing a full refresh.
Fortunately this is possible. The
body
can contain a simple attribute indicating whether a user is logged in or not, and a small stub of JavaScript in our standard public library can check whether this agrees with a cookie that is only present in the browser when the user is logged in. If notlocation.reload
can be triggered.location.reload
bypasses caches viaCache-Control: no-cache
.Since session cookies themselves are typically set to
HttpOnly
for better security, this will likely have to be a new, parallel cookie that is not set toHttpOnly
, such asshortname.loggedIn
, that is always set and cleared at the same time and does not carry the actual session identifier but rather just acts as a flag.🌌 Apostrophe already has a mechanism to force a refresh on login or logout, based on session storage, called
aposStateChange
. However this mechanism is probably not well-suited to this new problem, and might in fact be something we can remove partially or completely in favor of this simpler solution. Or they may coexist, depending on what we learn in further study.Summary of recommendations
My recommendations are:
Max-age
for Apostrophe page and GET REST API responses. However there must be a simple way to opt an API out of this, which will require tech design.ETag
to the last time the document or any of its backlinks was modified (stored as a new property of the document), and implementIf-None-Match
support as a fast check for a match with that single property. This unlocks the possibility of performant invalidation for the most common cases, although it can never be perfect for the reasons explained above, so there should always be aMax-age
too when using any kind of page or API request web caching in Apostrophe. Implementing backlinks also unlocks possibilities like showing users where an image is used, merging two tags with good performance, etc. However this is the most challenging recommendation to implement.express-cache-on-demand
, enabled by default, because it is an immediate benefit to all users that doesn’t have any of the downsides discussed above, although the benefit is mainly for single pages receiving very high traffic.express-cache-on-demand
, don’t implement the cache itself directly in Apostrophe. Instead our documentation should cover how to cache “at the edge” by enabling caching in nginx (for smaller customers and the open source community), as well as in Cloudflare (a good choice and even free for many sites) and Cloudfront (which can be a good choice for some enterprise customers).ETag
header.Notes on CSRF protection and why it is so much simpler after the recent PR
CSRF protection is a lot simpler in modern browsers! While the OWASP security vulnerabilities site still warns against using a “secret cookie” as a CSRF protection mechanism, because IE11 will send cookies for all requests, even those triggered from a third-party site (a classic CSRF attack), all modern browsers have ceased to do so as long as the cookie is set with
SameSite: lax
orSameSite: strict
.All modern browsers other than Safari now default to
SameSite: lax
, however Safari still defaults toSameSite: none
, so it is important to set this for both the CSRF cookie and the session cookie, which is fixed in a recent PR.This has led to a major simplification the same PR: in addition to not setting a unique, secret CSRF token in a cookie, we also do not “double-send” that cookie in a header and check that they are equal, as that was only necessary to accommodate IE11’s lack of support for
SameSite
. All the server has to do is verify that the cookie is set to its constant, well-known value, because requests not satisfying the Same-Origin policy will not be able to send it at all at all. This means that usingapos.http.post
is no longer mandatory to satisfy Apostrophe’s CSRF check after the PR, although it is still a good practice to encourage, in case we have to reintroduce something in this area.After the recent PR is published, A3 will have no CSRF protection built-in for IE11, however IE11 reaches its official, final, blown-off-the-desktop-forever end of life date in June 2022, and is already functionally gone for the general public. None of our own clients have expressed a requirement for IE11 compatibility going forward, and you can't edit Apostrophe content in IE11 anyway, so we consider a lack of CSRF protection in a deprecated browser to be an acceptable loss. Those who must have CSRF protection for customer-facing APIs in IE11 can implement their own.
Beta Was this translation helpful? Give feedback.
All reactions