RFC: We have to load a lot of data from the Publishing API and that can timeout #1021

kevindew · 2019-05-10T17:02:12Z

This is an attempt at posting a draft of an architectural suggestion in the form of an issue rather than as a long open ADR. I'm making a suggestion about an approach we could use to deal with Content Publishers timeout issues. I'm posting this to clarify my thoughts on this and using this as a means to try identify where there are agreements and disagreements before we proceed to working on this feature. 🤞

Background

Content Publisher provides users with the means to select from lengthy list of options (examples being organisations, contacts, taxonomy), to do this it requests a large amount of data from the Publishing API. These requests can take a long time which can cause delays and errors for a user.

We have received numerous tickets from users regarding timeouts and some of their confusing consequences. We have also monitored that we often have timeouts communicating with GDS API services and can often return slow responses.

Other applications have faced similar problems. In Whitehall this was encountered for presenting the taxonomy. The solution implemented was for the web application to read from a Redis datastore at runtime and then, hourly, a background task would run to re-populate that cache.

Proposal

Content Publisher should avoid loading bulk quantites of data during web requests. This data should be stored in Redis. Sidekiq should be used as the means to update the data, which should happen periodically. The web process should access the data from Redis and will raise an error if no data is available.

Data is considered to be bulk data, and a candidate for this caching approach, when it meets the following criteria: it requires a slow web request or many web requests to load; it changes infrequently; and it is acceptable for data to be out of date. Examples of this in Content Publisher are the Taxonomy, Contacts and the various Publishing API linkables that are loaded.

The suggestion for using Redis is based on: Content Publisher already has a dependency on Redis; Redis, as a memory based store, is well suited for this type of task; and there is GOV.UK precedent for solving this problem with Redis before.

The data should be accessed in redis through a key. A key should be prefixed with a namespace which is the application name, this is because Redis is shared between GOV.UK applications. A key should have some versioning identification, this is to allow the application to change expectations of what structure stored data would have. Data set in Redis should have an expiry (albeit a long one) so data is eventually flushed automatically should the key no longer be used.

Sidekiq is already used as a means to run background jobs and it can be scheduled using the sidekiq scheduler gem which already has precedent for use on GOV.UK and has ActiveJob compatibility. Sidekiq can make requests for the bulk data with long timeouts and can abort or retry if a request fails. The suggestion would be to use the interval scheduling type as this would run the job when the process starts and then every x amount of time. My initial suggestion would be that these run every 30 minutes.

The web process should read data from Redis. It should be unconcerned as to the age of the data. If the data is not available an exception should be raised, which can be caught in a similar way to how GdsApi::BaseErrors are caught currently.

Complications

Non-production environments

The proposed implementation is quite easy to map to a production environment however there are different considerations for development and test environments.

In a development environment this cache should be running, unlike Rails cache, since these are things that are frequently too slow to run at runtime. To ease development it should simple to reset aspects of this cache. Whilst we can expect a production environment to nearly always have data cached, it is quite likely that development environments won't have the data initially (this may well be offset by the use of interval in sidekiq scheduling) and thus a user may initially get errors until data is available.

In a test environment, we'd likely not want to use Redis as a storage means as we'd not want data to be persisted between tests or shared between two parallel tests. Thus it might well make sense to use an alternative to Redis for feature tests.

Another concern in a test environment is that, due to the application erroring if the data isn't available, there would need to be steps to populate the data before each feature test. An alternative approach that could resolve this would be allowing the application to fall back to loading in runtime if data isn't available.

Risks of concurrent data loading

A consequence of Content Publisher running Sidekiq on multiple machines is that any scheduled jobs will run on each machine, likely at the same time.

This is a problem that Email Alert API faced too and they use the with_advisory_lock gem to restrict to one execution at a time.

How frequently to query Redis

A query to Redis should be quick but as the data expected would be large, this shouldn't be couldn't something that is free and it should be discouraged that the same data be queried and parsed multiple times per request. Similarly the data would be suitable to be persisted across requests and could be memoised in an application. However it shouldn't be persisting data for longer than the data is refreshed.

A proposed implementation strategy

Based on ContactsService which is currently a single class that has the following responsibilities:

Interface with Publishing API to request contacts
Interface with Rails cache
Provides means to look up a single contact
Provides means to look up contacts by organisation

This could be broken into

A ContactsService class

with the responsibilities of:

Provides means to look up a single contact
Provides means to look up contacts by organisation

A ContactsService::Store class

which has the responsibilities of:

Communicating with the Publishing API
Interfacing with the BulkDataCache

A BulkDataCache class

which provides means to write to, read from and clear data from an adapter.

Two adapters for BulkDataCache one to interface with Redis (for dev and prod environments) and a memory one to store data in a Ruby hash (used in a test environment).

The text was updated successfully, but these errors were encountered:

benthorner · 2019-05-13T12:50:21Z

This all makes sense to me - the reasoning about which tools to use is very clear :-). I've got a few questions and thoughts about the approach, which I've split up to make it easier to digest.

Cache or Error

I like the decision to go for a 'cache-or-nothing' approach to make things clear for users, but this does make me a little concerned about the robustness of the end result. The way we're looking to use Redis seems quite similar to an API, in that we rely on it working an error when it doesn't. Thinking about the approach this way has helped me to come up with a couple of possible improvements:

Would it be helpful to introduce alerts for when the cache is not ready? This could be part of the app healthcheck
Would it be helpful to kick-off a background job to fill the cache, if a user queries the app when the cache is not ready?

We do have some precedent for the cache being empty unexpectedly, so I'm trying to think of ways the app can self-heal, without manual intervention.

Frequent Queries

The complication around frequent queries for subsets of the cached data is interesting. In the case of the taxonomy data, we load the entire structure and persist it across the request, carefully making sure we're working on the local copy of the data and not querying it again.

Reading this paragraph, I'm not sure if you're suggesting some other approaches (the wording is a little vague): re-use the cache to store some of the smaller subsets; abstract the memeoisation of the entire data structure so it's not a recurring concern.

It would be good to get a little more detail on what you think a data flow would look like in practice.

Scheduling

I was surprised Sidekiq scheduler isn't so good in a distributed system. Looking at gem, though, it seems that it might be OK with certain types of schedule. But then this article makes me less confident the documentation is accurate :-|.

On the other hand, the operation we're doing is presumably write-only, so perhaps we don't care too much if two jobs accidentally run at the same time once in a while?

kevindew · 2019-05-15T14:18:07Z

Thank you @benthorner ⭐

Cache or Error

I like the decision to go for a 'cache-or-nothing' approach to make things clear for users, but this does make me a little concerned about the robustness of the end result. The way we're looking to use Redis seems quite similar to an API, in that we rely on it working an error when it doesn't. Thinking about the approach this way has helped me to come up with a couple of possible improvements:

Would it be helpful to introduce alerts for when the cache is not ready? This could be part of the app healthcheck

Yup, I think that’d be useful. Not sure quite what the storage means would be for it though. It would be useful for it to be critical if it’s empty and warning if it’s not been updated for a long period of time

Would it be helpful to kick-off a background job to fill the cache, if a user queries the app when the cache is not ready?

I’m not sure about that, it is a route I hadn’t considered. The two main ones I considered were error or load data (and just deal with a slow request) and the perfectionist in me liked falling back to loading data but the pragmatist felt it added a bunch of extra code paths.

This could be a good middle ground but I’d consider it as something that’d be a nice to have if it doesn’t create strange scenarios / dependencies.

We do have some precedent for the cache being empty unexpectedly, so I'm trying to think of ways the app can self-heal, without manual intervention.

Good thinking 👍

Frequent Queries

The complication around frequent queries for subsets of the cached data is interesting. In the case of the taxonomy data, we load the entire structure and persist it across the request, carefully making sure we're working on the local copy of the data and not querying it again.

Reading this paragraph, I'm not sure if you're suggesting some other approaches (the wording is a little vague): re-use the cache to store some of the smaller subsets; abstract the memeoisation of the entire data structure so it's not a recurring concern.

It would be good to get a little more detail on what you think a data flow would look like in practice.

I don’t actually have a suggestion for that. My brain came up inconclusive for something that would be idiomatic for sharing data across just a request. The way I could imagine it would be something available at a controller level and injected into objects and views by the controller. However we don’t have precedent for that or similar so may seem strange.

I’m mostly expecting we may just have to hit Redis multiple times on occasions and then deal with any later smells that produce (if any). It feels better than having to deal with a long living object of a singleton approach. Maybe some of the view lookups of the data could use Rails.cache.

Scheduling

I was surprised Sidekiq scheduler isn't so good in a distributed system. Looking at gem, though, it seems that it might be OK with certain types of schedule. But then this article makes me less confident the documentation is accurate :-|.

On the other hand, the operation we're doing is presumably write-only, so perhaps we don't care too much if two jobs accidentally run at the same time once in a while?

Yeah the cron and at are ok because they’re pushed to Redis and whichever machine gets their first runs it, whereas others each machine just tries to run at the time and Redis isn't involved.

I think it’s fine though, as long as we have something that stops them running concurrently so we're not being needlessly wasteful. Like you say, it’s not much of a problem if occasionally the job runs twice.

benthorner · 2019-05-16T11:07:36Z

Discussed in-person and all makes sense to me. Let's do it!

kevindew · 2019-08-05T18:45:15Z

I did some work on this in https://github.com/alphagov/content-publisher/tree/2nd-pass-bulk-data-cache

Things I discovered:

We can piggy back off Rails cache to create a Redis cache: 1587fc2 there didn't seem to be a particular need to create a memory based one for test environment while tests don't run in parallel.
There is a Rack middleware that allows storing a local cache for the duration of a request: 3f4a20b#diff-b1fe55db50c712fef0673345e5b9c0d9R40-R46 - meaning no matter how many times we hit the cache in a request it'll only hit Redis once
We can use sidekiq-scheduler to kick off jobs at the start of a deployment
We can use with_advisory_lock to restrict a single update to run at a time

Next steps forward seem to be modelling related challenges:

How do we model the mediator classes that pull data from the cache and what data do they return. I expect we may consider them adapters and place them in app/adapters and have different classes for different data on GOV.UK. I figured they'd probably have a private class for dealing with the API and return some sort of low-fi model, however I'm not too sure. The quick thing I put together is: 437fde3
The classes and their roles for error handling. If cached data isn't available we'd want to show a view and add a job to update the cache, not sure what level this happens at.
Decide whether to have multiple jobs for updating the data or a single one, and how to handle failures.

kevindew · 2019-12-04T15:59:43Z

I've been doing some tinkering with this in: https://github.com/alphagov/content-publisher/compare/bulk-data-cache-nov - going to try move it towards something potentially mergeable.

kevindew mentioned this issue Dec 12, 2019

Load government data in an asynchronous task #1522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: We have to load a lot of data from the Publishing API and that can timeout #1021

RFC: We have to load a lot of data from the Publishing API and that can timeout #1021

kevindew commented May 10, 2019

benthorner commented May 13, 2019

kevindew commented May 15, 2019

Cache or Error

Frequent Queries

Scheduling

benthorner commented May 16, 2019

kevindew commented Aug 5, 2019

kevindew commented Dec 4, 2019

RFC: We have to load a lot of data from the Publishing API and that can timeout #1021

RFC: We have to load a lot of data from the Publishing API and that can timeout #1021

Comments

kevindew commented May 10, 2019

Background

Proposal

Complications

Non-production environments

Risks of concurrent data loading

How frequently to query Redis

A proposed implementation strategy

benthorner commented May 13, 2019

Cache or Error

Frequent Queries

Scheduling

kevindew commented May 15, 2019

Cache or Error

Frequent Queries

Scheduling

benthorner commented May 16, 2019

kevindew commented Aug 5, 2019

kevindew commented Dec 4, 2019