Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load government data in an asynchronous task #1522

Merged
merged 30 commits into from
Dec 17, 2019
Merged

Conversation

kevindew
Copy link
Member

Trello: https://trello.com/c/TQ6GPOHU/1228-spike-into-loading-governments-from-publishing-api

This introduces a new system, called BulkData, which is used to load data asynchronously for use by Content Publisher. This system is intended to be used in situations where we need to load a large amount of data that we would cache, loading this in can cause timeouts at various places across the app and is thus undesirable. The system to replace this uses a regular ActiveJob that updates the data every 15 minutes, moving the performance impacts to be a background concern. This builds on the work identified in #1021 and the first step towards implementing it to resolve various timeout problems.

What this PR does specifically is start an ActiveJob whenever the application is deployed and also ran every 15 minutes. This job calls the Publishing API to access all governments and stores the response in Redis for 24 hours.

Whenever Content Publisher needs to look up governments it will query Redis for this data. If the data is unavailable it will schedule the job to populate the data and throw an exception. Users will be a shown an error until this data is re-populated.

There's a likelihood users may see the error a reasonable amount in development environments if they are not used to starting the worker job with the app. As a follow up to this we should update govuk-docker to start content-publisher with a worker.

Screenshots of error pages:

Production / Test

Screenshot 2019-12-12 at 17 35 31

Development

Screenshot 2019-12-12 at 17 34 14

Further information in the commits

@karlbaker02 karlbaker02 temporarily deployed to content-publisher-revi-pr-1522 December 12, 2019 17:36 Inactive
@karlbaker02 karlbaker02 temporarily deployed to content-publisher-revi-pr-1522 December 12, 2019 23:58 Inactive

module BulkData
class Cache
include Singleton
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'm surprised you don't need to require "singleton" here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah Rails auto/eager loading takes care of it.

Copy link
Contributor

@ChrisBAshton ChrisBAshton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done in getting this significant piece of work to this point! Have attempted to review, some of my questions/suggestions may be missing some context, do feel free to bat back on them.

lib/bulk_data/cache.rb Outdated Show resolved Hide resolved
lib/bulk_data/cache.rb Show resolved Hide resolved

describe ".wrote_at" do
it "returns the time the entry was created" do
time = 2.hours.ago.change(usec: 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A shame that the .change(usec: 0) bit is needed. Why is that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a shame indeed. It's because travel_to does it to the time so we need it to match: https://github.com/rails/rails/blob/8bec77cc0f1fd47677a331a64f68c5918efd2ca9/activesupport/lib/active_support/testing/time_helpers.rb#L145

I could do it is an absolute time but I need it to be within the last 24 hours so it's tricky :-/

spec/lib/bulk_data/cache_spec.rb Outdated Show resolved Hide resolved
spec/lib/bulk_data/cache_spec.rb Outdated Show resolved Hide resolved
app/models/edition.rb Show resolved Hide resolved
spec/models/edition_spec.rb Show resolved Hide resolved
@@ -64,6 +64,7 @@ def and_a_publish_intent_is_created

def when_the_publishing_job_runs
travel_to(Time.zone.parse("2019-6-22 9:00"))
populate_default_government_bulk_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that you've had to make this change in 3 places suggests we should move the when_the_publishing_job_runs implementation to somewhere shared 🤔 all three implementations are almost identical.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we'd just have less feature tests for this and more of the other conditions handled by requests specs, where this would not be a concern. It doesn't seem out of place to do this sort of thing a few times in our feature specs.

I was rather sad when I realised the effect of the travel_to on clearing the cache and that I had to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the effect of travel_to on the cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cache bulk data for 24 hours. If we travel more than 24 hours after that time in Ruby code the cache will be invalidated. Here we jump from June 21st 0:00am to June 22nd 9:00am

spec/services/resync_service_spec.rb Outdated Show resolved Hide resolved
<%= render partial: "errors/error", locals: { title: title, body_govspeak: body_govspeak } %>
<% else %>
<%= render partial: "errors/error", locals: t("errors.local_data_unavailable") %>
<% end %>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole template is smashing 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😁

Copy link
Contributor

@leenagupte leenagupte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 💯
The commit messages are great and make the code changes easy to follow, and the tests are really clear. Thanks for breaking up the commit history into smaller commits too, it made reviewing this much easier.

I only a have a few minor comments and things that could be clarified, but nothing major.

spec/lib/bulk_data/cache_spec.rb Outdated Show resolved Hide resolved
spec/models/government_spec.rb Outdated Show resolved Hide resolved
lib/bulk_data/government_repository.rb Show resolved Hide resolved
spec/lib/bulk_data/government_repository_spec.rb Outdated Show resolved Hide resolved
spec/lib/bulk_data/government_repository_spec.rb Outdated Show resolved Hide resolved
spec/jobs/application_job_spec.rb Outdated Show resolved Hide resolved
describe "POST /documents/:document/unwithdraw" do
context "when the edition is in history mode" do
let(:edition) do
create(:edition, :withdrawn, :political, government: past_government)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this is being defined twice now, rather than once at the beginning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now moot as I had to rebase against Peters changes which has outdated this.

I tend to go with trying to have lets near the code they're calling. I don't mind a bit of repetition if it's easy to find where the let is defined. Mainly because I find it difficult to read tests where let definitions occur a long way before a test.

describe "POST /documents/:document/withdraw" do
context "when the edition is in history mode" do
let(:edition) do
create(:edition, :published, :political, government: past_government)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Is there a reason for duplicating this setup?

spec/jobs/populate_bulk_data_job_spec.rb Show resolved Hide resolved
lib/bulk_data/cache.rb Show resolved Hide resolved
Copy link
Contributor

@leenagupte leenagupte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 💯
The commit messages are great and make the code changes easy to follow, and the tests are really clear. Thanks for breaking up the commit history into smaller commits too, it made reviewing this much easier.

I only a have a few minor comments and things that could be clarified, but nothing major.

@karlbaker02 karlbaker02 temporarily deployed to content-publisher-revi-pr-1522 December 13, 2019 19:09 Inactive
Bulk data is the namespace that we will be using for download large
amounts of data periodically for use in Content Publisher. Some examples
of candidates for this are topics, contacts and organisations. These are
data entities where we need to download all of the Publishing API's data
rather than a small sampling of on demand data.

Previously we have loaded this data on demand and it risks periodic
timeouts for the application. As this data can be loaded in numerous
places it also makes it difficult to determine performance issues in the
application as it may be a contributing factor to any other tasks we
think may be slow.

This module contains two common exceptions that we're anticipating to
use. LocalDataUnavailableError is for situations when we're expecting
data to be available in our cache but it is not present.
RemoteDataUnavailableError is for when we try to load remote data but
this fails.
@karlbaker02 karlbaker02 temporarily deployed to content-publisher-revi-pr-1522 December 13, 2019 19:14 Inactive
@kevindew
Copy link
Member Author

@leenagupte and @ChrisBAshton thanks for the reviews. I've addressed the comments.

This class is intended to be the entry point for writing any BulkData.
It uses the Rails Redis cache as an underlying store.

Using Redis was a design choice for the following reasons:

- Redis offers caching across multiple GOV.UK boxes, so one write will
cover all machines
- Redis is fast to read and write from

And using the Rails cache implementation offered the following
advantages:

- Can be configured with [middleware][1] to avoid need to hit Redis multiple
  times in a request
- Redis namespacing
- Automatic serialisation/deserialisation

An important concern is that we need to ignore the guidance Rails
suggests to use a LRU redis cache, as we don't want to have keys
automatically cleared out of redis. Instead we just want them to expire
naturally.

The methods on this class allow adding an item to the cache, removing an
item, removing all items.

On top of this there are time based methods which are intended to be
used to determine when an item was added to cache. This is to enable
checking how old something is when deciding whether to add something or
not. This is a feature Redis doesn't have by default so it's quite
simple to just add this a different entry, the Rails cache entry
actually has some info about created time, but I have no idea how to
access it and also think it's probably nicer to not have to pull out the
whole cache entry to check it.

[1]: https://github.com/rails/rails/blob/9926fd86626690746c7ee78174c4d3577bdfa58d/activesupport/lib/active_support/cache/redis_cache_store.rb#L44
This sets up this model to work with data from the Publishing API. It no
longer has class methods to look up the model as these are going to be
on a different class.

The input we expect from the Publishing API is a hash of base_path,
content_id, details, locale and title. We can look up started_on,
ended_on and current dates from the details hash.
This allows classes that mix this module in to be serialised to a hash.
This has been added so that instances that use this module can be
serialised to fake cache data.
Copy link
Contributor

@ChrisBAshton ChrisBAshton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have skimmed through all the commits again, looks good to me 👍

@karlbaker02 karlbaker02 temporarily deployed to content-publisher-revi-pr-1522 December 16, 2019 09:30 Inactive
The purpose of this class is to mediate communication between Publishing
API and our cache adapter for governments.

It writes entries to the cache using a cache key that is defined as a
constant in this class. This is named as "government-v1" to indicate
that this is version 1 of the data we are caching. If we are to make any
backwards incompatible changes to the data we expect (such as adding an
extra mandatory field) then we will need to increment this number. This
will ensure that when the code change is deployed the old cache will not
be used anymore and will expire naturally.

It provides a number of methods to finding a particular government and
a method to populate the cache.
This clears the BulkData::Cache after every spec and helpers to populate
the cache before tests.

Default governments are set up for the most common case where we just
need to know about the current and previous governments.

It also provides helper methods for current and past government as these
are likely to be needed quite frequently.
This creates instances of the GovernmentRepository class whenever we
need to load governments in the context of an edition. The factory has
been updated to reflect that governments can't be determined without an
external dependency and thus governments are always injected.

This does raise the prospect of some potential performance concerns for
the amount of classes initialised if we are frequently looking up
government information. Mostly though I expect these things to not occur
multiple times in the same request so I don't think it'll cause any
actual performance issues.
This removes unnecessary instance variables and reduces long lines, it
also removes an unnecessary test expectation.
By using I18n.t we wouldn't be raising errors if a translation was
missing. We don't actually need to call I18n directly as Rails has a
view helper of t which roughly correlates to I18n.t!
This field is referred to as a title in the Publishing API
This sets up some default governments in advance of tests so that the
government data is populated. These are set-up before all feature and
request tests.
By jumping more than 24 hours into the future the cache entries for
governments expire, thus the government data needs to be re-stubbed.
As we no longer have past_government and current_government traits this
injects these object in instead.
This cuts back on the boilerplate needed to run a worker job on a single
box and deny another boxes that try and run this at the same time. This
is a situation that occurs with the sidekiq scheduler where every box
with a sidekiq process tries to run the job at the same time and the
first one to start it wins.

We counter this with a PostgreSQL advisory lock which prevents any other
clients using the db while this lock is held (they gracefully exit).
Ideally we want to run this lock inside a transaction so that if
anything goes wrong the lock can be automatically closed, whereas if we
run it outside a transaction and something crashes we may need to
manually release the lock.
It was more common that we were using Sidekiq::Testing.fake! than
relying on Sidekiq::Testing.inline! thus switching to this saves us
code. This is also a preferred approach to take since it keeps us more
honest by having to explicitly tell Sidekiq to perform actions rather
than expecting them to just happen synchronously.
The first_published_at value should always be set on documents that have
a live edition.

I didn't know a nice way to specify an optional trait to the association
method so have ended up with this array monstrosity. If you know a
better technique feel free to change it.
This moves the it block outside of a context to be before context as per
the more common convention. It also removes the government and political
tests from their draft and live context since they aren't resultant
effects of that particular context.

Finally it makes the draft and live tests more consistent with the other
edition state tests by creating an edition rather than a document by
factory bot.
This sets up a job that is run every 15 minutes to update the bulk data.
If accessing the data fails this job is retried.

The schedule is set with an interval of 15 minutes and a delay of 0s which
makes the job start immediately whenever the worker process is restarted. As
this job is run on a schedule it will execute on all machines at the
same time. To counter this the job is run exclusively using the method
on the parent class.
This will schedule a background job to load the data if for some reason
it's not available in the current environment and a service has tried to
use it.

There is a risk that this could cause a spike in identical queued jobs
were a bunch of users to get this at the same time. This shouldn't
present much of a problem as the job only executes one at a time and
won't do anything if the data was updated in the last 5 minutes.
If the application is ever in a state where there is not local data
unavailable it will return a 503 response and a simple error message. As
I anticipate that developers may often be people who see this message
(at least while sidekiq isn't run by default with the app) I thought it
was useful to have explicit instructions in a developer environment.

It was pretty hard to think of the best way to test this so I went with
this rather contrived example since documents_show is one of the main
routes. I expect that when the BulkData technique is used as part of
index page filters we can replace this test with a less contrived one.
This is no longer needed given we are now using the Publishing API
This spec relies on a government existing at the date we specify which
caused this test to fail. By creating the government in the test we
resolve this complication.
This sets up a local cache that wraps around the Redis cache layer of
BulkData::Cache during the duration of a request.

In order for this to work in a development environment we need the
BulkData::Cache class not to be reloaded automatically. I achieved this
by doing direct require calls. I first tried using the Rails
autoload_once_paths option but didn't have much luck.

As an example, before fixed require calls:

```
irb(main):002:0> BulkData::Cache.object_id
=> 70342543814920
irb(main):003:0> reload!
Reloading...
=> true
irb(main):004:0> BulkData::Cache.object_id
=> 70342595490220
```

and after:

```
irb(main):002:0> BulkData::Cache.object_id
=> 70249028613320
irb(main):004:0> reload!
Reloading...
=> true
irb(main):005:0> BulkData::Cache.object_id
=> 70249028613320
```
By default the RedisCacheStore uses a fault tolerant approach where it
returns nil if it can't access Redis and writes to a logger that we
haven't configured [1].

This isn't an ideal configuration for our use case as it's a critical
problem if the cache isn't available. To try resolve this we've got 3
reconnect attempts available if we don't manage to connect to Redis or
have lost our connection.

In the event that we have an error reading from the cache this will
report the error to sentry, which will hopefully help make problems
easier to diagnose.

[1]: https://github.com/rails/rails/blob/09a2979f75c51afb797dd60261a8930f84144af8/activesupport/lib/active_support/cache/redis_cache_store.rb#L472-L485
There's not a whole lot we can do when a dependent service is erroring
so lets not report it and wait to see if it causes problems in the app.

This is to resolve an issue where we start getting server errors
overnight in integration at the point of the data sync.
Copy link
Contributor

@leenagupte leenagupte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments.

LGTM 👍 🥇

@kevindew kevindew merged commit 151f9b4 into master Dec 17, 2019
@kevindew kevindew deleted the bulk-data-cache-nov branch December 17, 2019 08:59
kevindew added a commit to alphagov/govuk-docker that referenced this pull request Jan 6, 2020
Since alphagov/content-publisher#1522 was
introduced Content Publisher has required a worker to operate normally.
Because of this it is now added as an app dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants