Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load government data in an asynchronous task #1522

Merged
merged 30 commits into from
Dec 17, 2019
Merged

Commits on Dec 13, 2019

  1. Create a module for BulkData

    Bulk data is the namespace that we will be using for download large
    amounts of data periodically for use in Content Publisher. Some examples
    of candidates for this are topics, contacts and organisations. These are
    data entities where we need to download all of the Publishing API's data
    rather than a small sampling of on demand data.
    
    Previously we have loaded this data on demand and it risks periodic
    timeouts for the application. As this data can be loaded in numerous
    places it also makes it difficult to determine performance issues in the
    application as it may be a contributing factor to any other tasks we
    think may be slow.
    
    This module contains two common exceptions that we're anticipating to
    use. LocalDataUnavailableError is for situations when we're expecting
    data to be available in our cache but it is not present.
    RemoteDataUnavailableError is for when we try to load remote data but
    this fails.
    kevindew committed Dec 13, 2019
    Configuration menu
    Copy the full SHA
    62bab09 View commit details
    Browse the repository at this point in the history

Commits on Dec 15, 2019

  1. BulkData cache singleton class

    This class is intended to be the entry point for writing any BulkData.
    It uses the Rails Redis cache as an underlying store.
    
    Using Redis was a design choice for the following reasons:
    
    - Redis offers caching across multiple GOV.UK boxes, so one write will
    cover all machines
    - Redis is fast to read and write from
    
    And using the Rails cache implementation offered the following
    advantages:
    
    - Can be configured with [middleware][1] to avoid need to hit Redis multiple
      times in a request
    - Redis namespacing
    - Automatic serialisation/deserialisation
    
    An important concern is that we need to ignore the guidance Rails
    suggests to use a LRU redis cache, as we don't want to have keys
    automatically cleared out of redis. Instead we just want them to expire
    naturally.
    
    The methods on this class allow adding an item to the cache, removing an
    item, removing all items.
    
    On top of this there are time based methods which are intended to be
    used to determine when an item was added to cache. This is to enable
    checking how old something is when deciding whether to add something or
    not. This is a feature Redis doesn't have by default so it's quite
    simple to just add this a different entry, the Rails cache entry
    actually has some info about created time, but I have no idea how to
    access it and also think it's probably nicer to not have to pull out the
    whole cache entry to check it.
    
    [1]: https://github.com/rails/rails/blob/9926fd86626690746c7ee78174c4d3577bdfa58d/activesupport/lib/active_support/cache/redis_cache_store.rb#L44
    kevindew committed Dec 15, 2019
    Configuration menu
    Copy the full SHA
    a764576 View commit details
    Browse the repository at this point in the history
  2. Adapt Government model for loading from Publishing API

    This sets up this model to work with data from the Publishing API. It no
    longer has class methods to look up the model as these are going to be
    on a different class.
    
    The input we expect from the Publishing API is a hash of base_path,
    content_id, details, locale and title. We can look up started_on,
    ended_on and current dates from the details hash.
    kevindew committed Dec 15, 2019
    Configuration menu
    Copy the full SHA
    659c391 View commit details
    Browse the repository at this point in the history
  3. Provide a to_h method on InitializeWithHash

    This allows classes that mix this module in to be serialised to a hash.
    This has been added so that instances that use this module can be
    serialised to fake cache data.
    kevindew committed Dec 15, 2019
    Configuration menu
    Copy the full SHA
    ec71269 View commit details
    Browse the repository at this point in the history

Commits on Dec 16, 2019

  1. GovernmentRepository class for BulkData

    The purpose of this class is to mediate communication between Publishing
    API and our cache adapter for governments.
    
    It writes entries to the cache using a cache key that is defined as a
    constant in this class. This is named as "government-v1" to indicate
    that this is version 1 of the data we are caching. If we are to make any
    backwards incompatible changes to the data we expect (such as adding an
    extra mandatory field) then we will need to increment this number. This
    will ensure that when the code change is deployed the old cache will not
    be used anymore and will expire naturally.
    
    It provides a number of methods to finding a particular government and
    a method to populate the cache.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    77a50c2 View commit details
    Browse the repository at this point in the history
  2. Set up spec helpers for bulk data

    This clears the BulkData::Cache after every spec and helpers to populate
    the cache before tests.
    
    Default governments are set up for the most common case where we just
    need to know about the current and previous governments.
    
    It also provides helper methods for current and past government as these
    are likely to be needed quite frequently.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    2e55195 View commit details
    Browse the repository at this point in the history
  3. Transition edition to use BulkData::GovernmentRepository

    This creates instances of the GovernmentRepository class whenever we
    need to load governments in the context of an edition. The factory has
    been updated to reflect that governments can't be determined without an
    external dependency and thus governments are always injected.
    
    This does raise the prospect of some potential performance concerns for
    the amount of classes initialised if we are frequently looking up
    government information. Mostly though I expect these things to not occur
    multiple times in the same request so I don't think it'll cause any
    actual performance issues.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    10bd447 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    5a5fc1a View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    5aff300 View commit details
    Browse the repository at this point in the history
  6. Minor refactoring of document request spec

    This removes unnecessary instance variables and reduces long lines, it
    also removes an unnecessary test expectation.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    3bac691 View commit details
    Browse the repository at this point in the history
  7. Fix incorrect I18n call

    By using I18n.t we wouldn't be raising errors if a translation was
    missing. We don't actually need to call I18n directly as Rails has a
    view helper of t which roughly correlates to I18n.t!
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    49240a9 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    dbe6b81 View commit details
    Browse the repository at this point in the history
  9. Switch government name to title

    This field is referred to as a title in the Publishing API
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    59e57e4 View commit details
    Browse the repository at this point in the history
  10. Stub governments for tests

    This sets up some default governments in advance of tests so that the
    government data is populated. These are set-up before all feature and
    request tests.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    7ede62a View commit details
    Browse the repository at this point in the history
  11. Re-populate governments when working with future dates

    By jumping more than 24 hours into the future the cache entries for
    governments expire, thus the government data needs to be re-stubbed.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    5cc6f84 View commit details
    Browse the repository at this point in the history
  12. Adjust request/feature tests for change in government trait

    As we no longer have past_government and current_government traits this
    injects these object in instead.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    06d58ae View commit details
    Browse the repository at this point in the history
  13. Helper method to run ActiveJobs exclusively

    This cuts back on the boilerplate needed to run a worker job on a single
    box and deny another boxes that try and run this at the same time. This
    is a situation that occurs with the sidekiq scheduler where every box
    with a sidekiq process tries to run the job at the same time and the
    first one to start it wins.
    
    We counter this with a PostgreSQL advisory lock which prevents any other
    clients using the db while this lock is held (they gracefully exit).
    Ideally we want to run this lock inside a transaction so that if
    anything goes wrong the lock can be automatically closed, whereas if we
    run it outside a transaction and something crashes we may need to
    manually release the lock.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    edcf545 View commit details
    Browse the repository at this point in the history
  14. Default to fake sidekiq testing

    It was more common that we were using Sidekiq::Testing.fake! than
    relying on Sidekiq::Testing.inline! thus switching to this saves us
    code. This is also a preferred approach to take since it keeps us more
    honest by having to explicitly tell Sidekiq to perform actions rather
    than expecting them to just happen synchronously.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    c59ee81 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    070b673 View commit details
    Browse the repository at this point in the history
  16. Add document live trait that sets first_published_at

    The first_published_at value should always be set on documents that have
    a live edition.
    
    I didn't know a nice way to specify an optional trait to the association
    method so have ended up with this array monstrosity. If you know a
    better technique feel free to change it.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    a7fd0d4 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    4f42b1b View commit details
    Browse the repository at this point in the history
  18. Minor refactoring of ResyncService specs

    This moves the it block outside of a context to be before context as per
    the more common convention. It also removes the government and political
    tests from their draft and live context since they aren't resultant
    effects of that particular context.
    
    Finally it makes the draft and live tests more consistent with the other
    edition state tests by creating an edition rather than a document by
    factory bot.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    3d3245c View commit details
    Browse the repository at this point in the history
  19. Add job to populate bulk data asynchronously

    This sets up a job that is run every 15 minutes to update the bulk data.
    If accessing the data fails this job is retried.
    
    The schedule is set with an interval of 15 minutes and a delay of 0s which
    makes the job start immediately whenever the worker process is restarted. As
    this job is run on a schedule it will execute on all machines at the
    same time. To counter this the job is run exclusively using the method
    on the parent class.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    69f4964 View commit details
    Browse the repository at this point in the history
  20. Asynchronously load bulk data if it is not available

    This will schedule a background job to load the data if for some reason
    it's not available in the current environment and a service has tried to
    use it.
    
    There is a risk that this could cause a spike in identical queued jobs
    were a bunch of users to get this at the same time. This shouldn't
    present much of a problem as the job only executes one at a time and
    won't do anything if the data was updated in the last 5 minutes.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    dfb1eb2 View commit details
    Browse the repository at this point in the history
  21. Add an error view for BulkData::LocalDataUnavailableError

    If the application is ever in a state where there is not local data
    unavailable it will return a 503 response and a simple error message. As
    I anticipate that developers may often be people who see this message
    (at least while sidekiq isn't run by default with the app) I thought it
    was useful to have explicit instructions in a developer environment.
    
    It was pretty hard to think of the best way to test this so I went with
    this rather contrived example since documents_show is one of the main
    routes. I expect that when the BulkData technique is used as part of
    index page filters we can replace this test with a less contrived one.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    b177dc2 View commit details
    Browse the repository at this point in the history
  22. Remove hardcoded government config

    This is no longer needed given we are now using the Publishing API
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    d2ad3e7 View commit details
    Browse the repository at this point in the history
  23. Use specific government in history mode spec

    This spec relies on a government existing at the date we specify which
    caused this test to fail. By creating the government in the test we
    resolve this complication.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    be7cb1f View commit details
    Browse the repository at this point in the history
  24. Use local cache Rack middleware with BulkData::Cache

    This sets up a local cache that wraps around the Redis cache layer of
    BulkData::Cache during the duration of a request.
    
    In order for this to work in a development environment we need the
    BulkData::Cache class not to be reloaded automatically. I achieved this
    by doing direct require calls. I first tried using the Rails
    autoload_once_paths option but didn't have much luck.
    
    As an example, before fixed require calls:
    
    ```
    irb(main):002:0> BulkData::Cache.object_id
    => 70342543814920
    irb(main):003:0> reload!
    Reloading...
    => true
    irb(main):004:0> BulkData::Cache.object_id
    => 70342595490220
    ```
    
    and after:
    
    ```
    irb(main):002:0> BulkData::Cache.object_id
    => 70249028613320
    irb(main):004:0> reload!
    Reloading...
    => true
    irb(main):005:0> BulkData::Cache.object_id
    => 70249028613320
    ```
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    8399dd5 View commit details
    Browse the repository at this point in the history
  25. Configure Redis for resiliency and reporting

    By default the RedisCacheStore uses a fault tolerant approach where it
    returns nil if it can't access Redis and writes to a logger that we
    haven't configured [1].
    
    This isn't an ideal configuration for our use case as it's a critical
    problem if the cache isn't available. To try resolve this we've got 3
    reconnect attempts available if we don't manage to connect to Redis or
    have lost our connection.
    
    In the event that we have an error reading from the cache this will
    report the error to sentry, which will hopefully help make problems
    easier to diagnose.
    
    [1]: https://github.com/rails/rails/blob/09a2979f75c51afb797dd60261a8930f84144af8/activesupport/lib/active_support/cache/redis_cache_store.rb#L472-L485
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    fa27e4c View commit details
    Browse the repository at this point in the history
  26. Don't report BulkData::RemoteDataUnavailableError on server errors

    There's not a whole lot we can do when a dependent service is erroring
    so lets not report it and wait to see if it causes problems in the app.
    
    This is to resolve an issue where we start getting server errors
    overnight in integration at the point of the data sync.
    kevindew committed Dec 16, 2019
    Configuration menu
    Copy the full SHA
    d395aa5 View commit details
    Browse the repository at this point in the history