How might we reduce the number of steps to deploy application code (e.g. data product APIs) #1613

MatMoore · 2023-09-20T09:31:23Z

MatMoore
Sep 20, 2023
Collaborator

Context

We have a growing amount of code that supports the ingestion of data products.

container lambdas that provide API endpoints
container lambdas that move data from landing zone -> raw data -> curated data (which integrate with glue & athena)
shared libraries, which are embedded in a base docker image, daap-python-base, that is used by all of our lambda containers

Problem

Currently, when we modify any of this code, we have to go through a multi-step process:

If modifying shared libraries, we must do so in a separate PR, updating the changelog and version, to produce a new base image version
Raise a PR against data-platform repo, to produce a new version of the lambda docker image. If we are updating the base image, we must update the version in the dockerfile, and again update a changelog and version number.
Raise a PR against modernisation-platform-environments, where we update the pinned lambda versions to the latest one
Approve deployment to production.

Each of the first 3 stages adds delay to completing a user story, since each PR requires a review step, even if some of the reviews are just version number changes and changelog updates.

In practice, every time we update the shared library, we find we need to update multiple lambdas, which means either going through this process multiple times, or combining version updates for multiple lambdas into one PR.

Particularly while we are in alpha, we benefit from moving fast and experimenting, so we would like to remove as much friction from the deployment and review process as possible while being confident we are deploying something that works.

There are a few suggestions I will raise in comments, but the general theme of this discussion is: How might we streamline this process to reduce busywork and unhelpful blockers?

MatMoore · 2023-09-20T09:36:20Z

MatMoore
Sep 20, 2023
Collaborator Author

Option: Use dependabot for updating base image dependency

We previously had dependabot configured to automatically detect updates for our base docker image.

However all dependabot PRs failed the linter due to not updating the changelog, so this didn't save us any effort.

We could potentially save ourselves some time by reenabling this and loosening the changelog rules.

1 reply

julialawrence Sep 20, 2023
Maintainer

Should we look into https://github.com/marketplace/actions/dependabot-changelog-helper to get around this problem? It checks the changelog and updates it when a pull request is raised.

This option is really worth investigating since if I understand rightly, changes to dependencies will trigger rebuilds on all dependent images, wouldn't require using unpinned base images (which I really don't love) and, combined with the update-lambda solution below, update all the lambdas to the right image.

MatMoore · 2023-09-20T09:53:12Z

MatMoore
Sep 20, 2023
Collaborator Author

Option: avoid pinning the base image version in Dockerfiles

@murdo-moj and @PriyaBasker23 suggested we could avoid pinning our Dockerfiles to a specific version of the daap-python-base, either as a temporary thing while we are actively developing the API, or permanently, implying that every time we update the base image, we have to update all the lambdas.

If we do this without any additional changes, data-platform repo's container workflow wouldn't automatically trigger when we update the base image. We would still need to bump the version number of each container lambda (although we could write a script to bump all versions at once)

2 replies

murdo-moj Sep 20, 2023
Collaborator

We would still need to bump the version number of each container lambda (although we could write a script to bump all versions at once)

Good point, we could have a workflow which rebuilds all of our images to make sure the ECR versions use the updated base image? Was that what you meant?

MatMoore Sep 20, 2023
Collaborator Author

I was thinking more like a script you run from the command line to bump versions and generate a new changelog entry, but your idea solves the same problem.

I was just thinking that even if we get rid of version pinning on the base image, we wouldn't necessarily want to get rid of versioning the containers we actually deploy. We would still want to be sure we're deploying the same artefact to different environments when rolling out a build to production.

MatMoore · 2023-09-20T10:04:31Z

MatMoore
Sep 20, 2023
Collaborator Author

Option: avoid pinning lambda versions in terraform vars

Currently, we specify specific versions of the lambdas in a terraform file, application_variables.auto.tfvars.json - this is the reason we need to raise PRs against the modernisation-platform-environments repo in addition to the data-platform repo.

If we removed this version file, we could skip the step of updating this version for each change, and the terraform would always deploy the latest images.

We would need to be able to trigger the workflow without raising a PR, because we need to be able to update images even when no terraform code has changed.

A drawback of this change would be we can no longer deploy old versions of the images to specific environments. Instead if we need to rollback a change, we would have to revert the PR that introduced it, thus creating a new image that can be deployed.

3 replies

julialawrence Sep 20, 2023
Maintainer

Terraform doesn't have the ability to figure out natively what the newest image tag is, so unless you literally mean that you'd always tag the latest image as latest or another static value, you're not gonna get the expected results.

I do think the drawback is pretty significant in this case.

julialawrence Sep 20, 2023
Maintainer

We would need to be able to trigger the workflow without raising a PR, because we need to be able to update images even when no terraform code has changed.

Sorry one more comment which is relevant not to just this option:
It is possible to manually trigger a run in MP without changing code. However, having to rerun workflow on code you did not change feels kinda ugly l, if I'm honest, so I hope that working together we will come up with a solution that only requires workflow runs in repos where the image code lives.

In hope we can reach that end state where you write some code in a repo, you run a workflow in the same repo and that is all you need to do to make sure the lambda uses the new image.

MatMoore Sep 21, 2023
Collaborator Author

Agreed! I think that would be a good outcome 👍🏻

The only thing I'm not quite clear on - if we get to the stage where lambdas are deployed via another repo, would the terraform still have resources for them as well (with a latest tag or something) or would we remove it from the terraform altogether?

MatMoore · 2023-09-20T10:09:41Z

MatMoore
Sep 20, 2023
Collaborator Author

Option: separate out "application" deployments from "infrastructure" deployments

Currently all deployments go via the modernisation-platform-environments repo. The workflows in the data-platform repo only go as far as pushing images to ECR. Could we extend this to push images to lambda as well?

I.e. once the PR is raised/approved/merged, the deployment requests are raised from the data-platform repo, without us having to go to modernisation-platform-environments and bump a version.

4 replies

julialawrence Sep 20, 2023
Maintainer

This the approach that MP themselves take and I support it. I also think managing versions would be easier if we split image builds into their own individual repos.

I think the best approach is to throw an ignore_changes for image tag into the lambda definition in the environments repo and update the lambda in the data platform once you build a new deployable version of it.

MatMoore Sep 20, 2023
Collaborator Author

Ah interesting, I didn't realise MP already managed lambda code like this.

Regarding splitting the repos, did you mean storing the code in separate github repos instead of the current monorepo approach?

julialawrence Sep 20, 2023
Maintainer

I really like the concept of monorepo but I don't think it works really well with versioned artefacts. So anything I'd want to version, I'd want to split into the repo of its own. So kinda I guess. The not-quite-monorepo, if you will.

Things that I think it would make sense to split out:
Globally usable terraform modules
Image builds
Versioned software packages
And similar things.

I think you mentioned using shared libraries and if you're tagging them, they could probably be split off too.

It does add some ops overhead, but we manage our GitHub assets in code so it's not like we'll need to create multiple repos using point and click.

MatMoore Sep 21, 2023
Collaborator Author

That makes sense to me - I'm definitely much more used to working in a way where each versioned artefact is a separate repo.

My only concern would be the overhead of repo management you mentioned if we assume that each lambda is versioned and deployed separately (as they are now). That might make it harder to add new API endpoints for example.

I think splitting out daap-python-base would be a good first step. We could also publish it as a python library instead of/in addition to a docker image, as all that base image is doing is copying those python files.

MatMoore · 2023-09-20T10:30:38Z

MatMoore
Sep 20, 2023
Collaborator Author

Implications for unit testing

I recently added an additional workflow to data-platform to run unit tests based on what's changed.

This has some quirks though. Because dependency management of the lambdas is done at a container level, and we (currently) run tests outside of docker, I made the decision to always run the tests against the latest version of the shared libraries (i.e. whatever is in the repo, not what is published as an image).

This does the wrong thing if the dependency on the base image is not kept up to date, but becomes a non-issue if we pin to latest.

0 replies

MatMoore · 2023-10-04T10:55:27Z

MatMoore
Oct 4, 2023
Collaborator Author

Update

I think we have 4 complementary proposals here that we could implement some or all of.

Move each directory that publishes docker images into its own repo, starting with the base image. Rationale: each artefact is independently deployable anyway, so we typically only work on one at a time. Separating them out is the approach encouraged by the modernisation platform, and this will simplify the deployment pipelines and dependency management. We are not concerned about the admin that comes with having extra repos because we manage repos in terraform anyway.
Each repo should publish a docker image, and then deploy the image to the environments (following the same approval steps as our terraform deployment for now). To make this work, we have to throw an ignore_changes for image tag into the lambda definition in the environments repo. Rationale: this means we review and deploy from 1 repo at a time, instead of 2.
Try out using dependabot with https://github.com/marketplace/actions/dependabot-changelog-helper to automatically update lambdas when the base version changes. Rationale: this saves us having to manually bump versions and add changelog entries. We just have to approve the PRs and the deployments.
Make the base image publish a python library in addition to a base image. Rationale: this allows us to depend on the shared library in the tests requirements.txt, instead of always running against the latest version (which is misleading). Instead of copying files into the base image, we can pip install them to produce the same result.

Next steps

The DPL subteam will discuss the above on the 10th October, and if we agree we will create tickets for us to implement the above suggestions.

0 replies

MatMoore · 2023-10-11T09:33:48Z

MatMoore
Oct 11, 2023
Collaborator Author

Outcome following yesterdays meeting

We had a team meeting yesterday to go through the proposals in this discussion thread and agree the next way forward. Attending were @jemnery @LavMatt @mitchdawson1982 @PriyaBasker23 @tom-webber.

Since there are a whole bunch of issues raised here, we will likely need to make multiple changes. So I reckon we can close off this discussion, with the action being that we will write up spike tickets for the things we want to investigate further during the alpha. We all agreed that we want to make changes to the deployment, and nobody wants to keep the current process 🚮

Meeting notes

We agreed we want to separate out code from the data-platform repo, and that the ideal is for each repo to be associated with its own deployment, rather than keeping everything in the data-platform monorepo. However, we haven't yet figured out the details of how we will do this, or whether we will go to the extreme of creating 1 repo per lambda (vs managing all the daap- lambdas in one place).
We still have some unknowns around how we separate API endpoint deployment from the modernisation-platform-environments terraform but it seems doable. E.g. we are assuming we can deploy lambda changes by updating lambda to the latest image, and API gateway will pick up the latest version; whereas changes to parameters and paths would require updating the API gateway terraform.
We briefly discussed the reasoning for using lambda containers vs lambda layers, as we could potentially change this if it makes the deployment story easier, but we've parked this for now as a quick search shows SAM is compatible with containers as well.
We also discussed the option to leave out version pinning for the base image in the lambda containers. In this scenario all container images would still be versioned, but we would just pull in the latest base image at the point where we build containers that use it. We were reasonably confident this would work without introducing supply chain vulnerabilities, but we haven't thought through the implications in detail. In the end, we decided to try reenabling dependabot first, but this option is still on the table.
We noted that in the future we will have additional shared python code that won't necessarily go in the base image. For example, for the data catalogue integration, we will make that a separate python package.

Agreed spikes

We agreed to put the following spikes into the backlog, and prioritise the first two (will link in issues shortly):

Reenable dependabot. We will try out the changelog helper tool @julialawrence recommended. If we can't get a tidy solution, then we will disable the linters rather than disable dependabot, as our base image falling out of date is already creating extra work when testing & debugging things. This will help us keep all the endpoints using the latest shared code.
Extract the base image into it's own repo. As part of this we will publish a python package, so that the CI can pull in the correct dependencies when running the unit tests. This will give us more confidence in our tests, and prevent us from combining changes to base + lambdas in one big PR (which doesn't work)
Spike around Lambda power tools and SAM. We already wanted to look into this, and it should help us understand how we can decouple the API deployment from the modernisation-platform-environments deployment.

2 replies

PriyaBasker23 Oct 11, 2023
Collaborator

Do you want to do some spike around ignore_changes option

MatMoore Oct 12, 2023
Collaborator Author

Yeah, I think we could explore it, but probably the spike with lambda power tools/SAM should come first, as it might not be necessary if we go down that route.

If we do spike it, I think this option requires two small changes:

In modernisation-platform-environments, add

  lifecycle {
    ignore_changes = [
      image_uri,
    ]
  }

In data-platform, after pushing to EC2, run something like this:

aws lambda update-function-code --function-name $function_name --image-uri $image_uri

How might we reduce the number of steps to deploy application code (e.g. data product APIs) #1613

MatMoore Sep 20, 2023 Collaborator

Context

Problem

Replies: 7 comments · 12 replies

MatMoore Sep 20, 2023 Collaborator Author

Option: Use dependabot for updating base image dependency

julialawrence Sep 20, 2023 Maintainer

MatMoore Sep 20, 2023 Collaborator Author

Option: avoid pinning the base image version in Dockerfiles

murdo-moj Sep 20, 2023 Collaborator

MatMoore Sep 20, 2023 Collaborator Author

MatMoore Sep 20, 2023 Collaborator Author

Option: avoid pinning lambda versions in terraform vars

julialawrence Sep 20, 2023 Maintainer

julialawrence Sep 20, 2023 Maintainer

MatMoore Sep 21, 2023 Collaborator Author

MatMoore Sep 20, 2023 Collaborator Author

Option: separate out "application" deployments from "infrastructure" deployments

julialawrence Sep 20, 2023 Maintainer

MatMoore Sep 20, 2023 Collaborator Author

julialawrence Sep 20, 2023 Maintainer

MatMoore Sep 21, 2023 Collaborator Author

MatMoore Sep 20, 2023 Collaborator Author

Implications for unit testing

MatMoore Oct 4, 2023 Collaborator Author

Update

Next steps

MatMoore Oct 11, 2023 Collaborator Author

Outcome following yesterdays meeting

Meeting notes

Agreed spikes

PriyaBasker23 Oct 11, 2023 Collaborator

MatMoore Oct 12, 2023 Collaborator Author

MatMoore
Sep 20, 2023
Collaborator

Replies: 7 comments 12 replies

MatMoore
Sep 20, 2023
Collaborator Author

julialawrence Sep 20, 2023
Maintainer

MatMoore
Sep 20, 2023
Collaborator Author

murdo-moj Sep 20, 2023
Collaborator

MatMoore Sep 20, 2023
Collaborator Author

MatMoore
Sep 20, 2023
Collaborator Author

julialawrence Sep 20, 2023
Maintainer

julialawrence Sep 20, 2023
Maintainer

MatMoore Sep 21, 2023
Collaborator Author

MatMoore
Sep 20, 2023
Collaborator Author

julialawrence Sep 20, 2023
Maintainer

MatMoore Sep 20, 2023
Collaborator Author

julialawrence Sep 20, 2023
Maintainer

MatMoore Sep 21, 2023
Collaborator Author

MatMoore
Sep 20, 2023
Collaborator Author

MatMoore
Oct 4, 2023
Collaborator Author

MatMoore
Oct 11, 2023
Collaborator Author

PriyaBasker23 Oct 11, 2023
Collaborator

MatMoore Oct 12, 2023
Collaborator Author