CI Pipeline

Councourse

For our CI pipeline we've chosen Concourse. In their own words Concourse is a "pipeline-based continuous thing-doer". If you're unfamilliar, it lets you define pipelines using yaml, which are comprised of resources and jobs that act on those resources. Resources can be anything from git repositories to docker images and s3 buckets. Jobs are a series of steps, like getting or uploading a resource, or running a generic shell script, and everything is run inside a container and can run them on both linux and windows. In fact, Concourse actually uses Garden to run these containers. For more information, please have a look at the docs and the examples pages.

On the Garden team, we manage our own concourse deployment using BOSH and you can look at the Dasboard over at garden.ci.cf-app.com.

Show me some yaml

Like we mentioned, Concourse uses yaml to define a pipeline and our yaml definitions can be found on the garden-ci GitHub repository. It contains the pipeline definition as well as the all the BOSH deployments of Garden that the pipeline uses to run tests against. The yaml is under the ci directory and has the following structure:

ci/
├── pipelines/ -> defininitions for pipelines
    ├── common -> resources or jobs used in more than one pipeline
    ├── garden-godoc -> definitions for the garden-godoc pipeline
    ├── main -> definitions for the main pipeline
├── scripts/ -> scripts used by tasks
├── tasks/ -> task yaml definitions
├── vars/ -> yaml files containing [variables](https://concourse-ci.org/vars.html) when reconfiguring the pipeline

Additionally the structure of the a pipiline directory (like ci/pipelines/main) can be found here

The tasks and scripts directories do not contain all of the tasks definitions that the pipelines use. Other repositories contain task and script definitions in the ci directory in the root of the repository. Here is a list of them:

For example the guardian job, which runs test for guardian has its task definition and script in garden-runc-release.

The Concourse CLI

If you want to make changes to any of those task definitions and test them locally before pushing them and run them in the web ui you can use fly - the Concourse cli. You can download fly from the homepage of Concourse or form their release page on github. Here is how to run the aforementioned guardiand task:

fly -t garden-ci execute --inputs-from main/garden --image garden-ci-image -p -c "$HOME/workspace/garden-runc-release/ci/tasks/guardian.yml" -i gr-release-develop="$HOME/workspace/garden-runc-release"

You have to provide all inputs declared in the task yaml, some of which might be outputs of other tasks. For more information, please check out the fly help page

The scripts/remote-fly script in garden-runc-release makes this even easier:

"$HOME/workspace/garden-runc-release/scripts/remote-fly" "$HOME/workspace/garden-runc-release/ci/tasks/guardian.yml"

The garden-ci docker image

All concourse jobs in Garden CI run on top the cfgarden/garden-ci docker image. This image contains all dependencies of our tests and the scripts used to run them. If you need to add something in that image, please do the following:

modify the garden-ci Dockerfile and build a new version of the image by running make garden-ci
tag the image with the dessired new version (check the latest version here
push the new version of the image
change the version of the concourse resource and reconfigure the pipeline
run garden-runc-release/scripts/test -a to make sure nothing is broken (repeat steps above if it is)

Applying your changes

To apply your changes to the pipeline you simply need to run the scripts/reconfigure-pipeline in the garden-ci repository. Note that you need LastPass access to the garden shared vault in order to do this. Also keep in mind that in order to change tasks or scripts you need to push to the repository containing them.

Upgrading

To upgrade the concourse version checkout the desired version tag in the concourse-bosh-deployment submodule. Then just run the deploy script on the bosh deployment in eden. If you run into problems check out this wiki page for some troubleshooting suggestions.

The main pipeline

The main pipeline has several groups (tabs) which do different things. They are:

gating - this is the the part of the pipeline that gets triggered after a push to garden-runc release, so make sure everything goes green after committing a new change. It will run all the tests for individual components like guardian and garden. If those pass it will create a release candidate and deploy this version with different features enabled. Go to this page for a complete list.
non-gating - these are a collection of jobs that do various things, most notably run benchmarks against a CF deployment (how much time does it take to do a cf push and cf scale), upload new releases of stemcells
periodics - the periodics are the same tests that run against the different garden environments in the gating pipeline. They run every 40 minutes and are there so they can catch flakes
release - running the garden-runc-shipit will create a new release on GitHub, upload the new bosh release to bosh.io and advance the master branch. Similarly, running cpu-entitlement-plugin-shipit will release the cpu-entitlement-plugin.
dependachore - deploys dependachore a Google Cloud Function that moves the dependabot PR stories generated in the Garden tracker to their own section in the icebox and converts them to chores.
groot - runs groot tests for both linux and windows (also has a periodic). It is separate since it is not a part of garden-runc-release
cpu-entitlements-plugin - runs tests for the cpu-entitlement-plugin

There is only one other pipeline: the garden-godoc. Since Garden is a client ment as the main way for users to interact with us, it needs to have its godocs refreshed whenever we make a change.

Releasing Garden

As already mentioned above, garden-runc-release is released by running garden-runc-shipit job. By default this will result in a new patch version release. If you want to bump major or minor version, you should run garden-runc-bum-major-version or garden-runc-bump minor-version accordingly. The version is a smever concourse resource backed by a file in an s3 bucket.

When you hit shipit the following steps take place:

A new bosh release is created from the latest release candidate
The bosh release tarball is uploaded to an s3 bucket (so that it is available on https://bosh.io/releases/github.com/cloudfoundry/garden-runc-release)
A new github draft release is created with the all-in-one garden binary attached
The new bosh release yml is committed and pushed to master
The garden-runc-merge-master job merges the release commit back into develop

If everything is successful, you need to add some release notes to the draft github release and make it public. If something fails, I have bad news for you. You have to determine what part of the release process succeeded and what failed and decide what to do - either retrigger and hope nothing corrupt was pushed to s3/github or sacrifice a patch version and try again.

Flakes

Given the amount of tests running in our pipeline it often happens that a couple of tests would faild sporadically. We call this a snowflake or just flake for short. Sometimes these flakes indicate problems in the code, but more often are test problems or simply instabilities of the environments. We tend to ignore problems that occur just once or twice or are related toj external factors (e.g. docker hub is down), but if a problem persists we look into it in order to make our code and tests more stable.

But how do you know if a test failure is a one-off or a regular flake? We have built a tool to search for particular test failures in the Concourse history. It is called flake-hunter. Here is how you use it:

concourse-flake-hunter -c https://garden.ci.cf-app.com/ -n main search <regexp>

The tool will keep listing matching build failures until you Ctrl+c or the history is exhausted. This way you can determine how a flake behaves over time.

For more information on our experiences with flakes, check out this blog post by one of Garden's former team members

Hanging tests

Another nasty type of test failure is when a test hangs indefinitley. This is hard to spot, since it appears the same as any other running job, but it blocks the pipeline. In order to spot these problems we are running all garden-runc-release tests with a tool called slowmobius. Slowmobius watches the tests and if they take too long fails the job and posts to slack. The timeout as well as the slowmobius slack icon can be configured here. The pipeline needs to be reloaded for the changes to take effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly