Skip to content
Mario Nitchev edited this page Jun 7, 2021 · 2 revisions

Don't panic, it's fine

Don't panic!

This document provides tips on troubleshooting Garden in the unlikely event that it does not work as expected.

dontpanic

Your best friend in troubleshooting Garden is the dontpanic report. dontpanic is a Garden inhouse tool to gather various diagnostic information and pack it into an archive. The reporter is expected to run the tool and share the report alongside a reasonable description of the problem with Garden developers for further investigation.

Show me the code

https://github.com/cloudfoundry/dontpanic

What's in it for me?

Lot's of useful stuff, mostly the output of various commands executed on the diego cell. Refer to the dontpanic README for more information, or just look at the code to see the commands it runs.

How to get it?

dontpanic ships as a package within the garden-runc-release which means that the binary would be available on every diego cell running garden 1.17.1+. In the unlikely event that you are troubleshooting previous Garden, you could download the binary from the dontpanic release page

NOTE: The latest dontpanic release may be outdated, you could consider releasing a new version. There is no CI pipeline for that, you would have to that manually.

How does it work?

  1. The reporter logs onto the diego cell (via ssh), see how to ssh on a diego cell below
  2. The reporter switches to root via running sudo su -
  3. The reporter runs /var/vcap/packages/dontpanic/bin/dontpanic and relaxes until the dontpanic is done. Once it is done, it prints the location to the the produced report (it is produced in the /var/vcap/data/tmp directory)
  4. The reporter shares the report with the Garden team. The reporter might want to copy the report via bosh scp to their machine in order to make sharing easier.
  5. The Garden team looks at the report, figures out what is wrong and saves the day.

Sometimes it might be useful to see what is Garden up to at the moment of running the dontpanic report. If so, the reporter could supply the --sigquit flag to the dontpanic binary - this would SIGQUIT the Guardian server (gdn) process which would make gdn dump its goroutines stack into the Garden error log.

WARNING The --sigquit flag would terminate the Guardian server!

Note1 dontpanic collects Garden logs, monit logs, kernel logs, various data from the /proc filesystem. We do not expect this data to contain any secrets (such as passwords or certificates) but it is advisable the reporter to double check that.

Note2 The report archive might be big so sharing it might ivolve some sort of a shared cloud storage.

How do I use it

Once you get a dontpanic report you should have all the troubleshooting details you need. If that is not the case, consider enhancing dontpanic.

Here is a sample algorithm you can follow

  • Usually you would start with monit summary to figure out whether all jobs are running
  • If Garden is not running, have a look at its logs (in the garden directory in the report). If the gdn process is crashing, its error logs should contain clues. Also consider looking at the containerd logs.
  • Have a look at the configuration files (the config directory). Do the options there make sense?
  • Are there errors in the Garden logs?
  • Look at the process tree:
    • Is the gdn process (the Guardian) server alive? If not, look into Garden error logs for clues.
    • Are there any dadoo or containerd-shim processes that have no children? Such processes could indicate that a container exited without the shim noticing
  • Look at the running containers (garden-containers.log) - aren't there too much of them?
  • Look at disk usage (df.log) - could the cell be running out of disk space?
  • Depending on the issue you are looking at, look at the lsof, iptables, meminfo, etc. information.

Useful tips

How do I ssh onto a diego cell VM

  1. Once you have targeted the deployment with bosh, you need to figure out the VM instance ID of the diego cell you want to connect to. In order to list all the deployments VMs, just run bosh -d <dep name> vms.
  2. Pick the diego cell VM you want to connect to and run bosh -d <dep name> ssh <diego cell VM ID>, for example bosh -d cf ssh diego-cell/d3e1b55a-078d-4d9d-9c0a-76306e891dea. If working on a deployment with a single cell, bosh -d <dep name> ssh diego-cell would also do the trick.
  3. Once connected to the diego cell it is useful to become root via sudo su -

How do I copy the dontpanic report to my local machine

Provided that you ssh-ed onto a diego cell, managed to run dontpanic and it told you where the report archive is, you should run bosh scp on your local machine, for example:

bosh -d cf scp diego-cell/d3e1b55a-078d-4d9d-9c0a-76306e891dea:/var/vcap/data/tmp/os-report-ed0db125-3c54-43b4-8023-78d2ff53a39e-2021-06-01-09-29-20.084414748.tar.gz .

Where are

  • Garden logs: /var/vcap/sys/log/garden
  • Garden depot: /var/vcap/data/garden/depot
  • Garden configuration files: /var/vcap/jobs/garden/config
  • Garden binaries: /var/vcap/packages/{dontpanic,garden-idmapper,greenskeeper,guardian,netplugin-shim,thresholder}
  • Garden job monit scripts: /var/vcap/jobs/garden/bin

Reproducing it locally

In order to troubleshoot an issue it is best to reproduce it in a local minimal environment. Ideally, in most of the cases it is sufficient to create a pure Garden deployment (without all the CF machinery) where you could use the Garden API with proper input to reproduce a problem.

Create your sandbox environment

Check out the Creating-sandbox-environments-for-debugging wiki page.

Calling the Garden API

The most convenient way to call Garden is to use the unofficial Garden client - gaol. It provides a command line interface to create/delete containers, run processes, etc (view its README on how to use it). In most of the cases the client should be sufficient as is, if it is not, it is always an option to change it to your liking (e.g. make it create containers with hardcoded memory limit), build it and use your own version instead.

Running gaol from your local machine would be only possible if using a local bosh-lite Garden deployment where Garden is configured to listen on an HTTP port. For more realistic deployments it is most convenient to copy the gaol binary on the diego cell and call it after ssh-ing. The bosh-inject-garden-tools.sh script automates that task.

Alternatively, you may want to create a Ginkgo test that calls the Garden API and reproduces the issue. This approach is great because you can push the test after fixing the issue to ensure that the bug never appears again. Existing GAT tests are a great starting point as their fixture setup a ready to use Garden client.