Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Docker and Dev Container setup using Buildkit #4392

Draft
wants to merge 66 commits into
base: main
Choose a base branch
from
Draft

Conversation

ruffsl
Copy link
Member

@ruffsl ruffsl commented Jun 3, 2024

This refactors the docker based CI and unifies it with a Dev Container workflow.

Instead of use cron jobs to priodickly rebuild the base image used for CI, or installing any missing dependencies as initial steps in CircleCI, this migrates the CI pipeline to dynamically rebuild the base image on demand while leveraging buildkit cache backend to do so efficiently. This also unifies the docker image build process forDev Containers, making it simple to rebuild a development image locally, or bootstrap one by pulling image layer remotely from GHCR provided by CI. Lastly, this also provide additional room to build release docker images, to quickly ship a minimal but pre-built nav2 workspace for select branches and open pull requests to streamline end user experimentation and testing.

Related:


To get started, simply clone this PR with git submodules and follow along with the include quick start guide:

git clone --recurse-submodules -j8 \
  --branch buildkit \
  [email protected]:ros-navigation/navigation2.git

@ruffsl ruffsl changed the title WIP | Refactor Docker and Dev Container setup using Buildkit optimizations WIP | Refactor Docker and Dev Container setup using Buildkit Jun 3, 2024
Copy link
Contributor

mergify bot commented Jun 6, 2024

This pull request is in conflict. Could you fix it @ruffsl?

@tonynajjar
Copy link
Contributor

@ruffsl just FYI tried to run it and got:

0.367 E: Unable to locate package ros-rolling-nav2-minimal-tb3-sim
0.367 E: Unable to locate package ros-rolling-nav2-minimal-tb4-sim

@ruffsl
Copy link
Member Author

ruffsl commented Jun 13, 2024

@tonynajjar , yeah, looks like we have another un-released dependency back in our underlay.repos file:

@tonynajjar
Copy link
Contributor

@ruffsl new error

2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-control-msgs/ros-rolling-control-msgs_5.1.0-1noble.20240429.102647_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-hardware-interface/ros-rolling-hardware-interface_4.11.0-1noble.20240514.082551_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-controller-interface/ros-rolling-controller-interface_4.11.0-1noble.20240514.083301_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-diff-drive-controller/ros-rolling-diff-drive-controller_4.8.0-1noble.20240514.114350_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-common-vendor/ros-rolling-gz-common-vendor_0.1.0-1noble.20240503.181130_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-msgs-vendor/ros-rolling-gz-msgs-vendor_0.1.0-1noble.20240503.181547_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-fuel-tools-vendor/ros-rolling-gz-fuel-tools-vendor_0.1.0-1noble.20240503.182511_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-rendering-vendor/ros-rolling-gz-rendering-vendor_0.1.0-1noble.20240507.212408_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-transport-vendor/ros-rolling-gz-transport-vendor_0.1.0-1noble.20240503.182514_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-gui-vendor/ros-rolling-gz-gui-vendor_0.1.0-1noble.20240507.214434_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-sdformat-vendor/ros-rolling-sdformat-vendor_0.1.0-1noble.20240503.181458_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-physics-vendor/ros-rolling-gz-physics-vendor_0.1.0-1noble.20240503.182124_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-sensors-vendor/ros-rolling-gz-sensors-vendor_0.1.0-1noble.20240507.214434_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-sim-vendor/ros-rolling-gz-sim-vendor_0.1.0-1noble.20240507.215704_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-joint-state-broadcaster/ros-rolling-joint-state-broadcaster_4.8.0-1noble.20240514.114403_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-ros-gz-bridge/ros-rolling-ros-gz-bridge_1.0.0-1noble.20240507.145005_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-ros-gz-image/ros-rolling-ros-gz-image_1.0.0-1noble.20240507.151109_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-ros-gz-sim/ros-rolling-ros-gz-sim_1.0.0-1noble.20240507.225051_amd64.deb  404  Not Found [IP: 140.211.166.134 80]

@ruffsl
Copy link
Member Author

ruffsl commented Jun 16, 2024

@tonynajjar , are you partially re-build the image from a prior cache? At present, the Dockerfile only apt updates once for the entire build.

apt-get update && echo v1

This speeds up all the apt install steps, allows for later layers to be rebuilt offline if the local apt cache has already downloaded the debians, and ensures that all packages installed across the layers are originating from the same sync. But if there are debians versions you haven't downloaded locally, and not longer exist on the apt repo, then it's probably best rebuild the apt-update layer so all the following layers are on the same sync.

If the ros repos receive a new sync, then the apt list that was baked in the earlier layers can become stale, pointing to package version that the ros repos have since purged, as besides the ros snapshot repos, older packages are not yet archived.

So, we could either:

  • rebuild dev container with --no-cache to ensure all packages are install form the same sync
  • automate a cache mechanism, such as an ENV, to bust the apt update layer automatically
  • switch to using apt snapshots to ensure all packages originate form the same sync archive

While I see there are snapshots for ROS 2 Jazzy, there doesn't seem to be any for Rolling:

We could also pin the rolling image by image ID/sha to automate cache busting via dependabot, though that needs some more work to complete the upstream docker build automation:

I think I may just go with the ENV ROS_SYNC_DATE= approach in the meantime for the local Dockerfile.

@tonynajjar
Copy link
Contributor

tonynajjar commented Jun 16, 2024

I see, yes building without cache fixes it. On to the next error, basically all the nav2 packages are failing to build in the updateContentCommand because of this:

[2024-06-16T17:19:24.311Z] Failed   <<< nav2_velocity_smoother [0.00s, exited with code 1]
[2024-06-16T17:19:24.311Z] �]0;colcon cache [12/39 done] [1 ongoing]��]0;colcon cache [13/39 done] [0 ongoing]�Starting >>> nav2_costmap_2d
[2024-06-16T17:19:24.312Z] --- stderr: nav2_costmap_2d
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/colcon_core/executor/__init__.py", line 91, in __call__
    rc = await self.task(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/colcon_core/task/__init__.py", line 93, in __call__
    return await task_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/colcon_cache/task/lock/dirhash.py", line 179, in lock
    assert lockfile.lock_type == ENTRY_TYPE
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
---

It seems to be because of colcon cache lock

@ruffsl
Copy link
Member Author

ruffsl commented Jun 16, 2024 via email

@tonynajjar
Copy link
Contributor

rebuild dev container with --no-cache to ensure all packages are install form the same sync

With this you mean "Rebuild Container Without Cache"? it still seems to build with cache. Maybe because you're using image instead of Dockerfile in devcontainer.json

@tonynajjar
Copy link
Contributor

Regarding the colcon cache, I cleaned out a bunch of things and it works now. I'll keep an eye out if it reproduces as part of a "normal workflow".

Can we somehow have the option to not rebuild the packages to save time since the image is build quite often? For me that's a big plus. I guess commenting out the updateContentCommand from the devcontainer would do it? I even think this should be the default. What do you think?

ruffsl added 19 commits July 9, 2024 21:02
friendly for multi user hosts envs
to make it simple to share files
such as ros bags and artifacts
to avoid being to intrusive to user home dir
and sort purposes for each mount
as this is normally done automatically via VSCode
however this feature is still exclusive to MS remote extension
and has not yet been upstream to the FOSS CLI
- devcontainers/cli#441 (comment)
to avoid halting dev container build
to force github actions to complete workflow
to create dev container from CI images
with pre-built artifacts
for self constancy with dev container config
and with CI images
to readily re-use CI debugger image
that build colcon workspace without it
so that colcon install can be standalone
allowing it to be set from global env as well
i.e. the installed ROS distro
such as re-using generated CI image
to start dev containers with pre-built workspaces
# MARK: Pull image - download image from CI and GHCR for local dev container
# REFERENCE_IMAGE=ghcr.io/ros-navigation/navigation2:main-debugger
# docker pull $REFERENCE_IMAGE
# export DEV_FROM_STAGE=$REFERENCE_IMAGE
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To create a dev container using the CI images from this PR, including a pre-built colcon workspace, simply uncomment the lines above and change the following tagname to match the current branch before using dev container tooling to create the container.

-REFERENCE_IMAGE=ghcr.io/ros-navigation/navigation2:main-debugger
+REFERENCE_IMAGE=ghcr.io/ros-navigation/navigation2:buildkit-debugger

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more information in getting started, check the included .devcontainer/README.md for details.

@ruffsl ruffsl mentioned this pull request Jul 10, 2024
@ruffsl ruffsl changed the title WIP | Refactor Docker and Dev Container setup using Buildkit Refactor Docker and Dev Container setup using Buildkit Jul 10, 2024
@ruffsl ruffsl marked this pull request as ready for review July 10, 2024 18:44
@SteveMacenski
Copy link
Member

@ruffsl do you want this reviewed? Its not green yet, not sure where this stands

@ruffsl
Copy link
Member Author

ruffsl commented Jul 11, 2024

Sure, feel free. It's mostly complete aside from cleaning out the old CI. Some colcon tests seem to be failing locally as well, so I'm not sure if that's the CI or the tests themselves just yet. Were these failing on the old CI as well?

@SteveMacenski
Copy link
Member

Were these failing on the old CI as well?

Our current CI is green https://app.circleci.com/pipelines/github/ros-navigation/navigation2/12221/workflows/23fc544d-eaee-4fdc-a8cc-340605485154/jobs/37129, so you should be able to build successfully, but you may need to rebase for that if you haven't updated the base recently.

Update CI Image / Rebuild CI Image (push) Failing after 12s

This is also failing too Error: buildx failed with: ERROR: failed to solve: failed to read dockerfile: open Dockerfile: no such file or directory

.devcontainer/README.md Outdated Show resolved Hide resolved
// "--device=/dev/dri", // enable Intel integrated graphics
// "--ulimit", "nofile=1024:4096", // increase file descriptor limit for valgrind
//
"--runtime=nvidia", // enable NVIDIA Container Toolkit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if no NV GPU exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the user should comment out this device option and use the appropriate command for their local hardware, like --device=/dev/dri for Intel integrated graphics. Nvidia is just enabled by default as it's so common in robotics and AI development on linux (my own bias). We could leave all hardware acceleration options commented out by default instead, just a minor inconveniences to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove submodule - use .repos file to maintain the underlay workspace as needed - this is a very important feature still to have as we need to add new dependencies that aren't released yet and/or bugs fixed that we can use in our CI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit that sub modules are more tricky to manage than a yaml file, but they are less opaque to git VCS. In this case, I was trying to avoid the need of any ros specifics in the checkout job, but I'll swap to use a containerized job using a ros base image instead to simply the bootstraping so we can stick with yaml files instead.

@@ -0,0 +1,3 @@
[submodule "nav2_minimal_turtlebot_simulation"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -1,6 +1,6 @@
name: Lint
on:
pull_request:
# pull_request:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readd?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out temporarily while WIP to avoid wasting credits. Will revisit before merging.

account-type: org
org-name: ros-navigation
# keep-at-least: 0
# TODO: come up with a better way to filter out the PRs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to remove, think I have a better method for this now.

@@ -0,0 +1,151 @@
name: "Bake prod stages"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How much does baking prod/base vary? Is there a way they can share the majority of their code and use a input to flip the difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base stages don't have a workspace installed, while the prod stages do and are just stacked on top of the base stage layers. The fancy thing about the releaser stage is that it builds from the runner stage, so that you have a minimal prod image that only includes runtime dependencies, while the debugger includes the same workspace layer but build FROM the builder stage that includes build dependencies so you could rebuild/debug in production if also needed at the cost of greater image size.

@@ -0,0 +1,57 @@
name: "Cache Source Checkout"
description: "GitHub Action to cache checkout of source repos"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why is this necessary for CI? Why not check out each time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some repo projects can take quite a while to clone everything, even if it's only a shallow clone. So it can be faster to just cache the one-off checkout that triggered the workflow, so that each job in the workflow (that runs on a new/different runner VM) can skip the time in re-checking out the same code. Also, because you want to use a repos yaml file that may pin the source code to a tag or branch, rather than a fixed commit sha, it is possible for checkouts to differ of time and introduce race conditions, resulting in non-consistent sources of the duration of a given workflow.

@@ -0,0 +1,60 @@
# get-layer-metadata

GitHub Action to get layer metadata from Docker Buildx Bake output result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a "why" explanation?

@@ -0,0 +1,106 @@
name: Build Prod Images
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain differences in the build prod / colcon / integration workflows

Co-authored-by: Steve Macenski <[email protected]>
Signed-off-by: Ruffin <[email protected]>
@ruffsl ruffsl marked this pull request as draft December 19, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants