-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Docker and Dev Container setup using Buildkit #4392
base: main
Are you sure you want to change the base?
Conversation
This pull request is in conflict. Could you fix it @ruffsl? |
and refactor into dev container scripts later on
to persist dotfiles across container lifecycle such as bash history
@ruffsl just FYI tried to run it and got:
|
@tonynajjar , yeah, looks like we have another un-released dependency back in our underlay.repos file: |
@ruffsl new error
|
@tonynajjar , are you partially re-build the image from a prior cache? At present, the Dockerfile only apt updates once for the entire build. navigation2/.docker/Dockerfile Line 28 in 18bf8a4
This speeds up all the apt install steps, allows for later layers to be rebuilt offline if the local apt cache has already downloaded the debians, and ensures that all packages installed across the layers are originating from the same sync. But if there are debians versions you haven't downloaded locally, and not longer exist on the apt repo, then it's probably best rebuild the apt-update layer so all the following layers are on the same sync. If the ros repos receive a new sync, then the apt list that was baked in the earlier layers can become stale, pointing to package version that the ros repos have since purged, as besides the ros snapshot repos, older packages are not yet archived. So, we could either:
While I see there are snapshots for ROS 2 Jazzy, there doesn't seem to be any for Rolling: We could also pin the rolling image by image ID/sha to automate cache busting via dependabot, though that needs some more work to complete the upstream docker build automation: I think I may just go with the |
I see, yes building without cache fixes it. On to the next error, basically all the nav2 packages are failing to build in the updateContentCommand because of this:
It seems to be because of |
Looks like you may be trying to combine two different colcon cache lock
files, either derived from get revision control hashes, or der hash that
hashes the files directly. You could try deleting all the colcon cache lock
files in the colcon build base path, or just delete the workspace volume,
or rename it to make a new one, from the dev container config json.
…On Sun, Jun 16, 2024, 12:24 PM Tony Najjar ***@***.***> wrote:
I see, yes building without cache fixes it. On to the next error,
basically all the nav2 packages are failing to build because of this:
[2024-06-16T17:19:24.311Z] Failed <<< nav2_velocity_smoother [0.00s, exited with code 1]
[2024-06-16T17:19:24.311Z] �]0;colcon cache [12/39 done] [1 ongoing]��]0;colcon cache [13/39 done] [0 ongoing]�Starting >>> nav2_costmap_2d
[2024-06-16T17:19:24.312Z] --- stderr: nav2_costmap_2d
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/colcon_core/executor/__init__.py", line 91, in __call__
rc = await self.task(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/colcon_core/task/__init__.py", line 93, in __call__
return await task_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/colcon_cache/task/lock/dirhash.py", line 179, in lock
assert lockfile.lock_type == ENTRY_TYPE
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
---
am i doing something wrong?
—
Reply to this email directly, view it on GitHub
<#4392 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AARP6RMOSDWDLLAEV6GYIY3ZHXC4PAVCNFSM6AAAAABIWYVBKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRG43TOOBZHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
With this you mean "Rebuild Container Without Cache"? it still seems to build with cache. Maybe because you're using |
Regarding the colcon cache, I cleaned out a bunch of things and it works now. I'll keep an eye out if it reproduces as part of a "normal workflow". Can we somehow have the option to not rebuild the packages to save time since the image is build quite often? For me that's a big plus. I guess commenting out the updateContentCommand from the devcontainer would do it? I even think this should be the default. What do you think? |
friendly for multi user hosts envs
to make it simple to share files such as ros bags and artifacts
to avoid being to intrusive to user home dir
and sort purposes for each mount
as this is normally done automatically via VSCode however this feature is still exclusive to MS remote extension and has not yet been upstream to the FOSS CLI - devcontainers/cli#441 (comment)
to avoid halting dev container build
to force github actions to complete workflow
to create dev container from CI images with pre-built artifacts
for self constancy with dev container config and with CI images
to readily re-use CI debugger image that build colcon workspace without it so that colcon install can be standalone
allowing it to be set from global env as well
i.e. the installed ROS distro
such as re-using generated CI image to start dev containers with pre-built workspaces
# MARK: Pull image - download image from CI and GHCR for local dev container | ||
# REFERENCE_IMAGE=ghcr.io/ros-navigation/navigation2:main-debugger | ||
# docker pull $REFERENCE_IMAGE | ||
# export DEV_FROM_STAGE=$REFERENCE_IMAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To create a dev container using the CI images from this PR, including a pre-built colcon workspace, simply uncomment the lines above and change the following tagname to match the current branch before using dev container tooling to create the container.
-REFERENCE_IMAGE=ghcr.io/ros-navigation/navigation2:main-debugger
+REFERENCE_IMAGE=ghcr.io/ros-navigation/navigation2:buildkit-debugger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more information in getting started, check the included .devcontainer/README.md
for details.
@ruffsl do you want this reviewed? Its not green yet, not sure where this stands |
Sure, feel free. It's mostly complete aside from cleaning out the old CI. Some colcon tests seem to be failing locally as well, so I'm not sure if that's the CI or the tests themselves just yet. Were these failing on the old CI as well? |
Our current CI is green https://app.circleci.com/pipelines/github/ros-navigation/navigation2/12221/workflows/23fc544d-eaee-4fdc-a8cc-340605485154/jobs/37129, so you should be able to build successfully, but you may need to rebase for that if you haven't updated the base recently.
This is also failing too |
// "--device=/dev/dri", // enable Intel integrated graphics | ||
// "--ulimit", "nofile=1024:4096", // increase file descriptor limit for valgrind | ||
// | ||
"--runtime=nvidia", // enable NVIDIA Container Toolkit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if no NV GPU exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the user should comment out this device option and use the appropriate command for their local hardware, like --device=/dev/dri
for Intel integrated graphics. Nvidia is just enabled by default as it's so common in robotics and AI development on linux (my own bias). We could leave all hardware acceleration options commented out by default instead, just a minor inconveniences to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove submodule - use .repos
file to maintain the underlay workspace as needed - this is a very important feature still to have as we need to add new dependencies that aren't released yet and/or bugs fixed that we can use in our CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit that sub modules are more tricky to manage than a yaml file, but they are less opaque to git VCS. In this case, I was trying to avoid the need of any ros specifics in the checkout job, but I'll swap to use a containerized job using a ros base image instead to simply the bootstraping so we can stick with yaml files instead.
@@ -0,0 +1,3 @@ | |||
[submodule "nav2_minimal_turtlebot_simulation"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -1,6 +1,6 @@ | |||
name: Lint | |||
on: | |||
pull_request: | |||
# pull_request: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented out temporarily while WIP to avoid wasting credits. Will revisit before merging.
account-type: org | ||
org-name: ros-navigation | ||
# keep-at-least: 0 | ||
# TODO: come up with a better way to filter out the PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to remove, think I have a better method for this now.
@@ -0,0 +1,151 @@ | |||
name: "Bake prod stages" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: How much does baking prod/base vary? Is there a way they can share the majority of their code and use a input to flip the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base stages don't have a workspace installed, while the prod stages do and are just stacked on top of the base stage layers. The fancy thing about the releaser stage is that it builds from the runner stage, so that you have a minimal prod image that only includes runtime dependencies, while the debugger includes the same workspace layer but build FROM the builder stage that includes build dependencies so you could rebuild/debug in production if also needed at the cost of greater image size.
@@ -0,0 +1,57 @@ | |||
name: "Cache Source Checkout" | |||
description: "GitHub Action to cache checkout of source repos" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why is this necessary for CI? Why not check out each time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some repo projects can take quite a while to clone everything, even if it's only a shallow clone. So it can be faster to just cache the one-off checkout that triggered the workflow, so that each job in the workflow (that runs on a new/different runner VM) can skip the time in re-checking out the same code. Also, because you want to use a repos yaml file that may pin the source code to a tag or branch, rather than a fixed commit sha, it is possible for checkouts to differ of time and introduce race conditions, resulting in non-consistent sources of the duration of a given workflow.
@@ -0,0 +1,60 @@ | |||
# get-layer-metadata | |||
|
|||
GitHub Action to get layer metadata from Docker Buildx Bake output result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a "why" explanation?
@@ -0,0 +1,106 @@ | |||
name: Build Prod Images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain differences in the build prod / colcon / integration workflows
Co-authored-by: Steve Macenski <[email protected]> Signed-off-by: Ruffin <[email protected]>
This refactors the docker based CI and unifies it with a Dev Container workflow.
Instead of use cron jobs to priodickly rebuild the base image used for CI, or installing any missing dependencies as initial steps in CircleCI, this migrates the CI pipeline to dynamically rebuild the base image on demand while leveraging buildkit cache backend to do so efficiently. This also unifies the docker image build process forDev Containers, making it simple to rebuild a development image locally, or bootstrap one by pulling image layer remotely from GHCR provided by CI. Lastly, this also provide additional room to build release docker images, to quickly ship a minimal but pre-built nav2 workspace for select branches and open pull requests to streamline end user experimentation and testing.
Related:
To get started, simply clone this PR with git submodules and follow along with the include quick start guide: