feat: add reusable workflow to build openedx images #1

mariajgrimaldi · 2024-07-29T20:39:45Z

Description

Add reusable workflow to build Open edX images based on the Picasso V2 pipeline definition. The main goal of this implementation is to prove the feasibility of adopting Github Actions to replace Jenkins for our current build process. For more details on the team's stand on the potential migration, please visit this post.

Implementation details

For this implementation, we first translated each step in the current Picaso V2 pipeline to its counterpart in GH actions. Then, we refactored them into smaller steps when possible for easier management. Here's an overview of the translation of each stage into steps:

Jenkins pipeline stages

Login in Dockerhub
Clone strains repository
Get Global Variables
3.1 Set strain name for TVM use
3.2 Set tutor version for TVM use
Configure TVM project
4.1 Install TVM
4.2 Initialize TVM project
4.3 Copies strain into project folder
4.4 Generates tutor environment
Execute extra commands
5.1 Update plugin index
5.2 List plugins
5.3 Enable distro
5.4 Run syntax validation
5.5 Run distro run-extra-commands
5.6 Generates tutor environment
Build Service
Push service

GitHub steps

Login in Dockerhub
Checkout strains repository
Set TVM project name and version in Github Environment
Install TVM
Configure TVM project
5.1 Install TVM
5.2 Initialize TVM project
5.3 Copies strain into project folder
Enable distro plugin
Execute extra commands (run-extra-commands)
Prepare docker if building MFE (optional)
Build service image with no cache
Push service image to DockerHub

Other design details to take into account:

The caller workflow determines the default values.
It will only be called from private repositories by the repository configuration unless we make this repository public (TBD)
Each team needing to modify workflow secrets must maintain its caller workflow with those credentials.
No syntax validation has been implemented for now.
To avoid connection errors while building MFE images, we must use a custom build kit configuration file to restrict the number of parallel stages. See these posts for more info on the errors: https://discuss.openedx.org/t/npm-err-code-err-socket-timeout-when-building-mfe-image/10414, ERROR: Service 'mfe' failed to build : Build failed overhangio/tutor-mfe#55 (comment). The root cause is still unknown.

How to test

We need to create a caller workflow that uses this reusable (called) workflow. For reference, see the implementation details. The build.yml workflow actively uses the picasso/build.yml workflow and sets the necessary configurations to run correctly.
To manually trigger the ednx-strains/.github/build.yml workflow execution, go to Actions > Build Open edX strain, fill in the necessary configuration, or use the default. I suggest using the MJG/redwood-test-image strain branch to avoid overriding the current image on dockerhub.
Press run workflow.

See here a successful build: https://github.com/eduNEXT/ednx-strains/actions/runs/10208033113/job/28243841700

This reverts commit a934cdf.

…each time" This reverts commit 1dddf54.

mariajgrimaldi · 2024-08-02T01:22:34Z

.github/workflows/build.yml

+        with:
+          repository: ${{ inputs.STRAIN_REPOSITORY }}
+          ref: ${{ inputs.STRAIN_REPOSITORY_BRANCH }}
+          token: ${{ secrets.REPO_DEPLOYMENT_KEY }}


[comment to reviewers]: the default value in the ednx-strains workflow is the default github token (see here), which is unnecessary since the workflow is being used in the same repository. Instead, the token should correspond to a personal access token from a service account or an ssh key, also belonging to a service account, so the workflow can checkout other repositories: https://github.com/actions/checkout?tab=readme-ov-file#usage

Which one do you recommend?

I mentioned checking out other repositories to keep backward compatibility, but if we decide to stick with this implementation, we should revisit whether there's a use case for it.

Are we expecting that the other organizations interested in using this workflow will all have their own strains repo?

Not necessarly. Unless teams need to change defaults from the caller workflow (e.g: https://github.com/eduNEXT/ednx-strains/blob/master/.github/workflows/build.yml#L39-L42) they shouldn't need to.

We still need to figure out whether to configure a token or SSH key. Our current setup uses an SSH key associated with a service account. So, I'll implement this same approach in another commit.

Here's the commit: a43ca13. I created another strains repo (private) to test whether the workflow checks it out correctly: https://github.com/eduNEXT/strains

mariajgrimaldi · 2024-08-02T01:28:19Z

.github/workflows/build.yml

@@ -0,0 +1,161 @@
+name: Picasso V2


[comment to reviewers]: this is a private repository, so only repositories from the edunext organization can use the workflow, see this documentation for more details. According to our previous discussions, the plan for this workflow is also to be used by other organizations. Which means this repository should be public.

Is that feasible for this repository? Considering that the Jenkins pipeline implementation is private as well.

At first glance, I don't see any major issues with making this repo public. I would like to hear from those who worked on the Jenkins implementation to see if this was previously discussed and to understand if there was any specific reason for keeping the implementation private. It might be simply that they didn't see a need for it to be public since Jenkins allows access to the service for as many people as needed without requiring access to the script repository (?)

I'm sorry, I should've added this context to my previous comment but I forgot. I created the dedalo-scripts repository and also committed the first changes. I didn't know at the time whether it would be a long-term solution or what implementations would be hosted there, so I used private as the default visibility setting. Now, I'd argue against this decision. The dedalo-scripts repository looks like a team-specific solution, but it shouldn't be, mainly because it's solving the tech-wide problem of how to build Open edX images.

Knowing this, this repo should be public regardless of the decisions made about the visibility of dedalo-scripts.

mariajgrimaldi · 2024-08-02T01:29:33Z

.github/workflows/build.yml

+    inputs:
+      STRAIN_REPOSITORY:
+        description: 'The repository for strains to checkout. It should follow the format: ORGANIZATION/REPO'
+        required: true
+        type: string
+      STRAIN_REPOSITORY_BRANCH:
+        description: 'The branch of the repository to checkout'
+        required: true
+        type: string
+      STRAIN_PATH:
+        description: 'The path within the repository for strains'
+        required: true
+        type: string
+      SERVICE:
+        description: 'The service name to build'
+        required: true
+        type: string
+    secrets:
+      DOCKERHUB_USERNAME:
+        description: 'DockerHub username for login'
+        required: true
+      DOCKERHUB_PASSWORD:
+        description: 'DockerHub password for login'
+        required: true
+      REPO_DEPLOYMENT_KEY:
+        description: 'Deployment key for repository checkout'
+        required: true


[comment to reviewers]: What other types of configurations do you find useful?

mariajgrimaldi · 2024-08-02T01:32:04Z

.github/workflows/build.yml

+        run: |
+          . .tvm/bin/activate
+
+          major_version=$(pip show tutor | grep Version | awk '{print $2}' | cut -d'.' -f1)
+          distro_version=$(
+              git ls-remote --tags https://github.com/eduNEXT/tutor-contrib-edunext-distro \
+              | grep -E "$major_version\\.[0-9]+\\.[0-9]+$" \
+              | tail -n 1 \
+              | awk -F'/' '{print $NF}' \
+          )
+
+          pip install git+https://github.com/eduNEXT/tutor-contrib-edunext-distro@$distro_version
+          tutor plugins enable distro


[comment to reviewers]: I removed these lines from the original script since we're not using the plugin index to install distro:

tutor plugins update tutor plugins list

Please let me know if you think otherwise.

mariajgrimaldi · 2024-08-02T01:34:03Z

.github/workflows/build.yml

+        shell: bash
+        working-directory: ${{ inputs.STRAIN_PATH }}/${{ env.TUTOR_APP_NAME }}
+        run: |
+          . .tvm/bin/activate


[comment to reviewers]: Which environment variables does this script set? I think we can store those in the Github Environment, so there's no need to activate the tvm environment before running each step.

I agree that we could find a workaround to avoid activating the environment in each step, but I think we can leave it as an improvement for another PR

mariajgrimaldi · 2024-08-02T01:34:37Z

.github/workflows/build.yml

+          echo "[worker.oci]" > buildkit.toml
+          echo "max-parallelism = 2" >> buildkit.toml
+          docker buildx create --use --node=max2cpu --driver=docker-container --config=./buildkit.toml


[note to author]: add a comment explaining why we're doing this.

Does this connection error with the parallel build also occur in the Jenkins pipeline? Could this potentially slow down build times or impact us in other ways?

I asked internally about the Jenkins agent's configuration but am still waiting for an answer so I'll ask again. I'll post here once I know more.

And yes, limiting parallelism will slow down the building process compared to the default buildkit configuration. Here's a successful build with this buildkit config, which takes about 20 minutes. I'll compare this time with our current building time for MFEs in Jenkins.

I was also thinking about taking this to the Tutor Users' Group meeting because it seems like it hasn't been addressed upstream yet.

Thanks
Taking this to the Tutor users group seems like a good idea. And if it's something that can be resolved in the near future, and we can help push this forward, then I don't think it should be a major concern for us

I agree with you that this is not a blocker for the MVP. I opened this issue so we can work on it later: #2

mariajgrimaldi · 2024-08-02T01:35:06Z

.github/workflows/build.yml

+        run: |
+          . .tvm/bin/activate
+          tutor config save
+          tutor images build $SERVICE --no-cache


[note to author]: I don't think we need the --no-cache argument.

Could you explain why you think we wouldn't need this? Especially considering that Picasso builds the images with the same --no-cache argument, and we want to compare the build times as fairly and accurately as possible.

Of course. Our current jenkins setup uses dynamically provisioned agents, each with a max number of concurrent builds. As I understand, --no-cache prevents one of those concurrent builds from using docker cache layers from another concurrent build. This wouldn't happen in this setup since no concurrent builds exist in the same GH runner. Also, each time a runner is provisioned, it's done from scratch without using cache of any form, including from previous builds. Therefore, this setup would behave the same with or without the --no-cache flag.

@Alec4r @MaferMazu, please correct me if I'm wrong. Thanks!

We can decide it after get some numbers and make some tests with the MVP, but I agree with you so far.

magajh

Thanks for this @mariajgrimaldi. We could make other adjustments later if we need to, but I think this is good enough for an MVP.

This seems to be functional enough to show to the other teams and get their feedback on whether this new implementation is useful and easy to adapt to their use cases.

magajh · 2024-08-06T13:38:22Z

.github/workflows/build.yml

+        run: |
+          . .tvm/bin/activate
+          tutor config save
+          tutor images build $SERVICE --no-cache


Could you explain why you think we wouldn't need this? Especially considering that Picasso builds the images with the same --no-cache argument, and we want to compare the build times as fairly and accurately as possible.

magajh · 2024-08-06T13:40:59Z

.github/workflows/build.yml

+          echo "[worker.oci]" > buildkit.toml
+          echo "max-parallelism = 2" >> buildkit.toml
+          docker buildx create --use --node=max2cpu --driver=docker-container --config=./buildkit.toml


Thanks
Taking this to the Tutor users group seems like a good idea. And if it's something that can be resolved in the near future, and we can help push this forward, then I don't think it should be a major concern for us

magajh · 2024-08-06T13:46:27Z

.github/workflows/build.yml

+        shell: bash
+        working-directory: ${{ inputs.STRAIN_PATH }}/${{ env.TUTOR_APP_NAME }}
+        run: |
+          . .tvm/bin/activate


I agree that we could find a workaround to avoid activating the environment in each step, but I think we can leave it as an improvement for another PR

magajh · 2024-08-06T13:56:55Z

.github/workflows/build.yml

+        with:
+          repository: ${{ inputs.STRAIN_REPOSITORY }}
+          ref: ${{ inputs.STRAIN_REPOSITORY_BRANCH }}
+          token: ${{ secrets.REPO_DEPLOYMENT_KEY }}


Are we expecting that the other organizations interested in using this workflow will all have their own strains repo?

magajh · 2024-08-06T14:22:41Z

.github/workflows/build.yml

@@ -0,0 +1,161 @@
+name: Picasso V2


At first glance, I don't see any major issues with making this repo public. I would like to hear from those who worked on the Jenkins implementation to see if this was previously discussed and to understand if there was any specific reason for keeping the implementation private. It might be simply that they didn't see a need for it to be public since Jenkins allows access to the service for as many people as needed without requiring access to the script repository (?)

mariajgrimaldi · 2024-08-07T16:00:15Z

@magajh, thank you for your feedback! I've published this PR in our discussion post, so other teams can also review the solution.

Alec4r · 2024-08-08T05:06:13Z

@mariajgrimaldi I was thinking that maybe instead of creating a workflow for building Docker images, it might be more convenient to do it as a GitHub Action. This way, each team or company could create their own workflow using the action, which would be much more flexible.

Additionally, this would allow us to easily consider the future of pushing to different Docker registries. Centralizing everything in a GitHub Action would give us the flexibility to adapt to different needs in the future without having to modify multiple workflows.

What do you think? Do you think we could evaluate the possibility of doing it as an action instead of a workflow?

Squirrel18 · 2024-08-08T23:05:03Z

Hi, @mariajgrimaldi. I don't have any specific comments on the implementation, however I do have some general comments.

I think at some point it would be worth exploring the option of writing these workflows or actions using JS, as it will allow us more flexibility, extensibility, and we can also add unit or integration tests.
I understand that initially, this workflow will be used in the starins repository, but we can reuse it in any other repository we want, right?
Why are we going to use TVM? I don't think it's necessary to keep using TVM, at least for this version of Picasso, since each build will be in an isolated environment, unless we want to use some cache, but I don't think TVM will benefit us much in this case.
Is there any requirement to make this action public? IMO, we don't need to make it public, as it is for internal use only.

mariajgrimaldi · 2024-08-09T18:11:49Z

Thank you, @Alec4r, for your comments!

I was thinking that maybe instead of creating a workflow for building Docker images, it might be more convenient to do it as a GitHub Action. This way, each team or company could create their own workflow using the action, which would be much more flexible.

Thank you for the suggestion! Implementing this as a composite action was my first approach; it looked something like this, but then I changed it to a reusable action.

I found debugging difficult when using it as an action. All the workflow steps happen in a single stage of the caller workflow, making reviewing the logs more difficult. Here's an example of the output when using a composite action: https://github.com/eduNEXT/ednx-strains/actions/runs/10172503969/job/28135183468. I don't think we're solving this problem by using reusable workflow instead cause there are still a bunch of logs to go through when debugging, but it's more organized since every single step is logged independently.

Also, when implementing these bundles of instructions, we're assuming a couple of things, like that they run on Ubuntu. How can we ensure this is complied with when using an action?

In any case, I'm not against implementing it as an action, but I'd like to understand better why that option would be better. So I opened a new issue so we can discuss it further: #4. Changing this implementation to a composite action is not much work, so we can keep the conversation open.

Additionally, this would allow us to easily consider the future of pushing to different Docker registries. Centralizing everything in a GitHub Action would give us the flexibility to adapt to different needs in the future without having to modify multiple workflows.

This is enough to use a reusable workflow: https://github.com/eduNEXT/ednx-strains/blob/master/.github/workflows/build.yml#L30-L42, the reusable workflow is called within a job unlike actions that are called as a step but the how is very similar. Therefore, we shouldn't have to modify multiple workflows just our centralized definition.

mariajgrimaldi · 2024-08-09T18:50:10Z

Thank you @Squirrel18, for the feedback!

I think at some point it would be worth exploring the option of writing these workflows or actions using JS, as it will allow us more flexibility, extensibility, and we can also add unit or integration tests.

Thanks for the suggestion. I opened a new issue to discuss this further: #5

I understand that initially, this workflow will be used in the starins repository, but we can reuse it in any other repository we want, right?

Yes! Our main goal with this workflow is that can be easily used in other repositories by calling it from any workflow, e.g.: https://github.com/eduNEXT/ednx-strains/blob/master/.github/workflows/build.yml#L30-L42

Why are we going to use TVM? I don't think it's necessary to keep using TVM, at least for this version of Picasso, since each build will be in an isolated environment, unless we want to use some cache, but I don't think TVM will benefit us much in this case.

You're right. We don't need TVM since the environment is already isolated.

Is there any requirement to make this action public? IMO, we don't need to make it public, as it is for internal use only.

This would work if and only if internal means within the edunext organization only: https://docs.github.com/en/actions/creating-actions/sharing-actions-and-workflows-from-your-private-repository#about-github-actions-access-to-private-repositories. I'm not sure how we can bypass that constraint for other orgs to use it.

Squirrel18 · 2024-08-12T21:18:12Z

Thanks for the answers, @mariajgrimaldi

This would work if and only if internal means within the edunext organization only: https://docs.github.com/en/actions/creating-actions/sharing-actions-and-workflows-from-your-private-repository#about-github-actions-access-to-private-repositories. I'm not sure how we can bypass that constraint for other orgs to use it.

I meant making it an open source project. In my opinion, there is no need to make it open source, as each vendor has its own build system and there is not much benefit in making it open source in terms of reputation.

mariajgrimaldi · 2024-09-03T23:11:49Z

Hi all, @Alec4r, @Squirrel18, and @magajh. Thanks for your comments, I really appreciate them. I'll merge this tomorrow morning (EST) as the first workflow version so we can start working on all the improvements discussed here and in other channels. Thanks again.

feat: add first version of picasso builder action

e8aecee

mariajgrimaldi force-pushed the MJG/picasso-action branch from e2b9265 to e8aecee Compare July 29, 2024 21:01

mariajgrimaldi added 14 commits July 29, 2024 17:29

refactor: comment repo deployment key for now

a934cdf

Revert "refactor: comment repo deployment key for now"

b3f0c8b

This reverts commit a934cdf.

fix: install tvm before using it

1fdf90e

refactor: use correct working directory for steps

926ef7a

refactor: comment further steps while testing

98a2821

temp: use debug logs

39fd6a0

temp: use debug logs

142d336

refactor: use working directory with tvm environment

19bfd1b

refactor: uncomment further steps for testing

934e261

temp: comment syntax validator for now

acb78ae

refactor: use reusable action instead of composite

90db1b4

refactor: move reusable action definition to workflows folder

e7054ad

fix: remove unnecessary description field

89726a6

refactor: move workflow to build yml file

35495fc

mariajgrimaldi changed the title ~~feat: add first version of picasso builder action~~ feat: add reusable workflow to build openedx images Aug 1, 2024

mariajgrimaldi added 10 commits August 1, 2024 13:00

refactor: use environment variable instead of turning tvm on each time

1dddf54

feat: add missing push step

05e8b01

Revert "refactor: use environment variable instead of turning tvm on …

6e56a6e

…each time" This reverts commit 1dddf54.

docs: improve documentation for strains repository input

55095fd

fix: activate tvm before pushing image

3c70de2

temp: try build without virtual env

13da45d

refactor: remote tutor plugins update and list from step

b4c9f68

refactor: update checkout action to latest release

7c370f9

refactor: remove debug tvm command

23da84a

refactor!: drop creating virtual env

7cbd2e6

mariajgrimaldi marked this pull request as ready for review August 2, 2024 01:16

mariajgrimaldi requested a review from a team August 2, 2024 01:17

mariajgrimaldi commented Aug 2, 2024

View reviewed changes

magajh approved these changes Aug 6, 2024

View reviewed changes

feat: use service account ssh key instead

a43ca13

This was referenced Aug 9, 2024

Composite action vs reusable workflow #4

Open

Use JS instead of bash script #5

Closed

mariajgrimaldi added 2 commits August 20, 2024 14:29

feat: setup ssh key for private repo cloning

abf7ca7

fix: add github to known hosts

5765f2b

mariajgrimaldi mentioned this pull request Sep 2, 2024

feat: use self-hosted runners to improve build performance #6

Closed

mariajgrimaldi merged commit 35e750b into main Sep 4, 2024

mariajgrimaldi mentioned this pull request Oct 16, 2024

docs: add AGPL-3.0 as software license #29

Merged

feat: add reusable workflow to build openedx images #1

feat: add reusable workflow to build openedx images #1

Conversation

mariajgrimaldi commented Jul 29, 2024 • edited Loading

Description

Implementation details

How to test

mariajgrimaldi Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magajh Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magajh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

magajh Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariajgrimaldi commented Aug 7, 2024

Alec4r commented Aug 8, 2024

Squirrel18 commented Aug 8, 2024

mariajgrimaldi commented Aug 9, 2024 • edited Loading

mariajgrimaldi commented Aug 9, 2024 • edited Loading

Squirrel18 commented Aug 12, 2024

mariajgrimaldi commented Sep 3, 2024 • edited Loading

mariajgrimaldi commented Jul 29, 2024 •

edited

Loading

mariajgrimaldi Aug 2, 2024 •

edited

Loading

mariajgrimaldi Aug 7, 2024 •

edited

Loading

mariajgrimaldi Aug 7, 2024 •

edited

Loading

mariajgrimaldi Aug 2, 2024 •

edited

Loading

magajh Aug 6, 2024 •

edited

Loading

magajh left a comment •

edited

Loading

magajh Aug 6, 2024 •

edited

Loading

mariajgrimaldi commented Aug 9, 2024 •

edited

Loading

mariajgrimaldi commented Aug 9, 2024 •

edited

Loading

mariajgrimaldi commented Sep 3, 2024 •

edited

Loading