Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for K8S deployment (especially ITs) #18514

Merged
merged 14 commits into from
Sep 10, 2024

Conversation

almahmoud
Copy link
Member

@almahmoud almahmoud commented Jul 8, 2024

  1. Adds GHCR build, to get a k8s container image built and pushed to ghcr.io under repository owner, with no credentials needed (complementing the DockerHub/Quay.io pushed that are only for galaxyproject repo and require credentials). Demonstrated for the Bioconductor fork (https://github.com/Bioconductor/galaxy/pkgs/container/galaxy), and could also be useful for AnVIL fork where it could replace the custom workflow that is currently being used.

  2. Updates pykube-ng version to latest available, making the deployment compatible with newer versions of Kubernetes (would error out on an API call for ingresses otherwise). It seems pykube-ng is not being updated anymore though, and there is an official Kubernetes API client for python now, which should likely be the long-term solution here imo.

  3. Add ability to override the TLS secret for ITs' ingress resources, notably useful for easily setting up a wildcard cert instead of relying on cert-manager to issue a new certificate per subdomain (which eventually hits the rate limit at least for letsencrypt).

  4. Adds ability to override ingress version, currently useful for running on old kubernetes versions (and added back conditional logic to work with old v1beta1 API version).

I don't remember ITs on Kubernetes being tested, so I didn't really look into adding tests. The changes aren't too complicated though. One thing I did notice is that although default is set here, runner_params.get() still returns None when it's not set. Is that a bug in the runner_params parsing?

cc: @nuwang @ksuderman @afgane

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@dannon dannon requested a review from ksuderman July 9, 2024 14:41
@ksuderman
Copy link
Contributor

I am testing this now.

@ksuderman
Copy link
Contributor

I haven't had a chance to look into it, but there does seem to be a bug in the way the runner_params are parsed. For testing I worked around the problem by providing defaults to the get methods here and here, but that simply masks the underlying problem. However, until that problem is fixed we may want to consider including the work around in the PR otherwise we just get unhandled KeyErrors.

Copy link
Contributor

@ksuderman ksuderman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, we may want to include a work around to the runner_params parsing problem until that has been resolved.

Copy link
Member

@nuwang nuwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! I think the default param issue can be fixed by directly subscripting instead of using dict.get()

lib/galaxy/jobs/runners/kubernetes.py Outdated Show resolved Hide resolved
@almahmoud
Copy link
Member Author

almahmoud commented Jul 16, 2024

I added another commit that makes jobs in kubernetes not be marked as running until the pod is actually running, as opposed to for eg waiting for container image to be pulled. I also changed the .active for IT entrypoints to look for whether the job is running as opposed to just not finished, which fixes the "FIXME" comment about queued being detected as active in some cases. @ksuderman this fixes the issue I briefly brought up today on zoom call, with ITs being marked as active too soon when configured with wildcard certs but then leading to a 503 page when clicked, and doesn't require a new galaxy.yml configuration like the previous implementation. It won't fix them being marked as ready before a certificate is issued when using subdomains and no wildcard cert though, but that should be more of a dev setting anyway so I think that's fine?

lib/galaxy/jobs/runners/kubernetes.py Outdated Show resolved Hide resolved
lib/galaxy/model/__init__.py Show resolved Hide resolved
@nuwang nuwang merged commit 0d4cb10 into galaxyproject:dev Sep 10, 2024
52 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants