Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jhony/mcdaniel integrating Karpenter #41

Merged
merged 45 commits into from
Dec 5, 2023

Conversation

jfavellar90
Copy link
Contributor

@jfavellar90 jfavellar90 commented Jul 11, 2023

This is a duplicate of #24 with a couple of changes on top to adjust:

  • Code rebase
  • commit-lint errors
  • Changes to the new way of managing the helm chart.

I copy the original PR description as a reference:

For #7: refactor the new infra-example folder into infra-examples, and add Terraform modules for AWS reference resources.
Adds generic AWS Virtual Private Cloud and Elastic Kubernetes Service modules, both of which are preconfigured as necessary to support use of Karpenter for node auto-scaling and pod bin packing.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Jul 11, 2023
@openedx-webhooks
Copy link

openedx-webhooks commented Jul 11, 2023

Thanks for the pull request, @jfavellar90! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@itsjeyd itsjeyd added the waiting for eng review PR is ready for review. Review and merge it, or suggest changes. label Jul 11, 2023
@itsjeyd
Copy link

itsjeyd commented Jul 11, 2023

@lpm0073 @jfavellar90 Thank you for this contribution :)

@gabor-boros
Copy link
Contributor

@jfavellar90 I still need to create a cluster with Karpenter and deploy the helm chart, but looks good to me! Nice work @lpm0073 @jfavellar90! 🎉

@mphilbrick211 mphilbrick211 removed the waiting for eng review PR is ready for review. Review and merge it, or suggest changes. label Jul 17, 2023
@itsjeyd itsjeyd added core contributor PR author is a Core Contributor (who may or may not have write access to this repo). waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Jul 18, 2023
@cmltaWt0
Copy link
Contributor

cmltaWt0 commented Jul 25, 2023

@jfavellar90 @lpm0073 great job!

I've started testing it.

Issues I've found so far:

  1. Hashicorp/Template v2.2.0 is deprecated (archived) and has no distribution for M1:
template v2.2.0 does not have a package available for your current platform, darwin_arm64

Workaround for M1 users is following (tested):

brew install kreuzwerker/taps/m1-terraform-provider-helper
m1-terraform-provider-helper activate
m1-terraform-provider-helper install hashicorp/template -v v2.2.0
  1. k8s-cluster terraforming doesn't work for me from the first try (repeated twice to confirm):
│ Error: waiting for Security Group (sg-021189bbba3cc2252) Rule (sgrule-2633141658) create: couldn't find resource
│
│   with module.eks.aws_security_group_rule.node["port_8443"],
│   on .terraform/modules/eks/node_groups.tf line 207, in resource "aws_security_group_rule" "node":
│  207: resource "aws_security_group_rule" "node" {
│
╵

I've repeated terraform apply in aws/k8s-cluster folder since the SG was actually created in AWS.
Second try completed successfully.

Eventually I have a working Cluster, I can update my kube config and connect to it 🎉


Next, I'll try to deploy our Chart and couple of openedx installations.
Also we need to rebase the PR on the latest main and re-test.

@lpm0073
Copy link
Contributor

lpm0073 commented Jul 25, 2023

confirming that i ran into both issues for template last week on an unrelated project. comments:

  • the M1 incompatibility seems to be permanent. contributors didn't get around to addressing the apple silicon problem until after the M2 had been released. it seems that the Terraform community has moved on and doesn't have plans to address the problem for M1
  • the deprecation can be resolved with the following version bump

`terraform {
required_version = "~> 1.3"

required_providers {
template = {
source = "hashicorp/template"
version = "~> 2.2"
}
}
}`

Copy link
Contributor

@gabor-boros gabor-boros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found out that I didn't post my pending review comments for the past two weeks. 🤦 GitHub UI tricked me. Sorry about that. Posting now.

@cmltaWt0 and @lpm0073 thank you for helping out in functional reviewing!

charts/harmony-chart/values.yaml Outdated Show resolved Hide resolved
charts/harmony-chart/values.yaml Outdated Show resolved Hide resolved
# More details in https://karpenter.sh/preview/concepts/provisioners/#speclimitsresources
limits:
resources:
cpu: "400" # 100 * 4 cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. These seems to be really high limits. I don't mind, but it may not make sense. I did not see a 100 node cluster for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the number of nodes to 50, is it a better number?

infra-examples/aws/k8s-cluster/doc/aws-vpc-eks.png Outdated Show resolved Hide resolved
infra-examples/aws/k8s-cluster/main.tf Outdated Show resolved Hide resolved
@itsjeyd
Copy link

itsjeyd commented Aug 1, 2023

Hey @jfavellar90, a friendly reminder to follow up on @gabor-boros's comments above.

@jfavellar90
Copy link
Contributor Author

@gabor-boros I think this is ready for a second review. I'll be adding documentation to the main README file

@gabor-boros
Copy link
Contributor

@jfavellar90 Should I wait for the readme change or shall I proceed? 🚀

@itsjeyd
Copy link

itsjeyd commented Oct 17, 2023

Hey @jfavellar90, a friendly reminder to get back to @gabor-boros's question above.

@jfavellar90
Copy link
Contributor Author

@gabor-boros sorry for the late answer, I already added the docs to install Karpenter in the example infrastructure. However, there are changes we need to apply to the instructions in order to test the PR. In step 4 the Helm chart is installed in the cluster. Since the new version of the chart including Karpenter has not been released, step 4 must be replaced by the following tasks to install the chart from source:

  • Update the dependencies:
# From the repo root path
cd charts/harmony-chart
helm dependency update
  • Install the helm chart from source:
# From the repo root path
helm install --namespace harmony --create-namespace -f values-example.yaml harmony ./charts/harmony-chart --wait

Once the Helm chart is released, the instructions indicated in step 4 will work as expected.

@jfavellar90 jfavellar90 force-pushed the Jhony/mcdaniel_monitoring_services branch from 33b0567 to 7649372 Compare October 17, 2023 16:22
@gabor-boros
Copy link
Contributor

Ah! Thank you @jfavellar90 I'll review this after the meeting we have now. 😇

@itsjeyd itsjeyd added waiting for eng review PR is ready for review. Review and merge it, or suggest changes. and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Oct 24, 2023
@gabor-boros
Copy link
Contributor

FTR: The PR is reviewed, I just want to check one more thing before approving.

Copy link
Contributor

@gabor-boros gabor-boros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jfavellar90,

I just approved this PR, though I was not able to setup the kubernetes cluster. The hashicorp/template provider is deprecated and we are still depending on it (not just directly, but probably through dependencies as well?). Hence it cannot be installed on my system.

That should be a task of a different PR to fix that as the scope of the current PR is already huge. This is a good point to raise during the meeting today.

I read through the changes many times, so I can confidently approve this.

@itsjeyd itsjeyd removed the waiting for eng review PR is ready for review. Review and merge it, or suggest changes. label Nov 17, 2023
@gabor-boros gabor-boros changed the title Jhony/mcdaniel monitoring services Jhony/mcdaniel integrating Karpenter Nov 21, 2023
@itsjeyd
Copy link

itsjeyd commented Nov 29, 2023

@gabor-boros Based on your review above it sounds like this PR is ready for merge. Or are there any blockers that are keeping you from hitting the button? :)

@gabor-boros
Copy link
Contributor

@jfavellar90 the latest commit fixed the issue, though there is some other issues during EKS install with CBS and other things. That needs a separate issue to fix and waaaaay out of scope for this PR, so I'm merging this one now.

README.md Outdated Show resolved Hide resolved
@gabor-boros gabor-boros merged commit 3d3bc53 into main Dec 5, 2023
2 checks passed
@openedx-webhooks
Copy link

@jfavellar90 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@gabor-boros gabor-boros deleted the Jhony/mcdaniel_monitoring_services branch December 5, 2023 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core contributor PR author is a Core Contributor (who may or may not have write access to this repo). open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants