Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus add nodes gauge for SQS mode #1083

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

phuhung273
Copy link

@phuhung273 phuhung273 commented Oct 29, 2024

Issue #, if available:
Close #785

Description of changes:

  • Count nodes/instances being tracked. Eg:
    k get node return 5
    aws ec2 describe-instances return 2
    Identify 3 nodes no longer under NTH control

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@phuhung273 phuhung273 requested a review from a team as a code owner October 29, 2024 14:34
@phuhung273 phuhung273 marked this pull request as draft October 29, 2024 14:34
@phuhung273 phuhung273 force-pushed the prometheus-export-node branch from 6e910d0 to 91e34ac Compare October 30, 2024 14:05
@phuhung273 phuhung273 changed the title [WIP] Prometheus export nodes counter [WIP] IMDS mode: prometheus add nodes counter Nov 1, 2024
@phuhung273 phuhung273 changed the title [WIP] IMDS mode: prometheus add nodes counter [WIP] IMDS mode: prometheus add nodes gauge Nov 1, 2024
@phuhung273 phuhung273 changed the title [WIP] IMDS mode: prometheus add nodes gauge [WIP] SQS mode: prometheus add nodes gauge Nov 11, 2024
@phuhung273
Copy link
Author

@stevehipwell can you please give me a help on this. Now i only stuck at writing unit test for opentelemetry.go since it doesn
t have any test.

Do you think i should refactor this file along with this PR or should I open another separate PR (eg: make opentelemetry.go testable)

@stevehipwell
Copy link
Contributor

@phuhung273 I'm not a maintainer here but I like to see untestable code refactored to be testable, following the boy scout rule. If you do the work in this PR you can always split the refactoring to a separate PR before it's merged.

@LikithaVemulapalli what do you think?

@LikithaVemulapalli
Copy link
Contributor

Yes I agree with @stevehipwell here, let's separate refactoring PR to have a clear idea on the changes made. @phuhung273 if you want to test your changes for this PR let me know I will approve and run so for the future commits you can verify if the existing tests are working, if there are any conflicts or not, once you change the PR status to ready I will run the workflow, appreciate for your contribution. Thanks!

Copy link

github-actions bot commented Dec 4, 2024

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this PR to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Dec 4, 2024
@phuhung273
Copy link
Author

/remove-lifecycle stale

@github-actions github-actions bot removed the stale Issues / PRs with no activity label Dec 6, 2024
Copy link

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this PR to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Dec 22, 2024
@phuhung273 phuhung273 marked this pull request as ready for review December 26, 2024 15:13
@phuhung273 phuhung273 changed the title [WIP] SQS mode: prometheus add nodes gauge Prometheus add nodes gauge for SQS mode Dec 26, 2024
@Lu-David Lu-David removed the stale Issues / PRs with no activity label Dec 27, 2024
@tiationg-kho tiationg-kho self-assigned this Dec 27, 2024
@tiationg-kho
Copy link
Contributor

Hi @phuhung273, could you resolve the conflicts in pkg/node/node_test.go?

@phuhung273 phuhung273 force-pushed the prometheus-export-node branch from b9e1617 to 95fc5a9 Compare January 1, 2025 02:25
@phuhung273 phuhung273 force-pushed the prometheus-export-node branch from 95fc5a9 to f4ae6de Compare January 1, 2025 02:32
@phuhung273
Copy link
Author

Thanks and Happy new year @tiationg-kho. Conflicts resolved

Copy link
Contributor

@tiationg-kho tiationg-kho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phuhung273,

Thanks for resolving the conflicts. I have left some comments.
Also, we can run the e2e test in local (docker desktop) to make sure our modification is valid.

  • Run all e2e test cases: make e2e-test
  • Run certain e2e test case: ./test/k8s-local-cluster-test/run-test -a ./test/e2e/<test-case> -d -b e2e-test

pkg/ec2helper/ec2helper.go Outdated Show resolved Hide resolved
pkg/ec2helper/ec2helper.go Show resolved Hide resolved
pkg/ec2helper/ec2helper.go Outdated Show resolved Hide resolved
pkg/ec2helper/ec2helper.go Outdated Show resolved Hide resolved
pkg/node/node.go Outdated Show resolved Hide resolved
pkg/observability/opentelemetry.go Outdated Show resolved Hide resolved
pkg/observability/opentelemetry.go Show resolved Hide resolved
cmd/node-termination-handler.go Outdated Show resolved Hide resolved
cmd/node-termination-handler.go Outdated Show resolved Hide resolved
@phuhung273
Copy link
Author

Thanks @tiationg-kho for your instruction on the e2e test, took a while to realize where does it fail. I will try to include the 2 new metrics in e2e.

@phuhung273
Copy link
Author

e2e metrics test in SQS mode added, phew... Please let me know your thought @tiationg-kho, thanks

Copy link
Contributor

@tiationg-kho tiationg-kho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a new e2e test case! Just left few comments.

@LikithaVemulapalli could you please take a look too?

cmd/node-termination-handler.go Outdated Show resolved Hide resolved
@@ -215,6 +215,12 @@ func main() {
}
log.Debug().Msgf("AWS Credentials retrieved from provider: %s", creds.ProviderName)

ec2Client := ec2.New(sess)

if nthConfig.EnablePrometheus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add nil check for metrics here in case we have any error during observability.InitMetrics?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Copy link
Author

@phuhung273 phuhung273 Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to, but the compiler doesn't like it. Can you guys have a try?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, "nil check" is incorrect. Since we return a zero-value struct when err exists.

Could we rename the error then check that error here?

metrics, initMetricsErr := observability.InitMetrics(nthConfig.EnablePrometheus, nthConfig.PrometheusPort)
if initMetricsErr != nil {
    nthConfig.Print()
    log.Fatal().Err(initMetricsErr).Msg("Unable to instantiate observability metrics,")
}

test/e2e/prometheus-metrics-sqs-test Show resolved Hide resolved
test/e2e/prometheus-metrics-sqs-test Outdated Show resolved Hide resolved
if [[ $cordoned -eq 0 ]]; then
echo "❌ Worker node was not cordoned"
else
echo "❌ regular-pod-test was not evicted"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not quite correct in the context of this test case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I just tried to test the 2 new metrics. Later found out that https://github.com/aws/aws-node-termination-handler/blob/main/test/e2e/prometheus-metrics-test only test the metrics in IMDS mode.

So I decided to also include other metrics as well. Thats why the file name is prometheus-metrics-sqs: test all metrics in SQS mode. How do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could remove this if-else check, and replace by adding a status variable before TAINT_CHECK_CYCLES for-loop, and check it after the for-loop to prevent the misleading scenario here (even we already cordoned and evicted but still get a "not evicted" in our log)?

The metric part is good.

Copy link
Contributor

@LikithaVemulapalli LikithaVemulapalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some inline comments, thanks for your contribution :)

pkg/node/node.go Show resolved Hide resolved
pkg/node/node.go Outdated Show resolved Hide resolved
@@ -215,6 +215,12 @@ func main() {
}
log.Debug().Msgf("AWS Credentials retrieved from provider: %s", creds.ProviderName)

ec2Client := ec2.New(sess)

if nthConfig.EnablePrometheus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

pkg/ec2helper/ec2helper.go Show resolved Hide resolved
pkg/ec2helper/ec2helper.go Show resolved Hide resolved
pkg/ec2helper/ec2helper_test.go Show resolved Hide resolved
pkg/node/node_test.go Show resolved Hide resolved
pkg/observability/opentelemetry.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics to show the number of nodes being tracked
5 participants