Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scrapeconfig instead of servicemonitor #607

Merged
merged 5 commits into from
Jul 10, 2024

Conversation

csibbitt
Copy link
Collaborator

@csibbitt csibbitt commented Jun 19, 2024

This allows us to create a prometheus config where targets are specified by DNS name, rather than IP addresses.

Solves a problem discovered here: #599 (comment)

Risks:

  • This may not work when using observability_strategy: use_community on default deployments of OCP <= 4.14 (at least)
    • the prometheus operator from openshift-monitoring looks new enough, but the scrapeconfig CRD was missing in my test environments
    • observability_strategy: use_redhat w/ obo-prometheus-operator from COO 0.2.0 works fine
  • I haven't done additional testing, for example to ensure metrics labeling compatibility with all the other parts of the system, or impact on the (deprecated) HA mode

@csibbitt csibbitt added the do-not-merge Code is not ready to be merged label Jun 19, 2024
- kind: ScrapeConfigs
name: scrapeconfigs.monitoring.coreos.com
version: v1alpha1
- kind: ServiceMonitors
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing previously, and I'm not 100% sure the effects. My guess is it would affected garbage collection in the operator-sdk, preventing these servicemonitors from being cleaned up automatically when the servicetelemetry object is deleted.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/f801d6c3273d471990d68e8f495dbc93

stf-crc-ocp_412-local_build FAILURE in 38m 04s
stf-crc-ocp_414-local_build FAILURE in 37m 39s
stf-crc-ocp_412-local_build-index_deploy FAILURE in 47m 40s
stf-crc-ocp_414-local_build-index_deploy FAILURE in 47m 35s

@csibbitt
Copy link
Collaborator Author

I believe the smoketest failures are related to label changes. I'll take a closer look at this locally and see how bad it is (I see a lot of labels in that part of the code)

@csibbitt
Copy link
Collaborator Author

I adjusted the labels to exactly match what we were producing before and the smoketests passed locally.

@csibbitt csibbitt removed the do-not-merge Code is not ready to be merged label Jun 26, 2024
Copy link
Collaborator

@vkmc vkmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Chris, thanks! Considering we are removing the service monitors, what happen in brownfield deployments? Do the existing STF continue to run without change and the new scrape config component is deployed with new STF deployments? Thanks!

@csibbitt
Copy link
Collaborator Author

Looks good Chris, thanks! Considering we are removing the service monitors, what happen in brownfield deployments? Do the existing STF continue to run without change and the new scrape config component is deployed with new STF deployments? Thanks!

No. The code as-is deploys the new ScrapeConfig objects, and then deletes the old ServiceMonitor objects[1]
If users have manually configured additional servicemonitors via the "servicemonitor_manifest" option[2], then they will be preserved.

Upgrade testing will be required, but I don't forsee any difficulties. The labels of the new metrics exactly match the old ones.

[1] https://github.com/infrawatch/service-telemetry-operator/pull/607/files#diff-6b852f8c4b44841b7ddd16194f6304203cf7c6b134b7a008a0fc4b3c043b4d1cR87
[2] https://github.com/infrawatch/service-telemetry-operator?tab=readme-ov-file#overriding-default-manifests

Copy link
Contributor

@vyzigold vyzigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

roles/servicetelemetry/tasks/component_scrapeconfig.yml Outdated Show resolved Hide resolved
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/1f17428efe534a5b87a77ec40cc61ec4

stf-crc-ocp_412-local_build FAILURE in 35m 42s
✔️ stf-crc-ocp_414-local_build SUCCESS in 32m 58s
✔️ stf-crc-ocp_412-local_build-index_deploy SUCCESS in 42m 54s
✔️ stf-crc-ocp_414-local_build-index_deploy SUCCESS in 35m 49s

@csibbitt csibbitt dismissed vyzigold’s stale review July 10, 2024 13:14

change commited

Copy link
Member

@paramite paramite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt' test this, but it seems lgtm to me.

Copy link
Collaborator

@vkmc vkmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vkmc
Copy link
Collaborator

vkmc commented Jul 10, 2024

recheck

@csibbitt csibbitt enabled auto-merge (squash) July 10, 2024 16:40
@csibbitt csibbitt merged commit 43726ee into master Jul 10, 2024
10 checks passed
@csibbitt csibbitt deleted the csibbitt_STF-1776_ipv6-scraping branch July 10, 2024 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants