Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service and Endpoints for the node exporters are not correctly configured #826

Open
SSvilen opened this issue Nov 23, 2021 · 39 comments
Open
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@SSvilen
Copy link

SSvilen commented Nov 23, 2021

The windows node exporter is installed on all windows worker nodes, but the required Service and Endpoint resources are no created at all.
There is a service object created, but it's from type ClusterIP, which in this case won't work.
The Service should be of type 'ExternalName' and the Endpoints should be updated by the operator on every node join/deletion operation.
For instance:

apiVersion: v1
kind: Service
metadata:
 labels:
   name: windows-exporter
 name: windows-exporter
 namespace: openshift-windows-machine-config-operator
spec:
 type: ExternalName
 ports:
   - name: metrics
     port: 9182
     protocol: TCP
     targetPort: 9182
 externalName: nodexporter
---
apiVersion: v1
kind: Endpoints
metadata:
 labels:
   name: windows-exporter
 name: windows-exporter
 namespace: openshift-windows-machine-config-operator
subsets:
 - addresses:
     - ip: 1.1.1.1
       targetRef:
         kind: Node
         name: winmach-q84jj
         uid: ab8028e7-a0ed-4f83-89e5-b577be2231ed
     - ip: 1.1.1.1
       targetRef:
         kind: Node
         name: winmach-t5vgm
         uid: 1b710328-88d5-4142-a78f-dd414705cc19
   ports:
     - name: metrics
       port: 9182
       protocol: TCP
@mansikulkarni96
Copy link
Member

@SSvilen thanks for the provided information.

As you can see in the manifests/windows-exporter_v1_service.yaml, the type is not set to ClusterIP.
The required service and endpoint names should be both windows-exporter as that name is used to get the resources in the operator code.
I suspect monitoring is not enabled in the operator namespace. Please ensure label openshift.io/cluster-monitoring=true is present in the openshift-windows-machine-config-operator namespace which is required for monitoring resources to be created by WMCO in that namespace.
If it is not enabled you can see a log like: install the prometheus-operator to enable Prometheus configuration in the WMCO logs.
Community Operators have a checkbox to enable monitoring in the operator namespace, if you are building from source you can use oc label ns openshift-windows-machine-config-operator openshift.io/cluster-monitoring=true --overwrite to set the label.
Let us know if that resolves the issue!

@SSvilen
Copy link
Author

SSvilen commented Nov 23, 2021

@mansikulkarni96,

The monitoring for the namespace is enabled. The problem is that the node exporter is installed on the windows worker nodes and it's not running as a pod, like it is for the linux based os. So prometheus operator can not properly discover the endpoint for that servicemonitor.
So I had to recreate the service and manually create the endpoint object, which in turn points to the windows nodes.
Or am I overthinking this?

@mansikulkarni96
Copy link
Member

@SSvilen Thanks for confirming that.
The behaviour you see is the expected behaviour, windows-exporter runs as a Windows service on the Windows nodes which is different from it's linux counterpart due to support reasons.
Prometheus operator should be able to discover the endpoint, you can take a look at the service_monitor.yaml, you can see how the re-labellings are applied to make the endpoint discoverable.
What you are expecting is exactly what the operator does, it updates the endpoint objects on every node join/deletion operation, more details in metrics.go if you are interested in the code base.
If you could provide Windows Machine Config Operator logs and details about the exact operator version, OCP version and the steps followed to reach this point, I should be able to help you out further.

@SSvilen
Copy link
Author

SSvilen commented Nov 23, 2021

OK, I see what's happening.

controller.windowsmachine    invalid Machine    {"name": "winmach-t5vgm", "error": "no internal IP address associated",

and based on the code in metrics.go an internal IP address is expected.

The status field of the machine shows type 'InternalDNS'

status:                                                                                                                                                                                                                                    
  addresses:                                                                                                                                                                                                                               
   - address: winmach-q84jj                                                                                                                                                                                                                 
     type: InternalDNS

I'm not sure why that is.

@mansikulkarni96
Copy link
Member

@SSvilen can you provide details about the WMCO version, cloud provider, OCP version and the Windows Server version used for the VM?
This is what the support matrix looks like Supported Cloud Providers based on OKD/OCP Version and WMCO version and Supported Windows Server versions.

@SSvilen
Copy link
Author

SSvilen commented Nov 24, 2021

@mansikulkarni96 ,

WMCO 3.1
OCP 4.8
Windows 20H2

But it would be also beneficial, if there is a bit more logging. For instance here. That would make the troubleshooting easier.

@mansikulkarni96
Copy link
Member

mansikulkarni96 commented Nov 24, 2021

@SSvilen logging info noted.
According to your comment, the Windows worker node is present, is that added by using WMCO?
If yes, then the "no internal IP address associated" error should have resolved on its own as the IP address is not just required for metrics but also for the SSH connection to the VM.
I would request you some more information for the deubgging further:

  1. Cloud provider information: is it vmware vSphere?
  2. Node configuration method used here, provide info from one of the two:
  1. Full output of oc logs -f deployment/windows-machine-config-operator -n openshift-windows-machine-config-operator
  2. Windows MachineSet yaml/ configMap yaml depending on the Node configuration method used.
  3. Output of oc get network.operator cluster -o yaml

@SSvilen
Copy link
Author

SSvilen commented Nov 25, 2021

@mansikulkarni96 ,

1.Cloud provider information: is it vmware vSphere?

Yes.

2. Node configuration method used here, provide info from one of the two:
- BYOH
- machinesSet

machinesSet

network.txt
operatorlogs.txt
machineSet.txt

Thanks!

@mansikulkarni96
Copy link
Member

@SSvilen Thanks for providing the logs, from the operator logs I can see the IP address cannot be found to configure the Windows machine into a node. You should see the same issue if you try to oc describe the machine object, it is trying to configure. I suspect it has to do with the golden image creation for vSphere. Please make sure you have followed all the steps described in vsphere-golden-image.md

@SSvilen
Copy link
Author

SSvilen commented Dec 6, 2021

@mansikulkarni96,

ok thanks. We'll look at it again.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2022
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 5, 2022
@MattPOlson
Copy link

@mansikulkarni96,

ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

@mansikulkarni96
Copy link
Member

mansikulkarni96 commented May 9, 2022

@MattPOIson Can you provide including details about your setup from this comment so I can help you further.

@SSvilen
Copy link
Author

SSvilen commented May 10, 2022

@mansikulkarni96,
ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

You need a working reverse DNS - during the addition of the windows worker node, the operator creates the endpoints.

@MattPOlson
Copy link

@MattPOIson Can you provide including details about your setup from this comment so I can help you further.

The cluster is running in vsphere and we are using machinesets to provision the servers. If change the service to be of type 'ExternalName' and create an endpoint that includes the node it works fine, its just not happening automatically like it should.

@MattPOlson
Copy link

MattPOlson commented May 10, 2022

@mansikulkarni96,
ok thanks. We'll look at it again.

I'm seeing the same issue on vsphere, did you ever figure anything out?

You need a working reverse DNS - during the addition of the windows worker node, the operator creates the endpoints.

Reverse DNS lookup works fine in our network, the internal IP still isn't being populated on the machine so the endpoint isn't being created.

ping -a 10.33..

Pinging k8s-se-****************** [10.33..] with 32 bytes of data:
Reply from 10.33..: bytes=32 time=2ms TTL=121
Reply from 10.33..: bytes=32 time=2ms TTL=121

@SSvilen
Copy link
Author

SSvilen commented May 10, 2022

@MattPOlson ,

why do the logs from the operator say when you add a new machine?
Are they BYOH or do you provision with machine sets?

@MattPOlson
Copy link

@MattPOlson ,

why do the logs from the operator say when you add a new machine? Are they BYOH or do you provision with machine sets?

Its throwing this error. I'm trying to figure out where/how in the code the operator gets the external IP address. They are provisioned as machine sets.

DEBUG controller.windowsmachine invalid Machine {"name": "k8s-se-platform-01-bq57b-win-lprdv", "error": "no internal IP address associated", "errorVerbose": "no internal IP address associated\ngithub.com/openshift/windows-machine-config-operator/controllers.getInternalIPAddress\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:523\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).isValidMachine\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:203\ngithub.com/openshift/windows-machine-config-operator/controllers.(*WindowsMachineReconciler).SetupWithManager.func2\n\t/build/windows-machine-config-operator/controllers/windowsmachine_controller.go:114\nsigs.k8s.io/controller-runtime/pkg/predicate.Funcs.Update\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/predicate/predicate.go:87\nsigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnUpdate\n\t/build/windows-machine-config-operator/vendor/sigs.k8s.io/controller-runtime/pkg/source/internal/eventsource.go:88\nk8s.io/client-go/tools/cache.(*processorListener).run.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:775\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/cache.(*processorListener).run\n\t/build/windows-machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:771\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/build/windows-machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:73\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}

@mansikulkarni96
Copy link
Member

@MattPOlson can you add the full WMCO log snippet? Those are just the initial debug logs, which should resolve themseleves once the IP for the machine is available.

@MattPOlson
Copy link

@MattPOlson
Copy link

Any updates on this, I feel like this is either a legit issue or something isn't documented correctly as far as the setup goes. I looked through the code but I can't figure out why the internal IP still isn't being populated on the machine so the endpoint isn't being created.

@saifshaikh48
Copy link
Contributor

@MattPOlson can I ask what OCP and WMCO version you are using?
In the log you shared, I see some failures to watch/get the OperatorCondition k8s resource. The fix for this was backported to WMCO 3.1.1 and 4.0.1 for OCP 4.8 and 4.9 respectively.

@MattPOlson
Copy link

@saifshaikh48 sure:
operator: community-windows-machine-config-operator.v4.0.1
cluster: 4.9.0-0.okd-2022-02-12-140851

@saifshaikh48
Copy link
Contributor

Interesting, that version should have the proper permissions.

@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this as completed Jun 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 24, 2022

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MattPOlson
Copy link

This is still as issue in version 5.1.1. I have to update the endpoint manually to get any metrics back from the windows nodes.

/reopen

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 12, 2022

@MattPOlson: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

This is still as issue in version 5.1.1. I have to update the endpoint manually to get any metrics back from the windows nodes.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sebsoto
Copy link
Contributor

sebsoto commented Jul 13, 2022

I'll look into this today

/reopen

@openshift-ci openshift-ci bot reopened this Jul 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 13, 2022

@sebsoto: Reopened this issue.

In response to this:

I'll look into this today

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MattPOlson
Copy link

@sebsoto
Copy link
Contributor

sebsoto commented Jul 13, 2022

Seeing

1.657650479492904e+09	DEBUG	events	Warning	{"object": {"kind":"Namespace","name":"openshift-windows-machine-config-operator","uid":"6fabb20a-a268-4c58-8fc7-30e887bb7dce","apiVersion":"v1","resourceVersion":"27258196"}, "reason": "labelValidationFailed", "message": "Cluster monitoring openshift.io/cluster-monitoring=true label is not enabled in openshift-windows-machine-config-operator namespace"}

and

1.6576521713493032e+09	INFO	metrics	install the prometheus-operator to enable Prometheus configuration

In the logs but the ns has the correct openshift.io/cluster-monitoring=true label on it

@sebsoto
Copy link
Contributor

sebsoto commented Jul 13, 2022

WMCO checks for metrics being enabled on the namespace its deployed only in at startup.
WMCO ignores the change if metrics are enabled/disabled while WMCO is running.

Thinking about two potential options to fix this

  1. WMCO watches the namespace and enables/disables its metrics functionality depending on the label
  2. WMCO checks the namespace label anytime it needs to reconcile the endpoint object

@mtnbikenc
Copy link
Member

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 19, 2022
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022
@sebsoto
Copy link
Contributor

sebsoto commented Oct 18, 2022

This can be solved through https://issues.redhat.com/browse/WINC-545

@sebsoto
Copy link
Contributor

sebsoto commented Oct 18, 2022

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022
@sebsoto
Copy link
Contributor

sebsoto commented Oct 18, 2022

/lifecycle frozen

@openshift-ci openshift-ci bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

7 participants