Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Join fails when using custom CA certs #7453

Open
nnewc opened this issue Jan 27, 2025 · 17 comments
Open

[BUG] Join fails when using custom CA certs #7453

nnewc opened this issue Jan 27, 2025 · 17 comments
Assignees
Labels
area/rancher Rancher related including internal and external backport-needed/1.4.2 kind/bug Issues that are defects reported by users or that we know have reached a real release known-issue-v1.4.1 priority/0 Must be fixed in this release regression reproduce/always Reproducible 100% of the time require/doc Improvements or additions to documentation severity/1 Function broken (a critical incident with very high impact)
Milestone

Comments

@nnewc
Copy link

nnewc commented Jan 27, 2025

Describe the bug
After setting a custom CA certificate, additional Harvester nodes fail to join.

To Reproduce

  1. Install first harvester node with custom CA into system_settings.ssl-certificates and system_settings.additional-ca
  2. Add CA to system_settings.additional-ca on join config
  3. Join second harvester node
  4. Observe second node never joins:
harvey03:/home/rancher # journalctl -fu rancher-system-agent
Jan 24 23:28:08 harvey03 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.
Jan 24 23:28:13 harvey03 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 45.
Jan 24 23:28:13 harvey03 systemd[1]: Stopped Rancher System Agent.
Jan 24 23:28:13 harvey03 systemd[1]: Started Rancher System Agent.
Jan 24 23:28:13 harvey03 rancher-system-agent[3515]: time="2025-01-24T23:28:13Z" level=info msg="Rancher System Agent version v0.3.9 (0d64f6e) is starting"
Jan 24 23:28:13 harvey03 rancher-system-agent[3515]: time="2025-01-24T23:28:13Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Jan 24 23:28:13 harvey03 rancher-system-agent[3515]: time="2025-01-24T23:28:13Z" level=info msg="Starting remote watch of plans"
Jan 24 23:28:13 harvey03 rancher-system-agent[3515]: time="2025-01-24T23:28:13Z" level=fatal msg="error while connecting to Kubernetes cluster: Get \"https://192.168.60.155/version\": tls: failed to verify certificate: x509: certificate signed by unknown authority"
Jan 24 23:28:13 harvey03 systemd[1]: rancher-system-agent.service: Main process exited, code=exited, status=1/FAILURE
Jan 24 23:28:13 harvey03 systemd[1]: rancher-system-agent.service: Failed with result 'exit-code'.

Expected behavior
Additional harvester nodes join with custom CA certificates.

Environment

  • Harvester v1.4.0
  • Baremetal PXE boot

Additional context

Custom CA's seem to be added to OS root store, because curl works correctly:

harvey03:/home/rancher # curl -v https://192.168.60.155/version
*   Trying 192.168.60.155:443...
* Connected to 192.168.60.155 (192.168.60.155) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=harvester_homelab
*  start date: Jan 24 04:22:14 2025 GMT
*  expire date: Oct 21 04:22:14 2027 GMT
*  subjectAltName: host "192.168.60.155" matched cert's IP address!
*  issuer: CN=harvester_homelab_intermediate_ca
*  SSL certificate verify ok.
* using HTTP/2
* h2h3 [:method: GET]
* h2h3 [:path: /version]
* h2h3 [:scheme: https]
* h2h3 [:authority: 192.168.60.155]
* h2h3 [user-agent: curl/8.0.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x556b48ca93e0)
> GET /version HTTP/2
> Host: 192.168.60.155
> user-agent: curl/8.0.1
> accept: */*
>
< HTTP/2 200
< date: Fri, 24 Jan 2025 23:33:16 GMT
< content-type: application/json
< content-length: 285
< audit-id: aa9307ff-d269-4766-a41b-0e70004aeb76
< cache-control: no-cache, no-store, must-revalidate
< cache-control: no-cache, private
< x-api-cattle-auth: false
< x-content-type-options: nosniff
< x-kubernetes-pf-flowschema-uid: 4bbf68ff-a56f-4ce2-b095-e448479a66a0
< x-kubernetes-pf-prioritylevel-uid: 0e7ff0ed-a680-4bef-acc8-77f3886d5c26
< strict-transport-security: max-age=31536000; includeSubDomains
<
{
  "major": "1",
  "minor": "29",
  "gitVersion": "v1.29.9+rke2r1",
  "gitCommit": "114a1f58037bd70f90d9e630e591c5e52dd9b298",
  "gitTreeState": "clean",
  "buildDate": "2024-09-12T02:23:19Z",
  "goVersion": "go1.22.6 X:boringcrypto",
  "compiler": "gc",
  "platform": "linux/amd64"
* Connection #0 to host 192.168.60.155 left intact
}
@nnewc nnewc added kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jan 27, 2025
@ihcsim
Copy link
Contributor

ihcsim commented Jan 27, 2025

Does your custom cert have the management VIP added to its CN or SAN? This sounds very similar to https://docs.harvesterhci.io/v1.4/install/index/#fail-to-join-nodes-using-fqdn-to-a-cluster-which-has-custom-ssl-certificate-configured.

@nnewc
Copy link
Author

nnewc commented Jan 27, 2025

It does.

X509v3 Subject Alternative Name:
IP Address:192.168.60.155, DNS:harvester.harvey.lab

@nnewc
Copy link
Author

nnewc commented Jan 27, 2025

The problem appears to be that that an incorrect CA cert is being used in /var/lib/rancher/agent/rancher2_connection_info.json to connect back rancher.

Mine shows the default "dynamiclistener" CA cert when it should be my custom cert I specified:

harvey03:/home/rancher # cat /var/lib/rancher/agent/rancher2_connection_info.json | jq .kubeConfig | yq -P | yq '.clusters.[] | select(.name == "agent")'.cluster.certificate-authority-data | base64 -d | openssl x509 -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 0 (0x0)
        Signature Algorithm: ecdsa-with-SHA256
        Issuer: O = dynamiclistener-org, CN = dynamiclistener-ca@1738008916
        Validity
            Not Before: Jan 27 20:15:16 2025 GMT
            Not After : Jan 25 20:15:16 2035 GMT
        Subject: O = dynamiclistener-org, CN = dynamiclistener-ca@1738008916
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:a5:52:e8:e7:68:10:6f:fb:a0:a0:73:fc:d3:74:
                    85:ef:ac:db:3a:71:10:7a:46:04:15:da:6d:2c:f6:
                    26:cf:11:7b:42:2e:35:2d:ff:c7:40:f0:46:95:af:
                    6e:a7:1e:1f:35:c8:a5:80:3f:97:29:49:50:1c:1a:
                    6d:f4:f8:aa:03
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment, Certificate Sign
            X509v3 Basic Constraints: critical
                CA:TRUE
            X509v3 Subject Key Identifier:
                AE:4C:BC:23:15:6D:C8:BD:D9:EE:36:6C:B6:B7:08:F0:37:E9:5F:B7
    Signature Algorithm: ecdsa-with-SHA256
         30:45:02:21:00:90:10:19:b7:80:27:65:71:d5:21:b4:16:bd:
         c8:f8:f4:10:50:94:15:1f:4a:45:1b:8e:f1:7a:62:03:57:b1:
         a5:02:20:33:28:96:23:5d:9b:44:2f:7e:06:c1:92:eb:4f:89:
         32:69:03:62:ab:d5:db:24:e1:9a:68:85:9d:7d:a5:fe:17
-----BEGIN CERTIFICATE-----
MIIBvTCCAWOgAwIBAgIBADAKBggqhkjOPQQDAjBGMRwwGgYDVQQKExNkeW5hbWlj
bGlzdGVuZXItb3JnMSYwJAYDVQQDDB1keW5hbWljbGlzdGVuZXItY2FAMTczODAw
ODkxNjAeFw0yNTAxMjcyMDE1MTZaFw0zNTAxMjUyMDE1MTZaMEYxHDAaBgNVBAoT
E2R5bmFtaWNsaXN0ZW5lci1vcmcxJjAkBgNVBAMMHWR5bmFtaWNsaXN0ZW5lci1j
YUAxNzM4MDA4OTE2MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEpVLo52gQb/ug
oHP803SF76zbOnEQekYEFdptLPYmzxF7Qi41Lf/HQPBGla9upx4fNcilgD+XKUlQ
HBpt9PiqA6NCMEAwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB/wQFMAMBAf8wHQYD
VR0OBBYEFK5MvCMVbci92e42bLa3CPA36V+3MAoGCCqGSM49BAMCA0gAMEUCIQCQ
EBm3gCdlcdUhtBa9yPj0EFCUFR9KRRuO8XpiA1expQIgMyiWI12bRC9+BsGS60+J
MmkDYqvV2yThmmiFnX2l/hc=
-----END CERTIFICATE-----

In my experience, when deploying Rancher with custom CA certs outside of harvester, the certs at the Rancher endpoint /cacerts are used for connecting to Rancher. We tested this on 1.3.2 and certificate-authority-data was the same, but worked and harvester node joins perfectly. Now on 1.4, it seems to be stricter and giving an error. If we edit the rancher2-connection-info directly, the join succeeds, so somewhere the custom CA cert is not getting picked up.

@irishgordo
Copy link
Contributor

@nnewc - thanks for reporting - I can reproduce this, I'll adjust the labels 😄

@irishgordo irishgordo added severity/1 Function broken (a critical incident with very high impact) reproduce/always Reproducible 100% of the time and removed reproduce/needed Reminder to add a reproduce label and to remove this one severity/needed Reminder to add a severity label and to remove this one labels Jan 28, 2025
@nnewc
Copy link
Author

nnewc commented Jan 28, 2025

Thanks @irishgordo. I'm not sure how those labels got added. 😄 I also tried the recommendation to install certs manually per this doc and that didn't seem to make any difference.

@irishgordo irishgordo added regression area/rancher Rancher related including internal and external labels Jan 29, 2025
@ibrokethecloud
Copy link
Contributor

ibrokethecloud commented Jan 29, 2025

rancher v2.9.x seems to have made a change to enable strict tls verification when setting rancher-system-agent

In harvester v1.4.x installs the rancher-system-agent definition contains an extra variable CATTLE_AGENT_STRICT_VERIFY which has been added in rancher v2.9.x.

Harvester v1.4.0 uses rancher v2.9.2 for provisioning the local cluster, which is why we are setting this error.

[Unit]
Description=Rancher System Agent
Documentation=https://www.rancher.com
Wants=network-online.target
After=network-online.target
[Install]
WantedBy=multi-user.target
[Service]
EnvironmentFile=-/etc/default/rancher-system-agent
EnvironmentFile=-/etc/sysconfig/rancher-system-agent
EnvironmentFile=-/etc/systemd/system/rancher-system-agent.env
Type=simple
Restart=always
RestartSec=5s
Environment=CATTLE_LOGLEVEL=info
Environment=CATTLE_AGENT_CONFIG=/etc/rancher/agent/config.yaml
Environment=CATTLE_AGENT_STRICT_VERIFY=true
ExecStart=/opt/rancher-system-agent/bin/rancher-system-agent sentinel

The behaviour is controller by a combination of a rancher setting agent-tls-mode

https://github.com/rancher/rancher/blob/v2.9.2/pkg/settings/setting.go#L63

Or by defining AgentEnvVars in the local cluster.provisioning object

https://github.com/rancher/rancher/blob/v2.9.2/pkg/controllers/capr/managesystemagent/managesystemagent.go#L218

Since the embedded rancher only manages the local cluster we can easily disable the agent-tls-mode setting.

Current work around is to

  • provision the first node, setup custom TLS certs
  • edit setting.management agent-tls-mode on the first node and set value to false
apiVersion: management.cattle.io/v3
customized: false
default: strict
kind: Setting
metadata:
  creationTimestamp: "2025-01-29T00:37:23Z"
  generation: 2
  name: agent-tls-mode
  resourceVersion: "13758"
  uid: c0c40c7f-3d9c-47b0-8217-7140b73be1a2
source: ""
value: "system-store"

post this change subsequent nodes should join harvester cluster successfully.

@bk201 @innobead we need to include a fix for this in v1.5.0 and likely backport to v1.4.2

@ibrokethecloud ibrokethecloud added this to the v1.5.0 milestone Jan 29, 2025
@ibrokethecloud
Copy link
Contributor

we can also change the behaviour via https://github.com/rancher/rancher/blob/main/chart/values.yaml#L43

@nnewc
Copy link
Author

nnewc commented Jan 29, 2025

Could this be supported by setting privateCA and providing the CA as the secret tls-ca in cattle-system namespace?

I'm pretty sure rancher will handle getting the custom CA to the agents if this is set.

@ibrokethecloud
Copy link
Contributor

@nnewc that change needs setting tls-ca and server-url which has the side effect of trigger a fleet re-registration on local cluster to leverage the server-url and is likely going to complicate upgrades

@nnewc
Copy link
Author

nnewc commented Jan 31, 2025

@ibrokethecloud makes sense.

Also, we would definitely like to see this backported to 1.4.2 if possible. Let me know if there is anything I can do to push this along.

@bk201 bk201 added priority/0 Must be fixed in this release require/doc Improvements or additions to documentation known-issue-v1.4.1 labels Feb 5, 2025
@bk201
Copy link
Member

bk201 commented Feb 5, 2025

@khushboo-rancher @irishgordo do we have a test plan that joins nodes with self-sign certificates?

@irishgordo
Copy link
Contributor

@bk201 tentatively, something like this:

@starbops
Copy link
Member

starbops commented Feb 6, 2025

It might not be directly related, but seeing the agent-tls-mode setting reminds me of #7105.


FWIW, the relevant part in the Rancher doc is at https://ranchermanager.docs.rancher.com/getting-started/installation-and-upgrade/installation-references/tls-settings#agent-tls-enforcement

@harvesterhci-io-github-bot
Copy link
Collaborator

added backport-needed/1.4.2 issue: #7597.

@harvesterhci-io-github-bot
Copy link
Collaborator

harvesterhci-io-github-bot commented Feb 13, 2025

Pre Ready-For-Testing Checklist

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@harvesterhci-io-github-bot
Copy link
Collaborator

Automation e2e test issue: harvester/tests#1858

@HoustonDad
Copy link

Can this be backported into v1.3 as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rancher Rancher related including internal and external backport-needed/1.4.2 kind/bug Issues that are defects reported by users or that we know have reached a real release known-issue-v1.4.1 priority/0 Must be fixed in this release regression reproduce/always Reproducible 100% of the time require/doc Improvements or additions to documentation severity/1 Function broken (a critical incident with very high impact)
Projects
None yet
Development

No branches or pull requests

9 participants