Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default labels to use Kubernetes resource name #698

Merged
merged 9 commits into from
Oct 24, 2024

Conversation

burmanm
Copy link
Contributor

@burmanm burmanm commented Sep 4, 2024

What this PR does:
Changes datacenter label value to be Kubernetes resource name for Datacenter again. Provides backwards compatibility with older versions using the wrong one.

All the label names are updated after a successful reconcile (this is an existing feature). Tested with Helm installation of 1.22.1 and then updating the cass-operator image only.

Which issue(s) this PR fixes:
Fixes #691
Fixes #699

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@burmanm burmanm marked this pull request as ready for review September 6, 2024 07:43
@burmanm burmanm requested a review from a team as a code owner September 6, 2024 07:43
Copy link
Member

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do more testing on this in several respects;

  1. Does an upgrade from the previous version to this one work correctly when an override name is in place?
  2. Does this work when incorporated into downstream projects (e.g. k8ssandra-operator)? If not, should we make changes to the downstream projects, ensure everything works, and only then merge the changes?

Also @burmanm can you give me an overview of how the backwards compatibility is supposed to work? I don't think this was on the original ticket and I'm curious as to whether that status field you're using is pre-existing, and where it comes from generally.

@@ -596,7 +599,7 @@ func (dc *CassandraDatacenter) SetCondition(condition DatacenterCondition) {
// GetDatacenterLabels ...
func (dc *CassandraDatacenter) GetDatacenterLabels() map[string]string {
labels := dc.GetClusterLabels()
labels[DatacenterLabel] = CleanLabelValue(dc.DatacenterName())
labels[DatacenterLabel] = CleanLabelValue(dc.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is CleanLabelValue still necessary now that we're using the Kubernetes meta.name which will be DNS compliant anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not really, but they do have a bit different rules:

	dns1123SubdomainFmt string = "[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*"
	dns1123LabelFmt     string = "(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?"

So better safe than sorry I guess, especially if Kubernetes would one day update these..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think the dns1123LabelFmt is more liberal than the dns1123SubdomainFmt, I also don't think k8s will ever change these (since they're linked to the DNS spec).

I'd remove the extra function call just so it is clear to readers that the label is always going to be the exact meta.Name, and not something else that we've interfered with.

This is just a suggestion though, so it isn't blocking.

@burmanm
Copy link
Contributor Author

burmanm commented Sep 11, 2024

Yes, the operator upgrade test will deploy with override name in place (with an operator version that uses the override naming) and update from there.

The status field is not pre-existing, yet it wouldn't matter if it did or did not. A CRD upgrade will add it and the default is treated as 0 (such as in the cases that the field was not previously populated or does not exists). If the CRD is old one, it would simply fetch too much information on every reconcile but not break anything. I did test this manually against existing cluster.

If you wish to test that, the simplest way is to deploy cass-operator with Helm and do an upgrade. For example:

helm install --namespace cass-operator --create-namespace k8ssandra k8ssandra/cass-operator --set global.clusterScoped=true

Deploy something, like the one used in operator upgrade:

kubectl apply -f tests/testdata/default-three-rack-three-node-dc-4x.yaml

And then update cass-operator to a version built from this repo:

helm upgrade --namespace cass-operator k8ssandra k8ssandra/cass-operator --reuse-values --set image.registry="" --set image.tag=v1.23.0-dev.3f1fd20-20240906

And the cluster should stay healthy and get the updated values. This process does not update the CRD, only the operator image.

As for downstream projects, it's not really a place here to think about that (if the downstream is unable to use upstream, then the fix must happen in the downstream). That said, k8ssandra-operator uses these methods directly so that would work as soon as the first reconcile has finished here.

@Miles-Garnsey
Copy link
Member

As for downstream projects, it's not really a place here to think about that (if the downstream is unable to use upstream, then the fix must happen in the downstream). That said, k8ssandra-operator uses these methods directly so that would work as soon as the first reconcile has finished here.

In theory that's true, but if we merge this to master and then cut a cass-operator release, then we are blocked on incorporating that release into downstream projects until they are also updated and tested to ensure that the new cass-operator version doesn't break them. As a result, I tend to think we should prep and test the downstream changes before we merge this.

On the other hand, changes to cass-operator can always break downstream repos and that's why we do integration tests in those downstreams before release. So I'm not necessarily going to block this PR on that basis, if you really want to merge now. Just saying that we may need to do quite a lot of work in the downstreams before we can then upgrade cass-operator.

If you wish to test that, the simplest way is to deploy cass-operator with Helm and do an upgrade. For example:

This is very helpful, thank you! That'll help me get this tested much faster.

Copy link
Member

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a doc string in the types you need to change that currently reads:

Copy link
Member

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: There's a doc string in the types you need to change that currently reads:

Suggestion: It would be good to have some webhook protection against excessively long dc and cluster names to avoid this kind of problem:

2024-09-12T07:45:13.712Z	ERROR	Could not create headless service	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"correct-dc-name","namespace":"default"}, "namespace": "default", "name": "correct-dc-name", "reconcileID": "3e8bfbde-c188-408b-8244-0a74c55edb56", "namespace": "default", "datacenterName": "dcnamespecialcharacters", "clusterName": "cluster1&&OveriddenName", "error": "Service \"cluster1overiddenname-dcnamespecialcharacters-additional-seed-service\" is invalid: metadata.name: Invalid value: \"cluster1overiddenname-dcnamespecialcharacters-additional-seed-service\": must be no more than 63 characters"}

Checks:

Master Branch

Pods getcassandra.datastax.com/datacenter: dcNmCharacters
Services get cassandra.datastax.com/datacenter: dcNmCharacters
PVCs get cassandra.datastax.com/datacenter: dcNmCharacters
sts gets cassandra.datastax.com/datacenter: dcNmCharacters
pvs aren't labelled at all.
ConfigMaps - it doesn't look like there are any.

In the status I see datacenterName: dcNm!&^^ Characters

One odd thing with secrets, it looked like I ended up with two: cl1oridename-superuser cluster1-superuser. One had cassandra.datastax.com/datacenter: My_Super_Dc and the other had cassandra.datastax.com/datacenter: dcNmCharacters. I think the former might not have been cleaned up. Suggestion: investigate if this still happens after this PR, since it is possible that the cleanup failures are related to the buggy labelling behaviour that we are fixing here.

After upgrade

Pods get cassandra.datastax.com/datacenter: dcNmCharacters
Services get cassandra.datastax.com/datacenter: dcNmCharacters
I'm gonna stop here, it doesn't look like things have been upgraded...

In the status I see datacenterName: dcNm!&^^ Characters

Checking the logs, I get

kubectl logs -n cass-operator k8ssandra-cass-operator-84ff466f58-cxlxq
Error from server (BadRequest): container "cass-operator" in pod "k8ssandra-cass-operator-84ff466f58-cxlxq" is waiting to start: trying and failing to pull image

I assume this is because the image isn't published anywhere? Do I need to build it myself and then push it somewhere to get this tested @burmanm ?

@burmanm
Copy link
Contributor Author

burmanm commented Sep 17, 2024

I assume this is because the image isn't published anywhere? Do I need to build it myself and then push it somewhere to get this tested @burmanm ?

Right, it should be available in michaelburman290/cass-operator with the tag I mentioned. Your comments around doc string don't show up at all, no idea what happened here.

@burmanm
Copy link
Contributor Author

burmanm commented Oct 15, 2024

@Miles-Garnsey ping

@Miles-Garnsey
Copy link
Member

Going again... datacenterName: DC&n_!!

Checks:
Master Branch

Pods get cassandra.datastax.com/datacenter: DCn
Services get cassandra.datastax.com/datacenter: DCn
PVCs get cassandra.datastax.com/datacenter: DCn
sts gets cassandra.datastax.com/datacenter: DCn
pvs aren't labelled at all.
ConfigMaps - it doesn't look like there are any.

In the status I see datacenterName: DC&n_!!

After upgrade

Pods get cassandra.datastax.com/datacenter: test-dc
Services get cassandra.datastax.com/datacenter: test-dc
PVCs get cassandra.datastax.com/datacenter: test-dc
statefulset gets cassandra.datastax.com/datacenter: DCn (@burmanm is this expected?)

Otherwise this looks about right to me.

Copy link
Member

@Miles-Garnsey Miles-Garnsey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burmanm has raised another ticket to investigate the STS labels not being changed here. I'll approve this since he wants to investigate that behaviour elsewhere.

@burmanm burmanm merged commit c82c668 into k8ssandra:master Oct 24, 2024
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow-updatespec once can be removed prematurely Revert label override values
2 participants