Change Flyte CR naming scheme to better support namespace_mapping #5480

ddl-ebrown · 2024-06-14T22:38:44Z

@ddl-rliu did most of the work on this one - making this an upstream PR as it resolved a real issue for us.

Tracking issue

Why are the changes needed?

Typically Flyte is configured so that each project / domain has its
own Kubernetes namespace.

Certain environments may change this behavior by using the Flyteadmin
namespace_mapping setting to put all executions in fewer (or a singular)
Kubernetes namespace. This is problematic because it can lead to
collisions in the naming of the CR that flyteadmin generates.

What changes were proposed in this pull request?

This patch fixes 2 important things to make this work properly inside
of Flyte:
- it adds a random element to the CR name in Flyte so that the CR is
  named by the execution + some unique value when created by
  flyteadmin
  
  Without this change, an execution Foo in project A will prevent an
  execution Foo in project B from launching, because the name of the
  CR thats generated in Kubernetes assumes that the namespace the
  CRs are put into is different for project A and project B
  
  When namespace_mapping is set to a singular value, that assumption
  is wrong
- it makes sure that when flytepropeller cleans up the CR resource
  that it uses Kubernetes labels to find the correct CR -- so instead
  of assuming that it can use the execution name, it instead uses the
  project, domain and execution labels

How was this patch tested?

This is deployed in a live Flyte setup where we have automated tests. We observed that the CR names were correctly unique after this and the initial collision no longer occurred.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

codecov · 2024-06-14T22:42:40Z

Codecov Report

Attention: Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.

Project coverage is 36.18%. Comparing base (7136919) to head (a268b61).

Files with missing lines	Patch %	Lines
...ropeller/pkg/compiler/transformers/k8s/workflow.go	80.00%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5480   +/-   ##
=======================================
  Coverage   36.18%   36.18%           
=======================================
  Files        1302     1302           
  Lines      109614   109644   +30     
=======================================
+ Hits        39660    39680   +20     
- Misses      65809    65818    +9     
- Partials     4145     4146    +1

Flag	Coverage Δ
unittests-datacatalog	`51.37% <ø> (ø)`
unittests-flyteadmin	`55.32% <100.00%> (-0.01%)`	⬇️
unittests-flytecopilot	`12.17% <ø> (ø)`
unittests-flytectl	`62.18% <ø> (ø)`
unittests-flyteidl	`7.12% <ø> (ø)`
unittests-flyteplugins	`53.34% <ø> (ø)`
unittests-flytepropeller	`41.74% <80.00%> (+0.02%)`	⬆️
unittests-flytestdlib	`55.35% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddl-ebrown · 2024-06-14T22:49:26Z

flyteadmin/pkg/workflowengine/impl/k8s_executor.go

+		ctx,
+		v1.DeleteOptions{PropagationPolicy: &deletePropagationBackground},
+		v1.ListOptions{
+			LabelSelector: v1.FormatLabelSelector(executionLabelSelector(data.ExecutionID)),


Even though new executions will have different CR names, this deletion mechanism is fully backwards compatible - thanks for a good solution @ddl-rliu !

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

ddl-ebrown · 2024-06-14T23:53:43Z

flytepropeller/pkg/compiler/test/compiler_test.go

@@ -165,6 +165,10 @@ func TestDynamic(t *testing.T) {
 					Name:    "name",
 				},
 				"namespace")
+			// make sure real CR has randomized suffix


This is probably the simplest way to ensure existing tests continue to pass

ddl-ebrown · 2024-06-15T00:03:05Z

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

+	rand.Seed(seed)
+	// K8s has a limitation of 63 chars
+	name = name[:minInt(63-ExecutionIDSuffixLength, len(name))]
+	execName := name + "-" + rand.String(ExecutionIDSuffixLength-1)


Without this randomization, use of namespace_mapping in the config

flyte/flyteadmin/pkg/runtime/namespace_config_provider.go

Lines 11 to 18 in 02bf85f

const (

namespaceMappingKey = "namespace_mapping"

defaultTemplate = "{{ project }}-{{ domain }}"

)

var namespaceMappingConfig = config.MustRegisterSection(namespaceMappingKey, &interfaces.NamespaceMappingConfig{

Template: defaultTemplate,

})

with a value like foo causes problems when executions have the same names across projects

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

hamersaw

.

hamersaw · 2024-06-17T19:59:27Z

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

@@ -159,6 +161,17 @@ func generateName(wfID *core.Identifier, execID *core.WorkflowExecutionIdentifie
 	}
 }

+const ExecutionIDSuffixLength = 21


Can we make this a configurable value and if set to 0 (default?) then have it disabled (so no random characters are appended). My concern is that if anything relies on the CR name, this will break it.

I think we found / updated the spots where the name is a "contract" -- but if we want to be extra super careful we could make this configurable.

That said, I think the tradeoff we have to consider is:

backward compatibility vs

adding extra config / managing different behaviors

I would probably vote for not introducing an extra config and keeping this behavior not configurable (I'd argue prior behavior was a bug), but admit to not knowing the potential blast radius beyond core Flyte (i.e. plugins and such).

I'm happy to go either way since it's not my project :)

Yeah, I understand the config bloat all to well :). As you suggest, my main concern is breaking backwards compatibility here. I know there are Flyte users that rely on the FlyteWorkflow CR to be named identical to the execution ID, which this would break. For me to be comfortable merging this, I feel it should be defaulted to the current behavior. cc @eapolinario thoughts?

Ah thanks!

If you know there are users depending on the existing CR naming scheme somehow, then making this behavior configurable seems like the only thing to do right now. I can update my PR.

Not sure if you track potential breaking change tickets anywhere, but maybe file one away to remove the option to configure this on the next major release boundary?

Thanks for the feedback! I've updated the PR to address this and explain a bit more here: #5480 (comment)

kumare3 · 2024-06-25T14:31:03Z

I am not in favor of this, as randomness will lead to leaky workflows and duplicates. We should use the project id itself or generate a consistent hash to increase inter project execution entropy

ddl-ebrown · 2024-06-25T15:19:10Z

I am not in favor of this, as randomness will lead to leaky workflows and duplicates. We should use the project id itself or generate a consistent hash to increase inter project execution entropy

Ah thanks @kumare3 for the heads up! We clearly didn't realize there was something internal to Flyte that depends on deterministic naming for CRs -- will make some updates taking that into account as well

ddl-ebrown · 2024-06-25T17:23:12Z

I am not in favor of this, as randomness will lead to leaky workflows and duplicates. We should use the project id itself or generate a consistent hash to increase inter project execution entropy

Ah thanks @kumare3 for the heads up! We clearly didn't realize there was something internal to Flyte that depends on deterministic naming for CRs -- will make some updates taking that into account as well

Also, should mention @kumare3 that if by "leaky" you meant "CR might not be deleted from the cluster", the deletion process is robust because this uses the actual key of the workflow in conjunction with CR labels to perform deletes, rather than the CR name.

If there are dupe CRs for the same workflow though, that's clearly an issue regardless :)

EngHabu · 2024-06-25T19:04:20Z

@ddl-ebrown I agree with not introducing randomization... specially that the name already starts with a random string :-)

Instead, I would update this call to use something like project-domain-rand(10) and hash that and that becomes the execution name...

I would also make the length of the execution name configurable in flyteadmin. so in your deployment you can make it longer and give you better entropy...

ddl-rliu · 2024-07-18T22:49:33Z

flytepropeller/pkg/compiler/test/compiler_test.go

 				}
-				// make sure real CR has randomized suffix
-				assert.Regexp(t, regexp.MustCompile("name-[bcdfghjklmnpqrstvwxz2456789]{20}"), flyteWf.Name)
-				// then reset for the sake of serialization comparison
-				flyteWf.Name = "name"



The cr-name-scheme-change is now non-default, and tests are added here, so removing these checks.

ddl-rliu · 2024-07-18T22:50:51Z

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

+	hashedIdentifier := hashIdentifier(core.Identifier{
+		Project: project,
+		Domain:  domain,
+		Name:    name,
+	})
+	rand.Seed(int64(hashedIdentifier))


By hashing exactly these fields project, domain, name we prevent duplicate CRs from being created for the same execution (they will hash to the same CR name and k8s will prevent those duplicate CRs).

Edit: Regarding the hashIdentifier function, see this for reference.

ddl-rliu · 2024-07-18T22:51:41Z

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

+	if workflowCRNameHashLength := config.GetConfig().WorkflowCRNameHashLength; workflowCRNameHashLength > 0 {
+		obj.ObjectMeta.Name = rand.String(workflowCRNameHashLength)
+	} else {
+		obj.ObjectMeta.Name = name
+	}


By default WorkflowCRNameHashLength is 0 so the hash CR naming scheme will only be used when the deployment is specially configured to do so. See 93bb9a5#diff-67941e1320bd66fda570ebcf1f1c5c5221d608b136d662705e168a07a5d166ef

ddl-ebrown · 2024-07-20T00:10:50Z

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

+	rand.Seed(int64(hashedIdentifier))
+
+	if workflowCRNameHashLength := config.GetConfig().WorkflowCRNameHashLength; workflowCRNameHashLength > 0 {
+		obj.ObjectMeta.Name = rand.String(workflowCRNameHashLength)


Couldn't the name of the CR just incorporate the k8s namespace though and still be stable? I think using just a hash for the CR name loses some valuable information when you're glancing at the resource names now.

Updated #5480 (comment)

Now, for a setting like WorkflowCRNameHashLength: 32, the CR name will look like mywfname-f5ptmmd66pktp59nphfdjkqr9szp8fv5 or mywfname-wnw9zlg22z5wmdjggdcrnzdrpw59m4bc. Note that the suffix is deterministic, and uses a hash of the name+project+domain.

I opted for this approach rather than mywfname-project, this is because of the character limit of the CR name and how this affects executions that have a long execution name and a long project name, where improperly handled truncation will lead to undesired behavior such as executions not starting as a result of name conflicts. For reference.

ddl-rliu · 2024-07-23T18:03:30Z

flytepropeller/pkg/compiler/transformers/k8s/workflow.go

+		base := name + "-"
+		maxNameLength := allowedExecutionNameLength - workflowCRNameHashLength
+		if len(base) > maxNameLength {
+			base = base[:maxNameLength]
+		}
+		obj.ObjectMeta.Name = fmt.Sprintf("%s%s", base, rand.String(workflowCRNameHashLength))


Based off of https://github.com/clarklee92/kubernetes/blob/519f1ba07379108c301c0474d5eb61b2012bd427/staging/src/k8s.io/apiserver/pkg/storage/names/generate.go#L57-L63

Not to bikeshed too much on this one, but a couple things...

You can skip some of that slice logic around length if you just use format specifiers to set the "max"

maxNameLength := allowedExecutionNameLength - workflowCRNameHashLength - 1 if s, err := strconv.Atoi(maxNameLength); err == nil { format := "[%." + s+ "s]-%s" obj.ObjectMeta.Name = fmt.Sprintf(format, base, rand.String(workflowCRNameHashLength)) }

Even better -- does WorkflowCRNameHashLength really need to be configurable or can it just be on / off? If it's sufficiently long enough to prevent collisions (and I think we can pick something), then you could simplify this length calculation logic and just limit the string statically using a format specifier like this (let's assume its set to 24, for example):

maxBase := allowedExecutionNameLength - workflowCRNameHashLength - 1 obj.ObjectMeta.Name = fmt.Sprintf("[%.38s]-%s", base, rand.String(workflowCRNameHashLength))

The k8s rand util you're using has 32 possible characters... I think once you hit a string of length 24 you're in UUID territory (which is 5.31x10^36 and generally considered more than anyone would ever need)

I don't think something that long is remotely necessary for this use case though - the primary concern is that the seeded random number generator doesn't produce a duplicate random string given two different seeds (i.e. for an execution with the same name and domain, but in a different project). For that, even something like 10 characters should be enough I think, which is why I would advocate for picking something sensible. I know maintainers suggested making it configurable, but that feels unnecessary IMHO.

32^10 is 1.13x10^15

Using the collision calculator at https://devina.io/collision-calculator to understand this in more human-friendly terms

~1 year (or 3.43e+7 seconds) needed, in order to have a 1% probability of at least one collision if 500 ID's are generated every hour.

A fixed suffix length of 10 sounds good to me. Agreed about the sufficient amount of entropy on relatively high values for the suffix length. Just updated the PR: 15b57cb

- Typically Flyte is configured so that each project / domain has its own Kubernetes namespace. Certain environments may change this behavior by using the Flyteadmin namespace_mapping setting to put all executions in fewer (or a singular) Kubernetes namespace. This is problematic because it can lead to collisions in the naming of the CR that flyteadmin generates. - This patch fixes 2 important things to make this work properly inside of Flyte: * it adds a random element to the CR name in Flyte so that the CR is named by the execution + some unique value when created by flyteadmin Without this change, an execution Foo in project A will prevent an execution Foo in project B from launching, because the name of the CR thats generated in Kubernetes *assumes* that the namespace the CRs are put into is different for project A and project B When namespace_mapping is set to a singular value, that assumption is wrong * it makes sure that when flytepropeller cleans up the CR resource that it uses Kubernetes labels to find the correct CR -- so instead of assuming that it can use the execution name, it instead uses the project, domain and execution labels - Use deterministic hash of execution id, name, project, as the FlyteWorkflow CR name. Add workflow-cr-name-hash-length to flytepropeller config. Signed-off-by: ddl-ebrown <[email protected]> Signed-off-by: ddl-rliu <[email protected]>

ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from 806c40b to 438dd97 Compare June 14, 2024 22:39

ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from 438dd97 to 532888b Compare June 14, 2024 22:44

ddl-ebrown commented Jun 14, 2024

View reviewed changes

ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch 2 times, most recently from 33e6e7d to f97c77a Compare June 14, 2024 23:27

ddl-ebrown commented Jun 14, 2024

View reviewed changes

flytepropeller/pkg/compiler/transformers/k8s/workflow.go Outdated Show resolved Hide resolved

ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from f97c77a to d994304 Compare June 14, 2024 23:52

ddl-ebrown commented Jun 14, 2024

View reviewed changes

ddl-ebrown commented Jun 15, 2024

View reviewed changes

davidmirror-ops requested a review from hamersaw June 17, 2024 12:21

hamersaw reviewed Jun 17, 2024

View reviewed changes

flytepropeller/pkg/compiler/transformers/k8s/workflow.go Outdated Show resolved Hide resolved

hamersaw reviewed Jun 17, 2024

View reviewed changes

ddl-rliu force-pushed the change-flyte-CR-naming-scheme branch 6 times, most recently from 33a99f0 to 93bb9a5 Compare July 18, 2024 22:46

ddl-rliu reviewed Jul 18, 2024

View reviewed changes

ddl-ebrown commented Jul 20, 2024

View reviewed changes

ddl-rliu force-pushed the change-flyte-CR-naming-scheme branch 2 times, most recently from f46550d to d4a0e72 Compare July 23, 2024 18:02

ddl-rliu reviewed Jul 23, 2024

View reviewed changes

ddl-rliu force-pushed the change-flyte-CR-naming-scheme branch 2 times, most recently from 70accda to 15b57cb Compare July 25, 2024 23:22

davidmirror-ops requested a review from EngHabu August 14, 2024 13:40

ddl-ebrown force-pushed the change-flyte-CR-naming-scheme branch from 15b57cb to a268b61 Compare August 28, 2024 16:38

davidmirror-ops requested a review from pvditt October 3, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Flyte CR naming scheme to better support namespace_mapping #5480

Change Flyte CR naming scheme to better support namespace_mapping #5480

ddl-ebrown commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading

ddl-ebrown Jun 14, 2024

ddl-ebrown Jun 14, 2024

ddl-ebrown Jun 15, 2024

hamersaw left a comment •

edited

Loading

hamersaw Jun 17, 2024

ddl-ebrown Jun 17, 2024

hamersaw Jun 25, 2024

ddl-ebrown Jun 25, 2024

ddl-rliu Jul 18, 2024

kumare3 commented Jun 25, 2024

ddl-ebrown commented Jun 25, 2024

ddl-ebrown commented Jun 25, 2024

EngHabu commented Jun 25, 2024

ddl-rliu Jul 18, 2024 •

edited

Loading

ddl-rliu Jul 18, 2024 •

edited

Loading

ddl-rliu Jul 18, 2024 •

edited

Loading

ddl-ebrown Jul 20, 2024

ddl-rliu Jul 23, 2024 •

edited

Loading

ddl-rliu Jul 23, 2024

ddl-ebrown Jul 25, 2024

ddl-ebrown Jul 25, 2024 •

edited

Loading

ddl-rliu Jul 25, 2024 •

edited

Loading

	const (
	namespaceMappingKey = "namespace_mapping"
	defaultTemplate = "{{ project }}-{{ domain }}"
	)

	var namespaceMappingConfig = config.MustRegisterSection(namespaceMappingKey, &interfaces.NamespaceMappingConfig{
	Template: defaultTemplate,
	})

Change Flyte CR naming scheme to better support namespace_mapping #5480

Are you sure you want to change the base?

Change Flyte CR naming scheme to better support namespace_mapping #5480

Conversation

ddl-ebrown commented Jun 14, 2024 • edited Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Jun 14, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hamersaw left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumare3 commented Jun 25, 2024

ddl-ebrown commented Jun 25, 2024

ddl-ebrown commented Jun 25, 2024

EngHabu commented Jun 25, 2024

ddl-rliu Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

ddl-rliu Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

ddl-rliu Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddl-rliu Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ddl-ebrown Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

ddl-rliu Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

ddl-ebrown commented Jun 14, 2024 •

edited

Loading

codecov bot commented Jun 14, 2024 •

edited

Loading

hamersaw left a comment •

edited

Loading

ddl-rliu Jul 18, 2024 •

edited

Loading

ddl-rliu Jul 18, 2024 •

edited

Loading

ddl-rliu Jul 18, 2024 •

edited

Loading

ddl-rliu Jul 23, 2024 •

edited

Loading

ddl-ebrown Jul 25, 2024 •

edited

Loading

ddl-rliu Jul 25, 2024 •

edited

Loading