Skip to content

Conversation

shawkins
Copy link
Collaborator

@shawkins shawkins commented Oct 22, 2025

Replacement / alternative to #2999.

Layered on the tombstone removal changes / default enablement of parsing resource versions. It turns out that some of the tombstone changes needed refinement anyway. For example, we need to compare to a localized latest version, not to the cache.

This seems like the most straight-forward way to do this with locking - but of course that means we have to be very careful with what locks are already held. If this seems too risky, then queuing managed in the TemporaryResourceCache could be used instead, but that will need more code restructuring.

These changes also assume we only want to deal with this skipping problem when using the InformerEventSource. That could be generalized if needed for #3011

Much like the PrimaryUpdateAndCacheUtils, users will need to call InformerEventSource.PrimaryUpdateAndCacheUtils, to properly use this via custom logic.

This does not do any cleanup the extraneous puts into the cache because of onCreated / onUpdated being called after the updateAndCacheResource method.

This also does not try to reason about deletes.

cc @csviri @metacosm

Signed-off-by: Steve Hawkins <[email protected]>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 22, 2025
@shawkins shawkins mentioned this pull request Oct 22, 2025
@shawkins shawkins force-pushed the annotation_removal-b branch from d650f86 to 8558c1a Compare October 23, 2025 13:21
Copy link
Collaborator

@csviri csviri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally like this approarch. What we should do is also document it on high level, like javadoc on TemporaryResourceCache. Like the general idea; using lock striping for IO operations, etc.

Not sure if all the scenarios are covered with unit tests, probably those could be also added.

Will have to spend more time to think through all the scenarion, overall LGTM, added some comments, and questions to clarify.

}
try {
temporaryResourceCache.startModifying(id);
var updated = updateMethod.apply(resourceToUpdate);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to iterate over this, the start modifyn and recent resource update should be called from an utils, but fine by me now as it is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had initially started down the path of a common method with the primary cache utils - but for clarity this seemed like a better place to start.

Are you expecting that we'll want to push the event skipping logic down into ManagedInformerEventSource, and/or are you thinking that we'll effectively be able to use a Context rather than a direct reference to the event source?

.equals(previousResourceVersion))
|| isLaterResourceVersion(resourceId, newResource, cachedResource))) {
// first check against the source in general - this also prevents resurrecting resources when
// we've already seen the deletion event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned on other PR , would be nice to describe the resurrection scenario more in detial

boolean[] known = new boolean[1];
synchronized (this) {
if (!unknownState) {
latestResourceVersion = resource.getMetadata().getResourceVersion();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While is this better this way as using the latest sync version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are effectively using the TemporaryResourceCache to reason about things the reconciler was responsible for creating / updating.

If we skip putting something in the cache, then we may not skip that event as intended.

Let's say the informer has seen:

  • create v1
  • modify v2 - the reconciler did this
  • modify v3 - something else did this

But has only delivered to the TemporaryResourceCache

  • create v1

And we're now attempting a put with v2. If we use the informer latest version, we won't actually put v2 into the TemporaryResourceCache as it looks out-of-date. But then on onAddOrUpdateEvent we'll answer that the event wasn't known and trigger a reconciliation.

So by using a local notion of latestResourceVersion we're reasoning over what the TemporaryResourceCache actually knows.

Also because the get logic has changed to check both the TemporaryResourceCache and the informer cache, we won't actually return an out-of-date resource due to the put.

private final GroupVersionKind groupVersionKind;
private final InformerConfiguration<R> informerConfig;
private final KubernetesClient kubernetesClient;
private final boolean comparableResourceVersions;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this also to @Informer annotation


@Override
public void onDelete(R resource, boolean b) {
public synchronized void onDelete(R resource, boolean b) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we have this synchronized, could you elaborate pls?

Copy link
Collaborator Author

@shawkins shawkins Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic has start and the onAddOrUpdate methods synchronized. I'm assuming that the intention there is to effectively block event handling until the primaryToSecondaryIndex has been built from the initial listing. I don't think delete should be handled differently.

That could also be done with a more granular lock around the usages of primaryToSecondaryIndex.

In general there's effectively a single threaded model of event delivery, so unless users are calling the onXXX methods manually, those calls will be serialized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants