Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch : Fix context used when managing internal connection #5582

Merged
merged 3 commits into from
Dec 11, 2023

Conversation

johnhaddon
Copy link
Member

This concludes my lengthy investigation into the strange Unexpected message : ERROR : Emitting signal : Unknown error CI failures we've been seeing on main lately. See the final commit message for all the gory details.

The context is in fact required to be non-null.
This came to light due to a CI failure on `main` :

```
Error: FAIL: testInvalidColumns (GafferImageUITest.CatalogueUITest.CatalogueUITest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/gaffer/gaffer/build/python/GafferImageUITest/CatalogueUITest.py", line 58, in tearDown
    GafferUITest.TestCase.tearDown( self )
  File "/__w/gaffer/gaffer/build/python/GafferUITest/TestCase.py", line 88, in tearDown
    GafferTest.TestCase.tearDown( self )
  File "/__w/gaffer/gaffer/build/python/GafferTest/TestCase.py", line 119, in tearDown
    self.fail( f"Unexpected message : {message}" )
AssertionError: Unexpected message : ERROR : Emitting signal : Unknown error
```

Investigation showed that the "Unknown error" was in fact an `IECore::Cancelled` exception, and
that the signal in question was `plugInputChangedSignal()` on a Switch node inside a Text node inside an
InternalImage node inside a Catalogue node. The Catalogue node was being destroyed, and as the Text node
was torn down, its internal connections were modified and `Switch::inputChanged()` was called on the
internal Switch node. This was all happening in a background thread whose identity I havent ascertained,
and where the current context was cancelled. The destruction was actually being triggered from
`IECorePython::WrapperGarbageCollector::collect()`, which is why I believe the problem only occurred on
`main` - collection needs to occur at precisely the right time for the problem to show, and is highly
sensitive to how many wrappers have been created previously.

So what was the problem exactly (other than that garbage collection is a
horribly unpredictable compromise we'd rather avoid)? In `Switch::inputChanged()`
we make a shortcutting internal connection across the Switch if we can determine
that the index is not context-sensitive, which we check using
`PlugAlgo::dependsOnCompute()`. When the index isn't context sensitive, we call
`indexPlug()->getValue()` to get its current value and then make the appropriate
connection. But in this case, the index plug (which is an IntPlug) is driven by
a BoolPlug, which while still not context-sensitive, _will_ require type conversion
work in `getValue()`. That work is done by ValuePlug inside a Process, and the
Process checks for cancellation as a matter of course.

It's a bit hard to know exactly where to point the finger of blame here.
It's arguable that we don't need to wrap the type conversion in a Process.
But we currently do because we think it's possible that some conversions
might be expensive and therefore warrant monitoring (and caching). It's also
arguable that `dependsOnCompute()` should return `True` when a type conversion
is involved, and that what we really should mean by "Compute" is
"something involving a Process". But that's hard to explain, and would
defeat the Switch's internal optimisation in cases like this where it is
completely valid.

Instead I've blamed the poor old `Switch` node, and fixed it by scoping a default
context in `updateInternalConnection()`. This is much more limited in scope than
other potential fixes, and therefore much less likely to have unintended side
effects.

But I think the _real_ culprit is the garbage collection. It's occuring
at times completely out of our control, disposing of nodes on threads we don't want
them disposed on, and emitting signals on threads we don't want them emitted on. But
fixing that is far far _far_ harder.
@danieldresser-ie
Copy link
Contributor

I can't really say I fully understand all this - some of the more intrusive solutions do sound more elegant.

But I think you're right that a minimally intrusive fix to just make the problem go away is probably the right solution for now, and it sounds like you've thought through the options pretty thoroughly.

@johnhaddon johnhaddon merged commit 686c626 into GafferHQ:main Dec 11, 2023
4 checks passed
@johnhaddon johnhaddon deleted the catalogueErrorFix branch March 15, 2024 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants