-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realm: Assertion `size <= ib_seg_size' failed #1769
Comments
This error only seems to appear when I run with profiling. |
which branch are you using? There is no line 2xxx in runtime_impl.h |
|
I thought you were using the new barrier branch. The error means the reduction value used by barrier is too big. Currently, the active message upper limit of UCX is 8K, I am surprised that Legion uses such a big value with barrier. You can try to increase the size to 16K for now https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/ucx/ucp_internal.cc?ref_type=heads#L67 |
It needs to be 32K. I also have to run with |
@eddy16112 do you expect this will go away with the new barrier implementation or should i leave this issue open for now? |
In the new barrier branch, we need to send the tree to child nodes, so we have seen cases that Let's keep this bug open for now. |
Yeah, I am surprised we are hitting this on the legacy branch. Likely that's just been there for a while and never tested. |
The new critical path profiling infrastructure will bang on it in a way that it didn't used to get used very often, which is very likely what is happening here. |
@lightsighter I am surprised that the critical path profiling uses such a big reduction data, almost 36K. |
Legion is not using that large of a reduction. It's reducing this data structure which is not 36K: I suspect there is a performance bug in Realm where it always sends the reduced value for all generations which continues to grow larger and larger rather than sending the reduced values for the difference in subscribed generations. |
It think it's close but not exactly - the owner can collapse subscribe generations into a single notify active message depending on what was the latest subscribe generation observed. In case, we are running 1000 gens that would result into a single notification for |
@syamajala Is this blocking? I can consider doing a separate fix if we need it immediately otherwise we can wait for when we merge the scalable barrier branch that should in my opinion address it. |
I have a work around for right now so it's not blocking. |
I am seeing the following assertion when running cunumeric with ucx:
Here is a stack trace:
The text was updated successfully, but these errors were encountered: