Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flags not set on error #1325

Open
biddisco opened this issue Apr 12, 2017 · 2 comments
Open

Flags not set on error #1325

biddisco opened this issue Apr 12, 2017 · 2 comments
Assignees
Labels

Comments

@biddisco
Copy link

A number of large runs of our code failed due to errors when polling the send queue.
when an error is reported we use

                struct fi_cq_err_entry e = {};
                int err_sz = fi_cq_readerr(txcq_, &e ,0);

the error returned reported

Flags = 0
length = 0

so we do not have much information to use to debug.

Should the flags be set to SEND or RMA or is flags of 0 a valid value?

@hppritcha
Copy link
Member

@biddisco just to make sure you were getting -FI_EAVAIL back from the fi_cq_read or equivalent? It looks like the GNI provider should normally have something set for the flags field,
although it looks like there is a path through the provider for receives that may end up getting posted to the receive error cq with flags set to 0. Do you use the same CQ for both send/write and receive operations?

@biddisco
Copy link
Author

We are using separate CQs for Tx and RX, so there should not be a possibility of a receive completion event getting through.

The code has been heavily changed over the last few weeks by several people and so I'm not 100% certain of my facts. But ... there may have been a chance that an EAGAIN got through and triggered this error when in fact it was not an EAVAIL. I'm tempted to suggest that you close this issue as "not a bug" and I will reopoen it if we get further errors of the same kind.

[Additionally, we've implemented a resend of oor message in the event that a FI_MSG SEND error occurs so we should be able to recover from that. We have not implemented recovery of RMA, though we can/should do if we get errors of that kind.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants