Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probable Incorrect Error Handling in collective_create() #761

Open
rgmiller opened this issue Mar 10, 2023 · 0 comments
Open

Probable Incorrect Error Handling in collective_create() #761

rgmiller opened this issue Mar 10, 2023 · 0 comments

Comments

@rgmiller
Copy link
Contributor

System information

Type Version/Name
Operating System All of them, I suspect
OS Version
Architecture
UnifyFS Version

Describe the problem you're observing

In collective_create() (around about line 303) we attempt to acquire a child request handle. If we can't get a handle for some reason, we do NOT clean up and exit (as we do with most other error conditions). Instead, we log an error and set the pointer to NULL, but otherwise continue processing. If no other errors occur, then the function will return a pointer to a coll_request struct, but that struct will contain an unexpected NULL pointer.

Describe how to reproduce the problem

The bug is in the error handling for this function, so you'd have to trigger the specific error. I'm not sure how you'd do that (and I rather suspect it's never happened on a production system).

Include any warning or errors or releveant debugging data

This was discovered during a code inspection as part of the work on #758. For reference, Michael Brim's comment about it is reproduced below:

Regarding the TODO, the reason we don't return immediately in that error case is because that would leak the allocated coll_request structure and its member array allocations. However, it looks like we're missing the logic to notice that we failed to get one of the child handles and need to cleanup the structure before returning NULL. Unfortunately, it appears it's not as easy as just calling collective_cleanup(), since that would prematurely free the original group rpc request handle, which we use in the callers to collective_create() to send back an error. This is complicated enough that we probably should open an issue and address in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant