Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R3BUcesbSource2 and add runtime struct info checking #910

Merged
merged 1 commit into from
Jan 12, 2024

Conversation

YanzhaoW
Copy link
Contributor

@YanzhaoW YanzhaoW commented Nov 23, 2023

List of features:

  • Rewriting R3BUcesbSource into a new class R3BUcesbSource2. The interfaces of the new class is compatible with the current version of "R3BUcesbSource" except the features related to the text file input. Which means if you want to use the new class, the only thing to do is to change "R3BUcesbSource" to "R3BUcesbSource2" (EDIT: Also change the filename wirldcard * to .* since the R3BUcesbSource2 only input a regex for filenames.).
  • Use boost::asio to launch ucesb server instead of popen.
  • Add a runtime struct info item checking (based on regex search on item names) such that the current ext headers can be used for older experiments.

Motivation:

R3BUcesbSource crashes the program when running NeuLAND lmd2cal task due to the double closing:

R3BUcesbSource::~R3BUcesbSource()
{
R3BLOG(debug1, "R3BUcesbSource destructor.");
if (fReaders)
{
fReaders->Delete();
delete fReaders;
}
R3BUcesbSource::Close();
}

As has been discussed in #712 and #852 with @inkdot7 and @klenze, it would be better if the ext headers of the latest R3BRoot version could be compatible and used for the old experiments, without risking the "not-found" situations.

With the need to add this new feature and correct the existing malicious coding practices in the ucesb source class, the change of the old ucesb source class could result in 90% of its code to be rewritten. It's almost a certain that some people would be against it considering the new experiments coming very soon next year. Therefore, it's better to create a new class named "R3BUcesbSource2".

Testing result:

It's been tested with NeuLAND lmd data, both from s509 and s444, using the same ext header. For the s444, the number of neuland dp must be set to 8. Setting a higher number causes the program to terminate because of the unfound items (which should happen). However, the bar number value from s444 is somehow not correct. it could be caused by larger array sizes (causing "array_fewer" flag) of the struct defined in its sever side (need further investigation).

The implementation of runtime struct item checking of NeuLAND can be seen here in PR #890.

The implementation from the ucesb source side can be found in R3BUcesbStructInfo class.

If you have any suggestions, please comment below.


Checklist:

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 23, 2023

R3BUcesbSource crashes the program when running NeuLAND lmd2cal task due to the double closing:

That could and should be repaired?

@YanzhaoW
Copy link
Contributor Author

Hi, thanks for the comment.

That could and should be repaired?

The problem is the other closing from FairRoot is also not deterministic neither. Sometimes it closes and sometimes it doesn't. Instead of simply removing the calling of the close method, the best way, in my opinion, is to just let Close do nothing and handle the resource closing directly using RAII. That's also the reason why the close method of the new class is empty.

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 23, 2023

The problem is the other closing from FairRoot is also not deterministic neither. Sometimes it closes and sometimes it doesn't.

That sounds like a bug that should be reported.

@YanzhaoW
Copy link
Contributor Author

YanzhaoW commented Nov 24, 2023

That sounds like a bug that should be reported.

Yes, I have reported to FairRoot devs few days ago. Closings on both FairSource and FairSink have a lot of problems. They are also trying to fix them. See this pr and this issue.

@klenze
Copy link
Contributor

klenze commented Nov 24, 2023

My hot take on destructors:
I generally believe that cleaning up after things which are not likely to be instantiated within the event loop should be a low priority.

I so often read code in which people decide to just decide to call delete on any member variables which are (raw) pointers with no regard for ownership considerations. For Ucesb source, the only thing we might check is that the unpacker process is indeed dead. But I would hope that in any case, either processing stopped because the unpacker quit (out of data or error) or that it would try to continue to send data through the broken pipe which will also kill it.

Also, I see that you are already compromising with regard to using clear ownership semantics. Returning a raw pointer to a unique_ptr does not seem 100% kosher. Perhaps a shared_ptr would be more appropriate?

(Again, I do not think this matters at all. If no R3BReaders were ever deleted, that would also be an ok outcome for me.)

@YanzhaoW
Copy link
Contributor Author

Hi @klenze. Thanks for the comment

Also, I see that you are already compromising with regard to using clear ownership semantics. Returning a raw pointer to a unique_ptr does not seem 100% kosher. Perhaps a shared_ptr would be more appropriate?

I am not sure which case do you refer to exactly. Could you elaborate on the "compromising with regard to using clear ownership semantics"?

Regarding to returning a raw pointer from unique_ptr, it's actually a good practice to do that as long as there is no ownership transfer. I would say one of the biggest misunderstanding of C++ smart pointers is to use them everywhere. Old raw pointer is still perfect to use in non-owning situations.

In my experience, I very rarely use shared_ptr except when dealing with concurrency. In a synchronous programming, with a good design, somehow it can always determine the life-time of an object during the compile time and assign it to an owner, for which unique_ptr is enough. Shared_ptr is only needed when there is no way to determine the life-time of an object, such as the situation where an object is shared by multiple threads and you basically can't tell which thread will close first and which will close last (The one closes last should call the dtor of the object).

@YanzhaoW
Copy link
Contributor Author

Just one notice: The compile time error from the sofia folder is caused by rootcint as rootcint doesn't use the included path in cmake targets. See this discussion.

@YanzhaoW YanzhaoW marked this pull request as draft November 27, 2023 17:22
@YanzhaoW
Copy link
Contributor Author

Found ghost process from ucesb server unclosed. Under investigation.

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 27, 2023

Found ghost process from ucesb server unclosed. Under investigation.

Yikes. I thought that was repaired again in ucesb. (There was a problem where SIGPIPE got ignored; thus, the process did not die when an output pipe got closed for whatever reason at the other end (also crashes or Ctrl-C).) That should be these commits:

https://git.chalmers.se/expsubphys/ucesb/-/commit/7cc495a7757e40b1713e09ea2791110153b482a8
https://git.chalmers.se/expsubphys/ucesb/-/commit/62c42c7ed97d3e0f9cdd8cb4e7c0da1a660c89d8

This is the 'last defence' against ghosts. It is, of course, nice if one tries to close it more deliberately as well.

@YanzhaoW
Copy link
Contributor Author

I'm using the updated version. Would the server process die if it gets a SIGKILL signal?

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 28, 2023

I'm using the updated version. Would the server process die if it gets a SIGKILL signal?

I think nothing should be able to survive a SIGKILL. Only case that comes to mind is some filesystem hang. But then at least it would reasonably not use CPU any longer. Is it using CPU? Or if it is a zombie process.

@YanzhaoW
Copy link
Contributor Author

It's a zombie process. I found it because of some undeletable .nfs files with running processes that use no CPU power.

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 28, 2023

It's a zombie process. I found it because of some undeletable .nfs files with running processes that use no CPU power.

If it is a zombie, then its parent should still be alive, but have not yet asked for the zombie's exit status. If parent not alive, zombies are made children of pid 0, which typically asks for the exit status, to get rid of the zombies.

@YanzhaoW
Copy link
Contributor Author

I somehow can't reproduce this zombie process now. Maybe they were created before I added terminate for child process.

If parent not alive, zombies are made children of pid 0, which typically asks for the exit status, to get rid of the zombies.

Is that so for ucesb server? Normally an orphaned process will be assign to the init process (with PID 1) when its parent dies. How does ucesb handle this?

@inkdot7
Copy link
Contributor

inkdot7 commented Nov 29, 2023

If parent not alive, zombies are made children of pid 0, which typically asks for the exit status, to get rid of the zombies.

Is that so for ucesb server? Normally an orphaned process will be assign to the init process (with PID 1) when its parent dies. How does ucesb handle this?

My bad, I should have looked it up... As you say, PID 1, not 0. So described the sam as you: what the system does with zombies, ucesb cannot change that behaviour (only do its part by wait()ing for the process (get its exit status)).

Was the zombie a ucesb itself (i.e. the unpacker, which would be a child of r3broot), or was it the ext_... something (ext_struct_writer)? That latter would be a child of the unpacker process. The data flows from the unpacker to ext_... to r3broot in this case.

@YanzhaoW
Copy link
Contributor Author

I'm not sure. But I saw "Unexpected end of file from external writer." printout with yellow background when I killed the ghost processes by their PIDs.

But now I can't reproduce the ghost processes anymore. Even when I abort the R3BRoot in the middle of the run (Both close and terminate are not called), I still can't find any .nfs files leftover.

They may have existed for a long time when I run with old versions of ucesb.

@YanzhaoW YanzhaoW force-pushed the edwin_ucesbreader branch 4 times, most recently from 732fc56 to f5af579 Compare January 7, 2024 18:26
@YanzhaoW YanzhaoW marked this pull request as ready for review January 7, 2024 18:26
@YanzhaoW
Copy link
Contributor Author

YanzhaoW commented Jan 7, 2024

Completed and ready for further reviews. If there is no other issues, we could merge it in one or two days.

@YanzhaoW
Copy link
Contributor Author

Hi, @jose-luis-rs , it's ready to be merged.

@jose-luis-rs
Copy link
Contributor

Thanks @YanzhaoW, could you also do a PR on the fast branch?
Just to have the same code on both branches.

@YanzhaoW
Copy link
Contributor Author

Sure, it seems that Github can't do this automatically. By the way, what's the plan for the "fast branch"? Will you merge it back to dev branch after the experiment? If not, I don't need to worry about possible merge conflicts.

@jose-luis-rs
Copy link
Contributor

The idea is to delete the "fast branch" after the experiments and for this reason one must do the same PR in both branches, if the PR is accepted in the "fast branch" but proposed for improvements in the dev branch, the corresponding author should take care of it as soon as possible.
Anyhow, I think that the best is to do the same PR in both branches without changing the "title", just to avoid conflicts.

@jose-luis-rs jose-luis-rs merged commit 4e1f226 into R3BRootGroup:dev Jan 12, 2024
8 checks passed
@YanzhaoW YanzhaoW deleted the edwin_ucesbreader branch January 10, 2025 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants