-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG, MAINT: log with negative heatmap bin values (from Slack) #943
Comments
It looks like there may be another problem with this log (either something orthogonal, or possibly the root cause of record corruption elsewhere). My first attempt to parse the log manually using a current origin/main build produced the following:
My Darshan build is using gcc 12.2 and address sanitizer. Address sanitizer is triggering a segfault while darshan-parser is extracting Lustre module data, which happens before it gets to the heatmap module. I can't validate yet, but it's possible that memory corruption is leading to the bogus heatmap data. I'll try hacking up the parser to extract only the heatmap data and see what it looks like. |
Ok, that was a red herring. When I hacked up darshan-parser to skip straight to the heatmap module I definitely see negative values:
It looks like a little over 30 bins have negative values. |
I tried this little test program to see what the max value of an int64 counter is:
The negative values here look like they are very close to the most negative values possible on a signed 64 bit integer. It's impossible for the application to have accessed enough data to simply wrap around though. This has to be a bug in Darshan or some sort of corner case in the function arguments we are tracking. |
Another piece of the puzzle: all of these faulty values are in the stdio heatmap:
|
I think we might need a runtime reproducer to track this down. I'd like to see what the arguments look like on the way in to the |
Nafiseh also provided a log from a different run that does not exhibit this problem (this is again with the parser hacked to only show the heatmap data):
So this is possibly an application problem, but it would still be good to know if there was something we could have done to prevent bogus data from getting to the log or if it was some sort of uncontrollable problem (like a memory corruption). |
This log was provided by Nafise Moti on Slack, with debug prints in the
darshan/datatypes/heatmap.py
code suggesting we get negative heatmap bin values withprint(rec['write_bins'])
.moti_nek5000.tar.gz
I asked to see if they could regenerate the log with the latest
darshan-runtime
, in case it is simply an issue with the older runtime. If it is reproducible with the newer runtime, then it may be interesting to diagnose on the runtime side.However, the initial log above is probably enough for us to start studying for evidence of i.e., overflow to produce large negative numbers, and to see if the
darshan-parser
agrees with PyDarshan bindings perhaps.The text was updated successfully, but these errors were encountered: