Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Darshan parser not generating any output #1000

Open
wadudmiah opened this issue Jul 23, 2024 · 6 comments
Open

Darshan parser not generating any output #1000

wadudmiah opened this issue Jul 23, 2024 · 6 comments

Comments

@wadudmiah
Copy link

Hi,

I have generated the attached Darshan trace file, but when I try to generate the text output, I get the following error and no text output:

$ /home/kunet.ae/ku1324/darshan-3.4.0/bin/darshan-parser /l/proj/test0001/test_dir/darshan-logs/2024/7/23/ku1324@k_ben3027_1.darshan
Error: unable to inflate darshan log data.
Error: failed to read name hash from darshan log file.

This is a run of the benchio benchmark (https://github.com/EPCCed/benchio/issues) running on 4 nodes and with 16 MPI processes per node (ppn=16). However, Darshan profiles it correctly when I run the same benchmark on 2 nodes (ppn=16), so with 32 MPI processes. Strangely, it also works on 16 nodes (ppn = 16), so with 256 MPI processes.

Any help will be greatly appreciated.

ku1324@k_benchio.x_id1172710-18348_7-23-42668-17459906300090053027_1.darshan.txt

@carns
Copy link
Contributor

carns commented Jul 28, 2024

Hi @wadudmiah , can you confirm that there are no error messages being produced at runtime?

Can you also tell us what sort of file system the log is being written to?

This is an unusual error. It indicates that the gzip-compressed data contained in a portion of the log file cannot be parsed by libz. Usually even if something goes wrong in Darshan itself, the data generated can at least be uncompressed before there is a problem interpreting it.

I'm not sure what's wrong, but there may be some workarounds we can try that will help narrow down the problem. I'd like to confirm the questions above first, though, to get a better idea of the situation.

@Arun8765
Copy link

Arun8765 commented Sep 5, 2024

Even I am facing the same issue, with darshan version 3.4.4, OpenMPI version 4.1.2, GCC/MPICC version 11.4.0 Was not able to get a reliable log, but I got 2 types of errors when I performed

$ darshan-parser .darshan

First one

HEATMAP 44 16592106915301738621 HEATMAP_WRITE_BIN_2 0 heatmap:POSIX UNKNOWN UNKNOWN
HEATMAP 44 16592106915301738621 HEATMAP_WRITE_BIN_3 0 heatmap:POSIX UNKNOWN UNKNOWN
Error: unable to inflate darshan log data.
Error: failed to read module HEATMAP data from darshan log file.
HEATMAP 44 3668870418325792824 HEATMAP_F_BIN_WIDTH_SECONDS 0.100000 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_READ_BIN_0 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_READ_BIN_1 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_READ_BIN_2 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_READ_BIN_3 56053533965 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_WRITE_BIN_0 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_WRITE_BIN_1 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_WRITE_BIN_2 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
HEATMAP 44 3668870418325792824 HEATMAP_WRITE_BIN_3 940422246894996749 heatmap:MPIIO UNKNOWN UNKNOWN
Error: unable to inflate darshan log data.
Error: failed to read module HEATMAP data from darshan log file.
Error: failed to parse HEATMAP module record.

The second kind of error was

Error: unable to inflate darshan log data.
Error: failed to read name hash from darshan log file.

The results would vary based on my mpirun command.
When I run

$ mpirun --oversubscribe -n 32 ./my_executable_file

It gives me proper output 7/10 without any issues.

But when I run

$ mpirun --oversubscribe -n 8 ./my_executable_file

It gives me errors 19/20 times.

Even though from compiling to running I do the exact same steps one of these 2 errors pops up. There is no pattern for which error occurs.
High possibility for it to be an issue because of some kind of race condition.
Errors appear for both compile time instrumentation and Runtime instrumentation.

CPU information

Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx

@carns
Copy link
Contributor

carns commented Sep 6, 2024

Oh actually @wadudmiah 's original bug report is probably addressed by #1002, which is available in origin/main and will be included in the next release. I didn't make the connection.

@Arun8765 I'm not sure if your error is related or not. Are you able to share one of the log files that you are unable to parse?

@Arun8765
Copy link

Arun8765 commented Sep 6, 2024

Hi @carns , here are the 2 darshan files that throw the above errors when I use darshan-parser

arunk_mpiParaio_id118913-118913_9-6-4541-16124893853151831010_1.darshan.txt
arunk_mpiParaio_id7147-7147_9-6-66677-18197439800644243110_1.darshan.txt

@shanedsnyder
Copy link
Contributor

I just had a look at both example logs shared by @wadudmiah and @Arun8765 and I can confirm the reported parsing errors. It doesn't appear using Darshan's main branch is any help, so these both look like new issues (likely related given failures like this with decompression have been rare to see in Darshan).

We are looking into the problem to see if we have any ideas/suggestions. In the meantime, if you can provide the following info it might be of some help:

  • Can you share ldd <your_executable> output so we can see what libraries are being used at runtime?
  • If at all possible to boil this down to a really simple reproducer (i.e., using a barebones MPI program or something well-known and easy to build/run like IOR), I could try reproducing myself using the additional details provided above about @Arun8765's software environment.

@shanedsnyder
Copy link
Contributor

This is something we've not been able to reproduce directly ourselves, but did confirm from looking at each of your log files that there appears to be corrupted data in them. The log data is first compressed (without error) then immediately appended to the Darshan log, so there is very little time for it to be corrupted. One possibility, beyond a potential corner case bug in Darshan itself that we don't understand, is a problem with the MPI-IO implementation and how Darshan uses it.

We have a few suggestions on things to try to see if it perhaps works around the issue that maybe you could try:

  1. Use the DARSHAN_LOGFILE env var to force Darshan to write the log file to another shared file system, if possible.
  2. Unset all default MPI-IO hints Darshan using: export DARSHAN_LOGHINTS="", we've seen these hints expose bugs in MPI-IO implementations in the past so wouldn't be unprecedented.
  3. Use a different MPI implementation all together -- at least one of you mention using OpenMPI so might be worth trying MPICH or something else if at all feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants