-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Python derived/accum interface #839
ENH: Python derived/accum interface #839
Conversation
Does that change things/help to start with? The darshan_derived_metrics struct (and anything that branches off of it) will be the crucial datatypes for the Python bindings, I would think. Happy to make some helpers if that particular struct is awkward for bindings, though. |
Not really, I think the exposed accumulation functions are just too easy to segfault right now. Granted, the segfault may happen from misuse, but it should be possible to make segfaults effectively impossible with enough struct/type inspection guards that return useful feedback. For example, I isolated the segfault in the current version of this branch to this block of code in 110 if(!mod_logutils[acc->module_id]->log_agg_records ||
111 !mod_logutils[acc->module_id]->log_sizeof_record ||
112 !mod_logutils[acc->module_id]->log_record_metrics) {
113 /* this module doesn't support this operation */
114 return(-1);
115 } Only one of those three conditions is needed to trigger a segfault. A quick look at the |
The As for going out of bounds on the array index there, we could put a guard on the I don't want to go overboard checking for rational values in that accumulator input struct because at the end of the day it's C and a user can pass in whatever invalid pointer they want and it will segfault anyway as soon as we try to dereference the struct. |
The memory is either getting freed between the create and inject calls because of CFFI/language barrier details, or the fancy pointer stuff in |
* returning `-1` from a public API C function isn't sufficient to provide useful error information when working in Python/CFFI--it only tells you something went wrong if you check for a return code in the first place (which isn't normally done in Python anyway--normally you end execution at the error point and `exit()` with an appropriate error message) * when there are multiple conditions that can trigger the `-1` return value, the situation is even worse, one literally has to `printf` sprinkle the source to figure out what went wrong where * as a compromise, I'll leave the `-1` approach in since that is quite common in standard `C`, but I'm going to add in prints to `stderr` so that Python can then intercept the `-1` and refer the user to `stderr` * also, `darshan_accumulator_inject()` assumed that the `module_id` was reasonable because it was set/checked in `darshan_accumulator_create()`, however my experience in darshan-hpcgh-839 was that the accumulator memory location can get freed, or not properly set at CFFI boundary after calling the creation function, so I think the assumption that a previous function was called and worked perfectly is too fragile--I'm adding error handling to prevent a hard segfault on nonsense values of that structure member as a result
Re: the potential for memory to be freed, there isn't any memory that the caller should be aware of that it could plausibly free. For example, in this function prototype:
the I'll have a look at the error handling PR later today, I just wanted to clarify that point. I don't know how Python bindings work really, but I could imagine that an integer type might possibly be easier to handle than an opaque type; if that's a change that would be useful then we can update that. |
There may be some things that are surprising re: memory ownership from https://cffi.readthedocs.io/en/latest/using.html, though I'm not clear if that's actually biting us here.
I figured exposing a single function that calls the create, inject, emit, destroy internally, keeping the memory handling/passage of implementation details/"private structs" completely concealed on the C side would reduce the likelihood for complications, though I'm not sure that's worth doing at this point. It is almost certainly way easier for me to do this in Cython, because it will just produce a C file that passes things around in C like you expect them to be, but then I'd have to contend with a debate about introducing another dependency/technology. The fix on the bindings side is probably some trick about first allocating a pointer, then a pointer to that pointer, or that kind of thing.. CFFI certainly wasn't my choice, but it is what I have to work with at the moment. I basically end up iterating with stuff like |
I took a closer look at this and found a couple of issues:
I think the existing definitions in
I've got to take off for now, but I can clean up and push code tomorrow that fixes 1. |
Indeed, see: #840 -- I'd prefer to know not only that there was an error, but why, so I tried to make that clearer over there.
I did intentionally mutate some of the prototypes to deal with other errors, for now. As I understand it I've even exposed a Also, I still think this is a pain point: Why does it make sense to have the user of the API have to manage 4 functions instead of wrapping them in 1 where the internal handling of inter-function data passage is managed privately? i.e., --give me the derived metrics data for this module, rather than.. create, did create work? inject.. did inject work? emit.. did emit work? destory .. did destroy work? For example, the early prototype of Mock-up might be something like this: int log_get_derived_metrics(fd log_file, char* mod_name, struct derived_metrics) {
// the darshan_accumulator struct is effectively a private implementation detail
// that is not part of the public interface of this function; we only ask the user
// to provide a file handle for a log file, and a module name for which they wish
// to retrieve the derived metrics struct--CFFI will own the memory of derived_metrics
// struct--this C function is only allowed to assign to its structure members
// call the create, inject, emit, and destroy accum functions as needed, passing whatever
// private structures are needed between them without bothering the consumer of the
// interface with these details; if a failure occurs, try to return a non-zero exit code, preferably
// one that is unique to the failure type so Python can make sensible decisions based on it, and
// also try to write to `stderr` when this happens
} You may need to add argument for the number of records you wish to accumulate if there are compelling use cases for not using them all, etc. I guess my contention here is that if we need to be doing a bunch of shimming and error checking, we may as well just do it in C and expose the thing we actually want to call. Popping in and out of the CFFI language boundary isn't very fluid, and IMO should be restricted to just the most crucial calls that give us what we want. |
I think the main driver for this decision is that this accumulator API is envisioned for more use cases than just "accumulate every record in this module". Otherwise, I agree your proposed changes make sense for that workflow. You could imagine users wanting to accumulate metrics for different reasons, e.g.,:
I believe the So, the answer is really just that the API allows for a lot of flexibility right now. I do agree that we could hide a lot of the back-and-forth in a helper function to simplify the "give me everything for this module" approach, especially since we'll be using that in the job summary tool. We should probably still offer bindings for all functions though, in case PyDarshan users want more control over what records to accumulate on. |
I just pushed a commit that appears to correctly create an accumulator and inject into it, using the opaque types Darshan defines in darshan-logutils.h. You still ultimately get a seg fault for the existing tests, since there is no error handling currently to limit to modules that actually implement an accumulator, and the log under test includes the Lustre module (which does not implement an accumulator). Not sure what the cleanest way to handle in PyDarshan, but maybe rather than maintaining a separate list of supported modules, we just catch any error and gracefully skip the module for now? I agree that ideally darshan-util library would have actual return codes that PyDarshan could try and catch, but that's a more involved change -- I think we can get by just by catching general errors for now. |
Thanks @shanedsnyder . Re: the complexity of the accumulator API, that's right that the intent was to be able to run it on more granular subsets than the entire module (and darshan-parser already uses that functionality). We could make a helper function to hide this for sure, but there is a potential performance downside to hiding the individual records; it would require an additional pass over the log file. The way it's set up right now, you can read a record and do whatever you would like with it in addition to passing it through an accumulator (and the accumulator calculation overhead is small). The darshan-parser utility also takes advantage of that; no matter which options you select it only does one linear pass through a log file. If the Python tools can do something similar (use a single pass both to retrieve individual records and to do derived metric accumulation) then we might want to keep the same set up. If the Python tools are going to go back to the log separately for the derived metrics anyway then we can just wrap it in a helper. |
a6cadf4
to
ae5b0c1
Compare
In case it helps, with this diff I get a bit farther, past the POSIX and to MPI-IO: --- a/darshan-util/pydarshan/darshan/backend/cffi_backend.py
+++ b/darshan-util/pydarshan/darshan/backend/cffi_backend.py
@@ -705,9 +705,10 @@ def log_get_derived_metrics(log_path: str, mod_name: str):
print("after inject")
darshan_derived_metrics = ffi.new("struct darshan_derived_metrics *")
print("before emit")
+ buf_agg = ffi.new("char[81920]")
r = libdutil.darshan_accumulator_emit(darshan_accumulator[0],
darshan_derived_metrics,
- rbuf[0])
+ buf_agg)
if r != 0:
raise RuntimeError("A nonzero exit code was received from "
"darshan_accumulator_emit() at the C level. "
@@ -716,4 +717,7 @@ def log_get_derived_metrics(log_path: str, mod_name: str):
"stream.")
print("after emit")
#libdutil.darshan_accumulator_destroy(darshan_accumulator)
+ print("darshan_derived_metrics:", darshan_derived_metrics)
+ print("darshan_derived_metrics.total_bytes:", darshan_derived_metrics.total_bytes)
+ print("darshan_derived_metrics.unique_io_total_time_by_slowest:", darshan_derived_metrics.unique_io_total_time_by_slowest)
|
* this is a Python/pandas-only version of doing some simple derived metrics; I don't think we'll actually do this, but I was exploring a bit because of the difficulties in darshan-hpcgh-839 * this matches pretty well with the `perl` based reports for total bytes, but even simple cases can sometimes disagree on bandwidth per darshan-hpcgh-847, so now I'm curious what is going on * one problem with doing this is that we'd have the same algorithms implemented in two different languages; the advantages include: - not reading all the records in a second time, crossing the CFFI boundary each time - easier to debug/maintain because bounds checking/no segfaults, etc.
I may test the waters with moving the "wrapper" to C itself, a bit like copying/modifying what is done in A Python-only approach is in #848, but I don't think we'll rewrite the algorithms beyond the prototype there at this point, though it does have potential advantages someday in the future perhaps. |
* this is a Python/pandas-only version of doing some simple derived metrics; I don't think we'll actually do this, but I was exploring a bit because of the difficulties in darshan-hpcgh-839 * this matches pretty well with the `perl` based reports for total bytes, but even simple cases can sometimes disagree on bandwidth per darshan-hpcgh-847, so now I'm curious what is going on * one problem with doing this is that we'd have the same algorithms implemented in two different languages; the advantages include: - not reading all the records in a second time, crossing the CFFI boundary each time - easier to debug/maintain because bounds checking/no segfaults, etc.
I pushed some fixes just now that have the test working for me:
C memory leak checkers report a lot of leaks running this test, but its probably false positives because it doesn't understand Python memory management. |
I started expanding the code and testing to compare against the strings from the old perl report, but it looks like only a subset of modules/cases currently work to reproduce the numbers in the old report--you can see the My initial question is whether I should be able to easily reproduce the strings like below using only the data present in the derived metrics struct I'm getting back in all cases/for all reports and modules? Even for some of the simple cases that do reproduce here, I'm having to go back and use Python/pandas to do arithmetic on the report record data, which brings me back to the concerns raised in the Python version of this PR (gh-848)--if I can't cherry-pick most of what I need from the derived structs, and still have to encode some complex logic on what is in the records--what am I gaining through the structs? At the moment all I get is the total bytes, which is trivial to extract it seems. Hopefully I'm just missing something, and all the values I want, like the bandwidth in MiB/s can be directly cherry-picked from the structs, or at least reconstituted with some simple math without complex logic on a per-module basis, or needing switches for shared and unique IO in my math code. |
Hi @tylerjereddy (sorry for the delay, was busy at SC22) you should be able to use this single value directly: https://github.com/darshan-hpc/darshan/blob/main/darshan-util/darshan-logutils.h#L367 ( The python logic in that It should match the perl report exactly because that report is relying on the same C code, by way of invoking the darshan-parser utility as a shell command. (and we should continue to make sure to mark the value as an "estimate", because the true bandwidth is tricky to represent for some applications, particularly if they are doing truly uncoordinated I/O) |
fbb726e
to
3714795
Compare
Ok, this has some substantial updates now. I've added comments to the branch for sections where discussion may still be needed. I'll try to highlight a few points below as well:
|
Thanks @tylerjereddy. On the first point (re: shims and posix/mpiio bytes descrepancy), let's keep it simple, remove the special-case handling, and have each module report its own statistics (if present) in all cases. I'm not sure why the old report would have been reporting the posix bytes in the mpi-io summary; I'm guessing that was a mistake in the old code. The io-imbalance.darshan log definitely shows different volumes at the two levels, perhaps because MPI-IO elided some overlapping bytes or because one module hit the memory limit at a different time from the other. At any rate, that would be the correct thing to report from an analysis tools perspective. On the second point (re: text color) I agree. Might be nice to have it stand out a little since it is a key piece of information, maybe blue or something, but red has a bad connotation and we aren't necessarily reporting bad information here :) I don't know about the 3rd point (APMPI), but I wanted to go ahead and respond about the other two. Is this issue particular to the accumulator / derived metric stuff? Or a more general question? |
The old Perl reports only show POSIX estimates if there is no MPI-IO data in the log, so, it shouldn't ever report estimates for both MPI-IO and POSIX for the same log. I think the original reasoning was that apps that use MPI-IO might naturally just prefer MPI-IO level metrics since they are closer to what is observed by the app. I agree it's confusing in hindsight -- no reason we shouldn't just provide estimates at all layers. What's up with the reported byte values in the most recent report screenshots? I had to double check the log to confirm, but Phil is right that total bytes moved are different for MPI-IO and POSIX for
|
Yes, I treated the Perl report as the Oracle of all truth and added some spaghetti to do what it did--sounds like you and Phil are giving me the ok to remove the shims and just report by module, which will give you that difference. |
Yeah, ideally the accumulator API just knows that AutoPerf modules aren't supported and it returns 0, as well, but I guess that's not the case? There is some AutoPerf data in the |
* a number of additional `MPI-IO` and `STDIO` test cases were added from the logs repo to `test_derived_metrics_bytes_and_bandwidth()` * for the `MPI-IO` cases to pass, special casing was added to `log_get_bytes_bandwidth()` such that `total_bytes` is actually extracted from `POSIX`
* removed an invalid `darshan_free()` from `log_get_derived_metrics()`-- the `buf` object didn't even exist at that point in the control flow * add a `LUSTRE` test case, which raises a `RuntimeError` as expected * add a tentatie `POSIX` test case, which reports a bandwidth string at the Python level, but is not included in the Perl summary reports...
* when `log_get_derived_metrics()` receives a module name that doesn't exist in the log file it received, it will now raise a `ValueError` for clarity of feedback * update `test_derived_metrics_bytes_and_bandwidth()` accordingly, and also start regex matching on expected error messages in this test
* add the bandwidth summary string to the Python report proper, and include a test for the presence of this string in logs repo-based summary reports
* add one of the tricky `APMPI` cases I discovered to `test_derived_metrics_bytes_and_bandwidth()`, pending discussion with team re: how I should handle this
* adjust tests to more closely match `darshan-parser` instead of the old Perl report in cases where MPI-IO and POSIX are both involved; this allows me to remove the weird MPI-IO shim in `log_get_bytes_bandwidth()`
* the bandwidth text in the Python summary report is now colored "blue," along with a regression test, based on reviewer feedback * added `skew-app.darshan` log to `test_derived_metrics_bytes_and_bandwidth()`--we get the same results as `darshan-parser` * replaced the `xfail` for `e3sm_io_heatmap_only.darshan` with an expected `KeyError` when handling `APMPI` (this should already be handled gracefully/ignored by the Python summary report)
* the testsuite now always uses `DarshanReport` with a context manager to avoid shenanigans with `__del__` and garbage collection/`pytest`/multiple threads * this appears to fix the problem with testsuite hangs described in darshan-hpcgh-839 and darshan-hpcgh-851
2419408
to
b89884f
Compare
* `cffi_backend` module changes requested from PR review - remove a spurious `darshan_free` from `_log_get_heatmap_record()` - fix the scoping of the `darshan_free` of `buf` object used with `darshan_accumulator_inject` in `log_get_derived_metrics` - adding a missing `log_close()` to `log_get_derived_metrics` (maybe we can wrap in Python contexts in the future though) - use a separate buffer for `darshan_accumulator_emit()` inside `log_get_derived_metrics` * note that making the above CFFI/free-related changes caused a segfault in the testuite, so in the end I adjusted the location of the memory freeing as I saw fit to avoid segfaults--I'd say at this point please provide concrete evidence with a memory leak plot or failing test for additional adjustments there, or just push the change in * in the end, there is a slightly more concise usage of `darshan_free()` but no meaningful change in the free operations * I also reverted the suggested changed to `darshan_accumulator_emit()` usage--there was no testable evidence of an issue, and it was also causing segfaults.. * address many of the discussion points that came up in darshan-hpcgh-868: - `log_get_derived_metrics()` now uses an LRU cache, which effectively means that we use memoization to return derived metrics data rather than doing another pass over the log file if the same log path and module name have already been accumulated from; we still need to pass over a given log twice in most cases--once at initial read-in and once for using `log_get_derived_metrics`; how we decide to add filtering of records prior to accumulation interface in Python is probably a deeper discussion/for later - `log_get_bytes_bandwidth()` and its associated testing have been migrated to modules not named after "CFFI", like the in the above PR, because I think we should only use the "CFFI" named modules for direct CFFI interaction/testing, and for other analyses we should probably use more distinct names. Also, to some extent everything depends on the CFFI layer, so trying to restrict "CFFI" modules to direct rather than direct interaction will help keep them manageably sized, especially given the proclivity for surprising memory issues/segfaults in those parts of the code. - add a proper docstring with examples for `log_get_bytes_bandwidth()`
b89884f
to
96806fa
Compare
Revisions pushed in, from commit message:
Memory pressure check with from darshan.log_utils import get_log_path
from darshan.lib.accum import log_get_bytes_bandwidth
for i in range(10_000_000):
log_path = get_log_path("imbalanced-io.darshan")
log_get_bytes_bandwidth(log_path, "POSIX") |
Changes look good. My bad on the wild goose chase on pointers/segfaults. I'm not sure exactly how the code changed from previous commits, but I think what I see there now makes sense. I see that you are allocating the initial record pointer ahead of the loop, but what I may have missed is that the C library will only allocate the memory on the very first call to I'll have another look at #868 before approving/merging this just to be safe. |
* move `log_get_bytes_bandwidth()` out of CFFI module to darshan/lib/accum.py * adopt LRU cache and other cleanup tweaks for log_get_derived_metrics()
I just approved, but I'll wait to merge until double checking some things in #868 . |
[skip actions]
Change of plans. I just merged #886 into |
We replaced this with #898 and merged into main, so closing this. |
for now this definitely doesn't work, and feels like I'm basically reconstituting a C control flow in CFFI/Python instead of using a sensible exposure point between C and Python to pull out populated structs from a single entry point
perhaps folks can just help me sort out the current issues noted in the source changes rather than providing a convenient API, though once thorough regression tests are in place that might be something to consider in the future... (or even just maintaining it in
pandas
/Python someday if the performance is ~similar)