-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, ENH: more efficient hmaps #801
base: main
Are you sure you want to change the base?
WIP, ENH: more efficient hmaps #801
Conversation
* even if we add a switch to disable DXT data analysis for Python summary reports, runtime `HEATMAP` data can still explode the memory footprint in cases where there are many IO-inactive ranks, because the current Python code adds explicit all-zero rows for the full set of ranks * this branch removes the requirement to have explicit all-zero rows added for IO-inactive ranks, for both DXT and runtime `HEATMAP`, and the performance improvements in our `asv` suite below are clear, both on memory footprint and timing * the caveat is that, although the testsuite does pass, the new plotting strategy will require some more work, so this is just to get things rolling/WIP with TODO list similar to: - [ ] adjust the new scattering approach to use some kind of custom markers to fit the bin sizes, instead of fixed-size scatter markers - [ ] check the plotting/testing-related changes carefully; most likely comparing the HTML heatmap results for a large number of logs once things are more refined - [ ] restore the colorbar label to the heatmap plots--apparently the presence of this bar is not tested for in the suite * adjust `test_get_heatmap_df` for more efficient single-rank heatmap data structures (no more explicit all-zero ranks!) * adjust `test_set_y_axis_ticks_and_labels` for new heatmap plotting approach
np.asarray(hmap_df.index)) | ||
# x and y both have shape (active_ranks, xbins) | ||
# rather than (nprocs, xbins) | ||
hmap = jgrid.ax_joint.scatter(x, y, c=hmap_df, cmap="YlOrRd", norm=LogNorm(), marker="s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either for my own reference, or if i.e., a student comes back to look at this later--one idea might be a custom rectangle/marker if we could control the dimensions with sufficient granularity. I believe that since the number of ranks and bin widths are fixed for a given plot, we should only have to do the calculation once and then be able to reuse that for each "scatter marker."
Maybe something similar to: https://stackoverflow.com/a/58552620/2942522
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can sort that out, I think it is worth it, since we potentially eliminate thousands of float64
0
s, in case that is not clear.
The diff below the fold produces heatmaps that are perhaps a bit closer, but I need to dig a bit deeper on differences in more complex cases. --- a/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
+++ b/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
@@ -23,6 +23,7 @@ npt = MagicMock()
import pandas as pd
import seaborn as sns
+import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
@@ -383,8 +384,6 @@ def plot_heatmap(
hmap_df = heatmap_handling.get_heatmap_df(agg_df=agg_df, xbins=xbins, nprocs=nprocs)
elif mod == "HEATMAP":
hmap_df = report.heatmaps[submodule].to_df(ops=ops)
- # mirror the DXT approach to heatmaps by
- # adding all-zero rows for inactive ranks
xbins = hmap_df.shape[1]
# build the joint plot with marginal histograms
@@ -401,9 +400,27 @@ def plot_heatmap(
x, y = np.meshgrid(np.arange(xbins),
np.asarray(hmap_df.index))
+ print("x:", x)
+ print("y:", y)
+ print("hmap_df:", hmap_df)
# x and y both have shape (active_ranks, xbins)
# rather than (nprocs, xbins)
- hmap = jgrid.ax_joint.scatter(x, y, c=hmap_df, cmap="YlOrRd", norm=LogNorm(), marker="s")
+ print(hmap_df.columns)
+ bin_1 = hmap_df.columns[0]
+ tmax, runtime = determine_hmap_runtime(report=report)
+ bin_width = (x[0][1] - x[0][0])
+ print("bin_width:", bin_width)
+ bin_height = (xbins / nprocs) * bin_width
+ custom_path = [[0, 0],
+ [0, bin_height],
+ [bin_width, bin_height],
+ [bin_width, 0],
+ [0,0]]
+ marker_width_pixels, marker_height_pixels = jgrid.ax_joint.transData.transform((bin_width, bin_height))
+ size = marker_width_pixels * marker_height_pixels * (12/16) * (12/16) / 2
+ print("width (pixels):", marker_width_pixels)
+ print("height (pixels):", marker_height_pixels)
+ hmap = jgrid.ax_joint.scatter(x, y, c=hmap_df, cmap="YlOrRd", norm=LogNorm(), marker=custom_path, s=size)
jgrid.ax_joint.set_ylim(0, nprocs)
jgrid.fig.colorbar(hmap, ax=jgrid.ax_joint, orientation='vertical')
@@ -445,9 +462,9 @@ def plot_heatmap(
align="edge",
)
- tmax, runtime = determine_hmap_runtime(report=report)
# scale the x-axis to span the calculated run time
xbin_max = xbins * (runtime / tmax)
+ print("xbin_max:", xbin_max)
jgrid.ax_joint.set_xlim(0.0, xbin_max)
# set the x and y tick locations and labels using the runtime
set_x_axis_ticks_and_labels(jointgrid=jgrid, tmax=runtime, bin_max=xbin_max, n_xlabels=4)
|
even if we add a switch to disable DXT
data analysis for Python summary reports (see: ENH: use threshold to avoid plotting DXT heatmaps in logs with DXT and HEATMAP data #729),
runtime
HEATMAP
data can still explode thememory footprint in cases where there are many
IO-inactive ranks, because the current Python
code adds explicit all-zero rows for the full
set of ranks
this branch removes the requirement to have
explicit all-zero rows added for IO-inactive
ranks, for both DXT and runtime
HEATMAP
,and the performance improvements in our
asv
suite below are clear, both on memoryfootprint and timing (larger log files can have
other bottlenecks as well, and there are other PRs open
related to some of these)
the caveat is that, although the testsuite
does pass, the new plotting strategy will
require some more work, so this is just to get
things rolling/WIP with TODO list similar to:
some kind of custom markers to fit the bin
sizes, instead of fixed-size scatter markers (or pivot
to another suitable plotting strategy with all-zero
rows absent)
likely comparing the HTML heatmap results for a large number
of logs once things are more refined
the presence of this label is not tested for in the suite
adjust
test_get_heatmap_df
for more efficient single-rankheatmap data structures (no more explicit all-zero ranks!)
adjust
test_set_y_axis_ticks_and_labels
for new heatmapplotting approach
benchmarking with
asv continuous -e -b ".*heatmap.*" main treddy_heatmap_rank_efficiency
:To give reviewers an idea of the visual differences with the new plotting approach--two comparisons against the
main
branch are below. Maybe this is something I can come back to, or we could eventually find a student to drive forward a bit more. I think what we're seeing is pretty much entirely the result of using a fixed scatter marker "square" instead of filling areas based on widths of bins/ranks on the axes. That means that some data looks "too big" and other times "too small," sometimes simultaneously in both axis directions.For
treddy_runtime_heatmap_inactive_ranks.darshan
, with this branch result followed bymain
:For
snyder_acme.exe_id1253318_9-27-24239-1515303144625770178_2.darshan
, with this branch result followed bymain
: