Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, ENH: more efficient hmaps #801

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tylerjereddy
Copy link
Collaborator

  • even if we add a switch to disable DXT
    data analysis for Python summary reports (see: ENH: use threshold to avoid plotting DXT heatmaps in logs with DXT and HEATMAP data #729),
    runtime HEATMAP data can still explode the
    memory footprint in cases where there are many
    IO-inactive ranks, because the current Python
    code adds explicit all-zero rows for the full
    set of ranks

  • this branch removes the requirement to have
    explicit all-zero rows added for IO-inactive
    ranks, for both DXT and runtime HEATMAP,
    and the performance improvements in our
    asv suite below are clear, both on memory
    footprint and timing (larger log files can have
    other bottlenecks as well, and there are other PRs open
    related to some of these)

  • the caveat is that, although the testsuite
    does pass, the new plotting strategy will
    require some more work, so this is just to get
    things rolling/WIP with TODO list similar to:

  • adjust the new scatterplot approach to use
    some kind of custom markers to fit the bin
    sizes, instead of fixed-size scatter markers (or pivot
    to another suitable plotting strategy with all-zero
    rows absent)
  • check the plotting/testing-related changes carefully; most
    likely comparing the HTML heatmap results for a large number
    of logs once things are more refined
  • restore the colorbar label to the heatmap plots--apparently
    the presence of this label is not tested for in the suite
  • adjust test_get_heatmap_df for more efficient single-rank
    heatmap data structures (no more explicit all-zero ranks!)

  • adjust test_set_y_axis_ticks_and_labels for new heatmap
    plotting approach

  • benchmarking with asv continuous -e -b ".*heatmap.*" main treddy_heatmap_rank_efficiency:

       before           after         ratio
     [3fb01f49]       [32397878]
     <main>           <treddy_heatmap_rank_efficiency>
-            134M             114M     0.85  dxt_heatmap.GetHeatMapDf.peakmem_get_heatmap_df(10000, 250, 0.01)
-            132M             111M     0.84  dxt_heatmap.GetHeatMapDf.peakmem_get_heatmap_df(10000, 250, 0.001)
-         800±2ms          576±2ms     0.72  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('examples/example-logs/ior_hdf5_example.darshan', 1000)
-         799±5ms          574±2ms     0.72  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('examples/example-logs/dxt.darshan', 1000)
-         780±4ms          549±1ms     0.70  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('tests/input/sample-dxt-simple.darshan', 1000)
-         271±1ms        137±0.8ms     0.51  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('examples/example-logs/dxt.darshan', 10)
-       353±0.6ms        178±0.4ms     0.51  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('examples/example-logs/ior_hdf5_example.darshan', 100)
-       272±0.7ms        137±0.3ms     0.50  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('examples/example-logs/ior_hdf5_example.darshan', 10)
-         354±2ms        178±0.4ms     0.50  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('examples/example-logs/dxt.darshan', 100)
-         372±1ms        175±0.5ms     0.47  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('tests/input/sample-dxt-simple.darshan', 100)
-       279±0.4ms        128±0.4ms     0.46  dxt_heatmap.PlotDXTHeatMapSmall.time_plot_heatmap_builtin_logs('tests/input/sample-dxt-simple.darshan', 10)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

To give reviewers an idea of the visual differences with the new plotting approach--two comparisons against the main branch are below. Maybe this is something I can come back to, or we could eventually find a student to drive forward a bit more. I think what we're seeing is pretty much entirely the result of using a fixed scatter marker "square" instead of filling areas based on widths of bins/ranks on the axes. That means that some data looks "too big" and other times "too small," sometimes simultaneously in both axis directions.

For treddy_runtime_heatmap_inactive_ranks.darshan, with this branch result followed by main:

image
image

For snyder_acme.exe_id1253318_9-27-24239-1515303144625770178_2.darshan, with this branch result followed by main:

image
image

* even if we add a switch to disable DXT
data analysis for Python summary reports,
runtime `HEATMAP` data can still explode the
memory footprint in cases where there are many
IO-inactive ranks, because the current Python
code adds explicit all-zero rows for the full
set of ranks

* this branch removes the requirement to have
explicit all-zero rows added for IO-inactive
ranks, for both DXT and runtime `HEATMAP`,
 and the performance improvements in our
`asv` suite below are clear, both on memory
footprint and timing

* the caveat is that, although the testsuite
does pass, the new plotting strategy will
require some more work, so this is just to get
things rolling/WIP with TODO list similar to:

- [ ] adjust the new scattering approach to use
      some kind of custom markers to fit the bin
      sizes, instead of fixed-size scatter markers
- [ ] check the plotting/testing-related changes carefully; most
      likely comparing the HTML heatmap results for a large number
      of logs once things are more refined
- [ ] restore the colorbar label to the heatmap plots--apparently
      the presence of this bar is not tested for in the suite

* adjust `test_get_heatmap_df` for more efficient single-rank
heatmap data structures (no more explicit all-zero ranks!)

* adjust `test_set_y_axis_ticks_and_labels` for new heatmap
plotting approach
@tylerjereddy tylerjereddy added enhancement New feature or request pydarshan labels Sep 3, 2022
np.asarray(hmap_df.index))
# x and y both have shape (active_ranks, xbins)
# rather than (nprocs, xbins)
hmap = jgrid.ax_joint.scatter(x, y, c=hmap_df, cmap="YlOrRd", norm=LogNorm(), marker="s")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either for my own reference, or if i.e., a student comes back to look at this later--one idea might be a custom rectangle/marker if we could control the dimensions with sufficient granularity. I believe that since the number of ranks and bin widths are fixed for a given plot, we should only have to do the calculation once and then be able to reuse that for each "scatter marker."

Maybe something similar to: https://stackoverflow.com/a/58552620/2942522

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can sort that out, I think it is worth it, since we potentially eliminate thousands of float64 0s, in case that is not clear.

@tylerjereddy
Copy link
Collaborator Author

The diff below the fold produces heatmaps that are perhaps a bit closer, but I need to dig a bit deeper on differences in more complex cases.

--- a/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
+++ b/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
@@ -23,6 +23,7 @@ npt = MagicMock()
 
 import pandas as pd
 import seaborn as sns
+import matplotlib
 import matplotlib.pyplot as plt
 from matplotlib.colors import LogNorm
 
@@ -383,8 +384,6 @@ def plot_heatmap(
         hmap_df = heatmap_handling.get_heatmap_df(agg_df=agg_df, xbins=xbins, nprocs=nprocs)
     elif mod == "HEATMAP":
         hmap_df = report.heatmaps[submodule].to_df(ops=ops)
-        # mirror the DXT approach to heatmaps by
-        # adding all-zero rows for inactive ranks
         xbins = hmap_df.shape[1]
 
     # build the joint plot with marginal histograms
@@ -401,9 +400,27 @@ def plot_heatmap(
 
     x, y = np.meshgrid(np.arange(xbins),
                        np.asarray(hmap_df.index))
+    print("x:", x)
+    print("y:", y)
+    print("hmap_df:", hmap_df)
     # x and y both have shape (active_ranks, xbins)
     # rather than (nprocs, xbins)
-    hmap = jgrid.ax_joint.scatter(x, y, c=hmap_df, cmap="YlOrRd", norm=LogNorm(), marker="s")
+    print(hmap_df.columns)
+    bin_1 = hmap_df.columns[0]
+    tmax, runtime = determine_hmap_runtime(report=report)
+    bin_width = (x[0][1] - x[0][0])
+    print("bin_width:", bin_width)
+    bin_height = (xbins / nprocs) * bin_width
+    custom_path = [[0, 0],
+                   [0, bin_height],
+                   [bin_width, bin_height],
+                   [bin_width, 0],
+                   [0,0]]
+    marker_width_pixels, marker_height_pixels = jgrid.ax_joint.transData.transform((bin_width, bin_height))
+    size = marker_width_pixels * marker_height_pixels * (12/16) * (12/16) / 2
+    print("width (pixels):", marker_width_pixels)
+    print("height (pixels):", marker_height_pixels)
+    hmap = jgrid.ax_joint.scatter(x, y, c=hmap_df, cmap="YlOrRd", norm=LogNorm(), marker=custom_path, s=size)
     jgrid.ax_joint.set_ylim(0, nprocs)
     jgrid.fig.colorbar(hmap, ax=jgrid.ax_joint, orientation='vertical')
 
@@ -445,9 +462,9 @@ def plot_heatmap(
         align="edge",
     )
 
-    tmax, runtime = determine_hmap_runtime(report=report)
     # scale the x-axis to span the calculated run time
     xbin_max = xbins * (runtime / tmax)
+    print("xbin_max:", xbin_max)
     jgrid.ax_joint.set_xlim(0.0, xbin_max)
     # set the x and y tick locations and labels using the runtime
     set_x_axis_ticks_and_labels(jointgrid=jgrid, tmax=runtime, bin_max=xbin_max, n_xlabels=4)

image

image

@tylerjereddy
Copy link
Collaborator Author

If we gradually increase the DPI and/or figure dimensions of the snyder_acme.. heatmap on main to check on the "ground truth" of where data is (taking screencaps of the higher res results, not actually pasting at that resolution), we do start to see more data showing up at higher DPI/larger size, so we should exercise some caution in what we target as "truth."

700 DPI:
image

2000 DPI:
image

300 DPI, but expanding physical dimensions of the figure to (15, 30) inches:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pydarshan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant