Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Hierarchies #1432

Open
wants to merge 29 commits into
base: dev
Choose a base branch
from

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented May 3, 2023

The openPMD standard works by defining "what must be there", but does not impose restrictions as to "what must not be there". By this principle, openPMD is an extensible standard.
So far, standard extensions relied mostly on defining additional metadata in terms of attributes, e.g. for storing the name of the employed field solver for the ED-PIC extension. Custom hierarchies and custom n-dimensional datasets ("heavy" data in comparison to lightweight metadata) have not been employed so far despite the theoretical possibility to do so, granted by the openPMD standard. The major hindrance to such data organization has been the lacking support at the level of the openPMD-api, i.e. the implementation of the standard.

As the first part of this PR, the openPMD-api now supports writing custom-defined hierarchies and datasets within the basepath, i.e. within Iterations. This change is entirely independent from the standard as it makes use of the already existing liberty within the standard's conception as explained in the introduction.

This alone finds useful applications already:

  • Data that has been marked up according to another standard can be embedded side-by-side with openPMD-formatted particle-mesh data. A short example is given as part of this PR that writes an openPMD-formatted temperature mesh side by side with a simple NeXus example. The resulting dataset is shown below:
      string       /basePath                                                attr   = "/data/%T/"
      string       /date                                                    attr   = "2024-08-12 16:58:01 +0200"
      string       /iterationEncoding                                       attr   = "groupBased"
      string       /iterationFormat                                         attr   = "/data/%T/"
      string       /meshesPath                                              attr   = "meshes/"
      string       /openPMD                                                 attr   = "1.1.0"
      uint32_t     /openPMDextension                                        attr   = 0
      string       /software                                                attr   = "openPMD-api"
      string       /softwareVersion                                         attr   = "0.16.0-dev"
      double       /data/100/dt                                             attr   = 1
      double       /data/100/time                                           attr   = 0
      double       /data/100/timeUnitSI                                     attr   = 1
      string       /data/100/Scan/NX_class                                  attr   = "NXentry"
      string       /data/100/Scan/data/NX_class                             attr   = "NXdata"
      string       /data/100/Scan/data/axes                                 attr   = {"two_theta"}
      int64_t      /data/100/Scan/data/counts                               {15} = 0 / 0
      string       /data/100/Scan/data/counts/long_name                     attr   = "photodiode counts"
      string       /data/100/Scan/data/counts/units                         attr   = "counts"
      string       /data/100/Scan/data/signal                               attr   = "counts"
      double       /data/100/Scan/data/two_theta                            {15} = 0 / 0
      string       /data/100/Scan/data/two_theta/long_name                  attr   = "two_theta (degrees)"
      string       /data/100/Scan/data/two_theta/units                      attr   = "degrees"
      uint8_t      /data/100/Scan/data/two_theta_indices                    attr   = {0}
      string       /data/100/Scan/default                                   attr   = "data"
      double       /data/100/meshes/temperature                             {5, 5} = 0 / 0
      string       /data/100/meshes/temperature/axisLabels                  attr   = {"x", "y"}
      string       /data/100/meshes/temperature/dataOrder                   attr   = "C"
      string       /data/100/meshes/temperature/geometry                    attr   = "cartesian"
      double       /data/100/meshes/temperature/gridGlobalOffset            attr   = {0, 0}
      double       /data/100/meshes/temperature/gridSpacing                 attr   = {1, 1}
      double       /data/100/meshes/temperature/gridUnitSI                  attr   = 1
      long double  /data/100/meshes/temperature/position                    attr   = {0.5, 0.5}
      float        /data/100/meshes/temperature/timeOffset                  attr   = 0
      double       /data/100/meshes/temperature/unitDimension               attr   = {0, 0, 1, 0, 0, 0, 0}
      double       /data/100/meshes/temperature/unitSI                      attr   = 1
    
  • Embedding non-physical information into output files. An example is the particle-in-cell simulation PIConGPU that uses openPMD for regular output as well as for checkpoint-restart output. In the case of checkpoint-restart, internal program state must be serialized along with the physical state of the simulation, currently only possible by pretending that the internal state is a mesh which confuses many post-processing tools such as visualizers. PIConGPU has been adapted to make use of this change on this Git tree, check here for a diff. A shortened example output is pasted below, demonstrating that internal state information is now cleanly separated from physical data:
      float     /data/100/fields/E/x                                      {192, 1024, 192}
      float     /data/100/fields/E/y                                      {192, 1024, 192}
      float     /data/100/fields/E/z                                      {192, 1024, 192}
      float     /data/100/particles/e/momentum/x                          {71958528}
      float     /data/100/particles/e/momentum/y                          {71958528}
      float     /data/100/particles/e/momentum/z                          {71958528}
      float     /data/100/particles/e/position/x                          {71958528}
      float     /data/100/particles/e/position/y                          {71958528}
      float     /data/100/particles/e/position/z                          {71958528}
      int32_t   /data/100/particles/e/positionOffset/x                    {71958528}
      int32_t   /data/100/particles/e/positionOffset/y                    {71958528}
      int32_t   /data/100/particles/e/positionOffset/z                    {71958528}
      float     /data/100/particles/e/weighting                           {71958528}
      char      /data/100/picongpu_internal/RNG/RNGProvider3XorMin        {48, 128, 147456}
      uint64_t  /data/100/picongpu_internal/idProvider/nextId             {1, 1, 1}
      uint64_t  /data/100/picongpu_internal/idProvider/startId            {1, 1, 1}
    

Building on top of this, the other logical component of this PR consists in the support of this standard extension. While the PR as described so far brings custom hierarchies and datasets to the openPMD-api in a way that is transparent to the standard itself, the purpose of this next standard extension is to now make the standard aware of these hierarchies by embedding openPMD markup within them.

The schematic idea behind this is pictured below:
267274652-a4a4a4ac-636f-4349-bc14-c4e4a2cc36a1

With this, the data organization can step back into openPMD markup from anywhere within a custom-defined hierarchy. This further extends the use of this PR to:

  • Using openPMD markup within another standard, rather than merely beside it. This is currently being applied exploratively in this script for a sample dataset collected in the POLARIS laboratory.
  • For more complex setups, this permits a better organization of output data. As an example, meshes can be of different kinds such as 3-dimensional physical fields or 2-dimensional images; also there might be similar kinds of dependencies between particle data. It is desirable to group such data in a way that reflects the logical adjacencies and interdependencies between them.
  • A particular instance of the above is mesh refinement, currently proposed in a standard extension as a suffix-based naming scheme. Switching to an approach based on custom hierarchies, this comment details a more natural and more easily parsed approach at mesh refinement. A mesh-refined dataset of this type might be structured as follows:
    /data/0/refined_mesh_levels/0/meshes/E
    /data/0/refined_mesh_levels/0/meshes/B
    /data/0/refined_mesh_levels/1/meshes/E
    /data/0/refined_mesh_levels/1/meshes/B
    /data/0/refined_mesh_levels/2/meshes/E
    /data/0/refined_mesh_levels/2/meshes/B
    +++++++ ––––––––––––––––––––– ++++++++
    standard        custom        standard
    
    /data/0/simulation_internal/some_checkpointing_info
    +++++++ –––––––––––––––––––––––––––––––––––––––––––
    standard                  custom
    

TODO

  • Merge first: Remove necessity for RecordComponent::SCALAR #1154
  • Await Pybind11 release that has merged this fix: Introduce recursive_container_traits pybind/pybind11#4623
  • Implement custom groups at the Iteration level that can hold custom attributes
  • Implement custom datasets inside custom hierarchy
  • Implement openPMD-defined meshes/particles-data from anywhere in the hierarchy
  • Implement extended meshesPath/particlesPath
  • Update the openPMD standard, see Allow user to store non-openPMD information openPMD-standard#115 (comment)
  • Lenient parsing in CustomHierarchy class
  • Maybe lazy parsing of the custom hierarchy?
  • Use the new SharedAttributableData pattern to better implement variable-based encoding (where series.iterations and series.iterations[0] are the same backend objects)
  • Replace Iteration::meshes with Iteration::mesh("subdir/E") and Iteration::allMeshes() -> std::map<std::string, Mesh>, similar Iteration::species("subdir/e") and Iteration::allSpecies() -> std::map<std::string, ParticleSpecies>. But should it be species("subdir/particles/e") or species("subdir/e")?
  • Generalize to Attributable::openAsCustomHierarchy()?

Diff: https://github.com/franzpoeschel/openPMD-api/compare/topic-remove-scalar-component..topic-custom-hierarchies


private:
template <typename... Arg>
iterator makeIterator(Arg &&...arg)

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable arg is not used.
return iterator{this, std::forward<Arg>(arg)...};
}
template <typename... Arg>
const_iterator makeIterator(Arg &&...arg) const

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable arg is not used.
REQUIRE(r["x"].resetDataset(dset).numAttributes() == 0); /* unitSI */
// REQUIRE(r["y"].unitSI() == 1);
REQUIRE(r["y"].resetDataset(dset).numAttributes() == 0); /* unitSI */
// REQUIRE(r["z"].unitSI() == 1);

Check notice

Code scanning / CodeQL

Commented-out code Note test

This comment appears to contain commented-out code.
// unitSI is set upon flushing
// REQUIRE(r["x"].unitSI() == 1);
REQUIRE(r["x"].resetDataset(dset).numAttributes() == 0); /* unitSI */
// REQUIRE(r["y"].unitSI() == 1);

Check notice

Code scanning / CodeQL

Commented-out code Note test

This comment appears to contain commented-out code.
@@ -966,6 +968,27 @@
#endif
}

TEST_CASE("baserecord_test", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_32 is unreachable (
autoRegistrar33
must be removed at the same time)
Comment on lines 874 to 893
// for (auto it = this->container().begin(); it != end; ++it)
// {
// if (it->first == RecordComponent::SCALAR)
// {
// this->container().erase(it);
// throw error::WrongAPIUsage(detail::NO_SCALAR_INSERT);
// }
// }

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Comment on lines 855 to 874
// for (auto it = this->container().begin(); it != end; ++it)
// {
// if (it->first == RecordComponent::SCALAR)
// {
// this->container().erase(it);
// throw error::WrongAPIUsage(detail::NO_SCALAR_INSERT);
// }
// }

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
@@ -1353,3 +1378,44 @@
UniquePtrWithLambda<int[]> arrptrFilledCustom{
new int[5]{}, [](int const *p) { delete[] p; }};
}

TEST_CASE("scalar_and_vector", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_56 is unreachable (
autoRegistrar57
must be removed at the same time)
@@ -156,6 +159,39 @@
}
}

TEST_CASE("custom_hierarchies", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_4 is unreachable (
autoRegistrar5
must be removed at the same time)
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from c8a68a5 to 6c87958 Compare May 11, 2023 09:19
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from 86d8a73 to 399e6cd Compare May 30, 2023 12:43
@@ -156,6 +159,129 @@
}
}

TEST_CASE("custom_hierarchies", "[core]")

Check warning

Code scanning / CodeQL

Poorly documented large function Warning test

Poorly documented function: fewer than 2% comments for a function of 194 lines.
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jun 19, 2023

comment removed, updated version in comments below

@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from 8c28fab to 605bd55 Compare June 29, 2023 11:11
test/CoreTest.cpp Fixed Show fixed Hide fixed
@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jul 13, 2023

For the meshesPath (equivalently for particlesPath), I have now implemented a prototype that does the following:

A path /data/0/custom/group/meshes/E is a mesh if the meshesPath contains any of the following:

  1. Full path to the group containing the mesh: /custom/group/meshes/
  2. Full path to the mesh itself: /custom/group/meshes/E No longer supported
  3. Shorthand notation: meshes/

The underlying rule: Full paths are denoted by a leading slash and are based on the data path (/data/%T)

Remark: The shorthand notation achieves backwards compatibility with old openPMD files

@franzpoeschel
Copy link
Contributor Author

franzpoeschel commented Jul 13, 2023

One nontrivial design question is how to deal with the traditional openPMD hierarchy, especially with the paths /data/%T/meshes and /data/%T/particles. There is no definition of any form of physical data for those groups in the openPMD standard, a normal openPMD file contains no attributes /data/%T/meshes/<attr_name>.

This suggests to me that in the extended openPMD standard with custom hierarchies these paths should be treated as "nothing special". Rather, they become the canonical, but not mandatory layout/organization of a simple openPMD dataset.

Two somewhat tricky consequences from this point of view:

1. There might be more than 1 meshes paths in the same group
E.g. the paths /data/%T/meshes and /data/%T/images might exist side by side. In the openPMD standard, this is no problem, in the openPMD-api this becomes challenging.
The problem is with the member Iteration::meshes (made even worse by the fact that it's not a getter method, but a data member). Should it point to /data/%T/meshes? To a union of both? What about writing?

Imo, the best solution is to consider Iteration::meshes a shorthand API that should not be used in more complex setups. Rather, since /data/%T/meshes is now just another normal path in the custom Iteration hierarchy, one should access iteration["meshes"].asContainerOf<Mesh>() for clarity.

Iteration::meshes will point to the first user-specified meshes path that takes the form of a shorthand notation. E.g., after series.setMeshesPath({"fields/"}), the call iteration.meshes will be the same as iteration["fields"].asContainerOf<Mesh>(). This ensures backwards compatibility.

(Note: Since Iteration::meshes is unfortunately a member and not a method, this means that the meshes path must be set before creating or opening any Iteration. And it was enough fighting with pointers to get things to that state.)

2. There might be custom data inside /data/%T/meshes
This is not really a problem, but could be unexpected. When setting series.setMeshesPath({"/meshes/E"}), you state that only the E field is a mesh. Since /data/%T/meshes is otherwise "just a regular group" with no special meaning, there might be other data in there, too, e.g. /data/%T/meshes/custom/hierarchy. It's the job of the user to create a meaningful data layout here.

With the more restricted definition of meshesPath and particlesPath, this is no longer supported.

src/CustomHierarchy.cpp Fixed Show fixed Hide fixed
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 2 times, most recently from 53f968c to ba10099 Compare August 1, 2023 13:37
}
}

TEST_CASE("custom_hierarchies_no_rw", "[core]")

Check notice

Code scanning / CodeQL

Unused static function Note test

Static function C_A_T_C_H_T_E_S_T_6 is unreachable (
autoRegistrar7
must be removed at the same time)
@franzpoeschel franzpoeschel mentioned this pull request Aug 1, 2023
1 task
@franzpoeschel franzpoeschel force-pushed the topic-custom-hierarchies branch 4 times, most recently from 31c7a25 to 1d47d17 Compare August 3, 2023 09:25
franzpoeschel and others added 25 commits November 15, 2024 15:15
Introduction of iteration["meshes"].asContainerOf<Mesh>() as a more
explicit variant for iteration.meshes.
TODO: Since meshes/particles can no longer be directly addressed with
this, maybe adapt the class hierarchy to disallow mixed groups that
contain meshes, particles, groups and datasets at the same time.

Only maybe though..
The have their own meaning now and are no longer just carefully maintained
for backwards compatibility.
Instead, they are supposed to serve as a shortcut to all openPMD data
found further down the hierarchy.
std::string version = s.openPMD();
bool hasMeshes = false;
bool hasParticles = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants