Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove phys prop #164

Draft
wants to merge 27 commits into
base: develop
Choose a base branch
from
Draft

Remove phys prop #164

wants to merge 27 commits into from

Conversation

ndaelman-hu
Copy link
Collaborator

Make data solely reliant on sections and quantities, reserving the use of Dataframe (new PhysicalProperty) for results.
Re-group properties to with a focus on shared normalized quantities, as well as visualization.

Closes #143 #85 #80

@ndaelman-hu ndaelman-hu added the improvement/fix Improvement or fix of a previous feature label Feb 4, 2025
@ndaelman-hu ndaelman-hu self-assigned this Feb 4, 2025
@ndaelman-hu ndaelman-hu marked this pull request as draft February 4, 2025 09:44
@ndaelman-hu
Copy link
Collaborator Author

ndaelman-hu commented Feb 4, 2025

Regarding re-grouping, I'm undecided on the final interface. Using electronic properties (e.g. eigenvalues, DOS, band structure), I'll illustrate the 2 options I see.

Generically, any of these electronic property can be identified using any of the following attributes: particle kind*, semantic group**, k-point, spin. This list is not necessarily exhaustive.
*: different kinds of particle models computed by the various diagonalization routines, e.g. Kohn-Sham, self-energy exchange or correlation.
**: groupings typically used to project / decompose the electronic states in. These are quite varied, but examples include: element, ion, (atomic) orbital like Ce(4f).

Overall, we'd like to present any user with an as similar as possible interface (similar idea in the old PhysicalProperty, just differently executed). As you can see in the hierarchies below, these setups mostly follow the same ordering, though they may miss attributes is these are nonsensical:

  • eigenvalues: particle kind -> (semantic group?) -> k-point -> spin
  • DOS: semantic group -> spin
  • band structure: semantic group -> spin -> particle life time (k-point is on the x-axis)
  • Fermi surface: (semantic group?) -> band -> spin (mapped into k-space)

(note: will add example figures later on)

Hierarchies like these are easy to express in JSON key-value pairs, but mean deeper and different paths to traverse.
total would then stand out as a privileged group with a short-cut. This is most similar in setup to the current state of the schema (with the old PhysicalProperty).

{
  'band_structure': {
    'kpoints': [{...}],
    'total': [
       {
        'spin_channel': [
          {
            'spin': {...},
            'energies': [...],
            'occupations': [...]
          }
    ],
    'groups': [
      {
        'group_label': {...},
        'spin_channel': [
          {
            'spin': {...},
            'energies': [...],
            'occupations': [...]
          }
        ]
      }
    ]
  }
}

Alternatively, we could have a container section with repeating subsections that annotate all identifying attributes together, at the same level. This means more homogeneity in structure. This also mimics the plots (which are just overlays) better.

It does make data traversal harder (especially without the metainfo browser to show name), as you have to scan identifiers to know what you're dealing with. This can be mitigated if we ensure that groupings is properly sorted by the end of the normalization.

{
  'band_structure': {
    'kpoints': [{...}],
    'groupings': [
      {
        'identifiers': {...},
        'energies': [...],
        'occupations': [...]
      }
    ]
  }
}

@JosePizarro3
Copy link
Collaborator

Regarding re-grouping, I'm undecided on the final interface. Using electronic properties (e.g. eigenvalues, DOS, band structure), I'll illustrate the 2 options I see.

Generically, any of these electronic property can be identified using any of the following attributes: particle kind*, semantic group**, k-point, spin. This list is not necessarily exhaustive. *: different kinds of particle models computed by the various diagonalization routines, e.g. Kohn-Sham, self-energy exchange or correlation. **: groupings typically used to project / decompose the electronic states in. These are quite varied, but examples include: element, ion, (atomic) orbital like Ce(4f).

Overall, we'd like to present any user with an as similar as possible interface (similar idea in the old PhysicalProperty, just differently executed). As you can see in the hierarchies below, these setups mostly follow the same ordering, though they may miss attributes is these are nonsensical:

  • eigenvalues: particle kind -> (semantic group?) -> k-point -> spin
  • DOS: semantic group -> spin
  • band structure: semantic group -> spin -> particle life time (k-point is on the x-axis)
  • Fermi surface: (semantic group?) -> band -> spin (mapped into k-space)

I am not sure I understand what "particle kind" and "semantic group" mean. I think particle kind might be info stored in ModelMethod, is that so? Can you put an example?

And the same goes with semantic group: is this the degrees of freedom? Only corresponds to some index for orbital or plane-wave index depending on the basis?

Furthermore, I don't follow your hierarchy for the different properties. How I consider this, from purely the perspective of the property per se is that we have: $E_{k \sigma m}$ for the eigenvalues, where k is the k-points dof, $\sigma$ is the spin dof, and $m$ is the orbital/planewave dof. All the properties can then be derived from this information:

  1. The DOS in the integral of that over $k$
  2. The band structure is that over a specific $k$ path
  3. The Fermi surface is those eigenvlaues close to $E_F$
  4. The spectral function (which you didn't add, though you pointed to something with the life time) is an intensity for each of the eigenvalues $I(E_{k \sigma m})$

The first JSON looks good.


Now, also for @JFRudzinski: is the Variables idea of PhysicalProperty deprecated?

@EBB2675
Copy link
Collaborator

EBB2675 commented Feb 4, 2025

I find it more intuitive and easier to follow when there is a structured tree where i can conceptually drill down

@ndaelman-hu
Copy link
Collaborator Author

ndaelman-hu commented Feb 4, 2025

I find it more intuitive and easier to follow when there is a structured tree where i can conceptually drill down

Sure, but imagine now having 5 of these trees side-by-side that have mostly similar, but not identical structures.
The advantage of the 2nd approach is that the core structure remains identical, making some common normalization / plotting easier.
But I admit, both come with issues, hence why I'm collecting opinions.

@ndaelman-hu
Copy link
Collaborator Author

ndaelman-hu commented Feb 4, 2025

Regarding re-grouping, I'm undecided on the final interface. Using electronic properties (e.g. eigenvalues, DOS, band structure), I'll illustrate the 2 options I see.
Generically, any of these electronic property can be identified using any of the following attributes: particle kind*, semantic group**, k-point, spin. This list is not necessarily exhaustive. *: different kinds of particle models computed by the various diagonalization routines, e.g. Kohn-Sham, self-energy exchange or correlation. **: groupings typically used to project / decompose the electronic states in. These are quite varied, but examples include: element, ion, (atomic) orbital like Ce(4f).
Overall, we'd like to present any user with an as similar as possible interface (similar idea in the old PhysicalProperty, just differently executed). As you can see in the hierarchies below, these setups mostly follow the same ordering, though they may miss attributes is these are nonsensical:

  • eigenvalues: particle kind -> (semantic group?) -> k-point -> spin
  • DOS: semantic group -> spin
  • band structure: semantic group -> spin -> particle life time (k-point is on the x-axis)
  • Fermi surface: (semantic group?) -> band -> spin (mapped into k-space)

I am not sure I understand what "particle kind" and "semantic group" mean. I think particle kind might be info stored in ModelMethod, is that so? Can you put an example?

And the same goes with semantic group: is this the degrees of freedom? Only corresponds to some index for orbital or plane-wave index depending on the basis?

Furthermore, I don't follow your hierarchy for the different properties. How I consider this, from purely the perspective of the property per se is that we have: E k σ m for the eigenvalues, where k is the k-points dof, σ is the spin dof, and m is the orbital/planewave dof. All the properties can then be derived from this information:

1. The DOS in the integral of that over 
     k

2. The band structure is that over a specific 
     k
    path

3. The Fermi surface is those eigenvlaues close to 
     
       E
       F

4. The spectral function (which you didn't add, though you pointed to something with the life time) is an intensity for each of the eigenvalues 
     I
     (
     
       E
       
         k
         σ
         m
       
     
     )

The first JSON looks good.

Now, also for @JFRudzinski: is the Variables idea of PhysicalProperty deprecated?

You're pointing out the same observations as I did:

  • E k σ m: there are several indices / identifiers / variables for a single property. Note that both spin and orbital (or rather, band) can be much more complex structures. Depending on the context (you know which, not going to iterate them here), they may become sections in their own right.
  • The relevant indices will change between properties, as you (and I) showed in our listing.

The 1st approach puts up a preferential structure / order to run over these indices. The 2nd approach groups them all together at the same level. This helps consistency and visualization. That's their main difference.
Approach 2 is in that respect similar to the PhysicalProperty concept, but does not allow for open-ended contexts (such as potentially additional variables). Context is relevant both for normalization and visualization. Here, it is extended via inheritance.

PhysicalProperty is being reworked by Area D with feedback from us. For both technical reasons and again, context, it will be used in results and worklfow2, rather than data.

ndaelman added 2 commits February 4, 2025 14:32
- Apply standard template to `DensityOfStates`
- Add naming convention to `OrbitalsState`
- Add few comments + correct typos
@JFRudzinski
Copy link
Collaborator

@JosePizarro3 Thank you for keeping an eye out and giving feedback.

Sorry for the delay in updating you about PhysicalProperty, I hope this MR did not come as too much of a shock. There has been much movement in recent weeks and we are trying to make quick movement now in terms of schema dev and parser migration.

I want to just slightly expand upon what Nathan shared: Starting from your prototype, Markus made his own implementation of PhysicalProperty, a bit more in a dataframe-like style. If you are interested we will be able to share more information in the near future and there will also be a cafe about it.

Markus, Laurie, Hampus, and Nathan have been working to test and improve this implementation, and it is pretty close to usable now. Nathan did many tests of the new implementation into our schema. In the end, we came to the conclusion that applying this structure to every single property in our detailed schema was both tedious and probably not practical in the long run.

Instead we focus for now on applying this new structure/tool to try to improve interoperability at a higher level, hence Nathan's mention of results and workflow2. In principle, it could also be used in data if an appropriate use case comes along. It's just that we don't implement it everywhere as default. This also helps us to make more progress on our schema immediately, which is our top priority.

I'm happy to discuss further with you, also to see if there are specific use cases you have in mind that could be useful for further testing.

Otherwise, we plan to tag you on any relevant MRs for potential input as we go through here. We appreciate your continued input!

@JosePizarro3
Copy link
Collaborator

Sure, I just asked if PhysicalProperty was being reworked, and Variables deprecated. Whatever you and the others decide is ok for me.

So the idea is to just do whatever in data in order to push forward the schema development? And then it is results and workflow2.results the ones responsible of interoperability? This sounds good for me anyways, just making sure I understand.

- Correct eigenvalues and DOS
@JFRudzinski
Copy link
Collaborator

Sure, I just asked if PhysicalProperty was being reworked, and Variables deprecated. Whatever you and the others decide is ok for me.

So the idea is to just do whatever in data in order to push forward the schema development? And then it is results and workflow2.results the ones responsible of interoperability? This sounds good for me anyways, just making sure I understand.

Of course, I want to keep you updated 👍

I wouldn't exactly say that the idea is to do "whatever" in data. We are trying to develop some standardization within the schema and also some quasi templates for people to easily use when extending later. You will see here that Nathan kept several aspects of your PP implementation. It's just that we don't allow for the flexibility in data, at least not by default for all properties. If a use cases arises, PP can also be used in data.

And yes, exactly, results and workflow2.results are responsible for interoperability. There are still some decisions exactly what is done with the results section.

@JFRudzinski
Copy link
Collaborator

From my side, the second structure that Nathan suggested has some advantages for standardization and plotting. I think this will become more clear with some concrete examples...

@coveralls
Copy link

coveralls commented Feb 5, 2025

Pull Request Test Coverage Report for Build 13209227390

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-81.0%) to 0.0%

Totals Coverage Status
Change from base Build 13008515978: -81.0%
Covered Lines: 0
Relevant Lines: 0

💛 - Coveralls

@ndaelman-hu
Copy link
Collaborator Author

Okay, feel free to give some preliminary feedback.
The main guiding approach is producing accompanying plots. This means grouping together the necessary data as well as context (i.e. metadata) to understand / retrieve the plot. For example:

  • overlaying some kind of labelled plot several times is very common, and the base template is in SemanticGroup and SemanticGroupContainer.
  • most ground state electronic structures require a common alignment, so they are grouped together. For testing purposes, I currently directly use this section under outputs. I also still have to fine-tune the normalization order there (actually, via deactivation + explicit calling, rather than levels).

The most fleshed out example is the DensityOfStates under properties/solid_state_electronics.py, you can trace the other changes back from there. There is a script for generating an example under tests/properties/visualize_electronic.py. If I upload it to a NOMAD server, I get the following result below. Some corrections are still in order here (toggle legend, single plot, show on overview), but the skeleton stands.

Screenshot from 2025-02-07 23-50-03

from nomad.metainfo import Quantity, Reference


class ModelBaseSection(ArchiveSection):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small note: the terminology "model" here is a bit confusing to me considering the other uses of model within our schema

@JFRudzinski
Copy link
Collaborator

I'm starting to set this up for testing but already having some import problems.

../nomad-simulations/src/nomad_simulations/schema_packages/properties/decomposable.py:1: in <module>
    from nomad.metainfo import placeholder, Quantity, SubSection, Reference

I can only find placeholder defined in the javascript part of nomad-FAIR. Perhaps it's due to the older mapping annotation branch. @ladinesa could you let me know when you rebase your nomad-FAIR branch, I tried myself, but the conflicts are too complicated/unknown to me...also I guess if you have any alternative insight into the placeholder import in general 🙏

Copy link
Collaborator

@JFRudzinski JFRudzinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments throughout. I tried to test some things out but there are too many uncertainties for me at this preliminary stage to really be able to implement a working example.

However, here are some thoughts about the overall construction and potential approaches as I see them:

Outputs section structure

Until told otherwise, I will assume that we are sticking to the following structure of Outputs() which is a repeating subsection of simulation:

class Outputs(ArchiveSection):

    model_system_ref = Quantity(
...
    )

    model_method_ref = Quantity(
...
    )

    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # List of properties
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

    property_quantity = Quantity()
    property_section = Subsection()

So far this is consistent with the approach of runschema.

With the new support for system hierarchies, this also allows us to point to any branch of a system hierarchy, and so, more generally define properties that only apply to a portion of the system.

Simple property quantities

There are a number properties that can be simply be defined directly under Outputs as quantities. Nathan already started collecting some of these in common_properties.py. This probably includes most basic thermodynamic properties, temperature, pressure, etc, as we also did in runschema. I think we would opt to flatten the outputs structure compared a bit with runschema, which subgrouped semantically related "simple" properties (e.g., in the thermodynamic subsection).

I presume there is no controversy here, but please share if you think otherwise.

Properties with metadata

These are properties that have more than just a value, they also have some surrounding quantities that provide additional context to that value (note: I treat the case with multiple value types separately below). These properties are stored as subsections in output.

Again, this was also done in runschema, e.g., GreensFunctions.

There is probably some questions concerning subsection and "main value" naming conventions and such, which would should nail down, but otherwise it is clear we will need this structure.

One question is whether we need or want some sort of base section / template for such a class.

Properties with metadata and types

These are properties that contain multiple value types depending on the context.

In runschema, there seems to be a lack of consistency with how these were dealt with:

  • types of dos are defined as unique subsections under Calculation, but are defined with respect to an identical base class.

  • types of greens functions are stored within the same repeating subsection under outputs, with a metadata quantity type to distinguish them.

  • types of eigenvalues/BandEnergies are defined with respect to a single class which lists out different types of values as distinct quantities: values_correlation, values_exchange, etc. This is somewhat similar to how energy and force contributions are treated.

In this MR, Nathan is proposing some template classes for treating this class of properties. The underlying base class for this template comes with a set of plotting functionalities. However, at its core, it is suggesting that we group repeating value types in a repeating subsection of the property subsection of outputs, i.e.: outputs.property_w_types[].groups[].value, with shared metadata being stored directly under property_w_types[]

Decomposable properties with metadata and types

This is a special class of property with types that has a type total which presumably the sum of the other types. Although I am not sure if it was still intended in this PR, it is possible that we decide to treat these in a special way compared with the previous section, by lifting the total quantity above the groups or contributions.

There is something toward this implementation in decomposable.py, although again perhaps Nathan was thinking of treating this also with the SemanticGroupContainer()

Interoperable Physical Properties

This refers to Markus' PP implementation, which is akin to a data frame structure. One major consideration in this implementation is the flexibility of the dimensions to enable interoperability of properties defined in quite different contexts.

SUMMARY

In hope this helps to clarify a bit the situation. I think what remains to decide is 1. what kind of structure do we want to treat properties with metadata and types, 2. do we want to treat decomposable properties the same or different than 1.

The associated question then is if Nathan's template fits our desires or what amendments do we need to achieve our wish list.

To address Alvin's already voiced concern we need to clarify the fundamental difference to Markus' PP approach, both technically and in the desired support. Here the main difference I see is the flexibility. Also, at least at the moment I am viewing Nathan's template as a way to treat values of differing types within the same property, not to group distinct properties together, which is the secondary purpose of the data frame approach.

I look forward to hearing some more thoughts.

from structlog.stdlib import BoundLogger

from nomad_simulations.schema_packages.utils import RussellSaundersState


class OrbitalsState(Entity):
class OrbitalsState(ArchiveSection):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the +/- 's of using Entity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entity is an ArchiveSection with a specific semantic meaning, as defined in OBO. Can link the specifics later.

@@ -206,7 +187,7 @@ class Simulation(BaseSimulation, Schema):

model_method = SubSection(sub_section=ModelMethod.m_def, repeats=True)

outputs = SubSection(sub_section=Outputs.m_def, repeats=True)
outputs = SubSection(sub_section=KResolvedElectronicProperties.m_def, repeats=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just for testing, but what would be your plan generally? Same structure as before? I think you mentioned not listing out the properties under Outputs()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I prefer the flat structure under outputs, though we need the same approach for handling browsing (in the GUI, and when downloaded).
Some properties (like most of the solid state electronics) have shared normalization logic, and might be better off grouped together.

# ! extract name from value


class TotalEnergy(DecomposableProperty):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still how you want to treat total energy? Or was this a precursor to your SemanticGroup?

super().__init__(*args, **kwargs)
self.name = 'Total Energy' # !

class EnergyContribution(PropertyContribution):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation caused some difficulties for because it prevents me from directly importing EnergyContribution into my parser schema for potential extensions. This is a special case, but is it necessary or advantageous to define it within the other class?

)

normalize_all(archive)
with open('test.archive.json', 'w') as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately this archive did not get processed correctly for me, but I would like to browse this example within the gui at some point

@ndaelman-hu
Copy link
Collaborator Author

I'm starting to set this up for testing but already having some import problems.

../nomad-simulations/src/nomad_simulations/schema_packages/properties/decomposable.py:1: in <module>
    from nomad.metainfo import placeholder, Quantity, SubSection, Reference

I can only find placeholder defined in the javascript part of nomad-FAIR. Perhaps it's due to the older mapping annotation branch. @ladinesa could you let me know when you rebase your nomad-FAIR branch, I tried myself, but the conflicts are too complicated/unknown to me...also I guess if you have any alternative insight into the placeholder import in general 🙏

Feel free to remove placeholder with None (that is its current definition). It's slated for removal anyhow, and else I do it myself. My apologies that I part remained in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Molecular electronics is a subfield in nanotechnology, and is used for electronic devices usually.

Could we update the nomenclature as molecular_eigenstates.py? This way eigenstate - quantum state - physical property connection is clear.

@EBB2675
Copy link
Collaborator

EBB2675 commented Feb 13, 2025

Properties with metadata and types

These are properties that contain multiple value types depending on the context.

In runschema, there seems to be a lack of consistency with how these were dealt with:

* [types of dos](https://github.com/nomad-coe/nomad-schema-plugin-run/blob/develop/runschema/calculation.py#L2363) are defined as unique subsections under Calculation, but are defined with respect to an identical base class.

* [types of greens functions](https://github.com/nomad-coe/nomad-schema-plugin-run/blob/develop/runschema/calculation.py#L1801) are stored within the same repeating subsection under outputs, with a metadata quantity `type` to distinguish them.

* [types of eigenvalues/BandEnergies](https://github.com/nomad-coe/nomad-schema-plugin-run/blob/develop/runschema/calculation.py#L943C7-L943C19) are defined with respect to a single class which lists out different types of values as distinct quantities: `values_correlation`, `values_exchange`, etc. This is somewhat similar to how energy and force contributions are treated.

Although I have no strong opinions on this, storing them within the same subsection makes the archive look more organized and clean, just in case of greens functions with a type quantity to distinguish them.

@JFRudzinski
Copy link
Collaborator

I am posting here a short summary of our meeting with Alvin this week, and some adjustments to the perspective that I posted above:

Alvin proposes that instead of removing Chema's PP class, we simply remove the current validation checks, and amend this into a core base class for our properties. Beyond the benefit of less code changes, this implementation is already compatible with the mapping parser. He believes that the current variables implementation can be useful for a sort of normalization of quantities, when we have different types to define within a single property subsection. Note that this is no longer intended to be for flexibility of rank, the dataframe PP class from Markus should be used for such use-cases. For decomposable properties, we would opt for keeping the total outside of these groupings. Plotting functionalities can then be developed similar to this PR.

@ndaelman-hu @ladinesa and myself should meet early next week to discuss this and to come to a consensus about how to move forward. My current perspective is that the proposals by Nathan and Alvin are not incompatible, rather with some adjustments we can accommodate the main features of each into one cohesive approach...at least, this is the goal.

The new overall structure would be:

Outputs section structure

class Outputs(ArchiveSection):

    model_system_ref = Quantity(
...
    )

    model_method_ref = Quantity(
...
    )

    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
    # List of properties
    # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

    property_section = Subsection(PP)

i.e., all properties should use the PP base class (we should change the name to not confuse with the dataframe class).

Simple property quantities

No simple properties as quantities under outputs

Properties with metadata

use the PP base class

Properties with metadata and types

use the PP base class with a standard groupings subsection

Decomposable properties with metadata and types

use the PP base class with a total value at the top level and standard groupings subsection

Interoperable Physical Properties

properties that require flexibility use the data frame PP approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement/fix Improvement or fix of a previous feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PhysicalProperty rank checks broken
5 participants