Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI command to dump inputs/outputs of CalcJob/WorkChain #6276

Merged
merged 30 commits into from
May 27, 2024

Conversation

qiaojunfeng
Copy link
Contributor

Currently, we have verdi calcjob inputcat/outputcat for CalcJob, but there is no straightforward way to dump the inputs/outputs of a WorkChain to some files. The use case of this can be:

  • easier to debug the underlying code (e.g. QE pw.x) using the exact same inputs generated by aiida, without working on the python side
  • the files can be easily sent to someone who is not familiar with AiiDA, to reproduce/analyze results

Here I added an example CLI command verdi workchain inputsave, to recursively iterate through the called descendants of a workchain and save the input file of the calledCalcJob into a folder. For example, the structure of the saved folder for a Wannier90BandsWorkChain is

├── 1-scf
│   └── 1-iteration_01
│       └── aiida.in
├── 2-nscf
│   └── 1-iteration_01
│       └── aiida.in
├── 3-wannier90_pp
│   └── 1-iteration_01
│       └── aiida.win
├── 4-pw2wannier90
│   └── 1-iteration_01
│       └── aiida.in
└── 5-wannier90
    └── 1-iteration_01
        └── aiida.win

The current PR is working, but still there are plenty of tasks to be done to make it generic enough.

TODO

  • decide which namespace to use for the command, should be under verdi calcjob, a new verdi workchain, or sth else that combines both calcjob and workchain (since now the command works for both CalcJob and WorkChain)?
  • the folder structure, should uuid be included in the folder names? Currently I am using link labels

comments from Giovanni

  • add a Readme or Metadata.txt or similar in each folder, with info eg on time, uuid,... (also to help AiiDA users to find back the nodes)
  • have a way to also dump any file from input links (eg have subfolders inputnode/pseudo_Ba, inputnode/pseudo_Ti,...)

Further optional TODO

In addition, there are several other convenience CLI commands I often use, they are here
https://github.com/aiidateam/aiida-wannier90-workflows/blob/main/src/aiida_wannier90_workflows/cli/node.py

  • inputcat/outputcat for workchain, that by default shows the last calcjob and allows user to specify a link_label and goto the corresponding calcjob
  • cleanworkdir for workchain, that clean up all the nested calcjob of a workchain
  • gotocomputer for workchain / RemoteData / RemoteStashFolderData

Would be great if someone versed in aiida-core could use these as inspirations, and write some generic CLI commands that cover these usages. Thanks!

@qiaojunfeng qiaojunfeng marked this pull request as draft February 6, 2024 18:52
@giovannipizzi
Copy link
Member

giovannipizzi commented Feb 7, 2024

As a note, I assigned this to Ali and Julian (they are not yet in the project so I cannot tag them), and @eimrek was involved in the discussions with @qiaojunfeng

@sphuber
Copy link
Contributor

sphuber commented Feb 8, 2024

Thanks @qiaojunfeng . Interesting idea, but I have some questions about the design. How does the output structure of this command make it easier to rerun the calculations manually? If I understand correctly, the command only dumps the main input file, but that is typically not sufficient as there can be multiple input files for a CalcJob. I feel that the implementation may be a bit tied to your use case of QE, where there is a single principle input file. If we were to accept this in aiida-core, we need to make sure there are no assumptions that are based on a specific code, but it should be generally applicable.

That being said, even for QE I am not sure if this is very useful. As is, the generated output structure cannot be directly used to launch the calculations, can it? For example, the pseudopotentials are not written. So a user will have to manually add those (for each subfolder). And what about restart files? For the NSCF, the restart files won't be found and the user will have to manually add those.

What I think the command should really do is copy all input files such that it can be relaunched straight away. The implementation will be a bit more involved though. The total set of input files is only directly available from the remote working directory, but that is not stored permanently by AiiDA and can disappear, so we cannot rely on that. Some input files are copied to the repo of the CalcJobNode which can be easily dumped using verdi node repo dump, but that also doesn't contain all files. Some of the written input files can be excluded by the CalcJob plugin through the provenance_exclude_list and other files are written only to the remote working directory by the daemon, for example files specified through FolderData, RemoteFolder and SinglfileData nodes. So to reconstruct the complete set of input files to an arbitrary path, the command should really call CalcJob.presubmit followed by aiida.engine.daemon.execmanager.upload_calculation. This would therefore require some refactoring in aiida-core to make this nicely doable.

That all being said, in some cases we may not want to copy all input files. Taking QE as an example, if you want to dump the inputs for an NSCF calculation, the complete dump would include the restart files, but these can get very big and so in some cases the user may not want to dump these. So you would have to add options to exclude these.

I can see how this command is very useful for workchains of QE calculations, but in order to justify adding this to aiida-core, we have to think hard how this can be useful for the generic use case as well. Otherwise it might just be better to keep it in the code specific plugin package.

@giovannipizzi
Copy link
Member

Hi @sphuber - indeed we have to make it generic.
As you see in one of my comments that @qiaojunfeng reported, I had thought to this. One idea was to report the raw files of any direct input and link them with the correct input link name. so e.g. for QE you would have a subfolder (per calcjob) raw_inputs/, for instance, an inside for instance pseudo_Ba/Ba.UPF, pseudo_Ti/Ti.UPF, ...

Then we also considered that indeed it might be instead better to just run the presubmit, essentially running a dry run; this should already do essentially everything (putting also the pseudos; and leaving information on which files should be copied directly on the remote for restarts).

Note that the goal here is not that one should necessarily be able to just run again in the same folder and everything works. (Of course, if we achieve this, this is much better). The goal is that an experienced user (not of AiiDA, but of the underlying code - in our case, this was triggered by a developer of CP2K not interested in learning AiiDA) can inspect the raw inputs and outputs, to see what was run.
It's OK in my opinion to give minimal "human-readable" instructions to them (possibly plugin specific) on what to do if they really want to rerun (things like: "for each folder, move the pseudos from folder XX to folder YY and rename it accordingly; if it's a restart, you first have to copy this folder from the parent to the restart folder, where "this folder" is written in e.g. a 'calcinfo.json' or similar file; etc.). As long as these are short and minimal, a code developer will understand what to do (possibly even without).

Again, if we can really make this "just runnable" (at least without restarts) that's OK, but I don't think it's crucial for the main goal of exposing the raw files to interested people (who don't want to dig into how AiIDA works).

The PR itself needs some cleanup indeed. The suggestion is that @khsrali and @GeigerJ2 work on this in the next couple of weeks, and then we discuss if this is general enough. But of course further comments are welcome also to help them fix this PR.

@GeigerJ2
Copy link
Contributor

Hi @qiaojunfeng, thank you for this contribution! I think this is a very useful feature for people that are starting with AiiDA as it eases the transition from the directory tree to a more AiiDA-centric approach. I made some changes to the code locally. Is it OK for you if I commit them and continue working directly on top of this PR?

During development, I used a simple, exemplary PwBandsWorkChain (I'll also look at some aiida-vasp workchains in the future). Just to mention the three main points that came up:

  • The call to get_option():
    path = calcjob.get_option('input_filename')

    failed for the create_kpoints_from_distance step, as the resulting CalcFunctionNode does not implement that method. Thus, I filtered the list of links via:
    links = [link for link in links if not isinstance(link.node, orm.CalcFunctionNode)]
    to remove those. While this could also be resolved by checking the node type of the links in the _get_input_filename function, in the current implementation, a subdirectory is still created for these CalcFunctionNodes, even though there are no SingleFileData objects to be written to disk (though, as we'll continue working on it, this point will probably not be relevant).
  • For the steps for which the link_labels were not specifically set in the plugin, I'm using the process_labels to get a more descriptive directory tree, e.g.:
      wc-387
      ├── 1-relax
      │   ├── 1-PwBaseWorkChain
      │   │   └── 1-PwCalculation
      │   └── 2-PwBaseWorkChain
      │       └── 1-PwCalculation
      ├── 2-scf
      │   └── 1-PwCalculation
      └── 3-bands
          └── 1-PwCalculation
  • I'm dumping the outputs using the copy_tree() method via .outputs.retrieved from the respective CalcJobNode. I was expecting it would be similarly straightforward for multiple input files, but haven't found a way so far (I'll try the suggestion given by @giovannipizzi).

Further, I think dumping the main input/output files for easy inspection vs. dumping everything to be able to fully restart the simulations are two different use cases, and it would already be valuable to have the former. In general, I also think the user should have the option to manually include/exclude additional files.

Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something?

And lastly, regarding the namespace, it seems natural to assume that something like verdi workchain could also exist, and probably more functionalities that fit into cmd_workchain will come up over time. Though, in this case, I'd only keep the recursive logic here and make the dumping of input/outputfiles accessible for the CalcJobs themselves. Curious to hear your thoughts!

@sphuber
Copy link
Contributor

sphuber commented Feb 13, 2024

Thanks for your comments @GeigerJ2 . Just some quick pointers I noticed when reading them:

Thus, I filtered the list of links via:
links = [link for link in links if not isinstance(link.node, orm.CalcFunctionNode)]
to remove those.

This is indeed a solution, but there might be a better one. Instead of using a "black list" of sorts with known exceptions, you might want to just catch the AttributeError that is thrown when get_option is not defined. This will make the code more robust as there might be other cases you didn't think off. I know for sure that this one will fail for example if there is a WorkFunctionNode.

For the steps for which the link_labels were not specifically set in the plugin, I'm using the process_labels to get a more descriptive directory tree, e.g.:

Maybe a combination of the two would be even better? If you have multiple subsequent calls of the same subprocess, having just the same name prefixed with an integer is also not so informative. If a call_link_label is defined, it means the work chain developer put it there intentionally.

I'm dumping the outputs using the copy_tree() method via .outputs.retrieved from the respective CalcJobNode. I was expecting it would be similarly straightforward for multiple input files, but haven't found a way so far (I'll try the suggestion given by @giovannipizzi).

Input files (at least some of them) of a CalcJobNode are stored in its own repository. So you can get them with the same API, e.g., CalcJobNode.base.repository.copy_tree(). As I mentioned in my previous post, however, this does not include all inputs. Files that were copied using the local_copy_list, or remote_copy_list constructs will not be there. Similarly, even files that were copied to the sandbox folder by the plugin, but that were listed in the provenance_exclude_list, will not be present. For QE, typically this will include pseudos, so you won't find them in the calcjobnode's repository. And this brings me back to my earlier point...

Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something?

It might "work" as in there won't be any exceptions, but I still have my doubts whether it is really all that useful for the general case. Defining a default input file is not required and maybe not all that common. The current implementation will even raise an exception in that case. It says to specify a path, but that is not implemented and wouldn't even work if you have different CalcJobs in a tree that have different input paths. So I think in most cases you will get only a fraction of input files, if any at all.

If this command is to be useful, we have to go about it slightly differently. I would suggest we have to go through the same logic as an actual CalcJob goes in writing its input based on the input nodes that are stored in the database. This functionality is essentially in the upload_calculation function of aiida.engine.daemon.execmanager. Instead of the the current implementation, we should try to reuse that code. This will take care of including files that were copied using local_copy_list and we could refactor it to ignore the provenance_exclude_list. We would have to refactor the code a bit to allow the remote_copy_list to actually copy from the remote to the localhost. @mbercx has an open PR #6196 that proposes already to make this logic a bit more permissive. We could think of allowing it also to copy to the localhost.

This solution would require quite some refactoring, but I think this might anyway not be such a bad idea. The code there can be a bit dense. But if we think this might be an interesting approach, I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring.

@giovannipizzi
Copy link
Member

Hi all, great comments! I also thought of using the same logic that is used in the presubmit step. However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc.

We might decide to give both options, but I think the current approach is general enough to start to be useful. You can still dump a yaml version of the info in local copy list, remote copy list etc (we anyway already dump a json of those). And explain people what it means. And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it.

The design idea we had is to have, optionally, the possibility for a plug-in to provide its custom wait to dump files. So if you user has the plug-in install and the plug-in supports it, the dump will be done in a more plug-in specific way. But the current suggested approach, to me, is a good default when such plug-in specific function does not exist.

In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on verification (that is the actual usecase that triggered this pr), and then ask the respective code developers if the resulting folder is clear enough and contains all info they might need.

@sphuber
Copy link
Contributor

sphuber commented Feb 13, 2024

However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc.

Good point. It would require the plugin to be installed to do what I am proposing. I can see how that is limiting.

And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it.

What I was suggesting was not just calling the presubmit, but also the aiida.engine.daemon.execmanager.upload_calculation which reads those instructions generated by the plugin and copies local and remote files. This would just require some refactoring but in principle it is perfectly possible. But this still requires the actual plugin to be isntalled as the CalcJob.prepare_for_submission is called before that.

In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on

Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case. If this is really a use case that stems from these plugins, and especially from the common workflow project, is it maybe not also an idea to consider adding it to the aiida-common-workflows package?

@GeigerJ2
Copy link
Contributor

Thank you for your helpful and valuable feedback, @sphuber!

you might want to just catch the AttributeError that is thrown when get_option is not defined.

This was my original solution, but in the initial implementation of @qiaojunfeng, it led to the creation of a directory also for these CalcFunction steps of the workchain that did not use/create input/output-files.

This will make the code more robust as there might be other cases you didn't think off. I know for sure that this one will fail for example if there is a WorkFunctionNode.

Yes, good point!

Maybe a combination of the two would be even better?

Agreed!

Input files (at least some of them) of a CalcJobNode are stored in its own repository. So...

Just learned something new about the internals :)

I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring.

If you have the time, that would be great!

In practice, I would do it for the various archives of the recent common workflows paper

Thanks, @giovannipizzi for this pointer, that should give me plenty of workchains to try out when working on this.

Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case.

The general use case should be WorkChains that call CalcJobs that have file I/O on HPC (or locally), no? While the codes given by @giovannipizzi are from the field of DFT, I don't think they are otherwise too specific in that regard? Once we have use cases from the weather and climate community in the SwissTwins project, we can also see how the functionality could be of use there.

GeigerJ2 added a commit to GeigerJ2/aiida-core that referenced this pull request Feb 19, 2024
This commit builds on the [pull request](aiidateam#6276) by @qiaojunfeng. It implements the following changes:
- `_get_input_filename` function is removed
- `workchain inputsave` is split up and modified, with the recursive logic to traverse the `ProcessNodes` of the workchain moved to `_recursive_get_node_path`, the directory creation moved to `_workchain_maketree`, and the file dumping moved to `workchain_filedump`
  - `_recursive_get_node_path` builds up a list of tuples of "paths" and `CalcJobNodes`. The "paths" are based on the layout of the workchain, and utilize the `link_labels`, `process_labels`, as well as the "iteration" counter in the `link_labels`
    -> This might not be the best data structure here, but allows for extending the return value during recursion
    -> Rather than using the `ProcessNodes` directly, one could also only use the `pks` and load the nodes when needed
  - In the `PwBandsWorkChain` used for development, the "top level", processes had the `link_labels` set, so they were missing any numbering. Thus, I added it via `_number_path_elements`. Right now, this is just a quick fix, as it just works for the top-level, though, such a function could possibly take care of the numbering of all levels. Ideally, one would extract it directly from the data contained in the `WorkChain`, but I think that's difficult if some steps might be missing the iteration counter in their label.
  - Eventually I think it would be nice to be able to just create the empty directory tree, without dumping input/output files, so the `_workchain_maketree` is somewhat of a placeholder for that
- `calcjob_inputdump` and `calcjob_outputdump` added to to `cmd_calcjob`

So far, there's not really any error handling, and the code contains probably quite some issues (for example, the "path" naming breaks in complex cases like the `SelfConsistentHubbardWorkChain`), though, I wanted to get some feedback, and ensure I'm somewhat on a reasonable trajectory before generalizing and improving things. Regarding our discussion in PR aiidateam#6276, for working on an implementation of a *complete* version that makes the steps fully re-submittable, that might be an additional, future step, in which @sphuber could hopefully provide me some pointers (for now, I added a warning that about that). The current commands don't require any plugin, only `core` and the data.

The result of `verdi workchain filedump <wc_pk> --path ./wc-<wc_pk>` from an exemplary `PwBandsWorkChain`:

```shell
Warning: Caution: No provenance. The retrieved input/output files are not guaranteed to be complete for a full restart of the given workchain. Instead, this utility is intended for easy inspection of the files that were involved in its execution. For restarting workchains, see the `get_builder_restart` method instead.
./wc-3057/
├── 01-relax
│   ├── 01-PwBaseWC
│   │   └── 01-PwCalc
│   │       ├── aiida.in
│   │       ├── aiida.out
│   │       ├── _aiidasubmit.sh
│   │       ├── data-file-schema.xml
│   │       ├── _scheduler-stderr.txt
│   │       └── _scheduler-stdout.txt
│   └── 02-PwBaseWC
│       └── 01-PwCalc
│           ├── aiida.in
│           ├── aiida.out
│           ├── _aiidasubmit.sh
│           ├── data-file-schema.xml
│           ├── _scheduler-stderr.txt
│           └── _scheduler-stdout.txt
├── 02-scf
│   └── 01-PwCalc
│       ├── aiida.in
│       ├── aiida.out
│       ├── _aiidasubmit.sh
│       ├── data-file-schema.xml
│       ├── _scheduler-stderr.txt
│       └── _scheduler-stdout.txt
└── 03-bands
    └── 01-PwCalc
        ├── aiida.in
        ├── aiida.out
        ├── _aiidasubmit.sh
        ├── data-file-schema.xml
        ├── _scheduler-stderr.txt
        └── _scheduler-stdout.txt

9 directories, 24 files
```
@GeigerJ2
Copy link
Contributor

Hello everybody,
I took the PR and created a branch from it on my local fork of core to be able to work freely on it, without messing anything else up. I hope this follows proper GitHub ettiquette - if not, please let me know :)
I did some changes and a first proof-of-concept for dumping the main WorkChain files works. Though, I wanted to run it by you before continuing with it. A more in-depth explanation is given in the commit message. Thank you!

@sphuber
Copy link
Contributor

sphuber commented Feb 20, 2024

Thanks a lot @GeigerJ2

I took the PR and created a branch from it on my local fork of core to be able to work freely on it, without messing anything else up. I hope this follows proper GitHub ettiquette

That is perfectly fine because, as you say, the original remains untouched. There is no set rule here really and it depends on the situation. It is possible for you to, when you are done, make a PR to the original branch, which then automatically updates this PR. But if the original author is fine with it, they can also give you direct write access to their branch and update there directly. Or just close the PR and you open a new one. Fully up to you and @qiaojunfeng in this case.

I did some changes and a first proof-of-concept for dumping the main WorkChain files works. Though, I wanted to run it by you before continuing with it. A more in-depth explanation is given in the commit message. Thank you!

As requested, I will first give some high-level feedback on the design and some common principles we tend to keep in aiida-core.

  1. For any CLI functionality that we provide, it is good practice to first make this available just in the Python API, and then have the CLI command be a very thin wrapper around it. This makes it possible to use the functionality not just from verdi but also directly from Python. The functionality you are adding here, could typically be placed somewhere in aiida.tools which contains more functions of this type that performs some post processing of provenance data.
  2. Do we really need explicit CLI/API endpoints for the calcjob inputdump and outputdump? These are essentially just a single method call CalcJobNode.base.repository.copy_tree and CalcJobNode.outputs.retrieved.copy_tree. I don't think we need specific functions for this in the API and the CLI already has verdi node repo dump that provides this for both cases.
  3. Naming: I see there is quite a bit of logic to determine the naming of the directory structure. Would it be useful to keep the same naming as much as possible as other tools would give it? I am thinking of verdi node graph generate and verdi process status that essentially also give an overview of the callstack. It would be useful I think if the naming matches between these tools, otherwise it may be difficult to match the various outputs. Would the following format work perhaps: {index}-{call_link_label}-{process_label}. Where call_link_label is omitted if it is not defined or just the CALL defined.
  4. What is the reason you prefer for constructing the folder structure up front, before copying the files? If you make a single recursive function that goes through the call stack that for each CalcJobNode just copies inputs and outputs, creating the output directory if it doesn't already exist. Something like the following (not tested):
import pathlib
from aiida.common.links import LinkType
from aiida.orm import CalcJobNode, WorkChainNode


def workchain_dump(workchain: WorkChainNode, filepath: pathlib.Path) -> None:
    called_links = workchain.base.links.get_outgoing(link_type=(LinkType.CALL_CALC, LinkType.CALL_WORK)).all()

    for index, link_triple in enumerate(sorted(called_links, key=lambda link_triple: link_triple.node.ctime)):
        node = link_triple.node

        if link_triple.link_label != 'CALL':
            label = f'{index:02d}-{link_triple.link_label}-{node.process_label}'
        else:
            label = f'{index:02d}-{node.process_label}'

        if isinstance(node, WorkChainNode):
            workchain_dump(node, filepath / label)
        elif isinstance(node, CalcJobNode):
            node.base.repository.copy_tree(filepath / label)
            node.outputs.retrieved.copy_tree(filepath / label)

Let me know what you think.

@GeigerJ2
Copy link
Contributor

Hi @sphuber, thanks a lot for your valuable feedback, it is highly appreciated!

That is perfectly fine

Happy to hear that :)

But if the original author is fine with it, they can also give you direct write access to their branch and update there directly.

In principle, that should be the default anyway if the author hasn't unset "Allowing edits by maintainers" when making the PR, no? Though, better to ask ofc.

Regarding your feedback:

  1. Yes, fully agreed! Felt a bit weird to have everything in the cmd_ file, but I wasn't sure where the code would
    belong otherwise.

  2. Ah, I wasn't actually aware of verdi node repo dump... maybe this is also something we could highlight better (see #6290), as it seems like a very useful command to know, especially for beginners? Though, at least for a PwCalculation it only dumps the input and submission file, and .aiida folder*, so one would still be missing the retrieved files. Apart from that, is there a particular reason why verdi node repo dump implements another _copy_tree, rather than using node.base.repository.copy_tree()?

    I don't have a strong opinion about the API endpoints, though dumping multiple input/output files might be a functionality new users might expect, e.g. for debugging their submissions without having to use the AiiDA API (even though they should :D).

  3. For the naming, I'm fine with {index}-{call_link_label}-{process_label}, and I agree it should be consistent. I abbreviated the AiiDA classes, but maybe better to keep the full names to avoid any confusion. We just need to make sure that, if multiple lower-level WorkChains are called in a higher-level WorkChain, or multiple CalcJobs in one WorkChain, that they are properly numbered, but it seems like this is already taken care of with the link_label 👍

  4. I split it into multiple functions to get somewhat of a separation of concerns, assuming that in the future, we might want to add other WorkChain commands, where the conversion of the WorkChain to a directory tree construct could be useful. For example, if one wants to only dump either the input files, or the output files, this logic would need to be set within the recursive function, no? Additionally, as mentioned before, we might also want to implement a version that, given the necessary plugin being installed, more (all) input files are written (e.g. the pseudos for QE, ideally, making the calculation fully restartable), then the question would be again where to implement this logic.

(*I assume this is intentional, as the .aiida/calcinfo.json contains the references to the other input files, via local_copy_list, remote_copy_list and the output files via retrieve_list, so provenance is recorded, and the full file structure could be reconstructed?)

@sphuber
Copy link
Contributor

sphuber commented Feb 20, 2024

maybe this is also something we could highlight better (see #6290), as it seems like a very useful command to know, especially for beginners?

Sure. The problem is though that there are so many potentially useful things that it is difficult to organize. Of course you could start adding them to a single page, but this would quickly get very chaotic and disorganized. Then you start organizing it and then you end up with the same problem as in the beginning where you have to know where to look to find something. Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.

Though, at least for a PwCalculation it only dumps the input and submission file, and .aiida folder*, so one would still be missing the retrieved files.

Sure, you would have to call it twice, once for the node it self, and once for its retrieved output. Just as your code is doing, you have inputdump and outputdump.

Apart from that, is there a particular reason why verdi node repo dump implements another _copy_tree, rather than using node.base.repository.copy_tree()?

Probably just because that command got added before I implemented copy_tree on the repository (which was quite recent) 😅 It'd be great if you want to open a PR to simplify that 👍

I don't have a strong opinion about the API endpoints, though dumping multiple input/output files might be a functionality new users might expect, e.g. for debugging their submissions without having to use the AiiDA API (even though they should :D).

Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?

We just need to make sure that, if multiple lower-level WorkChains are called in a higher-level WorkChain, or multiple CalcJobs in one WorkChain, that they are properly numbered, but it seems like this is already taken care of with the link_label 👍

That is what the {index} is doing right? I am not relying on the link_label for this.

I split it into multiple functions to get somewhat of a separation of concerns, assuming that in the future, we might want to add other WorkChain commands, where the conversion of the WorkChain to a directory tree construct could be useful. For example, if one wants to only dump either the input files, or the output files, this logic would need to be set within the recursive function, no? Additionally, as mentioned before, we might also want to implement a version that, given the necessary plugin being installed, more (all) input files are written (e.g. the pseudos for QE, ideally, making the calculation fully restartable), then the question would be again where to implement this logic.

This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things.

@GeigerJ2
Copy link
Contributor

Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.

Yeah, it would definitely require putting some thought into it. Though, there are a few examples that seem to pop up frequently, and that one doesn't come across when just going through the tutorials, i.e. gotocomputer.

Probably just because that command got added before I implemented copy_tree on the repository (which was quite recent) 😅 It'd be great if you want to open a PR to simplify that 👍

Alright, I see. Sure, will do!

Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?

I mean that having verdi calcjob inputdump and verdi calcjob outputdump would be nice to have in the CLI, but yes,
I agree that it's best to have the actual (simple) implementation as part of the normal API, and just have wrappers for it in cmd_calcjob. So I think I just misunderstood your comment.

This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things.

Good point there! I'll revert it back to an implementation with just the one recursive function, based on your snippet, in the appropriate location, making sure it works, and

That is what the {index} is doing right? I am not relying on the link_label for this.

seeing if the indexing checks out for some examples. Thanks again for your assistance with this, it's very
insightful! :)

@GeigerJ2
Copy link
Contributor

(Copying the commit message here, which I thought would appear here in that way, sry for that... 🙃)

Hey all,

I commited the changes from my local fork of the PR now here directly, as I think that will make working on it more straightforward. Hope that's OK for everybody.

Based on @qiaojunfeng's original implementation, there's a recursive function that traverses the WorkChain hierarchy. Now, as actual file I/O is only done for the CalcJobNodes involved in the workchain, I factored that out into its own function _calcjob_dump. I also moved these implementations into tools/dumping/processes.py so that it's available via the normal Python API. In cmd_workchain and cmd_calcjob the commands verdi workchain dump and verdi calcjob dump are only wrappers around the respective functions.

_calcjob_dump, by default, dumps the node.base.repository and node.outputs.retrieved using copy_tree, as well as the input_nodes of the calcJobNode if they are of type SingleFileData or FolderData. I started working on the --use-prepare-for-submission option, but as previously indicated by @sphuber (also see here), it's not straightforward to make it work as intended, so I put that on hold for now and added a warning.

For each WorkChainNode and CalcJobNode a selected set of the node @propertys are dumped to yaml files. extras and attributes can also be included via the cli. I initially had a function for this, but realized that I'm just passing these arguments all the way through, so I encapsulated that in the the ProcessNodeYamlDumper class. Now, an instance is created when the respective cli command is run, and the arguments are set as instance variables, with the instance being passed through, rather than the original arguments. The dump_yaml method is then called at the relevant positions with the process_node and output_path arguments. Regarding relevant positions: To get the node yaml file for every ProcessNode involved, it's called in cmd_workchain and cmd_calcjob for the parent node, and subsequently for the outgoing links. Maybe there's a better way to handle that?

The other commands initially mentioned by @qiaojunfeng also seem very interesting and could probably easily added based on his original implementation, though we should agree on the overall API first.

A few more notes:

  • The cli options for verdi workchain dump and verdi calcjob dump are basically the same and the code is duplicated. We could avoid that by merging it into one, e.g. under verdi node dump, however, as a user, I would find verdi workchain dump and verdi calcjob dump more intuitive. Also, verdi node repo dump uses a former implementation of copy_tree, so I'd go ahead and update that, but in a separate PR (related, do we want to also implement (or change it to) just verdi node dump?)
  • Regarding the yaml dumping, the ProcessNodeYamlDumper class is quite specific and just a little helper to get the job done here. Should we generalize such functionality, e.g. to resolve issue #3521 to allow for dumping of computers/codes to yaml, or keep these things separate?
  • Currently, the pseudo directories are called pseudos__<X>, and I thought about splitting on the double underscore to just have one pseudos directory with a subdirectory for each element. Personally, I'd find that nicer than having a bunch of pseudos__<X>, but I'm not sure if the double underscore is based on general AiiDA name mangling, or specific to aiida-quantumespresso.

Lastly, examples of the (default) outputs obtained from verdi workchain dump:

dump-462
├── 01-relax-PwRelaxWorkChain
│  ├── 01-PwBaseWorkChain
│  │  ├── 01-PwCalculation
│  │  │  ├── aiida_node_metadata.yaml
│  │  │  ├── node_inputs
│  │  │  │  └── pseudos__Si
│  │  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  │  ├── raw_inputs
│  │  │  │  ├── .aiida
│  │  │  │  │  ├── calcinfo.json
│  │  │  │  │  └── job_tmpl.json
│  │  │  │  ├── _aiidasubmit.sh
│  │  │  │  └── aiida.in
│  │  │  └── raw_outputs
│  │  │     ├── _scheduler-stderr.txt
│  │  │     ├── _scheduler-stdout.txt
│  │  │     ├── aiida.out
│  │  │     └── data-file-schema.xml
│  │  └── aiida_node_metadata.yaml
│  ├── 02-PwBaseWorkChain
│  │  ├── 01-PwCalculation
│  │  │  ├── aiida_node_metadata.yaml
│  │  │  ├── node_inputs
│  │  │  │  └── pseudos__Si
│  │  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  │  ├── raw_inputs
│  │  │  │  ├── .aiida
│  │  │  │  │  ├── calcinfo.json
│  │  │  │  │  └── job_tmpl.json
│  │  │  │  ├── _aiidasubmit.sh
│  │  │  │  └── aiida.in
│  │  │  └── raw_outputs
│  │  │     ├── _scheduler-stderr.txt
│  │  │     ├── _scheduler-stdout.txt
│  │  │     ├── aiida.out
│  │  │     └── data-file-schema.xml
│  │  └── aiida_node_metadata.yaml
│  └── aiida_node_metadata.yaml
├── 02-scf-PwBaseWorkChain
│  ├── 01-PwCalculation
│  │  ├── aiida_node_metadata.yaml
│  │  ├── node_inputs
│  │  │  └── pseudos__Si
│  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  ├── raw_inputs
│  │  │  ├── .aiida
│  │  │  │  ├── calcinfo.json
│  │  │  │  └── job_tmpl.json
│  │  │  ├── _aiidasubmit.sh
│  │  │  └── aiida.in
│  │  └── raw_outputs
│  │     ├── _scheduler-stderr.txt
│  │     ├── _scheduler-stdout.txt
│  │     ├── aiida.out
│  │     └── data-file-schema.xml
│  └── aiida_node_metadata.yaml
├── 03-bands-PwBaseWorkChain
│  ├── 01-PwCalculation
│  │  ├── aiida_node_metadata.yaml
│  │  ├── node_inputs
│  │  │  └── pseudos__Si
│  │  │     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│  │  ├── raw_inputs
│  │  │  ├── .aiida
│  │  │  │  ├── calcinfo.json
│  │  │  │  └── job_tmpl.json
│  │  │  ├── _aiidasubmit.sh
│  │  │  └── aiida.in
│  │  └── raw_outputs
│  │     ├── _scheduler-stderr.txt
│  │     ├── _scheduler-stdout.txt
│  │     ├── aiida.out
│  │     └── data-file-schema.xml
│  └── aiida_node_metadata.yaml
└── aiida_node_metadata.yaml

and verdi calcjob dump:

dump-530
├── aiida_node_metadata.yaml
├── node_inputs
│  └── pseudos__Si
│     └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
├── raw_inputs
│  ├── .aiida
│  │  ├── calcinfo.json
│  │  └── job_tmpl.json
│  ├── _aiidasubmit.sh
│  └── aiida.in
└── raw_outputs
   ├── _scheduler-stderr.txt
   ├── _scheduler-stdout.txt
   ├── aiida.out
   └── data-file-schema.xml

for a PwBandsWorkChain and one of its involved CalcJobNodes.

@sphuber
Copy link
Contributor

sphuber commented Feb 28, 2024

Thanks a lot @GeigerJ2 , great progress 👍

I started working on the --use-prepare-for-submission option, but as previously indicated by @sphuber (also see here), it's not straightforward to make it work as intended, so I put that on hold for now and added a warning.

The current branch seems to implement it though. Does it not work in its current state?

For each WorkChainNode and CalcJobNode a selected set of the node @propertys are dumped to yaml files. extras and attributes can also be included via the cli. I initially had a function for this, but realized that I'm just passing these arguments all the way through, so I encapsulated that in the the ProcessNodeYamlDumper class. Now, an instance is created when the respective cli command is run, and the arguments are set as instance variables, with the instance being passed through, rather than the original arguments. The dump_yaml method is then called at the relevant positions with the process_node and output_path arguments. Regarding relevant positions: To get the node yaml file for every ProcessNode involved, it's called in cmd_workchain and cmd_calcjob for the parent node, and subsequently for the outgoing links. Maybe there's a better way to handle that?

I think I would include the dump_yaml call inside recursive_dump (only when the node is a WorkChainNode and _calcjob_dump itself. This way, the user doesn't have to think to call it manually for the parent node. On that node, why not call it workchain_dump and calcjob_dump. The fact that the workchain_dump will be recursive can be made clear in the docstring, and for the calcjob_dump it doesn't make sense to be recursive as a CalcJob cannot call other processes.

The cli options for verdi workchain dump and verdi calcjob dump are basically the same and the code is duplicated. We could avoid that by merging it into one, e.g. under verdi node dump, however, as a user, I would find verdi workchain dump and verdi calcjob dump more intuitive.

You can also reduce the code duplication differently. If you were to go for my previous suggestion, then the only code besides the recursive_dump function call is just setting the default for the output path and validation. That can simply be included in the option declaration itself, which can provide the default and do the validation. Then you can define the options once using the OverridableOption class that we use (see aiida.cmdline.params.options.main) and then you can easily reuse the option in both commands.

Also, verdi node repo dump uses a former implementation of copy_tree, so I'd go ahead and update that, but in a separate PR (related, do we want to also implement (or change it to) just verdi node dump?)

I would not change the name, because it currently only dumps the content of the repo, so in my mind verdi node repo dump is accurate.

Regarding the yaml dumping, the ProcessNodeYamlDumper class is quite specific and just a little helper to get the job done here. Should we generalize such functionality, e.g. to resolve issue #3521 to allow for dumping of computers/codes to yaml, or keep these things separate?

Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: #6255
It specifies a pydantic model for each ORM entity class (so including ProcessNode) and that allows to have pydantic automatically dump the content to JSON (having it do YAML should also be trivial)

Currently, the pseudo directories are called pseudos__<X>, and I thought about splitting on the double underscore to just have one pseudos directory with a subdirectory for each element. Personally, I'd find that nicer than having a bunch of pseudos__<X>, but I'm not sure if the double underscore is based on general AiiDA name mangling, or specific to aiida-quantumespresso.

You could indeed split on the double underscore and nest again, but then you would get paths pseudos/Si/Si.pbe.blabla.upf. Guess that would be down to preference if you prefer another level or nesting or keep the underscores in the folder name.

The double underscores indeed come from some name mangling. Inputs in AiiDA can be nested, e.g. pseudos for the PwCalculation are passed as

inputs = {
    'pseudos': {
        'Si': UpfData()
    }
}

The link labels are flat though, so nested namespaces are represented by double underscores.

In a sense then, splitting on the double underscores and making that a subdirectory would map quite nicely on nested input namespaces. But for me, either solution would be fine.

@GeigerJ2
Copy link
Contributor

GeigerJ2 commented Mar 4, 2024

Hi @sphuber, sorry for just getting back to you now. Thanks, happy to hear that :)

The current branch seems to implement it though. Does it not work in its current state?

Currently, it's giving this warning:

Warning: CalcJobNode<pk> already has a `remote_folder` output: skipping upload

and the pseudo directory is empty. Though, I have to admit that I haven't looked much further into it until now.

I have just implemented your proposed changes. Thank you for your inputs and pointers. The cli options are
now OverridableOptions, and passed to both dumping functions. Though, I was not sure how to set the default for the
path at this stage, as it depends on the pk of the process, so instead I added that as a function to
cmdline.utils.defaults.

I would not change the name, because it currently only dumps the content of the repo, so in my mind verdi node repo dump is accurate.

I meant also extending the dumping so that not only the repo of the node is dumped. In that case verdi node dump
could dump the repo and e.g. retrieved outputs. But that is effectively what verdi calcjob dump and verdi workchain dump are doing now, so no need to change or rename anything.

Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: #6255

Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would
merging here with a temporary solution be reasonable, or better to hold off until PR #6255 is merged. I'm riding along on
the pydantic hype train, so if I can be of any help there, please let me know!

The double underscores indeed come from some name mangling.

Good, so my intuition was correct :)

But for me, either solution would be fine.

Same here, don't have a strong opinion on that.

@sphuber
Copy link
Contributor

sphuber commented Mar 4, 2024

Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would
merging here with a temporary solution be reasonable, or better to hold off until PR #6255 is merged. I'm riding along on
the pydantic hype train, so if I can be of any help there, please let me know!

There is no fixed timeline, but it is ready for review and has been for a while. @mbercx and @edan-bainglass have both expressed interest in taking a look at this PR in the past as well as its predecessor #6190 I would be in favor of having this in soon. Since it is backwards compatible, I don't see major blockers really

@cpignedoli
Copy link

cpignedoli commented Mar 13, 2024

Dear all, great work, let me try to be the "bad guy" as much as possible.

I use the case of the data for CP2K oxides calculations in materialsclous:2023.81

Suppose we are back 10 years ago, someone like me, "A", was doing the calculations and for some reason putting them OA.
Suppose also someone like me "B" was upset with some CP2K related results and was willing to check what went wrong.

A would have had in his working space several directories such as HfO2, SiO2, Ti2O3,... after all calculations would have been compelted and ready to be published he would have done tar -cvzf cp2k_oxides.tgz oxides_dir and put the file online.
Though point here was... do I really have everything well organized..? This problem does not exist anymore with AiiDA

B Downloades the tar file, executes tar -xvf cp2k_oxides.tgz expecting to see something like HfO2 somewhere, enter sthe HfO2 directory and inspects what was wrong.

What I did now with AiiDA was:

  • find an available aiida environment and bring there the cp2k_oxides.aiida archive file
  • exceute verdi archive import cp2k_oxides.aiida

The two lines above will be replaced, this is a BIG PLUS, by a link to renku or an equivalent solution

To inspect the content of the archive:

  • I executed verdi process list -a | grep -y work
  • identified visually the "main workchains" (in this case Cp2kCommonRelaxWorkChain)

Here comes the big disadvantage now compared to 10 years ago:
image

Path to solution:
-assuming A assigned appropriate descriptions to each workchain, either we create a verdi command (or provide a script.py to be executed via verdi) to return pk description of, e.g., all "main workchains" or the instance activated e.g. in renku already shows such a list (running automatically a query builder script). The command verdi workchain dump pk could then use part of the description available in creating the dumping directories
-assuming A was lazy, with no descriptions in the workchains, we could still imagine an automated script in renku (or a verdi run script.py ) that, based on the first input StructureData encountered in each "main workchain", will provide to the user a guess description. The same label could then be used by verdi workchain dump

@sphuber
Copy link
Contributor

sphuber commented Mar 13, 2024

Thanks @cpignedoli for the writeup. I agree that it can be a bit daunting for a user to start to "browse" the contents of an imported archive. Without a high-level human-readable description that accompanies the archive, it can be difficult for a user to know how to analyse the data, especially if they are not familiar with the plugins used and so they don't necessarrily know or understand the workchain layouts. I fully agree that it would be great if we could improve this intelligibility automatically somehow.

That being said, I want to caution about implementing any solutions that are going to be to prejudiced by our own typical use cases that build in assumptions about the structure of archives. For example, in your suggestion, you implicitly assume that all archives are a simple collection of a number of top-level workchains of the same type and that their main input is a StructureData. This is not really the case in the general sense though. You could have an archive with many different types of top-level workchains, and their main input may not be structures.

What I imagine would really help here is the following:

Have a dead easy way to "mount" an AiiDA archive that allows to inspect it

The Renku functionality is a great example to solving this. We have recently made a lot of progress (and continue to do so) to also make this available on a normal computer locally. It is already possible to create a profile from an archive that doesn't require PostgreSQL. I also have functionality (see #6303) that makes RabbitMQ completely optional. With these two steps, users can start inspecting an archive in two steps: pip install aiida-core && verdi profile setup core.sqlite_zip path/to/archive.zip.

It should be easier to browse the provenance graph

Currently, there are a few ways to browse the contents of an archive. Use the various verdi commands, such as verdi group list and verdi process list. Alternatively the QueryBuilder API can be used to query anything, but this requires coding knowledge and some understanding of the layout of the provenance graph to find what one is looking for. Finally, there is the graphical browser provided by the Materials Cloud. However, the current functionality is quite limited as in that it only ever shows a single central node, with just the first layer of inputs/outputs. It would be great if we could have a representation similar to that produced by verdi node graph generate but have it be interactive; allowing it to zoom in/out, click on particular nodes to show more info etc.

As mentioned, I think we are very close to satisfactorily addressing the first part. It will be trivial to install AiiDA and mount an archive. The biggest step I think, is to build a fantastic visual browser that can be run locally. I think that there will always remain a need for a top-level human-readable description with archives to make them understandable to others (just like with any data publications online). One thing we could consider is to allow an archive creator to include this metadata, like a description, to the archive itself.

@cpignedoli
Copy link

Hi @sphuber I fully agree that my oversimplification can be too misleading, what you suggest/anticipate goes perfectly in the right direction and "easy browsing" for a general case would of course be a great success. I will test the pip install aiida-core && verdi profile setup core.sqlite_zip path/to/archive.zip, I was not aware, this is great.

@giovannipizzi
Copy link
Member

Great comments,i agree with what has been said. I think indeed that there is some responsibility with who puts the archive online to provide minimal guidance to users. This could just be eg 1. A csv file attached to the published entry (with uuids and any other relevant info to explain what is "top level"). For instance, in the case of the verification paper, it could be columns with code name, element name, and configuration, followed by uuid. This is e ought for a human to quickly find the relevant workchain and put it on the dump command by just looking at the csv.
Or 2. Provide a short script that runs a query, loops over the results, and prints such a csv or equivalent file/printout. This has the advantage of also providing a bit more context of how the archive is organised, eg if things are in groups, if extras are used to label nodes, etc (and encourages people to putt all this info in aiida and not just in an external text file). We have good examupls of such scripts currently in the Readme of some MC archive entries (I think mc2d, automated wannierisation, verification paper etc).
The additional thing we can do is that we have an automated way to have these script autopopulate the first jupyter notebook cells that appear when loading the archive in, eg, renkulab. So people just need click on a button, and execute cells to start browsing.

If we agree on this and implement this, we can then update the Mc archive entries of a fee of our own repos with such scripts (in the correct format to be autoloaded), as good examples to follow. And then invest a bit in "training/communication" : eg make a blog post explaining the philosophy behind them and linking to those examples, having looking into data of others as one of the first steps of the next tutorial we will organise, etc

We can always come up in the future with more general ways to do this, but trying to find a general solution that works automatically for any archive and for people who don't know aiida is very challenging (if at all possible), while I tjjon the approach I suggest is quite practical to implement, the scripts are easy to provide for data owners and easy to use for people who just want to look into the data (again, in the template for the jupyter notebook you can both put the short querying script that prints out uuids, and also some minimal text to guide people on what to do, eg, "get the uuid from the table above for the system you are interested in, and run this command verdi workchain dump UUID to get..."

@sphuber
Copy link
Contributor

sphuber commented Apr 10, 2024

Thanks @GeigerJ2 could you please update your branch, address the conflicts and make sure the tests run?

@GeigerJ2 GeigerJ2 force-pushed the cli_workchain branch 3 times, most recently from 9ea3e16 to 4daa2fc Compare April 10, 2024 14:36
GeigerJ2 and others added 20 commits May 27, 2024 12:16
To be consistent with other commands in `cmd_process`.
And with that `process_dump` and `calcjob_dump` in `processes.py` to
`process_node_dump` and `calcjob_node_dump`.
Now in function `generate_calcjob_io_dump_paths`. Takes care of handling
the `flat` argument and the naming of the `raw_inputs`, `raw_outputs`,
and `node_inputs` subdirectories.
Changed `--flat` option to still create subdirectories for the
individual steps of the WorkChain. Instead, just the subdirectories
per CalcJob are removed.

Generalized the dumping of outputs that it doesn't only dump
`retrieved` -> With this, it's dumping a whole range of aiida nodes,
basically all the parsed outputs, which are mainly numpy arrays dumped
as `.npy` files. Add an option to enable this, as it might not be
necessary to dump all of those. Currently, I just defined a global
variable in the file, but this will eventually become a class attribute
of the ProcessDumper class.
To avoid having to pass all the arguments
`include_node_inputs`, `include_attributes/extras`, `overwrite`, `flat`,
`all_aiida_nodes` through the different functions, everything related to
the dumping is now compiled in the `ProcessDumper` class, which defines
the main entry-point method `dump`. For nested workflows, this is
recursively called (as before). Once `CalculationFunction` nodes are
reached, their content is dumped via `dump_calculation_node`. The
helper functions to create and validate labels and paths of nested
subdirectories are also methods of the `ProcessDumper`.

Introduced the `parent_process` class attribute which is dynamically
generated from the parent_node, and which is used to generate the main
README, which is only created when the dumping is done via the `verdi`
CLI. For the other functions, this concept does not make sense, due to
the recursion, so the respective `process_node`s (which are changing
during the recursion) are always passed as arguments.

Next steps:
- Update tests to actually test the new implementations
- Update docstrings
- Add section to `How to work with data` section of the docs
- If the `OverridableOptions` are only used here, they can also just be
  defined as normal `click` options (however, we can also start thinking
  about the `verdi archive dump` functionality that we should start
  implementing soon)
Moved the recursive logic out of the top-level `dump` function instead
into `_workflow_dump`. In addition, moved the default path creation and
validation into the top-level `dump` function and out of the
`cmd_process.py` file.

The following entities are now dumped for each child `CalculationNode`
reached during the dumping:
- `CalculationNode` repository -> `inputs`
- `CalculationNode` retrieved output -> `outputs`
- `CalculationNode` input nodes -> `node_inputs`
- `CalculationNode` output nodes (apart from `retrieved`)
  -> `node_outputs`
By default, everything apart from the `node_outputs` is dumped, as to
avoid too many non-`SinglefileData` or `FolderData` nodes to be written
to disk. The `--all-aiida-nodes` option is instead removed. The number
of files might still grow large for complex workchains, e.g.
`SelfConsistentHubbardWorkchain` or `EquationOfStateWorkChain`.

In addition, set `_generate_default_dump_path`, `_generate_readme`, and
`_generate_child_node_label` as `staticmethod`s, as they logically
belong to the class, but don't access any of its attributes. The former
two are further only called  in the top-level `dump` method. Other
methods like `_validate_make_dump_path` still access class attributes
like `overwrite` or `flat`, so they remain normal class methods.
Copy link

codecov bot commented May 27, 2024

Codecov Report

Attention: Patch coverage is 94.47236% with 11 lines in your changes missing coverage. Please review.

Project coverage is 77.66%. Comparing base (ef60b66) to head (497b14c).
Report is 112 commits behind head on main.

Files Patch % Lines
src/aiida/tools/dumping/processes.py 93.65% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6276      +/-   ##
==========================================
+ Coverage   77.51%   77.66%   +0.16%     
==========================================
  Files         560      562       +2     
  Lines       41444    41652     +208     
==========================================
+ Hits        32120    32344     +224     
+ Misses       9324     9308      -16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sphuber sphuber self-requested a review May 27, 2024 11:02
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @qiaojunfeng for the original idea and @GeigerJ2 for the hard work on the implementation. Let's get this show on the road

@sphuber sphuber merged commit c1cc2b0 into aiidateam:main May 27, 2024
22 checks passed
sphuber pushed a commit that referenced this pull request May 27, 2024
This commit adds functionality to write all files involved in the
execution of a workflow to disk. This is achieved via the new
`ProcessDumper` class, which exposes the top-level `dump` method, while
`verdi process dump` provides a wrapper for access via the CLI.

Instantiating the `ProcessDumper` class is used to set the available
options for the dumping. These are the `-o/--overwrite` option, the
`--io-dump-paths` option which can be used to provide custom
subdirectories for the folders created for each `CalculationNode`
(the dumped data being the `CalculationNode` repository, its `retrieved`
outputs, as well as the linked node inputs and outputs), the `-f/--flat`
option that disables the creation of these subdirectories, thus creating
all files in a flat hierarchy (for each step of the workflow), and the
`--include-inputs/--exclude-inputs` (`--include-outputs/--exclude-outputs`)
options to enable/disable the dumping of linked inputs (outputs) for each
`CalculationNode`. In addition, a `README` is created in the parent
dumping directory, as well as `.aiida_node_metadata.yaml` files with the
`Node`, `User`, and `Computer` information in the subdirectories created
for each `ProcessNode`.

Nested workchains with considerable file I/O were needed for meaningful
testing of this feature, so it was required to extend the
`generate_calculation_node` fixture of `conftest.py`.  Moreover, the
`generate_calculation_node_add` and `generate_workchain_multiply_add`
fixtures that actually run the `ArithmeticAddCalculation` and
`MultiplyAddWorkchain` were also added. These could in the future
possibly be used to reduce code duplication where the objects are being
constructed in other parts of the test suite (benchmarking of manually
constructing the `ProcessNode`s vs. running the `Process` will still
have to be conducted). Lastly, the `generate_calculation_node_io` and
`generate_workchain_node_io` were added in `test_processes.py`, which
actually create the `CalculationNode`s and `WorkflowNode`s that are used
for the tests of the dumping functionality.

Co-Authored-By: Junfeng Qiao <[email protected]>
@sphuber
Copy link
Contributor

sphuber commented May 27, 2024

Just FYI, since the original PR was made by @qiaojunfeng , Github automatically set him as the author. However, since @GeigerJ2 has been the main author really, I have changed that and have added @qiaojunfeng in the Co-Authored-By attribution. I hope that is ok with the both of you

@GeigerJ2
Copy link
Contributor

Fine for me, thanks a lot @sphuber! 🙏

@qiaojunfeng
Copy link
Contributor Author

Thanks @sphuber, good with me! Very happy to see this come true after all the hard work done by @GeigerJ2!

mikibonacci pushed a commit to mikibonacci/aiida-core that referenced this pull request Sep 3, 2024
This commit adds functionality to write all files involved in the
execution of a workflow to disk. This is achieved via the new
`ProcessDumper` class, which exposes the top-level `dump` method, while
`verdi process dump` provides a wrapper for access via the CLI.

Instantiating the `ProcessDumper` class is used to set the available
options for the dumping. These are the `-o/--overwrite` option, the
`--io-dump-paths` option which can be used to provide custom
subdirectories for the folders created for each `CalculationNode`
(the dumped data being the `CalculationNode` repository, its `retrieved`
outputs, as well as the linked node inputs and outputs), the `-f/--flat`
option that disables the creation of these subdirectories, thus creating
all files in a flat hierarchy (for each step of the workflow), and the
`--include-inputs/--exclude-inputs` (`--include-outputs/--exclude-outputs`)
options to enable/disable the dumping of linked inputs (outputs) for each
`CalculationNode`. In addition, a `README` is created in the parent
dumping directory, as well as `.aiida_node_metadata.yaml` files with the
`Node`, `User`, and `Computer` information in the subdirectories created
for each `ProcessNode`.

Nested workchains with considerable file I/O were needed for meaningful
testing of this feature, so it was required to extend the
`generate_calculation_node` fixture of `conftest.py`.  Moreover, the
`generate_calculation_node_add` and `generate_workchain_multiply_add`
fixtures that actually run the `ArithmeticAddCalculation` and
`MultiplyAddWorkchain` were also added. These could in the future
possibly be used to reduce code duplication where the objects are being
constructed in other parts of the test suite (benchmarking of manually
constructing the `ProcessNode`s vs. running the `Process` will still
have to be conducted). Lastly, the `generate_calculation_node_io` and
`generate_workchain_node_io` were added in `test_processes.py`, which
actually create the `CalculationNode`s and `WorkflowNode`s that are used
for the tests of the dumping functionality.

Co-Authored-By: Junfeng Qiao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-blocking must be resolved before next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants