-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CLI command to dump inputs/outputs of CalcJob
/WorkChain
#6276
Conversation
As a note, I assigned this to Ali and Julian (they are not yet in the project so I cannot tag them), and @eimrek was involved in the discussions with @qiaojunfeng |
Thanks @qiaojunfeng . Interesting idea, but I have some questions about the design. How does the output structure of this command make it easier to rerun the calculations manually? If I understand correctly, the command only dumps the main input file, but that is typically not sufficient as there can be multiple input files for a That being said, even for QE I am not sure if this is very useful. As is, the generated output structure cannot be directly used to launch the calculations, can it? For example, the pseudopotentials are not written. So a user will have to manually add those (for each subfolder). And what about restart files? For the NSCF, the restart files won't be found and the user will have to manually add those. What I think the command should really do is copy all input files such that it can be relaunched straight away. The implementation will be a bit more involved though. The total set of input files is only directly available from the remote working directory, but that is not stored permanently by AiiDA and can disappear, so we cannot rely on that. Some input files are copied to the repo of the That all being said, in some cases we may not want to copy all input files. Taking QE as an example, if you want to dump the inputs for an NSCF calculation, the complete dump would include the restart files, but these can get very big and so in some cases the user may not want to dump these. So you would have to add options to exclude these. I can see how this command is very useful for workchains of QE calculations, but in order to justify adding this to |
Hi @sphuber - indeed we have to make it generic. Then we also considered that indeed it might be instead better to just run the presubmit, essentially running a dry run; this should already do essentially everything (putting also the pseudos; and leaving information on which files should be copied directly on the remote for restarts). Note that the goal here is not that one should necessarily be able to just run again in the same folder and everything works. (Of course, if we achieve this, this is much better). The goal is that an experienced user (not of AiiDA, but of the underlying code - in our case, this was triggered by a developer of CP2K not interested in learning AiiDA) can inspect the raw inputs and outputs, to see what was run. Again, if we can really make this "just runnable" (at least without restarts) that's OK, but I don't think it's crucial for the main goal of exposing the raw files to interested people (who don't want to dig into how AiIDA works). The PR itself needs some cleanup indeed. The suggestion is that @khsrali and @GeigerJ2 work on this in the next couple of weeks, and then we discuss if this is general enough. But of course further comments are welcome also to help them fix this PR. |
Hi @qiaojunfeng, thank you for this contribution! I think this is a very useful feature for people that are starting with AiiDA as it eases the transition from the directory tree to a more AiiDA-centric approach. I made some changes to the code locally. Is it OK for you if I commit them and continue working directly on top of this PR? During development, I used a simple, exemplary
Further, I think dumping the main input/output files for easy inspection vs. dumping everything to be able to fully restart the simulations are two different use cases, and it would already be valuable to have the former. In general, I also think the user should have the option to manually include/exclude additional files. Apart from the QE-specific assumption of a single input file, the function seems conceptually generic, and I would expect it to work similarly for other workchains, or am I missing something? And lastly, regarding the namespace, it seems natural to assume that something like |
Thanks for your comments @GeigerJ2 . Just some quick pointers I noticed when reading them:
This is indeed a solution, but there might be a better one. Instead of using a "black list" of sorts with known exceptions, you might want to just catch the
Maybe a combination of the two would be even better? If you have multiple subsequent calls of the same subprocess, having just the same name prefixed with an integer is also not so informative. If a
Input files (at least some of them) of a
It might "work" as in there won't be any exceptions, but I still have my doubts whether it is really all that useful for the general case. Defining a default input file is not required and maybe not all that common. The current implementation will even raise an exception in that case. It says to specify a path, but that is not implemented and wouldn't even work if you have different If this command is to be useful, we have to go about it slightly differently. I would suggest we have to go through the same logic as an actual This solution would require quite some refactoring, but I think this might anyway not be such a bad idea. The code there can be a bit dense. But if we think this might be an interesting approach, I'd be happy to go through it during a call some time and give some pointers and guide you in performing the refactoring. |
Hi all, great comments! I also thought of using the same logic that is used in the presubmit step. However, I realize now that this will require that the people who want to get the files will need to install all the plug-ins of the work chains they want to traverse. Think in 5 years time, they might be using a new AiiDA and the plug-ins will not be installable anymore with that version, might be in a new non backward compatible version etc. We might decide to give both options, but I think the current approach is general enough to start to be useful. You can still dump a yaml version of the info in local copy list, remote copy list etc (we anyway already dump a json of those). And explain people what it means. And in any case also the presubmit approach, for the remote copy list, can only explain what needs to be done and not do it. The design idea we had is to have, optionally, the possibility for a plug-in to provide its custom wait to dump files. So if you user has the plug-in install and the plug-in supports it, the dump will be done in a more plug-in specific way. But the current suggested approach, to me, is a good default when such plug-in specific function does not exist. In practice, I think that the best is to try to run this function against a few common use cases (qe, wannier, vasp, fleur etc) and see what we get and what we miss. In practice, I would do it for the various archives of the recent common workflows paper on verification (that is the actual usecase that triggered this pr), and then ask the respective code developers if the resulting folder is clear enough and contains all info they might need. |
Good point. It would require the plugin to be installed to do what I am proposing. I can see how that is limiting.
What I was suggesting was not just calling the presubmit, but also the
Note that these are still just use cases that all come from a single domain, namely the original DFT users, and don't necessearily represent the general case. If this is really a use case that stems from these plugins, and especially from the common workflow project, is it maybe not also an idea to consider adding it to the |
Thank you for your helpful and valuable feedback, @sphuber!
This was my original solution, but in the initial implementation of @qiaojunfeng, it led to the creation of a directory also for these
Yes, good point!
Agreed!
Just learned something new about the internals :)
If you have the time, that would be great!
Thanks, @giovannipizzi for this pointer, that should give me plenty of workchains to try out when working on this.
The general use case should be |
This commit builds on the [pull request](aiidateam#6276) by @qiaojunfeng. It implements the following changes: - `_get_input_filename` function is removed - `workchain inputsave` is split up and modified, with the recursive logic to traverse the `ProcessNodes` of the workchain moved to `_recursive_get_node_path`, the directory creation moved to `_workchain_maketree`, and the file dumping moved to `workchain_filedump` - `_recursive_get_node_path` builds up a list of tuples of "paths" and `CalcJobNodes`. The "paths" are based on the layout of the workchain, and utilize the `link_labels`, `process_labels`, as well as the "iteration" counter in the `link_labels` -> This might not be the best data structure here, but allows for extending the return value during recursion -> Rather than using the `ProcessNodes` directly, one could also only use the `pks` and load the nodes when needed - In the `PwBandsWorkChain` used for development, the "top level", processes had the `link_labels` set, so they were missing any numbering. Thus, I added it via `_number_path_elements`. Right now, this is just a quick fix, as it just works for the top-level, though, such a function could possibly take care of the numbering of all levels. Ideally, one would extract it directly from the data contained in the `WorkChain`, but I think that's difficult if some steps might be missing the iteration counter in their label. - Eventually I think it would be nice to be able to just create the empty directory tree, without dumping input/output files, so the `_workchain_maketree` is somewhat of a placeholder for that - `calcjob_inputdump` and `calcjob_outputdump` added to to `cmd_calcjob` So far, there's not really any error handling, and the code contains probably quite some issues (for example, the "path" naming breaks in complex cases like the `SelfConsistentHubbardWorkChain`), though, I wanted to get some feedback, and ensure I'm somewhat on a reasonable trajectory before generalizing and improving things. Regarding our discussion in PR aiidateam#6276, for working on an implementation of a *complete* version that makes the steps fully re-submittable, that might be an additional, future step, in which @sphuber could hopefully provide me some pointers (for now, I added a warning that about that). The current commands don't require any plugin, only `core` and the data. The result of `verdi workchain filedump <wc_pk> --path ./wc-<wc_pk>` from an exemplary `PwBandsWorkChain`: ```shell Warning: Caution: No provenance. The retrieved input/output files are not guaranteed to be complete for a full restart of the given workchain. Instead, this utility is intended for easy inspection of the files that were involved in its execution. For restarting workchains, see the `get_builder_restart` method instead. ./wc-3057/ ├── 01-relax │ ├── 01-PwBaseWC │ │ └── 01-PwCalc │ │ ├── aiida.in │ │ ├── aiida.out │ │ ├── _aiidasubmit.sh │ │ ├── data-file-schema.xml │ │ ├── _scheduler-stderr.txt │ │ └── _scheduler-stdout.txt │ └── 02-PwBaseWC │ └── 01-PwCalc │ ├── aiida.in │ ├── aiida.out │ ├── _aiidasubmit.sh │ ├── data-file-schema.xml │ ├── _scheduler-stderr.txt │ └── _scheduler-stdout.txt ├── 02-scf │ └── 01-PwCalc │ ├── aiida.in │ ├── aiida.out │ ├── _aiidasubmit.sh │ ├── data-file-schema.xml │ ├── _scheduler-stderr.txt │ └── _scheduler-stdout.txt └── 03-bands └── 01-PwCalc ├── aiida.in ├── aiida.out ├── _aiidasubmit.sh ├── data-file-schema.xml ├── _scheduler-stderr.txt └── _scheduler-stdout.txt 9 directories, 24 files ```
Hello everybody, |
Thanks a lot @GeigerJ2
That is perfectly fine because, as you say, the original remains untouched. There is no set rule here really and it depends on the situation. It is possible for you to, when you are done, make a PR to the original branch, which then automatically updates this PR. But if the original author is fine with it, they can also give you direct write access to their branch and update there directly. Or just close the PR and you open a new one. Fully up to you and @qiaojunfeng in this case.
As requested, I will first give some high-level feedback on the design and some common principles we tend to keep in
import pathlib
from aiida.common.links import LinkType
from aiida.orm import CalcJobNode, WorkChainNode
def workchain_dump(workchain: WorkChainNode, filepath: pathlib.Path) -> None:
called_links = workchain.base.links.get_outgoing(link_type=(LinkType.CALL_CALC, LinkType.CALL_WORK)).all()
for index, link_triple in enumerate(sorted(called_links, key=lambda link_triple: link_triple.node.ctime)):
node = link_triple.node
if link_triple.link_label != 'CALL':
label = f'{index:02d}-{link_triple.link_label}-{node.process_label}'
else:
label = f'{index:02d}-{node.process_label}'
if isinstance(node, WorkChainNode):
workchain_dump(node, filepath / label)
elif isinstance(node, CalcJobNode):
node.base.repository.copy_tree(filepath / label)
node.outputs.retrieved.copy_tree(filepath / label) Let me know what you think. |
Hi @sphuber, thanks a lot for your valuable feedback, it is highly appreciated!
Happy to hear that :)
In principle, that should be the default anyway if the author hasn't unset "Allowing edits by maintainers" when making the PR, no? Though, better to ask ofc. Regarding your feedback:
(*I assume this is intentional, as the |
Sure. The problem is though that there are so many potentially useful things that it is difficult to organize. Of course you could start adding them to a single page, but this would quickly get very chaotic and disorganized. Then you start organizing it and then you end up with the same problem as in the beginning where you have to know where to look to find something. Not saying we shouldn't try to solve this, just that it might not be as trivial as it initially seems.
Sure, you would have to call it twice, once for the node it self, and once for its
Probably just because that command got added before I implemented
Not quite sure what you mean here. But I am not saying that we shouldn't have the CLI end points. I am just saying that the functionality should be mostly implemented as normal API, such that people can also use it from the shell or a notebook. The CLI commands can then be minimal wrappers. But maybe I misunderstood your comment here?
That is what the
This really feels like a pre-mature abstraction, and don't think it would even necessarily be a good one. I think the difference in code length and complexity is significant and don't think the separate version gives a real benefit, but it is significantly more difficult to understand and maintain. Unless there are concrete benefits now, I would not overcomplicate things. |
Yeah, it would definitely require putting some thought into it. Though, there are a few examples that seem to pop up frequently, and that one doesn't come across when just going through the tutorials, i.e.
Alright, I see. Sure, will do!
I mean that having
Good point there! I'll revert it back to an implementation with just the one recursive function, based on your snippet, in the appropriate location, making sure it works, and
seeing if the indexing checks out for some examples. Thanks again for your assistance with this, it's very |
(Copying the commit message here, which I thought would appear here in that way, sry for that... 🙃) Hey all, I commited the changes from my local fork of the PR now here directly, as I think that will make working on it more straightforward. Hope that's OK for everybody. Based on @qiaojunfeng's original implementation, there's a recursive function that traverses the
For each The other commands initially mentioned by @qiaojunfeng also seem very interesting and could probably easily added based on his original implementation, though we should agree on the overall API first. A few more notes:
Lastly, examples of the (default) outputs obtained from dump-462
├── 01-relax-PwRelaxWorkChain
│ ├── 01-PwBaseWorkChain
│ │ ├── 01-PwCalculation
│ │ │ ├── aiida_node_metadata.yaml
│ │ │ ├── node_inputs
│ │ │ │ └── pseudos__Si
│ │ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ │ ├── raw_inputs
│ │ │ │ ├── .aiida
│ │ │ │ │ ├── calcinfo.json
│ │ │ │ │ └── job_tmpl.json
│ │ │ │ ├── _aiidasubmit.sh
│ │ │ │ └── aiida.in
│ │ │ └── raw_outputs
│ │ │ ├── _scheduler-stderr.txt
│ │ │ ├── _scheduler-stdout.txt
│ │ │ ├── aiida.out
│ │ │ └── data-file-schema.xml
│ │ └── aiida_node_metadata.yaml
│ ├── 02-PwBaseWorkChain
│ │ ├── 01-PwCalculation
│ │ │ ├── aiida_node_metadata.yaml
│ │ │ ├── node_inputs
│ │ │ │ └── pseudos__Si
│ │ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ │ ├── raw_inputs
│ │ │ │ ├── .aiida
│ │ │ │ │ ├── calcinfo.json
│ │ │ │ │ └── job_tmpl.json
│ │ │ │ ├── _aiidasubmit.sh
│ │ │ │ └── aiida.in
│ │ │ └── raw_outputs
│ │ │ ├── _scheduler-stderr.txt
│ │ │ ├── _scheduler-stdout.txt
│ │ │ ├── aiida.out
│ │ │ └── data-file-schema.xml
│ │ └── aiida_node_metadata.yaml
│ └── aiida_node_metadata.yaml
├── 02-scf-PwBaseWorkChain
│ ├── 01-PwCalculation
│ │ ├── aiida_node_metadata.yaml
│ │ ├── node_inputs
│ │ │ └── pseudos__Si
│ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ ├── raw_inputs
│ │ │ ├── .aiida
│ │ │ │ ├── calcinfo.json
│ │ │ │ └── job_tmpl.json
│ │ │ ├── _aiidasubmit.sh
│ │ │ └── aiida.in
│ │ └── raw_outputs
│ │ ├── _scheduler-stderr.txt
│ │ ├── _scheduler-stdout.txt
│ │ ├── aiida.out
│ │ └── data-file-schema.xml
│ └── aiida_node_metadata.yaml
├── 03-bands-PwBaseWorkChain
│ ├── 01-PwCalculation
│ │ ├── aiida_node_metadata.yaml
│ │ ├── node_inputs
│ │ │ └── pseudos__Si
│ │ │ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
│ │ ├── raw_inputs
│ │ │ ├── .aiida
│ │ │ │ ├── calcinfo.json
│ │ │ │ └── job_tmpl.json
│ │ │ ├── _aiidasubmit.sh
│ │ │ └── aiida.in
│ │ └── raw_outputs
│ │ ├── _scheduler-stderr.txt
│ │ ├── _scheduler-stdout.txt
│ │ ├── aiida.out
│ │ └── data-file-schema.xml
│ └── aiida_node_metadata.yaml
└── aiida_node_metadata.yaml and dump-530
├── aiida_node_metadata.yaml
├── node_inputs
│ └── pseudos__Si
│ └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF
├── raw_inputs
│ ├── .aiida
│ │ ├── calcinfo.json
│ │ └── job_tmpl.json
│ ├── _aiidasubmit.sh
│ └── aiida.in
└── raw_outputs
├── _scheduler-stderr.txt
├── _scheduler-stdout.txt
├── aiida.out
└── data-file-schema.xml for a |
Thanks a lot @GeigerJ2 , great progress 👍
The current branch seems to implement it though. Does it not work in its current state?
I think I would include the
You can also reduce the code duplication differently. If you were to go for my previous suggestion, then the only code besides the
I would not change the name, because it currently only dumps the content of the repo, so in my mind
Yes, I would be strongly in favor of this as opposed to one of solutions. This PR would actually essentially provide this functionality: #6255
You could indeed split on the double underscore and nest again, but then you would get paths The double underscores indeed come from some name mangling. Inputs in AiiDA can be nested, e.g. pseudos for the
The link labels are flat though, so nested namespaces are represented by double underscores. In a sense then, splitting on the double underscores and making that a subdirectory would map quite nicely on nested input namespaces. But for me, either solution would be fine. |
Hi @sphuber, sorry for just getting back to you now. Thanks, happy to hear that :)
Currently, it's giving this warning: Warning: CalcJobNode<pk> already has a `remote_folder` output: skipping upload and the I have just implemented your proposed changes. Thank you for your inputs and pointers. The
I meant also extending the dumping so that not only the
Fully agree! The PR looks like quite the extensive effort. Is there any estimated timeline for it? That is to say, would
Good, so my intuition was correct :)
Same here, don't have a strong opinion on that. |
There is no fixed timeline, but it is ready for review and has been for a while. @mbercx and @edan-bainglass have both expressed interest in taking a look at this PR in the past as well as its predecessor #6190 I would be in favor of having this in soon. Since it is backwards compatible, I don't see major blockers really |
Dear all, great work, let me try to be the "bad guy" as much as possible. I use the case of the data for CP2K oxides calculations in materialsclous:2023.81 Suppose we are back 10 years ago, someone like me, "A", was doing the calculations and for some reason putting them OA. A would have had in his working space several directories such as B Downloades the tar file, executes What I did now with AiiDA was:
The two lines above will be replaced, this is a BIG PLUS, by a link to renku or an equivalent solution To inspect the content of the archive:
Here comes the big disadvantage now compared to 10 years ago: Path to solution: |
Thanks @cpignedoli for the writeup. I agree that it can be a bit daunting for a user to start to "browse" the contents of an imported archive. Without a high-level human-readable description that accompanies the archive, it can be difficult for a user to know how to analyse the data, especially if they are not familiar with the plugins used and so they don't necessarrily know or understand the workchain layouts. I fully agree that it would be great if we could improve this intelligibility automatically somehow. That being said, I want to caution about implementing any solutions that are going to be to prejudiced by our own typical use cases that build in assumptions about the structure of archives. For example, in your suggestion, you implicitly assume that all archives are a simple collection of a number of top-level workchains of the same type and that their main input is a What I imagine would really help here is the following: Have a dead easy way to "mount" an AiiDA archive that allows to inspect itThe Renku functionality is a great example to solving this. We have recently made a lot of progress (and continue to do so) to also make this available on a normal computer locally. It is already possible to create a profile from an archive that doesn't require PostgreSQL. I also have functionality (see #6303) that makes RabbitMQ completely optional. With these two steps, users can start inspecting an archive in two steps: It should be easier to browse the provenance graphCurrently, there are a few ways to browse the contents of an archive. Use the various As mentioned, I think we are very close to satisfactorily addressing the first part. It will be trivial to install AiiDA and mount an archive. The biggest step I think, is to build a fantastic visual browser that can be run locally. I think that there will always remain a need for a top-level human-readable description with archives to make them understandable to others (just like with any data publications online). One thing we could consider is to allow an archive creator to include this metadata, like a description, to the archive itself. |
Hi @sphuber I fully agree that my oversimplification can be too misleading, what you suggest/anticipate goes perfectly in the right direction and "easy browsing" for a general case would of course be a great success. I will test the |
Great comments,i agree with what has been said. I think indeed that there is some responsibility with who puts the archive online to provide minimal guidance to users. This could just be eg 1. A csv file attached to the published entry (with uuids and any other relevant info to explain what is "top level"). For instance, in the case of the verification paper, it could be columns with code name, element name, and configuration, followed by uuid. This is e ought for a human to quickly find the relevant workchain and put it on the dump command by just looking at the csv. If we agree on this and implement this, we can then update the Mc archive entries of a fee of our own repos with such scripts (in the correct format to be autoloaded), as good examples to follow. And then invest a bit in "training/communication" : eg make a blog post explaining the philosophy behind them and linking to those examples, having looking into data of others as one of the first steps of the next tutorial we will organise, etc We can always come up in the future with more general ways to do this, but trying to find a general solution that works automatically for any archive and for people who don't know aiida is very challenging (if at all possible), while I tjjon the approach I suggest is quite practical to implement, the scripts are easy to provide for data owners and easy to use for people who just want to look into the data (again, in the template for the jupyter notebook you can both put the short querying script that prints out uuids, and also some minimal text to guide people on what to do, eg, "get the uuid from the table above for the system you are interested in, and run this command |
Thanks @GeigerJ2 could you please update your branch, address the conflicts and make sure the tests run? |
9ea3e16
to
4daa2fc
Compare
To be consistent with other commands in `cmd_process`. And with that `process_dump` and `calcjob_dump` in `processes.py` to `process_node_dump` and `calcjob_node_dump`.
Now in function `generate_calcjob_io_dump_paths`. Takes care of handling the `flat` argument and the naming of the `raw_inputs`, `raw_outputs`, and `node_inputs` subdirectories.
Changed `--flat` option to still create subdirectories for the individual steps of the WorkChain. Instead, just the subdirectories per CalcJob are removed. Generalized the dumping of outputs that it doesn't only dump `retrieved` -> With this, it's dumping a whole range of aiida nodes, basically all the parsed outputs, which are mainly numpy arrays dumped as `.npy` files. Add an option to enable this, as it might not be necessary to dump all of those. Currently, I just defined a global variable in the file, but this will eventually become a class attribute of the ProcessDumper class.
To avoid having to pass all the arguments `include_node_inputs`, `include_attributes/extras`, `overwrite`, `flat`, `all_aiida_nodes` through the different functions, everything related to the dumping is now compiled in the `ProcessDumper` class, which defines the main entry-point method `dump`. For nested workflows, this is recursively called (as before). Once `CalculationFunction` nodes are reached, their content is dumped via `dump_calculation_node`. The helper functions to create and validate labels and paths of nested subdirectories are also methods of the `ProcessDumper`. Introduced the `parent_process` class attribute which is dynamically generated from the parent_node, and which is used to generate the main README, which is only created when the dumping is done via the `verdi` CLI. For the other functions, this concept does not make sense, due to the recursion, so the respective `process_node`s (which are changing during the recursion) are always passed as arguments. Next steps: - Update tests to actually test the new implementations - Update docstrings - Add section to `How to work with data` section of the docs - If the `OverridableOptions` are only used here, they can also just be defined as normal `click` options (however, we can also start thinking about the `verdi archive dump` functionality that we should start implementing soon)
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Moved the recursive logic out of the top-level `dump` function instead into `_workflow_dump`. In addition, moved the default path creation and validation into the top-level `dump` function and out of the `cmd_process.py` file. The following entities are now dumped for each child `CalculationNode` reached during the dumping: - `CalculationNode` repository -> `inputs` - `CalculationNode` retrieved output -> `outputs` - `CalculationNode` input nodes -> `node_inputs` - `CalculationNode` output nodes (apart from `retrieved`) -> `node_outputs` By default, everything apart from the `node_outputs` is dumped, as to avoid too many non-`SinglefileData` or `FolderData` nodes to be written to disk. The `--all-aiida-nodes` option is instead removed. The number of files might still grow large for complex workchains, e.g. `SelfConsistentHubbardWorkchain` or `EquationOfStateWorkChain`. In addition, set `_generate_default_dump_path`, `_generate_readme`, and `_generate_child_node_label` as `staticmethod`s, as they logically belong to the class, but don't access any of its attributes. The former two are further only called in the top-level `dump` method. Other methods like `_validate_make_dump_path` still access class attributes like `overwrite` or `flat`, so they remain normal class methods.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6276 +/- ##
==========================================
+ Coverage 77.51% 77.66% +0.16%
==========================================
Files 560 562 +2
Lines 41444 41652 +208
==========================================
+ Hits 32120 32344 +224
+ Misses 9324 9308 -16 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @qiaojunfeng for the original idea and @GeigerJ2 for the hard work on the implementation. Let's get this show on the road
This commit adds functionality to write all files involved in the execution of a workflow to disk. This is achieved via the new `ProcessDumper` class, which exposes the top-level `dump` method, while `verdi process dump` provides a wrapper for access via the CLI. Instantiating the `ProcessDumper` class is used to set the available options for the dumping. These are the `-o/--overwrite` option, the `--io-dump-paths` option which can be used to provide custom subdirectories for the folders created for each `CalculationNode` (the dumped data being the `CalculationNode` repository, its `retrieved` outputs, as well as the linked node inputs and outputs), the `-f/--flat` option that disables the creation of these subdirectories, thus creating all files in a flat hierarchy (for each step of the workflow), and the `--include-inputs/--exclude-inputs` (`--include-outputs/--exclude-outputs`) options to enable/disable the dumping of linked inputs (outputs) for each `CalculationNode`. In addition, a `README` is created in the parent dumping directory, as well as `.aiida_node_metadata.yaml` files with the `Node`, `User`, and `Computer` information in the subdirectories created for each `ProcessNode`. Nested workchains with considerable file I/O were needed for meaningful testing of this feature, so it was required to extend the `generate_calculation_node` fixture of `conftest.py`. Moreover, the `generate_calculation_node_add` and `generate_workchain_multiply_add` fixtures that actually run the `ArithmeticAddCalculation` and `MultiplyAddWorkchain` were also added. These could in the future possibly be used to reduce code duplication where the objects are being constructed in other parts of the test suite (benchmarking of manually constructing the `ProcessNode`s vs. running the `Process` will still have to be conducted). Lastly, the `generate_calculation_node_io` and `generate_workchain_node_io` were added in `test_processes.py`, which actually create the `CalculationNode`s and `WorkflowNode`s that are used for the tests of the dumping functionality. Co-Authored-By: Junfeng Qiao <[email protected]>
Just FYI, since the original PR was made by @qiaojunfeng , Github automatically set him as the author. However, since @GeigerJ2 has been the main author really, I have changed that and have added @qiaojunfeng in the Co-Authored-By attribution. I hope that is ok with the both of you |
Fine for me, thanks a lot @sphuber! 🙏 |
This commit adds functionality to write all files involved in the execution of a workflow to disk. This is achieved via the new `ProcessDumper` class, which exposes the top-level `dump` method, while `verdi process dump` provides a wrapper for access via the CLI. Instantiating the `ProcessDumper` class is used to set the available options for the dumping. These are the `-o/--overwrite` option, the `--io-dump-paths` option which can be used to provide custom subdirectories for the folders created for each `CalculationNode` (the dumped data being the `CalculationNode` repository, its `retrieved` outputs, as well as the linked node inputs and outputs), the `-f/--flat` option that disables the creation of these subdirectories, thus creating all files in a flat hierarchy (for each step of the workflow), and the `--include-inputs/--exclude-inputs` (`--include-outputs/--exclude-outputs`) options to enable/disable the dumping of linked inputs (outputs) for each `CalculationNode`. In addition, a `README` is created in the parent dumping directory, as well as `.aiida_node_metadata.yaml` files with the `Node`, `User`, and `Computer` information in the subdirectories created for each `ProcessNode`. Nested workchains with considerable file I/O were needed for meaningful testing of this feature, so it was required to extend the `generate_calculation_node` fixture of `conftest.py`. Moreover, the `generate_calculation_node_add` and `generate_workchain_multiply_add` fixtures that actually run the `ArithmeticAddCalculation` and `MultiplyAddWorkchain` were also added. These could in the future possibly be used to reduce code duplication where the objects are being constructed in other parts of the test suite (benchmarking of manually constructing the `ProcessNode`s vs. running the `Process` will still have to be conducted). Lastly, the `generate_calculation_node_io` and `generate_workchain_node_io` were added in `test_processes.py`, which actually create the `CalculationNode`s and `WorkflowNode`s that are used for the tests of the dumping functionality. Co-Authored-By: Junfeng Qiao <[email protected]>
Currently, we have
verdi calcjob inputcat/outputcat
forCalcJob
, but there is no straightforward way to dump the inputs/outputs of aWorkChain
to some files. The use case of this can be:pw.x
) using the exact same inputs generated by aiida, without working on the python sideHere I added an example CLI command
verdi workchain inputsave
, to recursively iterate through the called descendants of a workchain and save the input file of the calledCalcJob
into a folder. For example, the structure of the saved folder for aWannier90BandsWorkChain
isThe current PR is working, but still there are plenty of tasks to be done to make it generic enough.
TODO
verdi calcjob
, a newverdi workchain
, or sth else that combines bothcalcjob
andworkchain
(since now the command works for bothCalcJob
andWorkChain
)?uuid
be included in the folder names? Currently I am using link labelscomments from Giovanni
Further optional TODO
In addition, there are several other convenience CLI commands I often use, they are here
https://github.com/aiidateam/aiida-wannier90-workflows/blob/main/src/aiida_wannier90_workflows/cli/node.py
Would be great if someone versed in aiida-core could use these as inspirations, and write some generic CLI commands that cover these usages. Thanks!