-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More thorough software tracking #17
Comments
We encode our workflow as a graph in json... It gets big but it works, and we put it in the header. Workflow is then reloadable just from the single .ort file |
@bmaranville I'd be interested to know more about how you do this. Do you have an example file? How do you store the graph? Maybe we can use the same format. But this doesn't solve my original problem. We would still need to encode multiple pieces of software and their versions. |
The format is home-made, with graph nodes defined in an ordered list of "modules", and connections in a list of "wires". We have the possibility of multiple inputs and outputs per "module", so the schema for a wire source or target is As you can see, we indicate the version of the (single) software package used with a git hash at the end. Here is an example header:
|
Thanks! It is good to see that other people are having similar ideas :) So you just put the graph into In case you are interested, here is an example of what our graphs look like (WIP). This was generated using Sciline. {
"directed": true,
"multigraph": false,
"nodes": [
{
"id": "2",
"kind": "function",
"label": "load",
"function": "__main__.load",
"args": [
"0"
],
"kwargs": {}
},
{
"id": "4",
"kind": "data",
"label": "RawData",
"type": "__main__.RawData"
},
{
"id": "1",
"kind": "data",
"label": "Filename",
"type": "__main__.Filename"
},
{
"id": "6",
"kind": "function",
"label": "normalize",
"function": "__main__.normalize",
"args": [
"5",
"7"
],
"kwargs": {}
},
{
"id": "10",
"kind": "data",
"label": "NormalizedData",
"type": "__main__.NormalizedData"
},
{
"id": "8",
"kind": "data",
"label": "NormalizationFactor",
"type": "__main__.NormalizationFactor"
}
],
"edges": [
{
"id": "0",
"source": "1",
"target": "2"
},
{
"id": "3",
"source": "2",
"target": "4"
},
{
"id": "5",
"source": "4",
"target": "6"
},
{
"id": "7",
"source": "8",
"target": "6"
},
{
"id": "9",
"source": "6",
"target": "10"
}
]
}
|
The |
Context
Currently, we can use
reduction.software
to specify a single program and version that was used to produce the data. This is enough in cases where the program is fully self-contained. But this is not always the case. For example, reduction software may be published as a Python package which depends on a number of other packages.For reproducibility, we need to track (some) dependencies as well as the 'main' package. In the Python example, the best solution would be to store the output of
pip freeze
orconda list
. However, most of those packages are not relevant for reproducing data (barring possible bugs in those packages). More importantly, we need to track pieces of software that provide algorithms which may change in the future and impact the result.In our concrete case at ESS, we have ESSreflectometry as the highest level package. It uses algorithms from ESSreduce and ScippNeutron. All three of those packages need to be listed with their versions if we hope to reproduce reduced data in the future.
For full provenance tracking, we need more than what can be reasonable encoded in YAML. E.g., a full list of packages (
pip freeze
) and a description of the concrete workflow beyond a short list ofcorrections
. The latter would likely take the form of a graph. This information can be saved in separate files alongside an.ort
file.Proposed solution
Allow
reduction.software
to be an array. This way we can track all pieces of software we deem relevant.The text was updated successfully, but these errors were encountered: