Description
Context
Currently, we can use reduction.software
to specify a single program and version that was used to produce the data. This is enough in cases where the program is fully self-contained. But this is not always the case. For example, reduction software may be published as a Python package which depends on a number of other packages.
For reproducibility, we need to track (some) dependencies as well as the 'main' package. In the Python example, the best solution would be to store the output of pip freeze
or conda list
. However, most of those packages are not relevant for reproducing data (barring possible bugs in those packages). More importantly, we need to track pieces of software that provide algorithms which may change in the future and impact the result.
In our concrete case at ESS, we have ESSreflectometry as the highest level package. It uses algorithms from ESSreduce and ScippNeutron. All three of those packages need to be listed with their versions if we hope to reproduce reduced data in the future.
For full provenance tracking, we need more than what can be reasonable encoded in YAML. E.g., a full list of packages (pip freeze
) and a description of the concrete workflow beyond a short list of corrections
. The latter would likely take the form of a graph. This information can be saved in separate files alongside an .ort
file.
Proposed solution
Allow reduction.software
to be an array. This way we can track all pieces of software we deem relevant.