Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More thorough software tracking #17

Open
jl-wynen opened this issue Jul 12, 2024 · 5 comments
Open

More thorough software tracking #17

jl-wynen opened this issue Jul 12, 2024 · 5 comments

Comments

@jl-wynen
Copy link

Context

Currently, we can use reduction.software to specify a single program and version that was used to produce the data. This is enough in cases where the program is fully self-contained. But this is not always the case. For example, reduction software may be published as a Python package which depends on a number of other packages.

For reproducibility, we need to track (some) dependencies as well as the 'main' package. In the Python example, the best solution would be to store the output of pip freeze or conda list. However, most of those packages are not relevant for reproducing data (barring possible bugs in those packages). More importantly, we need to track pieces of software that provide algorithms which may change in the future and impact the result.

In our concrete case at ESS, we have ESSreflectometry as the highest level package. It uses algorithms from ESSreduce and ScippNeutron. All three of those packages need to be listed with their versions if we hope to reproduce reduced data in the future.

For full provenance tracking, we need more than what can be reasonable encoded in YAML. E.g., a full list of packages (pip freeze) and a description of the concrete workflow beyond a short list of corrections. The latter would likely take the form of a graph. This information can be saved in separate files alongside an .ort file.

Proposed solution

Allow reduction.software to be an array. This way we can track all pieces of software we deem relevant.

@bmaranville
Copy link
Contributor

We encode our workflow as a graph in json... It gets big but it works, and we put it in the header. Workflow is then reloadable just from the single .ort file

@jl-wynen
Copy link
Author

@bmaranville I'd be interested to know more about how you do this. Do you have an example file? How do you store the graph? Maybe we can use the same format.

But this doesn't solve my original problem. We would still need to encode multiple pieces of software and their versions.

@bmaranville
Copy link
Contributor

The format is home-made, with graph nodes defined in an ordered list of "modules", and connections in a list of "wires". We have the possibility of multiple inputs and outputs per "module", so the schema for a wire source or target is [<module index (integer)>, <input/output id (string)>] and when we draw the graph we render the multiple inputs and outputs for the wires to connect to.

As you can see, we indicate the version of the (single) software package used with a git hash at the end.

Here is an example header:

# # ORSO reflectivity data file | 1.1 standard | YAML encoding | https://www.reflectometry.org/
# data_source:
#   owner:
#     name: null
#     affiliation: null
#   experiment:
#     title: null
#     instrument: NCNR NG7Refl
#     start_date: null
#     probe: neutron
#   sample:
#     name: null
#   measurement:
#     instrument_settings:
#       incident_angle: {min: 0.17400102317333221, max: 5.456196308135986, unit: degrees}
#       wavelength: {magnitude: 4.0, unit: angstrom, error: {error_value: 0.025479654008640572,
#           error_type: resolution, value_is: sigma, distribution: gaussian}}
#       polarization: unpolarized
#     data_files: []
# reduction:
#   software: {name: reductus, template_data: {template: {modules: [{title: load spec,
#             module: ncnr.refl.super_load, x: 30, y: 10, config: {intent: specular,
#               filelist: [{source: ncnr, path: ncnrdata/ng7/201509/21407_tiau_oct15/data/cr002163.nxs.ng7,
#                   mtime: 1444201042, entries: [unpolarized]}, {source: ncnr, path: ncnrdata/ng7/201509/21407_tiau_oct15/data/c
r002166.nxs.ng7,
#                   mtime: 1444209492, entries: [unpolarized]}]}, text_width: 80, version: '2022-04-27'},
#           {y: 10, x: 170, module: ncnr.refl.mask_points, title: mask, text_width: 51,
#             version: '2019-07-02'}, {y: 10, x: 310, module: ncnr.refl.join, title: join,
#             text_width: 38, version: '2020-12-15', config: {dQ_tolerance: 0.01}},
#           {title: load bg+, module: ncnr.refl.super_load, x: 30, y: 50, config: {
#               intent: background+, filelist: [{source: ncnr, path: ncnrdata/ng7/201509/21407_tiau_oct15/data/cr002164.nxs.ng7,
#                   mtime: 1444202261, entries: [unpolarized]}]}, text_width: 73, version: '2022-04-27'},
#           {y: 50, x: 170, module: ncnr.refl.mask_points, title: mask, text_width: 51,
#             version: '2019-07-02'}, {title: load bg-, module: ncnr.refl.super_load,
#             x: 30, y: 90, config: {intent: background-, filelist: [{source: ncnr,
#                   path: ncnrdata/ng7/201509/21407_tiau_oct15/data/cr002165.nxs.ng7,
#                   mtime: 1444203481, entries: [unpolarized]}]}, text_width: 69, version: '2022-04-27'},
#           {y: 90, x: 170, module: ncnr.refl.mask_points, title: mask, text_width: 51,
#             version: '2019-07-02'}, {y: 90, x: 310, module: ncnr.refl.join, title: join,
#             text_width: 38, version: '2020-12-15'}, {y: 50, x: 310, module: ncnr.refl.join,
#             title: join, text_width: 38, version: '2020-12-15'}, {title: load slit,
#             module: ncnr.refl.super_load, x: 30, y: 130, config: {intent: intensity,
#               filelist: [{source: ncnr, path: ncnrdata/ng7/201509/21407_tiau_oct15/data/sl00164.nxs.ng7,
#                   mtime: 1444073824, entries: [unpolarized]}, {source: ncnr, path: ncnrdata/ng7/201509/21407_tiau_oct15/data/sl00165.nxs.ng7,
#                   mtime: 1444074435, entries: [unpolarized]}, {source: ncnr, path: ncnrdata/ng7/201509/21407_tiau_oct15/data/sl00166.nxs.ng7,
#                   mtime: 1444075076, entries: [unpolarized]}], Qz_basis: detector},
#             text_width: 67, version: '2022-04-27'}, {y: 130, x: 170, module: ncnr.refl.mask_points,
#             title: mask, text_width: 51, version: '2019-07-02'}, {y: 10, x: 455, module: ncnr.refl.subtract_background,
#             title: sub bg, text_width: 59, version: '2016-03-23'}, {y: 130, x: 310,
#             module: ncnr.refl.rescale, title: rescale, text_width: 63, config: {scale: [
#                 1, 10.373871189574494, 205.43280520566898]}, version: '2015-12-17'},
#           {y: 130, x: 455, module: ncnr.refl.join, title: join, text_width: 38, version: '2020-12-15'},
#           {y: 55, x: 600, module: ncnr.refl.divide_intensity, title: divide, text_width: 53,
#             version: '2020-07-23'}], wires: [{source: [0, output], target: [1, data]},
#           {source: [1, output], target: [2, data]}, {source: [2, output], target: [
#               11, data]}, {source: [3, output], target: [4, data]}, {source: [4, output],
#             target: [8, data]}, {source: [8, output], target: [11, backp]}, {source: [
#               5, output], target: [6, data]}, {source: [6, output], target: [7, data]},
#           {source: [7, output], target: [11, backm]}, {source: [9, output], target: [
#               10, data]}, {source: [10, output], target: [12, data]}, {source: [12,
#               output], target: [13, data]}, {source: [13, output], target: [14, base]},
#           {source: [11, output], target: [14, data]}]}, config: {}, node: 14, terminal: output,
#       server_git_hash: d9a3af28b1c634cfead07c6e0784dbfc2b22bdca, export_type: ORSO_text}}
# data_set: cr002163:unpolarized
# columns:
# - {name: Qz, unit: 1/angstrom}
# - {name: R}
# - {error_of: R, error_type: uncertainty, value_is: sigma, distribution: gaussian}
# - {error_of: Qz, error_type: resolution, value_is: sigma, distribution: gaussian}
# - {name: incident_angle, unit: degrees, physical_quantity: incident_angle}
# - {error_of: incident_angle, error_type: uncertainty, value_is: sigma, distribution: gaussian}
# # Qz (1/angstrom)    R                      sR                     sQz                    incident_angle (degrees)sincident_angle       
9.5406590243449271e-03 3.2574058449359672e-01 7.0630371492072406e-03 2.6095631125376865e-04 1.7400102317333221e-01 4.6284325652104098e-03
1.0122778718735669e-02 3.0767090459365937e-01 6.1598304395443096e-03 2.7790914358879505e-04 1.8461766534052548e-01 4.9301643680911000e-03
1.0729492877846396e-02 2.9135374780414597e-01 5.6923149104897144e-03 2.9382444683640824e-04 1.9568286570627932e-01 5.2117591268719095e-03
1.1318540956934043e-02 2.5906232920597172e-01 5.0734309129009982e-03 3.1078622642878194e-04 2.0642588061121925e-01 5.5134668946266174e-03
1.1921291609349320e-02 2.5057936581442553e-01 4.7493786367579218e-03 3.2647067432808755e-04 2.1741880954136492e-01 5.7908440940082073e-03
...

And here is the graph rendered from that header:
image

@jl-wynen
Copy link
Author

Thanks! It is good to see that other people are having similar ideas :)

So you just put the graph into reduction.software. Is this technically allowed by the file format?

In case you are interested, here is an example of what our graphs look like (WIP). This was generated using Sciline.

{
  "directed": true,
  "multigraph": false,
  "nodes": [
    {
      "id": "2",
      "kind": "function",
      "label": "load",
      "function": "__main__.load",
      "args": [
        "0"
      ],
      "kwargs": {}
    },
    {
      "id": "4",
      "kind": "data",
      "label": "RawData",
      "type": "__main__.RawData"
    },
    {
      "id": "1",
      "kind": "data",
      "label": "Filename",
      "type": "__main__.Filename"
    },
    {
      "id": "6",
      "kind": "function",
      "label": "normalize",
      "function": "__main__.normalize",
      "args": [
        "5",
        "7"
      ],
      "kwargs": {}
    },
    {
      "id": "10",
      "kind": "data",
      "label": "NormalizedData",
      "type": "__main__.NormalizedData"
    },
    {
      "id": "8",
      "kind": "data",
      "label": "NormalizationFactor",
      "type": "__main__.NormalizationFactor"
    }
  ],
  "edges": [
    {
      "id": "0",
      "source": "1",
      "target": "2"
    },
    {
      "id": "3",
      "source": "2",
      "target": "4"
    },
    {
      "id": "5",
      "source": "4",
      "target": "6"
    },
    {
      "id": "7",
      "source": "8",
      "target": "6"
    },
    {
      "id": "9",
      "source": "6",
      "target": "10"
    }
  ]
}

@bmaranville
Copy link
Contributor

The name attribute is required (but can be null) for the Software schema, but the implementation also allows adding arbitrary attributes that can be serialized to JSON/YAML (in this case the template graph). I think there should be a predefined place to put the extended reduction information, though, and make it flexible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants