Datalad integration concept #75
Labels
concept
Design/Implementation concepts
documentation
Improvements or additions to documentation
enhancement
New feature or request
question
Further information is requested
This is a write-up of what @FeHoff and I figured so far.
The core of it is of course to wrap a
junifer run
call in adatalad run
call capturing the results in a dataset. That would produce a re-executable (viadatalad rerun
) record in the commit message that looks like this:The yaml file would be saved to the result dataset before that execution. Hence, the record refers to specific version of that yaml file.
The input data would be added as a subdataset (also before execution obviously).
For any input dataset there could then be a "central" superdataset, that would collect such result datasets as subdatasets (could be done via pull request), hence enabling discovery of what is already available. That would be an optional step, though.
Such a superdataset could also become the actual entry point for junifer users:
clone that superdataset, add a new subdataset for the results, put the yaml in it, and clone the input dataset into it. Then
datalad run "junifer run"
from within the results dataset and finallydatalad save
in the superdataset.This would allow for two things: Discovery of what already is available from the superdataset is a local operation + the final state of things can directly be turned into a PR to update the superdataset with the new results.
The datalad run execution can be hidden in a dedicated junifer command (or an option to
junifer run
). Say,junifer run --datalad
would then internally calldatalad run "junifer run"
.For
junifer queue
would need a RIA store to be given. It would then push the prepared dataset hierarchy into that store and submit a job that clones the entire thing from the store, switch to a job specific branch, runsjunifer run --datalad
and pushes back to store afterwards. A merge of the job branches would be needed when all those branches are back in store (see also FAIRly big workflow)This would imply minor changes to the DataladDataGrabber:
__enter__
would need to clone the input dataset into the result dataset (datalad clone -d path/to/result/dataset URL DEST
ordatalad.api.Dataset("path/to/result/dataset").clone(URL, DEST)
)__exit__
would then justuninstall
the input dataset (we don't want to change/delete the reference, but simply not have it locally anymore)So, the entire structure would look something like this:
Where,
super
,some_result
,other_result
and theinput
's are datasets. Note, that the inputs are supposed to be the same (dataset). Reminder that this is not a duplication. A subdataset only is a reference per se. But this also means, that every result dataset is self-contained -super
is only there for discovery.There's a bunch of question to decide upon, though:
junifer run --datalad
expect the result dataset (and possibly its super) to be there already or should it take care of creating it itself?workdir
,datadir
, the storage'suri
etc to point into that result dataset. Which piece of code (if any) would be best to make sure of that? To what extend would junifer want to enforce such things?datalad rerun
) and that is absolute paths in the YAML. I suppose the definedworkdir
is largely there to determine the workdir of a compute job injunifer queue
. It seems to me that it should rather be an option tojunifer queue
then rather than a (committed) entry in the YAML.datadir
and the storageuri
need to be relative to the result dataset's root.The text was updated successfully, but these errors were encountered: