-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Come up with a design for how to separate IO #44
Comments
Some general thoughts. First, trying to focus on what exactly we are trying to solve with this.
I've crossed out the middle requirement because, at least of the stages currently listed in #3, the HD_BET step is the only one that requires writing to disk, and it is just a constraint of the API. If we are tweaking the API, we can have it read a numpy array directly and these two points are moot. I don't think it's good to architect the API around optimizing for this one pathological step out of many that don't have the problem. There is also For The pipeline has very simple structure; there are no (*) dependencies between cases, and the dependencies are linear - each step depends only on the steps before it. I suggest encoding these dependencies with the method chaining strategy listed here, with usage like: data = obtain_one_case()
data.as_nrrd().gen_mask().fit_dti().estimate_fods().compute_degree_powers() Output files can be manually written to a particular location, or cached in data = obtain_one_case(cache_dir=...)
nrrds = data.as_nrrd()
nrrds.save_to(...)
masks = nrrds.gen_mask()
# masks.save_to(...)
powers = masks.fit_dti().estimate_fods().compute_degree_powers()
powers.save_to() (*) The population template step does have cross-case dependencies but they are very simple. A single case is chosen as template, and this step of all other cases depends on that template. The step (probably) is not run on the template itself. Rather than allow a single case (the template) to perform a single task (compute_degree_powers) sooner than the rest, I think better not to optimize for this. We scale better across case count anyway. A representation like: def pipeline(case):
dti = case.as_nrrd().gen_mask().fit_dti()
dti.save(...) # saving can be arbitrary
return dti.estimate_fods()
with Pool() as pool:
cases = obtain_many_cases(...)
powers = pool.imap_unordered(pipeline, cases)
template = next(powers)
templates = pool.map(template.register, powers) Note the usage of Otherwise, explicitly choosing a template would look something like this, but does not allow maximal parallelism. powers = pool.map(pipeline, cases)
template = powers.pop(42) # or search for correct index by case name, etc.
templates = pool.map(template.register, powers) Introducing a For serialization formats, I think this is where a simple adapter belongs. We can have a simple mapping of "known" file types to adapters, keyed on For CLI, we would insert explicit |
Linking brain-microstructure-exploration-tools/abcd-noddi-tbss-experiments#21 since that's the one step that I can tell has cross-case dependencies. |
A phrase I really like from that Scientific Python development principles is: Complexity is always conserved. If we can avoid architecting the entire thing about just one or two steps that require writing to disk, I think we'll make our lives much easier. I don't predict it will be too much performance impact in the meantime before we adjust HD-BET API. |
Problem: def pipeline(case):
...
return dti.estimate_fods().save_to(...) If the |
This is great! And this is seriously a hard problem, so I really appreciate the thoughtful design. Some replies and questions:
Indeed we will not have the mrtrix dependency
Totally agree, and I like the
This looks amazing. It becomes clear what plugs into what, and we are using the python language itself to enforce it. Reminds me of the approach I see functional programmers take, where they make their type system speak for itself.
What is meant by
The template is not going to be one of the cases or even a case at all. It's more like just the abstract concept of a common coordinate space. What matters is not what template image lives in that space, but rather the transformation from any given subject image into that space. So the population template construction step has as input the entire collection of subject images, and the output is an entire collection of transformations, one transformation for each subject image. We still don't know what's the best file format for representing transformations.
Besides the population template, another inter-case operator is the fod estimation step. This is because it determines a response function before fitting the fods, and that is done by averaging response functions across many subjects. And there are other steps in downstream analysis that would combine cases -- things that we haven't gotten to yet it abcd-noddi-tbss-experiments. A further complication that is particular to ABCD is the fact that there are many study sites, and we need to keep track of which images are coming from which study site. This comes from a table. There can also be further steps down the line that reply on tabular data associated to the images, once we get into the statistical analysis. At this point it may help to lay out all the pipeline steps, so here is a flowchart: flowchart LR
dl(ABCD download)
tbl(Table mapping to sites, scanners, variables)
img(DW images)
dl-->ext{Extract}-.->img
tbl-->ext
img --denoise-->img
img --generate nrrd headers-->img
img--hdbet-->masks(masks)
img-->dtifit{DTI fit}
masks-->dtifit
dtifit-->dti(DTI)
dtifit-->fa(FA)
dtifit-->md(MD)
img-->noddi{NODDI fit}
masks-->noddi
noddi-->ndi{NDI}
noddi-->odi{ODI}
csd{CSD}
rf{response function<br>estimation}
img-->csd
img-->rf
masks-->csd
masks-->rf
tbl-->rf
rf-.->rfv(response functions)
rfv-->csd
csd-->fod(FODs)
fod--compute_degree_powers-->dp(Degree powers)
tp{Population template<br>construction}
tbl-->tp
dp-->tp
tp-.->wrp(Warps)
vba{voxel-based<br>analysis}
wrp-->vba
fa-->vba
md-->vba
ndi-->vba
odi-->vba
tbl-->vba
vba-.->vbaR(result)
tbss{TBSS}
fa-->tbss
wrp-->tbss
tbss-.->tbssproj(Projections to<br>WM skeleton)
tba{TBSS<br>analysis}
tbssproj-->tba
wrp-->tba
fa-->tba
md-->tba
ndi-->tba
odi-->tba
tbl-->tba
tba-.->tbaR(result)
Solid line arrows indicate data flow that is purely per-case, and dotted line arrows indicate the possibility of inter-case interactions. Overall, I like this plan. And if pesky steps like population template construction make it difficult to parallelize, then that's fine we can just not parallelize those steps (or rather, make them internally parallelize if they want to). Now I think before we close this we also need some actionable details. What are we adding and where are we adding it? I'm also curious if seeing the monstrous pipeline flowchart brings to mind any issues. |
Here is an updated proposal. Will think on this a bit before closing the issue. The objects to operate on are "abcd events", dmri scans, etc. The starting point should be a download folder of ABCD data, containing tabular data and images. The starting object is an Here is how the individual pipeline components can be conceived, using a functional notation:
So how would the IO work within these objects like A Internally, a
One concrete implementation of graph LR
Resource --> ArrayResource
Resource --> BvalResource
Resource --> ...
ArrayResource --> NiftiArrayResource
ArrayResource --> SlicerArrayResource
ArrayResource --> BasicArrayResource
BvalResource --> FslBvalResource
BvalResource --> SlicerBvalResource
BvalResource --> BasicBvalResource
The "basic" things like
There can be any number of conversion layers to make things interlock cleanly. For example So all those pipeline components listed above, A conversion function that turns a from abcdmicro.io import dwi_to_nifti
def extract_and_denoise(abcd_event: AbcdEvent, output_path: Path):
dwi : Dwi = extract_dwi(abcd_event) # this DWI has its data in-memory
dwi : Dwi = denoise_dwi(dwi) # still data is in-memory
dwi : Dwi = dwi_to_nifti(dwi, output_path) # now the data is on disk. It's still a Dwi object, but the ArrayResource inside it is a NiftiArrayResource These could be method-chained: def extract_and_denoise(abcd_event: AbcdEvent, output_path: Path):
return abcd_event.extract_dwi().denoise().to_nifti(output_path) Note: Just realized that if I want to rewrite to nifti later than I should preserve the metadata, so wherever I say |
I think it works; a few additional observations:
and a few things deliberately being put off to design later:
These are left out of the original design because I suspect that once things are put together without the caching or generic |
Below we can have a detailed proposal for how to approach #35. We know there's going to have to be some canonical representation of objects as either in-memory or on-disk, and "in-memory" seems preferable. And we know there will have to be some kind of adapter to be able to switch between those. But there are a lot of challenging details to think through here, and an approach for these will be discussed below.
The text was updated successfully, but these errors were encountered: