Come up with a design for how to separate IO #44

ebrahimebrahim · 2024-05-08T20:02:32Z

Below we can have a detailed proposal for how to approach #35. We know there's going to have to be some canonical representation of objects as either in-memory or on-disk, and "in-memory" seems preferable. And we know there will have to be some kind of adapter to be able to switch between those. But there are a lot of challenging details to think through here, and an approach for these will be discussed below.

allemangD · 2024-05-13T17:06:44Z

Some general thoughts. First, trying to focus on what exactly we are trying to solve with this.

Functional python API is first-class.
Do not require loading the full dataset into memory together.
~~Support steps like HD_BET which require storing data to disk before compute.~~
- ~~Have the tool write its output directly to the desired location to avoid extraneous deserialize/reserialize steps.~~
Support additional serialization formats in the future.

I've crossed out the middle requirement because, at least of the stages currently listed in #3, the HD_BET step is the only one that requires writing to disk, and it is just a constraint of the API. If we are tweaking the API, we can have it read a numpy array directly and these two points are moot. I don't think it's good to architect the API around optimizing for this one pathological step out of many that don't have the problem.

There is also generate_population_template which depends on mrtrix3, but there is a plan to remove that dependency entirely.

For ~~those pathological steps~~ that pathological step, we can just write and read to a tempfile.NamedTemporaryFile as an implementation detail, until we fix the HD-BET api.

The pipeline has very simple structure; there are no (*) dependencies between cases, and the dependencies are linear - each step depends only on the steps before it.

I suggest encoding these dependencies with the method chaining strategy listed here, with usage like:

data = obtain_one_case()
data.as_nrrd().gen_mask().fit_dti().estimate_fods().compute_degree_powers()

Output files can be manually written to a particular location, or cached in $HOME/.cache or $PWD/.cache or similar. (see https://github.com/ActiveState/appdirs) Perhaps something like:

data = obtain_one_case(cache_dir=...)

nrrds = data.as_nrrd()
nrrds.save_to(...)

masks = nrrds.gen_mask()
# masks.save_to(...)

powers = masks.fit_dti().estimate_fods().compute_degree_powers()
powers.save_to()

(*) The population template step does have cross-case dependencies but they are very simple. A single case is chosen as template, and this step of all other cases depends on that template. The step (probably) is not run on the template itself. Rather than allow a single case (the template) to perform a single task (compute_degree_powers) sooner than the rest, I think better not to optimize for this. We scale better across case count anyway. A representation like:

def pipeline(case):
    dti = case.as_nrrd().gen_mask().fit_dti()
    dti.save(...)  # saving can be arbitrary
    return dti.estimate_fods()

with Pool() as pool:
    cases = obtain_many_cases(...)
    powers = pool.imap_unordered(pipeline, cases)
    template = next(powers)
    templates = pool.map(template.register, powers)

Note the usage of imap_unordered and next, combined with method binding template.register might allow full parallelism with the multiprocessing pool. The main process would submit those to the pool job queue as soon as available, while the pool continues running the pipeline on the rest of the cases.

Otherwise, explicitly choosing a template would look something like this, but does not allow maximal parallelism.

powers = pool.map(pipeline, cases)
template = powers.pop(42)  # or search for correct index by case name, etc.
templates = pool.map(template.register, powers)

Introducing a pipeline function means that each process doesn't need to serialize or deserialize any content between steps; the data can be loaded once at the top and arbitrarily serialized with explicit .save().

For serialization formats, I think this is where a simple adapter belongs. We can have a simple mapping of "known" file types to adapters, keyed on pathlib.Path.suffixes. Packages nibabel and pynrrd are probably sufficient for most of our work, just to get the data into numpy arrays.

For CLI, we would insert explicit .save at our chosen "output" data, enabled by certain CLI flags. For the Slicer orchestrator module we'd do similar, writing back to MRML nodes instead.

allemangD · 2024-05-13T17:08:14Z

Linking brain-microstructure-exploration-tools/abcd-noddi-tbss-experiments#21 since that's the one step that I can tell has cross-case dependencies.

allemangD · 2024-05-13T17:11:43Z

A phrase I really like from that Scientific Python development principles is: Complexity is always conserved. If we can avoid architecting the entire thing about just one or two steps that require writing to disk, I think we'll make our lives much easier. I don't predict it will be too much performance impact in the meantime before we adjust HD-BET API.

allemangD · 2024-05-13T17:18:33Z

Problem: return dti.estimate_fods() implies that the main process will collect all FODs in memory. We wanted to avoid this. So perhaps we'll need save to return a sort of pointer-to-files dataclass that can be returned instead.

def pipeline(case):
    ...
    return dti.estimate_fods().save_to(...)

If the estimate_fods() and estimate_fods().save_to() types both implement the same protocol, where the save_to type's implementation first deseralizes the data, then we avoid the issue.

ebrahimebrahim · 2024-05-13T20:06:13Z

This is great! And this is seriously a hard problem, so I really appreciate the thoughtful design. Some replies and questions:

There is also generate_population_template which depends on mrtrix3, but there is a plan to remove that dependency entirely.

Indeed we will not have the mrtrix dependency

I don't think it's good to architect the API around optimizing for this one pathological step out of many that don't have the problem.

Totally agree, and I like the tempfile.NamedTemporaryFile suggestion as a holdover until we fix the api.

method chaining strategy listed here

This looks amazing. It becomes clear what plugs into what, and we are using the python language itself to enforce it. Reminds me of the approach I see functional programmers take, where they make their type system speak for itself.

data.as_nrrd()

What is meant by as_nrrd()? Would it be returning a numpy array or a file pointer?

Otherwise, explicitly choosing a template would look something like this, but does not allow maximal parallelism.
powers = pool.map(pipeline, cases)
template = powers.pop(42)  # or search for correct index by case name, etc.
templates = pool.map(template.register, powers)

The template is not going to be one of the cases or even a case at all. It's more like just the abstract concept of a common coordinate space. What matters is not what template image lives in that space, but rather the transformation from any given subject image into that space. So the population template construction step has as input the entire collection of subject images, and the output is an entire collection of transformations, one transformation for each subject image. We still don't know what's the best file format for representing transformations.

there are no (*) dependencies between cases

Besides the population template, another inter-case operator is the fod estimation step. This is because it determines a response function before fitting the fods, and that is done by averaging response functions across many subjects.

And there are other steps in downstream analysis that would combine cases -- things that we haven't gotten to yet it abcd-noddi-tbss-experiments.

A further complication that is particular to ABCD is the fact that there are many study sites, and we need to keep track of which images are coming from which study site. This comes from a table. There can also be further steps down the line that reply on tabular data associated to the images, once we get into the statistical analysis.

At this point it may help to lay out all the pipeline steps, so here is a flowchart:

flowchart LR
dl(ABCD download)
tbl(Table mapping to sites, scanners, variables)
img(DW images)
dl-->ext{Extract}-.->img
tbl-->ext
img --denoise-->img
img --generate nrrd headers-->img
img--hdbet-->masks(masks)
img-->dtifit{DTI fit}
masks-->dtifit
dtifit-->dti(DTI)
dtifit-->fa(FA)
dtifit-->md(MD)
img-->noddi{NODDI fit}
masks-->noddi
noddi-->ndi{NDI}
noddi-->odi{ODI}
csd{CSD}
rf{response function<br>estimation}
img-->csd
img-->rf
masks-->csd
masks-->rf
tbl-->rf
rf-.->rfv(response functions)
rfv-->csd
csd-->fod(FODs)
fod--compute_degree_powers-->dp(Degree powers)
tp{Population template<br>construction}
tbl-->tp
dp-->tp
tp-.->wrp(Warps)
vba{voxel-based<br>analysis}
wrp-->vba
fa-->vba
md-->vba
ndi-->vba
odi-->vba
tbl-->vba
vba-.->vbaR(result)
tbss{TBSS}
fa-->tbss
wrp-->tbss
tbss-.->tbssproj(Projections to<br>WM skeleton)
tba{TBSS<br>analysis}
tbssproj-->tba
wrp-->tba
fa-->tba
md-->tba
ndi-->tba
odi-->tba
tbl-->tba
tba-.->tbaR(result)

Solid line arrows indicate data flow that is purely per-case, and dotted line arrows indicate the possibility of inter-case interactions.

Overall, I like this plan. And if pesky steps like population template construction make it difficult to parallelize, then that's fine we can just not parallelize those steps (or rather, make them internally parallelize if they want to).

Now I think before we close this we also need some actionable details.

What are we adding and where are we adding it?

I'm also curious if seeing the monstrous pipeline flowchart brings to mind any issues.

ebrahimebrahim · 2024-06-28T18:22:51Z

Here is an updated proposal. Will think on this a bit before closing the issue.

The objects to operate on are "abcd events", dmri scans, etc.

The starting point should be a download folder of ABCD data, containing tabular data and images.

The starting object is an AbcdEvent, which consists of a subject id and abcd event name (baseline, 2-year follow up, etc.), as well as a root download folder into which the raw abcd data was originally downloaded. This allows for links into the tabular data and into neuroimaging. So an AbcdEvent can have methods that allow for access to tabular info for the current subject and event, and that can provide paths to the neuroimaging data (including knowing to provide a list of zips rather than a single zip in the case of philips).

Here is how the individual pipeline components can be conceived, using a functional notation:

extract_dwi : AbcdEvent -> Dwi # extracts and concatenates when needed
denoise_dwi : Dwi -> Dwi
estimate_response_function : Dwi -> ResponseFunction
aggregate_response_functions : List[ResponseFunction] -> ResponseFunction
aggregate_within_sites : (List[T] -> T) -> List[T] -> Dict[SiteID, T]
aggregate_response_functions_within_site = aggregate_within_sites(aggregate_response_functions)
aggregate_response_functions_within_site : List[ResponseFunction] -> Dict[SiteID, ResponseFunction]
estimate_fod : ResponseFunction -> Dwi -> Fod
generate_warps_to_template_space : List[Fod] -> List[Warp]

So how would the IO work within these objects like Dwi, Fod, etc.?

A Dwi would have a method like get_array() that returns a 4D numpy array, similarly for Fod, etc.

Internally, a Dwi does this by calling the get() method of its ArrayResource. When you construct a Dwi you need to provide some kind of ArrayResource, BvalResource, and BvecResource.

ArrayResource is an abstract base class that has a get method that returns an NDArray. BValResource is an abstract base class that has a get method that returns a linear NDArray or perhaps list of bvals.

One concrete implementation of ArrayResource could be NiftiArrayResource which just has a path to file and uses nibabel or something like that to do the reading. Resources can be in an inheritance tree kind of like this

graph LR
Resource --> ArrayResource
Resource --> BvalResource
Resource --> ...
ArrayResource --> NiftiArrayResource
ArrayResource --> SlicerArrayResource
ArrayResource --> BasicArrayResource
BvalResource --> FslBvalResource
BvalResource --> SlicerBvalResource
BvalResource --> BasicBvalResource

The "basic" things like BasicArrayResource and BasicBvalResource that I have in mind would just be in-memory versions of the resources, where the get function would simply pass through the underlying in-memory object.

resource.py would contain all the resource abstract base classes
concrete implementations that handle disk io such as NiftiArrayResource would live in io.py, cordoned off.

There can be any number of conversion layers to make things interlock cleanly.

For example extract_dwi could turn an AbcdEvent into a Dwi that has an underlying NiftiArrayResource. But maybe we want to use denoise_dwi : Dwi -> Dwi immediately after and only then do we want to save to disk. Then we'd want extract_dwi to return a Dwi that has an underlying BasicArrayResource. To have the most flexibility, we should have extract_dwi to return a Dwi that has an underlying BasicArrayResource, BasicBvalResource, etc., and then convert these into NiftiArrayResource, FslBvalResource, etc. only when we are ready to save them.

So all those pipeline components listed above, extract_dwi, denoise_dwi, etc. should output the "basic" (i.e. in-memory) versions of things.

A conversion function that turns a BasicArrayResource into a NiftiArrayResource would be the way to write to disk. Here's an example bit of pipleine being put together

from abcdmicro.io import dwi_to_nifti

def extract_and_denoise(abcd_event: AbcdEvent, output_path: Path):
    dwi : Dwi = extract_dwi(abcd_event) # this DWI has its data in-memory
    dwi : Dwi = denoise_dwi(dwi) # still data is in-memory
    dwi : Dwi = dwi_to_nifti(dwi, output_path) # now the data is on disk. It's still a Dwi object, but the ArrayResource inside it is a NiftiArrayResource

These could be method-chained:

def extract_and_denoise(abcd_event: AbcdEvent, output_path: Path):
    return abcd_event.extract_dwi().denoise().to_nifti(output_path)

Note: Just realized that if I want to rewrite to nifti later than I should preserve the metadata, so wherever I say ArrayResource perhaps a better name is VolumeResource for medical image volumes. The basic version would perhaps just have an array and a dict of header info.

ebrahimebrahim · 2024-07-05T17:00:02Z

I think it works; a few additional observations:

A more descriptive name than Basic is InMemory. So InMemoryVolumeResource, etc.
AbcdEvent can have class level ("static") loaded dataframes of the abcd tables it uses, so the tables are cached in memory as they get loaded. This avoids doing a new diskread per row when iterating through AbcdEvents
Objects like Dwi and ResponseFunction will need to remember their subject ID and such in order to get related to study site when needed. I think the easiest way to do this is to have these objects maintain a reference to their original AbcdEvent. So when you construct these objects you also need to pass an an AbcdEvent.

and a few things deliberately being put off to design later:

To make reading data from disk less painful, classes like Dwi etc. should have a class level method from_file. So you can do Dwi.from_file(some path) to read a Dwi object from disk using the file extension to choose an appropriate reader. io.py can have this generic reader. But we need to answer the question of how best to serialize other data inside these objects like AbcdEvents. We can do this later. For now it works to use the constructors of the individual things.
Caching has been left out here and would have to be handled manually when putting together the pipeline. I.e. "output path is such-and-such. if that file exists, then read from disk. otherwise run this pipeline function on the input object and save the result to the output path."

These are left out of the original design because I suspect that once things are put together without the caching or generic from_file reading it will start to become clear how they ought be designed.

ebrahimebrahim mentioned this issue May 8, 2024

Cordon off IO operations #35

Open

2 tasks

ebrahimebrahim assigned allemangD May 8, 2024

ebrahimebrahim self-assigned this May 29, 2024

ebrahimebrahim mentioned this issue May 29, 2024

Add caching to pipelines #17

Closed

ebrahimebrahim unassigned allemangD Jun 3, 2024

ebrahimebrahim added this to ABCD Y2Q1 Jun 11, 2024

ebrahimebrahim closed this as completed Jul 5, 2024

github-project-automation bot moved this to Done in ABCD Y2Q1 Jul 5, 2024

ebrahimebrahim mentioned this issue Jul 31, 2024

Start putting together pipeline architecture #54

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Come up with a design for how to separate IO #44

Come up with a design for how to separate IO #44

ebrahimebrahim commented May 8, 2024 •

edited

Loading

allemangD commented May 13, 2024 •

edited

Loading

allemangD commented May 13, 2024

allemangD commented May 13, 2024 •

edited

Loading

allemangD commented May 13, 2024 •

edited

Loading

ebrahimebrahim commented May 13, 2024

ebrahimebrahim commented Jun 28, 2024

ebrahimebrahim commented Jul 5, 2024 •

edited

Loading

Come up with a design for how to separate IO #44

Come up with a design for how to separate IO #44

Comments

ebrahimebrahim commented May 8, 2024 • edited Loading

allemangD commented May 13, 2024 • edited Loading

allemangD commented May 13, 2024

allemangD commented May 13, 2024 • edited Loading

allemangD commented May 13, 2024 • edited Loading

ebrahimebrahim commented May 13, 2024

ebrahimebrahim commented Jun 28, 2024

ebrahimebrahim commented Jul 5, 2024 • edited Loading

ebrahimebrahim commented May 8, 2024 •

edited

Loading

allemangD commented May 13, 2024 •

edited

Loading

allemangD commented May 13, 2024 •

edited

Loading

allemangD commented May 13, 2024 •

edited

Loading

ebrahimebrahim commented Jul 5, 2024 •

edited

Loading