Properly model the data flow through HydroMT #1047

Jaapel · 2024-09-17T09:41:19Z

Jaapel
Sep 17, 2024

HydroMT is a framework for building models that consists of two main parts:

a Model part with components and a workflow language that specify what your model looks like and how to build it
a DataCatalog part with a declarative yaml interface that defines how to extract your data from multiple data sources.

In v1, the interface of Model was "remodelled" (pun intended) to be more flexible in term of what a model is.
The DataCatalog model was split out into a few components that are now extendable. URIResolver finds data from a single uri and metadata, Driver reads and merges data and DataAdapter harmonizes data. This was done under the assumption data flowing into HydroMT always follows the resolve -> read -> transform flow.
Working with the codebase longer, I found out that this is not the case. A few examples are:

preprocessors: happen after resolving, but before the data_adapter to harmonize the data before merging
Most data types, like the GeoDataFrame and GeoDataset do not allow for more than 1 uri, so there is no merging at all.
For certain data formats, like the .vrt files, there is some caching step involved during the resolving step.

These processes are not modelled correctly, resulting in a sometime hard to test and hard to generalize codebase with many "exceptions to the rule". Because HydroMT is a framework, the framework should properly support these kinds of features, that apparently are so key for HydroMT that they are not in a plugin, but in the main HydroMT library.

What we are doing in the code is creating a DAG. Same goes for the earlier mentioned model building, which currently only supports serial execution of different steps, which is hard to scale up.

My suggestion is to properly model the process and generate a DAG from the different HydroMT classes. Different processes in a dag can be modelled as a:

fan-out process: https://dagster.io/glossary/fan-out, which creates multiple nodes from one
a reduce, or fan-in process, like the python stdlib reduce: https://docs.python.org/3/library/functools.html#functools.reduce. This is the opposite of the previous process, which combines multiple nodes in the dag to a single new one.
a mapper, or one-to-one process: https://docs.python.org/3/library/functions.html#map

Each of the HydroMT classes can then be one, or a combination of these processes. A URIResolver is clearly a fan-out node (1 to many), a driver or adapter a mapper (1 to 1) and the merging part of the driver should be modeled seperately as a reducer.
If we want to "shield" beginner users from this logic, we can create high-level classes or methods that combine these nodes to a single component.

These steps should be interoperable with a workflow orchestrator (snakemake, argo-workflows, airflow etc etc.). Already, some projects or products exist to map these kind of steps to a multi-process or multi-server workflow manager, such as snakemake, or streamflow. Examples of projects are HydroFlows, Climate Stresstesting Toolbox, FloodAdapt. Luckily a DAG is a very simple network, and can easily be represented in plain python (dictionary even).
There are many projects trying to be "the dag definition language". I would make sure that at least all HydroMT classes cleanly translate to different dag steps, so that using HydroMT as a library in these projects is easier.

The benefits of this refactor / extra functionality are clear to me:

Easier to test different steps
Easier to integrate code into a workflow orchestrator
A more strict API (less user mistake and gotcha's)
More flexible workflow setup in the plugins
Parallel execution (both data and model)

Cons:

Some higher-level abstraction may be needed to give beginner users a good start.

Jaapel · 2024-09-17T09:48:44Z

Jaapel
Sep 17, 2024
Author

For DAG definition, looking at: https://github.com/dagworks-inc/hamilton/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly model the data flow through HydroMT #1047

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Properly model the data flow through HydroMT #1047

Jaapel Sep 17, 2024

Replies: 1 comment

Jaapel Sep 17, 2024 Author

Jaapel
Sep 17, 2024

Jaapel
Sep 17, 2024
Author