Replies: 1 comment
-
For DAG definition, looking at: https://github.com/dagworks-inc/hamilton/ |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
HydroMT is a framework for building models that consists of two main parts:
Model
part with components and a workflow language that specify what your model looks like and how to build itDataCatalog
part with a declarative yaml interface that defines how to extract your data from multiple data sources.In v1, the interface of
Model
was "remodelled" (pun intended) to be more flexible in term of what a model is.The
DataCatalog
model was split out into a few components that are now extendable.URIResolver
finds data from a single uri and metadata,Driver
reads and merges data andDataAdapter
harmonizes data. This was done under the assumption data flowing into HydroMT always follows theresolve
->read
->transform
flow.Working with the codebase longer, I found out that this is not the case. A few examples are:
GeoDataFrame
andGeoDataset
do not allow for more than 1 uri, so there is no merging at all..vrt
files, there is some caching step involved during the resolving step.These processes are not modelled correctly, resulting in a sometime hard to test and hard to generalize codebase with many "exceptions to the rule". Because HydroMT is a framework, the framework should properly support these kinds of features, that apparently are so key for HydroMT that they are not in a plugin, but in the main HydroMT library.
What we are doing in the code is creating a DAG. Same goes for the earlier mentioned model building, which currently only supports serial execution of different steps, which is hard to scale up.
My suggestion is to properly model the process and generate a DAG from the different HydroMT classes. Different processes in a dag can be modelled as a:
Each of the HydroMT classes can then be one, or a combination of these processes. A URIResolver is clearly a fan-out node (1 to many), a driver or adapter a mapper (1 to 1) and the merging part of the driver should be modeled seperately as a reducer.
If we want to "shield" beginner users from this logic, we can create high-level classes or methods that combine these nodes to a single component.
These steps should be interoperable with a workflow orchestrator (snakemake, argo-workflows, airflow etc etc.). Already, some projects or products exist to map these kind of steps to a multi-process or multi-server workflow manager, such as snakemake, or streamflow. Examples of projects are HydroFlows, Climate Stresstesting Toolbox, FloodAdapt. Luckily a DAG is a very simple network, and can easily be represented in plain python (dictionary even).
There are many projects trying to be "the dag definition language". I would make sure that at least all HydroMT classes cleanly translate to different dag steps, so that using HydroMT as a library in these projects is easier.
The benefits of this refactor / extra functionality are clear to me:
Cons:
Beta Was this translation helpful? Give feedback.
All reactions