-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Types of Tasks #43
Comments
I think this is an anti-pattern ... I'd prefer if all input assets be contained in a single object, and any "global" configuration passed in as a task parameter. E.g. {
"features": [
{
"data": "http://example.com/data.tif",
"metadata": "http://example.com/metadata.xml"
}
],
"process": {
"tasks": {
"my-task": {
"global-config": "http://example.com/dataset-config.json"
}
}
}
}
This is supported in the currently proposed model, e.g. you could define: class OneToManyTask(Task[Item, Item]):
def process(self, item: Item) -> List[Item]:
"""Explode a netCDF into a bunch of COGs, or whatever.""" One thing that isn't demonstrated in the current PR (yet) is the ability to put constraints on your input model with the from pydantic import BaseModel
from stac_task import Task
from stac_task.models import Asset, Item
class Assets(BaseModel):
data: Asset
metadata: Asset
class AssetsTask(Task[Assets, Item]):
def process(self, input: Assets) -> List[Item]:
"""Creates an item from input assets."""
item = do_the_thing(input)
return [item]
output_dict = AssetsTask().process_dict({
"data": { "href": "an/href.tif" },
"metadata": { "href": "the/metadata.xml" },
}) |
The idea of a STAC based workflow is that you are working with existing STAC Items...you are not creating a new STAC Item as input to a process. This is why it's 1 or more STAC Items, and it's important to preserve that for data provenance. For example, I want to take in a single Landsat scene and a DEM (or more to cover the area) for doing orthorectification. That set of Input STAC Items have self hrefs that point back to the source so they can be added in the derived_from field of the output orthorectified output. I'm not sure if that was what you meant above or if you only meant when the input was strings and not STAC. I'm not sure I like the ability to just define any arbitrary input or output here. Maybe this is better but the original goal was to have a strict convention that supports STAC workflows and not a generalized process-anything task. But I'll have to think on that some. |
As laid out in #42 (comment) I still believe many-items-in-many-items-out is an anti-pattern that makes it harder to build scope-limited, easily-testable tasks. In your example, if you want to maintain derived from links, you can include them in the "to-process" item: {
"id": "item-to-process",
"links": [
{ "href": "http:://landsat.stac/item-0.json", "rel": "derived-from" },
{ "href": "http:://sentinel2.stac/item-0.json", "rel": "derived-from" }
],
"assets": {
"landsat_B01": { "href": "http://landsat.stac/B01.tif" },
"sentinel2_B01": { "href": "http://sentinel2.stac/B01.jp2" }
}
} This way, you can define a schema of what the inputs should look like (e.g. w/ pydantic): class Assets(BaseModel):
landsat_B01: Asset
sentinel2_B02: Asset
class Input(BaseModel):
id: str
links: List[Link]
assets: Assets There's a couple of benefits (that I see) to my proposal:
|
Okay, after some rework, here's the core generic tasks in the library (
To make a task, you pick the one the best fits what you're trying to do, and implement. Cirrus would want a |
In @gadomski's PR #42 several types of tasks are defined.
I really like this way to define the input and output for different types of tasks, especially if it gives us JSON Schema!
Want to review these two Tasks:
StacOutputTask - Anything in, STAC out task.
HrefTask - Href in, STAC Out task
These tasks captures the need to create STAC Items from scratch. In the current payload structure you pass in parameters to the task in the process definition, you don't hand them in as part of the Task Input (which would normally be a FeatureCollection. So the
href
(or multiple hrefs), along with other parameters, would be provided in theprocess.tasks.taskname.parameter
field. I think that should be the preferred model and Input/Output is always going to be STAC Items, or nothing.Next is the ItemTask which defines a single Items, but stac-tasks current are ItemCollections. A STAC task can take in 1 or more STAC Items as input, and returns 1 or more STAC Items. Note that this is not 1:1, a task doesn't process each item independently to create an array of output items (although you could write a task to do that). A task might take in one Item and create two derived Items from it, or it takes in an Item of data and a few other Items of auxiliary data used in the processing to create a single output Item.
Each task would have requirements on the number of input Items.
So I'd propose
StacOutputTask - Nothing in, STAC out task
ItemCollectionTask - ItemCollection in, ItemCollection out
I suppose we could also have an
ItemTask
for single Item input and output (a most common scenario), but I'm not sure I see the advantage over using ItemCollection with 1 Item.The text was updated successfully, but these errors were encountered: