Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal OGC API - Processs - Part4: Job Management #437

Merged
merged 31 commits into from
Oct 18, 2024

Conversation

gfenoy
Copy link
Contributor

@gfenoy gfenoy commented Sep 23, 2024

Following the discussions during today's GDC / OGC API—Processes SWG meeting, I created this PR.

It contains a proposal for an additional part to the OGC API - Processes family: "OGC API - Processes - Part 4: Job Management" extension.

This extension was initially discussed here:

The document identifier 24-051 was registered.

@gfenoy gfenoy added the Part 4 (Job Management) OGC API - Processes - Part 4: Job Management label Sep 23, 2024
extensions/job_management/.DS_Store Outdated Show resolved Hide resolved
extensions/job_management/README.md Outdated Show resolved Hide resolved
extensions/job_management/standard/24-051.adoc Outdated Show resolved Hide resolved
Comment on lines 1 to 3
type: object
additionalProperties:
$ref: "input.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using an alternate representation:

inputs:
  type: object
  additionalProperties:
    $ref: "input.yaml"
outputs:
  type: object
  additionalProperties:
    $ref: "output.yaml"

Reasons:

  1. Although "outputs" are shown, those represent the requested outputs (i.e.: transmission mode, format, or any other alternate content negotiation parameters) submitted during the job creation request, in order to eventually produce the desired results. Often, the requested outputs depend on whichever inputs were submitted. Therefore, viewing them separately on different endpoints is not really convenient or useful.

  2. The /jobs/{jobId}/outputs endpoint can easily be confused with /jobs/{jobId}/results. The "request outputs" in this case are "parameter descriptions of eventual outputs", which are provided upstream of the execution workflow. In a way, those are parametrization "inputs" of the processing pipeline.

  3. Because OGC API - Processes core defines specific response combinations and requirements for /jobs/{jobId}/results, the /jobs/{jobId}/outputs is a convenient and natural endpoint name that an API can use to provide alternate response handling and representations conflicting with OGC API - Processes definitions. CRIM's implementation does exactly that. I would appreciate keeping that option available.

  4. As a matter of fact, older OGC API - Processes implementations (start of ADES/EMS days) actually used /jobs/{jobId}/outputs instead of /jobs/{jobId}/results. Adding /jobs/{jobId}/outputs would break those implementations.

  5. Having inputs and outputs nested under those fields (rather than at the root) allows providing even further contents that could be relevant along the inputs/outputs. For example, additional links, metadata or definitions to describe those parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see other comment about inputs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a $ref to openEO's definition, to avoid maintaining duplicate definitions?

Comment on lines +3 to +5
enum:
- process
- openeo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be made a simple string with examples?
Do we want to create the same situation as statusCode needing an override because of the new created status?

If the long term use is to have job management available for an OGC API, maybe it would be better to define a Requirements class that say for openEO, type: openeo MUST be used, and process for OGC API - Processes. A "long" Coverage processing could then easily define their own requirement class with type: coverage, without causing invalid JSON schema definitions.

Comment on lines 7 to 10
id:
type: string
processID:
type: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirements need to be revised. They used the alternate name jobID when referring to the GET /jobs responses.

Similarly, process was mentioned during the meeting.
I'm not sure if processID remains relevant however, because process would be the URI, not just the processID from GET /processes/{processID}.

====
[%metadata]
label:: /req/job-management/start-response
part:: A successful execution of the operation SHALL be reported as a response with a HTTP status code '200'.
Copy link
Contributor

@fmigneault fmigneault Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using POST /jobs/{jobId}/results, the server should probably still have the option to negotiate the execution mode. In other words, sync and 200 would be used by default, but a Prefer: respond-async would allow the server to trigger this job, although it would not respond with results right away. If async is selected this way, the response would instead be the usual Job Status with monitoring. Also, a 202 would have to be used, since no job is "created" in that case. Once completed, the results can be retrieved from GET /jobs/{jobId}/results.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you not expect the sync result already from POST /jobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do either of the following:

  1. POST /jobs + Prefer: wait=X to respond results inline (no need for other request)

  2. POST /jobs + Prefer: respond-async responds with Job Status

    1. starts the job whenever resources are ready
    2. GET /jobs/{jobId}/results to retrieve results once status: succeeded
  3. POST /jobs + Prefer: wait=X + status: create

    1. places the job in created state, until triggered later on
    2. POST /jobs/{jobId}/results to trigger the job and return the results inline
  4. POST /jobs + Prefer: respond-async + status: create

    1. places the job in created state, until triggered later on
    2. POST /jobs/{jobId}/results to trigger the job and return with Job Status
    3. GET /jobs/{jobId}/results to retrieve results once status: succeeded

Here, the 202 would make sense for 4.ii, whereas 200 for 3.ii

Copy link

@m-mohr m-mohr Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you have too many options in OAP. Honestly, this feels overengineered. You are implementing things server side that usually a client would handle. Then the server (and client) implementations are much simpler. openEO API is much simpler, but can still do everything you do here it's their clients easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we've had users expecting all these combinations so far, so we must support them to accommodate everyone. If Part 4 only allows working with openEO style jobs, it's not much of an extension for OAP.

Copy link

@m-mohr m-mohr Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users often don't really know what they need and that there are alternatives... If I would add averything to openEO that users just ask for, it would be a mess. Often it is possible to show them alternatives that work equally good.

Anyway, having multiple possible ways of doing the same thing should IMHO be avoided and I still think a client can simplify API and server development.

The way I understood our discussions initially the OAP version of the openEO jobs is slightly different/extended, but this is a completely different spec with a small subset being openEO. If that's what you are looking for I give up. I don't think the way it is is any useful. Then better don't align at all, which makes it less confusing for users because things are clearly separated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first 2 are a direct mapping of what POST /processes/{processId}/execution defines. Those have to be mapped as is on POST /jobs to allow a server/client to work only with the /jobs endpoint and integrating workflows.

The other 2 are for openEO alignment, allowing the creation of a job first, and then triggering its execution with a second request. At the moment in OAP, there is no such concept of job creation not queued right away, so the first 2 points needs to be available for servers that do not intend to create jobs this way, but still want to define workflows.

What you're asking for is to consider only openEO's perspective, which of course is easier, because you're thinking only about your use cases.

And to clarify, the "users" here that I was referring to are the participants of TB20-GDC. We have had discussions trying to show others (including you) alternatives, but nobody wants to concede anything. So, we are in this situation where all 3 modes of sync, async, and create/trigger execution are supported by POST /jobs as alternatives of running processes by different users.

The best simplification that can be done is by merging (3) and (4), basically not allowing switching Prefer when doing POST /jobs/{jobId}/results. It would have to be configured from the get go when doing POST /jobs or by follow-up PATCH /jobs/{jobId} to modify it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first 2 are a direct mapping of what POST /processes/{processId}/execution defines. Those have to be mapped as is on POST /jobs to allow a server/client to work only with the /jobs endpoint and integrating workflows.

Why if there's already an alternative?

The other 2 are for openEO alignment, allowing the creation of a job first, and then triggering its execution with a second request. At the moment in OAP, there is no such concept of job creation not queued right away, so the first 2 points needs to be available for servers that do not intend to create jobs this way, but still want to define workflows.

3 and 4 are not available in openEO right now?! 4 would be if you remove the "and return with Job Status".
I still think not all of this needs to be available in favor of clients that can make this happen in one call if you want.
But I understand that historically OGC was about servers, not clients, which I think is a massiv mistake.

What you're asking for is to consider only openEO's perspective, which of course is easier, because you're thinking only about your use cases.

No.

And to clarify, the "users" here that I was referring to are the participants of TB20-GDC. We have had discussions trying to show others (including you) alternatives, but nobody wants to concede anything. So, we are in this situation where all 3 modes of sync, async, and create/trigger execution are supported by POST /jobs as alternatives of running processes by different users.

No. We had discussions where each side was making compromises and we were on a good way, but it seems we are not following the first meeting where I was still available. I guess we should listen to the first meeting recording again. The thing is, if no one make compromises we should NOT align, because then we end up with an overengineered complex API. Then it's easier to have two separate ones. I was under the impression that we were on a good way, but this PR is not reflecting this as far as I can tell...

The best simplification that can be done is by merging (3) and (4), basically not allowing switching Prefer when doing POST /jobs/{jobId}/results. It would have to be configured from the get go when doing POST /jobs or by follow-up PATCH /jobs/{jobId} to modify it.

I think (2) and (3) can be easily be removed. You can simply solve 2 with two subsequent requests and why should there be two ways to process synchronously? We need to simplify not allow everything to be solved in 100 ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why if there's already an alternative?

It is not a direct alternative, because /processes/{processId}/execution does not allow all capabilities that /jobs offers. Notably, defining workflow/graphs that do not have only one top-level process, and providing additional job endpoints for metadata.

3 and 4 are not available in openEO right now?!

You misunderstood. It's the other way around. It is not available in OAP!
Only openEO defines such capability to create a job without putting it in queue. In OAP, when you POST the job, it is either executed in sync or async, but it is done right away. OAP does not allow creating a job, modify it, and submit it in queue after. This is added by Part 4.

think (2) and (3) can be easily be removed. You can simply solve 2 with two subsequent requests and why should there be two ways to process synchronously?

I agree about removing (3), and keeping only (4) for the openEO-style execution.

However, (2) should remain. It is completely different from (3)/(4) use case, because it locks the job right away and places it in queue (just like if it was executed in sync right away). Why force the client to do a subsequent request to trigger the job, when this hint can be supplied immediately via the Prefer header? In most situations, clients that submit a job intend for it to run right away, because their goal is to obtain the result from it. Avoiding an extra request for every single time a job is submitted effectively cuts the network traffic by 2. If the client intends to do more with the job (modify it, review it, whatever) before executing it, it makes more sense IMO that this detail should be specified explicitly (ie: status: create). Then, the client/server agree explicitly that a subsequent request is needed for the actual execution trigger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this clause say that say that there is no encoding for a specific job definition. Good!

It then indicates that this Standard includes two conformance classes ... One for "OGC - API - Processes - Workflow Execute Request" and one for "OpenEO Process Graph/UDP". Also Good,

However, what about an execute request from Part 1 that executes a single process? Shouldn't I be able to post an execute request to create a job the executes a single deployed process? ... without all the workflow dressing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you should be able to execute a single process as well since it would like similar to a ""OGC - API - Processes - Workflow Execute Request", but just without any nested process. I think the same schema can be used directly for validation, but an explicit mention of single process could be added to clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfenoy @fmigneault I think Part 4 needs to be reorganized so that the "core" is agnostic about what payload creates the job. I think this is already that case but maybe its not as clear as it could be. There should then be three conformance classes defined:

  1. A conformance class that binds Part 4 to Part 1 (i.e. creating/managing a job by executing a process),
  2. A conformance class that binds Part 4 to Part 3 (i.e. creating/managing a job by executing a Part 3 OGC workflow/chain),
  3. A conformance class that binds Part 4 to OpenEO (i.e. creating/managing a job by executing an OpenEO UDP)

Other standards can then add their own coformance classes that bind Part 4 to their particular requirements (e.g. covreages).

The fact that a Part 3 workflow, in its simplest form, decomposes to a Part 1 execute request is neither here nor there. Someone who only implements Part 1 and Part 4 should need to rely on Part 3 in any way.

Now that I've more-or-less finished working on Records I'll switch gears to Processes and make suggestions to Part 4.

Copy link
Member

@jerstlouis jerstlouis Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Part 3 contains the details of both Part1-execution request-style workflow definitions (whether extended with nested processes or not) as one conformance class, and the OpenEO stuff as a separate requirement class (there is already an OpenEO conformance class in there, as well as a CWL one)?

Then Part 4 could be this "Common" job management thing that is not specific to Processes at all, as we had discussed.

The "Collection" input / output stuff could be moved to another Part 5 if that is causing too much confusion to have it also in Part 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerstlouis I think someone should be able to implement Part 1 and the advanced job management as per Part 4 without having to reference Part 3 (which deals with workflows) at all. The common or core part of Part 4 would eventually move to Common and Part 4 would simply be the bits binding the Common job managmenet stuff to the Processes specific payloads as I described above.

Copy link
Member

@jerstlouis jerstlouis Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmigneault

I think that the fact the process/workflow employs a "Collection Output" should be encoded in the submitted job.

I very strongly disagree with that.
The whole idea of Collection Output is that it is an execution mechanism (as a reminder, despite the name, this is the concept of "Virtual Collection" with on-demand AoI/ToI/RoI processing -- not the the ability of a particular process or workflow to "output a collection" i.e., NOT uploading the results somewhere after a batch process is complete), implemented by the processing engine, irrespective of any particular process or workflow.

It is not something that the individual process or workflow output needs to worry about at all.

would cause conflicts with the multiple APIs that already use this for creating collections without any involvement from OAP.

This is resolved easily by defining a media type or content-schema / profile for the particular type of workflow definition, such as OGC API - Processes execution requests (Content-Type:, Content-Schema:, Content-Profile: header or ?profile= parameter for the POST /collections request).

Not sure to understand what you have in mind. Is this related to the ?response=collection query parameter? If

Yes, this is currently defined in the "Collection Output" requirements class of Part 3.

If so, why not just POST /jobs/{jobId}/results?response=collection, or even better, reuse response body parameter that was already available in OAP v1 (https://docs.ogc.org/is/18-062r2/18-062r2.html#7-11-2-4-%C2%A0-response-type) when submitting the job?

I thought we agreed in one of the recent meetings that this Collection Output (Virtual Collections) would NOT use Part 4 jobs at all, leaving /jobs for sync and async execution. I am focusing on this POST to /collections approach, potentially eventually implementing a POST to /jobs for the "sync" output, but I have no plan to implement /jobs/{jobId} or anything after that path anytime soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole idea of Collection Output is that it is an execution mechanism [...]

Then, I STRONGLY recommend renaming it and moving it out of Part 3, because that is both confusing and misleading, especially when the document also presents "Collection Input" as an actual process input.

If they are "On-Demand Virtual Collections", then just name them like this...

[about ?response=collection parameter]
Yes, this is currently defined in the "Collection Output" requirements class of Part 3.

Then, I'm even more confused about why you consider this an "execution mechanism". It looks to me like it is only a specific way to return the output with a guarantee it will be a collection endpoint where you will have a certain set of collection querying/filtering capabilities. I don't see how that is any different from any process that already returns a URI to a STAC/Features/etc. collection as output. Whether that collection URI is a static or virtual collection should not matter at all, and whether any "on-demand" processing must occur to accomplish the query/filtering shouldn't matter either.

I thought we agreed in one of the recent meetings that this Collection Output (Virtual Collections) would NOT use Part 4 jobs at all

Agreed, because "Virtual Collections" have nothing to do with Job definitions.

An on-demand processing could trigger one or many job executions to monitor a virtual collection querying/filtering operation, but no "Job" would be defined to contain the "on-demand" trigger condition itself. Jobs are not pub/sub channel definitions. They are the instantiation of a certain trigger being realized.

"Virtual Collection" with on-demand

I'm not quite sure what that actually changes in the context of OAP (whether Part 3, 4 or whatever).

If some processing is triggered by an input directive when querying the collection, that trigger should perform any relevant workflow processing as if it was submitted on POST /jobs, and publish the result to update the "Virtual Collection" wherever that is located. Which "processing workflow" that should be triggered in such cases should be a property defined under the Collection definition itself. However, the OAP /jobs would not contain any entry about this collection until an access query is actually performed. The POST /jobs gets called "on-demand" when the GET /collections/... operations that need the workflow processing happen.

Copy link
Member

@jerstlouis jerstlouis Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially when the document also presents "Collection Input" as an actual process input.

It is exactly the same thing for Collection Input, it is not a particular input to a process, but an additional way in how content can be provided to a process irrespective of the particular process (an alternative to the in line "value" and "href" mechanisms to provide input to a process). The actual processes invoked by the processing engine (e.g. the Part 2-deployed Docker containers) would never see the collection URI -- they would only see the blob of data coming in, and go through exactly the same code whether that came in as an "href", as a "value", or as a "collection" in the input.

I STRONGLY recommend renaming it and moving it out of Part 3

We could potentially move the requirements classes Collection Input / Output stuff, and the associated input/output field modifiers, to a Part 5 if this helps.
They were in Part 3, because they are part of the syntax of the execution requests workflow definition language (extended Part 1 execution request).
But they could be considered an extension defined in a separate Part 5 if that helps. They were always separate requirements classes from the "Nested Process" / "Remote Processes" requirements classes.

Then, I'm even more confused about why you consider this an "execution mechanism".

Because the actual way to "execute" the process is to:

  • A) Instantiate the virtual collection by POSTing the execution request to /collections
  • B) Perform an OGC API data request using one of the available access mechanisms on that virtual collection (e.g., OGC API - Coverages, Tiles, DGGS, Features, EDR...)

The client never needs to POST to /jobs or to /processes/{processId}/execution, and suddenly all data visualization clients magically become processing-enabled.

I don't see how that is any different from any process that already returns a URI to a STAC/Features/etc. collection as output.

It is a completely different paradigm. It's not something that is implemented in a per-process way, it's the processing engine that supports executing processes for specific AoI/ToI/RoI, and will itself execute the processes on the data subset as needed. Rather than the input data having to already exist, then the workflow being executed on it, then the processing engine uploading the results somewhere, then notifying the client getting a notification and retrieving the data, it's a "pull" mechanism where the client sets up the pipeline, then pulls on the output, and everything flows magically. It's similar to the piping mechanism in UNIX. This paradigm avoids all of the problems with batch processing, and allows for instant on-demand processing, minimizing latency. It also easily works in a distributed manner because once this is implemented:

  • any OGC API collection anywhere can be used as an input to a process,
  • any OGC API process or workflow anywhere can be used as an input to another process (because it can produce a virtual collection),
  • any OGC API client (e.g., GDAL) is able to easily access data from a process or workflow by simply doing a single POST operation to create the virtual collection

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is exactly the same thing for Collection Input, it is not a particular input to a process, but an additional way in how content can be provided to a process irrespective of the particular process (an alternative to the in line "value" and "href" mechanisms to provide input to a process)

I read this, and I only see contradictory messages. It's supposed to not be a particular input, but at the same time is an alternative to value/href, which are particular inputs. What?

irrespective of the particular process

This is the case of every input. Not sure what is the message here either.

processes invoked [...] would never see the collection URI -- they would only see the blob of data coming in

This is the exact same procedure for an href that would be downloaded if the process needs the data referred by the URI to do its job. The only thing collection allows on top of a usual URI is to auto-resolve some access mechanism and apply additional queries/filters. But, in the end, it is exactly the same. Data gets downloaded after negotiation, and what the process actually sees is a blob of data. So again, no difference. Therefore, no, "Collection Input" is not the same as "Collection Output" if you consider "Collection Output" to be an execution mechanism. It doesn't make any sense to mix them up.

Because the actual way to "execute" the process is to [...]

I agree with all that. Hence, my point. What does it have to do with OAP? There is no /collections in OAP. If OAP calls a collection as input/output, why should it care at all if that collection needs to do any processing behind the scene. From the OAP point of view trying to resolve a workflow chain, it should not care at all what that server does in its backend to serve the requested data from the collection reference.

It is a completely different paradigm [...]
Rather than the input data having to already exist [...]

I fail to see how it is any different. A process returning a STAC collection URI can do all of this as well. Why is it assumed that a URI returned by a such a process would be an existing collection? Why do you assume that URI could not also trigger on-demand processing when its data is requested from it? It can already be acting like a "Virtual Collection" conceptually, without any conformance classes trying to describe it as something else that is "special". Everything mentioned is already achievable.

The only thing relevant IMO is the collection as input, to indicate it is not a "simple" download request of the URI. What each node/hop in the chain does afterward when the relevant data is identified from the collection is irrelevant for the "local" processing engine. It is the "problem" of that remote collection to serve the data as requested. If that triggers further processing under the hood for that remote server, it is also up to it to execute what it needs to do, however it wants. The local processing engine just needs to worry about what it receives in return, regardless how it was generated, and pass it along to the next step in the chain.

Anyway. This is getting off track for this PR, so I will stop here.
I've already mentioned all of this many times in #426, and it still makes not sense to me.

Copy link
Member

@jerstlouis jerstlouis Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting off track for this PR, so I will stop here.

This "PR" is a whole Part. Normally, this would be a separate Project with multiple issues for discussing multiple aspects.

Part 4 is tightly coupled with Part 3, because it proposes /jobs as the new end-point for workflow execution, which Part 3 was already doing using /processes/{processId}/execution, which we would probably drop in favor of this Part 4 and the POST /collections for Virtual Collection execution. So we also need to discuss these other things in Part 3 (Collection Input/Output) to try to disentangle all that and see whether they should end up in a separate Part 5.

It's supposed to not be a particular input, but at the same time is an alternative to value/href, which are particular inputs. What?

By "particular input", I meant an input defined as a String URI type in a particular process, which the process will understand to be an OGC API collection. I meant that "collection" is a first-class input type, like "href" and "value", which the implementation of processes do not have to handle themselves since they're taken care of by the processing engine.

This is the exact same procedure for an href

There is a lot of similarity, yes. Good that we are on the same page on that.
However, an href cannot imply an arbitrary format, AoI/ToI/UoI, or access mechanism, because in an href all that needs to be hardcoded. Therefore a "collection" input only implies a particular collection, but represents "the entire collection", as opposed to a particular subset.

What does it have to do with OAP? There is no /collections in OAP.

We talked about moving this "Collection Output" requirements class currently in Part 3 which defines the POST to /processes/{processId}/execution to an "OGC API - Processes - Part 5: Virtual collection output and remote input collections" and changing this to a POST /collections instead, where the payload is an OGC API - Processes execution request (which can contain a workflow as defined in Part 3 "Nested Processes").

it should not care at all what that server does in its backend to serve the requested data from the collection reference.

The processing engine receiving the POST (execution request) to /collections does more than that:

  • It first validates the workflow definition and sets up the virtual collection
  • It can connect data access request (for the virtual collection output) to trigger processing (whether remote or local)
  • It has a concept of implied AoI/ToI/RoI parameters, which may map these automatically to e.g. a "bbox" parameter part1-style processes require it explicitly as a parameter
  • It can be a Processes - Part1 client to chain remote processes (the current "Remote Core Processes" requirements class in part 3)
  • It can retrieve the relevant subset using OGC APIs from input collections

A process returning a STAC collection URI can do all of this as well. Why is it assumed that a URI returned by a such a process would be an existing collection? Why do you assume that URI could not also trigger on-demand processing when its data is requested from it?

The assets of a STAC collection where the actual data is located are different URIs.
While it is technically possible to have separate items/assets for separate RoI/ToI/AoI/bands/formats, e.g. implementing a resource tree like OGC API - Tiles (/tiles/{tmsId}/{level}{row}{col}) and OGC API - DGGS (/dggs/{dggrsId}/zones/{zoneId}/data), implementing this as a virtual STAC collection in practice would require to list each of these as separate items/assets, and identifiying relevant items/assets would be a pain and inefficient without the STAC API. If the STAC API is available (and works as expected -- multiple STAC API implementations still have issues with their implementation of Features & Filtering/CQL2 even though they claim conformance), then yes that could work. So then the STAC API + datacess would make it a valid OGC API access mechanism for that virtual collection.

I fail to see how it is any different.

The big difference with "Virtual Collection Output" is that a data access client can simply do one POST (execution request) to /collections and get a virtual collection URI back (which did not exist beforehand -- the client just created it when setting up teh workflow) and then proceed as usual, as if the collection was a regular OGC API collection (whether STAC , or Coverage, or Tiles...). That makes it super easy to integrate this capability in OGC API data access clients like GDAL / QGIS.

It is the "problem" of that remote collection to serve the data as requested.

In full agreement there.

The local processing engine just needs to worry about what it receives in return, regardless how it was generated, and pass it along to the next step in the chain.

It also needs to know what subset to request from the input collection / remote processes to fulfill its own requests (since a virtual collection is not restricted/hardcoded to a particular AoI/ToI/RoI/...) -- that's how the processing chain flows down from the client pulling on the output to the source data.

I've already mentioned all of this many times in #426, and it still makes not sense to me.

We've discussed all this at length for 4 years in what will eventually be over 100 GitHub/GitLab issues :) Sometimes I feel like we understand each other and are in strong agreement. I'm not sure what you're saying makes no sense to you, but hopefully we can continue discussing to address the remaining or new misunderstandings, and get back on the same page :)

fmigneault added a commit to crim-ca/weaver that referenced this pull request Oct 15, 2024
@gfenoy
Copy link
Contributor Author

gfenoy commented Oct 17, 2024

@gfenoy I've been working on implementing job management, and while looking at the PR to apply a comment, I noticed there is no openapi/paths/pJobs file or similar defining the POST /jobs, or any of the other endpoints added under /jobs/{jobId}/....

Thanks a lot for pointing this out. I have drafted some of these missing files and they are now available from there: https://github.com/GeoLabs/ogcapi-processes/tree/proposal-part4-initial/openapi/paths/processes-job-management.

I would like to stress that I only included the added method and did not include the previous one that should be added.

Maybe we can update the update.sh script later on, to concatenate the files with the same name in the original processes-core directory.

@m-mohr
Copy link

m-mohr commented Oct 18, 2024

@gfenoy What do we do with the unresolved comments in this PR?

@gfenoy
Copy link
Contributor Author

gfenoy commented Oct 18, 2024

@m-mohr for unfinished dicussions please use the issue system and tag your issue with Part 4 (Job Management).

We think that it will ease organizing the work and discussions this way.

@fmigneault
Copy link
Contributor

@gfenoy @pvretano @ghobona
I don't mind going through the PR and creating the issues, but I do not have enough permissions to apply the labels. Can someone grant me this access?

@gfenoy
Copy link
Contributor Author

gfenoy commented Oct 18, 2024

I cannot grant privilege to contributors of this repository.

@m-mohr
Copy link

m-mohr commented Oct 19, 2024

Hmm, okay. I had hoped that if we review a PR at least some of the review comments would be considered and be updated in the PR. Not even typos with actual commitable suggestions were merged. The procedure seems not optimal. Now I need to spam issues copying text that may lack context. But anyway, I opened issues, added a prefix to the title as I can't assign labels.

@ghobona
Copy link
Contributor

ghobona commented Oct 22, 2024

The labels appear to have been applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Part 4 (Job Management) OGC API - Processes - Part 4: Job Management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants