Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add started, ... to job metadata #542

Open
soxofaan opened this issue Jul 23, 2024 · 12 comments · May be fixed by #556
Open

Feature request: add started, ... to job metadata #542

soxofaan opened this issue Jul 23, 2024 · 12 comments · May be fixed by #556
Assignees

Comments

@soxofaan
Copy link
Member

job metadata at GET /jobs/{job_id} currently lists these timestamps:

  • created (required): Date and time of creation
  • updated (optional): Date and time of the last status change

This is a feature request to add

  • started (optional): date and time when the job was started (POST /jobs/{job_id}/results)
  • stopped (optional): date and time when job stopped running (because of reaching status finished/error/canceled)

Context: we are handling some larger openEO use cases where a significant number of jobs has to be managed. We noticed that the "created" timestamp is not always a very informing aspect, while a "started" timestamp would be more relevant. For example because the jobs are created in bulk in advance, while they are started over a longer period, possibly hours or days after creation.

@soxofaan
Copy link
Member Author

FYI: I'm willing to create a PR for this (should be pretty straightforward I guess). Unless there are objections to the idea in general

@soxofaan
Copy link
Member Author

cc @HansVRP

@HansVRP
Copy link

HansVRP commented Jul 23, 2024

sounds excellent. Is 'started' then called when running?

@soxofaan
Copy link
Member Author

After discussing this some more, it might be more useful and scalable to not add toplevel timestamps, but a "timeline" construct to keep track of various lifetime events of a batch jobs, e.g. (added comments are for illustration)

  "timeline": [
    ["created", "2017-01-01T09:32:12Z"],
    ["started", "2017-01-05T12:34:56Z"],   # user started job 4 days after creation
    ["queued", "2017-01-05T12:35:01Z"],  # reached status "queued" 5s later
    ["running", "2017-01-05T12:39:10Z"],  # reached status "running" after 4 minutes
  ],

Note that I did not define the timeline here as a mapping object, but as an array/list of tuples: it has an explicit order, and it supports repeating an event if that is necessary (e.g. restarting a job).

@m-mohr
Copy link
Member

m-mohr commented Sep 12, 2024

This sounds like a simplified version of the logs to me, so I'm a bit sceptical.
You can already express that in a human-readable way in the logs through the log timestamps and corresponding messages.

@soxofaan
Copy link
Member Author

my proposal at #542 (comment) is a lot more primitive than logs. It's just a list of event-timestamp pairs (events could be predefined enum). It's small data, so can be easily included directly in job metadata, no need for extra endpoint like logs.

But it doesn't have to be that listing, the initial question is about how to include the actual start and stop time of jobs (in addition to create time and "last status change" time)

@m-mohr
Copy link
Member

m-mohr commented Oct 12, 2024

What's the usecase for having start and stop time? Or is it actually to the effective runtime (stop - start) that you want to get? Usually updated should be the stop time (after execution has finished), although that may differ if you make changes to the metadata of the job afterwards.

@soxofaan
Copy link
Member Author

From our end, there are multiple use cases:

  • execution benchmarking and profiling in the context of algorithm hosting (e.g. APEx and related use cases). Here you want to build insights/stats on how long jobs are queued before running, how long they run untile failure or success, ...
  • large scale client-side batch job management. E.g as a user I want to run hundreds/thousands of job, but max a handful in parallel. But to manage my resources/credits I want to be able to kill runaway jobs.

One could get this info from actively polling the job status and checking status transitions, but if you want decent time resolution you would be forced to spam the back-end with status polling requests. However, the back-end probably has full, exact view on the lifecycle of a batch job anyway, so it feels like a waste to try to guess all this from the client side.

Usually updated should be the stop time

The problem with updated is that it is just about time of the last status change, so if you didn't poll in time, you might have missed the info you're after. Differently put, it forces the user to spam the backend with status requests if they want more precise insights

@m-mohr
Copy link
Member

m-mohr commented Oct 14, 2024

Just trying to understand things better right now, to get to a good solution...

First use case: Does this need to be exposed publicly though? It seems that this can be done internally.

Second use case: That's what budget was meant for, but it's specified in the currency of the backend, not in time (unless the currency is time). Isn't the actual number of consumed resources (as reported in usage - is that "live"?) more meaningful here? The plain time doesn't necessarily have any relation to the credits.

@soxofaan
Copy link
Member Author

First use case: Does this need to be exposed publicly though? It seems that this can be done internally.

We'd prefer to decouple the benchmarking system from the particular backends-under-test here, and use standardized metadata/reporting instead of having to invent, implement and maintain some reporting backchannel for each possible backend-under-test.

Isn't the actual number of consumed resources (as reported in usage - is that "live"?) more meaningful here?

credit/cost consumption is indeed important to users, but so is time consumption.
Both are relevant. And they are relevant in different context: credit consumption is for the long term big picture view: "how much will my application cost each month?"; while time consumption is important now: "it feels my jobs are slow at the moment".

The plain time doesn't necessarily have any relation to the credits.

Indeed that's the point of this feature request: not to replace credits/budget, but to add insights about the timing of the job

@m-mohr
Copy link
Member

m-mohr commented Nov 15, 2024

What would a reasonable set of properties be?

  1. created (exists) -> time when POST /jobs was executed (the only property that never changes)
  2. queued -> time when POST /jobs/:id/results was executed
  3. started -> time when the job switched from queued to running
  4. updated (exists) -> last update (includes when the job errored/finished/was cancelled unless updated afterwards)
  5. (ended -> job errored/finished/was cancelled - not sure about this property, it's usually the same as updated)
  6. expires / expired -> Whenever the results expire (i.e. results are deleted after this timestamp)

created and updated can't be set to null, but the others must be nullable, right?

So ended - started => runtime.

@m-mohr
Copy link
Member

m-mohr commented Jan 12, 2025

@soxofaan PR is available at #556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants