use pod logs as feedback mechanism from flytekit to flytepropeller #3838

hamersaw · 2023-07-06T00:12:53Z

hamersaw
Jul 6, 2023
Maintainer

Motiviation

As Flyte scales users are requesting more information flowing from flytekit to flyteconsole. This runtime information may include additional configuration, execution metadata, etc. The current approach is to use a blobstore to write another file, or append to an existing file. This has many issues, foremost of which:
(1) small writes, especially file appends, are an anti-pattern.
(2) we are essentially piggybacking a storage framework for inter-processes communication to break out of k8s.

Proposal

This proposal is to use k8s Pod logs to encode flytekit "reports" which may then be parsed by flytepropeller and included in TaskExecutionEvents. Information will be regex parsable, where each specific category has a unique regex, before being emitting in the container logs. For example, adding runtime configurable log links could be encoded like:

"flyte:log id=<> url=<>"

# or more succicntly (single characters)
"f:l:i=<ID>:u=<URL>"

# maybe http-like
"f:l?i=<ID>&u=<URL>"

Flytepropeller will periodically check Pod logs (at most) every N (configurable) seconds, so not every single time the task is evaluated to reduce k8s apiserver stress. The frequency and near real-time ness is an obvious tradeoff. Additionally, flytepropeller will automatically check on Pod startup and completion to ensure correctness regardless of the throttling. As risk of over-simplifying this, flytepropeller uses a regex to parse each category of information and uses it to modify the current TaskExecutionEvent. This can be further optimized by included a lastCheckTimestamp to reduce duplicate reporting.

Use Cases

There are a number of current efforts being blocked by a lack of flytekit to flytepropeller feedback. Currently in consideration (with a brief explanation of how this proposal is a solution) are:

configurable log links: the user generates a log-link (ex. wandb) and flytekit writes a log with the link ID and URL which is read by propeller and reported.
reporting existence of flytedeck immediately: once a flytedeck is created a log is written that propeller will see on the next check.
realtime flytekit runtime metrics: span start and end times can be emitted in single lines which are tracked and reported by propeller in near real-time.

Considerations

The log information must be a single line and very small (128 / 256 character max?), so logs are not unnecessarily bloated. The increase in verbosity will affect the performance of each log check and parse from flytepropeller. I do not think there is a Pod log k8s request field to get all the logs after timestamp N, so every check will retrieve the entirety of logs.

Ideas

Could users set flytekit log-level to update which information is reported from flytekit to propeller? For example, runtime metrics are TRACE level, configurable log links are INFO level, etc
Is a pre-cursor to exposing logs to UI; unless we are hoping for a streamable, unpersistent solution here.

kumare3 · 2023-07-06T05:38:42Z

kumare3
Jul 6, 2023
Maintainer

Checking logs at scale can be extremely expensive as you will invariably stream logs to propeller memory?

2 replies

fg91 Jul 6, 2023
Collaborator

I have seen situations where ML engineers, in an effort to debug their training jobs, set the log level (or worse when doing distributed training the nccl log level) to debug, this way producing 10s of millions of log lines in less than an hour, up to a point where there were small spikes in billing for the logging service. I'm worried that such a situation would bring down propeller.

I agree though that piggy-backing on blob storage to communicate real-time updates is a pretty bad solution.

fg91 Jul 6, 2023
Collaborator

I feel that to solve this in a scaleable way we would need to use something like pub/sub. Being fully aware that this would add to the complexity of the deployment with which users already sometimes struggle today.

hamersaw · 2023-07-07T21:17:44Z

hamersaw
Jul 7, 2023
Maintainer Author

Going to keep riffing on this with myself. Many logging frameworks deploy a sidecar to handle seamless integration. We could use a lightweight sidecar (container B) to accept feedback requests from the flytekit process (container A), using a localhost connection or something, and then dump immediately to stdout. Propeller can retrieve the logs from the sidecar using the k8s api server as a feedback mechanism. There would be very little traffic, only as much as we write (hopefully 10s of lines) and would give all of the same benefits of parsing logs directly on the flytekit container.

3 replies

fg91 Jul 8, 2023
Collaborator

In the spirit that we are all aware that a pub/sub or message queue system would be the cleanest solution but that we explicitly try to avoid this in order to not increase the entry barrier for deployment, I like the side car suggestion the most so far!

It doesn't misuse blob storage
The additional stress on propeller would be far lower compared to parsing all primary container logs. (The side car could even enforce an upper limit of live updates to protect propeller.)
We don't have restrictions such as "only the first and last x lines of container logs are parsed" (which could have lead to situations where increasing the log level makes log links in the UI disappear)

So 👍

About the details of implementation:
Do you know whether there is a way for propeller to only parse new sidecar log lines on every iteration of the reconciliation loop? Or would they be parsed from the beginning every time?

fg91 Jul 8, 2023
Collaborator

@fellhorn for visibility

fellhorn Jul 17, 2023

I tried to give this topic a few more thoughts too. My main concern was the log throughput and overhead we have from parsing all logs. The sidecar could remove this burden and we have at least a direct communication channel between main container and sidecar (mounted file or socket). From a performance point I don't see issues here anymore, though I am still not a large fan of using logs as communication channel due to missing guarantees (delays, data lost if node disappears, ...).

I understand we are limited in our choices if we need to avoid a direct channel from the workload to flytepropeller.

As an alternative idea I was thinking about using k8s events as long as we anyway need to stay within a very restricted range:

The log information must be a single line and very small (128 / 256 character max?)

The note field currently allows 1kB of data. It might give us more persistence guarantees but has a tradeoff against number of events we can handle without overwhelming etcd and the k8s API.

@fg91 The kubernetes log API offers a since-time selector which should allow us to request only logs since the last iteration

davidmirror-ops · 2023-11-14T19:46:23Z

davidmirror-ops
Nov 14, 2023
Maintainer

2023-11-09 Contributor's meetup notes: with support from @hamersaw, @fg91 volunteers to champion an RFC for this idea.

0 replies

davidmirror-ops · 2024-11-07T23:45:48Z

davidmirror-ops
Nov 7, 2024
Maintainer

@fg91 do you still plan to shepherd an RFC for this?

1 reply

fg91 Nov 12, 2024
Collaborator

I don't have the capacity for it unfortunately at the moment. Since https://github.com/flyteorg/flytekit/tree/master/plugins/flytekit-wandb now provides an alternative way to configure log links to experiment tracking stores like wandb in the flyte UI, this is also not as pressing anymore :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use pod logs as feedback mechanism from flytekit to flytepropeller #3838

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

use pod logs as feedback mechanism from flytekit to flytepropeller #3838

hamersaw Jul 6, 2023 Maintainer

Motiviation

Proposal

Use Cases

Considerations

Ideas

Replies: 4 comments · 6 replies

kumare3 Jul 6, 2023 Maintainer

fg91 Jul 6, 2023 Collaborator

fg91 Jul 6, 2023 Collaborator

hamersaw Jul 7, 2023 Maintainer Author

fg91 Jul 8, 2023 Collaborator

fg91 Jul 8, 2023 Collaborator

fellhorn Jul 17, 2023

davidmirror-ops Nov 14, 2023 Maintainer

davidmirror-ops Nov 7, 2024 Maintainer

fg91 Nov 12, 2024 Collaborator

hamersaw
Jul 6, 2023
Maintainer

Replies: 4 comments 6 replies

kumare3
Jul 6, 2023
Maintainer

fg91 Jul 6, 2023
Collaborator

fg91 Jul 6, 2023
Collaborator

hamersaw
Jul 7, 2023
Maintainer Author

fg91 Jul 8, 2023
Collaborator

fg91 Jul 8, 2023
Collaborator

davidmirror-ops
Nov 14, 2023
Maintainer

davidmirror-ops
Nov 7, 2024
Maintainer

fg91 Nov 12, 2024
Collaborator