-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Allow controlling in which task phases log links are shown #4726
Conversation
634786f
to
f0946f6
Compare
4bbdd6d
to
66c6b29
Compare
@@ -236,15 +236,9 @@ func (plugin) GetTaskPhaseWithLogs(ctx context.Context, pluginContext k8s.Plugin | |||
|
|||
if err != nil { | |||
return pluginsCore.PhaseInfoUndefined, err | |||
} else if phaseInfo.Phase() != pluginsCore.PhaseRunning && phaseInfo.Phase() == pluginState.Phase && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic moved without change to k8s.MaybeUpdatePhaseVersion(
so that it can be reused in other plugins.
@@ -172,7 +172,7 @@ func (plugin) GetTaskPhaseWithLogs(ctx context.Context, pluginContext k8s.Plugin | |||
} | |||
|
|||
taskExecID := pluginContext.TaskExecutionMetadata().GetTaskExecutionID() | |||
if pod.Status.Phase != v1.PodPending && pod.Status.Phase != v1.PodUnknown { | |||
if pod.Status.Phase != v1.PodUnknown { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we do want to generate task logs while we are pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not a big problem but I think this change might be the reason that we get quite a lot of spam in the flytepropeller logs like
{"json":{"exec_id":"fybmhjrhjorydc","node":"n0/dn1","ns":"<namespace>","res_ver":"3487679492","routine":"worker-154","tasktype":"python-task","wf":"<workflow>"},"level":"error","msg":"containerStatus IndexOutOfBound, requested [0], but total containerStatuses [0] in pod phase [Pending]","ts":"2024-10-28T13:56:14Z"}
This comes from this line, so I wonder if we should adjust the logging a bit in response to this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing it to my attention that this change causes excessive logs!
Before this PR far, GetLogsForContainerInPod
wasn't called if pod.Status.Phase == v1.PodPending
. If in a later pod phase there still weren't any container statuses, this was unexpected, hence error log level.
Since we now optionally generate task logs already in the pending phase, if I'm not misunderstanding something, it's not unexpected that there might not be any container statuses at this point 🤔
I propose to here set the log level to debug if the pod is in pending phase and keep it at error level otherwise.
Does this make sense to you @Tom-Newton ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a PR 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phaseVersionUpdateErr := k8s.MaybeUpdatePhaseVersionFromPluginContext(&phaseInfo, &pluginContext) | ||
if phaseVersionUpdateErr != nil { | ||
return phaseInfo, phaseVersionUpdateErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, none of the k8s plugins apart from pod have checked whether the reason changed within the same phase. This means that per task phase only a single event was sent.
For the "queued" phase, this is the default "task submitted to K8s"
event emitted here.
We need to send subsequent events to admin once the reason changes as the first event doesn't contain the log links.
As a nice side effect, the status updates at the top right of flyte console now show a lot more updates within a single phase.
At the example of the tensorflow plugin/queueing phase:
Master:
4/7/2024 5:19:38 PM UTC task submitted to K8s
Branch:
4/7/2024 5:30:12 PM UTC task submitted to K8s
4/7/2024 5:30:13 PM UTC JobCreated
4/7/2024 5:30:12 PM UTC Restart policy in pod template will be overwritten by restart policy in replica spec
4/7/2024 5:30:13 PM UTC Created pod: f76767053e7b04e95aef-n0-0-chief-0
4/7/2024 5:30:13 PM UTC Created service: f76767053e7b04e95aef-n0-0-chief-0
4/7/2024 5:30:13 PM UTC Created pod: f76767053e7b04e95aef-n0-0-ps-0
4/7/2024 5:30:13 PM UTC Created service: f76767053e7b04e95aef-n0-0-ps-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparkConfig := GetSparkConfig() | ||
taskLogs := make([]*core.TaskLog, 0, 3) | ||
taskExecID := pluginContext.TaskExecutionMetadata().GetTaskExecutionID() | ||
|
||
if !isQueued { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff view is unfortunate here. The only change until line 409 is that if !isQueued {
is removed because we need to send all logs earlier.
The diff view suggests that "Driver logs" config was removed, the "User logs" config then turned into the "Drivers log" config, and so on.
if isPhaseChange { | ||
filterExternalResourceLogsByPhase(taskExecutionClosure.Metadata.ExternalResources, request.Event.Phase) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for map tasks which treat log links differently for performance/scalability reasons.
Master:
This branch:
Note that the task execution view with the log links only becomes available in the running phase for map tasks, meaning that there is no use case for showing log links already in the pending phase. However, log links can be removed once finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: array node map tasks replaced map tasks in flytekit 1.12.0 and this distinction is no longer true, i.e. array node map tasks log links no longer use the separate map tasks log links.
} | ||
return pluginsCore.PhaseInfoRunning(pluginsCore.DefaultPhaseVersion, &info), nil | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the plugins in which I add k8s.MaybeUpdatePhaseVersionFromPluginContext(
- kubeflow, dask, ray, spark - I add tests that the phase version is in fact bumped if the reason changes in this separate PR in order to not overload this one further:
#5200
d4b547d
to
dbc36a0
Compare
@@ -412,9 +406,13 @@ func getEventInfoForSpark(pluginContext k8s.PluginContext, sj *sparkOp.SparkAppl | |||
return nil, err | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new:
for _, log := range o.TaskLogs { | ||
log.ShowWhilePending = true | ||
} | ||
|
||
taskLogs = append(taskLogs, o.TaskLogs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one takes a look at how the log links are currently configured for the spark plugin, one sees that "All User Logs" are sent already when queued as the log plugin's init is not guarded by if !isQueued {
.
This has the effect that the "All User" logs are shown already in the init phase (current master):
(Fails because I don't have a proper spark image at hand.)The log link is not shown in the queued phase currently even though the log link is added to info
in the queued phase because the plugin doesn't currently bump the phase version when the reason changes meaning that only a single task submitted to k8s
event is sent to admin in the queued phase. When bumping the phase version, the log link is shown already in the queued phase:
Given if !isQueued {
I feel this might have been the original intention. In practice this makes little difference as in the init phase, the driver pod is still pending, meaning that the transition from queued to initializing phase is almost instantaneous even if the actual scheduling of the pods takes longer.
With the removal of if !isQueued {
and instead setting "All User" logs to ShowWhilePending
, this is how the log links behave:
dbc36a0
to
50e76c2
Compare
…e while filtering log links Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
* Add tests to ensure the phase version is bumped in kubeflow plugin if reason changes within the same phase Signed-off-by: Fabio Graetz <[email protected]> * Test that ray and dask plugins bump phase version in GetTaskPhase Signed-off-by: Fabio Graetz <[email protected]> * Test phase version increase when reason changes for spark plugin Signed-off-by: Fabio Graetz <[email protected]> * Fix ray tests after rebase Signed-off-by: Fabio Graetz <[email protected]> * Make lint pass Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Fabio Graetz <[email protected]>
07c00ba
to
683a6a5
Compare
@eapolinario, I merged #5200 and rebased onto master 🙏 |
Signed-off-by: Fabio Graetz <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing. Thank you for all the work on this!
…teorg#4726) * Add ShowWhilePending arg to TaskLog flyteidl message Signed-off-by: Fabio Graetz <[email protected]> * Allow showing specific logs already during queued phase Signed-off-by: Fabio Graetz <[email protected]> * Use core.PhaseInfoQueuedWithTaskInfo instead of core.PhaseInfoQueued in plugins so log links are available Signed-off-by: Fabio Graetz <[email protected]> * Bump phase version in pytorch plugin Signed-off-by: Fabio Graetz <[email protected]> * Fix nil containerId in pending phase Signed-off-by: Fabio Graetz <[email protected]> * Undo changes from rebase in ray.go Signed-off-by: Fabio Graetz <[email protected]> * Regenerate protos Signed-off-by: Fabio Graetz <[email protected]> * Fix after rebasing Signed-off-by: Fabio Graetz <[email protected]> * Add HideOnceFinished option to TaskLog proto message Signed-off-by: Fabio Graetz <[email protected]> * Hide certain logs once finished Signed-off-by: Fabio Graetz <[email protected]> * Move log link filtering (by phase) from propeller to admin Signed-off-by: Fabio Graetz <[email protected]> * Move bumping of plugin state phase version into function Signed-off-by: Fabio Graetz <[email protected]> * Move helper function which bumps phase version to k8s plugin package Signed-off-by: Fabio Graetz <[email protected]> * Consistently bump phase version when reason changes in pod, pytorch, tensorflow, and mpi plugins Signed-off-by: Fabio Graetz <[email protected]> * Make controlling lifetime of log links work with dask plugin Signed-off-by: Fabio Graetz <[email protected]> * Make controlling lifetime of log links work with ray plugin Signed-off-by: Fabio Graetz <[email protected]> * Make controlling lifetime of log links work with spark plugin Signed-off-by: Fabio Graetz <[email protected]> * Don't return pluginsCore.PhaseInfoUndefined but already known phaseInfo if we fail to update the phase version Signed-off-by: Fabio Graetz <[email protected]> * Remove now obsolete logic to check whether dask job is queued Signed-off-by: Fabio Graetz <[email protected]> * Adapt docstring explaining why we treat queued and init phase the same while filtering log links Signed-off-by: Fabio Graetz <[email protected]> * Make propeller tests pass Signed-off-by: Fabio Graetz <[email protected]> * Make pluginmachinery/flytek8s tests pass Signed-off-by: Fabio Graetz <[email protected]> * Fix dask, pytorch, tensorflow, and mpi tests Signed-off-by: Fabio Graetz <[email protected]> * Make log link filtering by phase work for map tasks Signed-off-by: Fabio Graetz <[email protected]> * Add tests for filtering log links when updating task execution Signed-off-by: Fabio Graetz <[email protected]> * Show All user logs while queueing phase as before Signed-off-by: Fabio Graetz <[email protected]> * Fix spark tests Signed-off-by: Fabio Graetz <[email protected]> * Fix after rebase Signed-off-by: Fabio Graetz <[email protected]> * Fix flyteidl go.mod Signed-off-by: Fabio Graetz <[email protected]> * Fix mpi test Signed-off-by: Fabio Graetz <[email protected]> * Add tests for PR flyteorg#4726 (flyteorg#5200) * Add tests to ensure the phase version is bumped in kubeflow plugin if reason changes within the same phase Signed-off-by: Fabio Graetz <[email protected]> * Test that ray and dask plugins bump phase version in GetTaskPhase Signed-off-by: Fabio Graetz <[email protected]> * Test phase version increase when reason changes for spark plugin Signed-off-by: Fabio Graetz <[email protected]> * Fix ray tests after rebase Signed-off-by: Fabio Graetz <[email protected]> * Make lint pass Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]> * Update flyteplugins/go/tasks/logs/logging_utils.go Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Fabio Graetz <[email protected]> * Update go.mod after flyteidl make generate Signed-off-by: Fabio Graetz <[email protected]> * Restrict numpy version in single binary e2e tests Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]> Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Fabio Grätz <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]>
…teorg#4726) * Add ShowWhilePending arg to TaskLog flyteidl message Signed-off-by: Fabio Graetz <[email protected]> * Allow showing specific logs already during queued phase Signed-off-by: Fabio Graetz <[email protected]> * Use core.PhaseInfoQueuedWithTaskInfo instead of core.PhaseInfoQueued in plugins so log links are available Signed-off-by: Fabio Graetz <[email protected]> * Bump phase version in pytorch plugin Signed-off-by: Fabio Graetz <[email protected]> * Fix nil containerId in pending phase Signed-off-by: Fabio Graetz <[email protected]> * Undo changes from rebase in ray.go Signed-off-by: Fabio Graetz <[email protected]> * Regenerate protos Signed-off-by: Fabio Graetz <[email protected]> * Fix after rebasing Signed-off-by: Fabio Graetz <[email protected]> * Add HideOnceFinished option to TaskLog proto message Signed-off-by: Fabio Graetz <[email protected]> * Hide certain logs once finished Signed-off-by: Fabio Graetz <[email protected]> * Move log link filtering (by phase) from propeller to admin Signed-off-by: Fabio Graetz <[email protected]> * Move bumping of plugin state phase version into function Signed-off-by: Fabio Graetz <[email protected]> * Move helper function which bumps phase version to k8s plugin package Signed-off-by: Fabio Graetz <[email protected]> * Consistently bump phase version when reason changes in pod, pytorch, tensorflow, and mpi plugins Signed-off-by: Fabio Graetz <[email protected]> * Make controlling lifetime of log links work with dask plugin Signed-off-by: Fabio Graetz <[email protected]> * Make controlling lifetime of log links work with ray plugin Signed-off-by: Fabio Graetz <[email protected]> * Make controlling lifetime of log links work with spark plugin Signed-off-by: Fabio Graetz <[email protected]> * Don't return pluginsCore.PhaseInfoUndefined but already known phaseInfo if we fail to update the phase version Signed-off-by: Fabio Graetz <[email protected]> * Remove now obsolete logic to check whether dask job is queued Signed-off-by: Fabio Graetz <[email protected]> * Adapt docstring explaining why we treat queued and init phase the same while filtering log links Signed-off-by: Fabio Graetz <[email protected]> * Make propeller tests pass Signed-off-by: Fabio Graetz <[email protected]> * Make pluginmachinery/flytek8s tests pass Signed-off-by: Fabio Graetz <[email protected]> * Fix dask, pytorch, tensorflow, and mpi tests Signed-off-by: Fabio Graetz <[email protected]> * Make log link filtering by phase work for map tasks Signed-off-by: Fabio Graetz <[email protected]> * Add tests for filtering log links when updating task execution Signed-off-by: Fabio Graetz <[email protected]> * Show All user logs while queueing phase as before Signed-off-by: Fabio Graetz <[email protected]> * Fix spark tests Signed-off-by: Fabio Graetz <[email protected]> * Fix after rebase Signed-off-by: Fabio Graetz <[email protected]> * Fix flyteidl go.mod Signed-off-by: Fabio Graetz <[email protected]> * Fix mpi test Signed-off-by: Fabio Graetz <[email protected]> * Add tests for PR flyteorg#4726 (flyteorg#5200) * Add tests to ensure the phase version is bumped in kubeflow plugin if reason changes within the same phase Signed-off-by: Fabio Graetz <[email protected]> * Test that ray and dask plugins bump phase version in GetTaskPhase Signed-off-by: Fabio Graetz <[email protected]> * Test phase version increase when reason changes for spark plugin Signed-off-by: Fabio Graetz <[email protected]> * Fix ray tests after rebase Signed-off-by: Fabio Graetz <[email protected]> * Make lint pass Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]> * Update flyteplugins/go/tasks/logs/logging_utils.go Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Fabio Graetz <[email protected]> * Update go.mod after flyteidl make generate Signed-off-by: Fabio Graetz <[email protected]> * Restrict numpy version in single binary e2e tests Signed-off-by: Fabio Graetz <[email protected]> --------- Signed-off-by: Fabio Graetz <[email protected]> Signed-off-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Eduardo Apolinario <[email protected]> Co-authored-by: Fabio Grätz <[email protected]> Co-authored-by: Eduardo Apolinario <[email protected]> Signed-off-by: Vladyslav Libov <[email protected]>
Why are the changes needed?
Currently, all task log links appear in the Flyte UI once the respective pod(s) are running and they are never removed.
In this PR I add config options to log links to allow showing certain log links during the queueing/initializing phase or to hide them once the task terminates.
This enables for instance the following user stories:
“FailedTriggerScaleUp”
).What changes were proposed in this pull request?
I propose to add two new options to task logs
flyteidl/protos/flyteidl/core/execution.proto
:This example …
… results in the following UX:
How was this patch tested?
I ran propeller and admin with the proposed changes. I ensured that the new behaviour works for all K8s plugins (pod, kubeflow, spark, dask, ray) as well as map tasks. For each plugin, screenshots are included in comments further below.
Check all the applicable boxes
Related PRs
I added further tests in #5200 which ensure all plugins send new events to admin in the queuing phase if the “reason” changes. Currently this is implemented properly only for the pod plugin and had to be fixed for this PR.
Docs link