Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flyte_binary not exposing all of propeller metrics #3758

Open
rxnandakumar opened this issue Jun 8, 2023 · 6 comments
Open

flyte_binary not exposing all of propeller metrics #3758

rxnandakumar opened this issue Jun 8, 2023 · 6 comments
Assignees

Comments

@rxnandakumar
Copy link

We have datadog agent scraping prometheus metrics on k8s for annotated pods. On exposing the metrics port and adding the annotations, we can see the metrics on datadog.

When trying to set up a dashboard based on the grafana propeller dashboard we found that the following metrics referenced in the grafana JSON are not exposed by flyte:

flyte:propeller:all:free_workers_count flyte:propeller:all:round:abort_error[5m] flyte:propeller:all:round:system_error_unlabeled[5m] flyte:propeller:all:node:plugin:.*_failure_unlabeled flyte:propeller:all:node:plugin:.*_success_unlabeled flyte:propeller:all:round:raw_unlabeled_ms[5m] flyte:propeller:all:round:raw_ms[5m] flyte:propeller:all:round:panic_unlabeled[5m] flyte:propeller:all:collector:flyteworkflow flyte:propeller:all:metastore:cache_hit flyte:propeller:all:metastore:cache_miss flyte:propeller:all:metastore:head_failure_unlabeled

We can only see the following in datadog when we search for 'propeller':
flyte_admin_admin_builder_flytepropeller_build_failures.count flyte_admin_admin_builder_flytepropeller_build_successes.count flyte_admin_admin_execution_manager_propeller_failures.count

These seem to be flyte admin logs not propeller logs.

Expected result: All flyte propeller metrics should be exposed via the metrics port.

@welcome
Copy link

welcome bot commented Jun 8, 2023

Thank you for opening your first issue here! 🛠

@davidmirror-ops davidmirror-ops self-assigned this Oct 4, 2023
@Sennuno
Copy link
Contributor

Sennuno commented Dec 13, 2023

Hey there, is there any update on this? Recently updated to latest Flyte-Binary Chart and Image and still no metrics show up.
@davidmirror-ops

@davidmirror-ops
Copy link
Contributor

@Sennuno no updates yet. I'll be working on this and will let you know once there's progress

@wild-endeavor
Copy link
Contributor

Is the issue still valid? I just took a look at the metrics from just the sandbox and I am seeing some stuff
image

@cjidboon94
Copy link
Contributor

@wild-endeavor Some are stilll missing for me in flyte-binary

  • flyte:propeller:all:round:abort_error (there does exist flyte:propeller:all:round:abort_error_unlabeled)
  • flyte:propeller:all:node:plugin:.*_failure_unlabeled (or anything with prefix flyte:propeller:all:node:plugin)
  • flyte:propeller:all:node:user_error_duration_ms_count
  • flyte:propeller:all:node:system_error_duration_ms_count

In addition, in the user dashboard, no workflows are being listed as the metric that is queried for the labels "label_values(flyte:propeller:all:collector:flyteworkflow, wf)", does not always have a wf key (only domain, endpoint, instance, job=, namespace, pod, project, service). This breaks the dashboard during down times since there are no workflows to select then.

@davidmirror-ops
Copy link
Contributor

There were a number of updates applied to the dashboards here and published the new versions in the Marketplace

It should address the problems described in this issue.

@rxnandakumar @cjidboon94 please let us know if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants