Skip to content

Commit

Permalink
add namespace label/tag to non-deprecated throttle metrics
Browse files Browse the repository at this point in the history
Back when implementing #6744 for #6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
  • Loading branch information
gabemontero committed Apr 11, 2024
1 parent 695f622 commit 1f01335
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 25 deletions.
32 changes: 16 additions & 16 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,24 @@ The following pipeline metrics are available at `controller-service` on port `90

We expose several kinds of exporters, including Prometheus, Google Stackdriver, and many others. You can set them up using [observability configuration](../config/config-observability.yaml).

| Name | Type | Labels/Tags | Status |
|-----------------------------------------------------------------------------------------| ----------- | ----------- | ----------- |
| Name | Type | Labels/Tags | Status |
|-----------------------------------------------------------------------------------------| ----------- |-------------------------------------------------| ----------- |
| `tekton_pipelines_controller_pipelinerun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `*pipeline`=&lt;pipeline_name&gt; <br> `*pipelinerun`=&lt;pipelinerun_name&gt; <br> `status`=&lt;status&gt; <br> `namespace`=&lt;pipelinerun-namespace&gt; | experimental |
| `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `*pipeline`=&lt;pipeline_name&gt; <br> `*pipelinerun`=&lt;pipelinerun_name&gt; <br> `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt;| experimental |
| `tekton_pipelines_controller_pipelinerun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_pipelinerun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_pipelineruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_pipelineruns` | Gauge | | experimental |
| `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `*pipeline`=&lt;pipeline_name&gt; <br> `*pipelinerun`=&lt;pipelinerun_name&gt; <br> `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt; | experimental |
| `tekton_pipelines_controller_pipelinerun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_pipelinerun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_pipelineruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_pipelineruns` | Gauge | | experimental |
| `tekton_pipelines_controller_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt; | experimental |
| `tekton_pipelines_controller_taskrun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_taskrun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node` | Gauge | | experimental |
| `tekton_pipelines_controller_client_latency_[bucket, sum, count]` | Histogram | | experimental |
| `tekton_pipelines_controller_taskrun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_taskrun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota_count` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | experimental |
| `tekton_pipelines_controller_client_latency_[bucket, sum, count]` | Histogram | | experimental |

The Labels/Tag marked as "*" are optional. And there's a choice between Histogram and LastValue(Gauge) for pipelinerun and taskrun duration metrics.

Expand Down
46 changes: 38 additions & 8 deletions pkg/taskrunmetrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -272,11 +272,13 @@ func viewRegister(cfg *config.Metrics) error {
Description: runningTRsThrottledByQuota.Description(),
Measure: runningTRsThrottledByQuota,
Aggregation: view.LastValue(),
TagKeys: []tag.Key{namespaceTag},
}
runningTRsThrottledByNodeView = &view.View{
Description: runningTRsThrottledByNode.Description(),
Measure: runningTRsThrottledByNode,
Aggregation: view.LastValue(),
TagKeys: []tag.Key{namespaceTag},
}
podLatencyView = &view.View{
Description: podLatency.Description(),
Expand Down Expand Up @@ -428,21 +430,40 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi
}

var runningTrs int
var trsThrottledByQuota int
var trsThrottledByNode int
trsThrottledByQuota := map[string]int{}
trsThrottledByQuotaCount := 0
trsThrottledByNode := map[string]int{}
trsThrottledByNodeCount := 0
var trsWaitResolvingTaskRef int
for _, pr := range trs {
// initialize metrics with namespace tag to zero if unset; will then update as needed below
cnt, ok := trsThrottledByQuota[pr.Namespace]
if !ok {
trsThrottledByQuota[pr.Namespace] = 0
}
cnt, ok = trsThrottledByNode[pr.Namespace]
if !ok {
trsThrottledByNode[pr.Namespace] = 0
}

if pr.IsDone() {
continue
}
runningTrs++

succeedCondition := pr.Status.GetCondition(apis.ConditionSucceeded)
if succeedCondition != nil && succeedCondition.Status == corev1.ConditionUnknown {
switch succeedCondition.Reason {
case pod.ReasonExceededResourceQuota:
trsThrottledByQuota++
trsThrottledByQuotaCount++
cnt, _ = trsThrottledByQuota[pr.Namespace]
cnt++
trsThrottledByQuota[pr.Namespace] = cnt
case pod.ReasonExceededNodeResources:
trsThrottledByNode++
trsThrottledByNodeCount++
cnt, _ = trsThrottledByNode[pr.Namespace]
cnt++
trsThrottledByNode[pr.Namespace] = cnt
case v1.TaskRunReasonResolvingTaskRef:
trsWaitResolvingTaskRef++
}
Expand All @@ -455,11 +476,20 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi
}
metrics.Record(ctx, runningTRsCount.M(float64(runningTrs)))
metrics.Record(ctx, runningTRs.M(float64(runningTrs)))
metrics.Record(ctx, runningTRsThrottledByNodeCount.M(float64(trsThrottledByNode)))
metrics.Record(ctx, runningTRsThrottledByQuotaCount.M(float64(trsThrottledByQuota)))
metrics.Record(ctx, runningTRsWaitingOnTaskResolutionCount.M(float64(trsWaitResolvingTaskRef)))
metrics.Record(ctx, runningTRsThrottledByNode.M(float64(trsThrottledByNode)))
metrics.Record(ctx, runningTRsThrottledByQuota.M(float64(trsThrottledByQuota)))
metrics.Record(ctx, runningTRsThrottledByQuotaCount.M(float64(trsThrottledByQuotaCount)))
metrics.Record(ctx, runningTRsThrottledByNodeCount.M(float64(trsThrottledByNodeCount)))

for ns, cnt := range trsThrottledByQuota {
ctx, err = tag.New(ctx, []tag.Mutator{tag.Insert(namespaceTag, ns)}...)
metrics.Record(ctx, runningTRsThrottledByQuota.M(float64(cnt)))

}
for ns, cnt := range trsThrottledByNode {
ctx, err = tag.New(ctx, []tag.Mutator{tag.Insert(namespaceTag, ns)}...)
metrics.Record(ctx, runningTRsThrottledByNode.M(float64(cnt)))

}

return nil
}
Expand Down
4 changes: 3 additions & 1 deletion pkg/taskrunmetrics/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -537,7 +537,7 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
informer := faketaskruninformer.Get(ctx)
for i := 0; i < multiplier; i++ {
tr := &v1.TaskRun{
ObjectMeta: metav1.ObjectMeta{Name: names.SimpleNameGenerator.RestrictLengthWithRandomSuffix("taskrun-")},
ObjectMeta: metav1.ObjectMeta{Name: names.SimpleNameGenerator.RestrictLengthWithRandomSuffix("taskrun-"), Namespace: "test"},
Status: v1.TaskRunStatus{
Status: duckv1.Status{
Conditions: duckv1.Conditions{{
Expand All @@ -563,7 +563,9 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
t.Errorf("RunningTaskRuns: %v", err)
}
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_quota_count", map[string]string{}, tc.quotaCount)
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_quota", map[string]string{namespaceTag.Name(): "test"}, tc.quotaCount)
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_node_count", map[string]string{}, tc.nodeCount)
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_node", map[string]string{namespaceTag.Name(): "test"}, tc.nodeCount)
metricstest.CheckLastValueData(t, "running_taskruns_waiting_on_task_resolution_count", map[string]string{}, tc.waitCount)
}
}
Expand Down

0 comments on commit 1f01335

Please sign in to comment.