Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and fix metrics export tests. #1957

Merged
merged 7 commits into from
Dec 18, 2020

Conversation

evankanderson
Copy link
Member

Changes

  • 🐛 Fix long-standing in e2e tests for OpenCensus. Ran 200 times without error; previously would see at least 1 error in 60-80 runs.
    • It turns out the "that could work" part of "do the simplest thing that could work" is important... there was a race collecting results off the channel, and we didn't read all channels to completion. /sigh
  • 🧹 Move e2e tests to a separate file from more unit-testy resource_view_test files.
  • 🐛 Moving tests between files exposed a bug in TestFlushExporter. Fixed that, too.
  • 🎁 Made the "none" exporter a real exporter to allow disabling all metric exports.

/kind bug
/kind cleanup

Fixes #1672

/assign @vagababov

It took embarassingly long to find this... I had to walk away twice for at least a week after bashing my head on the export code to figure out what was really going on. (The bug was inside the test all along!)

@knative-prow-robot knative-prow-robot added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Dec 11, 2020
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Dec 11, 2020
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 11, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2020
@evankanderson
Copy link
Member Author

If splitting resource_view_test is too annoying in this PR, I can merge things back into a single file. Looking at 3e094f8 should show a reasonable diff.

@codecov
Copy link

codecov bot commented Dec 11, 2020

Codecov Report

Merging #1957 (5725836) into master (ba2137c) will increase coverage by 0.12%.
The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1957      +/-   ##
==========================================
+ Coverage   68.96%   69.08%   +0.12%     
==========================================
  Files         209      209              
  Lines        8790     8786       -4     
==========================================
+ Hits         6062     6070       +8     
+ Misses       2453     2447       -6     
+ Partials      275      269       -6     
Impacted Files Coverage Δ
metrics/exporter.go 78.26% <50.00%> (ø)
metrics/config.go 83.20% <100.00%> (ø)
test/gcs/mock/mock.go 91.39% <0.00%> (ø)
websocket/connection.go 92.91% <0.00%> (ø)
controller/controller.go 89.02% <0.00%> (+0.14%) ⬆️
tracing/zipkin.go 71.42% <0.00%> (+3.00%) ⬆️
test/logstream/v2/stream.go 95.87% <0.00%> (+8.91%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba2137c...5725836. Read the comment docs.

Copy link
Contributor

@yanweiguo yanweiguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the failing downstream test relative?

// We unregister the views because this is one of two ways to flush
// the internal aggregation buffers; the other is to have the
// internal reporting period duration tick, which is at least
// [new duration] in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the comments here still correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, no. We have a function for that now.

Value: m.Timeseries[0].Points[0].GetInt64Value(),
}
records = append(records, metric)
keys[metric.Key()] = struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment that why using a set here fixes the problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I was convinced for a long time that the RPCs weren't actually going to the right place, but I finally figured out that we simply weren't reading enough off the channel to find them.

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG in general
Left some stylistic comments.

return fmt.Sprintf("%s:%d", m.Key(), m.Value)
}

func initSdFake(sdFake *stackDriverFake) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess?

Suggested change
func initSdFake(sdFake *stackDriverFake) error {
func initSDFake(sdFake *stackDriverFake) error {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have no idea, the product name is "Stackdriver", so I just spelled it out the 3 places it is used.

Comment on lines 144 to 159
resources := []*resource.Resource{
{
Type: "revision",
Labels: map[string]string{
"project": "p1",
"revision": "r1",
},
},
{
Type: "revision",
Labels: map[string]string{
"project": "p1",
"revision": "r2",
},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resources := []*resource.Resource{
{
Type: "revision",
Labels: map[string]string{
"project": "p1",
"revision": "r1",
},
},
{
Type: "revision",
Labels: map[string]string{
"project": "p1",
"revision": "r2",
},
},
}
resources := []*resource.Resource{{
Type: "revision",
Labels: map[string]string{
"project": "p1",
"revision": "r1",
},
},{
Type: "revision",
Labels: map[string]string{
"project": "p1",
"revision": "r2",
},
}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err != nil {
t.Fatalf("failed to read prometheus response: %+v", err)
}
want := `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
want := `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.
const want = `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 389 to 406
expected: []metricExtract{
{
"knative.dev/serving/autoscaler/actual_pods",
label1,
1,
},
{
"knative.dev/serving/autoscaler/desired_pods",
label2,
2,
},
{
"custom.googleapis.com/knative.dev/autoscaler/not_ready_pods",
batchLabels,
3,
},
},
}, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected: []metricExtract{
{
"knative.dev/serving/autoscaler/actual_pods",
label1,
1,
},
{
"knative.dev/serving/autoscaler/desired_pods",
label2,
2,
},
{
"custom.googleapis.com/knative.dev/autoscaler/not_ready_pods",
batchLabels,
3,
},
},
}, {
expected: []metricExtract{{
"knative.dev/serving/autoscaler/actual_pods",
label1,
1,
},{
"knative.dev/serving/autoscaler/desired_pods",
label2,
2,
},{
"custom.googleapis.com/knative.dev/autoscaler/not_ready_pods",
batchLabels,
3,
}},
}, {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}, {
name: "Don't allow custom metrics",
allowCustomMetrics: "false",
expected: []metricExtract{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@yanweiguo
Copy link
Contributor

LG but the failed unit test indicates there may be another problem.

@evankanderson
Copy link
Member Author

LG but the failed unit test indicates there may be another problem.

I've seen prometheus fail to be able to listen on the port; I can choose a random port to see if that helps. Probably the other fix would be to add a SO_REUSEADDR to the prometheus server setup; let me see if that helps.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020
@evankanderson
Copy link
Member Author

I'm not sure that SO_REUSEADDR helps; I added that code but was still able to get a timeout in the prometheus poll-for-ready loop.

@evankanderson
Copy link
Member Author

Regardless, the Prometheus code has not been changed, so I don't think that should be a barrier on this PR. (Though I'd love to figure out why the server sometimes won't respond for > 10s).

@@ -161,13 +162,13 @@ func UpdateExporter(ctx context.Context, ops ExporterOptions, logger *zap.Sugare
flushGivenExporter(curMetricsExporter)
e, f, err := newMetricsExporter(newConfig, logger)
if err != nil {
logger.Errorw("Failed to update a new metrics exporter based on metric config", newConfig, zap.Error(err))
logger.Errorw("Failed to update a new metrics exporter based on metric config", "config", newConfig, "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zap.String("config", newConfig), zap.Error(err)? Same below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh... I want to replace this with a non-sugared logger. It looks like With takes both Field objects and key-value pairs. Unfortunately, this ended up with the key-value format where the value was a zap.Field.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you do use sugared.
You don't need with, just logger.Error("...", zap.Error(err)) on the desugared one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to put all the Field objects first so it's not possible to log things as "string" => Field, which was what the old code was doing.

Using logger.Error() means taking the error out of a separate JSON field, which makes it harder (for example) to filter all the logs for just the ones with errors in them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know. That's why when I joined I was surprised we;re using this crap (zap that is) :)
But that's the zap philosophy — use jq to parse :)

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2020
@knative-prow-robot knative-prow-robot merged commit e41409a into knative:master Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flaky] TestMetricsExport is super flaky
4 participants