Refactor and fix metrics export tests. #1957

evankanderson · 2020-12-11T03:59:12Z

Changes

🐛 Fix long-standing in e2e tests for OpenCensus. Ran 200 times without error; previously would see at least 1 error in 60-80 runs.
- It turns out the "that could work" part of "do the simplest thing that could work" is important... there was a race collecting results off the channel, and we didn't read all channels to completion. /sigh
🧹 Move e2e tests to a separate file from more unit-testy resource_view_test files.
🐛 Moving tests between files exposed a bug in TestFlushExporter. Fixed that, too.
🎁 Made the "none" exporter a real exporter to allow disabling all metric exports.

/kind bug
/kind cleanup

It took embarassingly long to find this... I had to walk away twice for at least a week after bashing my head on the export code to figure out what was really going on. (The bug was inside the test all along!)

knative-prow-robot · 2020-12-11T03:59:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~metrics/OWNERS~~ [evankanderson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

evankanderson · 2020-12-11T04:00:28Z

If splitting resource_view_test is too annoying in this PR, I can merge things back into a single file. Looking at 3e094f8 should show a reasonable diff.

codecov · 2020-12-11T04:04:29Z

Codecov Report

Merging #1957 (5725836) into master (ba2137c) will increase coverage by 0.12%.
The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master    #1957      +/-   ##
==========================================
+ Coverage   68.96%   69.08%   +0.12%     
==========================================
  Files         209      209              
  Lines        8790     8786       -4     
==========================================
+ Hits         6062     6070       +8     
+ Misses       2453     2447       -6     
+ Partials      275      269       -6

Impacted Files	Coverage Δ
metrics/exporter.go	`78.26% <50.00%> (ø)`
metrics/config.go	`83.20% <100.00%> (ø)`
test/gcs/mock/mock.go	`91.39% <0.00%> (ø)`
websocket/connection.go	`92.91% <0.00%> (ø)`
controller/controller.go	`89.02% <0.00%> (+0.14%)`	⬆️
tracing/zipkin.go	`71.42% <0.00%> (+3.00%)`	⬆️
test/logstream/v2/stream.go	`95.87% <0.00%> (+8.91%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba2137c...5725836. Read the comment docs.

yanweiguo

Is the failing downstream test relative?

yanweiguo · 2020-12-11T05:24:08Z

metrics/e2e_test.go

+			// We unregister the views because this is one of two ways to flush
+			// the internal aggregation buffers; the other is to have the
+			// internal reporting period duration tick, which is at least
+			// [new duration] in the future.


Are the comments here still correct?

Whoops, no. We have a function for that now.

yanweiguo · 2020-12-11T05:24:36Z

metrics/e2e_test.go

+								Value:  m.Timeseries[0].Points[0].GetInt64Value(),
+							}
+							records = append(records, metric)
+							keys[metric.Key()] = struct{}{}


Could you add a comment that why using a set here fixes the problem?

Done. I was convinced for a long time that the RPCs weren't actually going to the right place, but I finally figured out that we simply weren't reading enough off the channel to find them.

vagababov

LG in general
Left some stylistic comments.

vagababov · 2020-12-11T18:18:43Z

metrics/e2e_test.go

+	return fmt.Sprintf("%s:%d", m.Key(), m.Value)
+}
+
+func initSdFake(sdFake *stackDriverFake) error {


I guess?

Suggested change

func initSdFake(sdFake *stackDriverFake) error {

func initSDFake(sdFake *stackDriverFake) error {

Done. I have no idea, the product name is "Stackdriver", so I just spelled it out the 3 places it is used.

vagababov · 2020-12-11T18:20:14Z

metrics/e2e_test.go

+	resources := []*resource.Resource{
+		{
+			Type: "revision",
+			Labels: map[string]string{
+				"project":  "p1",
+				"revision": "r1",
+			},
+		},
+		{
+			Type: "revision",
+			Labels: map[string]string{
+				"project":  "p1",
+				"revision": "r2",
+			},
+		},
+	}


Suggested change

resources := []*resource.Resource{

{

Type: "revision",

Labels: map[string]string{

"project": "p1",

"revision": "r1",

},

},

{

Type: "revision",

Labels: map[string]string{

"project": "p1",

"revision": "r2",

},

},

}

resources := []*resource.Resource{{

Type: "revision",

Labels: map[string]string{

"project": "p1",

"revision": "r1",

},

},{

Type: "revision",

Labels: map[string]string{

"project": "p1",

"revision": "r2",

},

}}

vagababov · 2020-12-11T18:20:39Z

metrics/e2e_test.go

+			if err != nil {
+				t.Fatalf("failed to read prometheus response: %+v", err)
+			}
+			want := `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.


Suggested change

want := `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.

const want = `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.

vagababov · 2020-12-11T18:21:26Z

metrics/e2e_test.go

+		expected: []metricExtract{
+			{
+				"knative.dev/serving/autoscaler/actual_pods",
+				label1,
+				1,
+			},
+			{
+				"knative.dev/serving/autoscaler/desired_pods",
+				label2,
+				2,
+			},
+			{
+				"custom.googleapis.com/knative.dev/autoscaler/not_ready_pods",
+				batchLabels,
+				3,
+			},
+		},
+	}, {


Suggested change

expected: []metricExtract{

{

"knative.dev/serving/autoscaler/actual_pods",

label1,

1,

},

{

"knative.dev/serving/autoscaler/desired_pods",

label2,

2,

},

{

"custom.googleapis.com/knative.dev/autoscaler/not_ready_pods",

batchLabels,

3,

},

},

}, {

expected: []metricExtract{{

"knative.dev/serving/autoscaler/actual_pods",

label1,

1,

},{

"knative.dev/serving/autoscaler/desired_pods",

label2,

2,

},{

"custom.googleapis.com/knative.dev/autoscaler/not_ready_pods",

batchLabels,

3,

}},

}, {

vagababov · 2020-12-11T18:21:35Z

metrics/e2e_test.go

+	}, {
+		name:               "Don't allow custom metrics",
+		allowCustomMetrics: "false",
+		expected: []metricExtract{


yanweiguo · 2020-12-15T00:07:52Z

LG but the failed unit test indicates there may be another problem.

evankanderson · 2020-12-15T18:06:27Z

LG but the failed unit test indicates there may be another problem.

I've seen prometheus fail to be able to listen on the port; I can choose a random port to see if that helps. Probably the other fix would be to add a SO_REUSEADDR to the prometheus server setup; let me see if that helps.

evankanderson · 2020-12-15T18:50:54Z

I'm not sure that SO_REUSEADDR helps; I added that code but was still able to get a timeout in the prometheus poll-for-ready loop.

evankanderson · 2020-12-15T19:18:23Z

Regardless, the Prometheus code has not been changed, so I don't think that should be a barrier on this PR. (Though I'd love to figure out why the server sometimes won't respond for > 10s).

vagababov · 2020-12-15T20:24:06Z

metrics/exporter.go

@@ -161,13 +162,13 @@ func UpdateExporter(ctx context.Context, ops ExporterOptions, logger *zap.Sugare
 		flushGivenExporter(curMetricsExporter)
 		e, f, err := newMetricsExporter(newConfig, logger)
 		if err != nil {
-			logger.Errorw("Failed to update a new metrics exporter based on metric config", newConfig, zap.Error(err))
+			logger.Errorw("Failed to update a new metrics exporter based on metric config", "config", newConfig, "error", err)


zap.String("config", newConfig), zap.Error(err)? Same below?

Ugh... I want to replace this with a non-sugared logger. It looks like With takes both Field objects and key-value pairs. Unfortunately, this ended up with the key-value format where the value was a zap.Field.

WDYT?

Well you do use sugared.
You don't need with, just logger.Error("...", zap.Error(err)) on the desugared one

I switched this to put all the Field objects first so it's not possible to log things as "string" => Field, which was what the old code was doing.

Using logger.Error() means taking the error out of a separate JSON field, which makes it harder (for example) to filter all the logs for just the ones with errors in them.

Yeah, I know. That's why when I joined I was surprised we;re using this crap (zap that is) :)
But that's the zap philosophy — use jq to parse :)

vagababov

/lgtm

evankanderson added 3 commits December 10, 2020 19:50

Copy resource_view_test to e2e_test

14a8dae

Restore metrics/resource_view_test.go

2f6ecdb

Refactor metrics tests, fix e2e_test

3e094f8

knative-prow-robot assigned vagababov Dec 11, 2020

knative-prow-robot added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Dec 11, 2020

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Dec 11, 2020

knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 11, 2020

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2020

knative-prow-robot requested review from mdemirhan and yanweiguo December 11, 2020 03:59

yanweiguo reviewed Dec 11, 2020

View reviewed changes

Address yanweiguo feedback

d12a943

vagababov reviewed Dec 11, 2020

View reviewed changes

Cleanup per vagababov

781822d

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020

Merge remote-tracking branch 'upstream/master' into opencensus-fix

df61543

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 15, 2020

vagababov reviewed Dec 15, 2020

View reviewed changes

Switch to zap.Error but avoid the bug the code previously had.

5725836

vagababov reviewed Dec 18, 2020

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 18, 2020

knative-prow-robot merged commit e41409a into knative:master Dec 18, 2020

evankanderson mentioned this pull request Jan 12, 2021

Move e2e tests for metrics export to integration status to reduce crosstalk flakiness #1782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and fix metrics export tests. #1957

Refactor and fix metrics export tests. #1957

evankanderson commented Dec 11, 2020

knative-prow-robot commented Dec 11, 2020

evankanderson commented Dec 11, 2020

codecov bot commented Dec 11, 2020 •

edited

Loading

yanweiguo left a comment

yanweiguo Dec 11, 2020

evankanderson Dec 11, 2020

yanweiguo Dec 11, 2020

evankanderson Dec 11, 2020

vagababov left a comment

vagababov Dec 11, 2020

evankanderson Dec 14, 2020

vagababov Dec 11, 2020

evankanderson Dec 14, 2020

vagababov Dec 11, 2020

evankanderson Dec 14, 2020

vagababov Dec 11, 2020

evankanderson Dec 14, 2020

vagababov Dec 11, 2020

evankanderson Dec 14, 2020

yanweiguo commented Dec 15, 2020

evankanderson commented Dec 15, 2020

evankanderson commented Dec 15, 2020

evankanderson commented Dec 15, 2020

vagababov Dec 15, 2020

evankanderson Dec 15, 2020

vagababov Dec 16, 2020

evankanderson Dec 18, 2020

vagababov Dec 18, 2020

vagababov left a comment

	func initSdFake(sdFake *stackDriverFake) error {
	func initSDFake(sdFake *stackDriverFake) error {

	want := `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.
	const want = `# HELP testComponent_global_export_counts Count of exports via standard OpenCensus view.

Refactor and fix metrics export tests. #1957

Refactor and fix metrics export tests. #1957

Conversation

evankanderson commented Dec 11, 2020

Changes

knative-prow-robot commented Dec 11, 2020

evankanderson commented Dec 11, 2020

codecov bot commented Dec 11, 2020 • edited Loading

Codecov Report

yanweiguo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagababov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanweiguo commented Dec 15, 2020

evankanderson commented Dec 15, 2020

evankanderson commented Dec 15, 2020

evankanderson commented Dec 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagababov left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 11, 2020 •

edited

Loading