Daisy: Consolidate reading of serial port output. #1243

EricEdens · 2020-06-09T17:33:08Z

Daisy reads serial port output (serial output) in two places: once in createInstances for log archival, and once in waitForInstancesSignal for control flow. These readers are independent: they both make GCP API calls, manage their own errors, and perform their own retries.
Uncoordinated access causes two errors:

Failure to archive logs: Since each reader has its own event loop, there’s a race for createInstances to archive logs prior to waitForInstancesSignal killing the instance (1160).
Subtle bugs: Over time, each set of reader code has been improved to handle transient failures and instance reboots. Since the code has evolved separately, we find subtle, difficult-to-debug errors.

This change introduces a pubsub mechanism to allow a single serial port reader to broadcast updates to multiple consumers.

Fixes #1160

google-oss-robot · 2020-06-09T17:33:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

google-oss-robot · 2020-06-09T17:33:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EricEdens

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [EricEdens]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

daisy/workflow.go

daisy/serial_output_watcher.go

zoran15 · 2020-06-10T04:01:01Z

daisy/serial_output_watcher.go

+		for _, cb := range callbacks {
+			subscribers = append(subscribers, cb.channel)
+			if cb.pollingInterval < pollingFrequency {
+				cb.pollingInterval = pollingFrequency


Related to the above comment : what if a watcher is registered after start? Should the polling interval be updated in Watch() as well?

My intention was to disallow late subscribers, specifically by not sending them updates if they call subscribe after watch, as that would be a clearer failure compared to sending them partial logs.

daisy/serial_output_watcher.go

daisy/serial_output_watcher_test.go

EricEdens · 2020-06-10T19:04:40Z

/hold

zoran15 · 2020-06-11T02:18:03Z

daisy/step_create_instances.go

+	(*watcher).Watch(name, serialPortToArchive, c, serialPortPollInterval)
+	(*watcher).start(name)


StepCreateInstance is starting the watcher, while StepWaitForInstance is registering to watch the same instance. Does that mean this code relies on the fact CreateInstnace.logSerialOutput is always called after WaitForInstance. validateForWaitForInstancesSignal()? Can this order of calls be changed in the future, leading to panic()?

Can we have all watcher.start() run in a single place, after we're sure steps have registered themselves as watchers?

I'm glad you brought this up, as the current approach feels fragile to me as well. Any ideas on where that central place might be?

Since serial port output can start as soon as an instance is created, it makes sense to have start() in step_create_instance.go. I'm not sure we can have start() somewhere else and ensure no serial output is lost.

Because of this, in the future, we might have subscribers call Watch() after start(). If we do allow this (currently it panics), we can handle it in a couple of ways:

buffer output on the watcher side (up to a point to allow some later subcribers, but not have to store whole output for the duration of the execution), or

allow subcribers to know if output has already started for given port/instance name before subcribing (and if any of it will be missed) and let them decide if they can live with this or not (by throwing panic on their own). If there was no output before late subcriber calls Watch() then nothing is lost and things should work as expected.

There are probably other possible solutions.

I've designed this system to specifically disallow late subscribers for two reasons:

We don't need it now.

We don't have an upcoming use case for it.

Am I missing something here? Is there a use case that you're aware of?

Regarding how Watch and start are wired into the current steps I've addressed potential fragility in three ways:

I've documented the behavior: https://github.com/GoogleCloudPlatform/compute-image-tools/pull/1243/files#diff-364879f77dafcc960afbb3cc4ded0dabR35

If someone inadvertently performs a late subscription, they won't receive updates, so it will fail loudly.

If someone changes step_wait_for_instances_signal.go and moves Watch to an incorrect location, then tests will fail (unit and integration)

Regarding your two potential solutions (buffer or indicate that polling has already started), I think those are fantastic options if we need to support late subscribers. Do you have a use case in mind, though?

In my opinion "Watch" and "start" should be bounded together. No matter whether "start" has been done, it should be called again blindly without side effect.
It's also a reason why we should allow "late subscribe": it may have been started.

It's not a good idea to always rely on create_instance to start the watcher. The reason is that we definitely should be able to wait for a signal from a instance which is not created in the current daisy workflow.

hopkiw · 2020-06-11T17:52:23Z

/retest

dntczdx · 2020-06-11T19:35:37Z

daisy/step_create_instances.go

@@ -51,73 +57,46 @@ func (ci *CreateInstances) UnmarshalJSON(b []byte) error {
 	return nil
 }

-func logSerialOutput(ctx context.Context, s *Step, ii InstanceInterface, ib *InstanceBase, port int64, interval time.Duration) {
+func logSerialOutput(s *Step, name string, watcher *SerialOutputWatcher, wcProvider func() io.WriteCloser) {


It's not clear what the 'name' presents for when I read it the 1st time.

dntczdx · 2020-06-11T19:38:20Z

daisy/workflow.go

+
+// SerialOutputWatcher returns a SerialOutputWatcher that can be used to subscribe to the
+// serial output of instances managed by Daisy.
+func (w *Workflow) SerialOutputWatcher() *SerialOutputWatcher {


is it thread safe?

dntczdx · 2020-06-11T20:14:08Z

daisy/step_create_instances.go

+	(*watcher).Watch(name, serialPortToArchive, c, serialPortPollInterval)
+	(*watcher).start(name)


In my opinion "Watch" and "start" should be bounded together. No matter whether "start" has been done, it should be called again blindly without side effect.
It's also a reason why we should allow "late subscribe": it may have been started.

It's not a good idea to always rely on create_instance to start the watcher. The reason is that we definitely should be able to wait for a signal from a instance which is not created in the current daisy workflow.

Daisy: Consolidate reading of serial port output.

5d796f2

google-oss-robot requested review from dntczdx and zmarano June 9, 2020 17:33

google-oss-robot added approved size/XXL labels Jun 9, 2020

EricEdens requested review from zoran15 and adjackura and removed request for zmarano June 9, 2020 17:33

EricEdens marked this pull request as ready for review June 9, 2020 17:34

zoran15 reviewed Jun 10, 2020

View reviewed changes

google-oss-robot added the do-not-merge/hold label Jun 10, 2020

EricEdens force-pushed the serials branch from 116e662 to 44c5cbe Compare June 10, 2020 19:05

zoran's feedback

b9d7fa8

EricEdens force-pushed the serials branch from 44c5cbe to b9d7fa8 Compare June 10, 2020 19:07

gofmt

de69985

zoran15 reviewed Jun 11, 2020

View reviewed changes

dntczdx reviewed Jun 11, 2020

View reviewed changes

EricEdens closed this Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daisy: Consolidate reading of serial port output. #1243

Daisy: Consolidate reading of serial port output. #1243

EricEdens commented Jun 9, 2020

google-oss-robot commented Jun 9, 2020

google-oss-robot commented Jun 9, 2020

zoran15 Jun 10, 2020

EricEdens Jun 10, 2020

EricEdens commented Jun 10, 2020

zoran15 Jun 11, 2020

EricEdens Jun 11, 2020

zoran15 Jun 11, 2020

EricEdens Jun 11, 2020 •

edited

Loading

dntczdx Jun 11, 2020

hopkiw commented Jun 11, 2020

dntczdx Jun 11, 2020

dntczdx Jun 11, 2020

dntczdx Jun 11, 2020

		(*watcher).Watch(name, serialPortToArchive, c, serialPortPollInterval)
		(*watcher).start(name)

Daisy: Consolidate reading of serial port output. #1243

Daisy: Consolidate reading of serial port output. #1243

Conversation

EricEdens commented Jun 9, 2020

google-oss-robot commented Jun 9, 2020

google-oss-robot commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricEdens commented Jun 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricEdens Jun 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hopkiw commented Jun 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricEdens Jun 11, 2020 •

edited

Loading