-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add task polling data to task responses #447
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -257,4 +257,19 @@ message TimestampedBuildIdAssignmentRule { | |
message TimestampedCompatibleBuildIdRedirectRule { | ||
CompatibleBuildIdRedirectRule rule = 1; | ||
google.protobuf.Timestamp create_time = 2; | ||
} | ||
|
||
// Data attached to successful poll responses (for any task type) which the SDK can use to optimize its polling | ||
// strategy. | ||
message TaskResponsePollingData { | ||
// Set to true if the returned task was a result of a sync match. | ||
bool was_sync_match = 1; | ||
// Set to the amount of time a poll call waited for a task to be available. This value is allowed to be unset | ||
// if the match was not sync. It is useful to help the SDK understand if it may reduce the number of pollers, since | ||
// any substantial wait time here means the backlog is drained, and poll calls are idling waiting for tasks. | ||
google.protobuf.Duration time_poller_waited = 2; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd call it |
||
// See `TaskQueueStats`. This field is only set if the poll hit the root partition, where the data is readily | ||
// available. SDKs should cache this data, and explicitly refresh it with a call to `DescribeTaskQueue` if some | ||
// TTL has expired. | ||
TaskQueueStats stats = 3; | ||
Comment on lines
+271
to
+274
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hrmm, I wonder a bit about this. I wasn't expecting SDKs to call describe task queue ever, only users. Can I get clarity on when a worker may implicitly call describe task queue? Is there information in here to drive decisions that we need? If not, I wonder if we should remove this field and ask users that want this info to call describe themselves. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's going to give us this same stats information, which is used to drive poller scaling decisions. Users aren't going to interact with any of this directly. So it will periodically call it if, by chance, no pollers are hitting the root partition in the last unit of time, which should be relatively rare. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hrmm, can the poller scaling decisions be done without changing workers to make new describe task queue calls? Or is it a strong requirement at this point? I saw the first version of this PR didn't need these stats. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the concern here? Nothing is, strictly speaking, required - we could not do poller scaling at all, but the information is helpful in making those decisions. I already had some discussion with server folks about the load making this call would add and that's all fine. I would've liked it to be in the task response all the time, ideally, but that is more expensive than this option turns out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The concern I think is adding the confusing results to the poll (we already got bit by this with backlog hint where it was confusing for everyone what it meant and when the data was accurate). Also there is the concern of adding logic burden on the workers to now start making additional, separate out of band calls beyond polling to support polling. I was hoping the poller scaling could work with the calls the worker already makes and if it needs more info, put the burden on the centralized part (i.e. server) to provide it as part of the calls already being made or acknowledge that we may need to work with limited information. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need to talk about calling DescribeTaskQueue, or call it. If we're polling fairly frequently, then we'll hit the root and get the data. If we're not, then there's no tasks and no reason to scale pollers. Let's just say the SDK can make use of this field when it's provided, and do without it if it's missing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That workers won't be expected to call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've been writing some pseudocode for the algorithm that's nearly complete. I'm quite sure I want this data, and delivering it in the field is cheap. It's likely to be present often enough to be useful. I don't really see the cost of having it here as problematic at all. More specifically, it's very useful in knowing when to scale down pollers. Which fits perfectly with this bit that is mentioned earlier but David writes clearly:
So we're going to get that data when we most need it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 So the docs make clear that it may not appear, but per what @dnr put a few messages up in this thread, can we remove the part about expecting SDKs to call describe task queue to get this info? @ShahabT - can you expand on what you mean by aggregate-and-cache? How out of date might this value be compared to a regular describe call? Is there a design step needed to confirm how this aggregate-and-cache might work? Or any fears we may learn something during impl that would affect these protos? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the stats to reflect the whole task queue, and not only the root partition itself, we need to fan-out to all partitions and get their stats (via internal Now, if we're to send such stats in the root partition's poll response, this fan-out and calling of To manage the load, server would have to add caching in the root partition to limit the fan-out frequency (say to once per second). It's not something that we cannot manage, but it is not free either. |
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to head off potential confusion, add a comment like "Don't assume the accuracy of this data for any purpose other than poller scaling."