-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add latency telemetry #577
base: mainline
Are you sure you want to change the base?
feat: add latency telemetry #577
Conversation
672d9cf
to
4b24596
Compare
latency = end_t - start_t | ||
|
||
event_name = decorator_kwargs.get("metric_name", function.__name__) | ||
get_deadline_cloud_library_telemetry_client().record_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to time the function even when there are exceptions in them or only successful executions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I feel like if there is an exception we don't care to capture the performance. In this case, we are just worrying about E2E calls that are successful so we can always make an apples to apples comparison.
I think it would be difficult to get meaningful information from unsuccessful calls without a lot more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use this as some metric for impact? We definitely shouldn't rely on it to notify us of a problem or anything like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impact would be tough to determine without other information and we already have telemetry around call success/failure. This is more to see what the customer experience is like when people are using our client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or what if we try-except
the function call on 404? Then log a different metric also for failed cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have success/failure metrics that do exactly this so I'm not sure what additional information we would get from logging the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this perform if:
- The CLI command is fast, and the process ends almost immediately. Does the telemetry event still get sent out?
- The telemetry endpoint is slow, or someone is geographically far from it. Does the CLI get hung up on sending the telemetry event?
I saw that AWS SAM has flag for it's telemetry sending function to not wait for a response. I think it's used when the CLI process is ending so that telemetry issues don't slow down the CLI: https://github.com/aws/aws-sam-cli/blob/70ad4f78f64bd5a2906af1d7e90fef65026ec50b/samcli/lib/telemetry/telemetry.py#L69
The telemetry is sent the same way that our other telemetry events are. No matter how fast the function is, we will get some metrics for it since we are using Python's time.perf_counter_ns() to measure the time taken. The telemetry function is blocking and not asynchronous so it will always go out. |
Signed-off-by: Justin Sawatzky <[email protected]>
e5d82c9
to
6323572
Compare
Quality Gate passedIssues Measures |
What was the problem/requirement? (What/Why)
We don't have any telemetry around how long our calls to the service take on the local client. This information is useful to understand our local performance and if any regressions have been introduced with other changes.
What was the solution? (How)
Add telemetry to our calls to the main service.
What is the impact of this change?
Increased telemetry coverage.
How was this change tested?
download
orasset_sync
modules? If so, then it is highly recommendedthat you ensure that the docker-based unit tests pass.
Was this change documented?
Yes.
Does this PR introduce new dependencies?
Is this a breaking change?
No.
Does this change impact security?
No.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.