feat: add latency telemetry #577

justinsaws · 2025-01-28T21:43:16Z

What was the problem/requirement? (What/Why)

We don't have any telemetry around how long our calls to the service take on the local client. This information is useful to understand our local performance and if any regressions have been introduced with other changes.

What was the solution? (How)

Add telemetry to our calls to the main service.

What is the impact of this change?

Increased telemetry coverage.

How was this change tested?

Have you run the unit tests?
- Yes.
Have you run the integration tests?
- Yes.
Have you made changes to the download or asset_sync modules? If so, then it is highly recommended
that you ensure that the docker-based unit tests pass.
- No.

Was this change documented?

Yes.

Does this PR introduce new dependencies?

This PR adds one or more new dependency Python packages. I acknowledge I have reviewed the considerations for adding dependencies in DEVELOPMENT.md.
This PR does not add any new dependencies.

Is this a breaking change?

No.

Does this change impact security?

No.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

erico-aws · 2025-01-29T21:41:55Z

src/deadline/client/api/_telemetry.py

+            latency = end_t - start_t
+
+            event_name = decorator_kwargs.get("metric_name", function.__name__)
+            get_deadline_cloud_library_telemetry_client().record_event(


Do we want to time the function even when there are exceptions in them or only successful executions?

That's a good question. I feel like if there is an exception we don't care to capture the performance. In this case, we are just worrying about E2E calls that are successful so we can always make an apples to apples comparison.

I think it would be difficult to get meaningful information from unsuccessful calls without a lot more context.

Maybe we could use this as some metric for impact? We definitely shouldn't rely on it to notify us of a problem or anything like that.

Impact would be tough to determine without other information and we already have telemetry around call success/failure. This is more to see what the customer experience is like when people are using our client.

Or what if we try-except the function call on 404? Then log a different metric also for failed cases?

We already have success/failure metrics that do exactly this so I'm not sure what additional information we would get from logging the failure.

crowecawcaw

How does this perform if:

The CLI command is fast, and the process ends almost immediately. Does the telemetry event still get sent out?
The telemetry endpoint is slow, or someone is geographically far from it. Does the CLI get hung up on sending the telemetry event?

I saw that AWS SAM has flag for it's telemetry sending function to not wait for a response. I think it's used when the CLI process is ending so that telemetry issues don't slow down the CLI: https://github.com/aws/aws-sam-cli/blob/70ad4f78f64bd5a2906af1d7e90fef65026ec50b/samcli/lib/telemetry/telemetry.py#L69

justinsaws · 2025-01-30T22:14:27Z

The telemetry is sent the same way that our other telemetry events are. No matter how fast the function is, we will get some metrics for it since we are using Python's time.perf_counter_ns() to measure the time taken. The telemetry function is blocking and not asynchronous so it will always go out.

Signed-off-by: Justin Sawatzky <[email protected]>

sonarqubecloud · 2025-01-30T23:27:27Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

justinsaws requested a review from a team as a code owner January 28, 2025 21:43

justinsaws force-pushed the justinsaws/add_latency_telemetry branch from 672d9cf to 4b24596 Compare January 28, 2025 21:49

erico-aws reviewed Jan 29, 2025

View reviewed changes

AWS-Samuel approved these changes Jan 30, 2025

View reviewed changes

crowecawcaw reviewed Jan 30, 2025

View reviewed changes

feat: add latency telemetry

6323572

Signed-off-by: Justin Sawatzky <[email protected]>

justinsaws force-pushed the justinsaws/add_latency_telemetry branch from e5d82c9 to 6323572 Compare January 30, 2025 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add latency telemetry #577

feat: add latency telemetry #577

justinsaws commented Jan 28, 2025

erico-aws Jan 29, 2025

justinsaws Jan 29, 2025

erico-aws Jan 29, 2025

justinsaws Jan 30, 2025

leongdl Jan 30, 2025

justinsaws Jan 30, 2025

crowecawcaw left a comment

justinsaws commented Jan 30, 2025

sonarqubecloud bot commented Jan 30, 2025

feat: add latency telemetry #577

Are you sure you want to change the base?

feat: add latency telemetry #577

Conversation

justinsaws commented Jan 28, 2025

What was the problem/requirement? (What/Why)

What was the solution? (How)

What is the impact of this change?

How was this change tested?

Was this change documented?

Does this PR introduce new dependencies?

Is this a breaking change?

Does this change impact security?

erico-aws Jan 29, 2025

Choose a reason for hiding this comment

justinsaws Jan 29, 2025

Choose a reason for hiding this comment

erico-aws Jan 29, 2025

Choose a reason for hiding this comment

justinsaws Jan 30, 2025

Choose a reason for hiding this comment

leongdl Jan 30, 2025

Choose a reason for hiding this comment

justinsaws Jan 30, 2025

Choose a reason for hiding this comment

crowecawcaw left a comment

Choose a reason for hiding this comment

justinsaws commented Jan 30, 2025

sonarqubecloud bot commented Jan 30, 2025

Quality Gate passed