Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add latency telemetry #577

Open
wants to merge 1 commit into
base: mainline
Choose a base branch
from

Conversation

justinsaws
Copy link
Contributor

What was the problem/requirement? (What/Why)

We don't have any telemetry around how long our calls to the service take on the local client. This information is useful to understand our local performance and if any regressions have been introduced with other changes.

What was the solution? (How)

Add telemetry to our calls to the main service.

What is the impact of this change?

Increased telemetry coverage.

How was this change tested?

  • Have you run the unit tests?
    • Yes.
  • Have you run the integration tests?
    • Yes.
  • Have you made changes to the download or asset_sync modules? If so, then it is highly recommended
    that you ensure that the docker-based unit tests pass.
    • No.

Was this change documented?

Yes.

Does this PR introduce new dependencies?

  • This PR adds one or more new dependency Python packages. I acknowledge I have reviewed the considerations for adding dependencies in DEVELOPMENT.md.
  • This PR does not add any new dependencies.

Is this a breaking change?

No.

Does this change impact security?

No.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@justinsaws justinsaws requested a review from a team as a code owner January 28, 2025 21:43
@justinsaws justinsaws force-pushed the justinsaws/add_latency_telemetry branch from 672d9cf to 4b24596 Compare January 28, 2025 21:49
latency = end_t - start_t

event_name = decorator_kwargs.get("metric_name", function.__name__)
get_deadline_cloud_library_telemetry_client().record_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to time the function even when there are exceptions in them or only successful executions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I feel like if there is an exception we don't care to capture the performance. In this case, we are just worrying about E2E calls that are successful so we can always make an apples to apples comparison.

I think it would be difficult to get meaningful information from unsuccessful calls without a lot more context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use this as some metric for impact? We definitely shouldn't rely on it to notify us of a problem or anything like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impact would be tough to determine without other information and we already have telemetry around call success/failure. This is more to see what the customer experience is like when people are using our client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or what if we try-except the function call on 404? Then log a different metric also for failed cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have success/failure metrics that do exactly this so I'm not sure what additional information we would get from logging the failure.

Copy link
Contributor

@crowecawcaw crowecawcaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this perform if:

  1. The CLI command is fast, and the process ends almost immediately. Does the telemetry event still get sent out?
  2. The telemetry endpoint is slow, or someone is geographically far from it. Does the CLI get hung up on sending the telemetry event?

I saw that AWS SAM has flag for it's telemetry sending function to not wait for a response. I think it's used when the CLI process is ending so that telemetry issues don't slow down the CLI: https://github.com/aws/aws-sam-cli/blob/70ad4f78f64bd5a2906af1d7e90fef65026ec50b/samcli/lib/telemetry/telemetry.py#L69

@justinsaws
Copy link
Contributor Author

The telemetry is sent the same way that our other telemetry events are. No matter how fast the function is, we will get some metrics for it since we are using Python's time.perf_counter_ns() to measure the time taken. The telemetry function is blocking and not asynchronous so it will always go out.

Signed-off-by: Justin Sawatzky <[email protected]>
@justinsaws justinsaws force-pushed the justinsaws/add_latency_telemetry branch from e5d82c9 to 6323572 Compare January 30, 2025 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants