Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Fetch failed error when use fetch_v2 #48

Closed
raulgzm opened this issue Jul 15, 2024 · 9 comments
Closed

[Experiment] Fetch failed error when use fetch_v2 #48

raulgzm opened this issue Jul 15, 2024 · 9 comments

Comments

@raulgzm
Copy link

raulgzm commented Jul 15, 2024

We are receiving different kind of errors when we call to fetch experiments:

  • [Experiment] Fetch failed: Remote end closed connection without response
  • [Experiment] Fetch failed: EOF occurred in violation of protocol (_ssl.c:2427)
  • [Experiment] Fetch failed: Request-sent
  • [Experiment] Fetch failed: The read operation timed out
  • [Experiment] Fetch failed: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1006)

What do those errors mean?

Expected Behavior

fetch_v2() method working properly

Current Behavior

we receive different kind of errors, not always.

Possible Solution

we don't know the root cause

Steps to Reproduce

we don't know.

I can share sentry traces and error information with you if prefer.

Environment

  • SDK Version: amplitude-experiment = "==1.3.1"
  • Language Version: 3.11
@zhukaihan
Copy link
Collaborator

Hey Raul,

Thanks for submitting the issue.

These all looks like issues related to network calls.
When did this start to happen and does it continue to happen?
Is there special configurations of server_url (ex. proxies or EU datacenter)?

Thanks.

@raulgzm
Copy link
Author

raulgzm commented Jul 18, 2024 via email

@zhukaihan
Copy link
Collaborator

zhukaihan commented Jul 18, 2024

Hi Raul,

Thanks for the response and confirming that it's a persistent error.

We have just identified an issue with one of our CDN vendors which should be the root cause of above errors. We have just stopped routing request to that CDN. The 502 is a new one. It's possible that it's also related to CDN. Please let us know if you still see any errors.

May I just get a bit more extra info:
When (date / time) did you started to use the library / error starts?
And what percentage of the request volume resulted in one of the above errors?

Thanks!

@raulgzm
Copy link
Author

raulgzm commented Jul 22, 2024

Hi Peter,

Thank you for your support. Do you know why we have seen the last of these errors 7 hours ago? Maybe the change has not be done yet?

"[Experiment] Fetch failed: Remote end closed connection without response"

We started to use the library 2 months ago.
The first error we saw was 4th June 2024
what percentage of the request volume resulted in one of the above errors? I can`t answer this question with an exact number but we have seen 80 events in Sentry, with a high number of requests in the service. So I would say that the percentage is probably low but has a high impact because we can't use that feature flag in our service properly.

@zhukaihan
Copy link
Collaborator

The CDN change was applied within minutes.

To ensure these evaluations are served properly, I would suggest to tweak retry parameters to retry these requests.

It's quite unusual that the endpoint just simply close connection without any status code or triggering a timeout. The following questions can help us understand patterns and potential causes.
How frequently does this error occur?
Does the error happen in bursts, or at a consistent rate over the course of a day?
How long does it take for requests to fail, the average latency of failures?

Thanks.

@raulgzm
Copy link
Author

raulgzm commented Jul 23, 2024

Hi Peter,

Let me try to answer those questions with helpful information:

How frequently does this error occur?
If you mean the last error "Remote end closed connection without response" we have had in 2 months 157 errors. The last one was on Jul 23, 2:04 AM. The first one was on Jun 4, 12:21 PM. The error seems to happen everyday:

image

Does the error happen in bursts, or at a consistent rate over the course of a day?
over the course of a day as you can see in the previous image, and the hour when the error occurs is always different. It seems that it does not follow any pattern.

How long does it take for requests to fail, the average latency of failures?
It seems, seeing the traces, fails immediately. Or for some reason, Sentry does not show us the total duration of each trace, so, we can not know that latency on average.

How else can I help out?

@zhukaihan
Copy link
Collaborator

Thanks for the info!
Looks like we have around 2 errors per day at different times, and worse on several days.
What would be the timezone of the time you mentioned? I'm trying to match numbers up with any of our metrics.
Thanks!

@raulgzm
Copy link
Author

raulgzm commented Jul 24, 2024 via email

@zhukaihan
Copy link
Collaborator

The error happened after we receive a huge spike in traffic. The time is indeed unpredictable. We have been continuously improving the performance of our service during large spikes, and have plans to keep doing so. For the SDK side actions, my suggestion would be to configure retries. Some tweaking may be needed to achieve the optimal results.
Thanks for raising this to us!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants