Worker crashes when using S3 / sites with very short DNS TTL #30

eriede · 2018-07-17T16:44:43Z

When a worker process connection times out after 2 * the DNS TTL period has passed the worker process may crash. (nginx 1.12). AWS S3 has a very short DNS timeout and exhibits the issue.

The issue is that the peer data structure is accessed after the dynamic server releases the peer.

I have created a patch which uses reference counting on active requests instead of the interleaving scheme currently used. This will ensure that the peer's memory remains valid during outstanding requests.

Use a reference counting scheme on open requests to keep the dynamic sever memory alive while a request could still have a reference to an old peer.

zhaofeng0019 · 2020-06-21T04:46:53Z

i think your solution may cause memoy leak when there are requests can't end during a TTL -- the previous pool can't be free. And may not solve the memory crash when using upstream keepalive.
i think i may solve the problem
https://github.com/zhaofeng0019/nginx-upstream-dynamic-resolve-servers
and this solution is doing some references to your solution.
would you like to check it out? thanks a lot.

eriede · 2020-06-22T04:30:55Z

It's been a long time from when I looked at this last. You are correct that it does not address the keepalive problem. I remember doing some updates to add keepalive features that I haven't updated. I think it used the connection's pool instead of the request to extend the lifetime. I requested permission from the open source committee for that change too. I think I did get permission, but I had moved in to a different project and promptly forgot about this. It's been stable for over 2 years, with no crashes or cores, normally running on a 9 server cluster with 2 cores each but sometimes we size it up to 20 our more servers during DDOS attempts or black Friday. I'll see about providing the update tomorrow morning (PST). It might take a little time to re-get approval. Glad to see some interest on this... It seems like the original maintainer might not be maintaining the project anymore

zhaofeng0019 · 2020-06-22T07:13:40Z

or would you please see my pull request for this project? it solves all the memory problem and works well.

eriede · 2020-06-22T07:40:59Z

The solution that you're proposing is similar, using the cleanup hooks on the connection pools as a connection closed callback. It will probably work similarly. The drawback I found that this strategy is that it only worked on the round robin lb option, and caused crashing on the other lb keepalive strats, at least on 1.12. Were you able to get the other lb options to work? If so which version of nginx are you using?

zhaofeng0019 · 2020-06-22T13:14:38Z

you can see that in native nginx code , all the lb option will use round robin finally, including keepalive,
so you can see in my code, i only do ngx_http_upstream_init_round_robin function when the dns result changed, you can just save the function pointer of other lb option at the init_process function.

zhaofeng0019 · 2020-06-22T13:16:02Z

if do in this way, support all the native nginx lb option, no crash, i tested.

eriede · 2020-06-22T17:04:18Z

I uploaded our keepalive solution that only works for round robin lb with 1.12, to get it out in the public domain. https://github.com/eriede/nginx-upstream-dynamic-servers/tree/round-robin-keepalive. I'm happy to do a code review on your code if you would like, but I don't have the time to do testing on the various nginx versions, so I can't accept pull requests.

abadcafe mentioned this issue Dec 19, 2019

memory leak when more request is received abadcafe/nginx-upstream-serverlist#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker crashes when using S3 / sites with very short DNS TTL #30

Worker crashes when using S3 / sites with very short DNS TTL #30

eriede commented Jul 17, 2018

zhaofeng0019 commented Jun 21, 2020

eriede commented Jun 22, 2020 via email •

edited

Loading

zhaofeng0019 commented Jun 22, 2020

eriede commented Jun 22, 2020 via email

zhaofeng0019 commented Jun 22, 2020

zhaofeng0019 commented Jun 22, 2020

eriede commented Jun 22, 2020

Worker crashes when using S3 / sites with very short DNS TTL #30

Worker crashes when using S3 / sites with very short DNS TTL #30

Comments

eriede commented Jul 17, 2018

zhaofeng0019 commented Jun 21, 2020

eriede commented Jun 22, 2020 via email • edited Loading

zhaofeng0019 commented Jun 22, 2020

eriede commented Jun 22, 2020 via email

zhaofeng0019 commented Jun 22, 2020

zhaofeng0019 commented Jun 22, 2020

eriede commented Jun 22, 2020

eriede commented Jun 22, 2020 via email •

edited

Loading