Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in aiohttp integration since v2.2.0 #8108

Open
agronholm opened this issue Jan 15, 2024 · 12 comments
Open

Memory leak in aiohttp integration since v2.2.0 #8108

agronholm opened this issue Jan 15, 2024 · 12 comments

Comments

@agronholm
Copy link

Summary of problem

We noticed a huge memory leak after upgrading the ddtrace dependency in our application from v1.9.0 to v2.4.0. With Datadog tracing middleware enabled (trace_app(...)), the memory footprint of the Python process keeps increasing as requests come in, and this keeps going and going until the process runs out of memory.

Which version of dd-trace-py are you using?

Any version since v2.2.0 seems to exhibit the issue, and v2.1.9 does not.

Which version of pip are you using?

23.2.1

Which libraries and their versions are you using?

`pip freeze` aiohttp==3.9.1 aiosignal==1.3.1 attrs==23.2.0 bytecode==0.15.1 cattrs==23.2.3 ddsketch==2.0.4 ddtrace==2.4.0 Deprecated==1.2.14 envier==0.5.0 frozenlist==1.4.1 idna==3.6 importlib-metadata==6.11.0 multidict==6.0.4 opentelemetry-api==1.22.0 protobuf==4.25.2 setuptools==69.0.3 six==1.16.0 typing_extensions==4.9.0 wrapt==1.16.0 xmltodict==0.13.0 yarl==1.9.4 zipp==3.17.0

How can we reproduce your problem?

Install aiohttp (v3.9.1) and ddtrace, and then start the following trivial app:

from aiohttp import web
from ddtrace import tracer
from ddtrace.contrib.aiohttp import trace_app


async def hello(request):
    return web.Response(text="Hello, world")

app = web.Application()
app.add_routes([web.get('/', hello)])
tracer.configure(enabled=False)
trace_app(app, tracer, service="aiohttp-web")
web.run_app(app, host="0.0.0.0", port=8800)

Then keep hitting it with a lot of requests and watch the memory footprint grow in a linear fashion without bounds. On v2.1.9 and earlier, I've observed the memory footprint to plateau at around 44 MB.

What is the result that you get?

The memory footprint of the process grows without bounds.

What is the result that you expected?

The memory footprint plateaus after a while.

@emmettbutler
Copy link
Collaborator

Thank you for reporting this issue, @agronholm. We'll look into it.

@nicolashardy
Copy link

Same problem. Huge impact : dd-trace last version is unusable for us.

@ZStriker19
Copy link
Contributor

Seems likely that the culprit PR is this one: This introduces basic tracing of streaming responses that stay open long after the on_prepare signal has been sent. since it's the only aiohttp integration change in 2.2.0. Waiting to end spans sounds like it could easily end up causing a memory leak, perhaps our callback is never getting called, or the request_span isn't in the request by the time we call the callback, and then we just return. I'll test this and will update here.

@agronholm
Copy link
Author

Then why is it that the trivial app I provided also exhibits this?

@mklokocka
Copy link

I just noticed this issue. This makes newer versions of ddtrace completely unusable for aiohttp apps. I tried to look into it and I could not find a clear reason so far. @ZStriker19 did you made any progress?

From my investigation I would say that:
a) Distinguishing between StreamResponse and other types seem superfluous and if the add_done_callback actually worked correctly, it should be enough to use that.
b) However, it is not clear to me that request.task should be used for the done callback. It refers to the asyncio task running this method. In theory I think it should work, but it looks like that's not always the case.
c) If the done callback is the problem, then I think this type check is why it also applies to standard, non-stream, responses. That is because Response is a subclass of StreamResponse (as is noted here as well).
d) Unfortunately it seems aiohttp in fact only exposes the on_response_prepare signal handling so there is no easy way built-in the library itself.
e) From my testing it does not look like the spans would remain unfinished. I tested both Response and StreamResponse and ddtrace debug logging seemingly always produced a message about finishing a span (but I only tested this with a few requests). This leads me to believe that there are some references to something being kept either in ddtrace or aiohttp itself. However I have no idea on what it could be, as the only change in 2.2.0 I consider suspicious is the add_done_callback.

@github-actions github-actions bot added the stale label May 7, 2024
Copy link
Contributor

github-actions bot commented Aug 5, 2024

This issue has been automatically closed after a period of inactivity. If it's a
feature request, it has been added to the maintainers' internal backlog and will be
included in an upcoming round of feature prioritization. Please comment or reopen
if you think this issue was closed in error.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 5, 2024
@sanchda sanchda removed the stale label Aug 5, 2024
@chamini2
Copy link

chamini2 commented Nov 7, 2024

Has this been tackled at all? this got auto closed.

seandstewart pushed a commit to seandstewart/dd-trace-py that referenced this issue Nov 23, 2024
@seandstewart
Copy link
Contributor

Hello - I've submitted #11518, which should remove the strong reference to the response object in the task callback, as well as fix the type check so we only add the callback when we have an actual stream response.

seandstewart pushed a commit to seandstewart/dd-trace-py that referenced this issue Nov 24, 2024
…g responses(DataDog#8108)

Without this change, the request and response objects are not freed from memory until the asyncio Task is freed, which can create a memory leak. This change leverages a contextvar to accomplish the same result as the previous version, while ensuring any memory is freed once the current async context is exited.
seandstewart pushed a commit to seandstewart/dd-trace-py that referenced this issue Nov 24, 2024
…g responses(DataDog#8108)

Without this change, the request and response objects are not freed from memory until the asyncio Task is freed, which can create a memory leak. This change leverages a contextvar to accomplish the same result as the previous version, while ensuring any memory is freed once the current async context is exited.
seandstewart pushed a commit to seandstewart/dd-trace-py that referenced this issue Nov 24, 2024
…g responses(DataDog#8108)

Without this change, the request and response objects are not freed from memory until the asyncio Task is freed, which can create a memory leak. This change leverages a contextvar to accomplish the same result as the previous version, while ensuring any memory is freed once the current async context is exited.
seandstewart pushed a commit to seandstewart/dd-trace-py that referenced this issue Nov 24, 2024
…g responses(DataDog#8108)

Without this change, the request and response objects are not freed from memory until the asyncio Task is freed, which can create a memory leak. This change leverages a contextvar to accomplish the same result as the previous version, while ensuring any memory is freed once the current async context is exited.
@wantsui wantsui reopened this Dec 11, 2024
@wantsui
Copy link
Collaborator

wantsui commented Dec 11, 2024

Hello! I'm reopening this issue given #11518 since I'm trying to see how we can test the change.

Regarding #8108 (comment)

Then keep hitting it with a lot of requests and watch the memory footprint grow in a linear fashion without bounds.

@agronholm, if this is still an issue, can you share some details about:

  • how long you kept the application running before this behavior was exhibited?
  • how many requests you hit with "a lot" of requests? (Just a rough range of volume is enough)
  • Did you run this inside a Docker container? If so, what version of Python did you target?

And for others in this thread running into this, can you provide examples of the sample application (or code snippets), along with the questions above so we can understand the conditions in which this memory leak happens to test for it?

If there are private details that are specific to your application that you don't want to share on Github, please open a support ticket via https://docs.datadoghq.com/help/ referencing this ticket and we'll investigate it in parallel.

Thank you very much in advance!

@tabgok
Copy link
Contributor

tabgok commented Dec 19, 2024

Help us replicate the issue!

We took the sample code in this issue and tried to replicate the behavior. The below pictures are from DD_VERSION=good and DD_VERSION=bad.

Good Version: 2.1.9 of the dd-trace-py
Bad Version: 2.4.0 (and all other versions/dependencies in the pip freeze command)

We used apache load generator to send a bunch of requests, and see no uptick in memory.

Image
Image
Image

@seandstewart
Copy link
Contributor

Hey folks -

Apologies for being a bit awol - it's launch season at work and I've been swamped.

I've extracted what I think are the relevant pieces of our applications and posted it here: https://github.com/seandstewart/memleak-example

Important bits:

  1. We use uvloop, structlog, and orjson (added here).
  2. We deploy containers via k8s (Google Kubernetes Engine).
  3. We use the standard python slim images as the base.

@wantsui
Copy link
Collaborator

wantsui commented Dec 20, 2024

Thank you very much @seandstewart ! We'll review this with the team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants