Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant CPU consumption when agent is on #1177

Closed
maybephilipp opened this issue Jul 14, 2024 · 8 comments
Closed

Significant CPU consumption when agent is on #1177

maybephilipp opened this issue Jul 14, 2024 · 8 comments

Comments

@maybephilipp
Copy link

Python agent makes python process to consume a lot of CPU in idle state.

Description
When I turn python agent on, upon server start, when first request came and some queries ran, python process starts to eat 40-100% of CPU. If I disable python agent and restart server, the CPU is 0-3%. The high CPU level with python agent keeps the same while being idle (100% sure no requests coming).

Stack: Python, Flask, Gevent (tried both – just PyWSGI server and gunicorn – the situation is exactly same), SocketIO (with simple-websocket), Nginx.

Debugging the app showed that most of CPU time is spent looping green threads. strace showed that a lot of epoll_wait's going on when newrelic is up.

Expected Behavior
No impact or low impact.

Troubleshooting or NR Diag results

NR Diag (nrdiag/linux/nrdiag_arm64 --suites python):

Executing following diagnostic task suites: Python Agent

Check Results
-------------------------------------------------
Info     Base/Env/CollectEnvVars [Gathered Environment variables of current shell.]
Success  Base/Config/Collect
Success  Base/Config/Validate
Success  Base/Log/Copy
Success  Base/Config/LicenseKey
Success  Base/Config/ValidateLicenseKey
Info     Base/Config/RegionDetect [1 unique New Relic region(s) detected from config.]
Info     Base/Env/HostInfo [Collected host information]
Success  Base/Env/DetectAWS
Success  Base/Env/RootUser
Success  Base/Log/ReportingTo
Success  Python/Config/Agent
Info     Python/Agent/Version [9.12.0]
Success  Base/Agent/EOL
Success  Base/Config/AppName
Success  Base/Collector/ConnectEU
Info     Base/Config/ValidateHSM [Local High Security Mode setting (false) for confi...]
Success  Python/Env/Dependencies
Success  Python/Env/Version
Success  Python/Requirements/PythonVersion
8 results not shown: 8 None
Local High Security Mode setting (false) for configuration:

	/app/newrelic.ini

To upload results and validate High Security Mode settings, run the Diagnostics CLI with the -a or -api-key flag.

See nrdiag-output.json for full results.

No Issues Found

Creating nrdiag-output.zip

Let me know how can I share the nrdiag-output.zip as I assume there is some sensitive info in it.

py-spy

(hub.py just keeps growing after time)

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   77.81s    77.85s   run (gevent/hub.py:647)
  0.00%   0.00%   0.780s    0.780s   acquire_with_timeout (gevent/_threading.py:36)
  0.00%   0.00%   0.160s    0.160s   load_default_certs (ssl.py:597)
  0.00%   0.00%   0.040s    0.040s   _on_expiration (gevent/timeout.py:242)
  0.00%   0.00%   0.010s    0.010s   send (gevent/_socketcommon.py:725)
  0.00%   0.00%   0.010s    0.010s   __run_task (gevent/threadpool.py:173)
  0.00%   0.00%   0.000s    0.160s   request_encode_url (newrelic/packages/urllib3/request.py:99)
  0.00%   0.00%   0.000s    0.160s   harvest (newrelic/core/application.py:1514)
  0.00%   0.00%   0.000s    0.010s   send (engineio/async_drivers/_websocket_wsgi.py:26)
  0.00%   0.00%   0.000s    0.160s   run (sched.py:151)
  0.00%   0.00%   0.000s    0.160s   _harvest_default (newrelic/core/agent.py:648)
  0.00%   0.00%   0.000s    0.780s   get (gevent/_threading.py:207)
  0.00%   0.00%   0.000s    0.160s   _bootstrap_inner (threading.py:1038)
  0.00%   0.00%   0.000s    0.160s   send_request (newrelic/common/agent_http.py:446)
  0.00%   0.00%   0.000s    0.160s   urlopen (newrelic/packages/urllib3/connectionpool.py:715)
  0.00%   0.00%   0.000s    0.010s   run (gevent/threadpool.py:206)
  0.00%   0.00%   0.000s    0.160s   _validate_conn (newrelic/packages/urllib3/connectionpool.py:1058)

austin-tui

main process:

  OWN    TOTAL    %OWN   %TOTAL  FUNCTION
  35"     35"    100.0%  100.0%  Hub.run (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/hub.py:647)

any other thread:

  OWN    TOTAL    %OWN   %TOTAL  FUNCTION
  00"     05"      0.0%    6.2%  _WorkerGreenlet.run (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/threadpool.py:195)                                                                        │
  00"     05"      0.0%    6.2%  Queue.get (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/_threading.py:207)                                                                                  │
  00"     05"      0.0%    6.2%  _Condition.wait (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/_threading.py:86)                                                                             │
  05"     05"      6.2%    6.2%  acquire_with_timeout (/root/.cache/pypoetry/virtualenvs/fin-back-dr-9TtSrW0h-py3.11/lib/python3.11/site-packages/gevent/_threading.py:36)

cProfile snakeviz

image

The most interesting part: strace

After a lot of time of investigation I found a stackoverflow question (https://stackoverflow.com/questions/58973698/uwsgi-with-gevent-takes-up-100-cpu) which led me to idea of monitoring sys calls and turned out something in my python app just annihilating my server with epoll_wait and getting pid:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 93.62  117.284295      132975       882       582 futex
  2.98    3.730401           3   1183701           epoll_pwait
  2.87    3.601322           3   1183853           getpid
  0.51    0.644368      107394         6         4 restart_syscall
  0.00    0.002443          29        84           sendto
  0.00    0.002401           9       260        15 recvfrom
  0.00    0.001816           7       255        11 newfstatat
  0.00    0.001388          10       133           write
  0.00    0.000871           2       318        16 read
  0.00    0.000658           5       130       108 epoll_ctl
  0.00    0.000632           2       211           lseek
  0.00    0.000591          19        30           mmap
  0.00    0.000463          17        26           munmap
  0.00    0.000427          10        42           close
  0.00    0.000408          18        22         2 openat
  0.00    0.000346          31        11           ppoll
  0.00    0.000307          18        17         5 connect
  0.00    0.000293          29        10           getdents64
  0.00    0.000240          20        12           socket
  0.00    0.000211           5        37           getsockopt
  0.00    0.000205          17        12         6 accept4
  0.00    0.000175          43         4           clone
  0.00    0.000139           5        26         4 ioctl
  0.00    0.000127           5        22           times
  0.00    0.000119           7        15           setsockopt
  0.00    0.000069           4        16           getsockname
  0.00    0.000063           4        15           rt_sigprocmask
  0.00    0.000058           5        10           getrandom
  0.00    0.000046           5         8           gettid
  0.00    0.000020           6         3           madvise
  0.00    0.000012           6         2           fcntl
  0.00    0.000012           3         4           rseq
  0.00    0.000007           3         2           epoll_create1
  0.00    0.000004           1         4           set_robust_list
  0.00    0.000003           0         4           getpeername
  0.00    0.000000           0         2           getuid
  0.00    0.000000           0         2           geteuid
  0.00    0.000000           0         1           bind
  0.00    0.000000           0         2           recvmsg
  0.00    0.000000           0         1           sendmmsg
------ ----------- ----------- --------- --------- ----------------
100.00  125.274940          52   2370195       753 total

Green threads dump:

https://pastebin.com/mVrqs4VY

Done with:

def report_greens(*args, **kwargs):
    gevent.util.print_run_info()

signal.signal(signal.SIGUSR1, report_greens)

Steps to Reproduce
will try to create

Your Environment

  • gevent version: 24.2.1 (pypi via poetry)
  • Python version: Python 3.11.2 installed as base docker image (https://hub.docker.com/_/python)
  • Operating System: aws ec2 (Ubuntu 22.04.3 LTS aarch64) (uname -r: 6.5.0-1020-aws)
  • Docker container OS: Debian GNU/Linux 12 (bookworm)

Additional context
nothing

@workato-integration
Copy link

@maybephilipp maybephilipp changed the title Python: Significant CPU consumption when agent is on Significant CPU consumption when agent is on Jul 14, 2024
@maybephilipp
Copy link
Author

I tried to create a repro repository mimicing what is happaning on my actual code and... maybe I miss something, but repro code doesn't create the same issue. I'm kinda tired and bleary-eyed already as I investigate it for 3 days...

Could someone from newrelic point me to other places for debugging newrelic somehow? Some extra debug logs that could show, for example very frequent execution of some newrelic functionality? My guess is that newrelic's green thread is ticking very frequently... Disabling newrelic on my prod for now... No idea how to solve

@ianfitzpatrick
Copy link

ianfitzpatrick commented Aug 6, 2024

Let me know if you'd find a repro useful and I can try to produce one, but I'm also seeing this. It is most pronounced in CPU-constrained environments, I didn't notice it locally, but once I deployed to Google Cloud Run it quickly maxed the CPU.

  • Python New Relic Agent 9.11.0
  • Python 3.12.4
  • Django 4.2
  • Gunicorn 22.0.0
  • Gevent 24.2.1

Invoking Gunicorn like:

gunicorn --pythonpath /my_django_app --chdir /my_django_app --worker-tmp-dir /dev/shm --worker-class gevent --workers=5 --max-requests=1000 --access-logfile '-' --error-logfile '-' --bind 0.0.0.0:8000 wsgi:application

Starting new relic in my wsgi.py file by simply doing:

newrelic.agent.initialize()

Hope that helps. A single request will trigger it in the case of Cloud Run.

I have New Relic turned off for my service that uses gevent worker model until this is resolved, so would be great if it gets figured out!

@lrafeei
Copy link
Contributor

lrafeei commented Aug 6, 2024

Thanks for the information to help with the reproduction! Looking into this

@ianfitzpatrick
Copy link

ianfitzpatrick commented Sep 2, 2024

Hi there -- Any luck looking into this? Would love to get back to instrumenting our gevent model gunicorn processes again.

@hmstepanek
Copy link
Contributor

Hey I'm picking this ticket up from Lalleh. That get_pid and epoll info is intriguing. I see there is this patch https://github.com/newrelic/newrelic-python-agent/blob/main/newrelic/hooks/coroutines_gevent.py#L17-L29 and I wonder if this could be causing an issue. Could you try running with this branch and see if this fixes the issue?

@hmstepanek
Copy link
Contributor

Besides that patch I mentioned which does seem suspicious (I do see calls to sleep in the green threads dump too), I've been trying to setup a repro of this issue and I haven't been able to do so. Along the same lines as maybephilipp was thinking, it looks like it's spending a lot of time waiting for the cpu to become available again (aka other threads). We do have a harvest thread that we create and it does run every 5s to push up certain data to newrelic. If this harvesting is causing issues, you could limit the data that is being harvested and see if that helps (the following configuration settings will turn off harvesting of various data in case the data harvest is so much that it's causing a bottleneck on the threads):

  • distributed_tracing.enabled = false
  • application_logging.enabled = false
  • event_harvest_config.harvest_limits.analytic_event_data = 0
  • event_harvest_config.harvest_limits.custom_event_data = 0
  • event_harvest_config.harvest_limits.ml_event_data = 0
  • event_harvest_config.harvest_limits.span_event_data = 0
  • browser_monitoring.enabled = false

What are the CPU constraints of the system you are running on where you are seeing this issue? Also along those same lines; how long does each request take to complete (aka do you have long running requests >.5s)? What kinds of things does the application do that you are monitoring? Does it talk to sql, do lots of logging, custom event creation, talk to other servers, do anything CPU intensive, memory intensive, network intensive, etc? I wonder if this is just a case where the additional overhead of running our monitoring is pushing your system over the edge especially since you say you are only able to reproduce it on CPU constrained systems. I am able to get the CPU usage up fairly high on a CPU constrained system as well but it doesn't stay there like what you are describing here which is why I'm wondering if there's maybe something specific inside of your application that you are doing that is contributing to this. It would really be helpful to have a reproduction of this if you could post a reproduction repo with a docker container and the CPU constraints you are running with I think it would go a long way to getting to the bottom of this.

@workato-integration
Copy link

All attempts at reproducing this issue failed, or not enough information was available to reproduce the issue. Reading the code produces no clues as to why this behavior would occur. If more information appears later, please reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants