Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory/retention problem with thread tracing and very-long-lived threads #2418

Closed
bensheldon opened this issue Jan 27, 2024 · 4 comments
Closed
Labels
bug community To tag external issues and PRs submitted by the community

Comments

@bensheldon
Copy link

bensheldon commented Jan 27, 2024

Description

I'm debugging a memory retention problem in a Rails application. This is half a question, and half a hypothesis: it seems like instrumentation.thread.tracing can lead to memory growth when it is inserted into a very long lived thread; does that seem possible?

This was debugged on v9.7.0 version of this gem.

Expected Behavior

I am imagining that the NewRelic::Agent::Transaction should have some mechanism to purge/flush/remove old members of NewRelic::Agent::Transaction::DatastoreSegment on a regular basis if the transaction is held open indefinitely while work happens within it.

Troubleshooting

I used the Sheap gem to explore the application's heap. When diffing 2 heap dumps, these are the top 10 sources of retained objects:

irb(main):028> diff.retained.group_by { |o| "#{o.file}:#{o.line}" }.transform_values { |objects| objects.map(&:memsize).sum }.sort_by { |k,v| v }.r
everse.first(10).to_h
=>
{"/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/tracer.rb:300"=>11232320,
 "/bundle/ruby/3.2.0/gems/activesupport-7.1.3/lib/active_support/core_ext/object/deep_dup.rb:16"=>10250720,
 "/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/arel/collectors/plain_string.rb:7"=>9777736,
 "/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/querying.rb:69"=>9243520,
 "/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/transaction_metrics.rb:13"=>6404000,
 "/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/transaction/abstract_segment.rb:100"=>6187984,
 "/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/stats.rb:18"=>5808168,
 "/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/transaction/datastore_segment.rb:67"=>5609760,
 "/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/datastores/metric_helper.rb:26"=>5580000,
 "/bundle/ruby/3.2.0/gems/activerecord-7.1.3/lib/active_record/result.rb:177"=>4747744 }

That top object type is a NewRelic::Agent::Transaction::DatastoreSegment. Inspecting one of those objects below, the GC has marked it old and thus more likely to be retained over time:

irb(main):029> diff.retained.select { |o| o.file == "/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/tracer.rb" && o.line == 300 }.f
irst
=> <OBJECT 0x7f183fb60018 NewRelic::Agent::Transaction::DatastoreSegment (15 refs)>
irb(main):030> _.data
=>
{"address"=>"0x7f183fb60018",
 "type"=>"OBJECT",
 "shape_id"=>5105,
 "slot_size"=>320,
 "class"=>"0x7f183b6674e8",
 "ivars"=>35,
 "references"=>
  ["0x7f183b0a9cb8",
   "0x7f183ebbd8b8",
   "0x7f183f8370a8",
   "0x7f1836e17c60",
   "0x7f183acc4e40",
   "0x7f183966f410",
   "0x7f183ab80318",
   "0x7f182ea52e98",
   "0x7f1836f19668",
   "0x7f183fb60a18",
   "0x7f1836f19578",
   "0x7f183fb60158",
   "0x7f183966a690",
   "0x7f183966f398",
   "0x7f1838d949f8"],
 "file"=>"/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/tracer.rb",
 "line"=>300,
 "method"=>"new",
 "generation"=>147,
 "memsize"=>320,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

This is the path back to the root of the heap showing why it's being retained:

irb(main):003> diff.after.find_path diff.after.at("0x7f183fb60018")
=>
[<ROOT end_proc (3459 refs)>,
 <DATA 0x7f18398daec0 proc (3 refs)>,
 <OBJECT 0x7f182ea53708 Segment::Analytics::Client (5 refs)>,
 <DATA 0x7f18398d9e80 VM/thread (9 refs)>,
 <DATA 0x7f18398d9ac0 proc (3 refs)>,
 <IMEMO 0x7f18398d9ae8 env (6 refs)>,
 <OBJECT 0x7f183fb60a18 NewRelic::Agent::Transaction (16 refs)>,
 <ARRAY 0x7f1839626d78 (13 refs)>,
 <OBJECT 0x7f183fb60018 NewRelic::Agent::Transaction::DatastoreSegment (15 refs)>]
Details of all those objects
=> <DATA 0x7f18398daec0 proc (3 refs)>
irb(main):005> diff.after.at("0x7f18398daec0").data
=>
{"address"=>"0x7f18398daec0",
 "type"=>"DATA",
 "shape_id"=>0,
 "slot_size"=>40,
 "class"=>"0x7f1840fdd5e0",
 "struct"=>"proc",
 "references"=>["0x7f182ea53708", "0x7f183b0ab338", "0x7f18398daee8"],
 "file"=>"/bundle/ruby/3.2.0/gems/analytics-ruby-2.4.0/lib/segment/analytics/client.rb",
 "line"=>33,
 "method"=>"at_exit",
 "generation"=>147,
 "memsize"=>80,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}
irb(main):006> diff.after.at("0x7f182ea53708").data
=>
{"address"=>"0x7f182ea53708",
 "type"=>"OBJECT",
 "shape_id"=>6351,
 "slot_size"=>80,
 "class"=>"0x7f183b0ba1f8",
 "ivars"=>7,
 "references"=>["0x7f18398db2d0", "0x7f183af49620", "0x7f18398db258", "0x7f182ea536b8", "0x7f18398d9e80"],
 "file"=>"/bundle/ruby/3.2.0/gems/analytics-ruby-2.4.0/lib/segment/analytics.rb",
 "line"=>23,
 "method"=>"new",
 "generation"=>147,
 "memsize"=>80,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}
irb(main):007> diff.after.at("0x7f18398d9e80").data
=>
{"address"=>"0x7f18398d9e80",
 "type"=>"DATA",
 "shape_id"=>5840,
 "slot_size"=>40,
 "class"=>"0x7f1840fdad40",
 "struct"=>"VM/thread",
 "references"=>
  ["0x7f183d768b10",
   "0x7f183dfd30c0",
   "0x7f18398d9ac0",
   "0x7f184409a890",
   "0x7f1844099080",
   "0x7f18398d9a48",
   "0x7f18398d99a8",
   "0x7f184407ccf0",
   "0x7f183dfd30c0"],
 "file"=>"/bundle/ruby/3.2.0/gems/analytics-ruby-2.4.0/lib/segment/analytics/client.rb",
 "line"=>190,
 "method"=>"new",
 "generation"=>147,
 "memsize"=>472,
 "flags"=>{"uncollectible"=>true, "marked"=>true}}
irb(main):008> diff.after.at("0x7f18398d9ac0").data
=>
{"address"=>"0x7f18398d9ac0",
 "type"=>"DATA",
 "shape_id"=>0,
 "slot_size"=>40,
 "class"=>"0x7f1840fdd5e0",
 "struct"=>"proc",
 "references"=>["0x7f183b666228", "0x7f183b65fef0", "0x7f18398d9ae8"],
 "file"=>"/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/tracer.rb",
 "line"=>424,
 "method"=>"proc",
 "generation"=>147,
 "memsize"=>80,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}
irb(main):009> diff.after.at("0x7f18398d9ae8").data
=>
{"address"=>"0x7f18398d9ae8",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"env",
 "references"=>["0x7f183fb60658", "0x7f183fb60a18", "0x7f183b62ed78", "0x7f18398d9c00", "0x7f18398d9ae8", "0x7f183b65ff68"],
 "file"=>"/bundle/ruby/3.2.0/gems/newrelic_rpm-9.7.0/lib/new_relic/agent/tracer.rb",
 "line"=>424,
 "method"=>"proc",
 "generation"=>147,
 "memsize"=>112,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

Just to summarize what you're seeing:

Additional context

It looks like there was a maybe related leak reported and fixed earlier: #1936

I'm sorry about all the heap dump verbosity ❤️

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs submitted by the community label Jan 27, 2024
@hannahramadan
Copy link
Contributor

Hi @bensheldon! Thank you for sharing such a thorough report with your findings and thoughts. It really helps us get started! First a few questions:

  1. Are you seeing this behavior in a long-running transaction that eventually finishes, or a transaction that is truly 'held open indefinitely'? The agent isn't a good fit for indefinite transactions.
  2. We expect a long-running transaction or thread use more memory, but are you seeing memory usage level off or is it always increasing?
  3. You noted that the Segment gem has an effect on threads. Do you still see this problem if the Segment gem is removed?
  4. Does this problem go away if you turn off thread instrumentation? That config is instrumentation.thread.tracing: false.

You mentioned seeing a lot of NewRelic::Agent::Transaction::DatastoreSegment objects being retained. We think this is because these are the specific objects being created inside the long-running thread. The agent doesn't currently have a way to purge any objects while a transactions is open, but this is an interesting idea and we're open to exploring it if this is something the community would like to see!

Based on the thought that the agent is continuously adding datastore segments in your case while the transaction is open, here are some options to manage memory:

  1. Wrap the call that kicks off long-running threads in NewRelic::Agent#disable_all_tracing.
  2. Lower the number of transaction trace nodes to record in a single transaction trace using transaction_tracer.limit_segments (the default is 4000). If you enable debug-level logs for New Relic and see DEBUG: Segment limit of [segment_limit] reached, ceasing collection. Then you know you're hitting this max.
  3. Lower the number of span events reported from a single harvest using span_events.max_samples_stored (the default is 2000).
  4. Disabling middleware instrumentation will decrease number of spans. That config is disable_middleware_instrumentation: true.

Additional note:

  • The referenced bug #1936 wasn't necessarily a memory-leak. Originally, finished transactions weren't being seen on all threads so the process didn't know to stop. While this has been fixed, the agent would behave in a similar way as before in the case of an indefinite transaction.

Whether in documentation or code, there seems to be an opportunity to improve how New Relic handles long-running threads. So thank you again for bringing this to our attention!

@benoittgt
Copy link

Hello @hannahramadan

Thanks a lot for your detailed answer. I was with @bensheldon when we looked at this issue on one of our app.

We disabled thread tracing on the app and saw no improvement. Then we disabled NewRelic for the specific impacted containers and saw no changes. I think the issue can be closed for the moment.

I will re-download some heap dump without NewRelic and Segment and check again what we have.

I would love to answer to all of your questions, but I think the issue is somewhere else. :)

@hannahramadan
Copy link
Contributor

Thanks for the followup @benoittgt! I'll close this for now, but please reopen should this turn out to be relevant. Best of luck on the issue search ◡̈

@github-project-automation github-project-automation bot moved this from Triage to Code Complete/Done in Ruby Engineering Board Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community To tag external issues and PRs submitted by the community
Projects
Archived in project
Development

No branches or pull requests

3 participants