Possible Memory Issue #3880

bmalinconico · 2024-08-30T15:14:08Z

Current behaviour
I have found that enabling auto-compaction in the Ruby GC is causing what appears to be random memory related bugs. This bug manifests in the following way:

Manifestation 1

TypeError: wrong argument type XXX (expected PG::Connection)

usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:27:in `exec': wrong argument type Set (expected PG::Connection) (TypeError)
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:27:in `block in exec'
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/contrib/pg/instrumentation.rb:145:in `block in trace'
	
from /usr/src/app/vendor/bundle/ruby/3.3.0/gems/datadog-2.3.0/lib/datadog/tracing/trace_operation.rb:206:in `block in measure'

Where XXX is any random built-in or app specific class.

This is raised on a call to exec or exec_params in the DD Postgres instrumentation.

Manifestation 2
This error is produced by the reproduction steps I will provide later.

PG::InvalidDatetimeFormat: ERROR:  invalid input syntax for type date: ""
CONTEXT:  unnamed portal parameter $1 = ''

I patched the DD PG instrumentation for exec_params and rescued the error with a pry session. The params array contained no empty values and a retry of the block succeeded

Manifestation 3
Occasional segfaults.

All of these errors feel like something is holding a memory reference that is being moved, resulting in random garbage getting passed down the stack and occasionally referencing a freed memory location.

Expected behaviour
Not an error!

Steps to reproduce
I was unable to reproduce on my local machine but a local containerized env may be able to reproduce it. I was only able to reproduce this in a container running on EC2, that machine is Linux x86_64.

Dockerfile to reproduce this image

FROM ruby:3.3.4
RUN apt-get update && apt-get install libjemalloc2 && rm -rf /var/lib/apt/lists/*
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

My running application is able to reproduce this error easily when I have the compacting garbage collector enabled due to the volume of activity. Reproducing this in a shell is much more time consuming as you need to (presumably) wait for some compaction.

I'll also acknowledge this may not be datadog, but I've tried to narrow it down as much as I can.

ENV['DATABASE_URL'] = 'setme'
require 'pg'
require 'datadog'
GC.auto_compact = true

Datadog.configure do |c|
  c.tracing.instrument :pg
end

conn = PG.connect(ENV.fetch('DATABASE_URL', nil))
loop do
  conn.exec_params("SELECT #{1_664.times.map { |i| "$#{i + 1}::date as f_#{i}" }.join(',')}", 1_664.times.map { Date.today })
  print '.'
end

I'm going to reiterate that reproducing this is annoying, since there is no small amount of luck trying to get a compacting GC run to trigger at the right time. Doing the above in concurrent fibers increased the odds of it happening (probably due to increased memory churn) however I am providing the smallest repo I can.

Environment

datadog version: Currently 2.3.0 but I was upgrading to 1.2.3 and enabling the profiler. When I downgraded to 1.2.3 the issue was still present if I turned on auto_compact
Configuration block (Datadog.configure ...):
Ruby version: ruby 3.3.4 (2024-07-09 revision be1089c8ec) +YJIT [x86_64-linux]
Operating system: Linux
Relevant library versions: PG - 1.5.6

The text was updated successfully, but these errors were encountered:

ivoanjo · 2024-08-30T17:45:16Z

Hey @bmalinconico thanks for reaching out and sorry you ran into this.

I was able to see the PG::InvalidDatetimeFormat showing up with your reproducer script 👀 on version 2.3.0 of the gem a few times (not very often). I wasn't able to trigger the other situations.

I did do a few experiments with this reproducer and the instrumentation does suspiciously seem to be the "straw that breaks the camel's back"; e.g. this does somewhat look to be a bug in pg that the extra memory pressure of tracing seems to trigger... But I don't have convincing evidence either way yet.

Just to confirm, you mentioned:

Currently 2.3.0 but I was upgrading to 1.2.3 and enabling the profiler. When I downgraded to 1.2.3 the issue was still present if I turned on auto_compact

Did you mean that you were upgrading from 1.23 and then downgraded again?

Occasional segfaults.

Can you share the output from the Ruby VM crash? It may help in tracking the issue down.

Or, even better, if you're able to get a core dump, it would be a really useful tool to help track this down.

bmalinconico · 2024-08-30T17:55:00Z

@ivoanjo thanks for confirming. The PG::InvalidDatetimeFormat is the only manifestation I've been able to reproduce on demand. All of these manifestations were collected from production failures, and correlated with slow a/b testing of settings.

Yes I first encounter this when I upgraded to 2.3.0 from 1.2.3 (among other changes). In order to isolate the issue I rolled everything back and started piecing it back together. GC compaction was triggering this even on 1.2.3 (same PG version).

Our tests suites have occasionally triggered the seg fault when auto compaction was on. I'll see if I can pull the crash log from that.

ivoanjo · 2024-09-02T08:23:24Z

Yes I first encounter this when I upgraded to 2.3.0 from 1.2.3 (among other changes). In order to isolate the issue I rolled everything back and started piecing it back together. GC compaction was triggering this even on 1.2.3 (same PG version).

I'm a bit confused about this part -- you've mentioned version 1.2.3 in a few places above. Is this version 1.23.something? 🤔 1.2.0 is quite old (July 2022) and there were no other point releases in that series.

Our tests suites have occasionally triggered the seg fault when auto compaction was on. I'll see if I can pull the crash log from that.

That would be great! 👀

bmalinconico · 2024-09-03T14:22:23Z

Sorry I was typing it out from memory. This was encountered when upgrading from 1.22.0 -> 2.3.0, when we encountered the error we rolled back to 1.22 and the issue was still present.

-    ddtrace (1.22.0)
-      datadog-ci (~> 0.8.1)
+    datadog (2.3.0)
       debase-ruby_core_source (= 3.3.1)
-      libdatadog (~> 7.0.0.1.0)
+      libdatadog (~> 11.0.0.1.0)
       libddwaf (~> 1.14.0.0.0)
       msgpack

I was not able to find a CI run with the output in it. I've got a branch with auto_compact enabled and I'm trying to get the failure to occur.

ivoanjo · 2024-09-04T08:29:18Z

I've got a branch with auto_compact enabled and I'm trying to get the failure to occur.

Thanks! Hopefully that will help shine some light on this.

bmalinconico added bug Involves a bug community Was opened by a community member labels Aug 30, 2024

bmalinconico changed the title ~~Possible Memory Corruption Issue~~ Possible Memory Issue Aug 30, 2024

bmalinconico closed this as completed Aug 30, 2024

bmalinconico reopened this Aug 30, 2024

ivoanjo self-assigned this Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Memory Issue #3880

Possible Memory Issue #3880

bmalinconico commented Aug 30, 2024 •

edited

Loading

ivoanjo commented Aug 30, 2024

bmalinconico commented Aug 30, 2024

ivoanjo commented Sep 2, 2024

bmalinconico commented Sep 3, 2024

ivoanjo commented Sep 4, 2024

Possible Memory Issue #3880

Possible Memory Issue #3880

Comments

bmalinconico commented Aug 30, 2024 • edited Loading

ivoanjo commented Aug 30, 2024

bmalinconico commented Aug 30, 2024

ivoanjo commented Sep 2, 2024

bmalinconico commented Sep 3, 2024

ivoanjo commented Sep 4, 2024

bmalinconico commented Aug 30, 2024 •

edited

Loading