Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentry Integration causes OOM kill #540

Open
MrSerth opened this issue Jan 24, 2024 · 5 comments
Open

Sentry Integration causes OOM kill #540

MrSerth opened this issue Jan 24, 2024 · 5 comments
Labels
bug Something isn't working deployment Everything related to our production environment pending

Comments

@MrSerth
Copy link
Member

MrSerth commented Jan 24, 2024

Today, we had another occurrence of Poseidon being OOM killed.

Stack Trace

Jan 24 08:27:45 poseidon-terraform poseidon[624]: heap profile: 194: 751991664 [9538813: 1723733127104] @ heap/1048576
Jan 24 08:27:45 poseidon-terraform poseidon[624]: 1: 83886080 [41: 3439329280] @ 0x4a9d4e 0x4a977d 0x4a9959 0x719b1d 0x71aade 0x71c3b6 0x71beef 0x71aade 0x71c70f 0x71aade 0x71c70f 0x717fd3 0x717b99 0x717490 0x78e7d9 0x78ea3c 0x78f889 0x7784d2 0x
77bb0d 0x77baf7 0x78b2fd 0x47b8df 0x78a5a5 0x78a56f 0xe46015 0x6e6a69 0x6e974e 0x6e5634 0x470781
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a9d4d#011bytes.growSlice+0x8d#011#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:249
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a977c#011bytes.(*Buffer).grow+0x13c#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:151
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a9958#011bytes.(*Buffer).Write+0x58#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:179
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x719b1c#011encoding/json.uintEncoder+0x13c#011#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:525
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c3b5#011encoding/json.arrayEncoder.encode+0xd5#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:849
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71beee#011encoding/json.sliceEncoder.encode+0x30e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:822
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c70e#011encoding/json.ptrEncoder.encode+0x20e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:878
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c70e#011encoding/json.ptrEncoder.encode+0x20e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:878
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x717fd2#011encoding/json.(*encodeState).reflectValue+0x72#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:323
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x717b98#011encoding/json.(*encodeState).marshal+0xb8#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:295
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71748f#011encoding/json.Marshal+0xcf#011#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:162
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78e7d8#011github.com/getsentry/sentry-go.envelopeFromBody+0x518#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/transport.go:196
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78ea3b#011github.com/getsentry/sentry-go.getRequestFromEvent+0xfb#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/transport.go:232
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78f888#011github.com/getsentry/sentry-go.(*HTTPTransport).SendEvent+0xc8#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/transport.go:359
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x7784d1#011github.com/getsentry/sentry-go.(*Client).processEvent+0x2b1#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/client.go:637
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x77bb0c#011github.com/getsentry/sentry-go.(*Client).CaptureEvent+0x6c#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/client.go:447
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x77baf6#011github.com/getsentry/sentry-go.(*Hub).CaptureEvent+0x56#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/hub.go:224
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78b2fc#011github.com/getsentry/sentry-go.(*Span).doFinish+0x15c#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/tracing.go:356
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x47b8de#011sync.(*Once).doSlow+0xbe#011#011#011#011#011/usr/lib/go-1.21/src/sync/once.go:74
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78a5a4#011sync.(*Once).Do+0x44#011#011#011#011#011#011/usr/lib/go-1.21/src/sync/once.go:65
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78a56e#011github.com/getsentry/sentry-go.(*Span).Finish+0xe#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/tracing.go:205
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110xe46014#011main.initServer.(*Handler).Handle.(*Handler).handle.func1+0x314#011/home/poseidon/go/pkg/mod/github.com/getsentry/[email protected]/http/sentryhttp.go:119
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e6a68#011net/http.HandlerFunc.ServeHTTP+0x28#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2136
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e974d#011net/http.serverHandler.ServeHTTP+0x8d#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2938
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e5633#011net/http.(*conn).serve+0x5f3#011#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2009


At the same time, Influx also showed erroneous behaviour:

Stack Trace

Jan 24 08:27:25 poseidon-terraform poseidon[624]: 2024/01/24 08:27:25 influxdb2client E! Write error: internal error: unexpected error writing points to database: timeout, batch kept for retrying
Jan 24 08:27:25 poseidon-terraform poseidon[624]: time="2024-01-24T08:27:25.603586Z" level=debug code=200 duration=15.765001853s method=GET path=/api/v1/runners/29-9a31d753-ba91-11ee-8c54-fa163e351aea/files user_agent="Faraday v2.9.0"
Jan 24 08:27:25 poseidon-terraform poseidon[624]: time="2024-01-24T08:27:25.604447Z" level=debug code=200 duration=16.191098103s method=GET path=/api/v1/runners/29-9a31d753-ba91-11ee-8c54-fa163e351aea/websocket user_agent=
Jan 24 08:27:37 poseidon-terraform telegraf[632]: 2024-01-24T08:27:37Z E! [outputs.influxdb_v2] When writing to [http://monitoring-terraform.compute.internal:8086]: Post "http://monitoring-terraform.compute.internal:8086/api/v2/write?bucket=tele
graf&org=codeocean": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Jan 24 08:27:37 poseidon-terraform telegraf[632]: 2024-01-24T08:27:37Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
Jan 24 08:27:38 poseidon-terraform poseidon[624]: 2024/01/24 08:27:38 influxdb2client E! Write error: internal error: unexpected error writing points to database: timeout, batch kept for retrying

@MrSerth MrSerth added bug Something isn't working deployment Everything related to our production environment labels Jan 24, 2024
@MrSerth
Copy link
Member Author

MrSerth commented Feb 7, 2024

We decided to focus on the Sentry part first (postponing InfluxDB). Maybe, the OOM kill is related to the buffer of Sentry events awaiting transmission to the relay. Thus, it could be well connected to the high memory threshold experience during network outages.

@mpass99
Copy link
Contributor

mpass99 commented Mar 27, 2024

WIP Questions to investigate:

  • Are we using the wrong metric in watchMemoryAndAlert? HeapAlloc vs HeapInuse. [1], [2]
  • Only 751991664 Bytes (752 MB) have been used by Poseidon (HeapInuse). Why has it got OOM killed?
  • Is the Sentry issue unrelated to the OOM and the Influx Connection Issues because Sentry is just marshaling some JSON with an ok-ish heap usage?
  • If not the heap usage, what else led Poseidon to reach its 4 GB RAM? Did it actually or is this issue just based on the Sentry warning? (No data due to retention policy)

We agreed to postpone this issue until the next occurrence.

@mpass99
Copy link
Contributor

mpass99 commented Oct 11, 2024

According to POSEIDON-7A the issue occurred again.
This time, Poseidon has not been OOM killed. However, the Heap profile again indicates a high memory usage of the Sentry integration.

@MrSerth
Copy link
Member Author

MrSerth commented Oct 13, 2024

You're right, the issue occurred once again. I checked the Sentry Relay and noticed that it was complaining about too little RAM (3899297792 / 4098719744 (95.13% >= 95.00%)). I've fixed at least this issue now by increasing the RAM from 4 GB to 8 GB for the instance and will try to observe the situation by now. Hence, it seems like the relay wasn't able to receive the further information, thus causing issues with Poseidon (where the collected data was still hold in a queue).

@MrSerth
Copy link
Member Author

MrSerth commented Nov 3, 2024

During the remaining duration of the recent Python course, we only noticed one other occurrence of the issue. For this second occurrence on October 16th, I was able to gather additional insights: Sentry had a (major) outage, causing issues with the delivery of the events upstream. Hence, our Sentry Relay collected all events which couldn't be sent. After some time, when the Sentry relay was hitting the memory limit once again, Poseidon also reported a too high memory usage (but no OOM kill).

My suspicion: We only see a high memory usage of the Sentry integration when events aren't forwarded correctly (either on the Poseidon host or the Sentry Relay host causing a backlog).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deployment Everything related to our production environment pending
Projects
None yet
Development

No branches or pull requests

2 participants