[BUG] server crash (msg-gateway process) #2483

ting-xu · 2024-08-06T09:56:19Z

OpenIM Server Version

3.7.0

Operating System and CPU Architecture

Linux (AMD)

Deployment Method

Docker Deployment

Bug Description and Steps to Reproduce

We have two machines deployed using Etcd as discovery.

In our online production environment, server crashed twice at different time, both had same prometheus metrics change before happen.

in qps, msggw - rpc drop fast (about 1 minute from ~500 to zero)
in Goroutines, msggw grow fast (in 10 minutes from ~5K to 95K)
in process memory, msggw grow fast (in 10 mintues from ~100MB to 1.6GB)

i took some time to dig code, i guess the root cause is in the impl of
func (ws *WsServer) Run(done chan error) error
this function start websocket server and a SINGLE ONE goroutine to process all messages from 3 channels ONE BY ONE sequentially. (our version is 3.7.0, but it's same in 3.8.0 code)

though i have no absolute evidence, i thought something happend and block in this goroutine, then register channel is quickly full of new messages (channel buffer size is 1000), after that, new incoming websocket request force server to create more and more new goroutine to handle request and block on channel writing, at the same time memory grow more and more, finally process crash down.

in our machine cloud environment, network jitter is not uncommon due to cloud provider factor. so one possible case is, network jitter cause packet loss, then gRPC request to another machine instance didn't get response, according to code, the gRPC call has no timeout (the context is a private impl, no deadline), so calling will wait forever, block all subsequent channel messages. in application layer, gRPC dial does not set TCP Keepalive option, so recv will not know TCP closed, it just wait.

i suggest quick fix is set TCP keepalive in gRPC dial option, better improvement is to change this single goroutine process.

Screenshots Link

No response

kubbot · 2024-08-06T09:56:41Z

Hello! Thank you for filing an issue.

If this is a bug report, please include relevant logs to help us debug the problem.

Join slack 🤖 to connect and communicate with our developers.

skiffer-git · 2024-08-06T10:54:16Z

ting-xu · 2024-08-07T01:14:56Z

Another important clue:
when the problem happened, both machines' msg gateway processes started to work abnormally at the same time.
Each had the same prometheus metric change (qps drop to 0, goroutines and memories grows)

So the whole service down, since two machines were not independent each other.

ting-xu · 2024-08-28T07:01:06Z

Any schedule plan on this problem ?
Currently to mitigate service unstability caused by this, we have to run only one msg-gw process instance on one machine of two, while keep other processes running on both machines.

ting-xu added the bug Categorizes issue or PR as related to a bug. label Aug 6, 2024

skiffer-git assigned FGadvancer Aug 7, 2024

skiffer-git added the planning label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] server crash (msg-gateway process) #2483

[BUG] server crash (msg-gateway process) #2483

ting-xu commented Aug 6, 2024

kubbot commented Aug 6, 2024

skiffer-git commented Aug 6, 2024

ting-xu commented Aug 7, 2024

ting-xu commented Aug 28, 2024

[BUG] server crash (msg-gateway process) #2483

[BUG] server crash (msg-gateway process) #2483

Comments

ting-xu commented Aug 6, 2024

OpenIM Server Version

Operating System and CPU Architecture

Deployment Method

Bug Description and Steps to Reproduce

Screenshots Link

kubbot commented Aug 6, 2024

skiffer-git commented Aug 6, 2024

ting-xu commented Aug 7, 2024

ting-xu commented Aug 28, 2024