Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] server crash (msg-gateway process) #2483

Open
ting-xu opened this issue Aug 6, 2024 · 4 comments
Open

[BUG] server crash (msg-gateway process) #2483

ting-xu opened this issue Aug 6, 2024 · 4 comments
Assignees
Labels
bug Categorizes issue or PR as related to a bug. planning

Comments

@ting-xu
Copy link

ting-xu commented Aug 6, 2024

OpenIM Server Version

3.7.0

Operating System and CPU Architecture

Linux (AMD)

Deployment Method

Docker Deployment

Bug Description and Steps to Reproduce

We have two machines deployed using Etcd as discovery.

In our online production environment, server crashed twice at different time, both had same prometheus metrics change before happen.

  • in qps, msggw - rpc drop fast (about 1 minute from ~500 to zero)
  • in Goroutines, msggw grow fast (in 10 minutes from ~5K to 95K)
  • in process memory, msggw grow fast (in 10 mintues from ~100MB to 1.6GB)

i took some time to dig code, i guess the root cause is in the impl of
func (ws *WsServer) Run(done chan error) error
this function start websocket server and a SINGLE ONE goroutine to process all messages from 3 channels ONE BY ONE sequentially. (our version is 3.7.0, but it's same in 3.8.0 code)

though i have no absolute evidence, i thought something happend and block in this goroutine, then register channel is quickly full of new messages (channel buffer size is 1000), after that, new incoming websocket request force server to create more and more new goroutine to handle request and block on channel writing, at the same time memory grow more and more, finally process crash down.

in our machine cloud environment, network jitter is not uncommon due to cloud provider factor. so one possible case is, network jitter cause packet loss, then gRPC request to another machine instance didn't get response, according to code, the gRPC call has no timeout (the context is a private impl, no deadline), so calling will wait forever, block all subsequent channel messages. in application layer, gRPC dial does not set TCP Keepalive option, so recv will not know TCP closed, it just wait.

i suggest quick fix is set TCP keepalive in gRPC dial option, better improvement is to change this single goroutine process.

Screenshots Link

No response

@ting-xu ting-xu added the bug Categorizes issue or PR as related to a bug. label Aug 6, 2024
@kubbot
Copy link
Contributor

kubbot commented Aug 6, 2024

Hello! Thank you for filing an issue.

If this is a bug report, please include relevant logs to help us debug the problem.

Join slack 🤖 to connect and communicate with our developers.

@skiffer-git
Copy link
Member

2703f58e203f81eb61fed8562d08d3a

@ting-xu
Copy link
Author

ting-xu commented Aug 7, 2024

Another important clue:
when the problem happened, both machines' msg gateway processes started to work abnormally at the same time.
Each had the same prometheus metric change (qps drop to 0, goroutines and memories grows)

So the whole service down, since two machines were not independent each other.

@ting-xu
Copy link
Author

ting-xu commented Aug 28, 2024

Any schedule plan on this problem ?
Currently to mitigate service unstability caused by this, we have to run only one msg-gw process instance on one machine of two, while keep other processes running on both machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug. planning
Projects
None yet
Development

No branches or pull requests

4 participants