Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash panics with Too many open files in the cloud GCP env #9663

Open
solotzg opened this issue Nov 21, 2024 · 4 comments
Open

TiFlash panics with Too many open files in the cloud GCP env #9663

solotzg opened this issue Nov 21, 2024 · 4 comments

Comments

@solotzg
Copy link
Contributor

solotzg commented Nov 21, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  • Topology: 3-tidb-16C32G / 9-tikv-16C64G-2000G / 3-pd-4C15G-50G / 2-tiflash-16C128G-500G
  • Platform: GCP
  • Unknown workloads

2. What did you expect to see? (Required)

  • No panic

3. What did you see instead (Required)

Too many fd under tiflash process. The number of FD keeps growing, causing queries to fail and eventually tiflash to panic. Most of the FDs are related to sockets and a large amount of sockets are still open but can not be found in /proc/net.

sh-5.1# ls -l /proc/1/fd/ | wc -l
295268
sh-5.1# ls -l /proc/1/fd/ | grep "eventfd" | wc -l
98368
sh-5.1# ls -l /proc/1/fd/ | grep "eventpoll" | wc -l
98391
sh-5.1# ls -l /proc/1/fd/ | grep "socket" | wc -l
98492

Other

In the AWS environment, there is no such problem yet.

4. What is your TiFlash version? (Required)

v7.5.3

@solotzg
Copy link
Contributor Author

solotzg commented Nov 25, 2024

After disabling mpp, the number of socket fd no longer continues to grow. There may be potential bugs in the implementation about mpp.

set global tidb_allow_fallback_to_tikv = "tiflash";
set global tidb_allow_mpp = 0;
set global tidb_allow_tiflash_cop = 1;

@solotzg solotzg added affects-7.5 This bug affects the 7.5.x(LTS) versions. and removed may-affects-7.5 labels Dec 6, 2024
@solotzg
Copy link
Contributor Author

solotzg commented Dec 23, 2024

grpc/grpc#32538 (comment)

For example, earlier versions of gRPC had an fd leak with the epollex poller, which was removed in v1.46. It may be too late to diagnose some of those years-old issues.

@solotzg
Copy link
Contributor Author

solotzg commented Dec 23, 2024

grpc/grpc#20418 (comment)

The details were posted to the mailing list: Patch Releases for CVE-2023-4785, covering gRPC Core, C++, Python, and Ruby.

Lack of error handling in the TCP server in Google's gRPC starting version 1.23 on posix-compatible platforms (ex. Linux) allows an attacker to cause a denial of service by initiating a significant number of connections with the server. Note that gRPC C++ Python, and Ruby are affected, but gRPC Java, and Go are NOT affected.

The following set of releases contain the fix:

@solotzg
Copy link
Contributor Author

solotzg commented Dec 31, 2024

In client-c conn.h#L23, the param grpc.max_reconnect_backoff_ms is set to 3s by default. The grpc uses backup pollers for polling fds every 5 seconds(DEFAULT_POLL_INTERVAL_MS) in certain edge cases. If GRPC_ARG_MIN_RECONNECT_BACKOFF_MS is less than poll interval, the channel won't reconnect after the net isolation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants