-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log streaming is affected when 1 of the multiple hosts is unreachable #16
Comments
Each As to an output becoming backlogged, that's what I suspect is happening internally (to fluentd). syslog_tls has a fixed 10s connect timeout when trying to establish a new connection. That matches the 10s interval between the 'Socket timeout' messages in your log snippet. If a syslog_tls output doesn't already have a viable connection, it will attempt to reconnect on every log message--good for avoiding lost messages; potentially bad for prolonged failures to connect to the remote host. There isn't any kind of circuit-breaker or 'stop attempting for NN seconds' type of thing. So, that failing output can easily end up with a backlog of log messages. More, if fluentd and/or Does host1's rsyslog continue to receive messages more often than at 10s intervals when host2 is unreachable? Or does that rsyslog experience matching ~10s pauses between batches of logs? |
@zarqman Thanks for the quick response as always! |
@brucelee138, yes, I suspect the 10s connect timeout is related to the 10s rsyslog pauses. I just took a look at the code for the
This loop runs in a single thread, so a delay in sending to one host will cause the other host to also be slowed down. I'm somewhat guessing your goal is for host1 and host2 to operate redundantly so that if one goes down, logs still make it to the other. For that to happen, the logs need to somehow be copied first and only then sent to the two outputs running on separate threads. I don't know enough about fluent's internals to know if that's possible, nor how to configure it so. I can't find any evidence of official plugins using threads on their own, so at this point it doesn't seem appropriate to try to handle that kind of thing inside syslog_tls itself. That leads me to hope that fluent itself offers a solution. If you find a solution, I'd love to hear about it. |
@zarqman Thanks for the detailed response. |
@zarqman I have started a thread regarding this in fluentd forum. I also wanted to understand the feasibility/effects in case we dont set a timeout & fail immediately when the host is unreachable (maybe handle socket connects differently etc). Would it cause any side affects? |
@brucelee138 Setting no timeout would cause things to hang indefinitely. To make things failover quickly, a super low timeout can be used as long as it's still high enough for connections to be properly established when everything is properly online. Intra-datacenter, that can be pretty low. Across the internet, it'll need to be a bit higher. |
@zarqman Due to constraint on IO.select used, we were able to reduce the timeout to 1 sec minimum (which is still a large delay). Could you please suggest other alternatives? |
@brucelee138 A per-host connection backoff might make sense when 2+ hosts are used (as in your case). One challenge is what to do internally during the backoff: are messages queued up, risking memory exhaustion, or discarded? Neither of those is a good option for a single host (which is the most common configuration), hence why the current behavior is just to keep trying. Splitting the connections across threads seems like the preferred way to address things, but my understanding of fluent's architecture suggests that belongs higher up in the stack, perhaps in a threading-aware copy plugin. Another alternative might be to spin up 3 copies of fluent on localhost. The first accepts inbound connections (like now), but then forwards to 2x local copies of fluent. Since they're localhost, there shouldn't ever be any network issues. Then setup the 2 added local fluents to forward to the remote destinations: localA -> remoteA and localB -> remoteB. This should achieve the same result as internal threads. |
@zarqman Thanks for your valuable suggestion! |
@zarqman We were trying to adopt the solution which was suggested by you (spinning a copy for fluent for each receiver), but we were having concerns about the increase in footprint. Regarding "per-host connection backoff" approach, would not the memory exhaustion/discard be already happening even with the current "retry-always" logic when 1 of the receiver is down? Also, if possible, we could consider this only when no. of hosts > 1. |
What kind of increase in footprint are you seeing? A quick check of a running fluentd here shows an RSS of 80MB, so 2 extra fluentd's would be 160MB, and maybe up to 2x 100MB depending on configuration. If high availability of logs is critical in your environment, I wouldn't think ~200MB is too much, although I obviously don't know the specifics of your setup. |
We are using the fluent-plugin-syslog-tls output plugin to stream container logs in Kubernetes cluster from fluentd to rsyslog with the following output config
Log streaming from fluentd to the syslog (rsyslogd) hosts (host1 & host2) works fine with this config.
But as soon as the host2 is unreachable, the log streaming is drastically affected towards host1 (still reachable) too. Logs are streamed still to host1 but it is drastically lesser.
We enabled trace logs and we only see the following errors in fluentd keep repeating.
We need help in identifying the reason why log streaming to a healthy host is affected when the other host goes down. We could provide more logs or do more tests.
fluentd version - 1.16.2
fluent-plugin-syslog-tls version - 2.1.0
Thanks in advance!
The text was updated successfully, but these errors were encountered: