Log streaming is affected when 1 of the multiple hosts is unreachable #16

brucelee138 · 2023-10-23T17:32:56Z

We are using the fluent-plugin-syslog-tls output plugin to stream container logs in Kubernetes cluster from fluentd to rsyslog with the following output config


    <match {node-audit.var.log.audit.audit.log}>
      @type copy

      <store>
        @type syslog_tls
        host <host1>
        port 514
        ca_cert /fluentd-out/ssl/ca.crt
        client_cert /fluentd-out/ssl/tls.crt
        client_key /fluentd-out/ssl/tls.key
        verify_cert_name true
        hostname_key nodename
        facility AUTH
        severity_key severity
        format json
      </store>
      <store>
        @type syslog_tls
        host <host2>
        port 514
        ca_cert /fluentd-out/ssl/ca.crt
        client_cert /fluentd-out/ssl/tls.crt
        client_key /fluentd-out/ssl/tls.key
        verify_cert_name true
        hostname_key nodename
        facility AUTH
        severity_key severity
        format json
      </store>
    </match>

Log streaming from fluentd to the syslog (rsyslogd) hosts (host1 & host2) works fine with this config.
But as soon as the host2 is unreachable, the log streaming is drastically affected towards host1 (still reachable) too. Logs are streamed still to host1 but it is drastically lesser.

We enabled trace logs and we only see the following errors in fluentd keep repeating.

      | Incoming connnection from #<Addrinfo: [::ffff:10.0.10.4]:52184 TCP> to Async::HTTP::Protocol::HTTP1
2023-10-23 17:24:08 +0000 [debug]: #0 [in_prometheus]    18m: Async::HTTP::Server
      | Incoming connnection from #<Addrinfo: [::ffff:10.0.10.4]:52182 TCP> to Async::HTTP::Protocol::HTTP1
2023-10-23 17:24:14 +0000 [error]: #0 fluent/log.rb:404:error: Socket timeout during connect_write
2023-10-23 17:24:18 +0000 [debug]: #0 [in_prometheus]    19m: Async::HTTP::Server
      | Incoming connnection from #<Addrinfo: [::ffff:10.0.10.4]:53334 TCP> to Async::HTTP::Protocol::HTTP1
2023-10-23 17:24:18 +0000 [debug]: #0 [in_prometheus]    19m: Async::HTTP::Server
      | Incoming connnection from #<Addrinfo: [::ffff:10.0.10.4]:53350 TCP> to Async::HTTP::Protocol::HTTP1
2023-10-23 17:24:24 +0000 [error]: #0 fluent/log.rb:404:error: Socket timeout during connect_write

We need help in identifying the reason why log streaming to a healthy host is affected when the other host goes down. We could provide more logs or do more tests.

fluentd version - 1.16.2
fluent-plugin-syslog-tls version - 2.1.0

Thanks in advance!

The text was updated successfully, but these errors were encountered:

zarqman · 2023-10-24T06:47:36Z

Each syslog_tls output stores its own data, including connections, so they should be operating without one affecting the other, as far as the internal data state is concerned. However, I don't know offhand how fluentd and/or the copy plugin handle things when just a single output is backlogged.

As to an output becoming backlogged, that's what I suspect is happening internally (to fluentd).

syslog_tls has a fixed 10s connect timeout when trying to establish a new connection. That matches the 10s interval between the 'Socket timeout' messages in your log snippet. If a syslog_tls output doesn't already have a viable connection, it will attempt to reconnect on every log message--good for avoiding lost messages; potentially bad for prolonged failures to connect to the remote host. There isn't any kind of circuit-breaker or 'stop attempting for NN seconds' type of thing. So, that failing output can easily end up with a backlog of log messages.

More, if fluentd and/or copy are executing any aspect of the multiple outputs in serial (vs parallel), then that could be a cause of host1 also experiencing delays in logging when host2 is unreachable.

Does host1's rsyslog continue to receive messages more often than at 10s intervals when host2 is unreachable? Or does that rsyslog experience matching ~10s pauses between batches of logs?

brucelee138 · 2023-10-25T10:35:02Z

@zarqman Thanks for the quick response as always!
You are right, the rsyslog is experiencing 10s pauses indeed between batches of logs. Could this be related to the syslog_tls connect timeout of 10s (though this would be applicable only for thr unreachable host right?)?

zarqman · 2023-10-28T16:48:01Z

@brucelee138, yes, I suspect the 10s connect timeout is related to the 10s rsyslog pauses.

I just took a look at the code for the copy plugin. It appears to be doing a simple loop on messages, going through each host in turn. Some pseudocode:

while msg = receive_message do
  send_message msg, to: host1
  send_message msg, to: host2
end

This loop runs in a single thread, so a delay in sending to one host will cause the other host to also be slowed down.

I'm somewhat guessing your goal is for host1 and host2 to operate redundantly so that if one goes down, logs still make it to the other. For that to happen, the logs need to somehow be copied first and only then sent to the two outputs running on separate threads. I don't know enough about fluent's internals to know if that's possible, nor how to configure it so.

I can't find any evidence of official plugins using threads on their own, so at this point it doesn't seem appropriate to try to handle that kind of thing inside syslog_tls itself. That leads me to hope that fluent itself offers a solution.

If you find a solution, I'd love to hear about it.

brucelee138 · 2023-10-30T09:15:39Z

@zarqman Thanks for the detailed response.
Yes, we require 2 hosts for high availability. I would explore the copy plugin too to understand how multiple hosts are handled.
I would update once we are able to make a breakthrough.

brucelee138 · 2023-12-12T08:55:07Z

@zarqman I have started a thread regarding this in fluentd forum. I also wanted to understand the feasibility/effects in case we dont set a timeout & fail immediately when the host is unreachable (maybe handle socket connects differently etc). Would it cause any side affects?

zarqman · 2023-12-13T04:28:33Z

@brucelee138 Setting no timeout would cause things to hang indefinitely. To make things failover quickly, a super low timeout can be used as long as it's still high enough for connections to be properly established when everything is properly online. Intra-datacenter, that can be pretty low. Across the internet, it'll need to be a bit higher.

brucelee138 · 2024-01-16T11:00:28Z

@zarqman Due to constraint on IO.select used, we were able to reduce the timeout to 1 sec minimum (which is still a large delay). Could you please suggest other alternatives?
Would a connection backoff (per host) on unsuccessful connects be a good solution for this?

zarqman · 2024-01-16T17:20:02Z

@brucelee138 A per-host connection backoff might make sense when 2+ hosts are used (as in your case).

One challenge is what to do internally during the backoff: are messages queued up, risking memory exhaustion, or discarded? Neither of those is a good option for a single host (which is the most common configuration), hence why the current behavior is just to keep trying.

Splitting the connections across threads seems like the preferred way to address things, but my understanding of fluent's architecture suggests that belongs higher up in the stack, perhaps in a threading-aware copy plugin.

Another alternative might be to spin up 3 copies of fluent on localhost. The first accepts inbound connections (like now), but then forwards to 2x local copies of fluent. Since they're localhost, there shouldn't ever be any network issues. Then setup the 2 added local fluents to forward to the remote destinations: localA -> remoteA and localB -> remoteB. This should achieve the same result as internal threads.

brucelee138 · 2024-01-18T06:13:28Z

@zarqman Thanks for your valuable suggestion!

brucelee138 · 2024-02-28T11:16:03Z

@zarqman We were trying to adopt the solution which was suggested by you (spinning a copy for fluent for each receiver), but we were having concerns about the increase in footprint.
I had actually opened a ticket towards fluentd for pushing for "threading-aware copy plugin" but that was shot down by them suggesting to use buffered plugin (I see there was a request for buffer support already #8)

Regarding "per-host connection backoff" approach, would not the memory exhaustion/discard be already happening even with the current "retry-always" logic when 1 of the receiver is down? Also, if possible, we could consider this only when no. of hosts > 1.

zarqman · 2024-03-09T02:11:12Z

What kind of increase in footprint are you seeing? A quick check of a running fluentd here shows an RSS of 80MB, so 2 extra fluentd's would be 160MB, and maybe up to 2x 100MB depending on configuration.

If high availability of logs is critical in your environment, I wouldn't think ~200MB is too much, although I obviously don't know the specifics of your setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log streaming is affected when 1 of the multiple hosts is unreachable #16

Log streaming is affected when 1 of the multiple hosts is unreachable #16

brucelee138 commented Oct 23, 2023

zarqman commented Oct 24, 2023

brucelee138 commented Oct 25, 2023 •

edited

Loading

zarqman commented Oct 28, 2023

brucelee138 commented Oct 30, 2023

brucelee138 commented Dec 12, 2023

zarqman commented Dec 13, 2023

brucelee138 commented Jan 16, 2024

zarqman commented Jan 16, 2024

brucelee138 commented Jan 18, 2024

brucelee138 commented Feb 28, 2024

zarqman commented Mar 9, 2024

Log streaming is affected when 1 of the multiple hosts is unreachable #16

Log streaming is affected when 1 of the multiple hosts is unreachable #16

Comments

brucelee138 commented Oct 23, 2023

zarqman commented Oct 24, 2023

brucelee138 commented Oct 25, 2023 • edited Loading

zarqman commented Oct 28, 2023

brucelee138 commented Oct 30, 2023

brucelee138 commented Dec 12, 2023

zarqman commented Dec 13, 2023

brucelee138 commented Jan 16, 2024

zarqman commented Jan 16, 2024

brucelee138 commented Jan 18, 2024

brucelee138 commented Feb 28, 2024

zarqman commented Mar 9, 2024

brucelee138 commented Oct 25, 2023 •

edited

Loading