[DISCUSS] Make otel as a default logsEngine #614

hvaghani221 · 2022-12-21T06:28:39Z

filelog receiver is promoted to beta (open-telemetry/opentelemetry-collector-contrib#15355) recently.
Also, I have observed an increasing number of issues with in_tail fluentd plugin. It basically stops collecting the data and resumes when the process is restarted manually. The root cause is unknown yet. Listing down some of the issues:

Fluentd process goes unresponsive/hung and not tailing any of the log files · Issue #3882 · fluent/fluentd
Fluentd in_tail "unreadable" file causes "following tail of " to stop and no logs pushed · Issue #3614 · fluent/fluentd
input tail plugin stops working after file rotation · Issue #3838 · fluent/fluentd

So, it may be a good time to make otel as a default logsEngine.

matthewmodestino · 2023-01-05T22:10:26Z

Yess!!! Please make this happen! It's time for us to push filelog receiver to the forefront!!! We already use it on any Splunk Enterprise or Splunk Cloud customer looking to monitor Kubernetes!

atoulme · 2023-01-18T20:34:52Z

I'd like to see open-telemetry/opentelemetry-collector-contrib#17846 resolved first. To a lesser extent, open-telemetry/opentelemetry-collector-contrib#17308 should be fixed as well.

hvaghani221 · 2023-01-19T05:48:41Z

Makes sense! I'll close the PR for now.

matthewmodestino · 2023-01-19T12:19:04Z

I really think that first issue on filelog rotation we should be careful taking that on as our problem. Need to research with cloud providers to up that file rotation as the setting is silly.

I would also say making file based buffer for persistence queue an option is needed as well

https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md#persistent-queue

hvaghani221 · 2023-01-19T12:41:59Z

I really think that first issue on filelog rotation we should be careful taking that on as our problem. Need to research with cloud providers to up that file rotation as the setting is silly.

Actually increasing the log's max size can increase the problem I described in that issue.
For example, assume the max log size is 1GB. And one pod is producing logs faster than the agent can handle(as the agent needs to process individual log lines). In this scenario, it will take much longer for the current file to be consumed and will delay the consumption of the fresh batch.

It will be more visible when doing a fresh installation as none of the log files would be consumed.

I would also say making file based buffer for persistence queue an option is needed as well

I am already working on that one (https://github.com/harshit-splunk/splunk-otel-collector-chart/tree/persistent_queue).

matthewmodestino · 2023-01-19T12:51:53Z

Actually increasing the log's max size can increase the problem I described in that issue.
For example, assume the max log size is 1GB. And one pod is producing logs faster than the agent can handle(as the agent needs to process individual log lines). In this scenario, it will take much longer for the current file to be consumed and will delay the consumption of the fresh batch.

Interesting, so you think leaving smaller chunks and just finding them faster is all we need?

If we can't keep up could one not just increase resources on the collector?

Would the answer be somewhere in the middle? where rotation maybe doesn't go to 1Gi, but gets to a place where rotation isn't happening in a matter of seconds? I wonder if theres an impact to the node/kublet of having handle those rotate jobs at the speed of light....the rotation happening that fast also risks those files rotating off the node completely, which may undermine the bit of persistence we can work with in case of issues or failure...

Let me know if you think I should reach out to the cloud vendors to campaign for some control for users.

hvaghani221 · 2023-01-19T13:23:13Z

Interesting, so you think leaving smaller chunks and just finding them faster is all we need?

Yup. Once any file is consumed, go ahead and consume the next available file. The list of available files will be refreshed at each poll_interval duration.

The example I gave earlier will not happen frequently. It is bursty in nature. So, the agent will keep up eventually.

Would the answer be somewhere in the middle? where rotation maybe doesn't go to 1Gi, but gets to a place where rotation isn't happening in a matter of seconds?

Exactly, the max size should be such that it should give the agent some breathing time to catch up.

I wonder if theres an impact to the node/kublet of having handle those rotate jobs at the speed of light....the rotation happening that fast also risks those files rotating off the node completely, which may undermine the bit of persistence we can work with in case of issues or failure...

I don't have the expertise to answer this. But from my observation using lsof command, kubelet renames the old file and creates a new one with the same name. So, it shouldn't put too much pressure on the kubelet when we are talking about a frequency of hundreds of milliseconds.

SriramDuvvuri · 2023-01-25T19:00:51Z

@harshit-splunk is this fix been addressing the issue # fluent/fluentd#3882 ??

If yes can you please let me know how could I test this and what are the parameters needs to be followed?

dmitryax · 2023-01-26T01:06:27Z

@SriramDuvvuri you can already use otel logs collection instead of fluentd. See https://github.com/signalfx/splunk-otel-collector-chart/blob/main/docs/advanced-configuration.md#logs-collection

Make otel as a default logsEngine

9ebcf98

hvaghani221 requested review from a team as code owners December 21, 2022 06:28

harshit-splunk added 2 commits December 21, 2022 12:00

fix markdown lint

c6a4ce9

update doc in values.yaml

81d8e16

hvaghani221 closed this Jan 19, 2023

hvaghani221 deleted the otel_default_logs_engine branch January 19, 2023 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Make otel as a default logsEngine #614

[DISCUSS] Make otel as a default logsEngine #614

hvaghani221 commented Dec 21, 2022

matthewmodestino commented Jan 5, 2023

atoulme commented Jan 18, 2023

hvaghani221 commented Jan 19, 2023

matthewmodestino commented Jan 19, 2023

hvaghani221 commented Jan 19, 2023

matthewmodestino commented Jan 19, 2023 •

edited

Loading

hvaghani221 commented Jan 19, 2023

SriramDuvvuri commented Jan 25, 2023

dmitryax commented Jan 26, 2023

[DISCUSS] Make otel as a default logsEngine #614

[DISCUSS] Make otel as a default logsEngine #614

Conversation

hvaghani221 commented Dec 21, 2022

matthewmodestino commented Jan 5, 2023

atoulme commented Jan 18, 2023

hvaghani221 commented Jan 19, 2023

matthewmodestino commented Jan 19, 2023

hvaghani221 commented Jan 19, 2023

matthewmodestino commented Jan 19, 2023 • edited Loading

hvaghani221 commented Jan 19, 2023

SriramDuvvuri commented Jan 25, 2023

dmitryax commented Jan 26, 2023

matthewmodestino commented Jan 19, 2023 •

edited

Loading