Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Make otel as a default logsEngine #614

Closed
wants to merge 3 commits into from

Conversation

hvaghani221
Copy link
Contributor

filelog receiver is promoted to beta (open-telemetry/opentelemetry-collector-contrib#15355) recently.
Also, I have observed an increasing number of issues with in_tail fluentd plugin. It basically stops collecting the data and resumes when the process is restarted manually. The root cause is unknown yet. Listing down some of the issues:

So, it may be a good time to make otel as a default logsEngine.

@hvaghani221 hvaghani221 requested review from a team as code owners December 21, 2022 06:28
@matthewmodestino
Copy link

Yess!!! Please make this happen! It's time for us to push filelog receiver to the forefront!!! We already use it on any Splunk Enterprise or Splunk Cloud customer looking to monitor Kubernetes!

@atoulme
Copy link
Contributor

atoulme commented Jan 18, 2023

I'd like to see open-telemetry/opentelemetry-collector-contrib#17846 resolved first. To a lesser extent, open-telemetry/opentelemetry-collector-contrib#17308 should be fixed as well.

@hvaghani221
Copy link
Contributor Author

Makes sense! I'll close the PR for now.

@hvaghani221 hvaghani221 deleted the otel_default_logs_engine branch January 19, 2023 05:48
@matthewmodestino
Copy link

I really think that first issue on filelog rotation we should be careful taking that on as our problem. Need to research with cloud providers to up that file rotation as the setting is silly.

I would also say making file based buffer for persistence queue an option is needed as well

https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md#persistent-queue

@hvaghani221
Copy link
Contributor Author

I really think that first issue on filelog rotation we should be careful taking that on as our problem. Need to research with cloud providers to up that file rotation as the setting is silly.

Actually increasing the log's max size can increase the problem I described in that issue.
For example, assume the max log size is 1GB. And one pod is producing logs faster than the agent can handle(as the agent needs to process individual log lines). In this scenario, it will take much longer for the current file to be consumed and will delay the consumption of the fresh batch.

It will be more visible when doing a fresh installation as none of the log files would be consumed.

I would also say making file based buffer for persistence queue an option is needed as well

I am already working on that one (https://github.com/harshit-splunk/splunk-otel-collector-chart/tree/persistent_queue).

@matthewmodestino
Copy link

matthewmodestino commented Jan 19, 2023

Actually increasing the log's max size can increase the problem I described in that issue.
For example, assume the max log size is 1GB. And one pod is producing logs faster than the agent can handle(as the agent needs to process individual log lines). In this scenario, it will take much longer for the current file to be consumed and will delay the consumption of the fresh batch.

Interesting, so you think leaving smaller chunks and just finding them faster is all we need?

If we can't keep up could one not just increase resources on the collector?

Would the answer be somewhere in the middle? where rotation maybe doesn't go to 1Gi, but gets to a place where rotation isn't happening in a matter of seconds? I wonder if theres an impact to the node/kublet of having handle those rotate jobs at the speed of light....the rotation happening that fast also risks those files rotating off the node completely, which may undermine the bit of persistence we can work with in case of issues or failure...

Let me know if you think I should reach out to the cloud vendors to campaign for some control for users.

@hvaghani221
Copy link
Contributor Author

Interesting, so you think leaving smaller chunks and just finding them faster is all we need?

Yup. Once any file is consumed, go ahead and consume the next available file. The list of available files will be refreshed at each poll_interval duration.

The example I gave earlier will not happen frequently. It is bursty in nature. So, the agent will keep up eventually.

Would the answer be somewhere in the middle? where rotation maybe doesn't go to 1Gi, but gets to a place where rotation isn't happening in a matter of seconds?

Exactly, the max size should be such that it should give the agent some breathing time to catch up.

I wonder if theres an impact to the node/kublet of having handle those rotate jobs at the speed of light....the rotation happening that fast also risks those files rotating off the node completely, which may undermine the bit of persistence we can work with in case of issues or failure...

I don't have the expertise to answer this. But from my observation using lsof command, kubelet renames the old file and creates a new one with the same name. So, it shouldn't put too much pressure on the kubelet when we are talking about a frequency of hundreds of milliseconds.

@SriramDuvvuri
Copy link

@harshit-splunk is this fix been addressing the issue # fluent/fluentd#3882 ??

If yes can you please let me know how could I test this and what are the parameters needs to be followed?

@dmitryax
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants