-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluentd in_tail "unreadable" file causes "following tail of <file>" to stop and no logs pushed #3614
Comments
We have encountered the same problem twice. Last month we upgraded fluentd from 1.13.3 (td-agent 4.2.0) to 1.14.3 (td-agent 4.3.0), and then we have got the problem. In our case, we are tracking more than 3 log files with the tail plugin, and only the file with the highest log flow has this problem. From the metrics produced by prometheus plugin, the file position always fluctuates by at least 400000/sec and the rotation happens at least once every 10 minutes. The input source is set as follows:
I will also share the record of the investigation when the problem occurs. First, fluentd suddenly, without warning, stops tracking new rotated files and only logs the following in the stdout log.
Normally, it would have been followed by an announcement that a new file was being tracked, as follows
However, we have not seen this announcement at all after the problem occurs. Also, when I view the file descriptors that fluentd is tracking, I see the following:
This is the rotated file, and we see that fluentd is still tracking the old file, not the new one. Also, looking at the position file, we see the following:
As far as the situation is concerned, it seems that something triggers fluentd to stop tracking new files altogether during rotate. We have about 300 servers running 24/7 and have only had two problems in the past month, so it seems to be pretty rare for problems to occur. However, since we haven't had any problems with fluentd 1.13.3 for 6 months, it seems likely that there was some kind of regression in fluentd 1.13.3 -> 1.14.3. We are planning to downgrade fluentd. |
We downgraded td-agent 4.3.0 (fluentd v1.14.3) to td-agent 4.2.0 (fluentd v1.13.3), and still have problems. It seemed that some of our log files were affected and some were not. The differences are as follows.
We are planning to change rotation policy and retry upgrading td-agent. And, we will comment some updates if we get. |
We enabled debug logging with fluentd 1.13.3 and encountered this issue. The state was same:
And, fluentd reported We searched some other debug logs related this issue but we had no such logs. fluentd reported no other debug logs but only pairs of |
Thanks for your report!
To tell the truth, I was suspecting it but I couldn't confirm it because I can't yet reproduce it. Although I'm not yet sure the mechanism of this issue, you might be able to avoid this issue by disabling stat watcher ( |
Similar reports:
|
From #3586
|
There are several similar reports and most of them said: When tailing is stopped,
was observed but
wasn't observed. In addition, a user reports that following debug message is observed when log level is debug:
So, it seems caused by the following code: fluentd/lib/fluent/plugin/in_tail.rb Lines 489 to 498 in 438a82a
|
@ashie Is this one fixed? |
In #3614 some users reported that in_tail rarely stops tailing. It seems it's caused when skipping update_watcher due to unexpected duplicate pos_file entries. The root cause is unknown yet. To investigate it, show more infomation when it happens. Ideally it should't happen so that log level "warn" is desired for it. Signed-off-by: Takuro Ashie <[email protected]>
At v1.15.1 we added a log message like the following to investigate this issue:
If you see such log message, please report it. |
Hello, @ashie! We have something similar to: "no more log pushed" and found duplication in
Maybe this can help you. |
@vparfonov Could you provide your in_tail config? Do you use |
@ashie, looks like not, we use default setting for
|
I looked a bit at this now. What seems to happen is that in the EDIT: This should be reproducible in a local k8s environment by creating a tiny app that outputs logs every 10th - 30th second. And running fluentd in_tail the log files there. |
It is really weird however, that the re-examination done on the next refresh does not always pick up on that the symlinks are no longer broken, and that the file should now be readable. |
I created a test case in sub_test_case "multiple log rotations" do
data(
"128" => ["128-0.log", 128, "msg"],
"256" => ["256-0.log", 256, "msg"],
"512" => ["512-0.log", 512, "msg"],
"1024" => ["1024-0.log", 1024, "msg"],
"2048" => ["2048-0.log", 2048, "msg"],
"4096" => ["4096-0.log", 4096, "msg"],
"8192" => ["8192-0.log", 8192, "msg"],
)
def test_reproduce_err_after_rotations(data)
file, num_lines, msg = data
File.open("#{@tmp_dir}/#{file}", 'wb') do |f|
num_lines.times do
f.puts "#{msg}\n"
end
end
conf = config_element("", "", {
"path" => "#{@tmp_dir}/*.log.link",
"pos_file" => "#{@tmp_dir}/tail.pos",
"refresh_interval" => "1s",
"read_from_head" => "true",
"format" => "none",
"rotate_wait" => "1s",
"follow_inodes" => "true",
"tag" => "t1",
})
link_name="#{@tmp_dir}/#{num_lines}.log.link"
File.symlink(file, link_name)
dl_opts = {log_level: Fluent::Log::LEVEL_DEBUG}
logdev = $stdout
logger = ServerEngine::DaemonLogger.new(logdev, dl_opts)
log_instance = Fluent::Log.new(logger)
rotations = 5
rot_now = 1
d = create_driver(conf, false)
d.instance.log = log_instance
d.run(timeout: 30) do
sleep 1
assert_equal(num_lines, d.record_count)
# rotate logs
while rot_now <= rotations do
sleep 2
puts "unlink #{link_name}"
File.unlink(link_name)
puts "symlink #{num_lines}-#{rot_now}.log #{link_name}"
File.symlink("#{num_lines}-#{rot_now}.log", link_name)
sleep 1
File.open("#{@tmp_dir}/#{num_lines}-#{rot_now}.log", 'wb') do |f|
num_lines.times do
f.puts "#{msg}\n"
end
end
assert_equal(num_lines*rot_now, d.record_count)
rot_now = rot_now + 1
end
end
end
end In this case, it seems to be working properly. But maybe we can help each-other in reproducing the error? UPDATE: Changed to multiple files of different sizes, and changed log rotation to how |
HI All, We are facing a similar kind of issue where the td-agent.log & .pos files are not updating properly. As per some previous comments, I have tried modifying the config with the below parameters, but nothing changes enable_stat_watcher false As we have systems over RHEL 7/6 we are using fluentd version using 1.11.5 td-agent 3.8.1/3.8.0, in our environment
Config:
PS. the same config is working fine in few other instances. I am not sure what was the problem here. Appreciate your help in advance. |
We just ran into the same problem again. Without the log line
Is it wrong that we are running this in a k8s DaemonSet? Should we install FluentD/FluentBit directly on the vm's instead? |
Here's the pos file migration script I wrote. (Seems I forgot that I just wrote it in bash, and not Golang) It requires #!/usr/bin/env bash
info() {
printf "[%s] INFO - %s\n" "$(date --iso-8601=seconds )" "$@"
}
readonly DB='/opt/fluent-bit-db/log-tracking.db'
readonly FLUENTD_LOG_POS="/var/log/fluentd-containers.log.pos"
if [[ ! -f "$FLUENTD_LOG_POS" ]]; then
info "No FluentD log tracking file to migrate from"
exit
fi
if [[ ! -f "$DB" ]]; then
sqlite3 "$DB" "CREATE TABLE main.in_tail_files (id INTEGER PRIMARY KEY, name TEXT, offset INTEGER, inode INTEGER, created INTEGER, rotated INTEGER);"
else
info "fluent-bit database already exists, will not do migration"
exit
fi
while read -r line; do
IFS=$'\t' read -r -a parts <<< "$line"
filename="${parts[0]}"
offset="$((16#${parts[1]}))"
inode="$((16#${parts[2]}))"
now="$(date +%s)"
sqlite3 "$DB" "INSERT INTO in_tail_files (name, offset, inode, created, rotated) VALUES ('$filename', $offset, $inode, $now, 0)"
done < <(sort "$FLUENTD_LOG_POS") There are no security stuff like escaping values for |
Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds in default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. Signed-off-by: Takuro Ashie <[email protected]>
Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. Signed-off-by: Takuro Ashie <[email protected]>
Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. Signed-off-by: Takuro Ashie <[email protected]>
Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. --------- Signed-off-by: Takuro Ashie <[email protected]> Co-authored-by: Daijiro Fukuda <[email protected]>
Fix fluent#3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. In this case, a tail watcher is possible to mark the position entry as `unwatched` if it's tansitioned to `rotate_wait` state by `refresh_watcher` even if another newer tail watcher is managing it. It's hard to occur in usual because `stat_watcher` will be called immediately after the file is changed while `refresh_wather` is called every 60 seconds by default. However, there is a rare possibility that this order might be swapped especillay if in_tail is busy on processing large amount of logs. Because in_tail is single threadied, event queues such as timers or inotify will be stucked in this case. There is no such problem on `follow_inode` case because position entries are always marked as `unwatched` before entering `rotate_wait` state. --------- Signed-off-by: Takuro Ashie <[email protected]> Co-authored-by: Daijiro Fukuda <[email protected]>
Thank you all very much! The problem that |
No, it is not fixed. |
We are still seeing this issue as well. We are using this image v1.16.3-debian-forward-1.0. We were seeing the following message We also noticed a pattern of memory leaking and gradual increase in CPU usage until a restart occurs. We are using fluentd as a daemonset on a kubernetes cluster. Here is our
|
In an environment where we have high volatility (we constantly deploy new code -> deployments are restarted very frequently -> pods are created -> lots of files to tail) we see a clear leak pattern.
Any suggestion to mitigate this? @ashie @daipom |
BTW, any good alternatives for Fluentd? |
I've observed this issue with v1.16.3. I suspect that in_tail handling of file rotations is unlikely to ever reach a satisfactory level of reliability, and something like the docker fluentd logging driver (which unfortunately breaks I suspect that my |
Setting |
Thanks for reporting.
So, there are still problems with both |
BTW it would be better to open a new issue to treat remaining things. |
@ashie @daipom Thanks for the responses and attention. Should I open a new issue for that? If you have any suggestion I'll be more than happy to test different configs.
|
@uristernik Thanks. Could you please open a new issue? |
@daipom Done, hopefully I described the issue clear enough. Please correct me if I wasn't accurate enough |
@uristernik Thanks! I will check it! |
If you all still have a similar issue, please report it on the new issue! |
Describe the bug
After a warning of an "unreadable" (likely due to rotation), no more logs were pushed (in_tail + pos_file).
Reloading config or restarting fluentd sorts the issue.
All other existing files being tracked continued to work as expected.
To Reproduce
Not able to reproduce at will.
Expected behavior
Logs to be pushed as usual after file rotation as fluentd recovers from the temporary "unreadable" file.
Your Environment
Your Configuration
Your Error Log
Additional context
This issue seems be related with #3586 but unfortunately I didn't check the
pos
file while the issue was happening so can't tell if it presented unexpected values for the failing file.The text was updated successfully, but these errors were encountered: