-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exec stdout w/o newlines breaks health checks by logging on closed pipe #423
Comments
Thanks for the detailed report @ochanism. I've dug into this a bit but don't have a conclusive answer yet. At one point we had a problem where we were dropping logs (this was at the root of #267 #294 #295 #296). We fixed that in #297 by making it so that we didn't close the logger for the health check until we stop polling entirely in That being said, the error message definitely seems to be bubbling up from
Did you have more than one failed health check with this error? With a 10 sec poll and a 30 sec TTL it should take more than a single case. |
I've broken this down a bit to confirm the error message is as expected, at least. package main
import (
"os/exec"
log "github.com/Sirupsen/logrus"
)
func main() {
logger := log.StandardLogger().Writer()
cmd := exec.Command("sh", "-c", "echo stdout; echo 1>&2 stderr")
cmd.Stdout = logger
cmd.Stderr = logger
cmd.Start()
// logger.Close() // uncomment this line to get expected err
if err := cmd.Wait(); err != nil {
log.Fatal(err)
}
} With the
Whereas with that line we get:
So think the problem is now figuring out how or why we're closing the logger. With it only happening after a few days getting a reliable repro will be key. |
I think I've narrowed this down further with the following: check.sh: #!/bin/bash
while true; do
echo 'check'
sleep 1
done logbug.go: package main
import (
"os/exec"
"time"
log "github.com/Sirupsen/logrus"
)
func main() {
logger := log.StandardLogger().Writer()
cmd := exec.Command("./check.sh")
cmd.Stdout = logger
cmd.Stderr = logger
err := cmd.Start()
if err != nil {
log.Fatalf("failed to start: %v", err)
}
go func() {
timer := time.NewTimer(time.Second * 2)
<-timer.C
logger.Close()
}()
if err := cmd.Wait(); err != nil {
log.Fatalf("failed: %v", err)
}
} output:
Which means that in a case where the exec is running and we've had a chance to properly set up the loggers, we won't get the @ochanism do you have any other debug logs from the health check around this error? It might help to see if we got other messages. |
I spent a bit of time trying to investigate whether it had to do with the timeout, which is the same as the poll interval, under the theory that subsequent runs were interfering with each other. I wrote a config that would exercise this area with a very very tight interval and let it run for a good long while. But I still can't replicate the closing of the logger except artificially; there's not a code path that hits that. I'm beginning to wonder whether the logger is a red herring, and that it's another io.WriteCloser in the path that's getting closed when it shouldn't be? |
I think I've figured out a way to bypass the issue entirely as well as solidify the fix in #297 and fix a regression in log formatting which I was planning on fixing for #393. Instead of setting up individual loggers, I'm going to update Logrus to 1.0 and then use a context logger and attach the new |
@tgross, yes, I got the error every 10 sec.
@tgross, I launched it with debug log level yesterday and I've been waiting for the error. Thanks for your fast support. |
I got the debug logs concerning the error. It occurred 26 hours after the service started.
|
I found that Logrus 1.0 also has the same issue related to token size in here. I think Logrus 1.0 wouldn't be safe from this error. |
I think I figured out what caused it. The output string of the health check script did not contain a new line character at the end of the string, so the output might be accumulated on the buffer. The defualt value of bufio.MaxScanTokenSize is set to 64KB. In case of my situation, the length of the output string was 7 bytes and the health check interval was 10 sec. And the service was in-service for 26 hours. I calculated the total length of the output during 26 hours like this: ((26*3600)/10)*7 = 65520. It is very close to 64KB. I added a new line character to the health check script to verify my hypothesis and I've been keeping my eyes on it |
I confirmed that there was no error with a new line character in the health check script.
|
That's some nice detective work there, @ochanism! I'm glad you were able to find a workaround for now, but we should probably try to come up with something more robust to this failure. Next steps:
|
With respect to trying to handle the error in ContainerPilot, it looks like neither of the two approaches above are going to work (injecting a newline or capturing the error and handling it gracefully). From the docs of the underlying stdlib interface (https://golang.org/pkg/bufio/#Scanner):
Once we've hit the error the logger is already broken. If it were possible to inject a newline we would reduce the risk of hitting an error but we'll still hit it in the case where a single |
I've pushed 3408f95 to update the documentation to warn about this issue. I'm going to cut a release for 2.7.6 and 3.1.2 that gives us the logrus 1.0 update and closes any possibility of a race around log closing. And then I'll circle back to look at sirupsen/logrus#564, in particular what's happening there around the |
I don't have the bandwidth right now to submit a fix to sirupsen/logrus. Ultimately this is an end-user issue that we've documented a warning for, and fixing the upstream problem doesn't have a ton of urgency. I'm going to marked this as |
Helping to mitigate TritonDataCenter/containerpilot#423
Recently, I updated the containerpilot version from 2.7.3 to 2.7.4 and 2.7.5 and I got the error as described below.
Actually, Zookeeper health was fine and health check script exited with code 0. But Zookeeper health on consul was out-of-service. When starting the service, it looked fine, but the problem occurred in a few days. I looked into containerpilot code to figure out what caused it. The error came out from here and it might have been caused by closing the logger in here. I couldn't figure out why the logger was closed and who did it. When I sent SIGHUP to a containerpilot process, the logger pipe was recreated and the service health was changed to in-service.
Anyway, in this situation, I think service health should be in-service, even if the logger is closed.
I've attached the some scripts below to help you debug this issue.
health check script
containerpilot configuration
The text was updated successfully, but these errors were encountered: