Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

documentation on stopTimeout field is confusing #479

Open
JeromeGe opened this issue Aug 16, 2017 · 12 comments
Open

documentation on stopTimeout field is confusing #479

JeromeGe opened this issue Aug 16, 2017 · 12 comments

Comments

@JeromeGe
Copy link
Contributor

JeromeGe commented Aug 16, 2017

containerpilot --version
Version: 3.3.4
GitHash: f7d6212

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

uname -a
Linux 0743e8140205 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

{
  consul: "127.0.0.1:8500",
  logging: {
    level: "DEBUG",
    format: "text",
  },
  jobs: [
    {
      name: "test-stopTimeout",
      exec: "/root/test",
      stopTimeout: "120s",
      restarts: "unlimited",
    }
  ],
  control: {
    socket: "/var/run/containerpilot.socket"
  }
}

test.go

package main

import "fmt"
import "os"
import "os/signal"
import "syscall"
import "time"

func main() {
        sigs := make(chan os.Signal, 1)
        done := make(chan bool, 1)
        signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
        go func() {
                fmt.Println("start")
                sig := <-sigs
                fmt.Println(sig)
                for i := 0; i < 30; i++ {
                        time.Sleep(1 * time.Second)
                        fmt.Println(i)
                }
                done <- true
        }()
        <-done
        fmt.Println("exiting")
}
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Shutdown global}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: stopping control server" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: completed graceful shutdown of control server" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: stopped serving at /var/run/containerpilot.socket" 
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Stopping test-stopTimeout}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Stopped test-stopTimeout}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="killing all processes in 5 seconds" 
time="2017-08-16T18:02:03+08:00" level=debug msg=test-stopTimeout.term 
time="2017-08-16T18:02:03+08:00" level=debug msg="terminating command 'test-stopTimeout' at pid: 4746" 
time="2017-08-16T18:02:03+08:00" level=info msg=terminated job=test-stopTimeout 
time="2017-08-16T18:02:04+08:00" level=info msg=0 job=test-stopTimeout 
time="2017-08-16T18:02:05+08:00" level=info msg=1 job=test-stopTimeout 
time="2017-08-16T18:02:06+08:00" level=info msg=2 job=test-stopTimeout 
time="2017-08-16T18:02:07+08:00" level=info msg=3 job=test-stopTimeout 
time="2017-08-16T18:02:08+08:00" level=info msg="killing processes for job "test-stopTimeout"" 
time="2017-08-16T18:02:08+08:00" level=debug msg=test-stopTimeout.kill 
time="2017-08-16T18:02:08+08:00" level=debug msg="killing command 'test-stopTimeout' at pid: 4746"

The stopTimeout field is set to 120s, but the test-stopTimeout program is killed after 5 seconds
test code: https://github.com/JeromeGe/test-containerpilot-stopTimeout

@jwreagor
Copy link
Contributor

jwreagor commented Aug 16, 2017

I think there's a little work required around documenting use cases of stopTimeout.

    // these fields interact with 'when' behaviors (see below)
    timeout: "300s",
    stopTimeout: "10s",
    restarts: "unlimited",

@tgross should add to this but stopTimeout really matters when you want a grace period before a job throws it's stopping event (before a watcher job begins), not the time until it is stopped. The documentation only notes this with the following when example...

jobs: [
  {
    name: "consul-agent",
    exec: "consul agent...",
    stopTimeout: "5s"
  },
  {
    name: "leave-consul",
    exec: "consul leave",
    when: {
      source: "consul-agent",
      once: "stopping"
    }
  }
]

I'll try to get a better use case/example into the docs.

@tgross
Copy link
Contributor

tgross commented Aug 16, 2017

Another way of putting this is that stopTimeout is the maximum amount of time a stopping job will wait for another job that might be watching for the stopping event.

But the job that's watching for the stopping event can take however long it wants to do that work.

So if you want to make sure that other job is going to timeout too, then you need to add the timeout field to that job like this:

jobs: [
  {
    name: "consul-agent",
    exec: "consul agent...",
    stopTimeout: "5s"
  },
  {
    name: "leave-consul",
    exec: "consul leave",
    timeout: "5s",
    when: {
      source: "consul-agent",
      once: "stopping"
    }
  }
]

@tgross tgross changed the title stopTimeout field does not work documentation on stopTimeout field is confusing Aug 17, 2017
@JeromeGe
Copy link
Contributor Author

JeromeGe commented Aug 18, 2017

@tgross I need to set the job graceful shutdown time, 5 seconds is too small, is it supported?

@inetfuture
Copy link

We need something like docker stop --time or kubectkubectl delete pod <pod-name> --grace-period, to control how much time to wait before force killing the job. By our test, current behavior is 5s, which is not enough for some job to cleanup itself.

@inetfuture
Copy link

And, IMHO, the default kill timeout should be infinity, so it equals container's timeout, whether managed by docker or kubernetes.

If user want to guarantee some downstream job gets executed, they should set the kill timeout of upstream job explicitly.

@tgross
Copy link
Contributor

tgross commented Aug 18, 2017

And, IMHO, the default kill timeout should be infinity, so it equals container's timeout, whether managed by docker or kubernetes.

That will be problematic in a non-Docker environment though.

@tgross
Copy link
Contributor

tgross commented Aug 18, 2017

@tgross tgross closed this as completed Aug 18, 2017
@jwreagor
Copy link
Contributor

We need something like docker stop --time or kubectl delete pod <pod-name> --grace-period, to control how much time to wait before force killing the job. By our test, current behavior is 5s, which is not enough for some job to cleanup itself.

@inetfuture Some time next week I'll take a look and see what this type of feature would require.

@jwreagor
Copy link
Contributor

@inetfuture @JeromeGe I had a chance to check out the behavior around the default "5s" timeout and found that we already have a configuration option to change this. It appears ContainerPilot has a global stopTimeout setting across all jobs. I could be wrong but let me know if the following configuration achieves the functionality you're looking for.

Note the second line in the config below...

{
  consul: "localhost:8500",
  stopTimeout: "15",
  logging: {
    level: "DEBUG",
    format: "text",
  },
  jobs: [
    {
      name: "main-process",
      exec: "/ubuntu/test-daemon -p 5 -t 30",
      restarts: "unlimited",
    },
    {
      name: "main-process2",
      exec: "/ubuntu/test-daemon -p 2 -t 20",
      restarts: "unlimited",
    },
  ],
  control: {
    socket: "/tmp/cp-simple.socket",
  },
}

Using a similar test program as provided by this issue's original author, I've set my job exec processes to pause for shutdown after 30s and another after 20s. A global stopTimeout of "15" forces all job processes to terminate after 15s.

Here's the log output of this configuration's behavior.
DEBU[0000] loaded config: {"Discovery":{},"LogConfig":{"level":"DEBUG","format":"text","output":"stdout"},"StopTimeout":15,"Jobs":[{"Name":"main-process","Exec":"/ubuntu/test-daemon -p 5 -t 30","Port":0,"Interfaces":null,"Tags":null,"ConsulExtras":null,"Health":null,"ExecTimeout":"","Restarts":"unlimited",
"StopTimeout":"","When":{"Frequency":"","Source":"","Once":"","Each":"","Timeout":""},"Logging":null},{"Name":"main-process2","Exec":"/ubuntu/test-daemon -p 2 -t 20","Port":0,"Interfaces":null,"Tags":null,"ConsulExtras":null,"Health":null,"ExecTimeout":"","Restarts":"unlimited","StopTimeout":"","When":{"Fr
equency":"","Source":"","Once":"","Each":"","Timeout":""},"Logging":null}],"Watches":null,"Telemetry":null,"Control":{"SocketPath":"/tmp/cp-simple.socket"}}
DEBU[0000] control: initialized router for control server
DEBU[0000] control: listening to /tmp/cp-simple.socket
DEBU[0000] event: {Startup global}
DEBU[0000] main-process2.Run start
INFO[0000] control: serving at /tmp/cp-simple.socket
DEBU[0000] main-process.Run start
INFO[0002] 01                                            job=main-process2
INFO[0002] start w/ timeout 20                           job=main-process2
INFO[0005] 01234                                         job=main-process
INFO[0005] start w/ timeout 30                           job=main-process
^CDEBU[0007] event: {Shutdown global}
DEBU[0007] event: {Stopping main-process2}
DEBU[0007] event: {Stopped main-process2}
DEBU[0007] main-process2.term
DEBU[0007] terminating command 'main-process2' at pid: 60399
DEBU[0007] control: stopping control server
DEBU[0007] event: {Stopping main-process}
DEBU[0007] event: {Stopped main-process}
DEBU[0007] killing all processes in 15 seconds
DEBU[0007] main-process.term
DEBU[0007] terminating command 'main-process' at pid: 60400
DEBU[0007] control: completed graceful shutdown of control server
INFO[0007] terminated                                    job=main-process2
DEBU[0007] control: stopped serving at /tmp/cp-simple.socket
INFO[0007] terminated                                    job=main-process
INFO[0008] 0                                             job=main-process2
INFO[0008] 0                                             job=main-process
INFO[0009] 1                                             job=main-process
INFO[0009] 1                                             job=main-process2
INFO[0010] 2                                             job=main-process
INFO[0010] 2                                             job=main-process2
INFO[0011] 3                                             job=main-process2
INFO[0011] 3                                             job=main-process
INFO[0012] 4                                             job=main-process2
INFO[0012] 4                                             job=main-process
INFO[0013] 5                                             job=main-process2
INFO[0013] 5                                             job=main-process
INFO[0014] 6                                             job=main-process
INFO[0014] 6                                             job=main-process2
INFO[0015] 7                                             job=main-process
INFO[0015] 7                                             job=main-process2
INFO[0016] 8                                             job=main-process
INFO[0016] 8                                             job=main-process2
INFO[0017] 9                                             job=main-process
INFO[0017] 9                                             job=main-process2
INFO[0018] 10                                            job=main-process
INFO[0018] 10                                            job=main-process2
INFO[0019] 11                                            job=main-process
INFO[0019] 11                                            job=main-process2
INFO[0020] 12                                            job=main-process2
INFO[0020] 12                                            job=main-process
INFO[0021] 13                                            job=main-process
INFO[0021] 13                                            job=main-process2
INFO[0022] killing processes for job "main-process"
DEBU[0022] main-process.kill
DEBU[0022] killing command 'main-process' at pid: 60400
INFO[0022] killing processes for job "main-process2"
DEBU[0022] main-process2.kill
DEBU[0022] killing command 'main-process2' at pid: 60399

I'm re-opening this issue so I'll remember to update the documentation to reflect this use case at the global/application level.

@jwreagor jwreagor reopened this Aug 20, 2017
@inetfuture
Copy link

inetfuture commented Aug 21, 2017

By our test, if we set the global stopTimeout to 120s, containerpilot won't exit until the timeout, even when all jobs have exited before the timeout. This behavior is not ideal, the timeout should be insurance for slowly-terminated jobs, not make containerpilot termination always that long.

@tgross
Copy link
Contributor

tgross commented Aug 21, 2017

I agree that it's not ideal, but it's a bit of a tricky situation. At the point where the top-level stop timeout is hit, we're in the following condition:

  • all jobs have received the shutdown event
  • all job exec have received SIGTERM
  • all jobs have deregistered themselves from the event bus
  • the event bus has no more subscribers and so it says "ok I'm done"

So at this point, we should be able to exit immediately, but in practice SIGTERM can take an arbitrarily long amount of time to complete for any of the exec. (By way of example, if you're running a Consul agent with leaveOnTerminate set, then it actually has to complete a network request during shutdown!) There's no way at that point for the child processes to communicate back to us that SIGTERM has been completed.

I think we might be able to improve this slightly via #476.

@jwreagor
Copy link
Contributor

jwreagor commented Oct 5, 2017

This is issue is still open because #476 (and PR #512) might bring about the necessary plumbing to support this. When working on #512 I did uncover the area where this could be applied however I never did anything after that. No promises, but worth exploring.

If I have the time I'll open a new proposal issue since this issue is more relevant to how CP works today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants