documentation on stopTimeout field is confusing #479

JeromeGe · 2017-08-16T10:06:55Z

containerpilot --version
Version: 3.3.4
GitHash: f7d6212

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

uname -a
Linux 0743e8140205 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

{
  consul: "127.0.0.1:8500",
  logging: {
    level: "DEBUG",
    format: "text",
  },
  jobs: [
    {
      name: "test-stopTimeout",
      exec: "/root/test",
      stopTimeout: "120s",
      restarts: "unlimited",
    }
  ],
  control: {
    socket: "/var/run/containerpilot.socket"
  }
}

test.go

package main

import "fmt"
import "os"
import "os/signal"
import "syscall"
import "time"

func main() {
        sigs := make(chan os.Signal, 1)
        done := make(chan bool, 1)
        signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
        go func() {
                fmt.Println("start")
                sig := <-sigs
                fmt.Println(sig)
                for i := 0; i < 30; i++ {
                        time.Sleep(1 * time.Second)
                        fmt.Println(i)
                }
                done <- true
        }()
        <-done
        fmt.Println("exiting")
}

time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Shutdown global}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: stopping control server" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: completed graceful shutdown of control server" 
time="2017-08-16T18:02:03+08:00" level=debug msg="control: stopped serving at /var/run/containerpilot.socket" 
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Stopping test-stopTimeout}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="event: {Stopped test-stopTimeout}" 
time="2017-08-16T18:02:03+08:00" level=debug msg="killing all processes in 5 seconds" 
time="2017-08-16T18:02:03+08:00" level=debug msg=test-stopTimeout.term 
time="2017-08-16T18:02:03+08:00" level=debug msg="terminating command 'test-stopTimeout' at pid: 4746" 
time="2017-08-16T18:02:03+08:00" level=info msg=terminated job=test-stopTimeout 
time="2017-08-16T18:02:04+08:00" level=info msg=0 job=test-stopTimeout 
time="2017-08-16T18:02:05+08:00" level=info msg=1 job=test-stopTimeout 
time="2017-08-16T18:02:06+08:00" level=info msg=2 job=test-stopTimeout 
time="2017-08-16T18:02:07+08:00" level=info msg=3 job=test-stopTimeout 
time="2017-08-16T18:02:08+08:00" level=info msg="killing processes for job "test-stopTimeout"" 
time="2017-08-16T18:02:08+08:00" level=debug msg=test-stopTimeout.kill 
time="2017-08-16T18:02:08+08:00" level=debug msg="killing command 'test-stopTimeout' at pid: 4746"

The stopTimeout field is set to 120s, but the test-stopTimeout program is killed after 5 seconds
test code: https://github.com/JeromeGe/test-containerpilot-stopTimeout

The text was updated successfully, but these errors were encountered:

jwreagor · 2017-08-16T20:48:46Z

I think there's a little work required around documenting use cases of stopTimeout.

    // these fields interact with 'when' behaviors (see below)
    timeout: "300s",
    stopTimeout: "10s",
    restarts: "unlimited",

@tgross should add to this but stopTimeout really matters when you want a grace period before a job throws it's stopping event (before a watcher job begins), not the time until it is stopped. The documentation only notes this with the following when example...

jobs: [
  {
    name: "consul-agent",
    exec: "consul agent...",
    stopTimeout: "5s"
  },
  {
    name: "leave-consul",
    exec: "consul leave",
    when: {
      source: "consul-agent",
      once: "stopping"
    }
  }
]

I'll try to get a better use case/example into the docs.

tgross · 2017-08-16T20:57:26Z

Another way of putting this is that stopTimeout is the maximum amount of time a stopping job will wait for another job that might be watching for the stopping event.

But the job that's watching for the stopping event can take however long it wants to do that work.

So if you want to make sure that other job is going to timeout too, then you need to add the timeout field to that job like this:

jobs: [
  {
    name: "consul-agent",
    exec: "consul agent...",
    stopTimeout: "5s"
  },
  {
    name: "leave-consul",
    exec: "consul leave",
    timeout: "5s",
    when: {
      source: "consul-agent",
      once: "stopping"
    }
  }
]

JeromeGe · 2017-08-18T02:39:33Z

@tgross I need to set the job graceful shutdown time, 5 seconds is too small, is it supported?

inetfuture · 2017-08-18T02:49:09Z

We need something like docker stop --time or kubectkubectl delete pod <pod-name> --grace-period, to control how much time to wait before force killing the job. By our test, current behavior is 5s, which is not enough for some job to cleanup itself.

inetfuture · 2017-08-18T02:56:54Z

And, IMHO, the default kill timeout should be infinity, so it equals container's timeout, whether managed by docker or kubernetes.

If user want to guarantee some downstream job gets executed, they should set the kill timeout of upstream job explicitly.

tgross · 2017-08-18T12:38:20Z

And, IMHO, the default kill timeout should be infinity, so it equals container's timeout, whether managed by docker or kubernetes.

That will be problematic in a non-Docker environment though.

tgross · 2017-08-18T14:58:03Z

Released in https://github.com/joyent/containerpilot/releases/tag/3.4.0

jwreagor · 2017-08-19T01:33:49Z

We need something like docker stop --time or kubectl delete pod <pod-name> --grace-period, to control how much time to wait before force killing the job. By our test, current behavior is 5s, which is not enough for some job to cleanup itself.

@inetfuture Some time next week I'll take a look and see what this type of feature would require.

jwreagor · 2017-08-20T15:33:59Z

@inetfuture @JeromeGe I had a chance to check out the behavior around the default "5s" timeout and found that we already have a configuration option to change this. It appears ContainerPilot has a global stopTimeout setting across all jobs. I could be wrong but let me know if the following configuration achieves the functionality you're looking for.

Note the second line in the config below...

{
  consul: "localhost:8500",
  stopTimeout: "15",
  logging: {
    level: "DEBUG",
    format: "text",
  },
  jobs: [
    {
      name: "main-process",
      exec: "/ubuntu/test-daemon -p 5 -t 30",
      restarts: "unlimited",
    },
    {
      name: "main-process2",
      exec: "/ubuntu/test-daemon -p 2 -t 20",
      restarts: "unlimited",
    },
  ],
  control: {
    socket: "/tmp/cp-simple.socket",
  },
}

Using a similar test program as provided by this issue's original author, I've set my job exec processes to pause for shutdown after 30s and another after 20s. A global stopTimeout of "15" forces all job processes to terminate after 15s.

Here's the log output of this configuration's behavior.

DEBU[0000] loaded config: {"Discovery":{},"LogConfig":{"level":"DEBUG","format":"text","output":"stdout"},"StopTimeout":15,"Jobs":[{"Name":"main-process","Exec":"/ubuntu/test-daemon -p 5 -t 30","Port":0,"Interfaces":null,"Tags":null,"ConsulExtras":null,"Health":null,"ExecTimeout":"","Restarts":"unlimited",
"StopTimeout":"","When":{"Frequency":"","Source":"","Once":"","Each":"","Timeout":""},"Logging":null},{"Name":"main-process2","Exec":"/ubuntu/test-daemon -p 2 -t 20","Port":0,"Interfaces":null,"Tags":null,"ConsulExtras":null,"Health":null,"ExecTimeout":"","Restarts":"unlimited","StopTimeout":"","When":{"Fr
equency":"","Source":"","Once":"","Each":"","Timeout":""},"Logging":null}],"Watches":null,"Telemetry":null,"Control":{"SocketPath":"/tmp/cp-simple.socket"}}
DEBU[0000] control: initialized router for control server
DEBU[0000] control: listening to /tmp/cp-simple.socket
DEBU[0000] event: {Startup global}
DEBU[0000] main-process2.Run start
INFO[0000] control: serving at /tmp/cp-simple.socket
DEBU[0000] main-process.Run start
INFO[0002] 01                                            job=main-process2
INFO[0002] start w/ timeout 20                           job=main-process2
INFO[0005] 01234                                         job=main-process
INFO[0005] start w/ timeout 30                           job=main-process
^CDEBU[0007] event: {Shutdown global}
DEBU[0007] event: {Stopping main-process2}
DEBU[0007] event: {Stopped main-process2}
DEBU[0007] main-process2.term
DEBU[0007] terminating command 'main-process2' at pid: 60399
DEBU[0007] control: stopping control server
DEBU[0007] event: {Stopping main-process}
DEBU[0007] event: {Stopped main-process}
DEBU[0007] killing all processes in 15 seconds
DEBU[0007] main-process.term
DEBU[0007] terminating command 'main-process' at pid: 60400
DEBU[0007] control: completed graceful shutdown of control server
INFO[0007] terminated                                    job=main-process2
DEBU[0007] control: stopped serving at /tmp/cp-simple.socket
INFO[0007] terminated                                    job=main-process
INFO[0008] 0                                             job=main-process2
INFO[0008] 0                                             job=main-process
INFO[0009] 1                                             job=main-process
INFO[0009] 1                                             job=main-process2
INFO[0010] 2                                             job=main-process
INFO[0010] 2                                             job=main-process2
INFO[0011] 3                                             job=main-process2
INFO[0011] 3                                             job=main-process
INFO[0012] 4                                             job=main-process2
INFO[0012] 4                                             job=main-process
INFO[0013] 5                                             job=main-process2
INFO[0013] 5                                             job=main-process
INFO[0014] 6                                             job=main-process
INFO[0014] 6                                             job=main-process2
INFO[0015] 7                                             job=main-process
INFO[0015] 7                                             job=main-process2
INFO[0016] 8                                             job=main-process
INFO[0016] 8                                             job=main-process2
INFO[0017] 9                                             job=main-process
INFO[0017] 9                                             job=main-process2
INFO[0018] 10                                            job=main-process
INFO[0018] 10                                            job=main-process2
INFO[0019] 11                                            job=main-process
INFO[0019] 11                                            job=main-process2
INFO[0020] 12                                            job=main-process2
INFO[0020] 12                                            job=main-process
INFO[0021] 13                                            job=main-process
INFO[0021] 13                                            job=main-process2
INFO[0022] killing processes for job "main-process"
DEBU[0022] main-process.kill
DEBU[0022] killing command 'main-process' at pid: 60400
INFO[0022] killing processes for job "main-process2"
DEBU[0022] main-process2.kill
DEBU[0022] killing command 'main-process2' at pid: 60399

I'm re-opening this issue so I'll remember to update the documentation to reflect this use case at the global/application level.

inetfuture · 2017-08-21T06:34:34Z

By our test, if we set the global stopTimeout to 120s, containerpilot won't exit until the timeout, even when all jobs have exited before the timeout. This behavior is not ideal, the timeout should be insurance for slowly-terminated jobs, not make containerpilot termination always that long.

tgross · 2017-08-21T12:45:54Z

I agree that it's not ideal, but it's a bit of a tricky situation. At the point where the top-level stop timeout is hit, we're in the following condition:

all jobs have received the shutdown event
all job exec have received SIGTERM
all jobs have deregistered themselves from the event bus
the event bus has no more subscribers and so it says "ok I'm done"

So at this point, we should be able to exit immediately, but in practice SIGTERM can take an arbitrarily long amount of time to complete for any of the exec. (By way of example, if you're running a Consul agent with leaveOnTerminate set, then it actually has to complete a network request during shutdown!) There's no way at that point for the child processes to communicate back to us that SIGTERM has been completed.

I think we might be able to improve this slightly via #476.

jwreagor · 2017-10-05T15:08:51Z

This is issue is still open because #476 (and PR #512) might bring about the necessary plumbing to support this. When working on #512 I did uncover the area where this could be applied however I never did anything after that. No promises, but worth exploring.

If I have the time I'll open a new proposal issue since this issue is more relevant to how CP works today.

jwreagor added docs enhancement labels Aug 16, 2017

tgross changed the title ~~stopTimeout field does not work~~ documentation on stopTimeout field is confusing Aug 17, 2017

jwreagor mentioned this issue Aug 17, 2017

docs: Attempt to articulate better language around stopTimeout #485

Merged

jwreagor added open PR awaiting release and removed open PR labels Aug 17, 2017

tgross added released and removed awaiting release labels Aug 18, 2017

tgross closed this as completed Aug 18, 2017

jwreagor reopened this Aug 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation on stopTimeout field is confusing #479

documentation on stopTimeout field is confusing #479

JeromeGe commented Aug 16, 2017 •

edited

Loading

jwreagor commented Aug 16, 2017 •

edited

Loading

tgross commented Aug 16, 2017

JeromeGe commented Aug 18, 2017 •

edited

Loading

inetfuture commented Aug 18, 2017

inetfuture commented Aug 18, 2017

tgross commented Aug 18, 2017

tgross commented Aug 18, 2017

jwreagor commented Aug 19, 2017

jwreagor commented Aug 20, 2017

inetfuture commented Aug 21, 2017 •

edited

Loading

tgross commented Aug 21, 2017

jwreagor commented Oct 5, 2017

documentation on stopTimeout field is confusing #479

documentation on stopTimeout field is confusing #479

Comments

JeromeGe commented Aug 16, 2017 • edited Loading

jwreagor commented Aug 16, 2017 • edited Loading

tgross commented Aug 16, 2017

JeromeGe commented Aug 18, 2017 • edited Loading

inetfuture commented Aug 18, 2017

inetfuture commented Aug 18, 2017

tgross commented Aug 18, 2017

tgross commented Aug 18, 2017

jwreagor commented Aug 19, 2017

jwreagor commented Aug 20, 2017

inetfuture commented Aug 21, 2017 • edited Loading

tgross commented Aug 21, 2017

jwreagor commented Oct 5, 2017

JeromeGe commented Aug 16, 2017 •

edited

Loading

jwreagor commented Aug 16, 2017 •

edited

Loading

JeromeGe commented Aug 18, 2017 •

edited

Loading

inetfuture commented Aug 21, 2017 •

edited

Loading