Allow to start a job after n other jobs have started #435

mterron · 2017-07-10T21:11:00Z

Sometimes, we need to define job dependencies that are non-linear. Given jobs A,B & C, job C might depend on A & B being healthy, however A doesn't depend on B or B on A.

At the moment, the only way I could find to express this dependency graph, was to create an artificial dependency between A & B and then make C depend on B. This slows down startup.

I suggest that something like this could be implemented:

jobs: [
  {
    name: "A",
    exec: "A.sh",
  },
  {
    name: "B",
    exec: "B.sh",
  },
  {
    name: "C",
    exec: "C.sh",
    when: {
      source: ["A","B"],
      once: "healthy"
    }
  }
`
]

tgross · 2017-07-11T17:22:30Z

The big picture need for this seems sound. The details look complicated. I think we need to explore the edge cases, particularly around each vs once and some of the non-health-related events. I also want to make sure that adding the flexibility doesn't make it much more difficult for an end-user to understand what's going on. Here's 3 general cases that I have concerns about, but I'd love if we can explore any further cases:

Case 1: multiple sources, once healthy

when: {
  source: ["A", "B"],
  once: "healthy"
}

This was your original example. Note that there's an implicit AND here. We're saying execute one time, after both A and B are healthy. One corner case is what might we expect to happen if A becomes healthy, then A becomes unhealthy, and then B becomes healthy? We respond to events, not state, so that implies that each job will have to track not just its own state but the state of its triggering events as well.

It looks like the case of exitSuccess, exitFailed, and changed all have the same set of state behaviors.

Case 2: multiple sources, each healthy

when: {
  source: ["A", "B"],
  each: "healthy"
}

This case takes the previous case and complicates it. The language of "each" kind of implies that we're now OR'ing the health states rather than AND'ing them, but it explicitly means that we run the job on each healthy event.

Like case 1, it looks like the case of exitSuccess, exitFailed, and changed all have the same set of state behaviors.

Case 3: multiple sources, once stopping

when: {
  source: ["A", "B"],
  once: "stopping"
}

We have state tracking again as per case 1. In this case we're responding to an event, but that event signals that we've entered an implicit "stopping state" that exists until we receive the stopped event. So even if we track state as per case 1 and 2 above, what would be the expected behavior if A fires stopping, A fires stopped, and then B fires stopping?

jwreagor · 2017-07-19T13:58:13Z

Curious how the state tracking will take place. Isn't the event bus already holding this state and you just need this type of job's event to observe subsequent events in order to fire?

I'd have to dig but I'm unsure if the bus was designed in that way. My hope would be that you could remove the hard dependency tracking out of some sort of global state manager and into already existing behavior.

tgross · 2017-07-19T14:08:59Z

The bus is a dumb publisher. Each job tracks its own state (via things like restartsRemain or startEvent), which is why we did things like set the start event to NonEvent in #438.

jwreagor · 2017-07-19T14:38:37Z

Of course, right where it was yesterday. I consistently over think the utility of that bus.

mterron · 2017-07-25T00:12:48Z

I see this is more complicated than I thought. Is there any other initiative to add state tracking to CP? I'm happy to keep using my "solution" if that's the way it is. I just thought it was a valid use case.

As an MVP, would it be simpler if there was only support for once: Healthy or once: exitSuccess as in "after" this n things are healthy/started, launch and then is up to the app to react to events and other dependencies going down.

tgross · 2017-07-26T18:30:46Z

Is there any other initiative to add state tracking to CP? I'm happy to keep using my "solution" if that's the way it is. I just thought it was a valid use case.

It does seem valid, for sure. But yeah it's just complicated. We don't have any other initiative doing state tracking other than the state of the job itself.

As an MVP, would it be simpler if there was only support for once: Healthy or once: exitSuccess as in "after" this n things are healthy/started, launch and then is up to the app to react to events and other dependencies going down.

That might be plausible. I do worry such a restriction on having multiple each or multiple stopping event handlers might seem arbitrary to users, but we have other places where we've had to say "we just don't support that because supporting it will be even more confusing".

tgross · 2017-08-03T13:35:56Z

Noting for myself that there's a lot of under-the-hook implementation overlap between the issues in #435, #416, and #396

gbmeuk · 2018-01-15T13:56:14Z

Hi,

We have a case that is related to this issue and also #416 and #518, where we hit a race condition between an on-change job and a pre-start job. Given the following containerpilot jobs:

    {
      name: 'pre-start',
      exec: '/usr/local/bin/app-manage preStart',
      when: {
        source: 'watch.squid-gcp-proxy',
        once: 'healthy'
      }
    }

    {
      name: 'on-change-squid-gcp-proxy',
      exec: '/usr/local/bin/app-manage reload',
      when: {
        source: 'watch.squid-gcp-proxy',
        each: 'changed'
      }
    }

    {
      name: 'apache-fwdproxy',
      exec: '/usr/local/apache/bin/apachectl -Xf /etc/apache-fwdproxy/httpd.conf -k start -D APACHE-FWDPROXY',
      restarts: 3,
      port: '33000',
      health: {
        exec: '/usr/local/bin/app-manage health',
        interval: 10,
        ttl: 30,
        timeout: 3,
      },
      tags: [
        'apache',
        'googleproxy'
      ],
      consul: {
        enableTagOverride: true,
        deregisterCriticalServiceAfter: '10m'
      },
      when: {
        source: 'pre-start',
        once: 'exitSuccess'
      }
    }

...And the script functions as follows:

preStart() {
    _log "Configuring application"
    touch /usr/local/apache/htdocs/health
    configureApp
}


health() {
    msg=$(curl --fail -sS http://localhost:33000/health)
    status=$?
    if [ ! ${status} -eq 0 ]; then
        echo ${msg}
        exit ${status}
    else
        return ${status}
    fi
}

reload() {
    _log "Configuring application"
    configureApp
    _log "reloading application"
    /usr/local/apache/bin/apachectl \
          -f /etc/apache-fwdproxy/httpd.conf \
          -k graceful \
          -D APACHE-FWDPROXY
}

Sometimes apache is started with graceful instead of start and then fails to run or reconfigure in a consistent and reliable fashion.

This issue was resolved by changing reload() to:

reload() {
    health
    if [ $? -eq 0 ]; then
        _log "Configuring application"
        configureApp
        _log "reloading application"
        /usr/local/apache/bin/apachectl \
            -f /etc/apache-fwdproxy/httpd.conf \
            -k graceful \
            -D APACHE-FWDPROXY
    else
        _log "WARNING: application not running. Can't reload"
    fi

}

I totally understand the design decision to emit a changed and healthy event, so it would be really nice to be able to handle this by better functionality in when. At least clearer documentation around the flow of even messages - in particular how changed and healthy are both emitted together.

tgross added proposal v3.x labels Jul 11, 2017

tgross mentioned this issue Aug 3, 2017

allow when:once and when:interval to coexist #416

Open

tgross mentioned this issue Aug 3, 2017

restart delay #396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to start a job after n other jobs have started #435

Allow to start a job after n other jobs have started #435

mterron commented Jul 10, 2017 •

edited by tgross

Loading

tgross commented Jul 11, 2017 •

edited

Loading

jwreagor commented Jul 19, 2017

tgross commented Jul 19, 2017 •

edited

Loading

jwreagor commented Jul 19, 2017

mterron commented Jul 25, 2017 •

edited

Loading

tgross commented Jul 26, 2017

tgross commented Aug 3, 2017 •

edited

Loading

gbmeuk commented Jan 15, 2018

Allow to start a job after n other jobs have started #435

Allow to start a job after n other jobs have started #435

Comments

mterron commented Jul 10, 2017 • edited by tgross Loading

tgross commented Jul 11, 2017 • edited Loading

jwreagor commented Jul 19, 2017

tgross commented Jul 19, 2017 • edited Loading

jwreagor commented Jul 19, 2017

mterron commented Jul 25, 2017 • edited Loading

tgross commented Jul 26, 2017

tgross commented Aug 3, 2017 • edited Loading

gbmeuk commented Jan 15, 2018

mterron commented Jul 10, 2017 •

edited by tgross

Loading

tgross commented Jul 11, 2017 •

edited

Loading

tgross commented Jul 19, 2017 •

edited

Loading

mterron commented Jul 25, 2017 •

edited

Loading

tgross commented Aug 3, 2017 •

edited

Loading