jobs/build: don't start new build until previous build completes #823

dustymabe · 2023-02-23T22:31:34Z

Since 973bcf9 ("jobs/build: rerun build-arch if previous build is
incomplete"), there is a race possible where the build job rerun logic
could kick in before the release jobs initially triggered for that
build has finished. We don't want to queue builds in that case.

Let's just prevent starting up the build job at all if there are
remaining pieces from another build for this stream running.

We only actually run the jobs when `uploading` is set. So it seems wrong to update the build description. In fact, let's not even waste time looking for missing arches or reading build metadata. Patch best viewed with whitespace ignored.

dustymabe · 2023-02-23T22:39:10Z

did some trivial testing on this. It seems to work

Since 973bcf9 ("jobs/build: rerun `build-arch` if previous build is incomplete"), there is a race possible where the `build` job rerun logic could kick in before the `release` jobs initially triggered for that build has finished. We don't want to queue builds in that case. Let's just prevent starting up the build job at all if there are remaining pieces from another build for this stream running. Co-authored-by: Jonathan Lebon <[email protected]>

jlebon · 2023-02-23T22:49:34Z

Hmm, I thought we wanted to keep the property of being able to start a new build while the previous build is still finishing? If that's not the case, a simpler approach is to always wait for the multi-arch and release jobs. And that would come with benefits like centralizing Slack notifications and not having to support an extra parameter and diverging between FCOS and RHCOS. (Edit: actually, another huge benefit would be that it'd be completely race-free. We'd be able to simplify our lock logic by a lot.)

dustymabe · 2023-02-23T23:15:19Z

Hmm, I thought we wanted to keep the property of being able to start a new build while the previous build is still finishing?

ehh, I've never been too set on that as a requirement.

If that's not the case, a simpler approach is to always wait for the multi-arch and release jobs. And that would come with benefits like centralizing Slack notifications and not having to support an extra parameter and diverging between FCOS and RHCOS. (Edit: actually, another huge benefit would be that it'd be completely race-free. We'd be able to simplify our lock logic by a lot.)

I think the thing I wanted to keep more was the fidelity between the job and failure reporting. i.e. if the release job fails I don't really want the x86_64 build job to report failure. I'd also like notifications to come through when the x86_64 job completes (letting us know it succeeded) and this typically happens way before the other jobs.

Maybe what we need is another toplevel job that's the entry point and will fail if any of the parts fail. Not sure how easy this would be to do because there is a lot of conditional logic baked in, but could be something to consider.

jlebon · 2023-02-27T15:28:06Z

ehh, I've never been too set on that as a requirement.

We should probably think on this a bit more and look at some numbers. It means we significantly lower the rate at which we can build a stream and also increase the latency between requesting a build and having it.

I think the thing I wanted to keep more was the fidelity between the job and failure reporting. i.e. if the release job fails I don't really want the x86_64 build job to report failure.

Note though that's exactly what 4c02574 does to help the ART case. :)

I'd also like notifications to come through when the x86_64 job completes (letting us know it succeeded) and this typically happens way before the other jobs.

Hmm yeah, I think we can get that by having the build job send its message when it's done and just waiting for the multi-arch and release jobs? We don't need to propagate failures from those if we don't want (that'll have to be a parameter that replaces WAIT_FOR_RELEASE_JOB). I think though there's more in general we could do to make notifications better.

Maybe what we need is another toplevel job that's the entry point and will fail if any of the parts fail. Not sure how easy this would be to do because there is a lot of conditional logic baked in, but could be something to consider.

Yeah, maybe. There's a natural tension it seems between RHCOS and FCOS wrt how we want the pipeline to behave. We might end up there (a separate job), though ideally we try to rework things so that they become more similar while still preserving what we value.

We could actually probably implement waiting for the build-arch and release jobs (race-free) without lots of locks and without giving up on pipelining. The milestone step is relevant here.

Anyway, I guess we need a fix soon for the race condition. I think #822 has a much smaller impact on pipeline behaviour; WDYT about merging that one for now while we discuss the above?

dustymabe · 2023-02-27T16:48:40Z

ehh, I've never been too set on that as a requirement.

We should probably think on this a bit more and look at some numbers. It means we significantly lower the rate at which we can build a stream and also increase the latency between requesting a build and having it.

I'm not so sure this is a significant difference than what we require today. I'd say it's not too common that we have multiple builds of the same stream in flight.

I think the thing I wanted to keep more was the fidelity between the job and failure reporting. i.e. if the release job fails I don't really want the x86_64 build job to report failure.

Note though that's exactly what 4c02574 does to help the ART case. :)

Indeed, which is why it's a parameter and not the default.

I'd also like notifications to come through when the x86_64 job completes (letting us know it succeeded) and this typically happens way before the other jobs.

Hmm yeah, I think we can get that by having the build job send its message when it's done and just waiting for the multi-arch and release jobs? We don't need to propagate failures from those if we don't want (that'll have to be a parameter that replaces WAIT_FOR_RELEASE_JOB). I think though there's more in general we could do to make notifications better.

👍

Maybe what we need is another toplevel job that's the entry point and will fail if any of the parts fail. Not sure how easy this would be to do because there is a lot of conditional logic baked in, but could be something to consider.

Yeah, maybe. There's a natural tension it seems between RHCOS and FCOS wrt how we want the pipeline to behave. We might end up there (a separate job), though ideally we try to rework things so that they become more similar while still preserving what we value.

We could actually probably implement waiting for the build-arch and release jobs (race-free) without lots of locks and without giving up on pipelining. The milestone step is relevant here.

yes, but that does imply that pipelining is something we heavily take advantage of today or want to in the future. I really don't think (in the current pipeline setup) it's something we really need.

Maybe if we had a slim version of the build that happened "all the time" and a heavy one that happened once a day or something. But right now we're running the heavy one all the time and that doesn't lend itself well to pipelining IMO. quickly invalidating all those images we just created and cloud uploads we just did is a bit of an anti-pattern.

Anyway, I guess we need a fix soon for the race condition. I think #822 has a much smaller impact on pipeline behaviour; WDYT about merging that one for now while we discuss the above?

I commented over there. I think I still prefer this (let's talk more!) but won't block that PR either.

jlebon · 2023-02-27T17:34:16Z

I commented over there. I think I still prefer this (let's talk more!) but won't block that PR either.

Agree, let's keep talking! I've updated that PR with your feedback.

I think you're probably right we don't heavily rely on pipelining today. I just didn't want to make this change quickly to fix the race without thinking it through since it has a wider impact. I think there's an opportunity here to converge on something simpler than the status quo. Maybe we can get together and discuss what we want the "ideal" to look like.

jobs/build: gate all rerun logic on uploading

c2ad35d

We only actually run the jobs when `uploading` is set. So it seems wrong to update the build description. In fact, let's not even waste time looking for missing arches or reading build metadata. Patch best viewed with whitespace ignored.

dustymabe mentioned this pull request Feb 23, 2023

jobs/build: don't try to complete ongoing builds #822

Merged

dustymabe force-pushed the dusty-lokcs branch 2 times, most recently from a3f861f to b9d3a04 Compare February 23, 2023 22:38

dustymabe force-pushed the dusty-lokcs branch from b9d3a04 to a084590 Compare February 23, 2023 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs/build: don't start new build until previous build completes #823

jobs/build: don't start new build until previous build completes #823

dustymabe commented Feb 23, 2023

dustymabe commented Feb 23, 2023

jlebon commented Feb 23, 2023 •

edited

Loading

dustymabe commented Feb 23, 2023

jlebon commented Feb 27, 2023

dustymabe commented Feb 27, 2023 •

edited

Loading

jlebon commented Feb 27, 2023

jobs/build: don't start new build until previous build completes #823

Are you sure you want to change the base?

jobs/build: don't start new build until previous build completes #823

Conversation

dustymabe commented Feb 23, 2023

dustymabe commented Feb 23, 2023

jlebon commented Feb 23, 2023 • edited Loading

dustymabe commented Feb 23, 2023

jlebon commented Feb 27, 2023

dustymabe commented Feb 27, 2023 • edited Loading

jlebon commented Feb 27, 2023

jlebon commented Feb 23, 2023 •

edited

Loading

dustymabe commented Feb 27, 2023 •

edited

Loading