Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When Repromoting a Freight into a Stage with a failed promotion, it fails on working tree error #3675

Open
4 tasks done
tal-hason opened this issue Mar 19, 2025 · 12 comments
Open
4 tasks done

Comments

@tal-hason
Copy link
Contributor

tal-hason commented Mar 19, 2025

Checklist

  • I've searched the issue queue to verify this is not a duplicate bug report.
  • I've included steps to reproduce the bug.
  • I've pasted the output of kargo version.
  • I've pasted logs, if applicable.

Description

I have a stage when a promotion failed due to some reason and i want to repromote the freight after the issue was fix i get:

step execution failed: step 0 met error threshold of 1: failed to run step "git-clone": error adding work tree ./nightly to repo https://gitlab.com/openshift-virtualization/fbc-payloads.git: error adding working tree at "/tmp/promotion-ff209559-788c-440d-aa2a-a292ca5be925/nightly": error executing cmd [/usr/bin/git worktree add /tmp/promotion-ff209559-788c-440d-aa2a-a292ca5be925/nightly 93a2d6c1e9693bdd985037ed80abb8f1109158af]: Preparing worktree (detached HEAD 93a2d6c) fatal: '/tmp/promotion-ff209559-788c-440d-aa2a-a292ca5be925/nightly' already exists

Screenshots

Image

Steps to Reproduce

create a warehouse and 2 following stages.
create a freight in the warehouse and start promoting the freight.
fail the freight due to a config error or even an HTTP request error in the second stage
try to promote the freight again.

Version

1.3.1

Logs

time="2025-03-19T10:40:21Z" level=info msg="began promotion" freight=2a12077559b892dc79178aa3027ae401a97b9007 namespace=v4-12 promotion=nightly-prod.01jppzttkrtfbc8m9gs8hq6hex.2a12077 stage=nightly-prod
time="2025-03-19T10:40:44Z" level=error msg="error executing Promotion" error="step execution failed: step 0 met error threshold of 1: failed to run step \"git-clone\": error adding work tree ./nightly to repo https://gitlab.com/openshift-virtualization/fbc-payloads.git: error adding working tree at \"/tmp/promotion-35ea43b0-82c0-4612-9e11-cb5290c036be/nightly\": error executing cmd [/usr/bin/git worktree add /tmp/promotion-35ea43b0-82c0-4612-9e11-cb5290c036be/nightly 8a4abdeceadbe5fa834d9703cbd375173aa95312]: Preparing worktree (detached HEAD 8a4abde)\nfatal: '/tmp/promotion-35ea43b0-82c0-4612-9e11-cb5290c036be/nightly' already exists\n" freight=2a12077559b892dc79178aa3027ae401a97b9007 namespace=v4-12 promotion=nightly-prod.01jppzttkrtfbc8m9gs8hq6hex.2a12077 stage=nightly-prod
time="2025-03-19T10:40:44Z" level=info msg=promotion freight=2a12077559b892dc79178aa3027ae401a97b9007 namespace=v4-12 phase="\"Errored\"" promotion=nightly-prod.01jppzttkrtfbc8m9gs8hq6hex.2a12077 stage=nightly-prod
@krancour
Copy link
Member

We've seen something like this before and I am having trouble finding the relevant issues and PRs.

The screenshot is a big clue as to what is happening.

Almost certainly, you got all the way through the trigger-pipeline http step then encountered a panic on the second one check-pipeline-status. A side effect of the panic is that the behind the scenes bookkeeping that marks the Promotion as having left off on that step didn't occur. This means that on the next reconciliation attempt, the steps were started again from step 0, and this is why git-clone has now failed.

To investigate further, seeing details of the check-pipeline-status http step would be helpful.

@tal-hason
Copy link
Contributor Author

Here is the HTTP step from the Promotion:

        - uses: http
          as: check-pipeline-status
          retry:
            timeout: 5m0s
          config:
            method: GET
            url: "${{ vars.notificationURL }}/notifications/notification"
            timeout: 5s
            insecureSkipTLSVerify: true
            successExpression: response.body.type == "info"
            failureExpression: response.body.type == "failure"
            headers:
              - name: Content-Type
                value: application/json
            queryParams:
              - name: ID
                value: ${{ outputs['message'].pipelineID }}
            outputs:
              - name: status
                fromExpression: response.status
              - name: type
                fromExpression: response.body.type
              - name: id
                fromExpression: response.body.NotificationID

@tal-hason
Copy link
Contributor Author

@krancour FYI i am doing all of this HTTP step cause I can't change the freight name from the stage 😕

@tal-hason
Copy link
Contributor Author

I also thought about it, can it be related that I have 2 API Pods?

@krancour
Copy link
Member

krancour commented Mar 19, 2025

Seeing the Promotion's status field would also help.

As of now, I would guess one of these is the source of the panic. I'd bet you're receiving some response in the 4xx or 5xx range and there's no type field in the body.

successExpression: response.body.type == "info"
failureExpression: response.body.type == "failure"

You should probably use a logical operators and short-circuiting to avoid that. Something like:

successExpression: response.status == 200 && response.body.type == "info"
failureExpression: response.status != 200 || response.body.type == "failure"

We obviously do need to handle panics in expressions better than we currently do, but hopefully this gets you moving again.

Even if this does resolve your problem, seeing the Promotion's status field would still be useful to us, if you don't mind. 🙏

@tal-hason
Copy link
Contributor Author

here you go:

status:
  phase: Errored
  message: >
    step execution failed: step 0 met error threshold of 1: failed to run step
    "git-clone": error adding work tree ./nightly to repo
    https://gitlab.cee.redhat.com/openshift-virtualization/fbc-payloads.git:
    error adding working tree at
    "/tmp/promotion-87d00c2d-023b-4948-b240-c4fad2892404/nightly": error
    executing cmd [/usr/bin/git worktree add
    /tmp/promotion-87d00c2d-023b-4948-b240-c4fad2892404/nightly
    d6f29b4f92a70dbc626bb5de687d3102d81e0f38]: Preparing worktree (detached HEAD
    d6f29b4)

    fatal: '/tmp/promotion-87d00c2d-023b-4948-b240-c4fad2892404/nightly' already
    exists
  lastHandledRefresh: 2025-03-19T12:34:18Z
  freight:
    name: 27ca89d66118e0705bdf03a2d77cd8c10aec2266
    commits:
      - repoURL: https://gitlab.com/openshift-virtualization/fbc-payloads.git
        id: d6f29b4f92a70dbc626bb5de687d3102d81e0f38
        branch: main
        message: "moved to stage - Commit: 2bd377d6881686b41fd88404d5213f825b9e104b -
          Version: v4-..."
        author: kargo <[email protected]>
        committer: openshift-virt-promotion-manager <[email protected]>
    origin:
      kind: Warehouse
      name: nightly-gitlab
  finishedAt:
    seconds: "1742387667"
  freightCollection:
    items:
      Warehouse/nightly-gitlab:
        name: 27ca89d66118e0705bdf03a2d77cd8c10aec2266
        commits:
          - repoURL: https://gitlab.com/openshift-virtualization/fbc-payloads.git
            id: d6f29b4f92a70dbc626bb5de687d3102d81e0f38
            branch: main
            message: "moved to stage - Commit: 2bd377d6881686b41fd88404d5213f825b9e104b -
              Version: v4-..."
            author: kargo <[email protected]>
            committer: openshift-virt-promotion-manager <[email protected]>
        origin:
          kind: Warehouse
          name: nightly-gitlab
    id: c2c7a3475af7adc6fb20911f612919ee73be96d4
  currentStep: "8"
  state:
    message:
      channel: nightly
      commitSha: c4344695412d3640c1593e2c896e612a11567bd2
      pipelineID: nightly-prod.01jpq6b34p4r1dhkwbwh9s3yk3.27ca89d
      product_version: hco-bundle-registry-container-v4.12.17
      project: v4-12
      releasePlan: prod
      repoURL: https://gitlab.com/openshift-virtualization/fbc-payloads.git
      subject_identifier: hco-bundle-registry-container-v4.12.17-39
    payload:
      channel: nightly
      fbc_fragment: quay.io/redhat-user-workloads/cnv-fbc-tenant/cnv-fbc-v4-12/v412-cnv-fbc@sha256:1e59e3378bdf678fabffa26e17cf624f488e11bdf1e80c26854ed40eb8f0374e
      from_index: registry-proxy.engineering.redhat.com/rh-osbs/iib-pub-pending:v4.12
      hco_bundle_registry_by_sha: registry.redhat.io/container-native-virtualization/hco-bundle-registry@sha256:8cb5c235465ef0573c33ca7dfdf9852d0d50df1990660ac0ef866b2ec471ce1d
      hco_bundle_registry_by_tag: registry.redhat.io/container-native-virtualization/hco-bundle-registry:v4.12.17-39
      hco_bundle_version: v4.12.17-39
      index_image: registry-proxy.engineering.redhat.com/rh-osbs/iib:938201
      minor_version: v4.12.17
      snapshot_id: cnv-fbc-v4-12-rtmtn
    trigger-pipeline: {}
    update-channel:
      commitMessage: |-
        Updated ./nightly/v4-12/lanes/nightly/prod/payload.yaml

        - channel: "nightly"
        - releasePlan: "prod"
        - version: "v4-12"
    update-stage:
      branch: main
      commit: c4344695412d3640c1593e2c896e612a11567bd2
    update-the-stage:
      commit: 0706a507f82d7f809b222386b8199ecf92d0a915
  stepExecutionMetadata:
    - alias: step-0
      startedAt:
        seconds: "1742387659"
      finishedAt:
        seconds: "1742387667"
      errorCount: 1
      status: Errored
      message: >
        failed to run step "git-clone": error adding work tree ./nightly to repo
        https://gitlab.com/openshift-virtualization/fbc-payloads.git:
        error adding working tree at
        "/tmp/promotion-87d00c2d-023b-4948-b240-c4fad2892404/nightly": error
        executing cmd [/usr/bin/git worktree add
        /tmp/promotion-87d00c2d-023b-4948-b240-c4fad2892404/nightly
        d6f29b4f92a70dbc626bb5de687d3102d81e0f38]: Preparing worktree (detached
        HEAD d6f29b4)

        fatal: '/tmp/promotion-87d00c2d-023b-4948-b240-c4fad2892404/nightly'
        already exists

@tal-hason
Copy link
Contributor Author

successExpression: response.status == 200 && response.body.type == "info"
failureExpression: response.status != 200 || response.body.type == "failure"

the body.type will give failure and the status will be 200

here is a failure example:

Image

and an info example:

Image

the only time that it gets 400 if when the pipeline is running and didn't send the notification yet so it respond with 400

@krancour
Copy link
Member

the only time that it gets 400 if when the pipeline is running and didn't send the notification yet so it respond with 400

You're basing this on curling from your desktop? That really isn't much of a guarantee that the step isn't encountering a non-200 due to misconfiguration, a firewall, a reverse proxy that's gone down, etc.

Thank you for the promo status. It's really strange that the current step is 8, but there's only step execution metadata for step 0. I can't even think of how that could have happened, but I'll keep looking as time permits.

I'm still pretty suspicious that you have an expression that's panicking and it's contributing to the problem in some way.

@tal-hason
Copy link
Contributor Author

I think i saw in one of the pods the panicking message somewhere

let me see if I can catch it

@krancour
Copy link
Member

@tal-hason it would be in the controller pod. And that would be super! Thanks!

@krancour
Copy link
Member

@tal-hason did you ever find relevant logs?

@tal-hason
Copy link
Contributor Author

@tal-hason did you ever find relevant logs?

I think it happened today, I notice that's happens if I click the refresh stage before it promotion steps updated.

I will try to find it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants