Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update the storage-version-migration script to complete execution in finite time #1682

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 15 additions & 11 deletions hack/storage-version-migration.sh
Original file line number Diff line number Diff line change
Expand Up @@ -65,19 +65,23 @@ spec:
ttlSecondsAfterFinished: 600
EOF

JOB_NAME=$(kubectl -n shipwright-build get job --selector app=storage-version-migration-shipwright -o jsonpath='{.items[0].metadata.name}')

while [ "$(kubectl -n shipwright-build get job "${JOB_NAME}" -o json | jq -r '.status.completionTime // ""')" == "" ]; do
echo "[INFO] Storage version migraton job is still running"
sleep 10
done
Comment on lines -70 to -73
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably where the bug is. Per the Job API documentation:

completionTime(Time):

Represents time when the job was completed. It is not guaranteed to be set in happens-before order across separate operations. It is represented in RFC3339 form and is in UTC. The completion time is set when the job finishes successfully, and only then. The value cannot be updated or removed. The value indicates the same or later point in time as the startTime field.

I suspect that in your case, the job is failing. This script doesn't see a completion time because the Job never succeeds, hence it just runs forever.

See https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/job-v1/#JobStatus

NAMESPACE="shipwright-build"
JOB_NAME=$(kubectl -n "${NAMESPACE}" get job --selector app=storage-version-migration-shipwright -o jsonpath='{.items[0].metadata.name}')

isFailed="$(kubectl -n shipwright-build get job "${JOB_NAME}" -o json | jq -r '.status.conditions[] | select(.type == "Failed") | .status')"
while true; do
jobStatus=$(kubectl get job "${JOB_NAME}" -n "${NAMESPACE}" -o=jsonpath='{.status.conditions[*].type}')

if [ "${isFailed}" == "True" ]; then
echo "[ERROR] Storage version migration failed"
kubectl -n shipwright-build logs "job/${JOB_NAME}"
exit 1
fi
if [[ "${jobStatus}" == *"Complete"* ]]; then
echo "Job ${JOB_NAME} has completed successfully!"
exit 0
Comment on lines +75 to +77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient. condition.type tells you if the status condition is reporting the Complete or Failed . The Complete and Failed condition types should always be populated in the Conditions array.

You need to inspect the condition.status field, which can be True, False, or Unknown.

elif [[ "${jobStatus}" == *"Failed"* ]]; then
echo "[ERROR] Storage version migration failed"
exit 1
fi

echo "[INFO] Storage version migration job is still running"
sleep 10
done

echo "[DONE]"
Loading