Add script to run and automatically retry tests #542

lfrancke · 2025-07-16T14:20:44Z

This script can be used to run the full test suite and then automatically retry the failing tests a configurable number of times. It can optionally keep the failed namespaces and it writes all the logs to files.

I did create this to run tests locally on my machine but I hope we can also make it usable for CI builds.

I am not the biggest fan of the output as it is very noisy but on the other hand I also don't want to miss crucial information during debugging.

This will generate output like this:

❯ python scripts/auto-retry-tests.py --parallel 4 --attempts-parallel 1 --attempts-serial 1 --venv venv --keep-failed-namespaces  --output-dir test-results
Starting Automated Test Suite with Retry Logic
==============================================

Step 1: Running initial full test suite...

Step 2: Parsing failed tests...
  Found 1 failed tests:
    1. orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false

Step 3: Retrying failed tests...

=== Parallel retries for 1 tests (up to 1 attempts each) ===

--- Parallel attempt 1 ---
Retrying 1 tests in parallel (max 1 at once)...
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/parallel)...
  ❌ Completed in 10.2m
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false FAILED (attempt 1)

1 tests still failing after parallel retries, starting serial retries...

=== Serial retries for orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false (up to 1 attempts
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/serial)... ⏰ Estimated: 10.2m
  ❌ Completed in 10.0m
    📊 Average: 10.1m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

📊 Step 4: Generating final report...

📊 Generating final report...
================================================================================
AUTOMATED TEST SUITE REPORT
==============================================
Started: 2025-07-16 15:01:18
Ended: 2025-07-16 15:31:43
Total Duration: 0:30:25.015686

SUMMARY
----------------------------------------------
Total Tests: 1
Passed: 0
Flaky (eventually passed): 0
Failed: 1
Success Rate: 0.0%

FAILED TESTS
----------------------------------------------
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-liberal-horse
    Average runtime: 10.1m (from 2 runs)
    Last error: failed in step 1-install-zk...

NAMESPACE MANAGEMENT
----------------------------------------------
Namespaces kept for debugging: 1
  - kuttl-test-liberal-horse

RUNTIME STATISTICS
----------------------------------------------
Total test runs recorded: 2
Overall average runtime: 10.1m
Overall median runtime: 10.1m
Fastest test run: 10.0m
Slowest test run: 10.2m

Slowest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

Fastest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

CONFIGURATION
----------------------------------------------
Parallel: 4
Parallel retry attempts: 1
Serial retry attempts: 1
Keep failed namespaces: True
Virtualenv: venv

Text report saved to: test-results/test_report.txt
Detailed JSON report saved to: test-results/detailed_report.json

sbernauer · 2025-07-25T12:37:12Z

I think I would prefer to have --keep-failed-namespaces by default.
A namespace is quickly deleted, but you can wait for an hour just to come back to the tests and have to re-run the test because the namespace was deleted.
For CI we can disable it (although it also doesn't hurt, as the replicated cluster is deleted afterwards)

lfrancke · 2025-07-25T13:06:16Z

I'm fine with that. I'll see if more comments are added and can then implement it.

sbernauer · 2025-07-25T13:20:22Z

I modified a smoke test (so that it fails) and ran scripts/auto-retry-tests.py.
Some remarks:

--keep-failed-namespaces default mentioned above
It said Namespace kept for debugging: kuttl-test-square-swift but it in fact deleted the namespace.
The smoke test failed with requests.exceptions.ConnectionError: HTTPConnectionPool(host='test-opa-server-default-metricssssssssssssssssssss', port=8081): Max retries exceeded with url: /metrics (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f7839ad9af0>: Failed to resolve 'test-opa-server-default-metricssssssssssssssssssss' ([Errno -2] Name or service not known)")). I would have spotted the "bug" much faster if I would see the STDOUT of the test run. On the other hand I guess you silenced that by choice?
My smoke_opa-1.0.1_openshift-false_attempt_1_parallel.txt only has this content:

INFO:root:Expanding test case id [smoke_opa-1.0.1_openshift-false]
INFO:root:Expanding test case id [smoke_opa-1.4.2_openshift-false]
Traceback (most recent call last):
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/bin/.beku-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/main.py", line 97, in main
    return expand(
           ^^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 332, in expand
    test_case.expand(template_dir, output_dir, namespace)
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 149, in expand
    test_source.build_destination()
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 79, in build_destination
    with open(dest, encoding="utf8", mode="w") as stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'tests/_work/tests/smoke/smoke_opa-1.4.2_openshift-false/01-install-vector-aggregator-discovery-configmap.yaml'
ERROR:root:beku failed

Is this maybe because of multiple processes running in parallel?

lfrancke · 2025-07-30T10:25:23Z

Can you give me the full command you ran and maybe a bit more output?
I did the same patch to the same kuttl test I guess and my result is

❯ python scripts/auto-retry-tests.py --parallel 4 --attempts-parallel 1 --attempts-serial 1 --venv venv --keep-failed-namespaces  --output-dir test-results
Starting Automated Test Suite with Retry Logic
==============================================

Step 1: Running initial full test suite...

Step 2: Parsing failed tests...
  Found 3 failed tests:
    1. keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false
    2. smoke_opa-1.4.2_openshift-false
    3. smoke_opa-1.0.1_openshift-false

Step 3: Retrying failed tests...

=== Parallel retries for 3 tests (up to 1 attempts each) ===

--- Parallel attempt 1 ---
Retrying 3 tests in parallel (max 3 at once)...
Running keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_opens... (attempt 1/parallel)...
Running smoke_opa-1.4.2_openshift-false (attempt 1/parallel)...
Running smoke_opa-1.0.1_openshift-false (attempt 1/parallel)...
  ❌ Completed in 5.2m
  ✗ keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false FAILED (attempt 1)
  ❌ Completed in 6.0m
  ✗ smoke_opa-1.4.2_openshift-false FAILED (attempt 1)
  ❌ Completed in 6.0m
  ✗ smoke_opa-1.0.1_openshift-false FAILED (attempt 1)

3 tests still failing after parallel retries, starting serial retries...

=== Serial retries for keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false (up to 1 attempts) - Test 1/3 failing ===
Running keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_opens... (attempt 1/serial)... ⏰ Estimated: 5.2m
  ❌ Completed in 5.0m
    📊 Average: 5.1m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

=== Serial retries for smoke_opa-1.4.2_openshift-false (up to 1 attempts) - Test 2/3 failing ===
Running smoke_opa-1.4.2_openshift-false (attempt 1/serial)... ⏰ Estimated: 6.0m
  ❌ Completed in 5.4m
    📊 Average: 5.7m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

=== Serial retries for smoke_opa-1.0.1_openshift-false (up to 1 attempts) - Test 3/3 failing ===
Running smoke_opa-1.0.1_openshift-false (attempt 1/serial)... ⏰ Estimated: 6.0m
  ❌ Completed in 5.3m
    📊 Average: 5.6m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

Step 4: Generating final report...

📊 Generating final report...
================================================================================
AUTOMATED TEST SUITE REPORT
==============================================
Started: 2025-07-30 11:49:43
Ended: 2025-07-30 12:18:46
Total Duration: 0:29:02.730052

SUMMARY
----------------------------------------------
Total Tests: 3
Passed: 0
Flaky (eventually passed): 0
Failed: 3
Success Rate: 0.0%

FAILED TESTS
----------------------------------------------
  ✗ keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-feasible-dory
    Average runtime: 5.1m (from 2 runs)
    Last error: failed in step 4-install-keycloak...

  ✗ smoke_opa-1.4.2_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-top-jaguar
    Average runtime: 5.7m (from 2 runs)
    Last error: failed in step 31-...

  ✗ smoke_opa-1.0.1_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-trusted-colt
    Average runtime: 5.6m (from 2 runs)
    Last error: failed in step 31-...

RUNTIME STATISTICS
----------------------------------------------
Total test runs recorded: 6
Overall average runtime: 5.5m
Overall median runtime: 5.3m
Fastest test run: 5.0m
Slowest test run: 6.0m

Slowest tests (by average):
  smoke_opa-1.4.2_openshift-false: 5.7m
  smoke_opa-1.0.1_openshift-false: 5.6m
  keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false: 5.1m

Fastest tests (by average):
  smoke_opa-1.4.2_openshift-false: 5.7m
  smoke_opa-1.0.1_openshift-false: 5.6m
  keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false: 5.1m

CONFIGURATION
----------------------------------------------
Parallel: 4
Parallel retry attempts: 1
Serial retry attempts: 1
Keep failed namespaces: True
Virtualenv: venv

Text report saved to: test-results/test_report.txt
Detailed JSON report saved to: test-results/detailed_report.json

And the three namespaces it mentions are still there.

sbernauer · 2025-07-31T06:59:32Z

I currently don't have any output, but at least I can provide the full command :) scripts/auto-retry-tests.py

And the three namespaces it mentions are still there.

Yeah, because you specified --keep-failed-namespaces. I didn't, so there should be no Namespace kept for debugging message

lfrancke · 2025-07-31T07:00:35Z

Ahhhh! Now I understand. Thank you. I'll fix that and try the run again.

lfrancke · 2025-08-01T06:51:57Z

I also found the issue (well you already hinted at it) with the message about the missing directories.
I never saw this because I never even look at the "intermediary" log files - only the last ones.

https://github.com/stackabletech/beku.py/pull/7/files
This deletes the generated tests so it is indeed because of the parallel runs.

I see two possibilities and I'm fine with both (or another option I haven't thought of)

Extend the run-tests script to take an output directory as a parameter. Currently hardcoded to tests/_work and then provide it with unique directory names
Just remove the parallel runs

Any preferences/ideas?

sbernauer · 2025-08-01T07:05:52Z

Option 1 sounds a bit nicer but I cant judge if it's worth the effort

lfrancke · 2025-08-05T13:56:41Z

I have implemented option 1 now along with other improvements

3e7c4fb (#542)

- keep-failed-namespaces has been changed to delete-failed-namespaces (default false) - A unique work directory is created for each test run to avoid interference - The logs now contain the exact command that was used to run the tests - Script tried to delete already deleted namespaces

sbernauer

I guess we can just give this a try in CI and see how it goes :)

lfrancke added this to Stackable Engineering Jul 16, 2025

lfrancke self-assigned this Jul 16, 2025

lfrancke moved this to Development: Waiting for Review in Stackable Engineering Jul 16, 2025

lfrancke force-pushed the feat/auto-retry branch from 7bbc2ba to 9a967dd Compare July 16, 2025 14:23

sbernauer moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Jul 28, 2025

Add script to run and automatically retry tests

9a56acc

lfrancke force-pushed the feat/auto-retry branch from 9a967dd to 6e8faaa Compare August 5, 2025 13:55

lfrancke force-pushed the feat/auto-retry branch from 6e8faaa to 3e7c4fb Compare August 5, 2025 13:57

sbernauer approved these changes Aug 5, 2025

View reviewed changes

lfrancke added this pull request to the merge queue Aug 5, 2025

lfrancke moved this from Development: In Review to Development: Done in Stackable Engineering Aug 5, 2025

Merged via the queue into main with commit 816ec99 Aug 5, 2025
2 checks passed

lfrancke deleted the feat/auto-retry branch August 5, 2025 14:27

lfrancke moved this from Development: Done to Done in Stackable Engineering Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add script to run and automatically retry tests #542

Add script to run and automatically retry tests #542

Uh oh!

lfrancke commented Jul 16, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

lfrancke commented Jul 25, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

lfrancke commented Jul 30, 2025

Uh oh!

sbernauer commented Jul 31, 2025

Uh oh!

lfrancke commented Jul 31, 2025

Uh oh!

lfrancke commented Aug 1, 2025

Uh oh!

sbernauer commented Aug 1, 2025

Uh oh!

lfrancke commented Aug 5, 2025 •

edited

Loading

Uh oh!

sbernauer left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add script to run and automatically retry tests #542

Add script to run and automatically retry tests #542

Uh oh!

Conversation

lfrancke commented Jul 16, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

lfrancke commented Jul 25, 2025

Uh oh!

sbernauer commented Jul 25, 2025

Uh oh!

lfrancke commented Jul 30, 2025

Uh oh!

sbernauer commented Jul 31, 2025

Uh oh!

lfrancke commented Jul 31, 2025

Uh oh!

lfrancke commented Aug 1, 2025

Uh oh!

sbernauer commented Aug 1, 2025

Uh oh!

lfrancke commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbernauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lfrancke commented Aug 5, 2025 •

edited

Loading