Skip to content

Add script to run and automatically retry tests #542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 5, 2025
Merged

Conversation

lfrancke
Copy link
Member

This script can be used to run the full test suite and then automatically retry the failing tests a configurable number of times. It can optionally keep the failed namespaces and it writes all the logs to files.

I did create this to run tests locally on my machine but I hope we can also make it usable for CI builds.

I am not the biggest fan of the output as it is very noisy but on the other hand I also don't want to miss crucial information during debugging.

This will generate output like this:

❯ python scripts/auto-retry-tests.py --parallel 4 --attempts-parallel 1 --attempts-serial 1 --venv venv --keep-failed-namespaces  --output-dir test-results
Starting Automated Test Suite with Retry Logic
==============================================

Step 1: Running initial full test suite...

Step 2: Parsing failed tests...
  Found 1 failed tests:
    1. orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false

Step 3: Retrying failed tests...

=== Parallel retries for 1 tests (up to 1 attempts each) ===

--- Parallel attempt 1 ---
Retrying 1 tests in parallel (max 1 at once)...
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/parallel)...
  ❌ Completed in 10.2m
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false FAILED (attempt 1)

1 tests still failing after parallel retries, starting serial retries...

=== Serial retries for orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false (up to 1 attempts
Running orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars... (attempt 1/serial)... ⏰ Estimated: 10.2m
  ❌ Completed in 10.0m
    📊 Average: 10.1m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

📊 Step 4: Generating final report...

📊 Generating final report...
================================================================================
AUTOMATED TEST SUITE REPORT
==============================================
Started: 2025-07-16 15:01:18
Ended: 2025-07-16 15:31:43
Total Duration: 0:30:25.015686

SUMMARY
----------------------------------------------
Total Tests: 1
Passed: 0
Flaky (eventually passed): 0
Failed: 1
Success Rate: 0.0%

FAILED TESTS
----------------------------------------------
  ✗ orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-liberal-horse
    Average runtime: 10.1m (from 2 runs)
    Last error: failed in step 1-install-zk...

NAMESPACE MANAGEMENT
----------------------------------------------
Namespaces kept for debugging: 1
  - kuttl-test-liberal-horse

RUNTIME STATISTICS
----------------------------------------------
Total test runs recorded: 2
Overall average runtime: 10.1m
Overall median runtime: 10.1m
Fastest test run: 10.0m
Slowest test run: 10.2m

Slowest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

Fastest tests (by average):
  orphaned_resources_nifi-2.4.0,host.k3d.internal_5000_lars_nifi_2.4.0-stackable0.0.0-dev_zookeeper-latest-3.9.3_openshift-false: 10.1m

CONFIGURATION
----------------------------------------------
Parallel: 4
Parallel retry attempts: 1
Serial retry attempts: 1
Keep failed namespaces: True
Virtualenv: venv

Text report saved to: test-results/test_report.txt
Detailed JSON report saved to: test-results/detailed_report.json

@lfrancke lfrancke self-assigned this Jul 16, 2025
@lfrancke lfrancke moved this to Development: Waiting for Review in Stackable Engineering Jul 16, 2025
@sbernauer
Copy link
Member

I think I would prefer to have --keep-failed-namespaces by default.
A namespace is quickly deleted, but you can wait for an hour just to come back to the tests and have to re-run the test because the namespace was deleted.
For CI we can disable it (although it also doesn't hurt, as the replicated cluster is deleted afterwards)

@lfrancke
Copy link
Member Author

I'm fine with that. I'll see if more comments are added and can then implement it.

@sbernauer
Copy link
Member

I modified a smoke test (so that it fails) and ran scripts/auto-retry-tests.py.
Some remarks:

  1. --keep-failed-namespaces default mentioned above
  2. It said Namespace kept for debugging: kuttl-test-square-swift but it in fact deleted the namespace.
  3. The smoke test failed with requests.exceptions.ConnectionError: HTTPConnectionPool(host='test-opa-server-default-metricssssssssssssssssssss', port=8081): Max retries exceeded with url: /metrics (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f7839ad9af0>: Failed to resolve 'test-opa-server-default-metricssssssssssssssssssss' ([Errno -2] Name or service not known)")). I would have spotted the "bug" much faster if I would see the STDOUT of the test run. On the other hand I guess you silenced that by choice?
  4. My smoke_opa-1.0.1_openshift-false_attempt_1_parallel.txt only has this content:
INFO:root:Expanding test case id [smoke_opa-1.0.1_openshift-false]
INFO:root:Expanding test case id [smoke_opa-1.4.2_openshift-false]
Traceback (most recent call last):
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/bin/.beku-wrapped", line 9, in <module>
    sys.exit(main())
             ^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/main.py", line 97, in main
    return expand(
           ^^^^^^^
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 332, in expand
    test_case.expand(template_dir, output_dir, namespace)
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 149, in expand
    test_source.build_destination()
  File "/nix/store/dk3mknkr5xfkd6wm4zznhy4h8zjcxv1w-beku-stackabletech-0.0.10-/lib/python3.12/site-packages/beku/kuttl.py", line 79, in build_destination
    with open(dest, encoding="utf8", mode="w") as stream:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'tests/_work/tests/smoke/smoke_opa-1.4.2_openshift-false/01-install-vector-aggregator-discovery-configmap.yaml'
ERROR:root:beku failed

Is this maybe because of multiple processes running in parallel?

@sbernauer sbernauer moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Jul 28, 2025
@lfrancke
Copy link
Member Author

Can you give me the full command you ran and maybe a bit more output?
I did the same patch to the same kuttl test I guess and my result is

❯ python scripts/auto-retry-tests.py --parallel 4 --attempts-parallel 1 --attempts-serial 1 --venv venv --keep-failed-namespaces  --output-dir test-results
Starting Automated Test Suite with Retry Logic
==============================================

Step 1: Running initial full test suite...

Step 2: Parsing failed tests...
  Found 3 failed tests:
    1. keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false
    2. smoke_opa-1.4.2_openshift-false
    3. smoke_opa-1.0.1_openshift-false

Step 3: Retrying failed tests...

=== Parallel retries for 3 tests (up to 1 attempts each) ===

--- Parallel attempt 1 ---
Retrying 3 tests in parallel (max 3 at once)...
Running keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_opens... (attempt 1/parallel)...
Running smoke_opa-1.4.2_openshift-false (attempt 1/parallel)...
Running smoke_opa-1.0.1_openshift-false (attempt 1/parallel)...
  ❌ Completed in 5.2m
  ✗ keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false FAILED (attempt 1)
  ❌ Completed in 6.0m
  ✗ smoke_opa-1.4.2_openshift-false FAILED (attempt 1)
  ❌ Completed in 6.0m
  ✗ smoke_opa-1.0.1_openshift-false FAILED (attempt 1)

3 tests still failing after parallel retries, starting serial retries...

=== Serial retries for keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false (up to 1 attempts) - Test 1/3 failing ===
Running keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_opens... (attempt 1/serial)... ⏰ Estimated: 5.2m
  ❌ Completed in 5.0m
    📊 Average: 5.1m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

=== Serial retries for smoke_opa-1.4.2_openshift-false (up to 1 attempts) - Test 2/3 failing ===
Running smoke_opa-1.4.2_openshift-false (attempt 1/serial)... ⏰ Estimated: 6.0m
  ❌ Completed in 5.4m
    📊 Average: 5.7m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

=== Serial retries for smoke_opa-1.0.1_openshift-false (up to 1 attempts) - Test 3/3 failing ===
Running smoke_opa-1.0.1_openshift-false (attempt 1/serial)... ⏰ Estimated: 6.0m
  ❌ Completed in 5.3m
    📊 Average: 5.6m
📁 State saved to: test-results/test_run_state.json
  ✗ Serial attempt 1 FAILED

Step 4: Generating final report...

📊 Generating final report...
================================================================================
AUTOMATED TEST SUITE REPORT
==============================================
Started: 2025-07-30 11:49:43
Ended: 2025-07-30 12:18:46
Total Duration: 0:29:02.730052

SUMMARY
----------------------------------------------
Total Tests: 3
Passed: 0
Flaky (eventually passed): 0
Failed: 3
Success Rate: 0.0%

FAILED TESTS
----------------------------------------------
  ✗ keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-feasible-dory
    Average runtime: 5.1m (from 2 runs)
    Last error: failed in step 4-install-keycloak...

  ✗ smoke_opa-1.4.2_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-top-jaguar
    Average runtime: 5.7m (from 2 runs)
    Last error: failed in step 31-...

  ✗ smoke_opa-1.0.1_openshift-false
    Total attempts: 3
    Namespace kept for debugging: kuttl-test-trusted-colt
    Average runtime: 5.6m (from 2 runs)
    Last error: failed in step 31-...

RUNTIME STATISTICS
----------------------------------------------
Total test runs recorded: 6
Overall average runtime: 5.5m
Overall median runtime: 5.3m
Fastest test run: 5.0m
Slowest test run: 6.0m

Slowest tests (by average):
  smoke_opa-1.4.2_openshift-false: 5.7m
  smoke_opa-1.0.1_openshift-false: 5.6m
  keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false: 5.1m

Fastest tests (by average):
  smoke_opa-1.4.2_openshift-false: 5.7m
  smoke_opa-1.0.1_openshift-false: 5.6m
  keycloak-user-info_opa-latest-1.4.2_keycloak-23.0.1_openshift-false: 5.1m

CONFIGURATION
----------------------------------------------
Parallel: 4
Parallel retry attempts: 1
Serial retry attempts: 1
Keep failed namespaces: True
Virtualenv: venv

Text report saved to: test-results/test_report.txt
Detailed JSON report saved to: test-results/detailed_report.json

And the three namespaces it mentions are still there.

@sbernauer
Copy link
Member

I currently don't have any output, but at least I can provide the full command :) scripts/auto-retry-tests.py

And the three namespaces it mentions are still there.

Yeah, because you specified --keep-failed-namespaces. I didn't, so there should be no Namespace kept for debugging message

@lfrancke
Copy link
Member Author

Ahhhh! Now I understand. Thank you. I'll fix that and try the run again.

@lfrancke
Copy link
Member Author

lfrancke commented Aug 1, 2025

I also found the issue (well you already hinted at it) with the message about the missing directories.
I never saw this because I never even look at the "intermediary" log files - only the last ones.

https://github.com/stackabletech/beku.py/pull/7/files
This deletes the generated tests so it is indeed because of the parallel runs.

I see two possibilities and I'm fine with both (or another option I haven't thought of)

  1. Extend the run-tests script to take an output directory as a parameter. Currently hardcoded to tests/_work and then provide it with unique directory names
  2. Just remove the parallel runs

Any preferences/ideas?

@sbernauer
Copy link
Member

Option 1 sounds a bit nicer but I cant judge if it's worth the effort

@lfrancke
Copy link
Member Author

lfrancke commented Aug 5, 2025

I have implemented option 1 now along with other improvements

3e7c4fb (#542)

- keep-failed-namespaces has been changed to delete-failed-namespaces
  (default false)
- A unique work directory is created for each test run to avoid
  interference
- The logs now contain the exact command that was used to run the tests
- Script tried to delete already deleted namespaces
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can just give this a try in CI and see how it goes :)

@lfrancke lfrancke added this pull request to the merge queue Aug 5, 2025
@lfrancke lfrancke moved this from Development: In Review to Development: Done in Stackable Engineering Aug 5, 2025
Merged via the queue into main with commit 816ec99 Aug 5, 2025
2 checks passed
@lfrancke lfrancke deleted the feat/auto-retry branch August 5, 2025 14:27
@lfrancke lfrancke moved this from Development: Done to Done in Stackable Engineering Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

2 participants