WIP: system test parallelization: two-pass approach #23275

edsantiago · 2024-07-15T12:16:39Z

Split system tests into two: those that can be run in
parallel, and those that can't. Run tests in two passes.
This requires eliminating the per-test leak check and
teardown. I think that's okay.

Tests that can run in parallel:

use unique container/pod/volume/network names
- bonus: added a way to track names to their test,
  so the leak test at end can be useful
do not run 'podman rm -a' or 'rmi -a'
do not run 'podman ps/images' and expect precise output

Signed-off-by: Ed Santiago [email protected]

None

openshift-ci · 2024-07-15T12:16:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [edsantiago]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

edsantiago · 2024-07-15T12:17:49Z

THIS IS NOT EVEN CLOSE TO DREAMING ABOUT MERGING!

@Luap99 I think this approach holds promise. I would like to spend some time pursuing it. Before I do so, WDYT?

Luap99 · 2024-07-15T13:52:53Z

@Luap99 I think this approach holds promise. I would like to spend some time pursuing it. Before I do so, WDYT?

Just some quick thoughts, will add more once I am back.

Syntax wise this seems to be better as there were many files I could not run parallel because one or two tests. Assuming this now runs things parallel across files as well it should utilize the cpu better.

However I do not see how this addresses the functional issues from my PR.
How are we going to debug flakes? There is nothing in the logs for timings etc... It is practically impossible to correlate problematic test interactions (this can be done well in e2e tests as we have a full log with timings)
Reviewing tests for possible conflicts will be hard and we will fail from time to time causing extra flakes.

I think there are nice gains here but honestly I am no longer sure that the ongoing maintenance will not cause to much work on all maintainers.

edsantiago · 2024-07-16T19:06:49Z

Okay..... I'm really favorably impressed with this approach. The two-pass requirement sucks, and debugging failures is really hard, but I think the benefits (so far) are outweighing those negatives. Running lots and lots of different tests in parallel, not just from one file, is finding a lot of bugs.

CI is likely to fail because of #23282. This is still very much a WIP. My plan is to break out much of the safename work, commit that separately in individual reviewable PRs, in order to minimize the changes in this one.

packit-as-a-service · 2024-07-16T19:33:34Z

Cockpit tests failed for commit e79fca479320156e76d577882013964f92c10282. @martinpitt, @jelly, @mvollmer please check.

Luap99 · 2024-07-17T16:20:20Z

@edsantiago Ok let's do this then. I will try to fix all the related podman bugs which you reported in the next days.

Luap99 · 2024-07-18T11:37:01Z

re tag name:
I would prefer parallel over para as this makes it more clear to readers. And I don't see a problem if the tag name is a bit longer.

edsantiago · 2024-07-18T11:46:01Z

Full name: my concern is typos. I know that we'll get occasional "parralel" or "parrallel" misspellings and those are hard to catch in review. I've been letting my brain think about this in the background and still haven't come up with any ideas.

The other consideration is a string that's easily greppable in source code and command-line history. ^Rpara (for rerunning tests) is pretty useless. Maybe ci:parallel and just try really hard to catch typos in review?

Luap99 · 2024-07-18T12:25:14Z

Maybe ci:parallel and just try really hard to catch typos in review?

ci:parallel SGTM. Another reason for something like codespell to be part of the actual CI checks.
I am not too concerned about typos, it is not like they would break anything. Also most people likely copy the thing from another test and would not really think about it to much anyway I think.

Luap99 · 2024-07-18T12:27:35Z

In general I would good to get some docs in test/system/README.md that descripe how this parallel mode works and what test can/cannot run in parallel (--all,--latest, output checks like podman ps empty output, etc... )

Luap99 · 2024-07-18T12:31:29Z

Also another flake I saw locally.

   [14:18:02.492492892] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman __completeNoDesc  system connection remove arg
   [14:18:02.522931040] m_t114-lgdlrt8i
   m_t114-lgdlrt8i-root
   :4
   Completion ended with directive: ShellCompDirectiveNoFileComp
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Unexpected non-Debug output line: m_t114-lgdlrt8i
   #| expected: \[Debug\]
   #|   actual: m_t114-

I know what is wrong with that and will do another PR to fix that.

Luap99 · 2024-07-18T12:51:40Z

Also another flake I saw locally.

   [14:18:02.492492892] $ /home/pholzing/go/src/github.com/containers/podman/bin/podman __completeNoDesc  system connection remove arg
   [14:18:02.522931040] m_t114-lgdlrt8i
   m_t114-lgdlrt8i-root
   :4
   Completion ended with directive: ShellCompDirectiveNoFileComp
   #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
   #|     FAIL: Unexpected non-Debug output line: m_t114-lgdlrt8i
   #| expected: \[Debug\]
   #|   actual: m_t114-

I know what is wrong with that and will do another PR to fix that.

Fix in #23326

...try to trace them back to the culprit tests Signed-off-by: Ed Santiago <[email protected]>

Signed-off-by: Ed Santiago <[email protected]>

In theory when syslog is set the cleanup process should log its errors to syslog (journald) so we can have a look at the errors in CI. Without it podman container cleanup errors will never be logged anywhere. In order to rey to debug containers#21569 Signed-off-by: Paul Holzinger <[email protected]>

All we care about in this PR is system tests. Signed-off-by: Ed Santiago <[email protected]>

Signed-off-by: Ed Santiago <[email protected]>

Luap99@df865c8 Signed-off-by: Ed Santiago <[email protected]>

Signed-off-by: Ed Santiago <[email protected]>

openshift-ci bot added release-note-none do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 15, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 15, 2024

edsantiago marked this pull request as draft July 15, 2024 12:16

edsantiago force-pushed the bats-parallel branch 3 times, most recently from f456c95 to e79fca4 Compare July 16, 2024 19:03

edsantiago force-pushed the bats-parallel branch from e79fca4 to 8cf6051 Compare July 16, 2024 23:01

Luap99 mentioned this pull request Jul 17, 2024

test/system: Add a test case for automount with multi images #23301

Merged

edsantiago mentioned this pull request Jul 17, 2024

completion: container diff: no such pod #23282

Closed

edsantiago force-pushed the bats-parallel branch from 8cf6051 to 0c5bb35 Compare July 17, 2024 16:01

edsantiago force-pushed the bats-parallel branch from 0c5bb35 to f6b6178 Compare July 17, 2024 18:04

edsantiago mentioned this pull request Jul 17, 2024

podman auto-update/system df: fix ErrNoSuchCtr/Volume race #23305

Merged

edsantiago force-pushed the bats-parallel branch 2 times, most recently from dc66bb4 to 477cbe0 Compare July 18, 2024 11:33

github-actions bot added the machine label Jul 18, 2024

edsantiago mentioned this pull request Jul 18, 2024

pkg/machine/compression: skip decompress bar for empty file #23323

Merged

edsantiago force-pushed the bats-parallel branch from 477cbe0 to b48e602 Compare July 18, 2024 16:01

edsantiago force-pushed the bats-parallel branch 7 times, most recently from f795830 to a04f8a9 Compare November 7, 2024 16:05

edsantiago mentioned this pull request Nov 7, 2024

kube SIGINT system test: fix race in timeout handling #24496

Merged

edsantiago force-pushed the bats-parallel branch 2 times, most recently from aee9d3c to d5c1a9a Compare November 7, 2024 21:39

edsantiago mentioned this pull request Nov 9, 2024

system tests: safer install_kube_template() #24515

Merged

edsantiago force-pushed the bats-parallel branch from d5c1a9a to f1a833c Compare November 11, 2024 12:46

edsantiago and others added 17 commits November 13, 2024 04:38

EXPERIMENTAL! In teardown, if we see leaks, ...

601768c

...try to trace them back to the culprit tests Signed-off-by: Ed Santiago <[email protected]>

(debug) for 21569: log play-kube command, show at end

467eac7

Signed-off-by: Ed Santiago <[email protected]>

DO NOT MERGE: skip unneeded CI tasks

c9d2f61

All we care about in this PR is system tests. Signed-off-by: Ed Santiago <[email protected]>

FIXME: update docs, teardown, ...

59586e3

Signed-off-by: Ed Santiago <[email protected]>

DO NOT MERGE: test tail logging fix from Luap99 fork

1364695

Luap99@df865c8 Signed-off-by: Ed Santiago <[email protected]>

FIXME-debug for k8s-file test

c9d3afe

Signed-off-by: Ed Santiago <[email protected]>

why is USEC test failing

dc50919

Signed-off-by: Ed Santiago <[email protected]>

try parallelizing the USEC test again

4c51c40

Signed-off-by: Ed Santiago <[email protected]>

220: add a FIXME comment about a race

c735206

Signed-off-by: Ed Santiago <[email protected]>

fixmeup-255 debugging

7164765

Signed-off-by: Ed Santiago <[email protected]>

unparallelize usec

5154b3d

Signed-off-by: Ed Santiago <[email protected]>

FIXME debugs for ns leak

b1d30e6

Signed-off-by: Ed Santiago <[email protected]>

FIXME: rmi pause. Is there a better place to do this?

535ceae

Signed-off-by: Ed Santiago <[email protected]>

Add lots more parallel-high-load FIXMEs

740f69c

Signed-off-by: Ed Santiago <[email protected]>

fixme, just a change to a skip msg

9c11490

Signed-off-by: Ed Santiago <[email protected]>

parallelize 010

a47a367

Signed-off-by: Ed Santiago <[email protected]>

edsantiago force-pushed the bats-parallel branch from f1a833c to a47a367 Compare November 13, 2024 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: system test parallelization: two-pass approach #23275

WIP: system test parallelization: two-pass approach #23275

edsantiago commented Jul 15, 2024

openshift-ci bot commented Jul 15, 2024

edsantiago commented Jul 15, 2024

Luap99 commented Jul 15, 2024

edsantiago commented Jul 16, 2024

packit-as-a-service bot commented Jul 16, 2024

Luap99 commented Jul 17, 2024

Luap99 commented Jul 18, 2024

edsantiago commented Jul 18, 2024

Luap99 commented Jul 18, 2024

Luap99 commented Jul 18, 2024

Luap99 commented Jul 18, 2024

Luap99 commented Jul 18, 2024

WIP: system test parallelization: two-pass approach #23275

Are you sure you want to change the base?

WIP: system test parallelization: two-pass approach #23275

Conversation

edsantiago commented Jul 15, 2024

openshift-ci bot commented Jul 15, 2024

edsantiago commented Jul 15, 2024

Luap99 commented Jul 15, 2024

edsantiago commented Jul 16, 2024

packit-as-a-service bot commented Jul 16, 2024

Luap99 commented Jul 17, 2024

Luap99 commented Jul 18, 2024

edsantiago commented Jul 18, 2024

Luap99 commented Jul 18, 2024

Luap99 commented Jul 18, 2024

Luap99 commented Jul 18, 2024

Luap99 commented Jul 18, 2024