Skip to content

Conversation

@Abacn
Copy link
Contributor

@Abacn Abacn commented Oct 24, 2025

Dataflow Streaming Validates Runner Streaming tests setup by gathering all @ValidatesRunnerTest and add --streaming pipeline options. There is a Dataflow Streaming Validates Runner test suites also exercising this test.

Over the time the number of test grows so does execution time. We already increased timeout several times in the past. Here I try to dedup some test cases that

  • already run in batch test suites, and
  • is error handling, or test some streaming mode irrelevant feature, or have a base test case not excluded

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn Abacn force-pushed the dedup-validate-runner branch from 89b0901 to 4325e2b Compare October 24, 2025 14:56
@github-actions github-actions bot added the java label Oct 24, 2025
@Abacn
Copy link
Contributor Author

Abacn commented Oct 24, 2025

cc: @Amar3tto @tvalentyn for #36577 (comment)

@tvalentyn
Copy link
Contributor

cc: @kennknowles

@kennknowles
Copy link
Member

I would prefer a semantic tag that justifies why this is OK. For example,

  • Maybe it doesn't validate the runner but just validates the SDK harness so we can remove the tag.
  • Maybe it is always the same whether it is batch or streaming, so it could be @Category(SameInBatchAndStreaming.class)

But the problem is that the existence of a batch vs streaming mode is actually quite under-the-hood. In fact it would be better for the execution mode to be chosen at finer granularity than "whole job". And if there really is a potentially different codepath in batch and streaming we can't safely disable it.

The real solution to the issue is to combine multiple ValidatesRunner pipelines into a single Dataflow job so that we can run 100 at a time or more. This is just sort of hard to do with JUnit. Or at least it requires hackery that I didn't figure out yet.

If it is a hack to make just Dataflow work better I would rather make it in the Dataflow build.gradle than tag on the test, since it isn't a property of the test.


@Test
@Category({ValidatesRunner.class, UsesSideInputs.class})
@Category({ValidatesRunner.class, UsesSideInputs.class, BatchOnly.class})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since streaming and batch have really different execution models, I would not opt this in to being batch only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, and the name is subject to change. Currently just use if for manual testing


@Test
@Category(ValidatesRunner.class)
@Category({ValidatesRunner.class, BatchOnly.class})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streaming and batch actually have somewhat different Create implementations I think

@kennknowles
Copy link
Member

Another question: How many Dataflow jobs are in parallel? I think at some point we single-threaded these to avoid quota issues, or some other build issue, but we could run with 10+ threads and just need to adjust our quota.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 27, 2025

It's still timing out reducing 30+ tests (should save ~1h). It is hard to check if there was a stuck tests without any log. Currently running on #36634, if there are stuck tests, it will show in Dataflow console as long as its in running state. Therefore checking 8h later.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 27, 2025

Another question: How many Dataflow jobs are in parallel? I think at some point we single-threaded these to avoid quota issues, or some other build issue, but we could run with 10+ threads and just need to adjust our quota.

It was set to default value

maxParallelForks Integer.MAX_VALUE

Checked in Dataflow console running it's effectively 4 jobs in parallel.

but we could run with 10+ threads and just need to adjust our quota.

We'd definitely need much larger quota if bump this value. Currently one of the quota most easily hit is in-use ipv4 addresses on us-central1. ipv4 quota is not easy to get approved. I wonder if we can run validates runner tests (and supported Dataflow test in general) on ipv6 only or disable external networks (could be follow up)

@kennknowles
Copy link
Member

Another question: How many Dataflow jobs are in parallel? I think at some point we single-threaded these to avoid quota issues, or some other build issue, but we could run with 10+ threads and just need to adjust our quota.

It was set to default value

maxParallelForks Integer.MAX_VALUE

Checked in Dataflow console running it's effectively 4 jobs in parallel.

but we could run with 10+ threads and just need to adjust our quota.

We'd definitely need much larger quota if bump this value. Currently one of the quota most easily hit is in-use ipv4 addresses on us-central1. ipv4 quota is not easy to get approved. I wonder if we can run validates runner tests (and supported Dataflow test in general) on ipv6 only or disable external networks (could be follow up)

This seems like a good idea. I'm not sure why we would need external networks. Certainly not for all tests.

Another possibility could be to shard the test suite for Dataflow. I think probably ParDoTest and CombineTest probably have a ton of pipelines and if you separate some "core ValidatesRunner suite" from "extended ValidatesRunner suite" it might reduce flakes and speed up reruns while reducing quota usage.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 27, 2025

I understand what is going on now. The parallelism isn't even. As of now (after 8h+), a single test class ViewTest is running (persumably other tests have finished). There isn't a stuck pipeline. That explains this PR doesn't work in terms of decreasing test running time.

ViewTest has 42 validates runner tests. We need to split it as well. However the pipeline running time definitely elevated recently. Checked example pipeline log it appears pipeline teardown time is increased (it usually takes 4h to startup, then needs 2-5 min from "stopping worker pool" to pipeline finish. At "stopping worker pool" the PipelineResult is still running. I kind of remember previously this was 1.5-3 min-ish.

I think we're good for now in terms of validation (no SDK side regression found)

@Amar3tto
Copy link
Collaborator

СС @tvalentyn

@Abacn Abacn closed this Nov 4, 2025
@Abacn Abacn deleted the dedup-validate-runner branch November 4, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants