Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python Examples Flink job is flaky #32794

Closed
github-actions bot opened this issue Oct 16, 2024 · 11 comments · Fixed by #33135 or #33138
Closed

The PostCommit Python Examples Flink job is flaky #32794

github-actions bot opened this issue Oct 16, 2024 · 11 comments · Fixed by #33135 or #33138

Comments

@github-actions
Copy link
Contributor

The PostCommit Python Examples Flink is failing over 50% of the time.
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Examples_Flink.yml?query=is%3Afailure+branch%3Amaster to see all failed workflow runs.
See also Grafana statistics: http://metrics.beam.apache.org/d/CTYdoxP4z/ga-post-commits-status?orgId=1&viewPanel=8&var-Workflow=PostCommit%20Python%20Examples%20Flink

@liferoad
Copy link
Contributor

INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 Caused by: java.io.IOException: Insufficient number of network buffers: required 16, but only 8 available. The total number of network buffers is currently set to 2048 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max'.
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalCreateBufferPool(NetworkBufferPool.java:495)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:468)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.partition.ResultPartitionFactory.lambda$createBufferPoolFactory$1(ResultPartitionFactory.java:379)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.io.network.partition.ResultPartition.setup(ResultPartition.java:158)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:969)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:658)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
INFO     apache_beam.utils.subprocess_server:subprocess_server.py:213 	at java.base/java.lang.Thread.run(Thread.java:829)

Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Contributor

No useful logs from the failed workflows.

@liferoad
Copy link
Contributor

Looks like some mem issue:

The node was low on resource: memory. Threshold quantity: 100Mi, available: 74432Ki. Container runner was using 58152720Ki, request is 3Gi, has larger consumption of memory. Container docker was using 43668Ki, request is 0, has larger consumption of memory. 
image

@liferoad
Copy link
Contributor

This keeps failing now here

2024-11-18T14:26:34.5015302Z apache_beam/examples/cookbook/bigquery_tornadoes_it_test.py::BigqueryTornadoesIT::test_bigquery_tornadoes_it

@liferoad
Copy link
Contributor

Even with higmem:

The node was low on resource: memory. Threshold quantity: 100Mi, available: 3568Ki. Container docker was using 42360Ki, request is 0, has larger consumption of memory. Container runner was using 59928716Ki, request is 5Gi, has larger consumption of memory. 

@liferoad
Copy link
Contributor

https://github.com/apache/beam/actions/runs/11915340857/job/33205368335 looks good now when switiching to the higher mem machines.

@github-actions github-actions bot reopened this Jan 12, 2025
Copy link
Contributor Author

Reopening since the workflow is still flaky

@Amar3tto
Copy link
Collaborator

Reason: highmem-runner-22 is not available since January 3.

@Amar3tto
Copy link
Collaborator

Successful after manually deleting nodepool via console and rerunning terraform which will recreate it.

@Amar3tto
Copy link
Collaborator

Job has been stable for 3 days. Closing as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment