Skip to content

Conversation

gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Oct 9, 2025

What is the purpose of the change

Now there are e2e failures on azure: https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=70100

I've had a deeper look and found the following: it started breaking because the Microsoft-hosted Ubuntu image the pipeline runs on changed overnight (Oct 3 → Oct 4). That update brings a newer toolchain (Ubuntu 22.04.5 + Docker 28.0.4, and a FIPS-hardened Go/kubectl stack). The kube step then crashes (kubectl exits with status 2) and your bash step dies with “STDIO streams did not close…”, so the Flink Kubernetes e2e never even really starts. It’s an agent/image regression, not a Flink change.

Brief change log

Bump vmImage version to ubuntu-24.04 to match the compatibility + use latest kubectl.

Verifying this change

Azure run.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 9, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@davidradl
Copy link
Contributor

@gaborgsomogyi The CI is still failing - were you expecting this change to fix the "[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x16596ae]" ?

@gaborgsomogyi
Copy link
Contributor Author

Seems like the new one is doing something so probably not going to blow up, but let's see...

@gaborgsomogyi
Copy link
Contributor Author

There is a test failure but it's unrelated:

Oct 09 11:22:03 11:22:03.127 [ERROR] org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute[union with mixed channels, p = 5, timeout = 0] -- Time elapsed: 300.5 s <<< ERROR!
...
Oct 09 11:22:03 Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources.

@1996fanrui any idea, because we've just touched the unaligned checkpoint part?

@1996fanrui
Copy link
Member

1996fanrui commented Oct 9, 2025

There is a test failure but it's unrelated:

Oct 09 11:22:03 11:22:03.127 [ERROR] org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute[union with mixed channels, p = 5, timeout = 0] -- Time elapsed: 300.5 s <<< ERROR!
...
Oct 09 11:22:03 Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not acquire the minimum required resources.

@1996fanrui any idea, because we've just touched the unaligned checkpoint part?

I analyzed the log from https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=70122&view=results

It seems the TM is crashed since NPE, resulting in the resource is not sufficient, so it is not caused by UC. I think it should be a problem with the test itself.

Code :

11:16:57,483 [  Sink: sink (5/5)#3] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - FATAL - exception in exception handler of task Sink: sink (5/5)#3 (ab4ac6968943e64acfb5047f3f2ed3fc_2e588c
e1c86a9d46e2e85186773ce4fd_4_3).
java.lang.NullPointerException: Cannot read field "numOutput" because "this.state" is null
        at org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase$VerifyingSinkBase.close(UnalignedCheckpointTestBase.java:1050) ~[test-classes/:?]
        at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:121) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1197) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:83) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1101) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$2(Task.java:958) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:973) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$3(Task.java:958) [flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:83) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:794) [flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:569) [flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
11:16:57,483 [  Sink: sink (5/5)#3] ERROR org.apache.flink.runtime.minicluster.MiniCluster             [] - TaskManager #2 failed.
java.lang.NullPointerException: Cannot read field "numOutput" because "this.state" is null
        at org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase$VerifyingSinkBase.close(UnalignedCheckpointTestBase.java:1050) ~[test-classes/:?]
        at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:121) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1197) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:83) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:1101) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$2(Task.java:958) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:973) ~[flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$3(Task.java:958) [flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:257) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:83) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127) ~[flink-core-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:794) [flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:569) [flink-runtime-2.2-SNAPSHOT.jar:2.2-SNAPSHOT]
        at java.base/java.lang.Thread.run(Thread.java:833) [?:?]

@gaborgsomogyi
Copy link
Contributor Author

Should we add a null check there?🤔 Until the azure is not fixed we're not able to test it...

@gaborgsomogyi
Copy link
Contributor Author

Never mind, now it's passing. Hope that the whole pipeline will be green now...

@gaborgsomogyi
Copy link
Contributor Author

Now it's passing🎉

Copy link
Contributor

@ruanhang1993 ruanhang1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants