Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [admin] When using a highly available K8s cluster, the jobId for the same task is the same every time it is executed. #4089

Open
2 of 3 tasks
jiangwwwei opened this issue Dec 24, 2024 · 2 comments
Assignees
Labels
Bug Something isn't working Waiting for reply Waiting for reply

Comments

@jiangwwwei
Copy link

jiangwwwei commented Dec 24, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

The logic in Flink's source code: When high availability is used, unless manually configured with PipelineOptionsInternal.PIPELINE_FIXED_JOB_ID, the default is to generate jobId based on the cluster id.

image
image

When Dinky submits tasks to Kubernetes, the cluster id is the fixed task name, which results in the same jobId for the same task being executed each time.
image

When the HistoryServer retrieves results, it does not re-fetch jobIds that already exist, which prevents the retrieval of results for newly submitted tasks.
image

image
On the other hand, Dinky's JobRefreshHandler will overwrite the information of tasks that have failed/canceled in the HistoryServer with the same jobId as tasks that are currently running.

What you expected to happen

The jobId should change for each new instance of the submitted task

How to reproduce

  1. Utilize the Flink Kubernetes cluster and configure high availability in the cluster settings, setting jobmanager.archive.fs.dir to the address specified by the HistoryServer.
  2. Launch the HistoryServer to ensure it operates normally.
  3. When submitting tasks to the Kubernetes application, it is observed that the job id for the same task remains fixed upon each submission.

->

  1. The task jobs obtained by the HistoryServer do not update with the completion of each task.
  2. The task statuses in the operations center all change to failed or canceled states in the HistoryServer, triggering alerts (even though the tasks are running normally), and refreshing is ineffective. Disabling the HistoryServer and then refreshing allows the task statuses to return to normal.

Anything else

No response

Version

1.2.0

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jiangwwwei jiangwwwei added Bug Something isn't working Waiting for reply Waiting for reply labels Dec 24, 2024
Copy link

Hello @jiangwwwei, this issue is about K8S, so I assign it to @gaoyan1998 and @zackyoungh. If you have any questions, you can comment and reply.

你好 @jiangwwwei, 这个 issue 是关于 K8S 的,所以我把它分配给了 @gaoyan1998@zackyoungh。如有任何问题,可以评论回复。

Copy link

Hello @jiangwwwei, this issue is about CDC/CDCSOURCE, so I assign it to @aiwenmo. If you have any questions, you can comment and reply.

你好 @jiangwwwei, 这个 issue 是关于 CDC/CDCSOURCE 的,所以我把它分配给了 @aiwenmo。如有任何问题,可以评论回复。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Waiting for reply Waiting for reply
Projects
None yet
Development

No branches or pull requests

4 participants