Replies: 1 comment 2 replies
-
Some more errors: Could not complete snapshot 2 for operator TelemetryValidationFunction -> (TelemetryRouterFunction -> (ShareEventsFlattenerFunction -> (Sink: share-events-primary-route-sink, Sink: share-items-primary-route-sink), Sink: log-route-sink, Sink: error-route-sink, Sink: audit-route-sink, Sink: audit-events-primary-route-sink, Sink: primary-route-sink), Sink: invalid-events-sink, Sink: duplicate-events-sink) (1/1). Failure reason: Checkpoint was declined. |
Beta Was this translation helpful? Give feedback.
-
Hi Team- we are using sunbird 3.1 telemetry - data pipeline only. Since14th April 2023 the pipeline preprocessor pod is failing with below error:
RROR org.apache.flink.runtime.blob.BlobServerConnection - Error while executing BLOB connection.
java.io.IOException: Unknown operation 71
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:120)
2023-04-16 17:40:59,041 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [/172.16.16.121:60164] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 1195725860 - discarded
2023-04-16 17:41:55,671 ERROR org.apache.flink.runtime.blob.BlobServerConnection - Error while executing BLOB connection.
java.io.IOException: Unknown operation 71
at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:120)
2023-04-16 17:41:59,041 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [/172.16.16.121:57602] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 1195725860 - discarded
2023-04-16 17:42:50,901 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline checkpoint 4 by task 65a895587513b604421b3b5f34602389 of job 0cbdb0c7eaf5aa80d9775ea6d8c4513e at c877dedc5ddc92ee611648b5fdaa205f @ pipeline-preprocessor-taskmanager-b5bb9df5d-pcr5r (dataPort=42855).
2023-04-16 17:42:50,902 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding checkpoint 4 of job 0cbdb0c7eaf5aa80d9775ea6d8c4513e.
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 4 for operator TelemetryValidationFunction -> (TelemetryRouterFunction -> (ShareEventsFlattenerFunction -> (Sink: share-events-primary-route-sink, Sink: share-items-primary-route-sink), Sink: log-route-sink, Sink: error-route-sink, Sink: audit-route-sink, Sink: audit-events-primary-route-sink, Sink: primary-route-sink), Sink: invalid-events-sink, Sink: duplicate-events-sink) (1/1). Failure reason: Checkpoint was declined.
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:434)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1403)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1337)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:803)
at org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:86)
at org.apache.flink.streaming.runtime.io.CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:113)
at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:310)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Expiring 101 record(s) for dev.telemetry.audit-1:120000 ms has passed since batch creation
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1225)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.flush(FlinkKafkaProducer.java:974)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:893)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.preCommit(FlinkKafkaProducer.java:99)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.snapshotState(TwoPhaseCommitSinkFunction.java:317)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.snapshotState(FlinkKafkaProducer.java:979)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:402)
... 19 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 101 record(s) for dev.telemetry.audit-1:120000 ms has passed since batch creation
2023-04-16 17:42:50,903 INFO org.apache.flink.runtime.jobmaster.JobMaster - Trying to recover from a global failure.
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:819)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2023-04-16 17:42:50,904 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job PipelinePreprocessorJob (0cbdb0c7eaf5aa80d9775ea6d8c4513e) switched from state RUNNING to FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=3, backoffTimeMS=30000)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getGlobalFailureHandlingResult(ExecutionFailureHandler.java:87)
at org.apache.flink.runtime.scheduler.DefaultScheduler.handleGlobalFailure(DefaultScheduler.java:201)
at org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyGlobalFailure(UpdateSchedulerNgOnInternalFailuresListener.java:58)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.failGlobal(ExecutionGraph.java:1035)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.failGlobalIfExecutionIsStillRunning(ExecutionGraph.java:1015)
at org.apache.flink.runtime.executiongraph.ExecutionGraph$1.lambda$failJobDueToTaskFailure$1(ExecutionGraph.java:473)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:87)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.failPendingCheckpointDueToTaskFailure(CheckpointCoordinator.java:1467)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.discardCheckpoint(CheckpointCoordinator.java:1377)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:719)
at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$declineCheckpoint$5(SchedulerBase.java:819)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
we have done rebuilded the the flink image and done the reployment. But same above error when we are tailing the logs.
root@10:/mydata_ebs_1# kubectl get po -n monitoring
NAME READY STATUS RESTARTS AGE
grafana-68d45dc447-527ch 1/1 Running 0 5d7h
grafana-test 0/1 Error 0 5d7h
node-exporter-6pzvq 1/1 Running 0 5d8h
node-exporter-frm6z 1/1 Running 0 5d8h
node-exporter-hfzfh 1/1 Running 0 5d8h
node-exporter-k4zhq 1/1 Running 0 5d8h
node-exporter-th2vx 1/1 Running 0 5d8h
node-exporter-w7vqf 1/1 Running 0 5d8h
node-exporter-wzhjb 1/1 Running 0 5d8h
node-exporter-x6vzk 1/1 Running 0 5d8h
prometheus-alertmanager-0 1/1 Running 0 5d7h
prometheus-alertmanager-test-connection 0/1 Error 0 5d7h
prometheus-kube-state-metrics-79d6db77b6-j6jcc 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-5gvpm 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-6ndlf 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-fcsqw 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-fqp2s 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-mh942 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-nst84 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-ps2vx 1/1 Running 0 5d7h
prometheus-prometheus-node-exporter-xsxjl 1/1 Running 0 5d7h
prometheus-prometheus-pushgateway-77dfd5d55c-vwwr7 1/1 Running 0 5d7h
prometheus-server-5cb99c5576-fj8xz 2/2 Running 0 5d7h
root@10:/mydata_ebs_1# kubectl get po -n flink-live
NAME READY STATUS RESTARTS AGE
pipeline-preprocessor-jobmanager-ctvr2 1/1 Running 4 (5m11s ago) 37m
pipeline-preprocessor-taskmanager-b5bb9df5d-pcr5r 1/1 Running 0 37m
redis-master-0 1/1 Running 0 30d
redis-replicas-0 1/1 Running 0 30d
redis-replicas-1 1/1 Running 0 30d
telemetry-extractor-jobmanager-g2zhw 1/1 Running 1 (32m ago) 38m
telemetry-extractor-taskmanager-6f5d6fcb58-dmtdf 1/1 Running 0 38m
telemetry-extractor-taskmanager-6f5d6fcb58-rpwbm 1/1 Running 0 38m
Beta Was this translation helpful? Give feedback.
All reactions