Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] The status of batch job is ERROR even the job is executed successfully #5169

Closed
2 of 4 tasks
zhifanggao opened this issue Aug 15, 2023 · 7 comments
Closed
2 of 4 tasks
Labels
kind:bug This is a clearly a bug priority:major

Comments

@zhifanggao
Copy link
Contributor

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

test steps:

  1. submit a batch job using rest api
curl -u "ocdp:112345" --location --request POST 'http://10.19.29.167:30099/api/v1/batches' --header 'Content-Type: application/json' --data '{ "batchType": "Spark", "resource": "hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar", "name": "kyuubi_batch_demo", "className": "org.apache.spark.examples.SparkPi", "conf": {"hive.server2.proxy.user":"ocdp"}}'
  1. check the pod of kyuubi server
kyuubi@kyuubi-deployment-example-7c7774d465-9f9xh:/opt/kyuubi$ ps -efl|grep driver
0 S kyuubi     584     1 87  80   0 - 755904 futex_ 19:03 ?       00:00:04 /opt/java/openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED org.apache.spark.deploy.SparkSubmit --conf spark.kyuubi.client.ipAddress=10.19.29.167 --conf spark.kyuubi.batch.resource.uploaded=false --conf spark.kubernetes.driverEnv.SPARK_USER_NAME=ocdp --conf spark.executorEnv.SPARK_USER_NAME=ocdp --conf spark.hive.server2.proxy.user=ocdp --conf spark.kubernetes.driver.label.kyuubi-unique-tag=1342671d-ac56-44cb-862f-7cc4aa9b6656 --conf spark.app.name=kyuubi_batch_demo --conf spark.kyuubi.session.real.user=ocdp --conf spark.kyuubi.server.ipAddress=0.0.0.0 --conf spark.kyuubi.session.connection.url=0.0.0.0:10099 --conf spark.kyuubi.batch.id=1342671d-ac56-44cb-862f-7cc4aa9b6656 --class org.apache.spark.examples.SparkPi --proxy-user ocdp hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar
  1. once the batch completed, check the status
[root@host-10-19-37-166 ~]# curl -u "ocdp:112345" --location --request GET  'http://10.19.29.167:30099/api/v1/batches/1342671d-ac56-44cb-862f-7cc4aa9b6656'
{"id":"1342671d-ac56-44cb-862f-7cc4aa9b6656","user":"ocdp","batchType":"SPARK","name":"kyuubi_batch_demo","appStartTime":0,"appId":null,"appUrl":null,"appState":"NOT_FOUND","appDiagnostic":null,"kyuubiInstance":"0.0.0.0:10099","state":"ERROR","createTime":1692097410346,"endTime":1692097443926,"batchInfo":{}}

the status of batch job is ERROR, In fact the batch job is executed successfully.

Checked the kyuubi logs

2023-08-15 11:03:30.358 INFO org.apache.kyuubi.session.KyuubiSessionManager: ocdp's session with SessionHandle [1342671d-ac56-44cb-862f-7cc4aa9b6656]/kyuubi_batch_demo is opened, current opening sessions 4
2023-08-15 11:03:30.359 INFO org.apache.kyuubi.operation.BatchJobSubmission: Submitting SPARK batch[1342671d-ac56-44cb-862f-7cc4aa9b6656] job:
/opt/spark/bin/spark-submit \
	--class org.apache.spark.examples.SparkPi \
	--conf spark.hive.server2.proxy.user=ocdp \
	--conf spark.kyuubi.batch.id=1342671d-ac56-44cb-862f-7cc4aa9b6656 \
	--conf spark.kyuubi.batch.resource.uploaded=false \
	--conf spark.kyuubi.client.ipAddress=10.19.29.167 \
	--conf spark.kyuubi.server.ipAddress=0.0.0.0 \
	--conf spark.kyuubi.session.connection.url=0.0.0.0:10099 \
	--conf spark.kyuubi.session.real.user=ocdp \
	--conf spark.app.name=kyuubi_batch_demo \
	--conf spark.kubernetes.driver.label.kyuubi-unique-tag=1342671d-ac56-44cb-862f-7cc4aa9b6656 \
	--conf spark.kubernetes.driverEnv.SPARK_USER_NAME=ocdp \
	--conf spark.executorEnv.SPARK_USER_NAME=ocdp \
	--proxy-user ocdp hdfs://host-10-19-29-137:8020/tmp/ocdp/spark-examples_2.12-3.3.2.jar
2023-08-15 11:03:30.361 INFO org.apache.kyuubi.server.http.authentication.AuthenticationAuditLogger: user=ocdp(auth:BASIC)	ip=10.19.29.167	proxyIp=null	method=POST	uri=/api/v1/batches	params=null	protocol=HTTP/1.1	status=200
2023-08-15 11:03:30.364 INFO org.apache.kyuubi.engine.ProcBuilder: Logging to /opt/kyuubi/work/ocdp/kyuubi-spark-batch-submit.log.3
2023-08-15 11:03:30.372 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 0ms, return UNKNOWN status
2023-08-15 11:03:30.374 INFO org.apache.kyuubi.operation.BatchJobSubmission: Batch report for 1342671d-ac56-44cb-862f-7cc4aa9b6656, Some(ApplicationInfo(null,null,UNKNOWN,None,None))
2023-08-15 11:03:35.390 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 5018ms, return UNKNOWN status
2023-08-15 11:03:40.391 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 10019ms, return UNKNOWN status
2023-08-15 11:03:45.393 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 15021ms, return UNKNOWN status
2023-08-15 11:03:50.394 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 20022ms, return UNKNOWN status
2023-08-15 11:03:58.917 WARN org.apache.kyuubi.engine.KubernetesApplicationOperation: Wait for driver pod to be created, elapsed time: 28545ms, return UNKNOWN status
2023-08-15 11:04:03.920 ERROR org.apache.kyuubi.engine.KubernetesApplicationOperation: Can't find target driver pod by tag: 1342671d-ac56-44cb-862f-7cc4aa9b6656, elapsed time: 33548ms exceeds 30000ms.
2023-08-15 11:04:03.923 INFO org.apache.kyuubi.operation.BatchJobSubmission: Batch report for 1342671d-ac56-44cb-862f-7cc4aa9b6656, Some(ApplicationInfo(null,null,NOT_FOUND,None,None))
2023-08-15 11:04:03.926 INFO org.apache.kyuubi.operation.BatchJobSubmission: Processing ocdp's query[4adbdd40-5cec-42ca-b670-c25bc0d8bd19]: PENDING_STATE -> ERROR_STATE, time taken: 1.692097443926E9 seconds
2023-08-15 11:08:18.960 INFO org.apache.kyuubi.server.http.authentication.AuthenticationAuditLogger: user=ocdp(auth:BASIC)	ip=10.19.29.167	proxyIp=null	method=GET	uri=/api/v1/batches/1342671d-ac56-44cb-862f-7cc4aa9b6656	params=null	protocol=HTTP/1.1	status=200

kyuub server check the driver pod tagged with batch id , once it is not found, It will mark the batch job error status.

But in fact , no driver pod is created .

Affects Version(s)

1.7.1

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

##################################################################################
#  Kyuubi Configurations
##################################################################################
#
 kyuubi.frontend.protocols THRIFT_BINARY,REST
#kyuubi.frontend.rest.bind.host localhost
kyuubi.frontend.rest.bind.port 10099
kyuubi.authentication           NONE
#kyuubi.authentication KERBEROS
#kyuubi.kinit.principal hive/[email protected]
#kyuubi.kinit.keytab /opt/kyuubi/conf/hive.service.keytab
# kyuubi.frontend.bind.host       localhost
# kyuubi.frontend.bind.port       10009
kyuubi.session.engine.initialize.timeout 3000000000
# 设置引擎共享级别为用户
kyuubi.engine.share.level USER
kyuubi.session.engine.idle.timeout PT10H
# 开启HA这里使用的是k8s外部的zk集群
kyuubi.ha.enabled true
kyuubi.ha.zookeeper.quorum 10.19.37.28:2181
kyuubi.ha.zookeeper.client.port 2181
kyuubi.ha.zookeeper.namespace kyuubi
# 设置engine 的jar包位置,从共享存储S3 进行访问
kyuubi.session.engine.spark.main.resource local:///opt/spark/work-dir/kyuubi-spark-sql-engine_2.12-1.7.1.jar
# 禁用hostname,在1.5.1 中不禁用会出现问题,无法解析 nameservice 具体原因不知,有兴趣可以自行研究
kyuubi.engine.connection.url.use.hostname=false

##################################################################################
#  Spark Configurations
##################################################################################
    spark.shuffle.file.buffer 2097151
    spark.shuffle.io.backLog 8192
    spark.shuffle.io.serverThreads 128
   spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf
  spark.shuffle.service.enabled false
    spark.shuffle.unsafe.file.output.buffer 5m
    spark.sql.autoBroadcastJoinThreshold 10214400000
    spark.sql.hive.convertMetastoreOrc true
    spark.sql.orc.filterPushdown true
    spark.sql.orc.impl native
    spark.sql.statistics.fallBackToHdfs true
    spark.unsafe.sorter.spill.reader.buffer.size 1m
# must use kryo serializer because java serializer do not support relocation
spark.serializer org.apache.spark.serializer.KryoSerializer

# celeborn master

# options: hash, sort
# Hash shuffle writer use (partition count) * (celeborn.push.buffer.size) * (spark.executor.cores) memory.
# Sort shuffle writer use less memory than hash shuffle writer, if your shuffle partition count is large, try to use sort hash writer.

# we recommend set spark.celeborn.push.replicate.enabled to true to enable server-side data replication
# If you have only one worker, this setting must be false

# Support for Spark AQE only tested under Spark 3
# we recommend set localShuffleReader to false to get better performance of Celeborn
spark.sql.adaptive.localShuffleReader.enabled true

# we recommend enabling aqe support to gain better performance
spark.sql.adaptive.enabled true
spark.sql.adaptive.skewJoin.enabled true
# Hive Metastore 配置
spark.sql.hive.metastore.version 2.3.9
#spark.sql.hive.metastore.jars path
spark.sql.warehouse.dir /warehouse/tablespace/managed/hive
# Spark native k8s 配置
# 指定 master
spark.master=k8s://https://10.19.29.167:6443
# 设置为cluster模式
spark.submit.deployMode=cluster
# Specify volcano scheduler and PodGroup template
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/opt/kyuubi/conf/podgrp.yaml
spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
spark.kubernetes.executor.podNamePrefix=kyuubi-ssql
spark.kubernetes.driver.podTemplateFile=/opt/kyuubi/conf/hostalias.yaml
spark.kubernetes.executor.podTemplateFile=/opt/kyuubi/conf/hostalias.yaml
#指定k8s 命名空间
spark.kubernetes.namespace=bigdata
# 指定使用的 serviceAccount
spark.kubernetes.authenticate.driver.serviceAccountName=kyuubi
# 设置spark镜像,从harbor自动拉取
spark.kubernetes.container.image=10.19.37.28:8033/bigdata/sparkkyuubi171:v3.3
spark.kubernetes.container.image.pullPolicy=IfNotPresent

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.
@zhifanggao zhifanggao added kind:bug This is a clearly a bug priority:major labels Aug 15, 2023
@github-actions
Copy link

Hello @zhifanggao,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

@zhifanggao zhifanggao changed the title [Bug] The status of batch job is not correct [Bug] The status of batch job is ERROR even the job is executed successfully Aug 16, 2023
@pan3793
Copy link
Member

pan3793 commented Feb 28, 2024

Sorry I missed this issue.

But in fact , no driver pod is created

Why? The configuration you provided indicates that you are going to run the Spark application in K8s Cluster mode, the Driver should be launched in a dedicated Pod.

@pan3793
Copy link
Member

pan3793 commented Feb 28, 2024

I roughly remember we had an offline discussion on WeChat, and the root cause is due to some reason(configuration issue?) that the Spark application runs in local mode rather than the K8s cluster mode.

Let me close then, and feel free to re-open if you still have such an issue.

@pan3793 pan3793 closed this as completed Feb 28, 2024
@wardlican
Copy link

wardlican commented Feb 28, 2024

The cluster mode has been used, but the correct status of the task still cannot be obtained.

image

here is spark-submit info

/usr/local/service/spark/bin/spark-submit \
        --class org.apache.spark.examples.SparkPi \
        --conf spark.hive.server2.proxy.user=xxx \
        --conf spark.kyuubi.batch.id=68a54f69-7531-4ab8-8841-ea8da161e70b \
        --conf spark.kyuubi.batch.resource.uploaded=false \
        --conf spark.kyuubi.client.ipAddress=xxx \
        --conf spark.kyuubi.client.version=1.7.1 \
        --conf spark.kyuubi.engine.engineLog.path=/usr/local/service/kyuubi/work/hadoop/kyuubi-spark-batch-submit.log.5 \
        --conf spark.kyuubi.server.ipAddress=xxx \
        --conf spark.kyuubi.session.connection.url=xxx:10099 \
        --conf spark.kyuubi.session.real.user=xxx \
        --conf spark.app.name=Kyuubi Spark Pi \
        --conf spark.driver.memory=4g \
        --conf spark.executor.cores=4 \
        --conf spark.executor.memory=8g \
        --conf spark.kubernetes.driver.label.kyuubi-unique-tag=68a54f69-7531-4ab8-8841-ea8da161e70b \
        --conf spark.kubernetes.driver.pod.name=kyuubi-spark-68a54f69-7531-4ab8-8841-ea8da161e70b-driver \
        --conf spark.kubernetes.executor.podNamePrefix=kyuubi-spark-68a54f69-7531-4ab8-8841-ea8da161e70b \
        --conf spark.master=k8s://https://kubernetes.default.svc \
        --conf spark.submit.deployMode=cluster \
        --conf spark.kubernetes.driverEnv.SPARK_USER_NAME=xxx \
        --conf spark.executorEnv.SPARK_USER_NAME=xxx \
        --proxy-user xxx hdfs://xxx/tmp/spark-examples_2.12-3.4.2-TBDS-5.3.1_2023p4-SNAPSHOT.jar 1000

@pan3793
Copy link
Member

pan3793 commented Feb 28, 2024

@wardlican have you checked via kubectl to make sure that the spark driver pod is launched?

@pan3793
Copy link
Member

pan3793 commented Feb 28, 2024

@wardlican BTW, if possible, please try 1.8.1 or at least 1.7.3, we have made significant improvements for Kyuubi/Spark on K8s in recent versions.

@wardlican
Copy link

nt improve

image Yes, the pod is normal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug priority:major
Projects
None yet
Development

No branches or pull requests

3 participants