[SPARK-25442][SQL][K8S] Support STS to run in k8s deployments with spark deployment mode as cluster. #22433

suryag10 · 2018-09-16T04:38:18Z

What changes were proposed in this pull request?

Code is enhanced to allow the STS run in kubernetes deployment with spark deploy mode of cluster.

How was this patch tested?

Started the sts in cluster mode in K8S deployment and was able to run some queries using the beeline client.

ifilonenko · 2018-09-16T04:47:18Z

test this please

SparkQA · 2018-09-16T05:02:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/3135/

SparkQA · 2018-09-16T05:12:42Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/3135/

dongjoon-hyun · 2018-09-16T05:23:36Z

Thank you for your first contribution, @suryag10 .

Could you file a SPARK JIRA issue since this is a code change?
Could you update the PR title like the other PRs? e.g. [SPARK-XXX][SQL][K8S] ...?

And, just out of curious, do we need this change?

- exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Thrift JDBC/ODBC Server" "$@"
+ exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Thrift-JDBC-ODBC-Server" "$@"

suryag10 · 2018-09-16T05:36:59Z

Thank you for your first contribution, @suryag10 .

Could you file a SPARK JIRA issue since this is a code change?
Sure.

Could you update the PR title like the other PRs? e.g. [SPARK-XXX][SQL][K8S] ...?
Sure.

And, just out of curious, do we need this change?
- exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Thrift JDBC/ODBC Server" "$@"
+ exec "${SPARK_HOME}"/sbin/spark-daemon.sh submit $CLASS 1 --name "Thrift-JDBC-ODBC-Server" "$@"

Without the above change, it fails to start the driver pod as well. Spaces, "/" are not allowed for the "name" in the kubernetes world.

mridulm · 2018-09-16T06:15:26Z

Does it fail in k8s or does spark k8s code error out ?
If former, why not fix “name” handling in k8s to replace unsupported characters ?

suryag10 · 2018-09-16T06:35:40Z

Does it fail in k8s or does spark k8s code error out ?
If former, why not fix “name” handling in k8s to replace unsupported characters ?

Following is the error seen without the fix:
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://k8s-apiserver.bcmt.cluster.local:8443/api/v1/namespaces/default/pods. Message: Pod "thrift jdbc/odbc server-1537079590890-driver" is invalid: metadata.name: Invalid value: "thrift jdbc/odbc server-1537079590890-driver": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)'). Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, message=Invalid value: "thrift jdbc/odbc server-1537079590890-driver": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)'), reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=thrift jdbc/odbc server-1537079590890-driver, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "thrift jdbc/odbc server-1537079590890-driver" is invalid: metadata.name: Invalid value: "thrift jdbc/odbc server-1537079590890-driver": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)*'), metadata=ListMeta(resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

This is not specific to Kubernetes, but more of a generic DNS (DNS-1123)

SparkQA · 2018-09-16T07:05:01Z

Test build #96105 has finished for PR 22433 at commit 3a7fa57.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2018-09-16T10:30:54Z

It is an implementation detail of k8s integration that application name is expected to be DNS compliant ... spark does not have that requirement; and yarn/mesos/standalone/local work without this restriction.
The right fix in k8s integration would be to sanitize the name specified by user/application to be compliant with its requirements. This will help not just with thrift server, but any spark application.

suryag10 · 2018-09-16T10:54:38Z

It is an implementation detail of k8s integration that application name is expected to be DNS compliant ... spark does not have that requirement; and yarn/mesos/standalone/local work without this restriction.
The right fix in k8s integration would be to sanitize the name specified by user/application to be compliant with its requirements. This will help not just with thrift server, but any spark application.

As this script is common start point for all the resource managers(k8s/yarn/mesos/standalone/local), i guess changing this to fit for all the cases has a value add, instead of doing at each resource manager level. Thoughts?

erikerlandson · 2018-09-16T14:51:25Z

I'm wondering, is there some reason this isn't supported in cluster mode for yarn & mesos? Or put another way, what is the rationale for k8s being added as an exception to this rule?

jacobdr · 2018-09-16T17:56:43Z

a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.'

Your changes to the name handling don’t comply with this, so agree with @mridulm you should move this change elsewhere and more broadly support name validation/sanitization for submitted applications in kubernetes

mridulm · 2018-09-16T18:26:33Z

As this script is common start point for all the resource managers(k8s/yarn/mesos/standalone/local), i guess changing this to fit for all the cases has a value add, instead of doing at each resource manager level. Thoughts?

Please note that I am specifically referring only to the need for changing application name.
The rationale given that name should be DNS compliant is a restriction specific to k8s and not spark.
Instead of doing one off rename's the right approach would be to handle this name translation such that it will benefit not just STS, but any user application.

suryag10 · 2018-09-17T03:15:04Z

I'm wondering, is there some reason this isn't supported in cluster mode for yarn & mesos? Or put another way, what is the rationale for k8s being added as an exception to this rule?

I donno the specific reason why this was not supported in yarn and mesos. The initial contributions to the spark on K8S started with cluster mode(with restriction for client mode). So this PR enhances such that STS can run in k8s deployments with spark cluster mode(In the latest spark code i had observed that the client mode also works(need to cross verify this once)).

liyinan926 · 2018-09-17T22:46:49Z

Agreed with @mridulm that the naming restriction is specific to k8s and should be handled in a k8s specific way, e.g., somewhere around https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L208.

suryag10 · 2018-09-18T03:14:00Z

Agreed with @mridulm that the naming restriction is specific to k8s and should be handled in a k8s specific way, e.g., somewhere around https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L208.

Ok, Will update the PR with the same.

suryag10 · 2018-09-19T05:09:40Z

Agreed with @mridulm that the naming restriction is specific to k8s and should be handled in a k8s specific way, e.g., somewhere around https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L208.

Ok, Will update the PR with the same.

Hi, Handling of this conversion is already present in

https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L259

I had reverted back the change in start-thriftserver.sh file. Please review and merge.

suryag10 · 2018-09-19T05:15:07Z

@mridulm @liyinan926 @jacobdr @ifilonenko
code check for space,"/" handling is already present at https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L259

I had reverted back the fix in start-thriftserver.sh. Please review and merge.

…ark deployment mode as cluster.

nrchakradhar · 2018-09-19T07:37:05Z

This PR is now same as PR-20272.
The conversation in PR-20272, has some useful information which can be included in the Spark documentation
Also, its good to mention that STS will not work with dynamicAllocation as shuffle support is not yet available.

docs/running-on-kubernetes.md

suryag10 · 2018-09-22T04:31:34Z

can somebody pls review and merge?

docs/running-on-kubernetes.md

erikerlandson · 2018-10-01T18:41:53Z

In the scenario of a cluster-mode submission, what is the command-line behavior? Does the thrift-server script "block" until the thrift server pod is shut down?

erikerlandson · 2018-10-01T18:43:15Z

If possible, there should be some basic integration testing. Run a thrift server command against the minishift cluster used by the other testing.

suryag10 · 2018-10-22T03:48:39Z

@liyinan926

The script may be run from a client machine outside a k8s cluster. In this case, there's not even a pod. >>I would suggest separating the explanation of the user flow details by the deploy mode (client vs cluster).

STS is a server and its best way of deployment in K8S cluster is either done through the helm chart or through the yaml file(although it can be done through the method you had suggested, but i guess that scenario would be a rare case and there will be no HA of the STS server if it is triggered from outside).

suryag10 · 2018-10-22T03:56:27Z

In the scenario of a cluster-mode submission, what is the command-line behavior? Does the thrift-server script "block" until the thrift server pod is shut down?

By default the script returns but can be made to block by setting the environment variable SPARK_NO_DAEMONIZE. Once this is done, script blocks until the thrift server pod is shut down

suryag10 · 2018-10-22T04:01:38Z

If possible, there should be some basic integration testing. Run a thrift server command against the minishift cluster used by the other testing.

Will add this a separate PR.

suryag10 · 2018-10-22T04:09:19Z

Can some body pls merge this?

suryag10 · 2018-10-22T04:27:46Z

I am observing some weird behaviour when i am trying to respond to the comments. Hence i am adding the resposes to comments as below.
Following are the responses for the comments:

The script may be run from a client machine outside a k8s cluster. In this case, there's not even a pod. I would suggest separating the explanation of the user flow details by the deploy mode (client vs cluster).

STS is a server and its best way of deployment in K8S cluster is either done through the helm chart or through the yaml file(although it can be done through the method you had suggested, but i guess that scenario would be a rare case and there will be no HA of the STS server if it is triggered from outside).

In the scenario of a cluster-mode submission, what is the command-line behavior? Does the thrift-server script "block" until the thrift server pod is shut down?

By default the script returns but can be made to block by setting the environment variable SPARK_NO_DAEMONIZE. Once this is done, script blocks until the thrift server pod is shut down

If possible, there should be some basic integration testing. Run a thrift server command against the minishift cluster used by the other testing.

Will add it as a separate PR.

Pls merge this, if you are ok with the responses.

erikerlandson · 2018-10-22T16:53:04Z

@suryag10 you were probably encountering github server problems from yesterday:
https://status.github.com/messages

erikerlandson · 2018-10-22T17:00:25Z

@suryag10, all things being equal, it is considered preferable to provide testing for new functionality on the same PR. Are there are logistical problems adding testing here?

vanzin · 2018-12-21T19:02:22Z

The bug here should be SPARK-23078; no point in filing duplicate bugs.

Also, could anyone answer my question in the bug? Seems like we don't need this anymore.

vanzin · 2019-01-25T23:36:55Z

No updates on the bug so I assume what I wrote is correct. Closing.

Support STS to run in k8s cluster mode

3a7fa57

suryag10 changed the title ~~Support STS to run in k8s cluster mode~~ Support STS to run in k8s deployment modes with spark deployment mode as cluster. Sep 16, 2018

suryag10 changed the title ~~Support STS to run in k8s deployment modes with spark deployment mode as cluster.~~ Support STS to run in k8s deployments with spark deployment mode as cluster. Sep 16, 2018

suryag10 changed the title ~~Support STS to run in k8s deployments with spark deployment mode as cluster.~~ [SPARK-25442][SQL][K8S] Support STS to run in k8s deployments with spark deployment mode as cluster. Sep 16, 2018

suryag10 mentioned this pull request Sep 17, 2018

[SPARK-23078] [CORE] [K8s] allow Spark Thrift Server to run in Kubernetes Cluster mode #20272

Closed

suryag10 added 2 commits September 19, 2018 00:16

[SPARK-25442][SQL][K8S] Support STS to run in k8s deployments with sp…

3556a61

…ark deployment mode as cluster.

Merge branch 'master' of https://github.com/suryag10/spark

78dc1a3

suryag10 added 4 commits September 19, 2018 12:08

Merge branch 'master' of https://github.com/suryag10/spark

a15f531

Merge branch 'master' of https://github.com/suryag10/spark

42dd479

Merge branch 'master' of https://github.com/suryag10/spark

4a7e737

Merge branch 'master' of https://github.com/suryag10/spark

d91fa2b

suryag10 added 2 commits September 19, 2018 12:22

Merge branch 'master' of https://github.com/suryag10/spark

a65cfa5

Merge branch 'master' of https://github.com/suryag10/spark

8dc7ced

liyinan926 reviewed Sep 19, 2018

View reviewed changes

suryag10 added 2 commits September 21, 2018 07:33

Merge branch 'master' of https://github.com/suryag10/spark

6e021e7

Merge branch 'master' of https://github.com/suryag10/spark

12be1d2

liyinan926 reviewed Sep 24, 2018

View reviewed changes

docs/running-on-kubernetes.md Show resolved Hide resolved

liyinan926 mentioned this pull request Sep 24, 2018

[SPARK-25516][k8s]Utilities needed for spark-history and spark-shuffle-service #22538

Closed

suryag10 closed this Oct 22, 2018

suryag10 reopened this Oct 22, 2018

vanzin closed this Jan 25, 2019

liyinan926 mentioned this pull request Dec 18, 2019

need resource file example of thrift server kubeflow/spark-operator#743

Closed

[SPARK-25442][SQL][K8S] Support STS to run in k8s deployments with spark deployment mode as cluster. #22433

[SPARK-25442][SQL][K8S] Support STS to run in k8s deployments with spark deployment mode as cluster. #22433

Uh oh!

Conversation

suryag10 commented Sep 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ifilonenko commented Sep 16, 2018

Uh oh!

SparkQA commented Sep 16, 2018

Uh oh!

SparkQA commented Sep 16, 2018

Uh oh!

dongjoon-hyun commented Sep 16, 2018

Uh oh!

suryag10 commented Sep 16, 2018

Uh oh!

mridulm commented Sep 16, 2018

Uh oh!

suryag10 commented Sep 16, 2018

Uh oh!

SparkQA commented Sep 16, 2018

Uh oh!

mridulm commented Sep 16, 2018

Uh oh!

suryag10 commented Sep 16, 2018

Uh oh!

erikerlandson commented Sep 16, 2018

Uh oh!

jacobdr commented Sep 16, 2018

Uh oh!

mridulm commented Sep 16, 2018

Uh oh!

suryag10 commented Sep 17, 2018

Uh oh!

liyinan926 commented Sep 17, 2018

Uh oh!

suryag10 commented Sep 18, 2018

Uh oh!

suryag10 commented Sep 19, 2018

Uh oh!

suryag10 commented Sep 19, 2018

Uh oh!

nrchakradhar commented Sep 19, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suryag10 commented Sep 22, 2018

Uh oh!

Uh oh!

erikerlandson commented Oct 1, 2018

Uh oh!

erikerlandson commented Oct 1, 2018

Uh oh!

suryag10 commented Oct 22, 2018

Uh oh!

suryag10 commented Oct 22, 2018

Uh oh!

suryag10 commented Oct 22, 2018

Uh oh!

suryag10 commented Oct 22, 2018

Uh oh!

suryag10 commented Oct 22, 2018

Uh oh!

erikerlandson commented Oct 22, 2018

Uh oh!

erikerlandson commented Oct 22, 2018

Uh oh!

vanzin commented Dec 21, 2018

Uh oh!

vanzin commented Jan 25, 2019

Uh oh!

Uh oh!