Open
Description
Scenario: I use a remote node to submit to the target k8s cluster. On the remote node, I use pip install in the conda environment for bigdl and related packages. In this case, the python environment on the driver (remote note) and executor (docker image) are different and thus there's bug in init_spark_on_k8s in our code.
Issues:
- python_env for PYTHONHOME, ld_path and preload_so should be the executor's python path https://github.com/intel-analytics/BigDL/blob/branch-2.0/python/dllib/src/bigdl/dllib/utils/spark.py#L324 Fixed in Fix init_spark_on_k8s #3589
- Users currently need to set the environment variable
export PYSPARK_PYTHON=/usr/local/envs/pytf1/bin/python
which is the python interpreter inside the docker image. support conda pack in init_spark_on_k8s #3646 - The jar paths for the executor extra classpath is the path on the driver, which is not correct and not available on the executor: https://github.com/intel-analytics/BigDL/blob/branch-2.0/python/dllib/src/bigdl/dllib/utils/spark.py#L348
Should be${BIGDL_HOME}/jars/*
in the docker image. support conda pack in init_spark_on_k8s #3646 - spark.driver.host should be the ip of the remote node. Added the default value in Fix init_spark_on_k8s #3589 in case users may set this value wrong. May remove the spark.driver.host in our documentation since users don't need to specify it actually?
- Add a default value for spark.driver.port so that users don't need to set this as well in Fix init_spark_on_k8s #3589
- The NFS path should be also available and mounted to /bigdl2.0/data on the remote node. Spark requires driver to have access to the data as well.
- Add instructions for running k8s from a remote node in the documentation. Update k8s user guide for run from remote #3606
- Update the k8s user guide, remove driver host and port from the code and add a remark. Update k8s user guide for run from remote #3606
- RayOnSpark would fail when running k8s from remote #3605 This may be a little bit complicated and thus separate into a new issue.
- add init_orca_context(cluster_mode="k8s-cluster") & init_spark_on_k8s_cluster