Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No module named 'cudf' while running spark-rapids with AWS EMR-7.3 #11668

Open
Basir-mahmood opened this issue Oct 28, 2024 · 3 comments
Open

Comments

@Basir-mahmood
Copy link

Thanks for such a great work and awesome library.
I am using spark-rapids with EMR-7.3 for the deep learning model inference with predict_batch_udf.
I have been following the provided documentation for AWS-EMR. And for enabling GPU-scheduling with pandas_udf, as described in the link. I am providing --py-files ${SPARK_RAPIDS_PLUGIN_JAR} in the spark-submit command, and also have added in the config.json file "spark.rapids.sql.python.gpu.enabled": "true" to enable gpu-scheduling for the pandas-udf.
The instances I am using are m5.4xlarge ( master ), and g4dn.12xlarge ( core ).

However, this task fails giving the error for no cudf module found.

-- spark-submit-command --
spark-submit --deploy-mode client --py-files /usr/lib/spark/jars/rapids-4-spark_2.12-24.06.1-amzn-0.jar s3://<my-bucket>/rapids-code.py

Following lines are from the logged error of emr.

24/10/28 13:48:12 WARN RapidsPluginUtils: RAPIDS Accelerator 24.06.1-amzn-0 using cudf 24.06.0, private revision 755b4dd03c753cacb7d141f3b3c8ff9f83888b69

...
...
...

24/10/28 13:48:28 INFO PythonWorkerFactory: Python daemon module in PySpark is set to [rapids.daemon] in 'spark.python.daemon.module', using this to start the daemon up. Note that this configuration only has an effect when 'spark.python.use.daemon' is enabled and the platform is not Windows.
INFO: Process 34593 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/daemon.py", line 131, in manager
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
ModuleNotFoundError: No module named 'cudf'
INFO: Process 34594 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/daemon.py", line 131, in manager
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
ModuleNotFoundError: No module named 'cudf'
INFO: Process 34595 found CUDA visible device(s): 0
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/daemon.py", line 131, in manager
  File "/mnt/yarn/usercache/hadoop/appcache/application_1730122445157_0001/container_1730122445157_0001_01_000002/rapids-4-spark_2.12-24.06.1-amzn-0.jar/rapids/worker.py", line 37, in initialize_gpu_mem
    from cudf import rmm
ModuleNotFoundError: No module named 'cudf'
24/10/28 13:48:29 ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 107)
java.io.EOFException: null
	at java.io.DataInputStream.readInt(DataInputStream.java:386) ~[?:?]
....
....
@gerashegalov
Copy link
Collaborator

gerashegalov commented Oct 28, 2024

rapids-4-spark jar only provides the minimum Java binding to be able to run SparkSQL/DataFrame API queries. For Pandas-like Python cudf module cudf needs to be installed on all nodes using one of EMR-recommended means for installing Python libs. cudf is available as a pip package among other things

@Basir-mahmood
Copy link
Author

Basir-mahmood commented Oct 30, 2024

@gerashegalov Thanks for the guidance. I also want to ask that I want to control the number of concurrent gpu tasks which are created by udf method ( i am using predict_batch_udf). I have tried spark.rapids.sql.concurrentGpuTasks but it doesnt control the number of concurrent task in GPU. Currently, the number of tasks in gpu equals to the 1/spark.task.resource.gpu.amount . Can you please help me with that ?

@eordentlich
Copy link
Contributor

eordentlich commented Oct 31, 2024

You can edit and add a version of this init script to run after the spark_rapids one to install the cudf python library: https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.10/notebooks/aws-emr/init-bootstrap-action.sh
Note that it builds and installs python 3.10 since cudf 24.10 and beyond have dropped support for 3.9. You will also need to configure Spark to use this non-default python in the driver and executors (see https://github.com/NVIDIA/spark-rapids-ml/blob/branch-24.10/notebooks/aws-emr/init-configurations.json#L71-L72 ) .
spark.rapids.sql.concurrentGpuTasks only applies to the core JVM part of spark-rapids.

The number of concurrent predict_batch_udf tasks is determined by the resource per task and resource per executor settings, as you say. Are you hoping to have different task concurrency per stage?

The spark.rapids.python.concurrentPythonWorkers config described here https://nvidia.github.io/spark-rapids/docs/additional-functionality/rapids-udfs.html#other-configuration might also be applicable to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants