Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnsatisfiedLinkError in KMeansDAL during training with OneCCL in OAP MLlib #399

Open
madhushreeb39250 opened this issue Oct 21, 2024 · 14 comments

Comments

@madhushreeb39250
Copy link

I encountered an error while running KMeans clustering with OAP MLlib, specifically when using the KMeansDAL implementation. The application fails with an UnsatisfiedLinkError, which points to an issue with loading the native CCL library in OneCCL$.c_init.

Here’s the full error trace:
Caused by: java.lang.UnsatisfiedLinkError: com.intel.oap.mllib.OneCCL$.c_init(IILjava/lang/String;Lcom/intel/oap/mllib/CCLParam;)I
at com.intel.oap.mllib.OneCCL$.c_init(Native Method)
at com.intel.oap.mllib.OneCCL$.init(OneCCL.scala:32)
at com.intel.oap.mllib.clustering.KMeansDALImpl.$anonfun$train$4(KMeansDALImpl.scala:71)
at com.intel.oap.mllib.clustering.KMeansDALImpl.$anonfun$train$4$adapted(KMeansDALImpl.scala:70)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Environment:
OAP MLlib version: 1.6.0
Spark version: 3.3.3
OneAPI CCL version: 2021.8.0
Java version: OpenJDK 8

Additional Information:

The native libraries for CCL are installed under /opt/intel/oneapi/ccl/2021.8.0/lib/cpu/ and /opt/intel/oneapi/ccl/2021.8.0/lib/cpu_gpu_dpcpp/.
Java environment variables seem correctly configured.

Could you please assist in resolving this issue? Thank you!

@minmingzhu
Copy link
Collaborator

minmingzhu commented Oct 23, 2024

@madhushreeb39250 You can re-pull from master, I merged the new code

@madhushreeb39250
Copy link
Author

Hi @minmingzhu, Thank you for your response. I tried re-pull and build OAP Mllib. Build was successful but it fails while running kmeans algorithm.

I found the following error in the spark worker log:

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties

java:85813 terminated with signal 11 at PC=7179e101afeb SP=716e3c7fdea8.  Backtrace:
[0x7179e101afeb]

KMeans-1030_17_58_11.log

Could you please help me with this issue?

@minmingzhu
Copy link
Collaborator

minmingzhu commented Nov 11, 2024

@madhushreeb39250 Can you provide the stdout and stderr of the corresponding error worker? And can you also provide spark conf likes spark-env.sh and spark-defaults.conf.

@madhushreeb39250
Copy link
Author

Thank you for the response @minmingzhu. Here are the log and conf files:
logs.zip

@minmingzhu
Copy link
Collaborator

Thank you for the response @minmingzhu. Here are the log and conf files: logs.zip

Hi madhushreeb39250, I would like to know which version of Intel GPU you are using.

@minmingzhu
Copy link
Collaborator

Maybe you can try add this in spark-default.conf

spark.executor.cores                   $Num_CORES
spark.executor.instances             $EXECUTOR_NUM            #  you need to set  $EXECUTOR_NUM = $GPU_WORKER_AMOUNT

spark.worker.resourcesFile           $GPU_RESOURCE_FILE             # Path to resources file which is used to find various resources while worker starting up
spark.worker.resource.gpu.amount     $GPU_WORKER_AMOUNT   # total GPUs for each worker
spark.executor.resource.gpu.amount   1
spark.task.resource.gpu.amount       0.001

spark.executor.extraClassPath        $SPARKJOB_ADD_JARS
spark.driver.extraClassPath          $SPARKJOB_ADD_JARS

And add this in spark-env.sh

export CCL_ATL_TRANSPORT=ofi
export ONEAPI_DEVICE_SELECTOR=level_zero:*

I assume you use Intel GPUs, if you want to know how many GPUs on nodes, you just "source $ONEAPI_ROOT, sycl-ls"

by the ways , The spark.worker.resourcesFile file format is as follows Also you can refer to spark doc(https://spark.apache.org/docs/latest/spark-standalone.html#resource-allocation-and-configuration-overview)
[{"id":{"componentName": "spark.worker","resourceName":"gpu"},"addresses":["0","1","2","3","4","5","6","7","8","9","10","11"]}]

@madhushreeb39250
Copy link
Author

madhushreeb39250 commented Nov 11, 2024

Thank you for suggesting @minmingzhu. I am not trying to use the GPU here. There are no GPUs available to the machine.

@minmingzhu
Copy link
Collaborator

Hi @madhushreeb39250
You can re-pull from master that I have merged new code.
You can run with CPU and add this in spark-env.sh

source ~/intel/oneapi/setvars.sh
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets

add this in spark-default.conf

spark.executor.extraClassPath        $SPARKJOB_ADD_JARS
spark.driver.extraClassPath          $SPARKJOB_ADD_JARS

@madhushreeb39250
Copy link
Author

Hi @minmingzhu, I tried re-pulling the master and still face the same issues. I even tried with both Java 8 and 11. I tried with building with and without CPU-ONLY options. Also, I am using the recommended version of oneAPI. The same error persists. Request you to please suggest how to proceed from here.

@minmingzhu
Copy link
Collaborator

minmingzhu commented Nov 25, 2024

Hi, @madhushreeb39250 Did you add env in spark-env.sh and spark-default.conf? And maybe you can try this version OneAPI, You can wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh && ./l_BaseKit_p_2024.0.0.49564_offline.sh.

@madhushreeb39250
Copy link
Author

Hi @minmingzhu, Thank you for suggesting.
I have updated spark-env.sh and spark-defaults.conf as per previous suggestions.
I tried with the suggested oneapi basekit. The error is still the same. But this time I could see kmeans clusters printed in the trace.
KMeans-1126_12_15_33.log

The error in the worker log:

Spark Executor Command: "/usr/lib/jvm/java-11-openjdk-amd64/bin/java" "-cp" "/home/madhushreeb/oap-mllib/mllib-dal/target/oap-mllib-1.6.0.jar:/home/madhushreeb/spark-3.3.3-bin-hadoop3/conf/:/home/madhushreeb/spark-3.3.3-bin-hadoop3/jars/*" "-Xmx4096M" "-Dspark.driver.port=38637" "-Dspark.network.timeout=1200s" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-XX:+PrintGCDetails" "-XX:+PrintGCDateStamps" "-XX:+PrintGCApplicationStoppedTime" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@ip-172-32-11-30:38637" "--executor-id" "6" "--hostname" "172.32.11.30" "--cores" "2" "--app-id" "app-20241126121536-0000" "--worker-url" "spark://[email protected]:33417"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
[INFO] OneCCL (native): cleanup
[ip-172-32-11-30:25731:1:26859] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x79d7b23d008)
[ip-172-32-11-30:25731:0:26835] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x79d7b23d008)
==== backtrace (tid:  26835) ====
 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x79d46a43fc4]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x79d46a47fec]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x79d46a481aa]
 3  [0x79d5d263cd9]
=================================
==== backtrace (tid:  26859) ====
 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x79d46a43fc4]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x79d46a47fec]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x79d46a481aa]
 3  [0x79d5d3e4c84]
=================================

@minmingzhu
Copy link
Collaborator

Hi @madhushreeb39250 I haven't met this problem. Could you provide the stdout and stderr of the spark worker?

@minmingzhu
Copy link
Collaborator

Hi @madhushreeb39250 I found Kmeans print the result from your spark master log.
image

@madhushreeb39250
Copy link
Author

Hi @minmingzhu, Yes, I am able to get the centroids of kmeans clusters. I was just not sure if this is the right behaviour of the algorithm as I was also getting error along with the kmeans results. Here are the stderr and stdout files of the spark worker which is failing.
worker_logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants