UnsatisfiedLinkError in KMeansDAL during training with OneCCL in OAP MLlib #399

madhushreeb39250 · 2024-10-21T17:10:40Z

I encountered an error while running KMeans clustering with OAP MLlib, specifically when using the KMeansDAL implementation. The application fails with an UnsatisfiedLinkError, which points to an issue with loading the native CCL library in OneCCL$.c_init.

Here’s the full error trace:
Caused by: java.lang.UnsatisfiedLinkError: com.intel.oap.mllib.OneCCL$.c_init(IILjava/lang/String;Lcom/intel/oap/mllib/CCLParam;)I
at com.intel.oap.mllib.OneCCL$.c_init(Native Method)
at com.intel.oap.mllib.OneCCL$.init(OneCCL.scala:32)
at com.intel.oap.mllib.clustering.KMeansDALImpl.$anonfun$train$4(KMeansDALImpl.scala:71)
at com.intel.oap.mllib.clustering.KMeansDALImpl.$anonfun$train$4$adapted(KMeansDALImpl.scala:70)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:907)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:907)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Environment:
OAP MLlib version: 1.6.0
Spark version: 3.3.3
OneAPI CCL version: 2021.8.0
Java version: OpenJDK 8

Additional Information:

The native libraries for CCL are installed under /opt/intel/oneapi/ccl/2021.8.0/lib/cpu/ and /opt/intel/oneapi/ccl/2021.8.0/lib/cpu_gpu_dpcpp/.
Java environment variables seem correctly configured.

Could you please assist in resolving this issue? Thank you!

minmingzhu · 2024-10-23T08:54:58Z

@madhushreeb39250 You can re-pull from master, I merged the new code

madhushreeb39250 · 2024-10-30T12:29:41Z

Hi @minmingzhu, Thank you for your response. I tried re-pull and build OAP Mllib. Build was successful but it fails while running kmeans algorithm.

I found the following error in the spark worker log:

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties

java:85813 terminated with signal 11 at PC=7179e101afeb SP=716e3c7fdea8.  Backtrace:
[0x7179e101afeb]

KMeans-1030_17_58_11.log

Could you please help me with this issue?

minmingzhu · 2024-11-11T06:16:59Z

@madhushreeb39250 Can you provide the stdout and stderr of the corresponding error worker? And can you also provide spark conf likes spark-env.sh and spark-defaults.conf.

madhushreeb39250 · 2024-11-11T07:16:41Z

Thank you for the response @minmingzhu. Here are the log and conf files:
logs.zip

minmingzhu · 2024-11-11T07:30:42Z

Thank you for the response @minmingzhu. Here are the log and conf files: logs.zip

Hi madhushreeb39250, I would like to know which version of Intel GPU you are using.

minmingzhu · 2024-11-11T07:52:38Z

Maybe you can try add this in spark-default.conf

spark.executor.cores                   $Num_CORES
spark.executor.instances             $EXECUTOR_NUM            #  you need to set  $EXECUTOR_NUM = $GPU_WORKER_AMOUNT

spark.worker.resourcesFile           $GPU_RESOURCE_FILE             # Path to resources file which is used to find various resources while worker starting up
spark.worker.resource.gpu.amount     $GPU_WORKER_AMOUNT   # total GPUs for each worker
spark.executor.resource.gpu.amount   1
spark.task.resource.gpu.amount       0.001

spark.executor.extraClassPath        $SPARKJOB_ADD_JARS
spark.driver.extraClassPath          $SPARKJOB_ADD_JARS

And add this in spark-env.sh

export CCL_ATL_TRANSPORT=ofi
export ONEAPI_DEVICE_SELECTOR=level_zero:*

I assume you use Intel GPUs, if you want to know how many GPUs on nodes, you just "source $ONEAPI_ROOT, sycl-ls"

by the ways , The spark.worker.resourcesFile file format is as follows Also you can refer to spark doc(https://spark.apache.org/docs/latest/spark-standalone.html#resource-allocation-and-configuration-overview)
[{"id":{"componentName": "spark.worker","resourceName":"gpu"},"addresses":["0","1","2","3","4","5","6","7","8","9","10","11"]}]

madhushreeb39250 · 2024-11-11T08:03:34Z

Thank you for suggesting @minmingzhu. I am not trying to use the GPU here. There are no GPUs available to the machine.

minmingzhu · 2024-11-14T05:47:02Z

Hi @madhushreeb39250
You can re-pull from master that I have merged new code.
You can run with CPU and add this in spark-env.sh

source ~/intel/oneapi/setvars.sh
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets

add this in spark-default.conf

spark.executor.extraClassPath        $SPARKJOB_ADD_JARS
spark.driver.extraClassPath          $SPARKJOB_ADD_JARS

madhushreeb39250 · 2024-11-18T17:06:30Z

Hi @minmingzhu, I tried re-pulling the master and still face the same issues. I even tried with both Java 8 and 11. I tried with building with and without CPU-ONLY options. Also, I am using the recommended version of oneAPI. The same error persists. Request you to please suggest how to proceed from here.

minmingzhu · 2024-11-25T14:50:42Z

Hi, @madhushreeb39250 Did you add env in spark-env.sh and spark-default.conf? And maybe you can try this version OneAPI, You can wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh && ./l_BaseKit_p_2024.0.0.49564_offline.sh.

madhushreeb39250 · 2024-11-26T06:54:10Z

Hi @minmingzhu, Thank you for suggesting.
I have updated spark-env.sh and spark-defaults.conf as per previous suggestions.
I tried with the suggested oneapi basekit. The error is still the same. But this time I could see kmeans clusters printed in the trace.
KMeans-1126_12_15_33.log

The error in the worker log:

Spark Executor Command: "/usr/lib/jvm/java-11-openjdk-amd64/bin/java" "-cp" "/home/madhushreeb/oap-mllib/mllib-dal/target/oap-mllib-1.6.0.jar:/home/madhushreeb/spark-3.3.3-bin-hadoop3/conf/:/home/madhushreeb/spark-3.3.3-bin-hadoop3/jars/*" "-Xmx4096M" "-Dspark.driver.port=38637" "-Dspark.network.timeout=1200s" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-XX:+PrintGCDetails" "-XX:+PrintGCDateStamps" "-XX:+PrintGCApplicationStoppedTime" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@ip-172-32-11-30:38637" "--executor-id" "6" "--hostname" "172.32.11.30" "--cores" "2" "--app-id" "app-20241126121536-0000" "--worker-url" "spark://[email protected]:33417"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
[INFO] OneCCL (native): cleanup
[ip-172-32-11-30:25731:1:26859] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x79d7b23d008)
[ip-172-32-11-30:25731:0:26835] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x79d7b23d008)
==== backtrace (tid:  26835) ====
 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x79d46a43fc4]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x79d46a47fec]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x79d46a481aa]
 3  [0x79d5d263cd9]
=================================
==== backtrace (tid:  26859) ====
 0  /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x79d46a43fc4]
 1  /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x79d46a47fec]
 2  /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x79d46a481aa]
 3  [0x79d5d3e4c84]
=================================

minmingzhu · 2024-11-28T09:03:22Z

Hi @madhushreeb39250 I haven't met this problem. Could you provide the stdout and stderr of the spark worker?

minmingzhu · 2024-11-28T09:12:05Z

Hi @madhushreeb39250 I found Kmeans print the result from your spark master log.

madhushreeb39250 · 2024-12-02T09:00:45Z

Hi @minmingzhu, Yes, I am able to get the centroids of kmeans clusters. I was just not sure if this is the right behaviour of the algorithm as I was also getting error along with the kmeans results. Here are the stderr and stdout files of the spark worker which is failing.
worker_logs.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnsatisfiedLinkError in KMeansDAL during training with OneCCL in OAP MLlib #399

UnsatisfiedLinkError in KMeansDAL during training with OneCCL in OAP MLlib #399

madhushreeb39250 commented Oct 21, 2024

minmingzhu commented Oct 23, 2024 •

edited

Loading

madhushreeb39250 commented Oct 30, 2024

minmingzhu commented Nov 11, 2024 •

edited

Loading

madhushreeb39250 commented Nov 11, 2024

minmingzhu commented Nov 11, 2024

minmingzhu commented Nov 11, 2024

madhushreeb39250 commented Nov 11, 2024 •

edited

Loading

minmingzhu commented Nov 14, 2024

madhushreeb39250 commented Nov 18, 2024

minmingzhu commented Nov 25, 2024 •

edited

Loading

madhushreeb39250 commented Nov 26, 2024

minmingzhu commented Nov 28, 2024

minmingzhu commented Nov 28, 2024

madhushreeb39250 commented Dec 2, 2024

UnsatisfiedLinkError in KMeansDAL during training with OneCCL in OAP MLlib #399

UnsatisfiedLinkError in KMeansDAL during training with OneCCL in OAP MLlib #399

Comments

madhushreeb39250 commented Oct 21, 2024

minmingzhu commented Oct 23, 2024 • edited Loading

madhushreeb39250 commented Oct 30, 2024

minmingzhu commented Nov 11, 2024 • edited Loading

madhushreeb39250 commented Nov 11, 2024

minmingzhu commented Nov 11, 2024

minmingzhu commented Nov 11, 2024

madhushreeb39250 commented Nov 11, 2024 • edited Loading

minmingzhu commented Nov 14, 2024

madhushreeb39250 commented Nov 18, 2024

minmingzhu commented Nov 25, 2024 • edited Loading

madhushreeb39250 commented Nov 26, 2024

minmingzhu commented Nov 28, 2024

minmingzhu commented Nov 28, 2024

madhushreeb39250 commented Dec 2, 2024

minmingzhu commented Oct 23, 2024 •

edited

Loading

minmingzhu commented Nov 11, 2024 •

edited

Loading

madhushreeb39250 commented Nov 11, 2024 •

edited

Loading

minmingzhu commented Nov 25, 2024 •

edited

Loading