Dependency issues with Spark's built-in commons-compress #93

jwooden1 · 2018-10-26T00:05:14Z

I can use the library when I run spark on my local windows machine and read excel files on the same machine. However, when I upload the files to WASB on Azure and use HDInsight cluster for running spark jobs (either local or cluster mode), I get the following error:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:64) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:71) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:70) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:264) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:263) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:263) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided

The text was updated successfully, but these errors were encountered:

nightscape · 2018-10-26T09:01:12Z

I had the same problem a few days ago, but haven't found a proper solution.
The problem is that Spark comes bundled with a rather outdated version of commons-compress and POI needs a newer version. In principle it should be possible to override the JARs bundled with Spark with user-provided ones, but I haven't yet managed to successfully do so.
In case you find a solution, please post it here 👍
In the mean time, you could try older versions of spark-excel maybe the pre-0.10 versions work with the older version of commons-compress.

jornfranke · 2018-11-06T22:23:40Z

I had the same issue (but not for spark-excel, another software). You need to shade the dependencies to commons-compress so that your Spark application uses the new version of commons-compress. You can do this in Java with the Maven shade plugin or in Scala with the assembly plugin (https://github.com/sbt/sbt-assembly) of SBT. Then, you can define in your build.sbt a rule to shade the commons compress (https://github.com/sbt/sbt-assembly#shading).

If you want to use R and Python then maybe @nightscape needs to shade it directly in the spark-excel module that is published on Maven.

The other way "override the Jars bundled with Spark" is in this case not possible, because it is core part of Spark. However, shading it is not so bad in this case. I recommend also to create a JIRA issue for this with the Spark project to update commons-compress (the old version is vulnerable to several attacks).

nightscape · 2018-11-09T22:12:51Z

I just released 0.10.1 and 0.11.0-beta2 which shade commons-compress and should hopefully fix this problem.
Can you give it a try and tell me if it worked?

hbenzineb · 2018-11-12T08:25:16Z

Hi @nightscape
I m using 0.11.0-beta2 and I still have the same Error
When I use a dependency to commons-compress, I have this message :
_

diagnostics: User class threw exception: java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.

_
When I dont use the dependency, I have this :
_

diagnostics: User class threw exception: java.lang.NoClassDefFoundError: org/apache/commons/compress/utils/InputStreamStatistics

_
As a reminder, I try to write the contents of several dataframes in several sheets of the same Excel file

jornfranke · 2018-11-12T22:26:31Z

@nightscape I think you don't include commons-compress explicitly in the resulting jar of the spark-excel module. In this case the shading rules will not apply. See fat jar: https://github.com/sbt/sbt-assembly.

nightscape · 2018-11-19T21:46:26Z

Just trying another approach. Can someone check 0.11.0-beta3?

hbenzineb · 2018-11-23T14:34:53Z

@nightscape : it's OK :)
Thanks

nightscape · 2018-11-23T15:53:15Z

Ok, then I'll backport this to 0.10 and release 0.11 from the beta version.

nightscape · 2018-11-24T11:14:21Z

Fixed in 0.10.2 and 0.11.0-beta3.

jwooden1 · 2018-11-26T18:53:41Z

fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3.
scala.MatchError: Map(treatemptyvaluesasnulls -> false, path -> /unique.xlsx, useheader -> true, endcolumn -> 8, inferschema -> true, startcolumn -> 0, sheetname -> input) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided
Looking at the code, it looks to me it is due to making dataaddress a mandetory filed? what is it anyway? Also, I think it is creating a side-effect, because if I pass null when reading, there is no err in read, but it does not read the specified sheet-- looks that it just read the first sheet.

abhishek-bhatt3 · 2019-01-25T09:57:07Z

fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3.
scala.MatchError: Map(treatemptyvaluesasnulls -> false, path -> /unique.xlsx, useheader -> true, endcolumn -> 8, inferschema -> true, startcolumn -> 0, sheetname -> input) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) at com.crealytics.spark.excel.DataLocator$.apply(DataLocator.scala:52) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:29) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided
Looking at the code, it looks to me it is due to making dataaddress a mandetory filed? what is it anyway? Also, I think it is creating a side-effect, because if I pass null when reading, there is no err in read, but it does not read the specified sheet-- looks that it just read the first sheet.

I am facing the same error in 0.11.0. Any update on this?

jagadeesh427 · 2019-05-01T18:02:39Z

Exception in thread "main" scala.MatchError: Map(treatemptyvaluesasnulls -> true, location -> hdfs://nameservice1/flatfiles/raw/500a_map_e.xlsx, useheader -> true, inferschema -> true, addcolorcolumns -> false, sheetname -> _500a_map_e) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap)

I am facing above issue.

dependencies used .

com.crealytics
spark-excel_2.10
0.8.3

can anyone help?

jagadeesh427 · 2019-05-03T15:07:51Z

solved the issue :

used --packages com.crealytics:spark-excel_2.11:0.10.2

worked fine

nightscape · 2019-06-27T14:46:10Z

I can reproduce this locally now. The problem seems to be that despite shading org.apache.commons.compress this line seems to be calling the constructor of the unshaded ZipArchiveInputStream.
Trying to find out what's happening...

nightscape · 2019-06-27T16:30:58Z

Not understanding it...
The exception says the following:

java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
  org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63)
  org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180)
  org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104)
  org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298)
  org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129)
  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  java.lang.reflect.Method.invoke(Method.java:498)
  org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314)
  org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296)
  org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214)
  org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180)
  com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:42)

on the other hand, when I download and unzip the spark-excel JAR and run

javap -verbose com/crealytics/spark-excel_2.12/0.11.2/org/apache/poi/openxml4j/opc/internal/ZipHelper.class

it clearly shows that the above method is using the shaded classes:

  public static org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream openZipStream(java.io.InputStream) throws java.io.IOException;
    descriptor: (Ljava/io/InputStream;)Lorg/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream;
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=5, locals=2, args_size=1
         0: aload_0
         1: invokestatic  #108                // Method org/apache/poi/poifs/filesystem/FileMagic.prepareToCheckMagic:(Ljava/io/InputStream;)Ljava/io/InputStream;
         4: astore_1
         5: aload_1
         6: invokestatic  #139                // Method verifyZipHeader:(Ljava/io/InputStream;)V
         9: new           #141                // class org/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream
        12: dup
        13: new           #143                // class shadeio/commons/compress/archivers/zip/ZipArchiveInputStream
        16: dup
        17: aload_1
        18: invokespecial #145                // Method shadeio/commons/compress/archivers/zip/ZipArchiveInputStream."<init>":(Ljava/io/InputStream;)V
        21: invokespecial #146                // Method org/apache/poi/openxml4j/util/ZipArchiveThresholdInputStream."<init>":(Ljava/io/InputStream;)V
        24: areturn

jornfranke · 2019-06-27T16:49:40Z

Maybe some of your dependencies have POI as a dependency and then this dependency does not use the shaded commons-io

nightscape · 2019-07-02T16:06:53Z

@jornfranke That was exactly the problem. spark-excel itself still adds POI as a dependency (see hammerlab/sbt-parent#32).
I'm now bundling and shading all dependencies that require commons-io.

I just released 0.12.0 with this fix (and Scala 2.12 compatibility), it should appear on Maven Central in the next few hours.
Please go ahead and try it.
I'll close this issue until there are reports of the problem occurring again.

jlscott3 · 2019-07-03T16:05:30Z

Confirmed 0.12.0 working in AWS Glue now - thanks for the quick response!

ecv-stan · 2019-07-25T10:05:35Z

@jlscott3 hi, do u mind to share how do u get this to work in glue?
do u just add the spark-excel_2.12-0.12.0.jar to Jar lib path in the glue job? do u need to set anything else?
I tried spark-excel_2.12-0.12.0.jar, spark-excel_2.11-0.12.0.jar, spark-excel_2.11-0.11.1.jar but all throw error...
thanks in advance.

Update:

Finally I got it working in AWS glue.

Below are the jars I used:
ooxml-schemas-1.4.jar
poi-4.0.0.jar
spark-excel_2.11-0.12.0.jar
xmlbeans-3.1.0.jar

Hope it helps.

nightscape · 2019-10-02T23:44:29Z

It turns out something went wrong while publishing spark-excel_2.12-0.12.0.jar, so that version actually still had this problem.
In case anyone wants to try with Scala 2.12 it should work with spark-excel 0.12.1.

tochandrashekhar · 2020-07-16T08:12:46Z

@jlscott3 hi, do u mind to share how do u get this to work in glue?
do u just add the spark-excel_2.12-0.12.0.jar to Jar lib path in the glue job? do u need to set anything else?
I tried spark-excel_2.12-0.12.0.jar, spark-excel_2.11-0.12.0.jar, spark-excel_2.11-0.11.1.jar but all throw error...
thanks in advance.

Update:

Finally I got it working in AWS glue.

Below are the jars I used:
ooxml-schemas-1.4.jar
poi-4.0.0.jar
spark-excel_2.11-0.12.0.jar
xmlbeans-3.1.0.jar

Hope it helps.

Do we need to import in spark code.. Can you please provide some sample code?

xvinosh · 2020-08-14T14:38:23Z

Did anyone get the solution to this problem. I am facing the same problem with the latest version of spark-excel -> 0.13.5

scala> val file = new File("/Users/vinodsharma/Documents/Spark-Excel/People.xlsx")
file: java.io.File = /Users/vinodsharma/Documents/Spark-Excel/People.xlsx

scala> val fIP = new FileInputStream(file)
fIP: java.io.FileInputStream = java.io.FileInputStream@236ec69

scala> val wb = new XSSFWorkbook(fIP)
java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:65)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:178)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:104)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
at org.apache.poi.ooxml.util.PackageHelper.open(PackageHelper.java:47)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.(XSSFWorkbook.java:309)
... 51 elided

How to go about changing the classpath for common compress jar: In my case, the version of compress jar is org.apache.commons#commons-compress;1.20

nightscape · 2020-08-14T19:57:17Z

You might have to manually exclude commons-compress from the dependencies due to this problem which I don't yet know how to fix: hammerlab/sbt-parent#32

xvinosh · 2020-08-17T17:14:39Z

@nightscape :
In my case, I tried all the versions from 0.12.1 to 0.13.5, none worked.
Downloaded the latest version of common compress manually which spark-shell showed as if it has downloaded while launching the spark shell with packages option but actually did not(as I could not find anywhere in the maven repo dir where it said, it’s downloaded)
Version: 1.20
Then explicitly mentioned the jar name in the driver’s classpath as mentioned below:
$ spark-shell --driver-class-path /home/xvinosh/.m2/repository/org/apache/commons/commons-compress/1.20/commons-compress-1.20jar

This worked. Hope it helps other.

sjahongir · 2021-04-18T02:44:15Z

@nightscape hi
I tried the 0.9.0 version with spark 2.3.1 (local and cluster mode). It is worked but when I use a large excel file, a spark cannot process it.

Then tried higher versions of your library from 0.10:

spark can process large file when I use as a local mode
the following error occurs when I use spark as a cluster (standalone) mode

Exception in thread "main" java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipFile$1 is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:147) at org.apache.poi.openxml4j.util.ZipSecureFile.getInputStream(ZipSecureFile.java:34) at org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getInputStream(ZipFileZipEntrySource.java:66) at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:258) at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:725) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:238) at etl.io.XlsxReader.open(XlsxReader.scala:135) at etl.io.XlsxReader.<init>(XlsxReader.scala:153) at etl.connectors.excel.ExcelConnector.readXlsx(ExcelConnector.scala:194) at etl.connectors.excel.ExcelConnector.read(ExcelConnector.scala:119) at etl.io.DatasetReader$.read(DatasetReader.scala:47) at etl.DatasetResolver$.byModel(DatasetResolver.scala:58) at etl.App$.processTask(App.scala:105) at etl.App$.main(App.scala:65) at etl.App.main(App.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

nightscape · 2021-04-23T09:31:43Z

@sjahongir can you try the recommendation from @xvinosh?

SwapnaRavi21 · 2021-10-25T19:43:10Z

@nightscape I still see issues with spark excel compatible with 2.12..
Using 0.13.4 I face java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.
at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.(ZipArchiveThresholdInputStream.java:65)

Using 0.12.0 or 0.12.1 I get useHeader errors and as well as the above. Nothing is working out. Tried using commons-compress-1.20.jar along with other jars in my spark submit. No use.

Currently we are migrating to scala 2.12, could you pls suggest the spark excel version for the same without these issues?

nightscape · 2021-10-26T06:49:06Z

Hi @SwapnaRavi21, I would recommend always using the latest version available for your Spark & Scala version.
@quanghgx and me will try to figure out a way to build against multiple versions of Spark.
Unfortunately I'm under quite some deadline pressure at the moment and will probably only get to this the second week of November.
If you have experience with SBT, we'd be happy for any contributions!

SwapnaRavi21 · 2021-10-26T11:40:05Z

@nightscape yes we are onto latest scala only 2.12. But this fix is available only in 2.11 and not in 2.12 right. Sure thanks. Meanwhile is there any alternative for this dependency so we can use that in 2.12 until the fix is provided in this version.

neontty · 2024-10-28T16:23:11Z

currently seeing this behavior in Databricks in multiple runtime versions (14.3LTS, 15.4LTS) ; scala 2.12 spark 3.5.0

version : com.crealytics:spark-excel_2.12:3.5.0_0.20.3

Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream.putArchiveEntry(Lorg/apache/commons/compress/archivers/zip/ZipArchiveEntry;)V
	at org.apache.poi.openxml4j.opc.internal.ZipContentTypeManager.saveImpl(ZipContentTypeManager.java:65)
	at org.apache.poi.openxml4j.opc.internal.ContentTypeManager.save(ContentTypeManager.java:450)
	at org.apache.poi.openxml4j.opc.ZipPackage.saveImpl(ZipPackage.java:608)
	at org.apache.poi.openxml4j.opc.OPCPackage.save(OPCPackage.java:1532)
	at org.apache.poi.ooxml.POIXMLDocument.write(POIXMLDocument.java:227)
	at com.crealytics.spark.excel.v2.ExcelGenerator.close(ExcelGenerator.scala:177)
	at com.crealytics.spark.excel.v2.ExcelOutputWriter.close(ExcelOutputWriter.scala:34)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseCurrentWriter(FileFormatDataWriter.scala:71)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:82)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.$anonfun$commit$2(FileFormatDataWriter.scala:141)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.enrichWriteError(FileFormatDataWriter.scala:97)
	at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:140)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:560)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1560)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:566)
	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:125)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:938)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:938)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:413)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:410)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:377)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:211)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:199)
	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:161)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
	at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
	at scala.util.Using$.resource(Using.scala:269)
	at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:155)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:102)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$10(Executor.scala:1036)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:1039)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:926)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more

excluding org.apache.commons:commons-compress building our spark Jar application did not help. Also adding an explicit dependency for commons-compress did not help.

are there any recommendations for workarounds?

pjfanning · 2024-10-28T16:36:44Z

@neontty looks like Spark defaults to an out of date CVE ridden version of commons-compress.

https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.13/3.5.3

POI uses a newer version of commons-compress and must rely on methods from that were added or changed recently.

Can you try to upgrade the commons-compress jar that Spark uses? Maybe best to ask on Spark mailing lists or forums if you don't know how to do this.

neontty · 2024-11-11T21:29:35Z

hi @pjfanning , thanks for the quick response. I'm just looking into this a bit more and trying to understand why the shading rule isn't enough in build.sc:67

is it because of this discussion regarding shading in the mill build system? com-lihaoyi/mill#3815

nightscape · 2024-11-12T10:40:09Z

@neontty thanks for commenting over at Mill 👍
If you and/or your colleagues could pick that issue up, that would be great.
With the bounty on top, you could do a nice celebration with your colleagues 🍻 😄

nightscape mentioned this issue Nov 9, 2018

saving multiple datasets as different sheets in excel file #43

Closed

nightscape closed this as completed Nov 24, 2018

jlscott3 mentioned this issue Jun 26, 2019

Dependency issues in AWS Glue with spark-excel 11.1 and commons-compress 1.18 #128

Closed

nightscape changed the title ~~error in reading files in azure hdinsight cluster~~ Dependency issues with Spark's built-in commons-compress Jun 27, 2019

nightscape reopened this Jun 27, 2019

nightscape closed this as completed Jul 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dependency issues with Spark's built-in commons-compress #93

Dependency issues with Spark's built-in commons-compress #93

jwooden1 commented Oct 26, 2018

nightscape commented Oct 26, 2018

jornfranke commented Nov 6, 2018 •

edited

Loading

nightscape commented Nov 9, 2018 •

edited

Loading

hbenzineb commented Nov 12, 2018

jornfranke commented Nov 12, 2018

nightscape commented Nov 19, 2018

hbenzineb commented Nov 23, 2018

nightscape commented Nov 23, 2018

nightscape commented Nov 24, 2018

jwooden1 commented Nov 26, 2018 •

edited

Loading

abhishek-bhatt3 commented Jan 25, 2019

jagadeesh427 commented May 1, 2019 •

edited

Loading

jagadeesh427 commented May 3, 2019

nightscape commented Jun 27, 2019

nightscape commented Jun 27, 2019

jornfranke commented Jun 27, 2019

nightscape commented Jul 2, 2019

jlscott3 commented Jul 3, 2019

ecv-stan commented Jul 25, 2019 •

edited

Loading

nightscape commented Oct 2, 2019

tochandrashekhar commented Jul 16, 2020

xvinosh commented Aug 14, 2020

nightscape commented Aug 14, 2020

xvinosh commented Aug 17, 2020

sjahongir commented Apr 18, 2021

nightscape commented Apr 23, 2021

SwapnaRavi21 commented Oct 25, 2021 •

edited

Loading

nightscape commented Oct 26, 2021

SwapnaRavi21 commented Oct 26, 2021

neontty commented Oct 28, 2024 •

edited

Loading

pjfanning commented Oct 28, 2024

neontty commented Nov 11, 2024

nightscape commented Nov 12, 2024

Dependency issues with Spark's built-in commons-compress #93

Dependency issues with Spark's built-in commons-compress #93

Comments

jwooden1 commented Oct 26, 2018

nightscape commented Oct 26, 2018

jornfranke commented Nov 6, 2018 • edited Loading

nightscape commented Nov 9, 2018 • edited Loading

hbenzineb commented Nov 12, 2018

jornfranke commented Nov 12, 2018

nightscape commented Nov 19, 2018

hbenzineb commented Nov 23, 2018

nightscape commented Nov 23, 2018

nightscape commented Nov 24, 2018

jwooden1 commented Nov 26, 2018 • edited Loading

abhishek-bhatt3 commented Jan 25, 2019

jagadeesh427 commented May 1, 2019 • edited Loading

jagadeesh427 commented May 3, 2019

nightscape commented Jun 27, 2019

nightscape commented Jun 27, 2019

jornfranke commented Jun 27, 2019

nightscape commented Jul 2, 2019

jlscott3 commented Jul 3, 2019

ecv-stan commented Jul 25, 2019 • edited Loading

nightscape commented Oct 2, 2019

tochandrashekhar commented Jul 16, 2020

xvinosh commented Aug 14, 2020

nightscape commented Aug 14, 2020

xvinosh commented Aug 17, 2020

sjahongir commented Apr 18, 2021

nightscape commented Apr 23, 2021

SwapnaRavi21 commented Oct 25, 2021 • edited Loading

nightscape commented Oct 26, 2021

SwapnaRavi21 commented Oct 26, 2021

neontty commented Oct 28, 2024 • edited Loading

pjfanning commented Oct 28, 2024

neontty commented Nov 11, 2024

nightscape commented Nov 12, 2024

jornfranke commented Nov 6, 2018 •

edited

Loading

nightscape commented Nov 9, 2018 •

edited

Loading

jwooden1 commented Nov 26, 2018 •

edited

Loading

jagadeesh427 commented May 1, 2019 •

edited

Loading

ecv-stan commented Jul 25, 2019 •

edited

Loading

SwapnaRavi21 commented Oct 25, 2021 •

edited

Loading

neontty commented Oct 28, 2024 •

edited

Loading