-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependency issues with Spark's built-in commons-compress #93
Comments
I had the same problem a few days ago, but haven't found a proper solution. |
I had the same issue (but not for spark-excel, another software). You need to shade the dependencies to commons-compress so that your Spark application uses the new version of commons-compress. You can do this in Java with the Maven shade plugin or in Scala with the assembly plugin (https://github.com/sbt/sbt-assembly) of SBT. Then, you can define in your build.sbt a rule to shade the commons compress (https://github.com/sbt/sbt-assembly#shading). If you want to use R and Python then maybe @nightscape needs to shade it directly in the spark-excel module that is published on Maven. The other way "override the Jars bundled with Spark" is in this case not possible, because it is core part of Spark. However, shading it is not so bad in this case. I recommend also to create a JIRA issue for this with the Spark project to update commons-compress (the old version is vulnerable to several attacks). |
I just released |
Hi @nightscape
_
_ |
@nightscape I think you don't include commons-compress explicitly in the resulting jar of the spark-excel module. In this case the shading rules will not apply. See fat jar: https://github.com/sbt/sbt-assembly. |
Just trying another approach. Can someone check |
@nightscape : it's OK :) |
Ok, then I'll backport this to 0.10 and release 0.11 from the beta version. |
Fixed in 0.10.2 and 0.11.0-beta3. |
fix is working for 0.10.2, but not in 0.11.0-beta3. I get this error in 0.11.0-beta3. |
I am facing the same error in 0.11.0. Any update on this? |
Exception in thread "main" scala.MatchError: Map(treatemptyvaluesasnulls -> true, location -> hdfs://nameservice1/flatfiles/raw/500a_map_e.xlsx, useheader -> true, inferschema -> true, addcolorcolumns -> false, sheetname -> _500a_map_e) (of class org.apache.spark.sql.catalyst.util.CaseInsensitiveMap) I am facing above issue. dependencies used . can anyone help? |
solved the issue : used --packages com.crealytics:spark-excel_2.11:0.10.2 worked fine |
I can reproduce this locally now. The problem seems to be that despite shading |
Not understanding it...
on the other hand, when I download and unzip the spark-excel JAR and run javap -verbose com/crealytics/spark-excel_2.12/0.11.2/org/apache/poi/openxml4j/opc/internal/ZipHelper.class it clearly shows that the above method is using the shaded classes:
|
Maybe some of your dependencies have POI as a dependency and then this dependency does not use the shaded commons-io |
@jornfranke That was exactly the problem. I just released |
Confirmed 0.12.0 working in AWS Glue now - thanks for the quick response! |
@jlscott3 hi, do u mind to share how do u get this to work in glue? Update: Finally I got it working in AWS glue. Below are the jars I used: Hope it helps. |
It turns out something went wrong while publishing |
Do we need to import in spark code.. Can you please provide some sample code? |
Did anyone get the solution to this problem. I am facing the same problem with the latest version of spark-excel -> 0.13.5 scala> val file = new File("/Users/vinodsharma/Documents/Spark-Excel/People.xlsx") scala> val fIP = new FileInputStream(file) scala> val wb = new XSSFWorkbook(fIP) How to go about changing the classpath for common compress jar: In my case, the version of compress jar is org.apache.commons#commons-compress;1.20 |
You might have to manually exclude commons-compress from the dependencies due to this problem which I don't yet know how to fix: hammerlab/sbt-parent#32 |
@nightscape : This worked. Hope it helps other. |
@nightscape hi Then tried higher versions of your library from 0.10:
|
@sjahongir can you try the recommendation from @xvinosh? |
@nightscape I still see issues with spark excel compatible with 2.12.. Using 0.12.0 or 0.12.1 I get useHeader errors and as well as the above. Nothing is working out. Tried using commons-compress-1.20.jar along with other jars in my spark submit. No use. Currently we are migrating to scala 2.12, could you pls suggest the spark excel version for the same without these issues? |
Hi @SwapnaRavi21, I would recommend always using the latest version available for your Spark & Scala version. |
@nightscape yes we are onto latest scala only 2.12. But this fix is available only in 2.11 and not in 2.12 right. Sure thanks. Meanwhile is there any alternative for this dependency so we can use that in 2.12 until the fix is provided in this version. |
currently seeing this behavior in Databricks in multiple runtime versions (14.3LTS, 15.4LTS) ; scala 2.12 spark 3.5.0 version : com.crealytics:spark-excel_2.12:3.5.0_0.20.3
excluding are there any recommendations for workarounds? |
@neontty looks like Spark defaults to an out of date CVE ridden version of commons-compress. https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.13/3.5.3 POI uses a newer version of commons-compress and must rely on methods from that were added or changed recently. Can you try to upgrade the commons-compress jar that Spark uses? Maybe best to ask on Spark mailing lists or forums if you don't know how to do this. |
hi @pjfanning , thanks for the quick response. I'm just looking into this a bit more and trying to understand why the shading rule isn't enough in is it because of this discussion regarding shading in the mill build system? com-lihaoyi/mill#3815 |
@neontty thanks for commenting over at Mill 👍 |
I can use the library when I run spark on my local windows machine and read excel files on the same machine. However, when I upload the files to WASB on Azure and use HDInsight cluster for running spark jobs (either local or cluster mode), I get the following error:
java.lang.IllegalArgumentException: InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics. at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.<init>(ZipArchiveThresholdInputStream.java:63) at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipStream(ZipHelper.java:180) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:104) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:298) at org.apache.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:129) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:314) at org.apache.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:296) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:214) at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:180) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2$$anonfun$apply$4.apply(ExcelRelation.scala:66) at scala.Option.fold(Option.scala:158) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at com.crealytics.spark.excel.ExcelRelation$$anonfun$openWorkbook$2.apply(ExcelRelation.scala:66) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.openWorkbook(ExcelRelation.scala:64) at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:71) at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:70) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:264) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:263) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:263) at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:91) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:39) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) ... 53 elided
The text was updated successfully, but these errors were encountered: