Closed
Description
I'm creating a development endpoint in AWS Glue with the following dependencies:
spark-excel_2.11-0.11.1.jar
poi-4.1.0.jar
poi-ooxml-4.1.0.jar
xmlbeans-3.1.0.jar
spoiwo_2.12-1.4.1.jar
commons-compress-1.18.jar
When I create a notebook against the dev endpoint and attempt to load an xlsx using the following command:
spark.read.format("com.crealytics.spark.excel").options(sheetName="sheet1").options(useHeader="true").load(s3_path)
I get:
u'InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1561581531952_0001/container_1561581531952_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 159, in load
return self._df(self._jreader.load(path))
File "/mnt/yarn/usercache/livy/appcache/application_1561581531952_0001/container_1561581531952_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt/yarn/usercache/livy/appcache/application_1561581531952_0001/container_1561581531952_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
IllegalArgumentException: u'InputStream of class class org.apache.commons.compress.archivers.zip.ZipArchiveInputStream is not implementing InputStreamStatistics.'
This appears to be similar to #93, but I've tried both 11.1 and 10.2 and have the same issue.
Not sure where to go from here and I don't think I have a way of shading dependencies in Glue. Glue uses Spark 2.2.1, FWIW.
Metadata
Metadata
Assignees
Labels
No labels