Home

Welcome to the spark-excel wiki!

There are pages with "Examples" prefix are examples, each one will try to highlight one (or some) main use case with given options in action. Basically, it "borrows" idea from #issues as the starting point. Preferred approach is documented by example, with actual data alongside with the reported issue (if any). Otherwise, examples use a pseudo ca_dataset dataset (you can find a copy in src/test/resources). Some of these examples can be found in the docs/spark_excel_example.ipynb as well.

This wiki will be built gradually. And it means to be open for edit, feel free to jump in and correct existing bugs and issues.

TBD:

Spark-excel with Azure, for example: https://github.com/crealytics/spark-excel/issues/282
Spark-excel with Databricks, for example: https://github.com/crealytics/spark-excel/issues/125

Setup Spark Excel

Dependencies

Spark Excel, as of August 2021, beside the bundled jars, need following dependencies

spark-excel_2.12 (or build with, for example sbt -Dspark.testVersion=3.1.2 assembly)
poi-ooxm (already bundled with spark-excel.jars)
poi-ooxml-schemas
xmlbeans
commons-collections4

Credit to #133 Apache commons dependency issue & @jakeatmsft and @fwani solution

Method I: Put into $SPARK_HOME/jars

Put a copy of spark-excel and its dependencies jar files into the $SPARK_HOME/jars folder. If this is a cluster setup, make sure spark driver and all executor nodes have their $SPARK_HOME/jars get updated as well.

Restart spark/spark-cluster to make these jars available.

Method II: Cannot put spark-excel into $SPARK_HOME/jars

For pyspark's bundled spark distribution or with existing SPARK_HOME that cannot add additional jar files. In this case, please make sure sparks-excel and its dependencies jar files are ready. Then pass these jars file to spark session builder as following:

from pyspark.sql import SparkSession

SPARK_EXCEL_LIBS="/path/to/spark-excel/jars" # Update this please

DEPENDENCIES_JARS = [
    f"{SPARK_EXCEL_LIBS}/spark-excel_2.12-<spark-version>_0.14.0.jar",
    f"{SPARK_EXCEL_LIBS}/poi-ooxml-schemas-4.1.2.jar",
    f"{SPARK_EXCEL_LIBS}/commons-collections4-4.4.jar",
    f"{SPARK_EXCEL_LIBS}/xmlbeans-3.1.0.jar"
]

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", ",".join(jars)) \
    .getOrCreate()

Restart Jupyter python kernel and rerun SparkSession.builder paragraph to make sure spark session can access given jar files.

Spark Excel Options

Options	Default	Reading	Writing	Description
columnNameOfCorruptRecord	None	Support		Name for column of corrupted records
columnNameOfRowNumber	None	Support		Additional column for excel row number
dataAddress	A1	Support	Support	Data address, default to everything
dateFormat		Support		String dateFormat
enforceSchema	False	Support		Forcibly apply the specified or inferred schema to data files. If the option is enabled, headers of Excel files will be ignored.
excerptSize	None (all)	Support		excerptSize
fileExtension	xlsx	Support	Support	Output excel file extension, default to xlsx
header	Yes	Support	Support	Have header line when reading and writing
ignoreAfterHeader	None (0)	Support		Number of rows to ignore after header. Only in reading
ignoreLeadingWhiteSpace	False	Support		ignoreLeadingWhiteSpace
ignoreTrailingWhiteSpace	False	Support		ignoreTrailingWhiteSpace
inferSchema	False	Support		inferSchema
keepUndefinedRows	False	Support		If true, keep undefined Excel rows
locale	US	Support	Support	Locale = A language tag in IETF BCP 47 format
nanValue	NaN	Support		nanValue
negativeInf	Inf	Support		negativeInf
nullValue	(empty)	Support		nullValue
parseMode	permissive	Support		Parsing mode, how to handle corrupted records. Default to permissive
positiveInf	Inf	Support		positiveInf
samplingRatio	1.0 (all)	Support		Defines fraction of file used for schema inferring. For default and invalid values, 1.0 will be used
timestampFormat		Support		Format to parse timestamp from text cell
useNullForErrorCells	Yes	Support		Use null value for error cells
usePlainNumberFormat	False		Support	If true, format the cells without rounding and scientific notations
workbookPassword	(empty)	Support	Support	Workbook password, optional
zoneId	(empty)	Support

Or you can checkout the actual ExcelOptions.scala

References:

Spark Session Configurations
poi-ooxm (already bundled with spark-excel.jars)
poi-ooxml-schemas
xmlbeans
commons-collections4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly