-
Notifications
You must be signed in to change notification settings - Fork 149
Home
Welcome to the spark-excel wiki!
There are pages with "Examples" prefix are examples, each one will try to highlight one (or some) main use case with given options in action. Basically, it "borrows" idea from #issues as the starting point. Preferred approach is documented by example, with actual data alongside with the reported issue (if any). Otherwise, examples use a pseudo ca_dataset dataset (you can find a copy in src/test/resources). Some of these examples can be found in the docs/spark_excel_example.ipynb as well.
This wiki will be built gradually. And it means to be open for edit, feel free to jump in and correct existing bugs and issues.
TBD:
- Spark-excel with Azure, for example: https://github.com/crealytics/spark-excel/issues/282
- Spark-excel with Databricks, for example: https://github.com/crealytics/spark-excel/issues/125
Spark Excel, as of August 2021, beside the bundled jars, need following dependencies
-
spark-excel_2.12 (or build with, for example
sbt -Dspark.testVersion=3.1.2 assembly
) - poi-ooxm (already bundled with spark-excel.jars)
- poi-ooxml-schemas
- xmlbeans
- commons-collections4
Credit to #133 Apache commons dependency issue & @jakeatmsft and @fwani solution
Put a copy of spark-excel and its dependencies jar files into the $SPARK_HOME/jars folder. If this is a cluster setup, make sure spark driver and all executor nodes have their $SPARK_HOME/jars get updated as well.
Restart spark/spark-cluster to make these jars available.
For pyspark's bundled spark distribution or with existing SPARK_HOME that cannot add additional jar files. In this case, please make sure sparks-excel and its dependencies jar files are ready. Then pass these jars file to spark session builder as following:
from pyspark.sql import SparkSession
SPARK_EXCEL_LIBS="/path/to/spark-excel/jars" # Update this please
DEPENDENCIES_JARS = [
f"{SPARK_EXCEL_LIBS}/spark-excel_2.12-<spark-version>_0.14.0.jar",
f"{SPARK_EXCEL_LIBS}/poi-ooxml-schemas-4.1.2.jar",
f"{SPARK_EXCEL_LIBS}/commons-collections4-4.4.jar",
f"{SPARK_EXCEL_LIBS}/xmlbeans-3.1.0.jar"
]
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.jars", ",".join(jars)) \
.getOrCreate()
Restart Jupyter python kernel and rerun SparkSession.builder paragraph to make sure spark session can access given jar files.
Options | Default | Reading | Writing | Description |
---|---|---|---|---|
columnNameOfCorruptRecord | None | Support | Name for column of corrupted records | |
columnNameOfRowNumber | None | Support | Additional column for excel row number | |
dataAddress | A1 | Support | Support | Data address, default to everything |
dateFormat | Support | String dateFormat | ||
enforceSchema | False | Support | Forcibly apply the specified or inferred schema to data files. If the option is enabled, headers of Excel files will be ignored. | |
excerptSize | None (all) | Support | excerptSize | |
fileExtension | xlsx | Support | Support | Output excel file extension, default to xlsx |
header | Yes | Support | Support | Have header line when reading and writing |
ignoreAfterHeader | None (0) | Support | Number of rows to ignore after header. Only in reading | |
ignoreLeadingWhiteSpace | False | Support | ignoreLeadingWhiteSpace | |
ignoreTrailingWhiteSpace | False | Support | ignoreTrailingWhiteSpace | |
inferSchema | False | Support | inferSchema | |
keepUndefinedRows | False | Support | If true, keep undefined Excel rows | |
locale | US | Support | Support | Locale = A language tag in IETF BCP 47 format |
nanValue | NaN | Support | nanValue | |
negativeInf | Inf | Support | negativeInf | |
nullValue | (empty) | Support | nullValue | |
parseMode | permissive | Support | Parsing mode, how to handle corrupted records. Default to permissive | |
positiveInf | Inf | Support | positiveInf | |
samplingRatio | 1.0 (all) | Support | Defines fraction of file used for schema inferring. For default and invalid values, 1.0 will be used | |
timestampFormat | Support | Format to parse timestamp from text cell | ||
useNullForErrorCells | Yes | Support | Use null value for error cells | |
usePlainNumberFormat | False | Support | If true, format the cells without rounding and scientific notations | |
workbookPassword | (empty) | Support | Support | Workbook password, optional |
zoneId | (empty) | Support |
Or you can checkout the actual ExcelOptions.scala
- Spark Session Configurations
- poi-ooxm (already bundled with spark-excel.jars)
- poi-ooxml-schemas
- xmlbeans
- commons-collections4