-
Notifications
You must be signed in to change notification settings - Fork 149
Examples: With Row Number & Undefined Rows
Quang Hoang Xuan edited this page Sep 4, 2021
·
4 revisions
Purpose: Load Excel files with row number and keeping undefined rows
Let use data file issue_285_bryce21.xlsx for this example
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
ROOT="/<please-change>/to/your-path"
schema = StructType([
StructField("RowID", IntegerType(), True),
StructField("1", StringType(), True),
StructField("2", StringType(), True),
StructField("3", StringType(), True)
])
df = spark.read.format("excel") \
.schema(schema) \
.option("header", False) \
.option("columnNameOfRowNumber", "RowID") \
.load(f"{ROOT}/src/test/resources/spreadsheets/issue_285_bryce21.xlsx")
df.show()
Got
+-----+---------+----+----+
|RowID| 1| 2| 3|
+-----+---------+----+----+
| 0|File info|null|null|
| 1| Info|Info|Info|
| 3| Metadata|null|null|
| 5| null| 1| 2|
| 6| A| 1| 2|
| 7| B| 5| 6|
| 8| C| 9| 10|
| 11| Metadata|null|null|
| 13| null| 1| 2|
| 14| A| 1| 2|
| 15| B| 4| 5|
| 16| C| 7| 8|
+-----+---------+----+----+
Notice: there is an extra RowID column, and value 2,4,9,10 are missing.
df = spark.read.format("excel") \
.schema(schema) \
.option("header", False) \
.option("keepUndefinedRows", True) \
.option("columnNameOfRowNumber", "RowID") \
.load(f"{ROOT}/src/test/resources/spreadsheets/issue_285_bryce21.xlsx")
df.show()
Got
+-----+---------+----+----+
|RowID| 1| 2| 3|
+-----+---------+----+----+
| 0|File info|null|null|
| 1| Info|Info|Info|
| null| null|null|null|
| 3| Metadata|null|null|
| null| null|null|null|
| 5| null| 1| 2|
| 6| A| 1| 2|
| 7| B| 5| 6|
| 8| C| 9| 10|
| null| null|null|null|
| null| null|null|null|
| 11| Metadata|null|null|
| null| null|null|null|
| 13| null| 1| 2|
| 14| A| 1| 2|
| 15| B| 4| 5|
| 16| C| 7| 8|
+-----+---------+----+----+
There are rows for undefined Excel rows now. For undefined Excel row, its row-number is also null.
- Apache POI about missing Excel rows
- #40 Allow reading only a subset of rows
- #59 Rows are returned in incorrect order on cluster
- #115 Can we get the excel row number column while reading Excel file to the dataframe?
- #285 Empty rows are automatically removed. Is it possible to not have that behavior?
- #352 When some data lines in the middle are empty, you cannot see these empty data when outputting data