-
Notifications
You must be signed in to change notification settings - Fork 6
Transformation
All the excel examples are located within the workbook called basic_examples.xlsx.
Now that we learned about extraction of the data from excel spreadsheets, let's look into how we can further consume and manipulate with this data.
In extraction examples all the parsed records were the instances of GenericRecord
. It holds the data in the denormalized way as a Map<String, Any>
. But what we need is a strict model with checks along the way.
E.g. let's look into into broken stats
spreadsheet within basic_examples
file.
You can see that Club Brugge misses the data about the total points.
However, this is a critical information for us and we want to check non-nullability of the data along the extraction of the data (with capturing what rows do not comply with our model validation).
Apart from this, we want to store the data in a normalised way as a data class with statically typed fields.
E.g. we know that points
could only be integers, if it's something else — it should not pass the validation checks!
To support both of the features refinery
allows you to define the data class with custom row parser to map the values to the fields of the data class.
By default refinery uses a GenericRowParser
that will simply convert all entries to a Map
.
Although useful defining your own parser will allow you to fit the data into an actual data class with typing information.
First, let's define the data class where we want to store the data:
data class TeamStat(
val team: String,
val group: String,
val plays: Int,
val wins: Int?,
val goalDiff: Int?,
val points: Int,
val date: LocalDate
: ParsedRecord()
Please note, that class should be inherited from ParsedRecord
.
Next, instead of putting all the HeaderCell
instances directly into the definition, we need to extract them as separate variables as we will need to refer to them when we define a custom row parser:
val team = StringHeaderCell("team")
val plays = StringHeaderCell("plays")
val wins = StringHeaderCell("wins")
val goalDiff = StringHeaderCell("goal diff")
val points = StringHeaderCell("points")
val date = StringHeaderCell("date")
Now, we need to define the custom parser.
This custom parser should be inherited from RowParser
with RowParserData
and ExceptionManager
passed as params
class StatParser(rowParserData: RowParserData, exceptionManager: ExceptionManager) :
RowParser(rowParserData, exceptionManager) {
override fun toRecord(row: Row): TeamStat {
return TeamStat(
team = parseRequiredFieldAsString(row, team),
group = rowParserData.metadata.getAnchor(),
plays = parseOptionalFieldAsInteger(row, plays)!!,
wins = parseOptionalFieldAsInteger(row, wins),
goalDiff = parseOptionalFieldAsInteger(row, goalDiff),
points = parseOptionalFieldAsInteger(row, points)!!, // todo: this should have a dedicated method called `parseRequiredFieldAsInteger`
date = parseOptionalFieldAsDateTime(row, date)!!.toLocalDate()
)
}
}
You can see that now we are explicitly saying what extracted data should be passed to which fields. On top of this we have validation of the extracted data (e.g. parseRequiredFieldAsString
will always return non-nullable String
and will fail to parse the row if the value is absent).
Finally, we need to combine of all that pieces of logic to the adjusted schema definition:
val headerColumns = setOf(team, plays, wins, goalDiff, points, date)
val definition = WorkbookParserDefinition(
spreadsheetParserDefinitions = listOf(
SheetParserDefinition(
sheetNameFilter = { it == "broken stats" },
tableDefinitions = listOf(
// 1st table
TableParserDefinition(
requiredColumns = headerColumns,
anchor = "Group A",
rowParserFactory = ::StatParser // referencing a class that applies a custom logic
),
// 2nd table
TableParserDefinition(
requiredColumns = headerColumns,
anchor = "Group B",
rowParserFactory = ::StatParser // referencing a class that applies a custom logic
)
)
)
)
)
Note, that we have a separate parameter in TableParserDefinition
called rowParserFactory
that accepts a lambda (reference to our custom row parser).
Refinery supports various different HeaderCell
types to help parse more complex headers. So far we've seen the StringHeaderCell
which uses a String
or Set
of those and works on case incensetive substrings. There are also:
SimpleHeaderCell
RegexHeaderCell
MergedHeaderCell
OrderedHeaderCell
You will find a set of methods to assist parsing the sheets into the correct datatypes e.g. parseOptionalFieldAsInteger
.
In case of a required field you can use the required counterpart of those methods e.g. parseRequiredFieldAsInteger
, which in case of failiure
or missing data will register a CellParserException
and continue with the next row
You will notice that you can pass an ExceptionManager
to the WorkbookParser
.
Any errors that occure during processing will be registered there, so as to not halt the progress of the parsing.
In most cases you will find the Sheet and Row referenced in the exception as well.
Details on what kind of exceptions are supported can be found here