The GenericDataSource
framework is a utility framework that helps configuring and
reading DataFrame
s.
This framework provides for reading from a custom data source, like
delta
.
Keep in mind that custom data sources usually require custom libraries dependencies or
custom packages dependencies.
The framework is composed of two classes:
GenericDataSource
, which is created based on aGenericSourceConfiguration
class and provides one main function:override def read(implicit spark: SparkSession): Try[DataFrame]
GenericSourceConfiguration
: the necessary configuration parameters
Sample code
import org.tupol.spark.io._
implicit val sparkSession: SparkSession = ???
val sourceConfiguration: GenericSourceConfiguration = ???
val dataframe = GenericDataSource(sourceConfiguration).read
Optionally, one can use the implicit decorator for the SparkSession
available by importing org.tupol.spark.io.implicits._
.
Sample code
import org.tupol.spark.io._
import org.tupol.spark.io.implicits._
val sourceConfiguration: GenericSourceConfiguration = ???
val dataframe = spark.source(sourceConfiguration).read
format
Required- the type of the input file and the corresponding source / parser
- any value is acceptable, but it needs to be supported by Spark
schema.path
Optional- this is an optional parameter that represents local path or the class path to the json Apache Spark schema that should be enforced on the input data
- this schema can be easily obtained from a
DataFrame
by calling theprettyJson
function - if this parameter is found the schema will be loaded from the given file, otherwise, the
schema
parameter is tried
schema
Optional- this is an optional parameter that represents the json Apache Spark schema that should be enforced on the input data
- this schema can be easily obtained from a
DataFrame
by calling theprettyJson
function - due to it's complex structure, this parameter can not be passed as a command line argument,
but it can only be passed through the
application.conf
file
options
Optional- due to it's complex structure, this parameter can not be passed as a command line argument,
but it can only be passed through the
application.conf
file
- due to it's complex structure, this parameter can not be passed as a command line argument,
but it can only be passed through the