The DatasetUtils object provides methods for flattening (unnesting) Spark's recursive StructType instances.
The partitioning system is designed to support extensible partitioning of RDF data.
The following entities are involved:
- The method for partitioning a
RDD[Triple]
is located in RdfPartitionUtilsSpark. It uses an RdfPartitioner which maps a Triple to a single RdfPartition instance. RdfPartition
, as the name suggests, represents a partition of the RDF data and defines two methods:matches(Triple): Boolean
: This method is used to test whether a triple fits into a partition.layout => TripleLayout
: This method returns the TripleLayout associated with the partition, as explained below.- Furthermore,
RdfPartition
s are expected to be serializable, and to define equals and hash code.
TripleLayout
instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:fromTriple(triple: Triple): Product
: This method must, for a given triple, return its representation as aProduct
(this is the super class of all scalaTuple
s)schema: Type
: This method must return the exact scala type of the objects returned byfromTriple
, such astypeOf[Tuple2[String, Double]]
. Hence, layouts are expected to only yield instances of one specific type.- See the available layouts for details.