-
Notifications
You must be signed in to change notification settings - Fork 26
How to Compare Data within a DataBricks Environment
Ahmed Ibrahim edited this page Oct 6, 2018
·
7 revisions
Through the below simple steps you can compare DataFrames and visualize the results
The columns Key1 and Key2 constitute the primary key for our tables
val left = Seq(
("1", "1" , "Adam" ,"Andreson"),
("2","2","Bob","Branson"),
("4","4","Chad","Charly"),
("5","5","Joe","Smith"),
("5","5","Joe","Smith"),
("6","6","Edward","Eddy"),
("7","7","normal","normal")
).toDF("key1" , "key2" , "value1" , "value2")
val right = Seq(
("1","1",null,null),
("3","3","Young","Yan"),
("5","5","Joe","Smith"),
("6","6","Edward","Eddy"),
("7","7","normal","normal"),
(null,null,"null key","null key")
).toDF("key1" , "key2", "value1" , "value2")
The method receives the spark context and references it for future spark operations.
import org.finra.msd.sparkfactory.SparkFactory
SparkFactory.initializeDataBricks(spark)
Note that the Val Key contains a sequence that represents the primary key columns Note that the parameter 100 is for specifying how many records you want to display as HTML.
import org.finra.msd.sparkcompare.SparkCompare
import org.finra.msd.visualization.Visualizer
val comparisonReult = SparkCompare.compareSchemaDataFrames(left,right)
val key: Seq[String] = Seq("key1", "key2")
val joinedDf = SparkCompare.fullOuterJoinDataFrames(comparisonReult.getLeft,comparisonReult.getRight , key)
val html = Visualizer.renderHorizontalTable(joinedDf , 100)
displayHTML(html)
The key columns are the ones in the middle The differences between left and right are highlighted yellow