#TeraSort benchmark for Spark
This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch, but it is not the same TeraSort program that currently holds the record. That program is here.
mvn install
cd
to your your spark install.
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraGen
path/to/spark-terasort/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
1g file://$HOME/data/terasort_in
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraSort
path/to/spark-terasort/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
file://$HOME/data/terasort_in file://$HOME/data/terasort_out
./bin/spark-submit --class com.github.ehiggs.spark.terasort.TeraValidate
path/to/spark-terasort/target/spark-terasort-1.0-SNAPSHOT-jar-with-dependencies.jar
file://$HOME/data/terasort_out file://$HOME/data/terasort_validate
This terasort doesn't use the partitioning scheme that Hadoop's Terasort uses. This results in not very good performance. I could copy the Partitioning code from the Hadoop tree verbatim but I thought it would be more appropriate to rewrite more of it in Scala.
I haven't pulled the DaytonaPartitioner from the record holding sort yet because it's pretty intertwined into the rest of the code and AFAIK it's not really idiomatic Spark.
TeraValidate can read the file parts in the wrong order on native file systems (e.g. if you run Spark on your laptop, on Lustre, Panasas, etc). HDFS apparently always returns the files in alphanumeric order so most Hadoop users aren't affected. I thought I fixed this in the TeraInputFormat, but I was able to reproduce it since migrating the code from my Spark terasort branch.
PRs are very welcome!