An extensible toolset for Spark performance benchmarking.
Currently available Spark jobs (including dataset generators):
| Data Type | Algorithm |
|---|---|
| Vector | KMeans |
| Vector | LinearRegression |
| Vector | LogisticRegression |
| Tabular | GroupByCount |
| Tabular | Join |
| Tabular | SelectWhereOrderBy |
| Text | Grep |
| Text | Sort |
| Text | WordCount |
To compile the jobs to a jar file:
cd spark
sbt package- Adjust
run_scripts/submit_local_jobto your local setup and execute it. - Later you can extend the script to submit jobs to a cluster that is available to you, be that in a public cloud or an on-premise setup.