- Wikipedia: Compare different methods for finding most common keywords in RDD
- Data:
http://alaska.epfl.ch/~dockermoocs/bigdata/wikipedia.dat data/
- Run:
spark-shell --master local[*] -i wikipedia.scala
WikipediaAnalysis.Wikipedia.compareRankingMethods(sc)
- Data:
- StackOverflow: KMeans clustering of StackOverflow questions & answers
- Data:
http://alaska.epfl.ch/~dockermoocs/bigdata/stackoverflow.csv
- Run:
spark-shell --master local[*] -i stackoverflow.scala
StackOverflowAnalysis.StackOverflow.clusterPostsUsingKMeans(sc)
- Data:
- Record Linkage [In Progress] Deduplication of records
- Data:
https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
- Run:
spark-shell -i linkage.scala
- Data: