Public github repositories commits statistics and anomalous days detections. (using Spark/Scala)
- spark 2.2.0 and above.
- sbt (scala build tool)
- git clone [this repository]
- sbt clean pacakge
- spark-submit --files [commits.csv] [sbt-fullpath-output.jar]
in spite of the fact that this assignment could be solved in more concise way, I chose the type-safe way with scala Dataset and case classes to demonstrate typesafe workspace.
in a real cluster environmet (standalone, Mesos or Yarn) in order to boost performance file(s) should be located under a distributed file system (HDFS/s3/etc)
you can find raw data under: https://drive.google.com/open?id=1dsLhVGFA1n-_Yl5xd-NjhZyGB_MqIxwZ