Skip to content

Simple framework for distributing tasks on replicated data, based on Apache Spark

Notifications You must be signed in to change notification settings

odusseys/octopus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

octopus

alt tag

A framework for job distribution and data replication on large clusters of machines, based off Apache Spark, taking advantage of Spark's resilience and simple framework for task distribution.

import OctopusContext._
val sc: SparkContext = ???
val oc = sc.getOctopusContext()

Octopus' abstraction for data replication is the DataSet class, which represents data replicated on all machines in the cluster. Unlike Spark RDD's, DataSets do not partition their data accross the cluster, but simply create a copy on each machine. DataSets can be acted on by lazy transformations (map, filter, flatMap, reduceByKey, groupBy, drop, take ...) which are not executed until the data is needed for jobs.

val data: DataSet[Int] = oc.deploy(1 to 1000)
val transformed: DataSet[(Int,Int)] = data.map(_ + 1).filter(_ % 2 == 0)
  .flatMap(i => 1 to i) //this will increase size a lot, better to do it on the workers 
  .map(i => (i, i + 1))
  .reduceByKey(_ * _) //transformations not computed yet - data not even deployed yet

def job(i: Int)(data: Iterable[(Int,Int)]) = data.map(_._2 + i).sum

val jobs = (1 to 10).map(i => job(i) _).toSeq
val results: Seq[Int] = transformed.execute(jobs)

Octopus supports caching of the replicated data, for ulterior reuse.

val cached = data.cache() //will store the data when a job is performed on it
cached.unpersist() //will remove the data from the cache when any job is performed

Jobs may also be launched without a DataSet, by using the OctopusContext :

def job(i: Int)(): Int = (1 to 1000).map(_ + i).sum
val jobs = (1 to 1000) map(i => job(i) _).toList
val results: List[Int] = oc.runJobs(jobs)

It doesn't matter which kind of collection you provide for your jobs (List, Seq, ...) as long as it's a GenSeqLike ; Octopus will return the results with the same collection type.

You can also launch jobs asynchronously ! Results will be wrapped in a Future

val results: Future[Seq[Int]] = data.submit(jobs)

About

Simple framework for distributing tasks on replicated data, based on Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages