This repo contains mapreduce implementation in Go.
This is first lab in MIT's course on distributed systems 6.824. Please see the lab instruction here. Note that I didn't submit the lab, since I wasn't enrolled.
We have a master and multiple workers.
- User starts a master by specifying number of reducers (nR) and a list of files to run mapReduce on. User also starts multiple workers with custom mapper/reducer funtions
- Master stores the list of files the users provided and other state in a master Data structure.
- When workers are started, they ping master for task every 5 sec (its synchronous, so worker finishes the task at hand before pinging the master again )
- During map phase:
- master gives worker one file to run map
- worker runs mapper on each file. for each map task (i-th file), the worker produces bunch of key, values. Each key,value is stored as a json entry in a file called
mr-{{i}}-{{r}}.txt
wherei
is the file index andr = hash(key) mod nR
. So each map tasks produces at mostrn
files. So overall after all map tasks are done there will benumFiles * nR
files.
- During Reduce phase:
- there will be nR reduce tasks. For each r-th task, master gives worker a reduce task with ID =
r
and a list of all files which have key, values such thatr = hash(key) mod nR
namely the filesmr-*-{{r}}.txt
. - the output of each reduce task will be in a file name
mr-out-r
- there will be nR reduce tasks. For each r-th task, master gives worker a reduce task with ID =
The master will start a goroutine to monitor state of tasks. It checks if any tasks are not marked DONE within 10 secs. If so, they are marked FAILED and eligible for reassignment.
One of the map reduce apps is a crash.go
which crashes or runs slow with certain probabilities. The test file src/main/test-mr.sh
produces scenario to run the crash scenario.
- Start master with a list of files as argument
cd src/main
go run mrmaster.go pg-being_ernest.txt pg-grimm.txt
- Start some workers with a mapreduce plugin (eg wordcounter)
cd src/main
# compile plugin
go build -buildmode=plugin ../mrapps/wc.go
# run worker
go run mrworker.go wc.so
cd src/main
sh test-mr.sh
This project is not ready for distributed envronment. All the mappers produce the intermediate files in the same directory which reducers can access. In distributed environment, the reducers will either have to connect to mapper via RPC to access the files or the mappers will have to store files in a distributed file system (eg HDFS)