Skip to main content

Map-Reduce

In this assignment you need to implement the master and the workers of Map-Reduce. The worker process performs the Map and Reduce tasks of the application and performs the read/write operations to the file. The master process is the controller which assigns different tasks to different workers. The master also needs to take care of failures.

Check out the preliminaries first!

Support code

The support code can be found in the Google classroom assignment. You can use the support code as your starting code. The support code has a sequential implementation of the Map-Reduce, i.e., the map is executed first and then the reduce is execute in a single process. The support code also has the implementation of word-count and text indexer applications.

Map-Reduce sequential: src/main/mrsequential.go
Word-count: src/mrapps/wc.go
Indexer: src/mrapps/indexer.go
Input file: src/mrapps/pg*.txt
Output file: mr-out-0

Your answers*

Your Master: mr/master.go
Your Worker: mr/worker.go
Your RPC: mr/RPC.go

Steps to execute sequential Map-Reduce:

    $ cd src/main
	$ go build -buildmode=plugin ../mrapps/wc.go
    $ rm mr-out*
    $ go run mrsequential.go wc.so pg*.txt
    $ more mr-out-0

Implementation

You are required to implement the master and the worker.

Master requirements There is a single master program that is responsible to assign different Map-Reduce tasks to the workers. The master needs to track whether a worker has completed the given task within a reasonable time (say 10 seconds). When a worker fails to complete the task, the task is re-assigned to a different worker.

Worker requirements One or more workers execute in parallel, ideally in multiple systems. Each worker is assigned some Map and Reduce tasks by the master. The worker reads the input corresponding to the assigned task, executes the task, and write the output of the assigned task to the corresponding output file. The worker and the master communicate using RPCs.

Output formats

  • A worker may multiple reduce tasks. The output for the nth reduce task should be written to mr-out-n
  • The output file should contain only key value pairs (one record per line). Check main/mrsequential.go for details.
  • The output of the Map tasks have to be written to file as well. This output is the input for the Reduce phase.

Running the code

  • Build a fresh word count plugin:
$ go build -buildmode=plugin ../mrapps/wc.go
  • Run the master in the main directory (pg*txt are the input files):
$ rm mr-out*
$ go run mrmaster.go pg-*.txt
  • Run a few worker processes:
$ go run mrworker.go wc.so

— Once the Map-Reduce has finished, the output can be found in mr-out-*. For the case of word-count, the sorted union of the workers’ output is the final output.

$ cat mr-out-* | sort | more

Things to take care

  • You solution should modify mr/worker.go, mr/master.go, and mr/rpc.go only. Changes to any other files can be done only for debugging purposes (and they need to be reverted back before submission).
  • mr/master.go is required to implement a Done() method that returns true when the Map-Reduce task is completed. This method notifies main/mrmaster.go about the completion of the task.
  • The worker process should exit after completion. When the worker is slow and by the time the worker completes, the master might have completed the task and exited. Even in this scenario, the worker needs to exit.

A few things to know

  • Application plugin: The Map and Reduce methods are loaded (using Go plugin package) at run-time from .so files. When any changes are made to the mr folder, the the Map-Reduce plugin needs to be rebuilt (go build -buildmode=plugin ../mrapps/wc.go).

  • Shared storage: Ideally, Map-Reduce works on multiple machines. However, such a system requires a shared file system like GFS. Therefore, for this assignment we will be running the Map-Reduce in a single machine itself.

  • Concurrency: The RPC server (at the master) needs to be a concurrent one.

  • Waiting: As different workers might have different speeds, some of the workers have to wait (before they can start the reduce phase). In such cases try to avoid busy waiting conditions.

  • Timeout: The master does not know the reason for the delay of a worker. Therefore, the master has a timeout of 10 seconds after which the task needs to be reassigned.

  • Crash testing: The mrapps/crash.go can be used as a plugin to test the master for crash recovery. The plugin randomly terminates the map-reduce processes.

  • Partial writes: To avoid the occurences of partial writes due to crashes, the workers can write the output to a temporary file and then atomically rename to file according to the convention. The ioutil.TempFile can be used to create a temporary file and os.Rename can be used to atomically rename the file.

Performance analysis

The performance of our Map-Reduce depends on the number of map workers and reduce workers.

  1. Number of map workers: Run the word-count application with increasing number of map workers while the number of reduce workers is constant. Keep the number of reduce workers as 5 and vary the number of map workers as 5, 10, 15, 20, 50, and 100. State your observation and reason out why does the system behave in the manner observed.

  2. Number of reduce workers: Repeat the question, this time keep the number of map workers constant (set it to 10) and vary the number of reduce workers as 5, 10, 15, 20, 50, and 100. State your observation and reason out why does the system behave in the manner observed.

  3. Various configurations: In this question, you need to vary the number of map and reduce workers together. Let (m,r) represent the number of map and reduce workers respectively. Run the code for the following configurations: (1,1), (5,1), (1,5), (5,5), (20,5), (5, 20), (20, 20), (50, 20), (50, 50), (100, 20), and (100, 50). State and reason out your observations. Which configuration gives the maximum performance and why?

  4. Repeat the third question for the indexer application.

Submission

You need to submit your code to the Google classroom assignment. You need to zip together your mr/master.go, mr/worker.go, mr/RPC.go , and a report. The report should contain the answers to the questions mentioned in performance analysis section. You can use graphs and charts to illustrate your findings. Support you findings with appropriate reasoning.