Topics
- MapReduce
- Dryad
- Hadoop
- Pig Latin
References
- MapReduce: Simplified Data Processing on Large Clusters
- Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
- Pig Latin: A Not-So-Foreign Language for Data Processing
- Hadoop
Workshop: MapReduce
Description
Implement well-known applications on top of MapReduce (sorting,counting, etc).
Specification
Using Azure HDInsight, you are going to implement the following applications using the Python MapReduce Azure tool.
The applications to be implemented are enlisted next.
Word Counter
Implement the mapper and reducer to obtain the histogram of the words in a file.
Grep
Implement the mapper and reducer to obtain the number of times the words “gatech” and “burdell” appears in a file.https://www.gatechdining.com/images/Spring%20Break%202016_tcm251-103643.pdf
Sort
Implement the mapper and reducer to sort a list of numbers in a file. Each line of the file has a different number.
Reverse Web-Link Graph
For a given list of webpages, the system should obtain a list of webpages that point to a given webpage, e.g., the final output is <target, list(sources)>, where a target is the url to which a hyperlink points and sources is where that hyperlink was found. Implement the mapper and reducer required to execute this task.
Additionally for the previous application, you would first test it using the default configurations, then you would create a better sharding algorithm based on the application, you can iteratively test until improved performance is obtained.
Homework
Some questions that would help the student familiarize with the MapReduce paper to prepare for the project, example of possible questions:
- What are the main functionalities that the master has to support in order to distribute the work on the workers?
- What happen when there are slow machines in the system?
- More to come when the system implementation is defined.
Intention: clarify on the first-week questions that are going to arise when programming the assignment.