Project 3

Project Description

This project is going to be completed in teams of 2-3 persons.

First, you are going to select a project that you find interesting that complies with the following requirements:

  • Use at least two different Azure services.
  • Have a load that requires the use of more than two servers.
  • Elastically grow depending on the load.

Grading

Your project grade will be based on the quality of your report, on the usefulness of the system you’ve built, on the extent to which your design is a good fit for the problem you’re solving, and the quality of the code submitted.

Specification

The students will define the project during week 11. Each group need to choose an slot to discuss the topic with the TA’s and Professor. At the beginning of week 12 the students need to send a project description with the milestones to be acomplished.

During Dead Week, demos of the project would be presented.

Week by Week

 

Week 13: IoT and Stream Processing

Topics to be covered

  • Internet of Things (IoT)
  • Stream Processing

References

Workshop: Apache Storm

The students are going to use Apache Storm on Azure.

The students are going to:

  • Implement the topology
  • Implement the spout and the bolts
  • Test it with different inputs.

This is done for two applications:

  • Threshold calculation (Michael G. Noll)
    • Instant threshold checks if the value of a field has exceeded the threshold value.
    • Time series threshold checks if the value of a field has exceeded the threshold value for a given time window.
  • Trending topic
    • Performing rolling counts of incoming data points (sliding window analysis). A typical use case for rolling counts is identifying trending topics in a user community. (DrDobbs)

References for the workshop

Week 12:

Topics to be covered

  • Resiliency
  • Failure tolerance
  • Incremental Deployment
  • Software Upgrades

This could be a class suitable for a Azure Developer Talk

References

Workshop: Recovery and Replication

Intention: learn about recovery and replication in real-world deployments.

Description

Have the students set-up the recovery and replication systems for their project, using the Disaster Recovery Service from Azure.

Monitor situations and force a virtual machine down, to see the effect on the service and how fast the recovery can be performed.

Week 11:

Topics to be covered

  • Use Cases for large scale Cloud Computing
    • Discuss public cases of companies using Azure services

References

Workshop

Implement simple applications on top of Azure services.

Ask for representative applications to be built by students following Azure Tutorials.

Additionally, the students are going to implement the same applications as the ones in week 6, but instead of using the MapReduce infrastructure, they are going to use Apache Spark, and briefly discuss the differences between Spark and MapReduce.

Intention the objective is to give the students an idea of what services they can use for their projects, so they can imagine better and more interesting applications of the system.

 

Project 2

Project Description

This project has an individual component and one in teams of 2-3persons.

For this project, you will design, implement, and thoroughly test the runtime system for Google MapReduce.

The project would be done a part in the class workshops and a part outside of the class.

Grading

Your project grade will be based on the quality of your report, on the usefulness of the system you’ve built, on the extent to which your design is a good fit for the problem you’re solving, and the quality of the code submitted.

Specification

This project consists of the next modules:

  • Homework – Individual
  • Applications on top of MapReduce
  • MapReduce Distributed Files System on top of Azure Services
  • MapReduce Master
  • Communication Patterns
  • Worker Creation
  • Map Reduce Workers
  • Completed MapReduce Runtime

Libraries and tools that may be handy

Azure HDInsight, Azure Blob storage, ZMQ, Hadoop MapReduce, ssh, wget, scp, Google Protobuf

Week by Week

References

Week 10: Scalability, performance characterization and benchmarking

Topics: Performance debugging

  • Scalability.
  • Performance Characterization
  • Benchmarking

References

Workshop: Integration

Help student with issues in the assignment. The student would integrate all the parts implemented in the previous weeks.

Intention: discuss with the student about how the assignment is going, expose different techniques for debugging distributed systems and how to measure performance.

The workshop is going to be mostly driven by the students.

Week 9: Resource Management

Topics

  • Automated provisioning
  • Balancing.
  • Scheduling
  • Elastic systems

References

Workshop: Map and Reduce functionality

Definition

Implement the functionality required for distributing the computation, running the handlers, and storing the results.

Intention: create the base code for the map and reduce functionality. The student should learn how to handle resources in the cloud.

Specification

Implement both the mapper and reducer code, using the DFS base code created in week 8.
 Your MapReduce implementation should be able to:
  • Have the master distribute the binaries for both the Map and Reduce phase.
  • Be able to execute both the mapper and the reducer code in any worker.
  • After executing the map phase, sort the mapper result in place, and store it locally. (Assume that the data fits in the RAM).
  • Store the required information in the master to be able to fetch the required information from the mapper, to execute the Reduce phase.
  • Store the final results into Azure Blob, you should be able to use this data as an input for a pipelined map/reduce computation.

Week 8: Filesystems and Data Storage

Topics

  • Distributed FileSystems
  • (Dynamo,Haystack,BigTable)
  • NoSQL
  • Azure Blob Storage (involve azure developer with a guest lecture)

References

Workshop: Master Functionality

Intention: Teach how to use the distributed filesystems in Azure, and make the student think about the requirements of the framework. Developed the base code for the Master implementation. Create the handlers, interfaces and scoreboard required for the Master.

Description

  • Design the base interface and functionalities for Map Reduce DFS (Distributed File System)  for moving and copying data, between the map and reduce phase, and to the final result.
  • Implement the function that is going to shard the data and distributed among the M available resources.

Specification

Using Azure Blob storage and HDInsight implement the required interface and functionalities for the MapReduce Runtime.

The system should be able to:

  • Distribute and give access to the input files to the mappers.
  • Distribute the <key,value> pairs to the corresponding reducer

Using the functionality implemented in the previous week, distribute the data from the master to the workers. Additionally test sending the temporal data between the two workers, similar to what is going to be done between the map and the reduce phase.

Extra

Additionally, you should measure the difference between having the information locally and accessing the information through the Azure blob interface using C++

Week 7: Virtualization basics

Topics

  • Virtualization, hypervisor
  • VM management example
  • RPC
  • Functional Debugging in distributed systems

References

References for the workshop

Workshop and assignment

Description

Design and implement the MapReduce master in Azure. Develop the base code for the Master implementation. Create the handlers, interfaces, and scoreboard required for the Master.

Intention: familiarize the user with the IaaS services provided by Azure, setup the environment to develop the project coding section.  Familiarize the student with the library and how we remotely start process in distributed systems.

Specification

Using Azure Linux Virtual Machines you are going to implement the Master node on the MapReduce runtime.

First, you need to create a pool of resources, using either the resource manager or the CLI. One of the virtual machines is going to run the master code and the other are going to run the workers code (explained afterward).

Second, create a Virtual Network, that is going to connect all the Virtual Machines in your system. Then install the required libraries into the virtual machines in the available pool of resources.static

Implement the required data structures for the Master, as exposed in the Map Reduce paper:

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks). The master is the conduit through which the location of intermediate file regions is propagated from map tasks to reduce tasks. Therefore, for each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

Jeffrey Dean and Sanjay Ghemawat, MapReduce (2004)

Once the data structures and handlers are implemented, exercise an empty handle in a worker, that would write “Hello gatech” into a log file and respond back to the master, using RPC (remote procedure call). For this workshop this would be the implementation of the worker.

Implementation Details

  • Each key is store in a blob and with a value per key