Task 1: Analyzing a Graph with Hadoop/Java
Writing your first simple Hadoop program
Imagine that your boss gives you a large dataset which contains an entire email communication network from a popular social network site. The network is organized as a directed graph where each node represents an email address and the edge between two nodes (e.g., Address A and Address B) has a weight stating how many times A wrote to B. The boss is very interested in finding out the people most frequently contacted by others. Your task is to write a MapReduce program in Java to report the largest weight among all the weighted inbound edges for each node in the graph.
First, go over the Hadoop word count tutorial to get familiar with Hadoop and some Java basics. You will be able to complete this task with only some knowledge about Java. You should have already loaded two graph files into HDFS and loaded into your HDFS file system in your vm. Each file stores a list of edges as tab-separated-values. Each line represents a single edge consisting of three columns: (source node ID, target node ID, edge weight), each of which is separated by a tab (\t). Node IDs are positive integers, and weights are also positive integers. Edges are ordered randomly.
| 1 | src tgt weight | 
Your code should accept two arguments upon running. The first argument (args[0]) will be a path for the input graph file on HDFS (e.g., /user/cse6242/graph1.tsv), and the second argument (args[1]) will be a path for output directory on HDFS (e.g., /user/cse6242/task1output1). The default output mechanism of Hadoop will create multiple files on the output directory such as part-00000, part-00001, which will be merged and downloaded to a local directory by the supplied run script. Please use the run scripts for your convenience.
The format of the output should be such that each line represents a node ID and the largest weight among all its inbound edges. The ID and the largest weight must be separated by a tab (\t). Lines do not need be sorted. The following example result is computed based on the toy graph above. Please exclude nodes that do not have incoming edges (e.g., those email addresses that never get contacted by anybody).
For the toy graph above, the output is as follows.1
2
351   3
151  79
130  10
Test your program on graph1.tsv and graph2.tsv. To demonstrate how your MapReduce procedure works, use the inline example above, trace the input and output of your map and reduce functions. That is, given the above graph as the input, describe the input and output of your map and reduce function(s) and how the functions transform/process the data (provide examples whenever appropriate). Write down your answers in description.pdf. You are welcome to explain your answers using a combination of text and images.
Designing a MapReduce algorithm (and thinking in MapReduce)
Design a MapReduce algorithm that accomplishes the following task: for each node i in a directed graph G, find that node’s in neighbors’ in neighbors. Node u is considered to be an in neighbor of node v if there is a directed edge pointing from node u to node v. In other words, your task is find every “2-hop” neighbor of every node i in the graph G, where such a neighbor is connected by at least one directed path of length 2 that reaches node i.
NOTE: You only need to submit pseudo code, a brief explanation of your algorithm, and trace of input and output of your map and reduce functions for the graph given below. No coding is required.
Task 2: Analyzing a Large Graph with Spark/Scala
Please go over this Spark word count tutorial to get more background about Spark/Scala.
Goal
Your task is to calculate the gross accumulated node weights for each node in graph1.tsv and graph2.tsv from edge weights using Spark and Scala. Assume the graph to be a representation of a network flow where each edge represents the number of items flowing from source to target. The gross accumulated node weight for a node is now defined as the number of items produced/consumed by the node.
When loading the edges, parse the edge weights using the toInt method and filter out (ignore) all edges whose edge weights equal 1 i.e., only consider edges whose edge weights do not equal 1.
Your Scala program should handle the same two arguments as in Task 1 for input and output from the console, and should generate the same formatted output file on the supplied output directory (tab-separated-file). Please note that the default Spark saveastextfile method uses a saving format that is different from Hadoop’s, so you need to format the result before saving to file (Tip: use map and mkString). The result doesn’t need to be sorted.
Task 3: Analyzing Large Amount of Data with Pig on AWS
You will try out PIG (http://pig.apache.org) for processing n-gram data on Amazon Web Services (AWS). This is a fairly simple task, and in practice you may be able to tackle this using commodity computers (e.g., consumer-grade laptops or desktops). However, we would like you to use this exercise to learn and solve it using distributed computing on Amazon EC2, and gain experience (very helpful for your future career in research or industry), so you are prepared to tackle more complex problems.
The services you will primarily be using are Amazon S3 storage, Amazon Elastic Cloud Computing (EC2) virtual servers in the cloud, and Amazon Elastic MapReduce (EMR) managed Hadoop framework.
This task will ideally use up only a very small fraction of your $100 credit. AWS allows you to use up to 20 instances in total (that means 1 master instance and up to 19 core instances) without filling out a “limit request form”. For this request, you should not exceed this quota of 20 instances. You can learn about these instance types, their specs, and pricing at Instance Types.
Please read the AWS Setup Guidelines provided to set up your AWS account. In this task, you will use subsets of the Google books n-grams dataset (full dataset for reference), on which you will perform some analysis. An ‘n -gram’ is a phrase with n words; the full n-gram dataset lists n-grams present in the books on books.google.com along with some statistics.
You will perform your analysis on two custom datasets, extracted from the Google books bigrams (2-grams), that we have prepared for you: a small one and a large one. To help you evaluate the correctness of your output, we have uploaded the output for the small dataset on T-Square (the link is here ).
VERY IMPORTANT : Both these datasets are in the US-Standard (US-East) region. Using machines in other regions for computation would incur data transfer charges. Hence, set your region to US East (N. Virginia) in the beginning (not Oregon which is the default). This is extremely important otherwise your code may not work and you may be charged extra.
Goal
For each unique bigram, compute its average number of appearances per book, with at least 50 occurrences for each recorded year. For the above example, the results will be:1
2I am       (342 + 211) / (90 + 10) = 5.53
very cool  (500 + 3210 + 9994) / (10 + 1000 + 3020) = 3.40049628
Output the 10 bigrams having the highest average number of appearances per book along with their corresponding averages, in tab-separated format, sorted in descending order, with at least 50 occurrences for each recorded year. If multiple bigrams have the same average, order them alphabetically. For the example above, the output will be:1
2I am       5.53
very cool  3.40049628
You will solve this problem by writing a PIG script on Amazon EC2 and save the output.
You can use the interactive PIG shell provided by EMR to perform this task from the command line (grunt). In this case, you can copy the commands you used for this task into a single file to have the PIG script and the output from the command line into a separate file. Please see this for how to use PIG shell. Also, you can upload the script and create a task on your cluster.
Task 4: Analyzing a Large Graph using Hadoop service onMicrosoft Azure
Goal
Your task is to write a MapReduce program to calculate the degree distribution of a graph. Note that this task shares some similarities with Task 1 (e.g., both are analyzing graphs). Task 1 can be completed using your own computer. This task is to be completed using Azure. We recommend that you first complete Task 1.
You will use data files small.tsv(~75MB) and large.tsv(~3GB), for this question. Each file stores a list of edges as tab-separated-values. Each line represents a single edge consisting of two columns: (Node A, Node B), each of which is separated by a tab. Node IDs are positive integers and the rows are already sorted by Node A.1
2
3
4
5
6
7src  tgt
51   130
51   194
51   299
130  200
151  230
151  194
Your code should accept two arguments upon running. The first argument (args[0]) will be a path for the input graph file, and the second argument (args[1]) will be a path for output directory. The default output mechanism of Hadoop will create multiple files on the output directory such as part-00000, part-00001, which will have to be merged and downloaded to a local directory.
The format of the output should be as follows. Each line represents the degree and its frequency. The degree and the frequency of the degree must be separated by a tab(\t), and lines don’t have to be sorted. The following example result is computed based on the toy graph above.
Hint: One way of doing it is using mapreduce procedure twice. First for finding the degree of each node and second for calculating the frequency of each degree. You will have to make appropriate changes in the skeleton code for this.