Hadoop代写:CA675-TF-IDF

Requirement

Tasks:

  1. Using MapReduce, carry out the following tasks:
  2. Acquire the top 250,000 posts by viewcount (see notes)
  3. Using pig or mapreduce, extract, transform and load the data as applicable
  4. Using mapreduce calculate the per-user TF-IDF (just submit the top 10 terms for each user)
  5. Bonus use elastic mapreduce to execute one or more of these tasks (if so, provide logs / screenshots)
  6. Using hive and/or mapreduce, get:

    1. The top 10 posts by score
    2. The top 10 users by post score
    3. The number of distinct users, who used the word ‘java’ in one of their posts

TF-IDF

The TF-IDF algorithm is used to calculate the relative frequency of a word in a document, as compared to the overall frequency of that word in a collection of documents. This allows you to discover the distinctive words for a particular user or document.

The formula is:

TF(t) = Number of times t appears in the document / Number of words in the document

IDF(t) = log_e(Total number of documents / Number of Documents containing t)

The TFIDF(t) score of the term t is the multiple of those two.

Summary

用Hadoop去计算TF-IDF的时间复杂度还是挺高的,毕竟有很多临时数据要落地,而且Hadoop程序也不是一个就能解决问题的,如果换成Spark的话,应该会高效很多。