Skip to content

Using MapReduce to calculate Wikipedia page rank; preventing dead-ends and spider-traps

Notifications You must be signed in to change notification settings

jieren123/Bigdata_Project_Pagerank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Project Overview

In this project, by using the Hadoop MapReduce framework to calculate each Wikipedia pages, produce an ordered list of page titles, along with their ranks. Moreover, this project prevents dead-ends and spider-traps suitiation by corporating the factor β into matrix by multiplying each of its elements by 0.2.

Pagerank Output Visualization

After iterated three times, I selected Page ID =6 as an example (black point). This shows the pagerank value of selected page relative to other pages. alt text

Dataset Description:

  • transition.txt has the following format:
from1: to11 to12 to13 ...
from2: to21 to22 to23 ...
...

where from1 is an integer labelling a page that has links from and where to1 is an interger labelling a page link to, separating by '\t'.

  • pr.txt has the following format: page_id prob where page_id is ID of the page and prob is probaility of each page, separating by '\t'.

PageRank Alogrithm

PageRank is a calculation evaluates the quality and quantity of links to a webpage to determine a relatvie score of that page's importantce and authority. This project proposes a generalization of the PageRank aglorithm based on both out-links and in-links by using a iterative algorithm as an alternative interpretation of the matrix based techniques Also the PageRank algorithm can specify a probability at any step that a person will continue clicking outgoing links. alt text

Architecture Overview Diagram

alt text

Dead-ends & Spider-traps Explains

  • A group of pages is a spider-trap if there are no links from within the group to outside the group. This leads some of columns will sum to 1 rather than 0. alt text

  • Pages with no outlinks are dead-ends for the random surfer. If dead ends are allowed, the transition matrix of the Web is no longer stochastic, some of the columns will sum to 0 rather than 1. alt text

To avoid those problems, modify the calculation of PageRank by multipiling teleporting parameter to transition matrix:

PR(N)=(1-β)*PR(N-1)*Transition Matrix + β*PR(N-1)
β is teleport parameter , usually range from 0.1 to 0.2

Main Files:

  • UnitMultiplicaiton.java: This class is the mapper class. It takes an input file which has graph data in form of node and its adjacency lists and generates files containing nodes, their page ranks and their adjacency lists. Finally emit a key value pair of (page-id,prob)

  • UnitSum.java: This class is the reducer class. It receives a key: page-id and a list of its values: probabilities. For each value in the list If it is a node, the initialize the node with this node. Else if it a pagerank value sum it up and set the pagerank by its convergence value into the list.

  • Driver.java

Compiling and Running

hadoop com.sun.tools.javac.Main *.java  
jar cf pr.jar *.class 
hadoop jar pr.jar Driver /transition /pagerank /output 3 #3: iterate three times

Output

#args0: dir of transition.txt
#args1: dir of PageRank.txt
#args2: dir of unitMultiplication result
#args3: times of convergence(make sure the code run successfully when args3=1, then test args3=40)

Reference

About

Using MapReduce to calculate Wikipedia page rank; preventing dead-ends and spider-traps

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages