Pagerank in Hadoop

How Pagerank works:

for node(A), its pagerank score is computed as:


pr(A) = (1-d)/N + d (pr(A1)/L(A1)+pr(A2)/L(A2)+...)

A1,A2,… are node A’s neighbors (incoming nodes). L(A1) is number of A1’s neighbors. N is number of nodes.

Assume in input graph for each node you have its outgoing edges (this is the way I assume Linkedin data. I will try both and see the difference). What we have to know before the computation:

  1. N, the number of all nodes
  2. Pr(A), each node’s initial score
  3. each Node’s outgoing edges(nodes)

Input:

Node_A Pr(A) A1 A2 ...

Mapper:

output:

A1 p Pr(A)/L(A)
A2 p Pr(A)/L(A)
..
A d A1
A d A2
...

‘p’ means pagerank score that goes to node A2

‘d’ means there’s an edge from A to A1

Reducer:

since a node’s all incoming nodes are grouped in the same reducer, updates its pr score and graph structure:


if 'p': pr(A1) += score (A1)
if 'd': add A1 to A's outgoing edges

and output:

Node_A Pr(A) A1 A2 ...