How Pagerank works:
for node(A), its pagerank score is computed as:
pr(A) = (1-d)/N + d (pr(A1)/L(A1)+pr(A2)/L(A2)+...)
A1,A2,… are node A’s neighbors (incoming nodes). L(A1) is number of A1’s neighbors. N is number of nodes.
Assume in input graph for each node you have its outgoing edges (this is the way I assume Linkedin data. I will try both and see the difference). What we have to know before the computation:
- N, the number of all nodes
- Pr(A), each node’s initial score
- each Node’s outgoing edges(nodes)
Input:
Node_A Pr(A) A1 A2 ...
Mapper:
output:
A1 p Pr(A)/L(A)
A2 p Pr(A)/L(A)
..
A d A1
A d A2
...
‘p’ means pagerank score that goes to node A2
‘d’ means there’s an edge from A to A1
Reducer:
since a node’s all incoming nodes are grouped in the same reducer, updates its pr score and graph structure:
if 'p': pr(A1) += score (A1)
if 'd': add A1 to A's outgoing edges
and output:
Node_A Pr(A) A1 A2 ...