Python urllib2.urlopen() bug

urllib2.urlopen() complains that “temporary failure in name resolution”. Seems to be a bug described in this link:

http://bugs.python.org/issue1208304

Tried a quick fix in this link:

http://helpful.nfshost.com/index.php/Python_usage_notes/Networking_stuff

Didn’t work for me. Maybe it’s my Python (2.6.5) or Ubuntu (10.04 LTS). Anyway tried another server (Ubuntu 11.10/Python 2.7) it seems gone. FYI.

Pagerank in Hadoop

How Pagerank works:

for node(A), its pagerank score is computed as:


pr(A) = (1-d)/N + d (pr(A1)/L(A1)+pr(A2)/L(A2)+...)

A1,A2,… are node A’s neighbors (incoming nodes). L(A1) is number of A1’s neighbors. N is number of nodes.

Assume in input graph for each node you have its outgoing edges (this is the way I assume Linkedin data. I will try both and see the difference). What we have to know before the computation:

  1. N, the number of all nodes
  2. Pr(A), each node’s initial score
  3. each Node’s outgoing edges(nodes)

Input:

Node_A Pr(A) A1 A2 ...

Mapper:

output:

A1 p Pr(A)/L(A)
A2 p Pr(A)/L(A)
..
A d A1
A d A2
...

‘p’ means pagerank score that goes to node A2

‘d’ means there’s an edge from A to A1

Reducer:

since a node’s all incoming nodes are grouped in the same reducer, updates its pr score and graph structure:


if 'p': pr(A1) += score (A1)
if 'd': add A1 to A's outgoing edges

and output:

Node_A Pr(A) A1 A2 ...