Sorting in Linux

Background of the story: recently I needed to run random walk on search click graph. The idea is similar to pagerank but the algorithm works on a bipartite graph: on the left-hand side it’s all the queries, and on the other it’s the URLs. Two sides are connected via user clicks and the edges can be weighted or not weighted.

The problem: engineers are not very capable here. I was given data of (cookie, query, URL-clicked) and I had to aggregate the data myself (basically discard cookies) which is 5000 50M files.

My attempt: apparently it’s a practical application of merge-sort the only way I can think of that works on multiple large files. So I first output every 50 50M partitions into one 2.5G file, and I generated 50 such files. Then I brilliantly sort each 2.5 file locally using ‘sort -k2,3 -t\tab’. Done overnight. Awesome. Then I wrote a small Python script to merge sort all of them. Idea is simple: use a priority-queue to manage all opened file handles, and intelligently choose the next file handle to read in next line. If the key (query, URL) is same, increment current count, otherwise, output and reset the key and counter. Pretty straightforward. Well turns out it didn’t give me the merged result. After some debugging, the root cause is linux sort and Python sort treat strings differently (at least the default comparing criteria).

E.g. first few lines from linux sort:

1) 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 http://www.westminsterkennelclub.org/history/bis/ad.html?refresh=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 5
2) 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 http://www.westminsterkennelclub.org/history/bis/ad.html?refresh=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 5
3) 00000000000000speed test jsperf.com/test-speed-for

first lines of Python sort:

1) ! arxi mytelevizia.com/pirveli-arxi.html 1
2) ! http://howtomakecandleseasy.com/freelesson.html http://www.misstutu.com/freelesson.html 1
3) ! http://www.roblox.com/marine-corps-base-hawaii-kaneohe-bay-place?id=132251748 http://www.roblox.com/usmc-marine-corps-air-station-kaneohe-bay-hawaii-place?id=129108403 1

Here I speculate Python first converts the characters to their ASCII values and follows a numeric sorting.

I don’t know how exactly linux sort works. In the file I saw “000” followed by “$0.00” followed by another “000”. It could be that certain chars are ignored. I’ll update this post when it’s more clear to me.

p.s. try this: make a file of following 3 lines:

000 000 0000 phone calls
$0 : 00000000, at : 416f0000, v0 : 416f0000, v1 : 42332e20
00000001

and run sort.

Multi dimension array slicing in Python

Python has nice properties of slicing lists. See https://stackoverflow.com/questions/509211/pythons-slice-notation

But it won’t work on list of list. For example:

l = [[1,2,3],[4,5,6],[7,8,9]] and we would like to grab 1st and 3rd columns of l with following:

l[:][0,2]

it won’t work (TypeError: list indices must be integers, not tuple)

For some reason Python doesnt really support slicing on multi-dimensional list. You have to convert it to Numpy array to do so:

np.array(l)[:,[0,2]]

nice and easy. More see http://ilan.schnell-web.net/prog/slicing/

python sqlite3 vs. node-sqlite3: performance test

I am recently learning node.js and trying to duplicate a small website from flask/python to Node.js. Due to memory constraint I ended up with using SQLite as my production database – although it says never to use SQLite for production XD.

One thing I noticed is that node-sqlite3 (the most active module from Node.js community for SQLite access) seems to a little bit sluggish compared to python, especially when doing pagination. Really? Everything in Node.js is async and shouldn’t it be more responsive? Baffled and curious, I did follow comparison: do multiple SELECT on a small database file (223 megabyes) and time it. Below is my code for each:

Node.js

var sqlite3 = require('sqlite3').verbose();
var util=require('util');
var dbfn = './db.sqlite';

var db = new sqlite3.Database(dbfn,sqlite3.OPEN_READONLY, function(err) {
  if(err) {
    console.log(err);
  } else {
    var perpage=10,max=500,table='data2012',kw='sa';
    for(var offset=1;offset<max;offset+=perpage) {
      stmt = util.format('select * from %s where xxx like "%s%" limit %d offset %d',table, kw,  perpage, offset);
      db.all(stmt);
   }
 }
});

Python:

import sqlite3
DATABASE = './db.sqlite'
db=None
try:
  db = sqlite3.connect(DATABASE)
except:
  sys.stderr.write(os.getcwd())

table='data2011'
kw=('sa%',)
perpage=10
max=500
for offset in xrange(1,max,perpage):
  stmt = 'select * from %s where xxx like ? limit %d offset %d'%(table, perpage, offset)
  rs = db.execute(stmt, kw)
  rs.fetchall()
db.close()

Ignore the table/columns the point is that it has more than enough data to complete the loop. Each was run 3 times and here’s the average:

  • Node-sqlite3: 2.945s
  • Python 1.542s

Looks like Python module is a lot faster. To eliminate any possible uncertainty I increased the loop# to be 1000. Again I ran each for 3 times for an average:

  • Node-sqlite3: 10.085s
  • Python: 6.040s

Python is still faster, but the gap is shrinking. How about 3000 iterations?

  • Node-sqlite3:1m5s
  • Python: 56.154s

5000?

  • Node-sqlite3: 2m23s
  • Python: 2m28s

So the observation so far is that python-sqlite3 beats node-sqlite3 when offset is small (remember sqlite doesn’t really support offset; it reads off all the data and discard  unneeded ones). When offset is getting large, both are getting equally slow. Maybe Node.js overall has a higher throughput its sqlite3 module is less satisfying, at least as of today.

BTW, one trick to optimize Sqlite access performance is to VACUUM the database. This made huge  difference on my database  file.