“java.io.IOException: No locks available” error when using mvn

Tried to use mvn for compiling a Java library but got error:

[WARNING] Failed to read tracking file /adminhome/yandong/.m2/repository/org/apache/maven/plugins/maven-resources-plugin/2.6/_remote.repositories
java.io.IOException: No locks available
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method)
at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90)
at sun.nio.ch.FileChannelImpl.lock(FileChannelImpl.java:1021)

 

turned out it’s about NFS lock service since the directory is mounted through network. The Linux distribution at work is customized and there’s no way to install the service as some Google search suggested. Solution: point mvn’s repository directory to some local space and the problem is gone. To be specific, in file apache-maven-${version}/conf/settings.xml: uncomment and update line:

<localRepository>some_local_directory</localRepository>

and re-run mvn.

Sorting in Linux

Background of the story: recently I needed to run random walk on search click graph. The idea is similar to pagerank but the algorithm works on a bipartite graph: on the left-hand side it’s all the queries, and on the other it’s the URLs. Two sides are connected via user clicks and the edges can be weighted or not weighted.

The problem: engineers are not very capable here. I was given data of (cookie, query, URL-clicked) and I had to aggregate the data myself (basically discard cookies) which is 5000 50M files.

My attempt: apparently it’s a practical application of merge-sort the only way I can think of that works on multiple large files. So I first output every 50 50M partitions into one 2.5G file, and I generated 50 such files. Then I brilliantly sort each 2.5 file locally using ‘sort -k2,3 -t\tab’. Done overnight. Awesome. Then I wrote a small Python script to merge sort all of them. Idea is simple: use a priority-queue to manage all opened file handles, and intelligently choose the next file handle to read in next line. If the key (query, URL) is same, increment current count, otherwise, output and reset the key and counter. Pretty straightforward. Well turns out it didn’t give me the merged result. After some debugging, the root cause is linux sort and Python sort treat strings differently (at least the default comparing criteria).

E.g. first few lines from linux sort:

1) 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 http://www.westminsterkennelclub.org/history/bis/ad.html?refresh=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 5
2) 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 http://www.westminsterkennelclub.org/history/bis/ad.html?refresh=00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 5
3) 00000000000000speed test jsperf.com/test-speed-for

first lines of Python sort:

1) ! arxi mytelevizia.com/pirveli-arxi.html 1
2) ! http://howtomakecandleseasy.com/freelesson.html http://www.misstutu.com/freelesson.html 1
3) ! http://www.roblox.com/marine-corps-base-hawaii-kaneohe-bay-place?id=132251748 http://www.roblox.com/usmc-marine-corps-air-station-kaneohe-bay-hawaii-place?id=129108403 1

Here I speculate Python first converts the characters to their ASCII values and follows a numeric sorting.

I don’t know how exactly linux sort works. In the file I saw “000” followed by “$0.00” followed by another “000”. It could be that certain chars are ignored. I’ll update this post when it’s more clear to me.

p.s. try this: make a file of following 3 lines:

000 000 0000 phone calls
$0 : 00000000, at : 416f0000, v0 : 416f0000, v1 : 42332e20
00000001

and run sort.

The ultimate failure of a once Internet giant

My first-hand experience so far (been here exactly 1 year):

1. Slow in response (email, in person)

2. Slow in action (e.g. It took 3 days to change a directory permission so I could read it)

3. Incapable managers

3. Never know how to read emails (immediate reply asks questions which was clearly answered in my last email)

4. Global distribution of teams. (no accountability. they can totally disappear for days without making any progress. and when they come back, no.3 happens)

Multi dimension array slicing in Python

Python has nice properties of slicing lists. See https://stackoverflow.com/questions/509211/pythons-slice-notation

But it won’t work on list of list. For example:

l = [[1,2,3],[4,5,6],[7,8,9]] and we would like to grab 1st and 3rd columns of l with following:

l[:][0,2]

it won’t work (TypeError: list indices must be integers, not tuple)

For some reason Python doesnt really support slicing on multi-dimensional list. You have to convert it to Numpy array to do so:

np.array(l)[:,[0,2]]

nice and easy. More see http://ilan.schnell-web.net/prog/slicing/

Immediately-Invoked Function Expression (IIFE)

Scope in Javascript is function-bound, meaning variables defined in a function are visible inside the same function, and each time a function is invoked it creates a new execution context. This is different from C which confines life span of variables inside a block. For following code:

var a = 1;

function f1() {
  var b = 2;
  console.log('inside f1 a:'+a);
}

function f2() {
  var c = 3;
  return function(d) {
    return d+c;
  }
}

{
var e = 5;
}

f1();

console.log(a);
console.log(b); //error
console.log(e);
console.log(f2()(4)); //closure

Only hidden variable to the global context is b. Since every C function is defined using block Javascript is more loose in the sense that blocks without function declaration doesn’t really do much.

Now move on to Immediately-Invoked Function Expression: the motive behind this is you want to execute a piece of code without contaminating the global context while being able to use many of the global variables. You do it by creating an anonymous function and immediately invoking it:

(function () {
 ...
})();

This is a very common pattern for Javascript and is seen everywhere, especially for locking in the value of
execution context. More reading can be found at http://benalman.com/news/2010/11/immediately-invoked-function-expression/

How fast node responds to concurrent http requests

Disclaimer: people have done this a million times but I just wanted to see it myself.

One of the merits of node.js that people keep talking about is it scales really well under the pressure when there’s a large number of users using the service at the same time. Let’s examine the validity of this witness today:

I’ve built a small website using Node.js + sqlite, with some in-house caching. For the testing request, there are 4 SQL Select requests. The database file is fairly light-weighted with less than 1 million records and the queried columns are indexed as well. But in theory all this shouldn’t matter as long as the same url has been visited, all those DB results are cached and no more DB access is needed.

I used ‘ab’ for concurrent tests and below is the result:

concurrency = 1 to 100

Image

 

#request from 100 to 1000

Image

 

Apparently node.js handles multiple requests pretty well until concurrency exceeds 15 for my hardware