You are currently browsing the monthly archive for February 2010.

Been reading a bit about gaussian processes and machine learning…

For a slightly related problem of matching Buyers with Sellers or matching a large number of people on a dating site, the brute force method would make NxN comparisons.  You have N points [seller/buyer or date-seeker etc] with d attributes such as age, sex,  weight, height, income location, interests, preferences… or price, quantity, product, location, expiry etc – these define the dimensions of the space in which N belongs.

If your data was 1 dimensional, each item having only one attribute such as price or age, youd simply sort on that and do the match in a greedy fashion, complexity roughly O(NlogN).

In many problems d=5 or so [in biotech micro-assay arrays where there are thousands of gene probes, d could be as large as 60k] which means that the volume of the space grows incredibly large – if you partition the space in K parts over each dimension thats K^d subvolumes to match to each other which gets huge very quickly.

There are normally continuity assumptions – if you take samples, you can get some predictive value from them.  A sample tells us something about nearby points, and this is really the basis of machine learning.  Another aspect of this is a theorem from compressed sensing and random matricies, that Terry Tao and others have proven, which says something along the lines that in high dimensions, random samples will actually be very effective in exploring the higher dimensional space.  This could explain why evolution is so universally effective in finding ultra optimised solutions, and overcoming local maxima.

Back to the case of finding best matches of N points in a d dimensional domain.  Lets assume we have some probably nonlinear function f = fitness(P1, P2) given any two points.  One approach would be to take a smaller random sample from the N points, small enough that we can do brute force comparing each sample against each other.  Then we sort on f and pick the best ones.  This gives us a lot of information about the space… because for any P1, P2 that match well, there will be surrounding points that also match well, in all likelihood.

So using this geometric way of looking at the problem of finding best matches, I implemented a simple prototype in Python which does the following –

  • Take S sample pairs P1, P2 and eveluate f12 = fitness(P1,P2)
  • Sort on f12, and take the best matches
  • Make a small volume V1 around P1, V2 around P2
  • Take the best matches from all points Pa in Va and Pb in V2
  • Brute force any remaining points
  • Post-process by swapping P1 of one pair with P2 of another pair

The results were not as good as I hoped but there are lots of improvements to make.  I measured the total number of times the fitness() function is called, as the complexity, and found that for N<~200 brute force is more effective.  Brute force on 1000 takes a long time to compute.   This sampling pairs method gets results in roughly NlogN it seems, runs on samples 5 to 10k in size, seems roughly NlogN.  This initial implementation when compared against brute force gave around 80% of the maximum possible global fitness score, when compared to brute force.

So its promising and Ill play with it and see if it can be improved in practical ways.  One of the nice things about it is the fitness function can be very nonlinear, so that the hotspots of high fitness volumes are not something that you can hard code a routine for… but random sampling finds these very well, and requires no special handling.

I used a naive approach to post processing – improving the match by swapping randomly chosen pair elements, and keeping the swap if it results in a better match.  This is basically a simple form of genetic optimisation or simulated annealing, so its pretty inefficient as a first implementation.

Interesting

The topterm util returns the top ranking item from google, from the command line.

Heres a sample session :

unix_host$ python topterm.py
Enter Search Term [or blank to quit]: justgord
TITLE : gordon anderson (justgord) on Twitter
URL   : http://twitter.com/justgord
Enter Search Term [or blank to quit]: terry tao
TITLE : What's new
URL   : http://terrytao.wordpress.com/
Enter Search Term [or blank to quit]: quantblog vfuncs
TITLE : vfuncs – functional coding with array 'verbs' in C « quantblog
URL   : https://quantblog.wordpress.com/2008/09/24/vfuncs-functional-coding-with-array-verbs-in-c/
Enter Search Term [or blank to quit]:

Verbose Description

Recently I was doing some testing of a machine learning algorithm, which looks for relevance patterns in a graph of text nodes.  I was doing a small amount of web crawling for this, and as a quick test of my output needed a tool that could get the top ranking web site given a search term – a command line google-it was needed!

I decided to hack this up in Python. One of the things that put me off the language initially is that whitespace is used to indicate nested program logic.  I guess I’m just an old-skool bracket lover at heart.   Putting that bias aside, Python is not a bad language at all, and I found myself writing surprisingly small programs in it, even if they don’t have the thought-as-code feel of Scheme or Clojure.

For this command line app, I had hoped to reuse a google API : ideally I’d request an url with curl or wget and get back some nice JSON output which could be trivially traversed.   Unfortunately, I could find no such api, so had to resort to screen-scraping the output to get the first entry.  The curl utility does the heavy lifting, with BeautifulSoup to traverse the HTML.

So now I have a handy little program that asks for search terms and prints the URL and Title of the first term according to google.  It would be nice if something like this was an option to lynx.

Ive posted the python project source [less than a page] on google code – here

Great talk, superb intuitive demo of constrained nth order gaussian distributions… and a free-as-in-beer textbook by David MacKay of Cambridge.

Most enjoyable… I’m off to read the book, this will take a year off your PhD :]

Links –

Video Talk

Gaussian Processes for Machine Learning

also by MAcKay – Information Theory, Inference and Learning Algorithms