You are currently browsing the tag archive for the ‘mq’ tag.

Some reading links –

  • HN style forum of quant posts – quant.ly
  • High Frequency Trading articles are here
  • Herb Sutters Effective Concurrency articles from Dr Dobbs Journal are here
  • Herb Sutters Blog – Sutters Mill
  • Fun talk on C++ – vid here [If the goal of C++ is ‘high-performance-abstraction’ has it succeeded? ]

Rhetorical question – given that only three people on the planet truly understand the subtleties of C++ … is this a language anyone should be using to build reliable software?    C++Ox is nearing completion, so the language continues to grow in size.

I’ve been thoroughly enjoying Paul Zeitz’s book “The Art and Craft of Problem Solving“.

One of the early problems in the book is from 1994 Putnam math competition, but surprisingly easy once you see that it can be scaled into a much simpler question.

The problem :

Find the positive value of m such that the area enclosed by the ellipse x^2/9 + y^2=1, the x-axis, and the line y=2x/3 is equal to the area in the first quadrant enclosed by the ellipse x^2/9 + y^2=1, the y-axis, and the line y=mx.

Heres a drawing of the areas mentioned using GeoGebra, with each area in separate quadrants for comparison, with an approximate area.  The Qn asks what is the slope of the line through point F [or -ve slope of line through Q here].

Problem as stated, with areas colored

Well its a bit hard to pick the point Q [ the slope -m ] so that these colored regions have the same area…

… but then you realize that the whole problem becomes easy if you simply scale the ellipse back to the unit circle.

To do this, x is scaled back by 3x, so areas become 1/3 of what they were [y remains the same].  So the problem now looks like this :

“]

scaled back to the unit circle : areas are x 1/3, slopes x 3

So the line that gives the same area is y=x/2.  When this is scaled back to the original ellipse, the slope gets divided by 3;  so the line we want is y=x/6.

Quite surprised to see this in a Putnam, but it does show a really common motif in math, namely :

Transform to a simpler domain, solve it there, then transform the solution back.

The astute reader will have noticed that I cheated : the areas of the colored regions in the top picture are of course ~1.65 or 3 x area of regions in the second diagram.

The areas marked were calculated by rough approximation with a polygon and the GeoGebra area function. [  The ellipse itself was drawn to fit using foci – I wasn’t sure how to edit the formula in GeoGebra to make it exact. ]

Been reading a bit about gaussian processes and machine learning…

For a slightly related problem of matching Buyers with Sellers or matching a large number of people on a dating site, the brute force method would make NxN comparisons.  You have N points [seller/buyer or date-seeker etc] with d attributes such as age, sex,  weight, height, income location, interests, preferences… or price, quantity, product, location, expiry etc – these define the dimensions of the space in which N belongs.

If your data was 1 dimensional, each item having only one attribute such as price or age, youd simply sort on that and do the match in a greedy fashion, complexity roughly O(NlogN).

In many problems d=5 or so [in biotech micro-assay arrays where there are thousands of gene probes, d could be as large as 60k] which means that the volume of the space grows incredibly large – if you partition the space in K parts over each dimension thats K^d subvolumes to match to each other which gets huge very quickly.

There are normally continuity assumptions – if you take samples, you can get some predictive value from them.  A sample tells us something about nearby points, and this is really the basis of machine learning.  Another aspect of this is a theorem from compressed sensing and random matricies, that Terry Tao and others have proven, which says something along the lines that in high dimensions, random samples will actually be very effective in exploring the higher dimensional space.  This could explain why evolution is so universally effective in finding ultra optimised solutions, and overcoming local maxima.

Back to the case of finding best matches of N points in a d dimensional domain.  Lets assume we have some probably nonlinear function f = fitness(P1, P2) given any two points.  One approach would be to take a smaller random sample from the N points, small enough that we can do brute force comparing each sample against each other.  Then we sort on f and pick the best ones.  This gives us a lot of information about the space… because for any P1, P2 that match well, there will be surrounding points that also match well, in all likelihood.

So using this geometric way of looking at the problem of finding best matches, I implemented a simple prototype in Python which does the following –

  • Take S sample pairs P1, P2 and eveluate f12 = fitness(P1,P2)
  • Sort on f12, and take the best matches
  • Make a small volume V1 around P1, V2 around P2
  • Take the best matches from all points Pa in Va and Pb in V2
  • Brute force any remaining points
  • Post-process by swapping P1 of one pair with P2 of another pair

The results were not as good as I hoped but there are lots of improvements to make.  I measured the total number of times the fitness() function is called, as the complexity, and found that for N<~200 brute force is more effective.  Brute force on 1000 takes a long time to compute.   This sampling pairs method gets results in roughly NlogN it seems, runs on samples 5 to 10k in size, seems roughly NlogN.  This initial implementation when compared against brute force gave around 80% of the maximum possible global fitness score, when compared to brute force.

So its promising and Ill play with it and see if it can be improved in practical ways.  One of the nice things about it is the fitness function can be very nonlinear, so that the hotspots of high fitness volumes are not something that you can hard code a routine for… but random sampling finds these very well, and requires no special handling.

I used a naive approach to post processing – improving the match by swapping randomly chosen pair elements, and keeping the swap if it results in a better match.  This is basically a simple form of genetic optimisation or simulated annealing, so its pretty inefficient as a first implementation.

Interesting

I recently implemented a standalone FIX message parser in C++ as I wanted to answer a few questions I had myself about the design on the protocol.

To define the protocol [what messages, fields, values are allowed] I use the same XML format that QuickFix and QuickFix/J use.  This is a more compactly rendered subset of the information provided in the fixprotocol.org XML repository.

The spec itself is most easily browsed via the foxprotocol.org’s ‘Fiximate’ site, or download the full PDF specification.

I did notice that QuickFix and QuickFix/J seem to come only with FIX 4.0 through FIX 4.4 definitions… After some pain wrangling XML with perl I created FIX50SP2 which should match the most recent FIX specification.

You can find the full source, and the FIX 5 XML definition on the google code project page.

Note : BSD licence, uses gnu C++ on Linux, makefile build, depends on libXML2 for SAX2 parsing of XML

enjoy, and happy gnu year!

gord.

My previous blog articles describe approaches to calculating kth moments in a single pass – see compute kth central moments in one pass and variance of a sequence in one pass.

I decided to make a rough performance comparison between my approach and boost accumulators api.  A reasonable test is to procedurally generate 10^7 random numbers in range [0.0..1.0] as the input data.  We compare 3 algorithms –

  • No moments accumulated – baseline, just generates the input
  • Accumulate all 12th order moments using Boost Accumulators
  • Accumulate all 12th order moments using my vfuncs cpp code

We run on Linux using time command to get rough timings, raw results are –

  • Baseline   –   1.16s
  • Boost Acc – 23.16s
  • Vfunc Acc –  2.44s

If we subtract the baseline we get a rough comparison of Boost ~ 22s and Vfuncs cpp ~1.3s when the cost of generating the input is factored out.

So the vfuncs impl is roughly an order of magnitude faster then the current v1.37.0 boost accumulate stats implementation (both are single pass).

I don’t think the boost algorithm is logically different, its probably more a case of template code complexity having an impact – the 10s compile times might indicate this also.  Executable size also reflects this, differing by an order of magnitude.

(Note : Using gcc 4.2.4.  I haven’t tried this on other compilers, build settings etc – they could have a profound effect.  Let me know if you see different results on eg. Intel or VC compilers)

Download

Download code and project – kth_moments.tgz – from vfuncs google code downloads page.  [BSD license feel free to use]

Continuing on the same topic as my previous post, its nice to be able to gather all the kth order moments in a single pass.

Last time I mentioned the boost/accumulators example, but you will have noticed two issues if you use that.  Firstly, moment<k> tag will give you the kth simple moment relative to zero, whereas we often want the kth central moment of a sequence relative to the mean.  Secondly, although boosts accumulator is well written it does seem to take a while to compile [~ 12 seconds for code using moment<12>].

After some playing around Ive got a faster simpler approach, where the inner loop accumulates kth powers of the element.  After you’ve run the sequence through, you can then easily extract variance, and all the kth central moments.  So in adding the more general case of kth moments, Ive made the particular variance case simpler.  That often seems to happen in programming and math!

algebra

First a bit of math and then the code.  We want to express the kth central moment in terms of the k basic moments.

First, lets define the basic moment as –

\displaystyle M_{n}^{j}= \sum_{i=1}^n {x}_i^{j}

We rearrange the binomial expansion –

\displaystyle nv_{n}^{k}= \sum_{i=1}^n({x}_{i}-\mu_{n})^k

\displaystyle = \sum_{i=1}^n \sum_{j=0}^k \binom{k}{j} {x}_{i}^j(-\mu_{n})^{k-j}

\displaystyle = \sum_{j=0}^k \binom{k}{j} (-\mu_{n})^{k-j} \sum_{i=1}^n {x}_{i}^j

So we have the kth central moment given as a weighted sum of the kth simple moments –

\displaystyle v_{n}^{k} = 1/n(\sum_{j=0}^k \binom{k}{j} (-\mu_{n})^{k-j} M_{n}^{j})

which shows that all we need to accumulate as we walk across the sequence is the kth simple powers ({x}_{i})^k .

Notice the variance is now handled as a special case where k=2.  Likewise k=0 corresponds to n, the element count and k=1 is the sum of elements.

c++ impl

Heres a basic impl of the above expression –

Read the rest of this entry »

Rearranging

I was chatting with some other quant developers the other day, as you do, and the issue came up of calculating variances.  Its a pretty common operation, and the naive approach does two passes, one to calculate the mean, \mu_{n} , and one to gather the squares of differences from that. … but someone asked if it could be done in one pass, and of course it can, quite easily.

Recall, the population variance of a sequence \left \{x_{i} \right \} is defined as –
v_{n} = (\sum_{i=1}^{n} (x_{i}-\mu_{n} )^2) / n

We can expand this and see how the variance after n terms differs from the variance after m=n+1 terms, vis –

nv_{n}= \sum_{i=1}^{n} (x_{i}-\mu_{m} + \mu_{m} - \mu_{n})^2
\displaystyle = \sum_{i=1}^{n} (x_{i}-\mu_{m})^2 + \sum_{i=1}^{n}(\mu_{m} - \mu_{n})^2 + 2 \sum_{i=1}^{n}(x_{i}-\mu_{m}) (\mu_{m} - \mu_{n})
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 + 2 (\mu_{m} - \mu_{n}) \sum_{i=1}^{n}(x_{i}-\mu_{m})
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 - 2n(\mu_{m} - \mu_{n})^2
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 - n(\mu_{m} - \mu_{n})^2

and we have v_{m} in terms of v_{n} vis –

v_{m} = ( nv_{n} + (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 ) / m

In other words v_{m}= f(v_{n}, sum_{n}, n, x_{m}) , so theres very little we need to store to accumulate the variance as we traverse the sequence.

C++ Implementation

Expressing this in C++ code we’d have a functor that maintains state and gets handed each element of the sequence, in simplest form –

Read the rest of this entry »

I want to describe a simple experiment Ive just done, a direct way to write code with medium level verbs in a semi-functional style in pure C.

All of this can be done in C++ and theres certainly more syntactic sugar there, but I wanted to explore the idea in C… C is close to the metal [but not too close, like assembler], compilers generate fairly good machine code, while the language supports a minimalist way to define functions [without lambdas, but we can use function pointers and context pointers to get that, if not in a type safe way].

Another approach would be to do it in C++ with operators and templates, much of it is reusable from STL and boost… yet another way would be to do it in ansi C and use MACROS heavily… but my experiment is to make simple, readable C code thats fairly quick.

In the K (terse) and Q (less terse) languages of KDB+, one can express something like this –

drawfrom:{[spec; vals; n]
mon: 0, sums nmlz spec; idx: mon bin n?1.0; vals[idx] }

Basically this reads –

function drawfrom(spec, vals, n)
mon = partial sums of spec (the cdf after normalizing to 1.0)
generate n random numbers uniformly in [0,1]
idx = array of indexes of each random sample into mon
return the values indexed by idx

So basically, this semi-functional zen kaon simply generates n random samples from the spectrum supplied.  Think of spec as the weights that determine how often each of vals appears – spec is a histogram or discrete pdf.  Actually this is the readable version, closer to Q than K, as Ive defined nmlz and used the verbose style – in K it can be much more ‘terse’ [proponents would say succinct].

At first this style of programming is annoying if your from a C++ background, but once you get used to it, you begin to think directly in terms of verbs working on vectors – In the same way that std::vector allows you to think at a higher level and avoid many for() loops by using accumulate and other languages the foreach construct…

So how does this look in C?  Try this –

Read the rest of this entry »

More on Princeton Companion to Mathematics aka ‘PCM’.

Quite a few of the single Companion chapters are available from the authors’ own web sites… I gave links to Terry Tao and Tim Gowers blogs below.

Gordon Slade wrote the chapter ‘Probabalistic Models in Critical Phenomena

Its a nice overview of whats been happening in this area, with sections about Critical Phenomena, Branching Processes, Random Graphs, Percolation, Ising Model, Random Cluster model and also introduces SLE the Stochastic-Loewner-Evolution work by Oded Schramm, Wendelin Werner etc.

If you’ve never heard of this stuff heres a motivator of sorts, by yours truly –

Read the rest of this entry »