You are currently browsing the tag archive for the ‘code’ tag.

Setting the scene

  • the database is 30 Million rows and performance is getting slower
  • users want to search for stuff.. but don’t know what they want to search for
  • the web experience needs to be fast enough to feel “interactive
  • it needs to have an api which the mobile app developers can use.. so json

Questions / Observations

  • Do we even have ‘Big Data’ ?
    • 9GB data as CSV
    • 2.5GB when zipped
  • We could actually fit the data all into 16GB RAM ..
    • why doesn’t the database do that ?
  • What if we fully “invert” the data, so tag searches are fast :
    • data  : id -> tags   [ in named fields ]
    • index : tag -> ids
    • “inverted” when all tags are indexed
    • so, given a tag we can quickly get all the record ids it appears in

Stategy / Plan

  • RAM is fast, use plenty : enough for :
    • cached rows and
    • the full index
  • keep the index fully in RAM : all the time
  • vertical scaling might work

The pain, oh the pain

  • cassandra OK
    • but we don’t have a cluster.. + its 150MB java bloat/dependencies install
  • mysql innodb OK ..
    • but weak sql support + a large csv import died
  • redis OK .. but
    • hit 25GB RAM before dying
    • too much space to store long sets of integers
  • new respect for postgres :
    • great sql support
    • csv import “copy” : nice, simple, fast, reliable
    • up the buffers, and warm the cache with some queries
      • will bring in linux virtual ram, based on most recently used
      • directive to hold index or table in RAM would be nice
  • linux utils, ftw :
    • head tail sed split sort uniq wc
    • csvkit
    • perl -pe ‘s///g’ [ and regexes generally in vim, node.js, perl ]
    • bash shell

The epiphany

  • keep the fully inverted index in RAM, all the time
  • save space with a binary format
  • develop custom “index server”
    • custom binary file format
    • node.js is nice to work with : buffers, streams

Partial Enlightenment

  • SSDs are incredible
    • seek SSD ~0.1ms vs  HDD ~10ms  [ 100x ]
    • read SSD  150MB/s vs ~500MB/s [ 4x ]
    • readable comparison here
  • a middle way :
    • custom index held in RAM
    • data in a sturdy RDB [ postgres ]
    • trust SSD performance
  • json result set compresses well
    • 256 rows ~ 170k json ~ 25k gzipped [ 7x ! ]

Good Performance Numbers

  • running on a linode host with 2GB RAM
    • 1GB for postgres cache
    • 1GB for index tag table :
      • first 700 MB of the 2.3 GB binary index file
      • rest is blocks of ids [ per each tag ]
  • tag search on 30 Million records :
    • 25ms to get the first 1000 ids from the index table and index file
    • 150ms to fetch the records with those ids !
  • web front end :
    • search is fast enough to be “interactive
    • user sees 450ms roundtrip [ including 200ms ping ]
    • gzip the json brings 2.3s fetch down to 0.45s
    • feels nearly instantaneous

Moral of the story

  • SSDs are FAST.. but they are NEW and DB software will need time to catch up
  • NoSQL vs SQL just means more options, and more hybrid approaches
  • You can still scale vertical
    • if your data is big [100M rows | 50GB ]
    • but not BIG big [ 1Bn rows | 1TB ]
  • RAM is expensive on the server
    • because hosting companies can share it over many users
    • 64GB linode = $ 640/m vs 2GB linode = $20/mo
    • potentially save $600 per month
  • we will see more soft realtime web sites .. web with data will feel more interactive

I recently implemented a standalone FIX message parser in C++ as I wanted to answer a few questions I had myself about the design on the protocol.

To define the protocol [what messages, fields, values are allowed] I use the same XML format that QuickFix and QuickFix/J use.  This is a more compactly rendered subset of the information provided in the XML repository.

The spec itself is most easily browsed via the’s ‘Fiximate’ site, or download the full PDF specification.

I did notice that QuickFix and QuickFix/J seem to come only with FIX 4.0 through FIX 4.4 definitions… After some pain wrangling XML with perl I created FIX50SP2 which should match the most recent FIX specification.

You can find the full source, and the FIX 5 XML definition on the google code project page.

Note : BSD licence, uses gnu C++ on Linux, makefile build, depends on libXML2 for SAX2 parsing of XML

enjoy, and happy gnu year!


My previous blog articles describe approaches to calculating kth moments in a single pass – see compute kth central moments in one pass and variance of a sequence in one pass.

I decided to make a rough performance comparison between my approach and boost accumulators api.  A reasonable test is to procedurally generate 10^7 random numbers in range [0.0..1.0] as the input data.  We compare 3 algorithms –

  • No moments accumulated – baseline, just generates the input
  • Accumulate all 12th order moments using Boost Accumulators
  • Accumulate all 12th order moments using my vfuncs cpp code

We run on Linux using time command to get rough timings, raw results are –

  • Baseline   –   1.16s
  • Boost Acc – 23.16s
  • Vfunc Acc –  2.44s

If we subtract the baseline we get a rough comparison of Boost ~ 22s and Vfuncs cpp ~1.3s when the cost of generating the input is factored out.

So the vfuncs impl is roughly an order of magnitude faster then the current v1.37.0 boost accumulate stats implementation (both are single pass).

I don’t think the boost algorithm is logically different, its probably more a case of template code complexity having an impact – the 10s compile times might indicate this also.  Executable size also reflects this, differing by an order of magnitude.

(Note : Using gcc 4.2.4.  I haven’t tried this on other compilers, build settings etc – they could have a profound effect.  Let me know if you see different results on eg. Intel or VC compilers)


Download code and project – kth_moments.tgz – from vfuncs google code downloads page.  [BSD license feel free to use]

Continuing on the same topic as my previous post, its nice to be able to gather all the kth order moments in a single pass.

Last time I mentioned the boost/accumulators example, but you will have noticed two issues if you use that.  Firstly, moment<k> tag will give you the kth simple moment relative to zero, whereas we often want the kth central moment of a sequence relative to the mean.  Secondly, although boosts accumulator is well written it does seem to take a while to compile [~ 12 seconds for code using moment<12>].

After some playing around Ive got a faster simpler approach, where the inner loop accumulates kth powers of the element.  After you’ve run the sequence through, you can then easily extract variance, and all the kth central moments.  So in adding the more general case of kth moments, Ive made the particular variance case simpler.  That often seems to happen in programming and math!


First a bit of math and then the code.  We want to express the kth central moment in terms of the k basic moments.

First, lets define the basic moment as –

\displaystyle M_{n}^{j}= \sum_{i=1}^n {x}_i^{j}

We rearrange the binomial expansion –

\displaystyle nv_{n}^{k}= \sum_{i=1}^n({x}_{i}-\mu_{n})^k

\displaystyle = \sum_{i=1}^n \sum_{j=0}^k \binom{k}{j} {x}_{i}^j(-\mu_{n})^{k-j}

\displaystyle = \sum_{j=0}^k \binom{k}{j} (-\mu_{n})^{k-j} \sum_{i=1}^n {x}_{i}^j

So we have the kth central moment given as a weighted sum of the kth simple moments –

\displaystyle v_{n}^{k} = 1/n(\sum_{j=0}^k \binom{k}{j} (-\mu_{n})^{k-j} M_{n}^{j})

which shows that all we need to accumulate as we walk across the sequence is the kth simple powers ({x}_{i})^k .

Notice the variance is now handled as a special case where k=2.  Likewise k=0 corresponds to n, the element count and k=1 is the sum of elements.

c++ impl

Heres a basic impl of the above expression –

Read the rest of this entry »


I was chatting with some other quant developers the other day, as you do, and the issue came up of calculating variances.  Its a pretty common operation, and the naive approach does two passes, one to calculate the mean, \mu_{n} , and one to gather the squares of differences from that. … but someone asked if it could be done in one pass, and of course it can, quite easily.

Recall, the population variance of a sequence \left \{x_{i} \right \} is defined as –
v_{n} = (\sum_{i=1}^{n} (x_{i}-\mu_{n} )^2) / n

We can expand this and see how the variance after n terms differs from the variance after m=n+1 terms, vis –

nv_{n}= \sum_{i=1}^{n} (x_{i}-\mu_{m} + \mu_{m} - \mu_{n})^2
\displaystyle = \sum_{i=1}^{n} (x_{i}-\mu_{m})^2 + \sum_{i=1}^{n}(\mu_{m} - \mu_{n})^2 + 2 \sum_{i=1}^{n}(x_{i}-\mu_{m}) (\mu_{m} - \mu_{n})
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 + 2 (\mu_{m} - \mu_{n}) \sum_{i=1}^{n}(x_{i}-\mu_{m})
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 - 2n(\mu_{m} - \mu_{n})^2
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 - n(\mu_{m} - \mu_{n})^2

and we have v_{m} in terms of v_{n} vis –

v_{m} = ( nv_{n} + (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 ) / m

In other words v_{m}= f(v_{n}, sum_{n}, n, x_{m}) , so theres very little we need to store to accumulate the variance as we traverse the sequence.

C++ Implementation

Expressing this in C++ code we’d have a functor that maintains state and gets handed each element of the sequence, in simplest form –

Read the rest of this entry »

Just a note that Ive uploaded the initial version of vfuncs to google code. Ive released under a BSD license so you can use it in your commercial and noncommercial code easily.

Download from here [I’ll import to SVN sometime soon]. See my previous post for a description of vfuncs.

This version contains an example of a digital filter. This can be used to smooth the series data, or apply other signal processing operations. If your familiar with applying a blur filter in photoshop or gimp, using a gaussian filter kernel, this is exactly the same idea (except in one dimension).  Gaussian filter is basically just a moving average of the data.

Think of the algorithm as applying a sliding window across the data – the sliding window contains the filter weights, and at each position you apply the weighted average [dot product] of the filter weights against each data point in the window.

If the filter contains a single element of weight 1.0, then the result is just the input (the filter is just the Dirac delta function in that case). If the filter contains [0.25 0.50 0.25] its going to mix each element with its neigbours and take a weighted average, thus smoothing the data.

Read the rest of this entry »