Via HN I found and excellent developer manager blog by Rands here – www.RandsInRepose.com

Rands worked on the Paradox team at Borland, and recants his take on the Database vs. Spreadsheet space in his article - Keeping Track of Everything

great blog, Ill have to get the book “Managing Humans“…

Hacking Xilla

Ive been hacking on a proof-of-concept for Xilla, really to try out some of these ideas and see where they go.. at some stage Ill upload that to the projects home -  WebXilla.com

spreadheet limitations?

In the meantime, given my comments about how people use spreadsheets instead of databases and why, I want to keep a link to this Hacker News discussion – “What Cant you do in Excel?”

The discussion is fascinating, the guts of it boils down to -

  • its got to be on the web, in the browser [shareable, publishable, multiuser]
  • spreadsheets should be more like databases [data types, data relations, named attributes]
  • we want to use spreadsheets as a live interface GUI for databases, sometimes
  • need a better language for programming [eg. python, ruby, lisp - anything but VB!]
  • need a better language for querying [ SQL or an alternative]
  • need better naming of data ranges [names not C1:C37]
  • better import/export/conversion/integration

To this I add my own wishlist -

  • browse data links [relations] as web hyper-links
  • runtime editable schema
  • data-app prototyping environment, RAD tool
  • complete versioning, check-pointing and backup
  • fine grained user access controls & permissions

A spreadsheet that has all of these is no longer a spreadsheet, or a database but a new new thing – so lets stop asking how to improve spreadsheets and ask how to build this data-web-’Xilla’.  I’m calling this new thing Xilla, because it needs a name.

Its tempting to deny or address the opinions above by saying ‘if you need better data management, grow up and use a real database’ .. but the real point is that spreadsheets solve part of a problem, databases solve part of an overlapping problem – and where we are now is that neither solves the problem really well, or at least well enough.

The reason people stick with spreadsheets is because they partly work, and they don’t want to move to a database and lose what flexibility they have – living with the limitations is better than fixing a schema in stone and writing a web app that is tied completely to that schema [and losing all the nice computational tools and UI they currently have in the process].  In an organizational context, theyll have to give up their dato to the experts, and once they do that theyll never see it again, or at least have no rights to change and redefine the structure or adapt it to the problem they have as it evolves.

I think quite a few startups are basically porting the spreadsheet or traditional database onto the web.. but thats not really what is needed, or its only 10 to 15 percent of whats needed… This approach rules out the new kind of app that might come into existence if we really revisited the assumptions.

Independently of users data needs, theres a lot of movement towards a pure attribute based or key-hash style of database.  David Heinemier Hansen mentions that Rails uses the database as a kind of big hash table – data is given a key and thats used to get the data back, so most of the other DB features arent really used.  Amazons Cloud DB service seems to do a similar thing, and many many web apps are getting huge scalability and performance gains by going for the memcached api with a similar style.

Combine that with the ubuquity of XML, and the fact that treelike data doesn’t fit easily within a table database structure and it seems the whole “data on web” space is ready for an overhaul, and there are sneak peeks of how it might be done the right way.

There is a better way to manage data – its there in front of us, perfectly formed inside the block of marble, just waiting to be uncovered.

WebXilla – the web 2.0 mind-map for your data

Problem – You cant change your database to fit your business and you cant do it incrementally.

Spreadsheets only handle the very early phase of data growth.  Custom database web systems are risky and expensive to develop.  Off the shelf web solutions can’t be customized nor easily integrated together.  Wikis and web content management systems don’t respect data schemas.

Solution – Xilla dramatically lowers the cost of entry because you can start with where your data is now, and you dont have to develop software.  Changes to the schema are immediate, codeless, free and reversible.  Running and maintenance costs are cut to a bare minumum.

Xilla allows you to experiment safely and make incremental changes to your data design without writing SQL or hiring DBAs and programmers.  Xilla reduces missed opportunity costs, because the schema stays in step with your business, and you can prototype and experiment cheaply.

Xilla is -

  • a place to share important data privately or publicly
  • a RAD prototyping tool & web development environment
  • a customizable database with search engine
  • a new tool for browsing cascades of data links
  • a community of users, developers and data architects
  • a schema repository of reusable solutions
  • a securely hosted, high performance web service with automated backup
  • a concrete semantic web, a mind map for your data

The Xilla user interface is completely web based and the data management, backup and hosting is covered by the monthly subscription fee.  There is no app to roll out nor database to take down and restart when you make a schema change, nor does that require writing scripts or code.

You can reuse existing schema solutions from a library of Xilla domain models to get started, then customize to suit your needs on an ongoing basis.  There is no fixed delivery date when the much awaited over hyped web app arrives to fit an outdated snapshot of your business – both your business and your data projects grow in sync with Xilla.

In most databases you have to know what field or column to search for.  Xilla search is like web search – by default it searches everything and shows you where the search terms occur.

Enter a keyword to find a starting point, then browse through related items and follow the data relationship chain, from item to item to item.  While doing this you’ve constructed a custom page view with all the relevant information at hand – useful summary for a sales call or purchase decision, bookmarked for later use.

Xilla makes relationships in the data stand out clearly because you can see them directly – you click to drill down through the cascade and are reminded about the structure of the data as you use it.

This powerful new browsing feature, combined with the benefits of incremental improvement make Xilla a revolutionary new tool for managing and evolving your business data.

My previous blog articles describe approaches to calculating kth moments in a single pass – see compute kth central moments in one pass and variance of a sequence in one pass.

I decided to make a rough performance comparison between my approach and boost accumulators api.  A reasonable test is to procedurally generate 10^7 random numbers in range [0.0..1.0] as the input data.  We compare 3 algorithms -

  • No moments accumulated – baseline, just generates the input
  • Accumulate all 12th order moments using Boost Accumulators
  • Accumulate all 12th order moments using my vfuncs cpp code

We run on Linux using time command to get rough timings, raw results are -

  • Baseline   -   1.16s
  • Boost Acc – 23.16s
  • Vfunc Acc -  2.44s

If we subtract the baseline we get a rough comparison of Boost ~ 22s and Vfuncs cpp ~1.3s when the cost of generating the input is factored out.

So the vfuncs impl is roughly an order of magnitude faster then the current v1.37.0 boost accumulate stats implementation (both are single pass).

I don’t think the boost algorithm is logically different, its probably more a case of template code complexity having an impact – the 10s compile times might indicate this also.  Executable size also reflects this, differing by an order of magnitude.

(Note : Using gcc 4.2.4.  I haven’t tried this on other compilers, build settings etc – they could have a profound effect.  Let me know if you see different results on eg. Intel or VC compilers)

Download

Download code and project – kth_moments.tgz – from vfuncs google code downloads page.  [BSD license feel free to use]

Some links while I have them at hand…

Compressed sensing basically gives a better practical way of recovering a signal from its samples than the Shannon Sampling theorem suggests – given some structure on the data.

Recent methods using L1 norm, as a compromise between L2 and L0, mean this is much more computationally feasible. New work by Tao, Candes etc gives some proofs on why this sparse sampling works so well.

Useful compressed sensing links -

Continuing on the same topic as my previous post, its nice to be able to gather all the kth order moments in a single pass.

Last time I mentioned the boost/accumulators example, but you will have noticed two issues if you use that.  Firstly, moment<k> tag will give you the kth simple moment relative to zero, whereas we often want the kth central moment of a sequence relative to the mean.  Secondly, although boosts accumulator is well written it does seem to take a while to compile [~ 12 seconds for code using moment<12>].

After some playing around Ive got a faster simpler approach, where the inner loop accumulates kth powers of the element.  After you’ve run the sequence through, you can then easily extract variance, and all the kth central moments.  So in adding the more general case of kth moments, Ive made the particular variance case simpler.  That often seems to happen in programming and math!

algebra

First a bit of math and then the code.  We want to express the kth central moment in terms of the k basic moments.

First, lets define the basic moment as -

 \displaystyle M_{n}^{j}= \sum_{i=1}^n {x}_i^{j}

We rearrange the binomial expansion -

\displaystyle nv_{n}^{k}= \sum_{i=1}^n({x}_{i}-\mu_{n})^k

\displaystyle = \sum_{i=1}^n \sum_{j=0}^k \binom{k}{j} {x}_{i}^j(-\mu_{n})^{k-j}

\displaystyle = \sum_{j=0}^k \binom{k}{j} (-\mu_{n})^{k-j} \sum_{i=1}^n {x}_{i}^j

So we have the kth central moment given as a weighted sum of the kth simple moments -

\displaystyle v_{n}^{k} = 1/n(\sum_{j=0}^k \binom{k}{j} (-\mu_{n})^{k-j} M_{n}^{j})

which shows that all we need to accumulate as we walk across the sequence is the kth simple powers ({x}_{i})^k .

Notice the variance is now handled as a special case where k=2.  Likewise k=0 corresponds to n, the element count and k=1 is the sum of elements.

c++ impl

Heres a basic impl of the above expression -

Read the rest of this entry »

Rearranging

I was chatting with some other quant developers the other day, as you do, and the issue came up of calculating variances.  Its a pretty common operation, and the naive approach does two passes, one to calculate the mean, \mu_{n} , and one to gather the squares of differences from that. … but someone asked if it could be done in one pass, and of course it can, quite easily.

Recall, the population variance of a sequence \left \{x_{i} \right \} is defined as -
v_{n} = (\sum_{i=1}^{n} (x_{i}-\mu_{n} )^2) / n

We can expand this and see how the variance after n terms differs from the variance after m=n+1 terms, vis -

nv_{n}= \sum_{i=1}^{n} (x_{i}-\mu_{m} + \mu_{m} - \mu_{n})^2
\displaystyle = \sum_{i=1}^{n} (x_{i}-\mu_{m})^2 + \sum_{i=1}^{n}(\mu_{m} - \mu_{n})^2 + 2 \sum_{i=1}^{n}(x_{i}-\mu_{m}) (\mu_{m} - \mu_{n})
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 + 2 (\mu_{m} - \mu_{n}) \sum_{i=1}^{n}(x_{i}-\mu_{m})
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 - 2n(\mu_{m} - \mu_{n})^2
\displaystyle = mv_{m}- (x_{m}-\mu_{m})^2 - n(\mu_{m} - \mu_{n})^2

and we have v_{m} in terms of v_{n} vis -

v_{m} = ( nv_{n} + (x_{m}-\mu_{m})^2 + n(\mu_{m} - \mu_{n})^2 ) / m

In other words v_{m}= f(v_{n}, sum_{n}, n, x_{m}) , so theres very little we need to store to accumulate the variance as we traverse the sequence.

C++ Implementation

Expressing this in C++ code we’d have a functor that maintains state and gets handed each element of the sequence, in simplest form -

Read the rest of this entry »

While browsing LWN archives I found a thoughtful article introducing DTrace, written by filesystems developer Valerie Henson -

See also her book reviews in the Kernel Hackers Bookshelf series, such as  -

While we’re on the topic of filesystems, I’d recommend Practical File System Design with the Be File System [pdf download ] by Dominic Giampaolo. As well as being totally free to download, this book is a delight to read with practical examples explaining all the ideas. Id read it cover to cover before diving into the internals of any of the current fs incarnations – EXT3/4, XFS, ZFS or the in-progress btrfs.

Back to DTrace – I believe Linux _needs_ DTrace. Purely on stylistic grounds, in the same way that any respectable unix needs ls, grep or strace.

DTrace has great style, in the best un*x tradition.  If its unlikely that a _port_ of DTrace will get into the linux kernel due to legal reasons, then the only option is to embrace the ideas of DTrace, and implement something comparable [perhaps D language compatible].  I’ve read Bryan Cantrils heartfelt exhortations that a DTrace clone would either be a straight port or a feature-poor dead duck in the water [my paraphrase]… but you could say that about any great piece of unix code… yet the successful ones embrace the beauty, utility and ethos of the original and faithfully live up to it.  I hope when this happens the authors of the original DTrace will give the linux clone their blessing, as a homage to their effort.

enjoy,

gord.

The embarrassment of riches truly astounds in the domain of ahem ‘very small’ lisp implementations… and who am I to refuse to fill an obvious missing link in the series?  We got small covered…

Yes, I have in fact begun a super cut down lisp implementation for vfuncs, which I think will be a nice way to express such idioms as map [hey, lisp invented map, right? - maybe not, better check ]

I’ve uploaded the new files atto.c, atto.h, attotest.c to the vfuncs project page – grab v0.6.7.

Some of the earlier blog comments around vfuncs got me to wondering.. How would you implement closure and currying in straight C so it could be used in useful array verbs?  It seems what you need is to replace some verb params with values from the ‘environment’.  That way the verb can use both runtime args and design time args easily… I did hack up something, but it began to look more and more like I was writing an interpreter… so I decided the ‘honest, unenlightened hacker’ way to proceed was to actually _write_ an interpreter for as simple a language as I could think of…

The core guts of lisp is, well – calling functions with args, when any of those args could be a const, a variable or another function.  So, the atto-kernel of lisp is something like this grammar [a comment pasted from the C source] -

//      expr : left func args right // (func arg0 arg1 arg2 )
//      args : arg args | empty // args is any length 0..n
//      arg  : ident | const | expr // each arg can be an expr itself – nested calls

In the spirit of ’start small, get it working and add features’, this cut-down grammar allows you to parse expressions like the following -

//     “(sqrt (add (sqr 3.0) (sqr 4.0)) )”

This is the ‘lisp way’ – everything is in prefix notation ie. (* a b) not (a * b).  You get used to it, and it is very consistent [and the reason lisp macros are more part of the language than say C/C++ macros].

Read the rest of this entry »

After seeing some discussion about Kernel Tracing tools in the recent Linux Kernel Summit…I found this great tech-talk.

Dtrace is like ‘TiVo for the Kernel’ … its been included in FreeBSD and Solaris, but Linux is unlikely to embed it directly due to the incompatibility of the GPL with Dtrace’s licence.

Great Dtrace google talk video by one of the developers Bryan Cantrell [his blog here]…

Hilarious discussion of seeing into all the abstract layered levels that is software, nice demo tracing into both kernel and user space programs.

At the end he talks a while about how to view into scripting languages – normally you’d just see rbeval [for ruby] or the JVM implementation – but he decribes how to extend further to Dtrace into scripting Environments.   Superb!

He makes some frank comments about the political situation of porting Dtrace to Linux.

Perhaps solution is to make a clean room GPL  impl from the external API.. watch this space.

Whats immediately apparent from this talk is how useful Dtrace is for developers and sysadmins.

[ fyi, Im working on an 'attolisp' scripting implementation for vfuncs.. hmm.. now how to instrument it?  Can you believe nanolisp, picolisp and femtolisp project names were taken! what an embarrassment of riches! :]