You are currently browsing the tag archive for the ‘data’ tag.
Setting the scene
- the database is 30 Million rows and performance is getting slower
- users want to search for stuff.. but don’t know what they want to search for
- the web experience needs to be fast enough to feel “interactive‘
- it needs to have an api which the mobile app developers can use.. so json
Questions / Observations
- Do we even have ‘Big Data’ ?
- 9GB data as CSV
- 2.5GB when zipped
- We could actually fit the data all into 16GB RAM ..
- why doesn’t the database do that ?
- What if we fully “invert” the data, so tag searches are fast :
- data : id -> tags [ in named fields ]
- index : tag -> ids
- “inverted” when all tags are indexed
- so, given a tag we can quickly get all the record ids it appears in
Stategy / Plan
- RAM is fast, use plenty : enough for :
- cached rows and
- the full index
- keep the index fully in RAM : all the time
- vertical scaling might work
The pain, oh the pain
- cassandra OK
- but we don’t have a cluster.. + its 150MB java bloat/dependencies install
- mysql innodb OK ..
- but weak sql support + a large csv import died
- redis OK .. but
- hit 25GB RAM before dying
- too much space to store long sets of integers
- new respect for postgres :
- great sql support
- csv import “copy” : nice, simple, fast, reliable
- up the buffers, and warm the cache with some queries
- will bring in linux virtual ram, based on most recently used
- directive to hold index or table in RAM would be nice
- linux utils, ftw :
- head tail sed split sort uniq wc
- perl -pe ‘s///g’ [ and regexes generally in vim, node.js, perl ]
- bash shell
- keep the fully inverted index in RAM, all the time
- save space with a binary format
- develop custom “index server”
- custom binary file format
- node.js is nice to work with : buffers, streams
- SSDs are incredible
- seek SSD ~0.1ms vs HDD ~10ms [ 100x ]
- read SSD 150MB/s vs ~500MB/s [ 4x ]
- readable comparison here
- a middle way :
- custom index held in RAM
- data in a sturdy RDB [ postgres ]
- trust SSD performance
- json result set compresses well
- 256 rows ~ 170k json ~ 25k gzipped [ 7x ! ]
Good Performance Numbers
- running on a linode host with 2GB RAM
- 1GB for postgres cache
- 1GB for index tag table :
- first 700 MB of the 2.3 GB binary index file
- rest is blocks of ids [ per each tag ]
- tag search on 30 Million records :
- 25ms to get the first 1000 ids from the index table and index file
- 150ms to fetch the records with those ids !
- web front end :
- search is fast enough to be “interactive“
- user sees 450ms roundtrip [ including 200ms ping ]
- gzip the json brings 2.3s fetch down to 0.45s
- feels nearly instantaneous
Moral of the story
- SSDs are FAST.. but they are NEW and DB software will need time to catch up
- NoSQL vs SQL just means more options, and more hybrid approaches
- You can still scale vertical …
- if your data is big [100M rows | 50GB ]
- but not BIG big [ 1Bn rows | 1TB ]
- RAM is expensive on the server
- because hosting companies can share it over many users
- 64GB linode = $ 640/m vs 2GB linode = $20/mo
- potentially save $600 per month
- we will see more soft realtime web sites .. web with data will feel more interactive
Ive been hacking on a proof-of-concept for Xilla, really to try out some of these ideas and see where they go.. at some stage Ill upload that to the projects home – WebXilla.com
In the meantime, given my comments about how people use spreadsheets instead of databases and why, I want to keep a link to this Hacker News discussion – “What Cant you do in Excel?”
The discussion is fascinating, the guts of it boils down to –
- its got to be on the web, in the browser [shareable, publishable, multiuser]
- spreadsheets should be more like databases [data types, data relations, named attributes]
- we want to use spreadsheets as a live interface GUI for databases, sometimes
- need a better language for programming [eg. python, ruby, lisp – anything but VB!]
- need a better language for querying [ SQL or an alternative]
- need better naming of data ranges [names not C1:C37]
- better import/export/conversion/integration
To this I add my own wishlist –
- browse data links [relations] as web hyper-links
- runtime editable schema
- data-app prototyping environment, RAD tool
- complete versioning, check-pointing and backup
- fine grained user access controls & permissions
A spreadsheet that has all of these is no longer a spreadsheet, or a database but a new new thing – so lets stop asking how to improve spreadsheets and ask how to build this data-web-‘Xilla’. I’m calling this new thing Xilla, because it needs a name.
Its tempting to deny or address the opinions above by saying ‘if you need better data management, grow up and use a real database’ .. but the real point is that spreadsheets solve part of a problem, databases solve part of an overlapping problem – and where we are now is that neither solves the problem really well, or at least well enough.
The reason people stick with spreadsheets is because they partly work, and they don’t want to move to a database and lose what flexibility they have – living with the limitations is better than fixing a schema in stone and writing a web app that is tied completely to that schema [and losing all the nice computational tools and UI they currently have in the process]. In an organizational context, theyll have to give up their dato to the experts, and once they do that theyll never see it again, or at least have no rights to change and redefine the structure or adapt it to the problem they have as it evolves.
I think quite a few startups are basically porting the spreadsheet or traditional database onto the web.. but thats not really what is needed, or its only 10 to 15 percent of whats needed… This approach rules out the new kind of app that might come into existence if we really revisited the assumptions.
Independently of users data needs, theres a lot of movement towards a pure attribute based or key-hash style of database. David Heinemier Hansen mentions that Rails uses the database as a kind of big hash table – data is given a key and thats used to get the data back, so most of the other DB features arent really used. Amazons Cloud DB service seems to do a similar thing, and many many web apps are getting huge scalability and performance gains by going for the memcached api with a similar style.
Combine that with the ubuquity of XML, and the fact that treelike data doesn’t fit easily within a table database structure and it seems the whole “data on web” space is ready for an overhaul, and there are sneak peeks of how it might be done the right way.
There is a better way to manage data – its there in front of us, perfectly formed inside the block of marble, just waiting to be uncovered.
Problem – You cant change your database to fit your business and you cant do it incrementally.
Spreadsheets only handle the very early phase of data growth. Custom database web systems are risky and expensive to develop. Off the shelf web solutions can’t be customized nor easily integrated together. Wikis and web content management systems don’t respect data schemas.
Solution – Xilla dramatically lowers the cost of entry because you can start with where your data is now, and you dont have to develop software. Changes to the schema are immediate, codeless, free and reversible. Running and maintenance costs are cut to a bare minumum.
Xilla allows you to experiment safely and make incremental changes to your data design without writing SQL or hiring DBAs and programmers. Xilla reduces missed opportunity costs, because the schema stays in step with your business, and you can prototype and experiment cheaply.
Xilla is –
- a place to share important data privately or publicly
- a RAD prototyping tool & web development environment
- a customizable database with search engine
- a new tool for browsing cascades of data links
- a community of users, developers and data architects
- a schema repository of reusable solutions
- a securely hosted, high performance web service with automated backup
- a concrete semantic web, a mind map for your data
The Xilla user interface is completely web based and the data management, backup and hosting is covered by the monthly subscription fee. There is no app to roll out nor database to take down and restart when you make a schema change, nor does that require writing scripts or code.
You can reuse existing schema solutions from a library of Xilla domain models to get started, then customize to suit your needs on an ongoing basis. There is no fixed delivery date when the much awaited over hyped web app arrives to fit an outdated snapshot of your business – both your business and your data projects grow in sync with Xilla.
In most databases you have to know what field or column to search for. Xilla search is like web search – by default it searches everything and shows you where the search terms occur.
Enter a keyword to find a starting point, then browse through related items and follow the data relationship chain, from item to item to item. While doing this you’ve constructed a custom page view with all the relevant information at hand – useful summary for a sales call or purchase decision, bookmarked for later use.
Xilla makes relationships in the data stand out clearly because you can see them directly – you click to drill down through the cascade and are reminded about the structure of the data as you use it.
This powerful new browsing feature, combined with the benefits of incremental improvement make Xilla a revolutionary new tool for managing and evolving your business data.
Some links while I have them at hand…
Compressed sensing basically gives a better practical way of recovering a signal from its samples than the Shannon Sampling theorem suggests – given some structure on the data.
Recent methods using L1 norm, as a compromise between L2 and L0, mean this is much more computationally feasible. New work by Tao, Candes etc gives some proofs on why this sparse sampling works so well.
Useful compressed sensing links –