You are currently browsing the monthly archive for February 2015.
Setting the scene
- the database is 30 Million rows and performance is getting slower
- users want to search for stuff.. but don’t know what they want to search for
- the web experience needs to be fast enough to feel “interactive‘
- it needs to have an api which the mobile app developers can use.. so json
Questions / Observations
- Do we even have ‘Big Data’ ?
- 9GB data as CSV
- 2.5GB when zipped
- We could actually fit the data all into 16GB RAM ..
- why doesn’t the database do that ?
- What if we fully “invert” the data, so tag searches are fast :
- data : id -> tags [ in named fields ]
- index : tag -> ids
- “inverted” when all tags are indexed
- so, given a tag we can quickly get all the record ids it appears in
Stategy / Plan
- RAM is fast, use plenty : enough for :
- cached rows and
- the full index
- keep the index fully in RAM : all the time
- vertical scaling might work
The pain, oh the pain
- cassandra OK
- but we don’t have a cluster.. + its 150MB java bloat/dependencies install
- mysql innodb OK ..
- but weak sql support + a large csv import died
- redis OK .. but
- hit 25GB RAM before dying
- too much space to store long sets of integers
- new respect for postgres :
- great sql support
- csv import “copy” : nice, simple, fast, reliable
- up the buffers, and warm the cache with some queries
- will bring in linux virtual ram, based on most recently used
- directive to hold index or table in RAM would be nice
- linux utils, ftw :
- head tail sed split sort uniq wc
- csvkit
- perl -pe ‘s///g’ [ and regexes generally in vim, node.js, perl ]
- bash shell
The epiphany
- keep the fully inverted index in RAM, all the time
- save space with a binary format
- develop custom “index server”
- custom binary file format
- node.js is nice to work with : buffers, streams
Partial Enlightenment
- SSDs are incredible
- seek SSD ~0.1ms vs HDD ~10ms [ 100x ]
- read SSD 150MB/s vs ~500MB/s [ 4x ]
- readable comparison here
- a middle way :
- custom index held in RAM
- data in a sturdy RDB [ postgres ]
- trust SSD performance
- json result set compresses well
- 256 rows ~ 170k json ~ 25k gzipped [ 7x ! ]
Good Performance Numbers
- running on a linode host with 2GB RAM
- 1GB for postgres cache
- 1GB for index tag table :
- first 700 MB of the 2.3 GB binary index file
- rest is blocks of ids [ per each tag ]
- tag search on 30 Million records :
- 25ms to get the first 1000 ids from the index table and index file
- 150ms to fetch the records with those ids !
- web front end :
- search is fast enough to be “interactive“
- user sees 450ms roundtrip [ including 200ms ping ]
- gzip the json brings 2.3s fetch down to 0.45s
- feels nearly instantaneous
Moral of the story
- SSDs are FAST.. but they are NEW and DB software will need time to catch up
- NoSQL vs SQL just means more options, and more hybrid approaches
- You can still scale vertical …
- if your data is big [100M rows | 50GB ]
- but not BIG big [ 1Bn rows | 1TB ]
- RAM is expensive on the server
- because hosting companies can share it over many users
- 64GB linode = $ 640/m vs 2GB linode = $20/mo
- potentially save $600 per month
- we will see more soft realtime web sites .. web with data will feel more interactive