You are currently browsing the category archive for the ‘server’ category.

Setting the scene

  • the database is 30 Million rows and performance is getting slower
  • users want to search for stuff.. but don’t know what they want to search for
  • the web experience needs to be fast enough to feel “interactive
  • it needs to have an api which the mobile app developers can use.. so json

Questions / Observations

  • Do we even have ‘Big Data’ ?
    • 9GB data as CSV
    • 2.5GB when zipped
  • We could actually fit the data all into 16GB RAM ..
    • why doesn’t the database do that ?
  • What if we fully “invert” the data, so tag searches are fast :
    • data  : id -> tags   [ in named fields ]
    • index : tag -> ids
    • “inverted” when all tags are indexed
    • so, given a tag we can quickly get all the record ids it appears in

Stategy / Plan

  • RAM is fast, use plenty : enough for :
    • cached rows and
    • the full index
  • keep the index fully in RAM : all the time
  • vertical scaling might work

The pain, oh the pain

  • cassandra OK
    • but we don’t have a cluster.. + its 150MB java bloat/dependencies install
  • mysql innodb OK ..
    • but weak sql support + a large csv import died
  • redis OK .. but
    • hit 25GB RAM before dying
    • too much space to store long sets of integers
  • new respect for postgres :
    • great sql support
    • csv import “copy” : nice, simple, fast, reliable
    • up the buffers, and warm the cache with some queries
      • will bring in linux virtual ram, based on most recently used
      • directive to hold index or table in RAM would be nice
  • linux utils, ftw :
    • head tail sed split sort uniq wc
    • csvkit
    • perl -pe ‘s///g’ [ and regexes generally in vim, node.js, perl ]
    • bash shell

The epiphany

  • keep the fully inverted index in RAM, all the time
  • save space with a binary format
  • develop custom “index server”
    • custom binary file format
    • node.js is nice to work with : buffers, streams

Partial Enlightenment

  • SSDs are incredible
    • seek SSD ~0.1ms vs  HDD ~10ms  [ 100x ]
    • read SSD  150MB/s vs ~500MB/s [ 4x ]
    • readable comparison here
  • a middle way :
    • custom index held in RAM
    • data in a sturdy RDB [ postgres ]
    • trust SSD performance
  • json result set compresses well
    • 256 rows ~ 170k json ~ 25k gzipped [ 7x ! ]

Good Performance Numbers

  • running on a linode host with 2GB RAM
    • 1GB for postgres cache
    • 1GB for index tag table :
      • first 700 MB of the 2.3 GB binary index file
      • rest is blocks of ids [ per each tag ]
  • tag search on 30 Million records :
    • 25ms to get the first 1000 ids from the index table and index file
    • 150ms to fetch the records with those ids !
  • web front end :
    • search is fast enough to be “interactive
    • user sees 450ms roundtrip [ including 200ms ping ]
    • gzip the json brings 2.3s fetch down to 0.45s
    • feels nearly instantaneous

Moral of the story

  • SSDs are FAST.. but they are NEW and DB software will need time to catch up
  • NoSQL vs SQL just means more options, and more hybrid approaches
  • You can still scale vertical
    • if your data is big [100M rows | 50GB ]
    • but not BIG big [ 1Bn rows | 1TB ]
  • RAM is expensive on the server
    • because hosting companies can share it over many users
    • 64GB linode = $ 640/m vs 2GB linode = $20/mo
    • potentially save $600 per month
  • we will see more soft realtime web sites .. web with data will feel more interactive