What follows is something like a live blog, based on comments from Matthew Wall and Simon Willison from The Guardian the NoSQL EU conference in London today.
Wall kicked off the talk with a question about NoSQL: is it a good name for the phenomenon? He says not really, pointing out absurdity of calling SQLite and MySQL “old world databases” as opposed to “new world” key value stores.
[This point resonates strongly with RedMonk thinking. Both me and Stephen have been wary of reductionist approaches to defining NoSQL – we feel Hadoop style Big Data for example should be thought of as a related trend]
Where is The Guardian today? Its a modern, information-driven web site driven by tags and feeds.
“Its a traditional three tier web app, with a large Oracle database at center of the world. People might have thought we’re cooler than that, but we’re not.”
The Guardian took the decision to stick with traditional relational model 5 years ago. The kind of tools we’re beginning to use weren’t as mature back then.
A key reason for sticking with Oracle was the maturity of the surrounding tools ecosystem- performance management and optimisation, back up – and available skills.
SQL has worked well for the paper. “SQL is great. we can do cool stuff with that. at scale.”
“Searching one tag is ok, but what about two? What does it do to the database?
“Related content” was 40% of the Guardian’s app load so… the team used a search engine instead.. The search engine approach – using Apache Solr – worked well, but scale issues were still likely to become a problem.
“Willison suggested the Guardian stuck a massive memcached in front instead”.
It worked. But what about throwing more resource at Oracle instead?
“We wanted to avoid Oracle RAC because its really expensive, but we want to scale out”.
[Oracle RAC is the database giant’s clustering technology.]
The Guardian’s Business Drivers: Linked data, social networks- there is all sorts of information out there. we need to engage with them. We can’t just broadcast the news…
The Guardian’s editor called for the organisation to Mutualise the News.
“We’re changing the platform because of the business change. new technologies: we have a real need to use them… blurring the line between journalists and readers.”
“Journalism is becoming the curation of all the world’s information”.
[note: google’s automated curation seems to be winning at this point… which explains why the Guardian is responding in the way it is.]
What happens with API access, which drives for example, tag proliferation, which dramatically increases load on the database.
“Apache Solr is like a database, it works like one for us”
Fields can be multi-value. one piece of content with five tags can be stored in one field. Most important is that SOLR offers the ability to facet the content. apply it *like* a tag…
For example: – an editor’s star rating. we can facet on that for free, and just jump to all the three star albums. facets can be combined much more quickly than a relational database.
With Solr we can perform complex queries, filter by facets.
“On our data set, most queries are about the same cost. no transactions.”
With Solr Schema design is very important – the schema are more flexible and fuzzy than relational.
This is about getting data out of the system: powering the Guardian’s iPad app, site components, editors tools off the API, with far more to follow. But what about getting data in?
The Guardian has also built a simple REST/HTTP framework. for example – for sucking in live football scores, eg. apps that don’t affect the data store.
At this point the talk speeded up dramatically. Willison talks a lot faster Wall.
NoSQL for journalism
“I am working at the Guardian because I am interested in the opportunity to build rapid prototypes that go live: apps that live for two or three days. My interest is how NoSQL can help support journalism.”
Rapid prototyping. things that scale down as well as up handle massive spikes (if you’re on the front page) quickest way to do lookups- was to use Redis
v1 of the MPs expenses app was not Redis enabled.
The initial application generated 468k rows, randomised, every time someone hit the button.
Guardian Zeigeist, meanwhile, doesnt use Redis. The app attempts to highlight stories on the guardian that are interesting- the amount of conversation about that thing on social networks. looks for peaks. ie- a page on the Guardian’s Environment section that gets more traffic than normal.
So use message queues and cron jobs. pull data, task queue, then calculate hotness. feed into Big Table, running on Google AppEngine, which not great at complex queries, but good at simple select and sort.
“Using Big Table as a dumping ground for data you can sort by 1 or 2 columns when you need to”
Talking of dumping grounds… Guardian employees were effectively creating data sets that if they didn’t make it into the paper as Infographics, weren’t used. Raw numbers were being collected and cleaned up. Today the underlying data will be in a Google Docs spreadsheet, and made accessible on the Guardian website accordingly.
Guardian Datablog – a bunch of Google doc spreadsheets. Retrieve data as CSV, XLSW, JSON. click “make a copy” Make a Copy, and run your own.
“We want to keep publishing arbitrary data sets, for example “output school league tables” or “volcano information”. we want something schema free.”
Our first option is CouchDb. Create schema free database, then index in Solr.
We have changed from the relational database being at the center of the world to a mix of datastores and models.
disclosure: Oracle is not a client. VMWare, which is, recently acquired Redis.