When production is your test lab – what is “canarying”?

 

If you’ve attended an infrastructure conference recently, particularly of the “cloud native” variety, you’ll likely have heard the term “canarying”. If you don’t know what that means you’ve probably nodded along sagely and thought, oh I better look that up. I’ll do it when I look up “smoke testing”.

So anyway – what is canarying?

A canary test reduces risk by pushing out a change to a small subset of users before it is deployed to your entire infrastructure, and the rest of your users. Sounds simple, but obviously you need a pretty sophisticated development and operations infrastructure to do it properly. You start by deploying the change to a subset of a load balanced infrastructure that is currently not being used. Then gradually route a small number of requests to the new service. These might be random, or employees, or some other specific set of users first. As you become more confident in the change you route more users to it, and then begin to deploy change to more servers. You do want to run the older version in parallel for a while though, in case a problem emerges and you want to route users back to that.

We used to call this kind of approach “phased rollout” but “canarying” sounds better, and the practice has come a long way over the last couple of years, in terms of technology and process – cloud, containers, microservices. For the state of the art in canarying at scale check out what Netflix is doing. I was reading a recent post on the Netflix tech blog about the Chaos Automation Platform and at first I thought “gee whiz pass the salt” but then I paused and took a step back to appreciate the breath-taking achievements of the engineering team. Of course we’re not all Netflix, and we shouldn’t try to emulate everything they do – premature optimisation is a killer. But that doesn’t mean we can’t all learn from the company, and luckily they are pretty happy to share their experiences.

Here is the Netflix testing approach over time..

 

Netflix asks profound questions

“The best experiments do not disturb the customer experience. In line with the advanced Principles of Chaos Engineering, we run our experiments in production. To do that, we have to put some requests at risk for the sake of protecting our overall availability. We want to keep that risk to a minimum. This raises the question: What is the smallest experiment we can run that still gives us confidence in the result?”

One of the ways Netflix is an outlier is in how many changes it makes over a relatively short term frame. When changes are being deployed to microservices that you are not in control of, canarying gets more interesting.

“Any change to the production environment changes the resilience of the system. At Netflix, our production environment might see many hundreds of deploys every day. As a result, our confidence in an experimental result quickly diminishes with time.”

There is a reason Netflix calls it “Chaos Engineering”. As ever you should read Thoughtworks on this stuff.  Tomorrow we’ll talk about “smoke testing”.

(Read this and other great posts  @ RedMonk)

LinkedIn Twitter
James, aka @Monkchips is co-founder of RedMonk, the open source analyst firm, which specialises in developer advocacy and analytics.