As part of the CXOTALK series of conversations with innovators, I recently interviewed Cameron Tuckerman-Lee, a site reliability engineer at Airbnb. I caught up with Cameron at New Relic’s FutureStack16 conference.
Site reliability is more technology-focused than the usual CXOTALK conversations on business and digital disruption. However, I thought it valuable to learn how a high-volume site like Airbnb thinks about reliability and user experience.
If the Airbnb site goes down or has trouble, not only does the company stop making money but it could cause guests and hosts great inconvenience. For example, an outage could cause travelers who are actively on their way to an Airbnb rental to lose a map or address. Give the size and scale of Airbnb, these things matter enormously.
You can watch our conversation in the video embedded below. An edited transcript of highlights follows and you can read a complete transcript on the CXOTALK video page.
What does a site reliability engineer do?
I think the role is very different depending on what company you’re at. At a lot of companies, SRE’s are your operators. You have developers on one part of your building that develop your applications, and then throw them over the metaphorical wall over to your operators, who make sure that it’s running in production.
At Airbnb, we don’t subscribe to that model; we are in the DevOps model that is becoming very popular lately. So, the same engineers that are building applications are also the ones that are running them, scaling them, and dealing with incidents. But because of that, there’s a new class of tools that are required to make sure that they’re doing that efficiently and using best practices. So, that’s what the SRE team does: it makes sure that the entire site is reliable and available, and we do that by supporting the other teams that own their applications.
What kind of tools do you use?
A lot of it is learning. When there are incidents, how do you make sure that there’s a good follow-up to that; that there’s learning from that. There is tooling, like post-mortems, and making sure that when incidents do occur you can get data on previous ones very quickly and understand it.
It’s also getting the right people in the room. So, how you do [that] with staggered escalations, how you deal with alerting; the site reliability team also owns those. You know, we’re also the ones that own and maintain the integrations with some of our monitoring tools, like StatsD and NewRelic. These are how, when there are incidents that we’re able to quickly triangulate where the problem is and what the impact was.
There are lots of different good ways to go about incident response, but a not-great way would be to have everybody be doing it their way, and have no consistency. Having a team like SRE means that Airbnb has a consistent approach to incident response, so when there are problems that need to be escalated up the chain, they can get picked up and handled very quickly.
How does trust relate to site reliability?
Some might say that Airbnb is the hospitality company, but some might also argue that we’re selling trust: the trust that you’re going to be able to go to a stranger’s home, and feel welcome and have a good experience, and be able to experience that neighborhood like a local.
The technology that goes into making sure that people are what they say they are, that you’re able to interact with your host and get to know each other beforehand. When you’re searching for a listing, find a place that’s going to fit with the kind of neighborhood that you’re looking for. I think all contribute to making sure that when you go someplace, you trust that it’s going to be a good experience.
What kinds of data do you look at?
There are a couple of different parts of the data that my team cares about.
It’s everything from your traditional SRE metrics, mean time to resolve, mean time to acknowledge, you know, when [it is] incident response. My team is also starting to care about metrics around making sure that our on-call engineers are living healthy, productive lives; making sure that work-life balance is something that extends [to] something when you’re on call at 2 AM. I think it’s something important for the industry to start looking at.
Lastly, the ones that are aligned with how our users see things; and these are what a lot of companies would call “service-level objectives.” Making sure that our response time is up, our error rates low, that [it is] not just response time to sending out bytes to our CDN as fast, but also making sure that when the browser does get that information, it also has fast load times.
(Cross-posted @ ZDNet | Beyond IT Failure)