Do you really have a Site Reliability Engineering(SRE) Team ?

There were desktop systems. Then there were developers. Then there were big, distributed systems. These systems had many moving parts and they kept on behaving as per their whims. Then there was NOC, the team that monitored production alerts. Soon, some good people (Ben Treynor and others) sitting in Google had a thought that the workflow for all production related incidents, and the methodology for managing incidents, and the approach for designing scalable architectures was actually the same almost everywhere, wherever the scale of operations and traffic was large. And hence was born a new school of thought called “Site Reliability Engineering”.

The way software development takes place in 2019 is in stark contrast to the way it was written in early 2000. There was no stackoverflow to refer to, not github projects to fork from, no cloudformation template to click on. The new-age developers are more like plumbers. Plumbers pick pieces from different sources and apply relevant connection and define source and sinks to achieve an objective. The pieces may be an event-stream, event-consumers, databases, caching, logging, api end-points and so on.

New-age software development invariably consists of many building blocks which need to be plumbed with each other to create a full blown-working software

Softwares in early days, which was usually a desktop application, could be written in an IDE and compiled to produce a binary (bundled with libraries ) which was self-sufficiently capable to run in an environment. Eg: Visual C++, .Net platform, C Compilers like Z80/Turbo and so on. The container to run these applications would be usually the operating system. Since the Business-Analysts usually did the requirement gathering in initial phases regarding usage patterns and peak levels, managing concurrency never was a key question. Outages in the application arose due to a middle-layer being down, or disk getting full and so on. But these reasons were localised to the server environment so these were usually taken care of by the developer. So, the reliability vectors were: Middle-Layer, and Operating System.

Then came web-applications. The web-apps were one step advanced in execution runtime in the sense that the code-written for web-application needed to be hosted on a web-servers. And the web-server itself used to run in an operating system. Smaller web-apps could handle low traffic easily. But with a surge in traffic, you needed a farm of web-servers.

This gave rise to distributed web-applications. They had load-balancers with an intelligence to detect health of downstream servers and distribute traffic accordingly. In addition, the emergence of many new paradigms such as Pub-Sub(redis/memcache), programmable cloud (Lambda/Cloud Functions), Queuing frameworks(Kafka/redis/RabbitMQ/Kinesis/SNS), NoSQL (MongoDB,DocumentDB) and On-demand Storage and compute means that the approach to monitor and program for resiliency needs to have a new approach. With so many moving parts, the tasks for a plumber is more of sizing and cutting the right part. This also means that the developers have their task made a bit tough now that the code on multiple servers have to be in sync, as well the deployment process needs to take care of these operations spanning multiple components. This also gives rise to many non-development tasks which compose of configuration and setup of proper execution environment for carrying out the deployment. This also means that for the Scrum Teams, allocating correct Story Points to project tasks turns out to be more inaccurate. This is because of two reasons:

  1. The development team has no experience on installation & configuration of the execution environment.

Such predictable unpredictability means that the projects would be doomed to off-rails since the beginning.

This prompted the rise of DevOps. DevOps school of thought provisioned for a dedicated team whose mandate is cross-functional, ranging from Building/Testing/Releasing/Configuring/SysAdmin and others. Since, there has been no formal body or manifesto for DevOps, this team has meant different scopes & roles in different organizations.

But ensuring performance & reliability of a system with so many moving parts could not be left to DevOps since they had their operational tasks cut out. Also, the primary KRA of development team could never be uptime & performance. So, who is going to bell the cat ?

The increase in Internet penetration and increasing consumerism has meant that the numbers of users engaging with an application, whether mobile or web, has increased to a massive scale. Engineering for scale involves an architecture which is decoupled and scalable. If an app is not engineered for uptime and performance, it will cause discontent among your users, leading to dissatisfaction, churn and even penalties (in case of enterprise products).

SRE is a practice that starts from design phase itself. From having replicas of all components, and having load-balancers, serving traffic from CDNs are all examples of SRE philosophy. But, the mistake most commonly made is to club the DevOps and SRE team and putting the SRE on a higher pedestal than DevOps.

Common mistakes done by companies is to promote DevOps people into SRE. In this case, you are left with a glorified Devops team which is engaged in creating build pipelines and doing automation in the name of SRE. The problem here is that SRE is a mindset, a thought process which needs to be there, irrespective of which team or role you play. SRE is actually a thought process whose accountability to drive and uptime lies with SRE Team, but the onus of reliability starts from the product manager, the developer, the architect and Goes on till QA. In fact, even the non-tech teams such as Finance, Revenue Assurance, and Cost-Control are also key stakeholders since they are often the ones who are loosely aligned with importance and their KPIS are directly impacted whenever any reliability metrics take a hit.

SRE needs to take care of both Proactive & Reactive Approach .

Proactive Approach with Performance & Uptime Engineering (PUE). ALL The teams HAVE to realize that downtime and low performance figures is not an option. Once this alignment is made, go through the following if you have done them all:

  1. Setting the Theme — It all needs to start by acknowledging that you will be dishing out new features all the time, and there needs to be maximum uptime & performance. Innovation & Reliability don’t live at the cost of each other. At the same time, accepting that mistakes & downtimes will be treated as learning, alongwith the principle of One-mistake-only-once (If you dropped the table once and caused an outage, you learnt from it and started keeping snapshots/removed admin privileges etc). Everyone needs to come out from a fear psychosis that they are the ones who caused an outage or poor-performance of application. More Outages happen because not-enough checks & balances and redundancies were in place than due to human mistakes. A Well-designed, resilient architecture is the one that can withstand the vagaries of human mistakes & emotions :) . The essence of SRE is to foresee all potential vectors which can cause a downtime and mitigate them. SRE needs to be the last in safety net whose miss will cause an outage.
    If a developer was able to drop a production database, then it’s not developers mistake, its the SRE’s mistake.
    Companies often do the grave mistake of not aligning developers with the cost of mistakes and poor-performance (https://www.gigaspaces.com/blog/amazon-found-every-100ms-of-latency-cost-them-1-in-sales) .
    If you still find an engineer with “I am here just to code, not for ensuring reliability”, then setup a coffee meeting and politely put into him that his application running for 9 out of 10 customers has failed 100% for someone sitting somewhere across the internet.

The above is not exhaustive and cases may differ by company and business, but the common themes above run alike wherever an effective SRE team exists. You can measure the same against yours too, if any !

We will cover reactive SRE processes in next post.

Feel free to share how you do SRE at your company!

Meanwhile, you can go through my previous SRE article

Reliance Jio

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store