There were desktop systems. Then there were developers. Then there were big, distributed systems. These systems had many moving parts and they kept on behaving as per their whims. Then there was NOC, the team that monitored production alerts. Soon, some good people (Ben Treynor and others) sitting in Google had a thought that the workflow for all production related incidents, and the methodology for managing incidents, and the approach for designing scalable architectures was actually the same almost everywhere, wherever the scale of operations and traffic was large. And hence was born a new school of thought called “Site Reliability Engineering”.
The way software development takes place in 2019 is in stark contrast to the way it was written in early 2000. There was no stackoverflow to refer to, not github projects to fork from, no cloudformation template to click on. The new-age developers are more like plumbers. Plumbers pick pieces from different sources and apply relevant connection and define source and sinks to achieve an objective. The pieces may be an event-stream, event-consumers, databases, caching, logging, api end-points and so on.
Softwares in early days, which was usually a desktop application, could be written in an IDE and compiled to produce a binary (bundled with libraries ) which was self-sufficiently capable to run in an environment. Eg: Visual C++, .Net platform, C Compilers like Z80/Turbo and so on. The container to run these applications would be usually the operating system. Since the Business-Analysts usually did the requirement gathering in initial phases regarding usage patterns and peak levels, managing concurrency never was a key question. Outages in the application arose due to a middle-layer being down, or disk getting full and so on. But these reasons were localised to the server environment so these were usually taken care of by the developer. So, the reliability vectors were: Middle-Layer, and Operating System.
Then came web-applications. The web-apps were one step advanced in execution runtime in the sense that the code-written for web-application needed to be hosted on a web-servers. And the web-server itself used to run in an operating system. Smaller web-apps could handle low traffic easily. But with a surge in traffic, you needed a farm of web-servers.
This gave rise to distributed web-applications. They had load-balancers with an intelligence to detect health of downstream servers and distribute traffic accordingly. In addition, the emergence of many new paradigms such as Pub-Sub(redis/memcache), programmable cloud (Lambda/Cloud Functions), Queuing frameworks(Kafka/redis/RabbitMQ/Kinesis/SNS), NoSQL (MongoDB,DocumentDB) and On-demand Storage and compute means that the approach to monitor and program for resiliency needs to have a new approach. With so many moving parts, the tasks for a plumber is more of sizing and cutting the right part. This also means that the developers have their task made a bit tough now that the code on multiple servers have to be in sync, as well the deployment process needs to take care of these operations spanning multiple components. This also gives rise to many non-development tasks which compose of configuration and setup of proper execution environment for carrying out the deployment. This also means that for the Scrum Teams, allocating correct Story Points to project tasks turns out to be more inaccurate. This is because of two reasons:
- The development team has no experience on installation & configuration of the execution environment.
- The development team can face an unpredictable amount of time & effort in getting the environment right. Unmet dependencies, wrong version of OS, wrong values in my.cnf..anyone ?
Such predictable unpredictability means that the projects would be doomed to off-rails since the beginning.
This prompted the rise of DevOps. DevOps school of thought provisioned for a dedicated team whose mandate is cross-functional, ranging from Building/Testing/Releasing/Configuring/SysAdmin and others. Since, there has been no formal body or manifesto for DevOps, this team has meant different scopes & roles in different organizations.
But ensuring performance & reliability of a system with so many moving parts could not be left to DevOps since they had their operational tasks cut out. Also, the primary KRA of development team could never be uptime & performance. So, who is going to bell the cat ?
The increase in Internet penetration and increasing consumerism has meant that the numbers of users engaging with an application, whether mobile or web, has increased to a massive scale. Engineering for scale involves an architecture which is decoupled and scalable. If an app is not engineered for uptime and performance, it will cause discontent among your users, leading to dissatisfaction, churn and even penalties (in case of enterprise products).
SRE is a practice that starts from design phase itself. From having replicas of all components, and having load-balancers, serving traffic from CDNs are all examples of SRE philosophy. But, the mistake most commonly made is to club the DevOps and SRE team and putting the SRE on a higher pedestal than DevOps.
Common mistakes done by companies is to promote DevOps people into SRE. In this case, you are left with a glorified Devops team which is engaged in creating build pipelines and doing automation in the name of SRE. The problem here is that SRE is a mindset, a thought process which needs to be there, irrespective of which team or role you play. SRE is actually a thought process whose accountability to drive and uptime lies with SRE Team, but the onus of reliability starts from the product manager, the developer, the architect and Goes on till QA. In fact, even the non-tech teams such as Finance, Revenue Assurance, and Cost-Control are also key stakeholders since they are often the ones who are loosely aligned with importance and their KPIS are directly impacted whenever any reliability metrics take a hit.
SRE needs to take care of both Proactive & Reactive Approach .
Proactive Approach with Performance & Uptime Engineering (PUE). ALL The teams HAVE to realize that downtime and low performance figures is not an option. Once this alignment is made, go through the following if you have done them all:
- Setting the Theme — It all needs to start by acknowledging that you will be dishing out new features all the time, and there needs to be maximum uptime & performance. Innovation & Reliability don’t live at the cost of each other. At the same time, accepting that mistakes & downtimes will be treated as learning, alongwith the principle of One-mistake-only-once (If you dropped the table once and caused an outage, you learnt from it and started keeping snapshots/removed admin privileges etc). Everyone needs to come out from a fear psychosis that they are the ones who caused an outage or poor-performance of application. More Outages happen because not-enough checks & balances and redundancies were in place than due to human mistakes. A Well-designed, resilient architecture is the one that can withstand the vagaries of human mistakes & emotions :) . The essence of SRE is to foresee all potential vectors which can cause a downtime and mitigate them. SRE needs to be the last in safety net whose miss will cause an outage.
If a developer was able to drop a production database, then it’s not developers mistake, its the SRE’s mistake.
Companies often do the grave mistake of not aligning developers with the cost of mistakes and poor-performance (https://www.gigaspaces.com/blog/amazon-found-every-100ms-of-latency-cost-them-1-in-sales) .
If you still find an engineer with “I am here just to code, not for ensuring reliability”, then setup a coffee meeting and politely put into him that his application running for 9 out of 10 customers has failed 100% for someone sitting somewhere across the internet.
- PRINCIPLE ALIGNMENT : Do all teams understand the role of SRE Team as the custodian and the sole accountable of uptime for the application ? One thing is very important to understand here: The SRE Team has no or little tasks of its own as far as the application implementation is concerned. The guidelines and architecture complying to scalability & reliability are discussed in a room along with Product Owner, Dev Teams and QA. All discussions driven by SRE tend to converge to PUE. Since SREs are typically veterans who have designed softwares and managed operations, they have the most experience in guiding the team for choosing an architecture/framework and toolset for application. But the ultimate goal for developers should be to have a mindset like SRE in the long-run. Ideally, teams from Finance & Revenue Assurance should also be the ones who should be aligned with the role of SRE since a deferment/rejection of SRE requests for resource provisioning is going to result in bad numbers in accounts book. Also, the finance and business teams should understand that a more tighter code working with high performance as a result of SRE driven optimization working in less-infrastructure results in cost-savings as well!
One common friction arises between alignment of INFOSEC teams & SRE Teams: INFOSEC teams often feels that rights of SRE is overarching into their territory. The thing to understand is that loosely implemented security policies (eg:code with credentials, wrongly implemented IAM roles, incorrect log permissions) may induce downtime so it is necessary for both teams to realize the shared goals of high uptime & performance.
Empowerment to SRE team — Many companies do the mistake of having a SRE team without tooth. Important to realize that if you have a SRE team but adopting the guidelines and principles are left as “choose if you want” to key-stakeholders, then you are better off without a SRE team.
Have you made clear to SRE team & other teams that KRA for the team is high uptime & performance, and all other teams are jointly going to contribute to this KRA? Since the SRE’s sole task is to go through all possible holes in the chinks and get them rectified by the teams, it is necessary that in-spirit adoption is made by all the teams. This mandate declaration needs to be done with all stakeholders in a room by the top management. Since the SRE teams should not be allowed to play victim, it is important to empower the SRE team.The approach of Disagree-and-Commit should be bought-in by all leaders. While this approach is tough to be adopted by leaders who are not experienced, still the benefits are too many!
There can be no greater joy if the push for SRE mindset comes from top-management. In normal cases, the CTO needs to be one who should be bearing the championing the cause of SRE. From CTO, it should be distilled down to VP/Directors of Engineering, the Team Leads and then finally to developers.
- ARCHITECTURE: Have you agreed and Implemented redundancy and high-availability ACROSS THE BOARD? A single instance of single-point-of failure will undo your production. One point overlooked in eager to have a strong architecture is to use costly components or use components without planning. Cost is an important pillar of a good architecture. For example: Using dynamodb with wrongly-configured high RCUs when your anticipated reads are going to be low is an example of bad architecture.
Is your architecture resilient ? You need to accept that somewhere sometime a datacenter or a server-rack may loose power or internet access. Are your ready to tolerate that downtime or are you creating your architecture to support these scenarios too ? Are you deploying your DB in multiple zones with global replication as failover ? Are you using principle of least privilege access across the system to contain potential security holes ?
- PERFORMANCE CONTRACTS ALIGNMENT: Have you agreed on acceptable latency figures for all services and contracts? Your application may have a mix of many services some of which may not be critical. Identify them and adjust their weight-age. Since Infrastructure Capacity provisioning is dependent upon performance benchmarks, the dev teams need to make bench-marking a periodic exercise. This is important since the introduction of a service taking just 50 milliseconds more may distort previous publish benchmark significantly, resulting in provisioning of more resources. As such, the onus of detecting such changes lies with the SRE by implementing such changes in release automation reports, the dev teams should proactively inform about such major changes.
- SLO/SLI ALIGNMENT — Service Level Objectives should be metrics definition which are experienced by services customers. To have SLOs, you need to define Service Level Indicators. Arriving at the correct SLIs is science, and arriving at the correct SLOs involves experience & customer experience.
A common mistake while designing SLO is to pick wrong set of SLIs. For eg: A payment API may have SLIs as Error Rate and Latency, but ratio of CDN Hits/Miss may not be too relevant if you are designing the SLO for your customers. In the same way, designing a SLA based on premature SLOs or too less SLIs sampled is an invitation to regular customer credit notes, the finance won’t like it. The order of setting should be SLI->SLO-SLAs. Don’t be in a hurry to define SLAs until you have had one or two weeks of production running. This stems from the fact that SLI values derived from internal bench-marking are usually run in “ideal conditions” and immune from real-world impairments.
- OBSERVABILITY: How will you identify that majority of your users are right now dissatisfied ? Have you implemented a telemetry system in your application which keeps dumping all metrics into a time-series database ? This is quite important since your basic system metrics may be missing some business metrics that you wish to monitor. Have you implemented a dashboard which is accessible to all business-stakeholders as well technical teams ? One thing to note here is that the level of granularity needed for different stakeholders need not be same. For eg: Business Teams is concerned with only Success & Failures in a checkout page, while the technical team might want to have breakup of all 4XXs and 5XXs.
A major fallacy faced almost everywhere is using ETL jobs to consume transaction data and present multiple dashboards for different teams. This gives rise to problem of “Different Teams speaking different numbers”. By all means, implement ETL, but have a single source of truth. The golden rule is to render one metric in only a single dashboard and embed it as needed.
Many companies confuse Observability with Business Intelligence (BI). BI is simple rendering your business data for non-technical teams for business-analysis. On the other hand, Observability is the capability of your application to emit relevant metrics to a data-sink which has the capability to plot/report the data for production health analysis.
- ERROR BUDGET AGREEMENT — Any software system will have releases. Releases may have bugs. Bugs will cause performance deterioration. But innovation cannot stop for the sake of reliability. Although the SRE team’s performance is always measured in terms of reliability, that still doesn't mean that SRE can ask dev teams to hold back a change. Lack of a transparent performance contract can create friction between your SRE & Dev teams. Hence an error budget needs to be arrived and agreed upon. This metric will remove the endless discussion & debates on if-to-release and when-to-release since both can refer to the error budget allotted and current status (violated or within limits). This enforces general discipline as well as high-quality of code & deployment as well as removes any room for politics.
- MONITORING —
Are you monitoring enough?
Are all components, and the sub-components which support those components being monitoring ?
Being monitored for all dimensions — Availability, Latency, Request-Rates, Error-rates ?
Being monitored with correct granularity (not measuring in per minute when rate is 100 transactions per second)?
Are you monitoring the right statistic (monitoring average when you should have been monitoring P90)?
The rule of thumb is to create a list of ALL resources which your codes are accessing (yes, in the exception block also)- File system, Redis, Databases, Endpoints, Web-Servers, Cloud Resources and everything involved. Have a monitoring on all of these accessibility (uptime) and health (latency, error-rates). Don’t be paranoid about what values to set. Just pick a day in recent past when everything was going normal and pick the values from logs and set it- Set the warning values and Critical values. Refinements you can always do later, that is an evolving thing. But for god’s sake, set the monitoring, since the cost of not having a monitoring outweighs the cost of setting incorrect thresholds. But do remember to set correct values as you keep on getting more data, not doing so will induce Alert Fatigue in teams, leading to misses. You don’t want that.
- TEAM SELECTION- If your only criteria for moving someone from DevOps team to SRE team was a “better devops guy”, then god save you. As discussed earlier, everyone in the team needs to have an owner mindset. You don’t expect the CTO’s back when things fail — its the SRE team which needs to get shit done! When it comes to making the system up and running within the given constraints, the buck stops at SRE. While nothing against the less-experienced, the folks who have been to wars and battlefield know many what-not-to-do’s as well what-to-do’s and they command breadth(most important) as well as depth. A good SRE engineer also needs to have the flair for converting uptime into Dollar-value. He/She has the whole model of the infrastructure in his/her mind(literally) and can, in the minimum time guide incident management teams to start looking at a particular log file in outages.
The above is not exhaustive and cases may differ by company and business, but the common themes above run alike wherever an effective SRE team exists. You can measure the same against yours too, if any !
We will cover reactive SRE processes in next post.
Feel free to share how you do SRE at your company!
Meanwhile, you can go through my previous SRE article