Ops Excellence: Best Practices on Managing Technical Outages from an Operations Point-of-View
If you're properly prepared, you can bring the calm to the chaos of outage management.
The goal of any tech company is to have 100% uptime and zero outages. However, when engineering teams are deploying code multiple times per day or week, it leaves opportunities for failures and service could be unintentionally impacted.
During my time at Uber and Fast, I helped manage and triage outages from the Operations point-of-view. I worked within the payments space, where being reliable 24/7 is an utmost priority for your customers.
Here, I will walk through some lessons learned while managing payments outages at scale for high growth companies.
As I experienced first-hand, if customers are upset about large-scale outages they will go to the press. Sound outage management processes can hopefully help you resolve outages quickly and seamlessly.
Outage Definition
Each company may have their own definition of an outage. Here, I’m referring to technical issues that prevent the product from working as it should in a large-scale way.
Outage Leveling
There are different levels of outages. At Uber, we would number outages from Level 5 to Level 1. An L5 Outage was the highest level, which meant that a major system was down and trips couldn’t be completed or a major functionality was not working for some area in the world. L1 would be the lowest level outage and could be akin to a bug with a low impact.
Each company will have their own definitions for outage levels, naming, and downstream processes.
For Startups: When your company is just starting out, I would keep outage processes light or non-existent. All team members should be aware of your normal volume and be able to quickly identify if something is not working.
For Scale-ups: It can be helpful to have written definitions of what outage leveling you want to follow so that the whole company understands the terminology.
Preventing Outages
Outages can happen for many reasons. From doing post-mortems and managing outages at scale, I’ve found that the vast majority of outages happen during code deployments. Engineers try their best to properly test the code before it’s deployed, however things will happen and sometimes new code will break some existing functionality.
Dedicated Slack Channel for Code Deployments
One practice we have used at companies is to have a dedicated Slack channel for code deployments. In either an automated or manual way, the channel is notified when a code deployment is starting, when it’s in progress, and when it’s complete.
If anyone on the business side notices that something is breaking after the deployment, they can ping the Slack channel and if the service issue is confirmed by Engineering, the code deploy can generally be quickly rolled-back.
At a massive scale, companies might have 24/7 Command Centers where they are monitoring metrics and volume in real time for any anomalies.
Cloud Configuration Settings
Another reason that I have seen outages occur is due to Cloud Configuration Settings. Settings on cloud platforms, like AWS, can be quite complex even for experienced SRE teams. If an outage occurs and the likely culprit is not a code deploy, the SRE team is the first one I would contact to investigate.
Cloudflare / Firewall
A third issue that can come up is Cloudflare. There have been instances where Cloudflare was misconfigured, or they had an issue that temporarily blocked functionality.
Third Party Providers
A fourth potential cause of outages is third party providers. As companies scale, it is best practice to have redundancy so that the core functionality is not dependent on one vendor. For early stage startups, this can be difficult. For large cloud providers, generally these outages impact a large portion of the internet ecosystem so companies would likely address it on social media.
For smaller vendors that are core to your operations, I would ensure you have documented escalation pathways in place so you know exactly who to contact if you suspect an issue. I would also recommend using more stable, established vendors and/or vetting all vendors to ensure they have sufficient runway and financial resources to properly support your business over time.
Outage Processes
When a company starts out, you do not want to add too much process. However, as you scale, some simple process clarifications can go a long way in helping the company respond rapidly and quickly to any issues. If you document the following workflows, you will be well on your way to a streamlined outage management process.
Outage Leveling
As discussed above, documented outage leveling and light processes can be helpful for keeping the whole company on the same page and communicating in the same language. These can be stored in Notion or a similar company repository.
Outage Communications
Having strong communications both internally and externally is paramount for managing outages in a professional manner.
Internal Communications
Reporting a potential outage: There should be a dedicated channel where any company at the employee can report a potential outage & justification for what they’re seeing. This is often an Engineering Slack channel.
Communicating with the company once an outage is confirmed
Once the outage has been confirmed (or unconfirmed) by the Engineering team, and Outage Commander should be assigned to manage the process end-to-end. I have most commonly seen the Outage Commander be either the Engineering lead or Product lead for the system that likely was at fault.
For Level 5 (most severe outages), you want to communicate with the company very frequently. I recommend posting updates in an All-Company Slack channel every ~10 minutes or on some consistent cadence. Uber would send email communications to the entire company for Level 4 and Level 5 outages.
Setting up a Zoom - The Outage Commander should very quickly assemble a Zoom call with the Engineers who are investigating, a Customer Support leader, and anyone else who might be able to add immediate value.
External Communications
Social Media & Website Banner: If you have a product that many other businesses depend on, like Stripe, Square, or Shopify, and you have a large-scale outage, you will want to inform customers on your social media channels and a banner on your website as soon as possible. Similar to internal company communications, you want to proactively keep your customers updated so that they do not slam your customer support ticketing system.
Emails: For outages that might last awhile, at Uber we would often send emails to impacted customers letting them know of the status and estimated resolution time.
Internal Reviews: All communications should be reviewed by your legal POC, your marketing POC, your Customer Support POC, and your executive leadership before being sent out. Generally the Marketing team would own this workstream.
Remediation Policies
If you are a payments or services company and your systems are down for a period of time, customers will often ask for financial remediation for any lost earnings. I would recommend consulting your legal counsel and potentially including specific language in your Terms & Conditions about this topic rather than improvising it after an outage occurs.
Outage Resolution
Post-Mortem Document: Once an outage is resolved, it is time to more deeply determine what happened. The Engineering Lead on the team where the outage occurred will often be responsible for writing up a post-mortem with the Cause, Timeline of Events, Business Impact (e.g. lost revenue, number of customers impacted), Timeline of Events, and Remediation Steps. This document should be shared with the company and presented to leadership. Some companies also post a public post-mortem.
Documented Remediation Steps: There should be a clear list of action items to hopefully prevent the same type of outage from occurring again. This could be a prioritized set of tasks that can be added to the Engineering Roadmap. The team will need to determine the urgency and timeline for each item.
Playbook Approved Communications: Store your learnings and approved communications in a central place for easy access in case other incidents occur and for other team members to learn from.
Super Complex Outages
I have been a part of some outages that are incredibly complex. These often impact multiple teams and cannot be managed with a documented playbook. In these cases, you will need to use your best business judgement, leadership input, and company resources to ideally make sound decisions on what to do. Something will always happen that you have not planned for. In these cases, it is best to keep a level head and try to make decisions using both head and heart with the customer at the top of your mind.
Tooling
There is now tooling available to help companies manage outages, such as Incident.io. I have not used any similar software at scale. Generally tools like this will not solve all of your problems, and you will still need to proactively manage the tool and the process. The tools also are not helpful at identifying outages for you.
If you have any best practices, learnings, or tips I might have missed, feel free to comment below.