(and What To Do About Them)
As much as we prefer to talk on the sunny side of the street, occasionally we perform a reality check. As a managed cloud services provider, we know how difficult it can be to cover all the bases (especially with shrinking SaaS & IT budgets), but shortcuts and rush jobs can lead to bad places. For example, a little over a year ago, Atlassian reported problems with its biggest development tools (Jira, Confluence when a maintenance script error led to a days-long outage. When SaaS providers make such mistakes, clients suffer — and they may ‘vote with their feet’ as a result. We’ll explore a few SaaS disasters and offer prescriptions for preventing and remedying them when they occur.
Notable SaaS Disasters
In addition to our Atlassian example, here are a few more illustative SaaS disasters.
- Salesforce Outage: In May 2022, a database failure caused a global outage impacting Salesforce services for thousands of customers for over a day.
- Twilio SMS/Voice Outage: A software bug disrupted Twilio’s communication services for several hours in September 2022.
- Oracle WebLogic Server Bug: A severe bug in March 2022 allowed remote takeover of server clusters.
- Feedly DDoS: Users were unable to access Feedly for almost a full day in January 2022 due to a distributed denial-of-service attack.
- Cloudflare R2 Storage Incident: Improper data isolation settings exposed sensitive customer data in August 2022.
- Slack Outage: In 2021, a configuration change caused Slack to go down worldwide for several hours, disrupting remote work for millions.
- Zoom Security: We probably all remember 2020’s’ “Zoom bombing” where various privacy/security flaws emerged as usage spiked during the pandemic.
These examples demonstrate the importance of availability, security, backups, patching, and customer trust when issues occur. Disasters can significantly damage SaaS companies, but most recover if problems are addressed quickly and transparently.
How to Avoid SaaS Disasters
Here are some ways SaaS companies can help prevent major service disruptions.
- Implement Strong Backup Systems: The first line of defense is to perform regular backups, storing offsite/offline if possible.
- Audit Access Controls: Limit employee access. Aim for zero trust.
- Deliver Frequent Small Releases: Reduce risk of large-scale failures using CI/CD.
- Architect for High Availability: Build self-healing systems. Decouple components, and distribute workloads. Replicate across data centers and regions.
- Add Monitoring and Alerting: All systems should attempt to detect anomalies early.
- Test Disaster Recovery Plans: If you don’t have one, make one, and when you do, simulate different failure scenarios.
- Perform Regular Penetration Testing: Identify and patch vulnerabilities.
- Chaos Engineering: Intentionally inject failures into systems to test resilience [see Chaos Monkey!].
- Get Insurance: Time for an insurance check-up! Protect your organization against financial losses.
How to Respond to SaaS Disasters
No SaaS provider can achieve 100% uptime and total prevention of disasters. If you find your organization in the vice of one of these disastrous scenarios, address it head on.
- System Status Communication: It’s mandatory that you provide transparent updates during incidents. Own it ASAP. Immediately inform customers of issues through a status page and other notifications. Provide regular updates. Also, keep communication open internally – reassure employees and outline your response plan.
- Write Detailed Postmortems: For any incidents, get a full assessment from engineering on root cause, scope and restoration timeline. Work to understand the causes so you can improve your processes. If caused by a vulnerability or attack, engage security consultants to assess and improve defenses.
- Rebuild Customer Trust: Be cautious about over-promising on resolution timeframes. Under-promise, over-deliver. Be cautious but also thorough. When resolved, explain your postmortem in user terms. Ensure all client data is restored and intact. Verify with customers.
- Fulfill Customer SLAs: Consider customer reimbursement, credits, or SLA-based penalties if appropriate.
- Consider Public Relations: Evaluate a professional PR response if the incident gains media coverage.
After a SaaS service disruption, being transparent, providing frequent updates, apologizing sincerely, and learning from mistakes can help rebuild client confidence.
Handling SaaS Disasters
Major outages and security incidents can significantly damage SaaS companies. While you can’t prevent all disasters, you can minimize risks through redundancy, good processes, and transparency. Prevention is ideal. Architect for high availability, use controlled releases, implement monitoring, schedule regular backups, and conduct penetration tests. Learn to expect the unexpected. If disaster strikes, get a full assessment of root cause and restoration timeline. Communicate openly internally and externally. Messaging is critical. Thank customers for their loyalty and patience.