Prepare for the Next Internet Outage
Last Thursday, the Internet broke. Again. Yes, the media turned a two-hour outage into a baitclick-friendly global crisis.
What made this incident significant was not just the disruption of Google Cloud but the hundreds of websites and applications that went down at the same time. This included including some major ones like Cloudflare, who uses GCP for some of its services. Cloudflare being a widespread CDN, cache and proxy, it created a domino effect and broke, in turn, countless websites.
It reminds us of the fragile interconnectedness of our digital world. I don’t want to point fingers, but rather learn lessons from this incident. This wasn’t just a random hiccup; it highlighted fundamental principles that, in the age of “everything as a service,” we might have inadvertently overlooked.
Here are my key takeaways.
Do Not Put All Your Eggs in the Same Vendor Basket
The cloud means infinite scaling, infinite storage, infinite compute power, infinite flexibility. It is built on the promise of reducing costs (which can be true when used correctly). However, this hides an overlooked truth and its biggest risk: single vendor dependency. The recent outage showed how a single vendor outage, or even a component within their infrastructure, can have a cascading effect on most services.
Now, let’s add to the mix that AWS, Azure and Google Cloud Platform have a combined market share of 63% in value. Even if your business do not use these infrastructure providers directly, chances are that you use vendors who relies on them, or on vendors who might rely on them. Yes, chances are that your SaaS application is dependent on at least one of these vendors.
What you can do:
- Map Your Dependencies: Do you truly know all the services your core product relies on, directly and indirectly? Which IaaS, PaaS, APIs, CDN, and so on are you using? What are they, in turn, using? Do you rely on NpmJS to build your product? Is your app deployed with a Github Action ? The more you know, the more you’re prepared.
- Vendor Due Diligence: Uptime guarantees (3? 4? 5 nines?) are just marketing. Take it as such. What is your vendor’s architecture? Its continuity plan? Its transparency on incidents? Those are far more important criteria.
- Consider Multi-Cloud Strategies: You would not put all your servers in the same datacenter? Then do not put all your infrastructure in the same IaaS provider! (If you would, you should do something about it!)
Own Your Data, Own Your Business
The cloud and API world we live in is great. It allows us to build fast, iterate quickly, test things and improve our solutions. You need authentication, use Subabase or Auth0. Online payment? There is Stripe or Paypal. Transactional emails? Sendgrid and MailChimp. Search? Algolia. The list can be long, but now, you can work on creating value.
Yet, as the outage showed, if these services become unavailable, your users might be locked out, or your application might cease to function, regardless of your own infrastructure’s health. This can lead to a significant loss of control over core business operations and data access. Third-Party services ARE single point of failures!
What you can do:
- Fallback Mechanisms for Core Services: If a service becomes unavailable, how do you replace it? Can you develop an alternative to fall back on?
- Robust Data Mirroring: Ensure you have regular, accessible backups of your critical data, even if it primarily resides with a third-party. Can you restore it quickly to a different environment if needed?
Build for Resilience
Resilience has always been a consequence of redundancy. You should always have a backup system that can assure the service while your main system is down.
But this is not enough to just have redundancy. Your application should also be designed to be fault tolerant and use whole or parts of the backup system when needed. At least, it should ensure that the impact for your users is the least possible: the impossibility to send an email should never block your whole application.
What you can do:
- Distributed Architectures: Design your systems with principles like microservices. Deploy your services on several IaaS providers. Replicate critical data across several providers. The goal is to limit the impact of any single component failure.
- Self-Healing Systems: Implement mechanisms that can automatically detect failures, reroute traffic, or restart services without human intervention. The quicker your system can react, the less impact an outage will have.
- Design for failure: Don’t wait for an external event to expose your weaknesses. It is too late. Add some automated failure tests to your CI pipeline: what if the client has a 5 second latency with your server? What if the database is unavailable? What if a payment cannot be processed right away? What is the user experience like when something goes wrong? Those issues WILL happen.
Conclusion
The next outage will come. That’s for sure. Maybe not as big, but there will be some that will affect your business.
Be prepared:
- Know your infrastructure, your vendors, their vendors, etc.
- Assess risks on a regular basis. Your app evolves, your vendors too. What is true at one moment is not at the next.
- Plan for the worst case. Incidents will happen. Your job is to make it so that the user experience is not impacted.