Disaster Recovery Plan

This document outlines the procedures to follow in the event of a major service disruption or disaster (e.g., an outage). The goal is to restore service as quickly and safely as possible.

For incidents related to a security breach, please refer to the Security Incident Response Plan.

Our architecture, which heavily relies on Cloudflare’s resilient infrastructure (Pages, Workers, R2, D1), is inherently robust. However, disasters can still occur due to issues like bad code deployments, upstream service failures, or security incidents.

Tiers of Incidents

  • Severity 1 (Critical): Complete service outage. The application is down for all users.
  • Severity 2 (Major): A core feature is non-functional, or a significant portion of users are affected.
  • Severity 3 (Minor): A non-critical feature is broken, or performance is degraded.

This plan primarily focuses on Severity 1 and Severity 2 incidents.

Incident Response Team

  • Incident Commander: The person responsible for coordinating the response effort.
  • Technical Lead: The senior engineer responsible for diagnosing and resolving the technical issue.
  • Communications Lead: The person responsible for communicating with internal and external stakeholders.

In a small team, these roles may be filled by the same person.

The “War Room”

  • In the event of a Severity 1 or 2 incident, a virtual “war room” will be established. This is a dedicated Slack channel and a video call for the incident response team to communicate in real-time.

The Recovery Process

1. Detection and Alerting

  • Automated Alerts: An incident is typically declared when an automated alert is triggered (e.g., from Cloudflare monitoring, Sentry, or our logging service).
  • Manual Declaration: Any team member can declare an incident if they discover a critical issue.

2. Assessment and Communication

  • Initial Assessment (First 5-10 minutes):
    • The Incident Commander and Technical Lead work to understand the impact and scope of the outage.
    • The Communications Lead posts an initial message in the company-wide Slack channel acknowledging the issue.
  • Status Page: For client-facing incidents, the Communications Lead will update our status page (if we have one).

3. Triage and Resolution

  • Identify the Cause: The technical team’s primary goal is to find the root cause. Common causes include:

    • A recent deployment: Was a change recently pushed to production?
    • Upstream provider outage: Are any of our critical dependencies (e.g., a database provider, an external API) down? Check their status pages.
    • Cloudflare outage: Is Cloudflare experiencing a system-wide issue? Check the Cloudflare status page.
  • The Quickest Path to Recovery: Rollback

    • In most cases, the fastest way to resolve an incident caused by a deployment is to roll back to a previous, known-good version.
    • How to Roll Back a Cloudflare Pages Deployment:
      1. Navigate to your Pages project in the Cloudflare dashboard.
      2. Go to the Deployments tab.
      3. Find the last successful deployment.
      4. Click the “Rollback to this deployment” button.
    • Do not try to “fix forward” by pushing a new commit unless a rollback is not possible or would be more dangerous. The priority is to restore service, not to debug the problem in real-time.

4. Post-Resolution

  • Verification: The team verifies that the service is fully restored.
  • Communication: The Communications Lead posts a final update announcing the resolution of the incident.
  • Post-Mortem:
    • A post-mortem meeting will be scheduled within 48 hours of the incident.
    • The goal of the post-mortem is not to assign blame, but to understand the root cause and identify preventative measures. We foster a blameless culture to encourage open and honest discussion.
    • The meeting will produce a written post-mortem document that includes:
      • A timeline of the incident.
      • The root cause(s).
      • What went well during the response.
      • What could be improved.
      • Action items to prevent the issue from happening again.