Migration Rollback Playbooks

Rollback runs as a gated sequence: a breached threshold forces an explicit go/no-go before parallel DNS and redirect reversals, then verification.

Executive Summary

A rollback is a pre-authorised, rehearsed reversal of a migration to a known-good state, not an improvised scramble. This runbook gives the release owner, on-call SRE, and SEO lead a single source of truth: when to abort, who decides, and the exact order in which DNS, redirects, and server configuration revert. The outcome is a bounded recovery window — typically under 15 minutes for redirect-layer faults and under one TTL cycle for DNS faults — with link equity and crawl signals preserved. Treat every cutover as reversible until the post-launch soak window closes.

Prerequisites

A pre-migration snapshot of every authoritative zone file, redirect map, and server config under version control
Lowered DNS TTLs in place before cutover (see TTL planning below)
Access to DNS provider APIs (Cloudflare API token, aws CLI with Route 53 permissions)
Server reload rights (nginx -t, systemctl reload) on every origin and edge node
Baseline metrics for error rate, p95 latency, conversion, and crawl errors captured before launch
A named decision-maker with explicit authority to call the rollback

Step-by-Step Execution

1. Define Objective Rollback Triggers

Codify the numeric signals that force a reversal before launch, never during the incident. Set ceilings for 5xx/4xx rate, p95 latency, traffic drop, conversion drop, and crawl-error spikes, each tied to a monitoring window. Document these in Rollback Trigger Thresholds so on-call staff act on data, not intuition.

2. Assign Decision Authority

Name one accountable owner per phase with the power to abort, plus a clear escalation tier if recovery stalls. Ambiguity here is the single largest cause of prolonged outages during failed migrations. Pre-agree that the owner’s call is final within the soak window to avoid debate while error rates climb.

3. Prepare the DNS Rollback Path

Keep the legacy A/AAAA/CNAME values staged as a ready-to-apply change set so reversion is a single API call. The speed of this path depends entirely on the TTL you set beforehand via TTL Optimization Strategies. Detailed reversion steps live in DNS Rollback Procedures.

4. Prepare the Redirect Rollback Path

Stage the previous redirect map and server config as an atomic swap so the routing layer reverts without downtime. Coordinate this with your live URL Mapping & Redirect Architecture so the restored rules match the last verified state. Recovery specifics — config restore, cache purge — are in Redirect Rollback & Recovery.

5. Rehearse the Reversal in Staging

Run a full rollback drill against staging and time each step end to end. A rollback that has never been executed is an assumption, not a plan. Capture the wall-clock duration of DNS reversion plus one TTL cycle to set realistic recovery expectations with stakeholders.

6. Execute and Verify Recovery

On a triggered abort, run DNS and redirect rollbacks in parallel, then verify with dig and curl against the legacy targets. Confirm error rate and latency return to baseline before declaring recovery. Reconcile crawl and index signals afterward through Search Console Handover to catch residual indexing drift.

7. Communicate Status Continuously

Post a single authoritative status to stakeholders at trigger, mid-rollback, and recovery confirmation. Tie this to the broader plan in Pre-Migration Auditing & Risk Assessment so the rollback record feeds the post-mortem. Silence during a rollback erodes trust faster than the outage itself.

Technical Configs

AWS Route 53 — revert an A record to the legacy origin via the CLI:

# UPSERT overwrites the live record with the staged legacy value
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456ABCDEFG \
  --change-batch file://rollback-a-record.json
# rollback-a-record.json contains action UPSERT, name www.example.com, value 203.0.113.10

Cloudflare API — patch a DNS record back to the previous IP:

# Restore the pre-cutover A record in a single call
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE/dns_records/$REC" \
  -H "Authorization: Bearer $CF_TOKEN" \
  -H "Content-Type: application/json" \
  --data '{"type":"A","name":"www","content":"203.0.113.10","ttl":60}'

Nginx — atomic redirect config swap with validation gate:

# Symlink the previous known-good map, then validate before reload
# ln -sfn /etc/nginx/maps/redirects.prev.conf /etc/nginx/maps/redirects.active.conf
map $request_uri $rollback_target {
    include /etc/nginx/maps/redirects.active.conf;
}
# nginx -t && systemctl reload nginx   (reload is zero-downtime)

Apache — restore prior rewrite rules and test config:

# Swap in the archived rules file, then validate syntax before graceful restart
# cp /etc/apache2/conf-available/redirects.prev.conf /etc/apache2/conf-available/redirects.conf
RewriteEngine On
RewriteMap legacy "txt:/etc/apache2/maps/legacy-map.txt"
RewriteRule ^(.*)$ ${legacy:$1} [R=301,L]
# apachectl configtest && apachectl graceful

Validation & Rollback

Confirm the migration is healthy before closing the soak window, and verify the legacy state is fully restored after any reversal.

Post-Rollback Validation Checklist:

Confirm dig +short www.example.com @1.1.1.1 returns the legacy IP across 3+ resolvers
Verify curl -sI https://www.example.com | grep -i location shows pre-migration redirect targets
Confirm 5xx rate returns below baseline in the monitoring dashboard
Validate p95 latency is within the pre-launch ceiling
Purge CDN edge cache and confirm fresh objects via cf-cache-status: MISS then HIT

Common Pitfalls:

Setting triggers as vague “if it looks bad” criteria instead of numeric ceilings with windows
Forgetting that DNS rollback is bounded by the TTL set before cutover, not the moment of reversion
No single named owner, causing decision paralysis while errors compound
Reverting redirects without purging CDN cache, serving stale routing tables
Treating the rollback as untested theory rather than a rehearsed drill

Rollback Protocol:

Trigger fires when any threshold in Rollback Trigger Thresholds is breached for its full window
Named owner issues the abort and broadcasts the first status update
Apply the staged DNS reversion via Route 53 / Cloudflare API and start the TTL countdown
Swap redirect config atomically (nginx -t && systemctl reload) and purge CDN cache
Verify legacy responses with dig and curl, confirm metrics return to baseline, then broadcast recovery

FAQ

How fast can a DNS rollback actually take effect? No faster than the TTL set before cutover allows. If you lowered the TTL to 60 seconds beforehand, most recursive resolvers adopt the reverted record within a few minutes; if the TTL was still 86400 seconds, some resolvers cache the broken record for up to a day. Always lower TTL before migrating, never after the fault appears.

Who should have authority to call a rollback? One named release owner per phase, with a documented escalation tier. The owner acts on the pre-agreed numeric triggers, not on opinion, and their call is final within the soak window. This removes the debate that otherwise lets error rates climb while stakeholders argue.

Should DNS and redirect rollbacks happen in sequence or in parallel? In parallel. They affect independent layers — name resolution versus HTTP routing — and serialising them doubles the recovery window. Run the DNS API change and the atomic redirect swap simultaneously, then verify both.

How long should the post-launch soak window stay open? Keep rollback fully armed for at least 72 hours, the period during which the majority of crawl re-indexing and traffic-pattern anomalies surface. High-traffic enterprise moves often hold the window for a full week before declaring the migration irreversible.

What is the difference between a rollback and a hotfix? A rollback returns the system to the previously known-good state wholesale; a hotfix patches forward. During the soak window, prefer rollback — it is rehearsed, bounded, and reversible — and reserve forward fixes for after the migration is declared stable.

← Back to Home