Implementing Blue-Green Deployments for Site Migrations

Problem Statement

A single-environment migration forces you to break the old site to build the new one, so any defect surfaces in production with no clean way back β€” your only option is a forward fix under pressure while users see errors. DNS propagation delays split traffic across environments, stale CDN caches serve broken assets, and database writes can land in both places during the overlap. A blue-green deployment removes the all-or-nothing risk: you stand up the new environment (green) alongside the live one (blue), validate strict parity, then shift traffic gradually with an instant revert path back to blue. The discipline that makes it work is twofold β€” green must be a genuine clone (config, data, headers, session keys), and the rollback must be automated against numeric thresholds so the decision to revert is not a panicked judgement call at 2am. This page sits under Zero-Downtime Cutover Plans and covers building the two environments to strict parity, shifting traffic with a canary stage, and wiring rollback triggers that fire automatically.

Blue-green traffic shift A router sends live traffic to the blue environment, then weighted-shifts to green; a rollback path returns traffic to blue on failure. Blue-Green Traffic Shift DNS / LB router weighted routing Blue (current) legacy production Green (new) migration target rollback on 5xx > 2%
The router weight-shifts traffic from blue to green; breaching an error or latency threshold returns it to blue instantly.

When to Use This Approach

  • Downtime is unacceptable and you need an instant revert path rather than a forward-fix.
  • You can afford to run two parallel environments for the duration of the cutover.
  • Your routing layer (DNS, load balancer, or CDN) supports weighted traffic shifting.
  • Database and session state can be kept consistent across both environments during overlap.
  • You want automated, threshold-driven rollback instead of a manual judgement call under pressure.
  • You have observability (APM or log analysis) able to compute 5xx rate and p95 latency in near real time, since those numbers drive the rollback decision.

Step-by-Step Instructions

1. Establish Infrastructure Parity

Green must match blue before any traffic moves. Define both with infrastructure-as-code and diff the rendered config to catch drift in server blocks, headers, and routing.

# Fail fast on any difference between live and target server config
diff -rq /etc/nginx/sites-available/ /staging/etc/nginx/sites-available/
# Tag responses so you can see which environment served a request
# (in the green server block)  proxy_set_header X-Environment green;

2. Synchronise Data and Files

Mirror the database and file system from blue to green, then verify integrity before the switch. For the database integrity gate, follow Syncing Staging Databases Before Production Switch.

# Files: checksum-verified mirror, dry-run first to validate the delta
rsync -avz --checksum --delete --dry-run /var/www/html/ /mnt/green/html/
rsync -avz --checksum --delete         /var/www/html/ /mnt/green/html/
# Confirm identical signing keys / session store so logins survive the switch

3. Pre-Stage DNS and Shift Traffic

Lower TTL ahead of time so the shift β€” and any revert β€” propagates in minutes, then move weight from blue to green in stages rather than all at once. Start with a small canary (for example 10%), hold it long enough to observe real traffic against your error and latency budget, then ramp to 50% and 100%. Track adoption with Monitoring Global DNS Propagation During Cutover and hold full cutover until convergence.

# Execute the weighted record change, then watch propagation
aws route53 change-resource-record-sets \
  --hosted-zone-id "$ZONE" --change-batch file://cutover.json   # blue -> green
watch -n 10 'dig +noall +answer example.com @8.8.8.8'           # TTL decays to 60s

4. Purge Caches and Validate Live

Purge the CDN after the flip so the edge re-fetches from green, then validate SEO-critical headers and asset integrity in real time.

# Purge edge, then confirm canonical/robots headers and asset hashes
curl -s -X POST "https://api.cloudflare.com/client/v4/zones/$ZONE/purge_cache" \
  -H "Authorization: Bearer $TOKEN" -d '{"purge_everything": true}'   # full purge
curl -sI https://example.com | grep -iE 'canonical|x-robots-tag'
sha256sum -c manifest.sha256   # every asset matches the green manifest

Worked Example

An agency migrates example.com from a legacy VM (blue, 198.51.100.5) to a containerised stack (green, 203.0.113.20). Parity diffing in step 1 catches a missing X-Robots-Tag header on green that would have deindexed paginated pages; they add it before proceeding. After an rsync --checksum mirror and a database checksum match, they lower TTL to 60 s and push a Route 53 weighted record at 90% blue / 10% green.

The 10% canary holds for 15 minutes at a 0.3% 5xx rate and 240 ms p95, so they shift to 50/50, then 100% green. They purge the CDN and confirm the edge serves green:

curl -sI https://example.com/app.js | grep -i 'x-cache-status'
# x-cache-status: MISS   <- edge re-fetched from green after purge

Twenty minutes later an APM alert shows green’s checkout endpoint hitting 3.1% 5xx β€” above the 2% trigger. The pre-authored rollback.json re-points the record to blue, the CDN is purged again, and error rates fall to baseline within 4 minutes because the TTL was already at 60 s. The migration is retried the next night after fixing the checkout regression.

Verification

Confirm propagation, then confirm green is healthy before trusting it with full traffic.

# 1. All major resolvers agree on the green IP
for r in 8.8.8.8 1.1.1.1 208.67.222.222; do dig @"$r" example.com A +short; done
# 2. Error and latency budget within thresholds (from APM/log analysis)
awk '$9 ~ /^5/ {c++} END {print "5xx:", c+0}' access.log   # expect near zero
# 3. SEO headers and robots parity match blue
curl -sI https://example.com/robots.txt | head -n 1   # expect 200

If thresholds are breached, revert per Rollback Trigger Thresholds rather than attempting a forward fix mid-cutover.

FAQ

How do I verify DNS propagation before committing the blue-green switch? Run dig +noall +answer example.com against several global resolvers (8.8.8.8, 1.1.1.1, 208.67.222.222) and watch the TTL decay to the pre-configured 60 s floor. When every sampled resolver returns the green IP, propagation is effectively complete and you can commit the full shift.

What is the safest method to sync large media directories without downtime? Use rsync -avz --checksum --delete --bwlimit=5000 (limit in KiB per second) and run a --dry-run first to validate the delta calculation before the real transfer. Verify integrity with sha256sum -c manifest.sha256 against a manifest generated on blue before you cut over.

What triggers an automatic rollback in a blue-green migration? Trigger rollback on a 5xx error rate above 2% over a 5-minute window, p95 latency above 800 ms, or failed /healthz checks on three consecutive polls. The rollback script must execute the DNS revert and CDN purge atomically, and the rollback.json payload must be pre-authored and tested before the cutover window opens.

Related

← Back to Zero-Downtime Cutover Plans