Defining Error-Rate Thresholds That Trigger Rollback

Problem Statement

“Roll back if errors get bad” is not a plan — under pressure it produces hesitation while error rates climb. You need concrete error-rate numbers, evaluated over fixed windows, that page the right person or auto-fire the reversal. This is the detailed companion to Rollback Trigger Thresholds, covering how to set 5xx and 4xx ceilings, choose windows, and wire alerting that turns a metric into an action.

Three pre-agreed zones turn a live error-rate metric into an unambiguous action: continue, escalate, or revert.

When to Use This Approach

You are in the soak window after a cutover and need unambiguous abort criteria.
Your monitoring exposes per-status-code request rates at edge and origin.
You want certain breaches to auto-fire and others to require owner confirmation.
You need to distinguish origin failures (5xx) from broken routing (4xx).
Stakeholders demand a defensible, pre-agreed basis for any rollback decision.

Step-by-Step Instructions

1. Separate the 5xx and 4xx Ceilings

Treat the two error classes distinctly: 5xx signals origin/application failure, 4xx signals broken redirects or missing resources. Set a 5xx ratio ceiling (e.g. 2% over 5 minutes) and a separate 4xx ceiling relative to baseline.

# 5xx ratio across all requests, evaluated for a 5-minute window
sum(rate(http_requests_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) > 0.02

2. Choose Windows That Filter Noise but Bound Damage

Pick a window long enough to ignore a cache-warm blip yet short enough to cap exposure. Five minutes suits 5xx; ten minutes suits 4xx, which is noisier from crawlers.

# Prometheus: require the breach to persist for the full window before firing
- alert: Rollback5xxBreach
  expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.02
  for: 5m
  labels: {severity: page, action: rollback}

3. Add a Fast Auto-Fire Tier for Severe Breaches

Define a second, stricter tier that auto-triggers without confirmation when the breach is unambiguous, so catastrophic failures do not wait on a human.

# Severe tier: 5xx above 5% for 2 minutes auto-triggers the reversal
- alert: Rollback5xxSevere
  expr: sum(rate(http_requests_total{code=~"5.."}[2m])) / sum(rate(http_requests_total[2m])) > 0.05
  for: 2m
  labels: {severity: critical, action: auto-rollback}

4. Wire Alerting to a Pager, Not a Dashboard

Route the breach to the named rollback owner via pager and a synthetic backstop, so detection never depends on someone watching a screen.

# Synthetic backstop: page if the critical path returns 5xx three times in a row
fails=0
while true; do
  c=$(curl -s -o /dev/null -w '%{http_code}' https://www.example.com/checkout)
  [ "${c:0:1}" = "5" ] && fails=$((fails+1)) || fails=0
  [ "$fails" -ge 3 ] && echo "PAGE: sustained 5xx ($c)" && break
  sleep 20
done

Worked Example

A retailer cut over its checkout flow at 09:00. Baseline 5xx was 0.1% and 4xx was 1.2%. Thresholds were set at 5xx > 2% for 5 minutes (page), 5xx > 5% for 2 minutes (auto-rollback), and 4xx > baseline + 10 points for 10 minutes (page).

At 09:06 the new origin’s connection pool saturated; 5xx climbed to 3.4% and held. The Rollback5xxBreach alert fired at 09:11 after the full 5-minute window, paging the release owner, who confirmed the abort. Separately, a broken redirect map pushed 4xx from 1.2% to 13%, breaching the 4xx ceiling at 09:16 — this pointed at routing rather than origin, so recovery used Redirect Rollback & Recovery rather than a DNS revert. Had 5xx instead spiked to 6%, Rollback5xxSevere would have auto-fired at 09:08, skipping the confirmation step entirely. The two distinct error classes drove two distinct recovery paths, exactly as the threshold design intended.

Verification

Inject a controlled 5xx burst in staging and confirm Rollback5xxBreach fires only after the full 5-minute window, not on a transient spike.
Confirm Rollback5xxSevere auto-fires within ~2 minutes on a >5% burst, and that the page reaches the owner’s device.
Drive a test path to 404 and confirm the 4xx alert fires while 5xx stays quiet, proving the classes are independent.

FAQ

What is a sensible default 5xx threshold for a migration soak? A ratio of 2% sustained over 5 minutes for the paging tier, plus a 5% over 2 minutes auto-fire tier. These work as absolute floors even without a clean baseline; tighten them toward baseline + a small margin once you have stable pre-migration numbers.

Why evaluate 4xx separately from 5xx? They indicate different failures. A 5xx surge means the origin or application is failing and a DNS or origin rollback is likely; a 4xx surge usually means redirects or resources are broken and the fix lives in the routing layer. Conflating them sends you down the wrong recovery path.

Should every error-rate breach automatically roll back? No. Reserve auto-fire for severe, unambiguous breaches. Moderate breaches should page the named owner for a confirmation call, because softer rises can stem from benign causes like a crawler hitting deprecated paths or analytics tagging breaking during the move.

← Back to Rollback Trigger Thresholds