Defining Error-Rate Thresholds That Trigger Rollback

Problem Statement

β€œRoll back if errors get bad” is not a plan β€” under pressure it produces hesitation while error rates climb. You need concrete error-rate numbers, evaluated over fixed windows, that page the right person or auto-fire the reversal. This is the detailed companion to Rollback Trigger Thresholds, covering how to set 5xx and 4xx ceilings, choose windows, and wire alerting that turns a metric into an action.

Error-rate threshold zones A 5xx error-rate scale divided into three zones: healthy below 0.5 percent, warn between 0.5 and 2 percent, and auto-rollback above 2 percent over a five-minute window. 5xx Error-Rate Decision Zones evaluated over a rolling 5-minute window at the edge HEALTHY WARN β€” page owner AUTO-ROLLBACK < 0.5% 0.5% – 2% > 2% 0% 0.5% 2% 5%+ Baseline 5xx before cutover is typically < 0.1% β€” set ceilings against measured baseline, not zero.
Three pre-agreed zones turn a live error-rate metric into an unambiguous action: continue, escalate, or revert.

When to Use This Approach

  • You are in the soak window after a cutover and need unambiguous abort criteria.
  • Your monitoring exposes per-status-code request rates at edge and origin.
  • You want certain breaches to auto-fire and others to require owner confirmation.
  • You need to distinguish origin failures (5xx) from broken routing (4xx).
  • Stakeholders demand a defensible, pre-agreed basis for any rollback decision.

Step-by-Step Instructions

1. Separate the 5xx and 4xx Ceilings

Treat the two error classes distinctly: 5xx signals origin/application failure, 4xx signals broken redirects or missing resources. Set a 5xx ratio ceiling (e.g. 2% over 5 minutes) and a separate 4xx ceiling relative to baseline.

# 5xx ratio across all requests, evaluated for a 5-minute window
sum(rate(http_requests_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m])) > 0.02

2. Choose Windows That Filter Noise but Bound Damage

Pick a window long enough to ignore a cache-warm blip yet short enough to cap exposure. Five minutes suits 5xx; ten minutes suits 4xx, which is noisier from crawlers.

# Prometheus: require the breach to persist for the full window before firing
- alert: Rollback5xxBreach
  expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.02
  for: 5m
  labels: {severity: page, action: rollback}

3. Add a Fast Auto-Fire Tier for Severe Breaches

Define a second, stricter tier that auto-triggers without confirmation when the breach is unambiguous, so catastrophic failures do not wait on a human.

# Severe tier: 5xx above 5% for 2 minutes auto-triggers the reversal
- alert: Rollback5xxSevere
  expr: sum(rate(http_requests_total{code=~"5.."}[2m])) / sum(rate(http_requests_total[2m])) > 0.05
  for: 2m
  labels: {severity: critical, action: auto-rollback}

4. Wire Alerting to a Pager, Not a Dashboard

Route the breach to the named rollback owner via pager and a synthetic backstop, so detection never depends on someone watching a screen.

# Synthetic backstop: page if the critical path returns 5xx three times in a row
fails=0
while true; do
  c=$(curl -s -o /dev/null -w '%{http_code}' https://www.example.com/checkout)
  [ "${c:0:1}" = "5" ] && fails=$((fails+1)) || fails=0
  [ "$fails" -ge 3 ] && echo "PAGE: sustained 5xx ($c)" && break
  sleep 20
done

Worked Example

A retailer cut over its checkout flow at 09:00. Baseline 5xx was 0.1% and 4xx was 1.2%. Thresholds were set at 5xx > 2% for 5 minutes (page), 5xx > 5% for 2 minutes (auto-rollback), and 4xx > baseline + 10 points for 10 minutes (page).

At 09:06 the new origin’s connection pool saturated; 5xx climbed to 3.4% and held. The Rollback5xxBreach alert fired at 09:11 after the full 5-minute window, paging the release owner, who confirmed the abort. Separately, a broken redirect map pushed 4xx from 1.2% to 13%, breaching the 4xx ceiling at 09:16 β€” this pointed at routing rather than origin, so recovery used Redirect Rollback & Recovery rather than a DNS revert. Had 5xx instead spiked to 6%, Rollback5xxSevere would have auto-fired at 09:08, skipping the confirmation step entirely. The two distinct error classes drove two distinct recovery paths, exactly as the threshold design intended.

Verification

  • Inject a controlled 5xx burst in staging and confirm Rollback5xxBreach fires only after the full 5-minute window, not on a transient spike.
  • Confirm Rollback5xxSevere auto-fires within ~2 minutes on a >5% burst, and that the page reaches the owner’s device.
  • Drive a test path to 404 and confirm the 4xx alert fires while 5xx stays quiet, proving the classes are independent.

FAQ

What is a sensible default 5xx threshold for a migration soak? A ratio of 2% sustained over 5 minutes for the paging tier, plus a 5% over 2 minutes auto-fire tier. These work as absolute floors even without a clean baseline; tighten them toward baseline + a small margin once you have stable pre-migration numbers.

Why evaluate 4xx separately from 5xx? They indicate different failures. A 5xx surge means the origin or application is failing and a DNS or origin rollback is likely; a 4xx surge usually means redirects or resources are broken and the fix lives in the routing layer. Conflating them sends you down the wrong recovery path.

Should every error-rate breach automatically roll back? No. Reserve auto-fire for severe, unambiguous breaches. Moderate breaches should page the named owner for a confirmation call, because softer rises can stem from benign causes like a crawler hitting deprecated paths or analytics tagging breaking during the move.

Related

← Back to Rollback Trigger Thresholds