Rollback Trigger Thresholds

Context

A migration aborts on data, not on nerves. The hardest decision during a cutover is whether a spike in errors is transient noise or a genuine regression, and that decision must be settled in advance with numeric ceilings and fixed observation windows. This guide defines the trigger set — error rate, latency, traffic, conversion, and crawl errors — that on-call staff watch during the soak window. It sits under Migration Rollback Playbooks; the playbook owns the reversal mechanics, while this page owns the firing conditions.

Trigger evaluation windows Five metric ceilings each evaluated over a fixed window; a sustained breach fires the rollback, a transient spike does not. Trigger Evaluation Windows 5xx rate p95 latency Traffic drop Conversion Crawl errors > 2% / 5 min > 1500 ms / 5 min > 20% / 15 min > 15% / 30 min > 3x / 60 min Sustained breach across the window → fire rollback
Each metric has its own ceiling and window; only a breach sustained across the full window fires the rollback.

Pre-flight Checks

Establish baselines and alerting before launch so thresholds compare against real pre-migration behaviour, not guesses.

  • Capture 7-day baselines for 5xx rate, p95 latency, sessions, and conversion from your APM and analytics.
  • Confirm monitoring scrapes both edge and origin so a CDN-masked origin fault is still visible.
  • Verify alert routing reaches the named rollback owner via pager, not just a dashboard.
  • Validate clock sync across log sources so windowed evaluations align.

Threshold Readiness Checklist:

  • curl) running against critical paths every 60 s

Execution Steps

Set each threshold relative to baseline, with a window long enough to filter noise but short enough to bound damage.

1. Set the Error-Rate Ceiling

Define the 5xx and 4xx percentage that fires a rollback over a fixed window — a common default is 5xx above 2% sustained for 5 minutes. Separate the 4xx ceiling, which signals broken redirects rather than origin failure. The full derivation, alerting rules, and edge cases live in Defining Error-Rate Thresholds That Trigger Rollback.

2. Set the Latency Ceiling

Pin a p95 latency ceiling — typically baseline plus 50% or a hard 1500 ms — over a 5-minute window. Latency regressions often precede error spikes as origin connection pools saturate. Watch p95 and p99 separately; a p99 blowout with a stable p95 usually points to a single slow path, not a systemic fault.

3. Set Traffic and Conversion Drop Triggers

Define a traffic-drop ceiling, such as organic sessions falling more than 20% versus baseline over 15 minutes, and a conversion-drop ceiling such as 15% over 30 minutes. These catch silent failures where pages return 200 but render broken. Cross-reference high-value paths from Traffic & Conversion Mapping so revenue-critical routes weight the decision.

4. Set the Crawl-Error Spike Trigger

Define a crawl-error multiplier — for example, more than 3x the baseline crawl errors within 60 minutes — to catch redirect and indexability regressions search engines see before users do. Feed this from Search Console Handover coverage reports. A crawl-error spike alone rarely justifies an immediate abort, but combined with a 4xx rise it confirms a routing fault.

5. Define Composite and Time-Boxed Rules

Combine signals so two correlated breaches escalate faster than one in isolation, and define a hard time box — keep all triggers armed for the full 72-hour soak. Document which single triggers auto-fire versus which require the owner’s confirmation, referencing the reversal steps in Migration Rollback Playbooks.

Configs / Commands

Prometheus alert rules — error-rate and latency ceilings:

# Fire when 5xx ratio exceeds 2% sustained for 5 minutes
groups:
  - name: migration-rollback
    rules:
      - alert: High5xxRate
        expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.02
        for: 5m
      - alert: HighP95Latency
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1.5
        for: 5m

Synthetic check — windowed status and latency sampling with curl:

# Sample status code and total time every 60s; log breaches for the on-call owner
while true; do
  curl -s -o /dev/null -w '%{http_code} %{time_total}\n' https://www.example.com/checkout
  sleep 60
done | awk '$1 ~ /^5/ || $2 > 1.5 {print strftime(), "BREACH", $0}'

Validation

Confirm each trigger fires on synthetic faults before relying on it during a live cutover.

  • Inject a controlled 5xx burst in staging and confirm the High5xxRate alert fires within its window.
  • Throttle a backend and confirm the p95 latency alert pages the owner, not just a dashboard.
  • Drop a test path to 404 and confirm the crawl-error and 4xx signals rise as expected.
  • Verify dashboards show the metric against its baseline line, so the on-call call is unambiguous.

Rollback Triggers

Fire a rollback when any of these numeric ceilings is breached for its full window during the soak.

  • Error rate: 5xx above 2% sustained 5 minutes, or 4xx above baseline + 10 points sustained 10 minutes.
  • Latency: p95 above 1500 ms (or baseline + 50%) sustained 5 minutes.
  • Traffic: organic/total sessions down more than 20% versus baseline over 15 minutes with no external cause.
  • Conversion: key conversion rate down more than 15% over 30 minutes.
  • Crawl errors: crawl/indexing errors exceeding 3x baseline within 60 minutes, confirmed alongside a 4xx rise.

FAQ

Why attach a time window to every threshold instead of firing instantly? A single 5-second spike during a cache warm-up is noise; a 5-minute sustained breach is a regression. Windows filter transient blips so you do not roll back a healthy migration over a momentary anomaly, while still bounding how long a real fault runs.

Should any single metric auto-trigger a rollback without human confirmation? A severe error-rate breach (for example 5xx above 5% for 2 minutes) is a reasonable auto-trigger because it is unambiguous. Softer signals like conversion drop should require the named owner to confirm, since they have more benign causes such as analytics tagging breaking during the move.

How do I set thresholds for a site with no clean baseline? Use a representative staging load test plus industry-standard absolutes — 5xx above 2%, p95 above 1500 ms — as a floor, then tighten after the first stable week. Absolute ceilings protect you even when relative baselines are missing.

Do these thresholds change as the soak window progresses? Yes. Keep them tight for the first few hours when regressions are most likely, then you may relax traffic/conversion windows once daily patterns are confirmed. Error-rate and latency ceilings should stay strict for the full 72 hours.

Related

← Back to Migration Rollback Playbooks

Explore Sub-topics