Monitoring Global DNS Propagation During Cutover
Problem Statement
During a live DNS cutover, an authoritative record swap does not take effect everywhere at once. Recursive resolvers across global networks cache the old IP for the duration of the previous TTL — and many enforce minimum cache floors or ignore short TTLs entirely. The result is split-brain routing: some users reach the new origin, others the legacy one, and you have no single dashboard telling you which. Unmonitored cutovers risk NXDOMAIN blackholes, HTTP 5xx spikes, and CDN origin mismatches that surface as intermittent, hard-to-reproduce downtime. The deeper trap is that the average looks healthy long before the slowest tail converges: if 80% of resolvers have adopted the new IP, a status page can read green while a fifth of your traffic still lands on a decommissioned origin. This page sits under DNS Propagation Tracking and covers how to instrument convergence across a wide resolver panel, gate downstream actions on a hard adoption threshold, and verify the result in real time rather than guessing from a single dig.
When to Use This Approach
- You are performing a live IP or hostname swap on a production domain and cannot tolerate silent split-brain routing.
- Your authoritative records carry, or have recently carried, long TTLs that may leave caches stale for hours.
- You serve traffic through a CDN where edge origin shields must be verified against the new IP.
- You need an auditable convergence record for stakeholders or a go/no-go decision gate.
- You require a defined threshold (for example 85% global adoption) to trigger downstream steps such as cache purges.
Step-by-Step Instructions
1. Pre-Condition Authoritative Zones
Lower TTLs so resolver caches expire quickly once you publish, which is the single biggest lever on how fast the slow tail converges. Execute this 48 hours before the swap; the mechanics are detailed in How to Lower DNS TTL Before Domain Migration. Document current NS records, registrar lock status, and DNSSEC key state so rollback is deterministic — if DNSSEC is active, a swap that omits a re-signed RRSIG produces SERVFAIL rather than the old answer, which your sampler must distinguish from a normal miss.
# Confirm the effective TTL the resolver is actually serving (not just the zone value)
dig @8.8.8.8 example.com A +noall +answer
# example.com. 300 IN A 203.0.113.10 <- TTL field must read 300, not 86400
2. Establish a Baseline Snapshot
Capture the current globally-resolved IP set before touching the record, so you can distinguish “not yet propagated” from “wrong record published”. Query a fixed panel of public resolvers and store the result.
# Sample a fixed resolver panel and record the answer per resolver
for r in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
echo -n "$r -> "; dig @"$r" example.com A +short # baseline IP per resolver
done | tee baseline.txt
3. Deploy Distributed Query Sampling
Run parallel dig queries against 50+ resolvers on a short interval and aggregate response codes (NOERROR, NXDOMAIN, SERVFAIL) plus the returned IP into a time series. Drive it from a CSV panel so the resolver list is version-controlled.
# Extract resolver IPs from CSV (columns: resolver,region) and sample each
awk -F',' 'NR>1 {print $1}' resolvers.csv \
| xargs -I{} sh -c 'echo -n "{} "; dig @{} example.com A +short' \
> results.txt # one line per resolver: <resolver-ip> <answer-ip>
4. Gate Downstream Actions on a Threshold
Compute the percentage of sampled resolvers returning the new IP and treat that number, not a single spot-check, as your source of truth. Hold CDN cache purges and traffic finalisation until adoption crosses 85%, which suppresses re-populating edge caches that still resolve to the legacy origin. Equally important is a floor on the slow tail: if adoption stalls below 70% after two hours, stop waiting and investigate resolver-enforced minimums or a registrar TTL override rather than assuming time alone will fix it.
# Count resolvers returning the new IP and emit an adoption percentage
NEW_IP=203.0.113.20
total=$(wc -l < results.txt)
hits=$(grep -c " ${NEW_IP}$" results.txt) # exact-match the answer column
awk -v h="$hits" -v t="$total" 'BEGIN{printf "adoption: %.0f%%\n", (h/t)*100}'
Worked Example
A retailer moves shop.example.com from a legacy host at 198.51.100.5 to a new origin at 203.0.113.20. The team lowered the A-record TTL from 86400 s to 300 s two days prior. At T+0 they publish the new record; baseline.txt confirms every panel resolver still returns 198.51.100.5.
By T+15 min the sampler shows EU and US resolvers returning 203.0.113.20, but dig @216.146.35.35 shop.example.com A +noall +answer still serves the legacy IP with a TTL of 600 s — a resolver-enforced floor above the published 300 s. Adoption sits at 71%, below the gate, so the automated runbook holds: no purge, no traffic finalisation. The team resists the temptation to call it done from the green-looking US figures, because the synthetic checkout monitor in APAC is still hitting the old origin and would serve a stale cart if the edge were purged now. At T+50 min the laggards expire and the adoption script prints adoption: 89%. Only then does the team fire the CDN purge and verify the edge:
curl -sI https://shop.example.com/app.js | grep -i 'x-cache-status'
# x-cache-status: MISS <- edge re-fetched from the new origin after purge
Verification
Confirm the source of truth, then confirm global agreement.
# 1. The record is correct at the authoritative source
dig @ns1.example-dns.com shop.example.com A +noall +answer
# 2. Flag any resolver whose answer differs from the expected new IP
comm -23 <(sort expected.txt) <(sort actual.txt) # non-empty = unconverged nodes
# 3. End-to-end status across regional endpoints
curl -s -o /dev/null -w '%{http_code}\n' https://shop.example.com # expect 200
For automated resolver mapping, geographic distribution logic, and threshold-based alerting, integrate the broader DNS Propagation Tracking workflow rather than running ad-hoc samples by hand.
FAQ
How do I bypass local OS DNS cache to verify true propagation status?
Flush local caches with sudo resolvectl flush-caches (Linux; the older systemd-resolve --flush-caches is deprecated) or ipconfig /flushdns (Windows), then query public resolvers directly with dig @8.8.8.8 domain.com A +noall +answer, which bypasses the local stub resolver entirely.
Why do some regions still return old IPs after the TTL expires?
Recursive resolvers often enforce minimum cache floors or ignore TTLs below 300 s, and negative caching follows the SOA MINIMUM field, which can range from 300 s to 86400 s. Query the authoritative nameservers directly with dig @ns1.yourdomain.com domain.com to confirm the record is correct at the source.
What adoption level should gate a CDN purge during cutover? Wait until at least 85% of your sampled resolver panel returns the new IP before purging edge caches, so you avoid re-populating caches that still resolve to the legacy origin. Below 70% after two hours, treat it as a stall and review resolver floors or registrar TTL overrides.
Related
- DNS Propagation Tracking
- How to Lower DNS TTL Before Domain Migration
- Zero-Downtime Cutover Plans
- DNS Rollback Procedures
← Back to DNS Propagation Tracking