Segmenting High-Value URLs by Revenue Before Migration

Problem Statement

You cannot manually QA 80,000 redirects, and you should not try. A handful of URLs drive the bulk of revenue, and if those break on launch day the business notices in hours; if a forgotten archive page breaks, almost nobody does. Treating every URL as equally important spreads scarce QA time so thin that the pages that actually matter get the same cursory check as a decade-old press release. Segmenting URLs by revenue and organic traffic before migration tells you exactly which redirects to test by hand, which to sample, and which can safely ride on a regex rule with automated spot checks. The segmentation also becomes the input to risk forecasting and to the QA worklist your team executes on cutover night, so it pays for itself several times over. This page is part of Traffic & Conversion Mapping; start there for the wider mapping workflow.

URLs joined with revenue and session data are sorted, then split into three QA tiers by contribution.

When to Use This Approach

The URL set is too large to test every redirect by hand within the freeze window.
The site monetises through measurable conversions (orders, leads, ad revenue) attributable to URLs.
You have analytics revenue or conversion data you can join to URLs by path.
You need to defend a QA prioritisation decision to stakeholders with numbers.
Redirect QA capacity is limited and you must spend it where a failure costs the most.

Step-by-Step Instructions

1. Assemble Revenue and Traffic per URL

Pull organic sessions and attributed revenue per landing-page path from analytics, then join them to the URL inventory from your crawl baseline. Path is the join key, so normalise trailing slashes and casing first — a single mismatched slash will silently drop a high-value page out of the join and into the unprioritised remainder. Use a window of at least three months so a one-off campaign spike does not crown an otherwise minor URL, and prefer last-non-direct or data-driven attribution over last-click so assisting pages keep some of their credit.

# Export landing-page sessions + revenue from GA4 via BigQuery (or the GA4 API)
bq query --use_legacy_sql=false \
'SELECT page_path, SUM(sessions) sessions, SUM(revenue) revenue
 FROM analytics.landing_pages
 WHERE channel = "Organic Search" AND date >= "2026-03-01"
 GROUP BY page_path ORDER BY revenue DESC' > url_revenue.csv   # value per path

2. Compute Each URL’s Share of the Total

Convert raw revenue into cumulative share so the Pareto curve is explicit. Sorting descending and running a cumulative sum shows where the top 80% of revenue is concentrated, and on most sites that curve is steep — a few hundred URLs out of tens of thousands. Looking at the shape of the curve also tells you whether tiering will help: a steep curve means manual QA on a small head protects most of the value, while a shallow curve means revenue is spread thin and you should lean harder on automated checks across the board.

# Add revenue share and cumulative share to rank URLs
import pandas as pd
df = pd.read_csv('url_revenue.csv').sort_values('revenue', ascending=False)
df['rev_share'] = df.revenue / df.revenue.sum()
df['rev_cum'] = df.rev_share.cumsum()           # cumulative revenue contribution
df.to_csv('url_revenue_ranked.csv', index=False)

3. Assign QA Tiers from Thresholds

Cut the ranked list into tiers with explicit thresholds so the assignment is reproducible. Tier 1 gets manual redirect QA, Tier 2 gets sampled QA, and the low-value remainder gets automated spot checks.

# Tier URLs: T1 = top of cumulative revenue, T2 = next band, T3 = remainder
def tier(row):
    if row.rev_cum <= 0.80: return 'T1_manual'     # ~top 80% of revenue
    if row.rev_cum <= 0.95: return 'T2_sampled'    # next 15%
    return 'T3_spot'                                # final 5% / low-value remainder
df['qa_tier'] = df.apply(tier, axis=1)
df.to_csv('url_qa_tiers.csv', index=False)         # drives redirect QA plan

4. Generate the Tier 1 QA Worklist

Turn Tier 1 into a concrete checklist of old-to-new URL pairs that a human will curl and eyeball. This is the list that protects the revenue you cannot afford to lose. Join each Tier 1 source URL to its mapped destination from the redirect map so the worklist shows both sides, letting the reviewer confirm not just that a redirect fires but that it lands on the right page. Size the worklist deliberately: if it is too large to clear within the cutover window, raise the revenue threshold or split it across reviewers, because a worklist nobody finishes provides false assurance rather than coverage.

# Build the manual QA worklist of source URLs from Tier 1
awk -F',' '$NF=="T1_manual" {print $1}' url_qa_tiers.csv > t1_qa_worklist.txt
wc -l t1_qa_worklist.txt   # how many redirects need human eyes

Worked Example

A retailer migrating oldshop.example.com exports 62,000 organic landing-page URLs. After joining revenue, the ranked file shows that 480 URLs (0.8% of the set) carry 80% of attributed revenue — almost entirely product and category pages. Those 480 become Tier 1 and are added to t1_qa_worklist.txt for manual redirect QA.

The next 7,300 URLs (to 95% cumulative revenue) become Tier 2 and are sampled at 10%, while the remaining 54,000 ride on regex rules with automated spot checks. On launch night the team manually confirms all 480 Tier 1 redirects in under two hours, catching three that pointed at the wrong category because of a stale mapping row — failures that, on those exact pages, would have cost real orders within the hour.

The tiering also reframes the QA conversation with leadership: instead of “we tested as much as we could”, the team can say “100% of revenue-critical redirects verified by hand, 10% of the next band sampled with zero failures”. That is a defensible coverage statement backed by numbers. The ranked file also feeds mapping legacy traffic to new URL structures and the traffic-loss forecast, and rolls up into the Pre-Migration Auditing & Risk Assessment record.

Verification

Confirm the tiering is sound and the worklist is actionable.

# Tier 1 should be a small fraction of URLs but the majority of revenue
awk -F',' 'NR>1 {t[$NF]++} END {for (k in t) print k, t[k]}' url_qa_tiers.csv

# Confirm cumulative share reaches ~1.0 (no rows dropped in the join)
python -c "import pandas as pd;print(pd.read_csv('url_qa_tiers.csv').rev_cum.max())"

# Spot-check a Tier 1 redirect resolves to a 200 on the new domain
curl -sIL https://oldshop.example.com/$(head -1 t1_qa_worklist.txt) | grep -E 'HTTP|Location'

Watch for these failures: joining on un-normalised paths so high-value URLs silently drop out; ranking by sessions alone when a low-traffic, high-AOV page actually drives revenue; and treating the low-value remainder as zero-risk rather than spot-checking it. Re-run the segmentation if the redirect map changes late, because a remapped destination can move a URL between tiers. And remember that seasonality cuts both ways: a page that earns nothing in the off-season may be a top earner during a sale, so cross-check the ranking against the same period last year before freezing the tiers.

FAQ

Should I rank by revenue or by traffic? Rank by revenue where you can attribute it, because a low-traffic, high-value page (an enterprise pricing page, a flagship product) can outweigh thousands of informational visits. Use organic sessions as the tiebreaker and as the primary signal only when revenue attribution is unavailable.

What if revenue is not attributable to specific URLs? Fall back to a composite of organic sessions, conversions, and assisted-conversion value. Even a coarse proxy (sessions times conversion rate by template) is far better than treating every URL as equal, and it still concentrates manual QA on the pages that matter.

How small should Tier 1 be? Small enough to QA by hand in the freeze window, which for most sites means the URLs covering roughly the top 80% of revenue. That is usually a few hundred to a couple of thousand pages; if it is larger, raise the threshold or sample within the tier.

← Back to Traffic & Conversion Mapping