How to Export Full Crawl Data Before Migration

Problem Statement

Migrating without a complete URL inventory guarantees broken redirects and lost link equity. Incomplete datasets trigger immediate 404 spikes and organic traffic collapse the moment DNS flips. You must capture every HTTP status, redirect chain, and canonical tag before touching production. This page is part of Crawl Baseline Generation; start there if you have not yet defined the scope of your legacy environment.

Crawl export and normalisation flow Crawl configuration feeds a full crawl, which exports raw CSV, then normalises columns into a versioned baseline committed to Git. Crawl Baseline Export Flow 1. Configure 2. Crawl 3. Normalise 4. Commit Depth, JS, limits Export raw CSV Map columns SHA-256 + Git Re-run on schema or coverage failure
The export pipeline runs left to right; a coverage or schema failure sends you back to reconfigure the crawl.

When to Use This Approach

A full crawl export is the single source of truth every later migration phase depends on. Redirect maps, risk scoring, and post-launch diffing all read from it, so the export must be exhaustive rather than representative. Reach for this approach when:

  • You are within the freeze window before a domain change, replatform, or large-scale URL restructure.
  • The site exceeds a few hundred URLs and manual inventory is no longer reliable.
  • You need an immutable, version-controlled snapshot to compare against post-launch.
  • The site renders meaningful content client-side (SPA/CSR routes) and a plain HTML crawl would miss it.
  • You must feed downstream redirect mapping and risk scoring with authoritative source data.

If any of these hold, treat the export as a release gate: do not begin DNS or redirect work until the baseline is committed and verified. The cost of re-deriving a lost inventory mid-cutover is measured in hours of unplanned downtime, whereas capturing it cleanly up front is a one-time job of a few minutes plus crawl time.

Step-by-Step Instructions

1. Define Crawl Boundaries

Set strict limits before extraction so you capture the full architecture without overloading the origin. For baseline capture you typically want everything, so use the crawler’s “ignore robots.txt” option in a controlled, authenticated context — never deploy a non-compliant bot against third-party sites. Set depth to unlimited and cap pages at the sitemap count plus a 15% buffer. The buffer matters because real sites almost always expose more URLs than the sitemap declares: faceted navigation, paginated archives, and parameterised filters routinely add 5-15% on top of the canonical set, and those are exactly the paths that break first when left unmapped. Apply a conservative crawl delay so the extraction does not trip rate limiting or a WAF, which would silently truncate the dataset.

# pre_migration.seospiderconfig — key crawl flags
max_depth=0              ; unlimited depth (capture deep archives)
respect_robots=false     ; controlled internal crawl only
rendering=javascript     ; capture SPA/CSR routes
crawl_delay=2            ; 2s delay to avoid 429/503

2. Run the Full Crawl and Export Raw Data

Run the crawler with optimised concurrency, then export raw datasets immediately for downstream processing. Keep internal and external link tables separate so migration-critical paths are easy to isolate.

# Screaming Frog CLI — full crawl with JS rendering (requires SF SEO Spider license)
ScreamingFrogSEOSpider \
  --crawl https://production-site.com \
  --save-crawl \
  --export-csv "All" \
  --output-folder /tmp/baseline/ \
  --config pre_migration.seospiderconfig

# wget fallback — mirror sitemap tree for offline inspection
wget -r -l 0 -nd -A html,htm --wait=2 -o /tmp/crawl.log \
  https://production-site.com/sitemap.xml

3. Normalise Columns into a Mapping Table

Transform raw exports into migration-ready tables with strict column mapping and regex isolation of legacy paths. This output feeds your redirect map and risk scoring directly.

# Convert and normalise crawl export (Screaming Frog field names)
import pandas as pd
df = pd.read_csv('/tmp/baseline/all_inlinks.csv')
df = df.rename(columns={
    'Address': 'Source_URL',
    'Status Code': 'HTTP_Status',
    'Canonical Link Element 1': 'Target_Canonical',
})
df = df.drop_duplicates(subset=['Source_URL'], keep='last')  # collapse dupes
df.to_parquet('/tmp/baseline/baseline.parquet', index=False)  # fast downstream I/O

4. Commit an Immutable Versioned Baseline

Checksum and version-control the raw export so you always have a known-good snapshot to diff against and roll back to. Storing the baseline in Git rather than a shared drive gives you a tamper-evident history: the SHA-256 fingerprint proves the file has not changed between capture and cutover, and the commit timestamp anchors the snapshot to a specific point in the freeze window. If a stakeholder later asks “what did the site look like before we touched it”, the answer is a single git show.

# Fingerprint and commit the baseline so it cannot be silently overwritten
sha256sum /tmp/baseline/all_inlinks.csv > baseline_raw.sha256
git add /tmp/baseline/all_inlinks.csv baseline_raw.sha256
git commit -m "Pre-migration crawl baseline $(date +%Y-%m-%d)"

Worked Example

A retailer on oldshop.example.com is replatforming to shop.example.com. The sitemap reports 8,200 URLs, so max_pages is set to 9,430 (sitemap + 15%). The Screaming Frog crawl with JavaScript rendering discovers 9,118 URLs — 918 more than the sitemap, exposing orphaned filter and pagination routes that a sitemap-only export would have lost.

Normalisation isolates legacy paths with ^https?://(?:www\.)?oldshop\.example\.com(/.*)$ and surfaces three redirect loops where Source_URL == Target_Canonical. Those rows are corrected before they ever reach the redirect map. The committed baseline.parquet becomes the reference dataset for mapping legacy traffic to new URL structures and for the broader Pre-Migration Auditing & Risk Assessment workflow.

Verification

Confirm the export is complete and self-consistent before treating it as a baseline.

# Row count must match expected discovered URLs (sitemap + buffer)
wc -l /tmp/baseline/all_inlinks.csv

# Detect self-referential redirect loops in the normalised export
awk -F',' 'NR>1 && $1==$3 {print "loop:", $1}' /tmp/baseline/all_inlinks.csv

# Re-verify the committed snapshot integrity
sha256sum -c baseline_raw.sha256

Watch for these failures: ignoring JavaScript-rendered routes loses SPA content; exporting only top-level URLs misses deep pagination; dropping hreflang/x-default annotations corrupts international targeting; and overwriting baseline files without versioning eliminates rollback capability.

FAQ

What is the optimal crawl depth for capturing migration-critical URLs? Set depth to unlimited (0) with a hard page limit matching the sitemap URL count plus 15%. Use max_depth=0 in your crawler config to prevent arbitrary truncation of deep archive or parameterised URLs.

How do I handle 302 redirects in the baseline export? Flag all 302s for manual review. Convert them to 301s pre-migration if the destination is permanent, or preserve a redirect_type column in your mapping CSV so redirect intent stays explicit during the cutover.

Can I automate CSV normalisation for 500k+ URLs? Yes. Use pandas with chunked reading (chunksize=50000) or DuckDB for out-of-core processing, apply vectorised regex replacements, and export to Parquet for faster I/O during mapping operations.

Related

← Back to Crawl Baseline Generation