Using Python to Generate CSV Redirect Maps

Problem Statement

Large-scale migrations break when redirect maps are hand-built: legacy CMS exports inject tracking parameters and trailing-slash mismatches, byte-order marks corrupt RewriteMap ingestion, and unescaped regex produces chained or looping rules. At tens of thousands of rows, manual mapping wastes crawl budget, dilutes link equity, and throws 500 errors when the server parses the map. Unpinned Python environments add UnicodeDecodeError and out-of-memory failures on top. This page sits under CSV Mapping Workflows and turns a raw legacy export into a deterministic, server-ready redirect map.

Each row flows ingest to export; a failed target HEAD check loops back for re-mapping before the file is deployed.

When to Use This Approach

You have more than a few hundred URL changes and manual mapping is impractical or error-prone.
Legacy exports carry tracking parameters, mixed encodings, or inconsistent trailing slashes that need normalising before rules are generated.
You want one version-controlled CSV that feeds both Apache RewriteMap and Nginx map directives.
You need to programmatically verify that every redirect target actually resolves before deployment.
The migration spans repeated dry-runs and you want a repeatable, scriptable build rather than a one-off spreadsheet edit.

Step-by-Step Instructions

1. Pin an Isolated Python Environment

Unpinned dependencies are the most common cause of encoding and memory failures on bulk transforms. Create a virtual environment and pin exact versions before touching legacy data.

# Isolate and pin to avoid UnicodeDecodeError / OOM on large maps
python -m venv redirect_env && source redirect_env/bin/activate
pip install pandas==2.2.3 regex==2024.11.6 requests==2.32.3
export PYTHONIOENCODING=utf-8

2. Ingest and Normalise the Legacy Export

Read the export with utf-8-sig so a leading BOM is consumed, then strip tracking parameters and normalise trailing slashes so source paths are canonical.

import re
import pandas as pd
from urllib.parse import urlparse, urlunparse

def clean_url(raw: str) -> str:
    """Strip tracking params and normalise the path."""
    url = urlparse(raw)
    path = re.sub(r'[?&](?:sid|session|utm_[a-z_]+)=[^&]*', '', url.path)
    return urlunparse(('', '', path.rstrip('/'), '', '', ''))

df = pd.read_csv('legacy_export.csv', encoding='utf-8-sig')  # utf-8-sig eats the BOM
df['clean_source'] = df['url'].apply(clean_url)

3. Apply Vectorised Regex Transforms

Build PCRE2-compatible capture groups and apply them with vectorised pandas replacement, then drop rows where no transform fired and assign the status code.

# Map /legacy/<slug>/article/<id> -> /new/category/<slug>/post/<id>
df['target'] = df['clean_source'].str.replace(
    r'^/legacy/([a-z0-9-]+)/article/(\d+)$',
    r'/new/category/\1/post/\2',
    regex=True
)
df = df[df['target'] != df['clean_source']].copy()  # keep only transformed rows
df['status_code'] = 301

4. Export a Server-Ready Map

Write UTF-8 with BOM so Excel opens it cleanly while Apache and Nginx strip the BOM on ingestion. Quote non-numeric fields to keep commas in paths intact.

import csv

df[['clean_source', 'target', 'status_code']].rename(
    columns={'clean_source': 'old_url', 'target': 'new_url'}
).to_csv(
    'redirect_map.csv',
    index=False,
    encoding='utf-8-sig',
    quoting=csv.QUOTE_NONNUMERIC   # protect commas inside path values
)

Load the resulting file as an O(1) hash lookup on the server. Tune Nginx hash sizing for large maps and reference the Apache RewriteMap directly:

# Nginx — size the hash tables for a large map file
map_hash_max_size 10000;
map_hash_bucket_size 128;

# Apache — load the redirect map as an O(1) hash lookup
RewriteMap redirect txt:/etc/apache2/redirects.map
RewriteCond ${redirect:$1|NOTFOUND} !NOTFOUND
RewriteRule ^(.*)$ ${redirect:$1} [R=301,L]

Worked Example

An enterprise news site exports 48,000 legacy article URLs. A row reads https://old.example.com/legacy/world-news/article/8821?utm_source=newsletter. After step 2 the source normalises to /legacy/world-news/article/8821; after step 3 the target becomes /new/category/world-news/post/8821 with status 301. The exported row is:

"old_url","new_url",301
"/legacy/world-news/article/8821","/new/category/world-news/post/8821",301

A request to the legacy path now resolves in a single hop:

GET /legacy/world-news/article/8821 HTTP/1.1
Host: example.com

HTTP/1.1 301 Moved Permanently
Location: https://example.com/new/category/world-news/post/8821

Run chain detection across the generated set with Redirect Chain Elimination so no transformed row points at another redirected source.

Verification

HEAD-check every target before deployment so chained or missing destinations are caught while still in CSV form. Keep allow_redirects=False so a chained target does not report a false 200.

import requests

session = requests.Session()
session.headers.update({'User-Agent': 'RedirectValidator/1.0'})

results = []
for _, row in df.iterrows():
    try:
        r = session.head(row['new_url'], allow_redirects=False, timeout=5)
        results.append({'url': row['new_url'], 'status': r.status_code})
    except requests.RequestException as e:
        results.append({'url': row['new_url'], 'status': 'ERROR', 'error': str(e)})

pd.DataFrame(results).to_csv('validation_results.csv', index=False)

Deploy the map with an atomic swap and keep a timestamped backup so a 5xx spike can be reverted in seconds:

# Backup, validate config, reload atomically
cp /etc/nginx/redirects.map /etc/nginx/redirects.map.bak.$(date +%s)
rsync --backup redirect_map.csv /etc/nginx/redirects.map
nginx -t && systemctl reload nginx          # rollback: copy the .bak.* file back, then reload

Validation checklist:

allow_redirects=False confirmed so chained targets do not report a false 200 OK.
Trailing-slash parity verified between source and target to block duplicate-content loops.
BOM consumed on ingest and absent from the deployed .map file.
Timestamped backup stored on a separate volume before reload.

FAQ

How do I process 50,000+ URL rows without a Python memory overflow? Iterate with pandas.read_csv('source.csv', chunksize=10000) to process memory-safe blocks, or switch to polars for lazy, out-of-core evaluation on larger datasets.

What prevents catastrophic regex backtracking on malformed legacy URLs? Use the regex module rather than the built-in re and apply possessive quantifiers (*+, ++) or atomic groups (?>...) on ambiguous path segments. Test against adversarial inputs — very long strings and deeply nested paths — before production.

What is the exact emergency rollback if load spikes after deployment? Restore the timestamped backup: cp /etc/nginx/redirects.map.bak.TIMESTAMP /etc/nginx/redirects.map && nginx -t && systemctl reload nginx, which restores baseline routing in seconds. If nginx -t fails, the backup itself is corrupt — restore from Git instead.

← Back to CSV Mapping Workflows