extraction

Data Normalization

Data normalization standardises scraped values into a consistent format — stripping currency symbols, trimming whitespace, converting dates, and unifying units.

Raw scraped text is inconsistent: prices appear as `$1,299.00`, `1299`, `USD 1,299`, or `1.299,00` depending on locale. Dates appear as `June 25, 2026`, `2026-06-25`, `25/06/26`, or relative strings like `2 hours ago`. Normalization converts these varied representations into a canonical form — a float for the price, an ISO 8601 string for the date — so that records can be compared, sorted, and aggregated.

Normalization steps typically include: stripping whitespace and control characters, removing HTML entities, converting character encodings to UTF-8, parsing numbers by removing thousands separators and detecting locale-specific decimal notation, parsing dates with libraries like dateparser that handle multiple formats, and resolving relative URLs to absolute.

Normalizing data at the extract-transform boundary (rather than leaving it to downstream consumers) simplifies downstream logic and ensures that any anomaly — an unexpected format that the normaliser cannot parse — is surfaced immediately rather than silently producing wrong values.

Examples

import re
from dateparser import parse as parse_date

def normalize_price(raw: str) -> float:
    cleaned = re.sub(r"[^\d.,]", "", raw)
    # Handle European decimal comma
    if cleaned.count(",") == 1 and cleaned.count(".") == 0:
        cleaned = cleaned.replace(",", ".")
    return float(cleaned.replace(",", ""))

def normalize_date(raw: str) -> str:
    dt = parse_date(raw)
    return dt.date().isoformat() if dt else raw

Related Terms

Extract Data Normalization data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Data Normalization — Web Scraping Glossary | AlterLab