Raw scraped text is inconsistent: prices appear as `$1,299.00`, `1299`, `USD 1,299`, or `1.299,00` depending on locale. Dates appear as `June 25, 2026`, `2026-06-25`, `25/06/26`, or relative strings like `2 hours ago`. Normalization converts these varied representations into a canonical form — a float for the price, an ISO 8601 string for the date — so that records can be compared, sorted, and aggregated.

Normalization steps typically include: stripping whitespace and control characters, removing HTML entities, converting character encodings to UTF-8, parsing numbers by removing thousands separators and detecting locale-specific decimal notation, parsing dates with libraries like dateparser that handle multiple formats, and resolving relative URLs to absolute.

Normalizing data at the extract-transform boundary (rather than leaving it to downstream consumers) simplifies downstream logic and ensures that any anomaly — an unexpected format that the normaliser cannot parse — is surfaced immediately rather than silently producing wrong values.

Examples

import re
from dateparser import parse as parse_date

def normalize_price(raw: str) -> float:
    cleaned = re.sub(r"[^\d.,]", "", raw)
    # Handle European decimal comma
    if cleaned.count(",") == 1 and cleaned.count(".") == 0:
        cleaned = cleaned.replace(",", ".")
    return float(cleaned.replace(",", ""))

def normalize_date(raw: str) -> str:
    dt = parse_date(raw)
    return dt.date().isoformat() if dt else raw

Data Normalization

Examples

Related Terms

Extract Data Normalization data from any website

Your first scrape.
Sixty seconds.

Examples

Related Terms

Extract Data Normalization data from any website

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.