Raw scraped text is inconsistent: prices appear as `$1,299.00`, `1299`, `USD 1,299`, or `1.299,00` depending on locale. Dates appear as `June 25, 2026`, `2026-06-25`, `25/06/26`, or relative strings like `2 hours ago`. Normalization converts these varied representations into a canonical form — a float for the price, an ISO 8601 string for the date — so that records can be compared, sorted, and aggregated.
Normalization steps typically include: stripping whitespace and control characters, removing HTML entities, converting character encodings to UTF-8, parsing numbers by removing thousands separators and detecting locale-specific decimal notation, parsing dates with libraries like dateparser that handle multiple formats, and resolving relative URLs to absolute.
Normalizing data at the extract-transform boundary (rather than leaving it to downstream consumers) simplifies downstream logic and ensures that any anomaly — an unexpected format that the normaliser cannot parse — is surfaced immediately rather than silently producing wrong values.