extraction

Schema Validation

Schema validation checks that extracted data conforms to an expected structure and type constraints before it is written to a database or downstream system.

Web pages change without notice, and scraping pipelines can silently produce invalid data when a site restructures its HTML or changes field formats. Schema validation adds a checkpoint after the extraction step: each extracted record is checked against a defined schema (field names, data types, required fields, value ranges) and records that fail validation are quarantined for review rather than written to the destination.

Common schema validation tools for Python include Pydantic (declarative model validation with type coercion), Cerberus, and jsonschema. For TypeScript pipelines, Zod provides runtime schema validation that mirrors TypeScript types.

Schema validation failures are early-warning signals that the target site's structure has changed, allowing the scraping team to update selectors and field mappings before large volumes of bad data accumulate in the destination. Validation reports are a key component of data quality monitoring in production scraping systems.

Examples

from pydantic import BaseModel, HttpUrl
from typing import Optional

class ProductRecord(BaseModel):
    name: str
    price: float
    sku: Optional[str] = None
    url: HttpUrl

# Validate extracted data — raises ValidationError if invalid
record = ProductRecord(**extracted_fields)