general

Deduplication

Deduplication is the process of identifying and removing or merging duplicate records in a scraped dataset, ensuring each real-world entity appears exactly once.

Web scraping frequently produces duplicate records: the same product listed across multiple category pages, the same news article linked from several sources, or the same URL scraped twice due to crawler graph cycles. Deduplication removes or collapses these duplicates to produce a clean dataset.

Exact deduplication compares records by a unique key (URL, product SKU, canonical identifier) and discards duplicates. Fuzzy deduplication detects near-duplicate records where small differences in text or formatting produce distinct strings that represent the same entity — for example, the same product name with different punctuation on two marketplace pages. Fuzzy matching uses techniques like MinHash, SimHash, or embedding similarity to identify near-duplicates.

At the crawl level, URL deduplication prevents re-visiting already-scraped pages by storing a set of seen URLs (or a Bloom filter for memory efficiency) and skipping any URL already in the set.

Examples

# Simple URL deduplication with a set
seen = set()
queue = ["https://example.com"]

while queue:
    url = queue.pop()
    if url in seen:
        continue
    seen.add(url)
    html = fetch(url)
    new_links = extract_links(html, base=url)
    queue.extend(new_links)