Without explicit domain and path filters, a web crawler will follow any link it encounters — including links to external sites, CDN subdomains, login portals, or auto-generated parameter variants that produce millions of near-duplicate pages. This causes the crawl to expand far beyond its intended scope, consuming resources and producing irrelevant data.
Prevention requires explicit allow-list rules: only follow links that match the target domain and path prefix, reject links to different protocols or subdomains (unless explicitly included), and canonicalise query parameters to prevent combinatorial explosion from faceted search filters.
Common sources of scope creep: tracking parameters appended by analytics scripts (`?utm_source=...`), session IDs in URLs, calendar pages with infinite future dates, printer-friendly version URLs, and language variants that duplicate content. robots.txt Disallow rules and sitemap URL sets can inform the scraper which URL patterns to exclude.