An XML sitemap follows the Sitemap Protocol specification and is typically served at `/sitemap.xml` or referenced from `robots.txt`. Each `<url>` entry specifies a `<loc>` (the URL), optionally `<lastmod>` (last modified date), `<changefreq>` (expected change frequency), and `<priority>` (relative importance 0.0–1.0). Large sites use sitemap index files that reference multiple child sitemaps.
For scrapers, the sitemap is the most efficient starting point for a comprehensive crawl: rather than discovering URLs through link following, the scraper downloads the sitemap and directly queues all listed URLs. This avoids link-graph traversal overhead and ensures coverage of pages that may not be reachable through internal links.
Sitemaps also reveal site structure intelligence: the `<lastmod>` field enables incremental scraping (only fetch pages changed since the last run), and the priority field suggests which pages the publisher considers most important.