general

XML Sitemap

An XML sitemap is a structured file listing a website's important URLs with optional metadata, helping search engine crawlers and custom scrapers discover all public pages.

An XML sitemap follows the Sitemap Protocol specification and is typically served at `/sitemap.xml` or referenced from `robots.txt`. Each `<url>` entry specifies a `<loc>` (the URL), optionally `<lastmod>` (last modified date), `<changefreq>` (expected change frequency), and `<priority>` (relative importance 0.0–1.0). Large sites use sitemap index files that reference multiple child sitemaps.

For scrapers, the sitemap is the most efficient starting point for a comprehensive crawl: rather than discovering URLs through link following, the scraper downloads the sitemap and directly queues all listed URLs. This avoids link-graph traversal overhead and ensures coverage of pages that may not be reachable through internal links.

Sitemaps also reveal site structure intelligence: the `<lastmod>` field enables incremental scraping (only fetch pages changed since the last run), and the priority field suggests which pages the publisher considers most important.

Examples

# Parse a sitemap.xml with Python
import requests
from xml.etree import ElementTree as ET

response = requests.get("https://example.com/sitemap.xml")
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
root = ET.fromstring(response.content)
urls = [url.text for url in root.findall(".//sm:loc", ns)]
print(f"Found {len(urls)} URLs in sitemap")