general

XML Sitemap

An XML sitemap is a structured file listing a website's important URLs with optional metadata, helping search engine crawlers and custom scrapers discover all public pages.

An XML sitemap follows the Sitemap Protocol specification and is typically served at `/sitemap.xml` or referenced from `robots.txt`. Each `<url>` entry specifies a `<loc>` (the URL), optionally `<lastmod>` (last modified date), `<changefreq>` (expected change frequency), and `<priority>` (relative importance 0.0–1.0). Large sites use sitemap index files that reference multiple child sitemaps.

For scrapers, the sitemap is the most efficient starting point for a comprehensive crawl: rather than discovering URLs through link following, the scraper downloads the sitemap and directly queues all listed URLs. This avoids link-graph traversal overhead and ensures coverage of pages that may not be reachable through internal links.

Sitemaps also reveal site structure intelligence: the `<lastmod>` field enables incremental scraping (only fetch pages changed since the last run), and the priority field suggests which pages the publisher considers most important.

Examples

# Parse a sitemap.xml with Python
import requests
from xml.etree import ElementTree as ET

response = requests.get("https://example.com/sitemap.xml")
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
root = ET.fromstring(response.content)
urls = [url.text for url in root.findall(".//sm:loc", ns)]
print(f"Found {len(urls)} URLs in sitemap")

Related Terms

Extract XML Sitemap data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    XML Sitemap — Web Scraping Glossary | AlterLab