format

Sitemap

An XML file listing all URLs on a website with metadata like last modification date, helping crawlers discover and prioritise pages.

A sitemap is an XML file — following the Sitemap Protocol standard at sitemaps.org — that a website publishes at a well-known URL (typically `/sitemap.xml`) to list all its pages along with optional metadata: when each page was last modified, how frequently it changes, and its relative priority. The file tells search engine crawlers exactly which URLs exist and should be indexed.

For web scraping, a site's sitemap is an invaluable discovery resource. Rather than recursively following every internal link from the homepage — which is slow, may miss orphaned pages, and risks falling into pagination traps — parsing the sitemap directly gives you the complete URL inventory in a structured machine-readable format. Sitemaps can reference other sitemaps (sitemap indexes), creating a tree structure for large sites.

Sites with many dynamic pages may generate sitemaps programmatically from database records. E-commerce sites often have product sitemaps, category sitemaps, and blog sitemaps split across multiple files. When scraping at scale, checking for a sitemap first saves significant crawling time and infrastructure cost.

Examples

# Fetch and parse a sitemap
{
  "url": "https://example.com/sitemap.xml",
  "extract_schema": {
    "type": "object",
    "properties": {
      "urls": {
        "type": "array",
        "items": { "type": "string" }
      }
    }
  }
}

Related Terms

    Sitemap — Web Scraping Glossary | AlterLab