The Sitemap Protocol allows a single sitemap to contain at most 50,000 URLs and must be no larger than 50 MB uncompressed. Large websites with hundreds of thousands or millions of pages use a sitemap index file — a sitemap whose entries are other sitemap files rather than page URLs. The index file is typically served at `/sitemap_index.xml` or referenced from `/sitemap.xml`.

For scrapers, parsing a sitemap index requires a two-step process: first fetch and parse the index to discover the child sitemap URLs, then fetch and parse each child sitemap to collect the actual page URLs. Child sitemaps are often segmented by content type (products, blog posts, categories), date range, or alphabetically by URL.

Google Search Console validates and displays sitemap index submission status, showing which child sitemaps have been processed and how many URLs were discovered. Scrapers building comprehensive crawls of large sites should always check `/sitemap.xml` first and follow the index structure if present.

Examples

import requests
from xml.etree import ElementTree as ET

def parse_sitemap_tree(url):
    r = requests.get(url)
    root = ET.fromstring(r.content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    # Check if this is a sitemap index
    sitemaps = root.findall(".//sm:sitemap/sm:loc", ns)
    if sitemaps:
        urls = []
        for sm_url in sitemaps:
            urls.extend(parse_sitemap_tree(sm_url.text))
        return urls
    # Regular sitemap
    return [loc.text for loc in root.findall(".//sm:loc", ns)]

Sitemap Index File

Examples

Related Terms

Extract Sitemap Index File data from any website

Your first scrape.
Sixty seconds.

Examples

Related Terms

Extract Sitemap Index File data from any website

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.