general

Sitemap Index File

A sitemap index file is a sitemap that lists other sitemap files rather than individual URLs, enabling large sites to split their sitemap into manageable chunks.

The Sitemap Protocol allows a single sitemap to contain at most 50,000 URLs and must be no larger than 50 MB uncompressed. Large websites with hundreds of thousands or millions of pages use a sitemap index file — a sitemap whose entries are other sitemap files rather than page URLs. The index file is typically served at `/sitemap_index.xml` or referenced from `/sitemap.xml`.

For scrapers, parsing a sitemap index requires a two-step process: first fetch and parse the index to discover the child sitemap URLs, then fetch and parse each child sitemap to collect the actual page URLs. Child sitemaps are often segmented by content type (products, blog posts, categories), date range, or alphabetically by URL.

Google Search Console validates and displays sitemap index submission status, showing which child sitemaps have been processed and how many URLs were discovered. Scrapers building comprehensive crawls of large sites should always check `/sitemap.xml` first and follow the index structure if present.

Examples

import requests
from xml.etree import ElementTree as ET

def parse_sitemap_tree(url):
    r = requests.get(url)
    root = ET.fromstring(r.content)
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    # Check if this is a sitemap index
    sitemaps = root.findall(".//sm:sitemap/sm:loc", ns)
    if sitemaps:
        urls = []
        for sm_url in sitemaps:
            urls.extend(parse_sitemap_tree(sm_url.text))
        return urls
    # Regular sitemap
    return [loc.text for loc in root.findall(".//sm:loc", ns)]

Related Terms

Extract Sitemap Index File data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Sitemap Index File — Web Scraping Glossary | AlterLab