The Sitemap Protocol allows a single sitemap to contain at most 50,000 URLs and must be no larger than 50 MB uncompressed. Large websites with hundreds of thousands or millions of pages use a sitemap index file — a sitemap whose entries are other sitemap files rather than page URLs. The index file is typically served at `/sitemap_index.xml` or referenced from `/sitemap.xml`.
For scrapers, parsing a sitemap index requires a two-step process: first fetch and parse the index to discover the child sitemap URLs, then fetch and parse each child sitemap to collect the actual page URLs. Child sitemaps are often segmented by content type (products, blog posts, categories), date range, or alphabetically by URL.
Google Search Console validates and displays sitemap index submission status, showing which child sitemaps have been processed and how many URLs were discovered. Scrapers building comprehensive crawls of large sites should always check `/sitemap.xml` first and follow the index structure if present.