A page can be excluded from search engine indexes via two mechanisms: a `<meta name='robots' content='noindex'>` tag in the HTML `<head>`, or an `X-Robots-Tag: noindex` HTTP response header. Both signal to compliant crawlers (Googlebot, Bingbot) that the page should be fetched but not indexed — it will not appear in search results.
For web scrapers, `noindex` pages are interesting because they often contain content the site operator wants to keep private from search engines but still accessible to logged-in users — admin panels, cart pages, staging previews, or thin content variants. Scrapers are not bound by the noindex directive (it is a crawler convention, not a technical restriction), but responsible scrapers respect it in the same spirit as robots.txt.
From an SEO standpoint, noindex pages should not be included in a site's canonical content inventory. When auditing a site's crawlability, distinguishing between indexed and noindexed pages helps identify technical SEO issues like canonicalisation conflicts.