general

robots.txt

A plain-text file at a website's root specifying which URL paths crawlers are permitted or disallowed from accessing.

The `robots.txt` file is a plain-text convention that websites place at their root URL (e.g. `https://example.com/robots.txt`) to communicate crawling directives to web robots. The file uses `User-agent` directives to target specific crawlers (or `*` for all crawlers) and `Disallow` and `Allow` rules to specify which URL paths the crawler should or should not access.

Critically, robots.txt is a convention — not a technical enforcement mechanism. Web servers do not automatically block requests to disallowed paths; the file is simply a published preference that well-behaved crawlers are expected to honour. Search engine crawlers (Googlebot, Bingbot) respect robots.txt rigorously. Third-party scraping tools are not technically prevented from accessing disallowed paths.

The legal weight of robots.txt in the context of scraping has been debated in court cases. Accessing robots.txt-disallowed paths may be considered a violation of a site's terms of service, which can have legal implications depending on jurisdiction and use case. Reviewing robots.txt before scraping at scale is a best practice for ethical and legal risk management.

Examples

# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Crawl-delay: 10

Sitemap: https://example.com/sitemap.xml

Related Terms

    robots.txt — Web Scraping Glossary | AlterLab