The `robots.txt` file is a plain-text convention that websites place at their root URL (e.g. `https://example.com/robots.txt`) to communicate crawling directives to web robots. The file uses `User-agent` directives to target specific crawlers (or `*` for all crawlers) and `Disallow` and `Allow` rules to specify which URL paths the crawler should or should not access.
Critically, robots.txt is a convention — not a technical enforcement mechanism. Web servers do not automatically block requests to disallowed paths; the file is simply a published preference that well-behaved crawlers are expected to honour. Search engine crawlers (Googlebot, Bingbot) respect robots.txt rigorously. Third-party scraping tools are not technically prevented from accessing disallowed paths.
The legal weight of robots.txt in the context of scraping has been debated in court cases. Accessing robots.txt-disallowed paths may be considered a violation of a site's terms of service, which can have legal implications depending on jurisdiction and use case. Reviewing robots.txt before scraping at scale is a best practice for ethical and legal risk management.