A plain-text file at a website's root specifying which URL paths crawlers are permitted or disallowed from accessing.

robots.txt — Web Scraping Glossary

The `robots.txt` file is a plain-text convention that websites place at their root URL (e.g. `https://example.com/robots.txt`) to communicate crawling directives to web robots. The file uses `User-agent` directives to target specific crawlers (or `*` for all crawlers) and `Disallow` and `Allow` rules to specify which URL paths the crawler should or should not access.

Critically, robots.txt is a convention — not a technical enforcement mechanism. Web servers do not automatically block requests to disallowed paths; the file is simply a published preference that well-behaved crawlers are expected to honour. Search engine crawlers (Googlebot, Bingbot) respect robots.txt rigorously. Third-party scraping tools are not technically prevented from accessing disallowed paths.

The legal weight of robots.txt in the context of scraping has been debated in court cases. Accessing robots.txt-disallowed paths may be considered a violation of a site's terms of service, which can have legal implications depending on jurisdiction and use case. Reviewing robots.txt before scraping at scale is a best practice for ethical and legal risk management.

robots.txt

What is robots.txt?

How does AlterLab handle robots.txt?

Examples

Related Terms

Extract robots.txt data from any website

Your first scrape.
Sixty seconds.