general

Web Crawler

A program that automatically navigates websites by following hyperlinks, systematically indexing pages across an entire domain.

A web crawler (also called a spider or bot) is a program that discovers and visits web pages by following hyperlinks from one page to the next. Unlike scraping a single URL, a crawler maps and traverses an entire site — or the whole web — by queuing every link it encounters and processing each URL in turn.

Crawlers need to handle URL deduplication to avoid revisiting the same page, respect robots.txt exclusion rules, manage politeness delays to avoid overwhelming servers, handle redirects, and deal with session cookies and authentication. Large-scale crawlers distribute work across multiple machines and maintain persistent queues for fault tolerance.

Search engines like Google and Bing run massive crawlers to index the web. Developers use site-specific crawlers to map content structure, build datasets across hundreds of pages, or discover new URLs before scraping their content. For shallow crawling tasks, AlterLab's `/crawl` endpoint accepts a root URL and returns all discovered links within a domain.

Examples

# Discover all links from a domain
{
  "url": "https://example.com",
  "depth": 2,
  "follow_links": true,
  "same_domain": true
}

Related Terms

    Web Crawler — Web Scraping Glossary | AlterLab