general

Web Crawler

A program that automatically navigates websites by following hyperlinks, systematically indexing pages across an entire domain.

A web crawler (also called a spider or bot) is a program that discovers and visits web pages by following hyperlinks from one page to the next. Unlike scraping a single URL, a crawler maps and traverses an entire site — or the whole web — by queuing every link it encounters and processing each URL in turn.

Crawlers need to handle URL deduplication to avoid revisiting the same page, respect robots.txt exclusion rules, manage politeness delays to avoid overwhelming servers, handle redirects, and deal with session cookies and authentication. Large-scale crawlers distribute work across multiple machines and maintain persistent queues for fault tolerance.

Search engines like Google and Bing run massive crawlers to index the web. Developers use site-specific crawlers to map content structure, build datasets across hundreds of pages, or discover new URLs before scraping their content. For shallow crawling tasks, AlterLab's `/crawl` endpoint accepts a root URL and returns all discovered links within a domain.

What is Web Crawler?

A program that automatically navigates websites by following hyperlinks, systematically indexing pages across an entire domain.

How does AlterLab handle Web Crawler?

Search engines like Google and Bing run massive crawlers to index the web. Developers use site-specific crawlers to map content structure, build datasets across hundreds of pages, or discover new URLs before scraping their content. For shallow crawling tasks, AlterLab's `/crawl` endpoint accepts a root URL and returns all discovered links within a domain.

Examples

# Discover all links from a domain
{
  "url": "https://example.com",
  "depth": 2,
  "follow_links": true,
  "same_domain": true
}

Related Terms

Automated extraction of data from websites using software tools that parse HTML and collect structured information at scale.

A plain-text file at a website's root specifying which URL paths crawlers are permitted or disallowed from accessing.

An XML file listing all URLs on a website with metadata like last modification date, helping crawlers discover and prioritise pages.

A server-side control that caps the number of requests accepted from a single IP or session within a time window, returning HTTP 429 when exceeded.

Session Management

The handling of cookies, authentication tokens, and browser state across multiple requests to maintain a continuous browsing session.

Extract Web Crawler data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

Back to Glossary

Your first scrape.
Sixty seconds.

$1 free credit — up to 5,000 scrapes. No credit card.
Just a POST request.

terminal

curl -X POST https://api.alterlab.io/v1/scrape \

-H "X-API-Key: YOUR_KEY" \

-H "Content-Type: application/json" \

-d '{"url": "https://example.com", "formats": ["markdown"]}'

Start building free

No credit card required · $1 free credit, up to 5,000 scrapes · Balance never expires