general

Concurrent Crawling

Concurrent crawling runs multiple URL fetches in parallel, balancing throughput against politeness by limiting simultaneous connections to any single domain.

A crawler that fetches one URL at a time (sequential) is vastly underutilising network and CPU resources. Concurrent crawling issues multiple requests simultaneously, dramatically increasing throughput. The key constraint is per-domain politeness: while issuing 50 concurrent requests globally is fine when targeting 50 different domains, issuing 50 concurrent requests to a single small website is an aggressive load that will trigger rate limiting or anti-bot defences.

Well-designed crawlers implement per-domain concurrency limits: a global semaphore limits total concurrent fetches (say, 100), while per-domain semaphores limit simultaneous requests to any single host (say, 2–5). Scrapy implements this via `CONCURRENT_REQUESTS_PER_DOMAIN` and Crawl-delay settings.

For website compatibility, low per-domain concurrency is often essential: sites that detect more than N requests per second from a session will trigger challenges. Staying below this threshold — even at the cost of lower throughput — is preferable to being blocked.

Related Terms

Extract Concurrent Crawling data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Concurrent Crawling — Web Scraping Glossary | AlterLab