infrastructure

Concurrency

Concurrency in scraping refers to the number of requests or browser sessions running in parallel, which directly controls throughput and resource consumption.

Concurrency determines how many scraping operations execute simultaneously. Higher concurrency means more data per unit time, but also more memory, CPU, and network bandwidth consumed, and a greater likelihood of triggering rate limits or anti-bot detections on the target site.

Scraping frameworks implement concurrency using threads (Python threading), async I/O (asyncio, Twisted), or process pools. Async I/O is especially efficient for I/O-bound scraping because thousands of in-flight HTTP requests can be managed with minimal thread overhead. Browser-based scraping is limited by the memory footprint of each browser instance (typically 100–300 MB per Chrome process).

Scraping platforms like AlterLab enforce per-account concurrency limits based on the billing tier. Requests above the concurrency cap are queued rather than rejected, ensuring predictable throughput without overloading the worker pool.

Examples

# asyncio: scrape 10 URLs concurrently
import asyncio, httpx

async def fetch(client, url):
    r = await client.get(url)
    return {"url": url, "status": r.status_code}

async def main(urls):
    async with httpx.AsyncClient() as client:
        return await asyncio.gather(*[fetch(client, u) for u in urls])

results = asyncio.run(main(urls[:10]))