Pricing Compare Playground Blog Docs Changelog

How to Rotate Proxies and Solve CAPTCHAs at Scale in 2026

Learn how to rotate proxies and solve CAPTCHAs at scale without getting blocked. Practical patterns for building resilient scraping pipelines in 2026.

Yash DubeyApril 6, 2026

8 min read

182 views

On this page

AlterLab handles this automatically — scrape any URL with one API call. No infrastructure required.

Try it free

The Problem With Building Your Own Proxy Infrastructure

You spin up a scraper. It works for 200 requests. Then the target site starts returning 403s. Then CAPTCHAs. Then your IP gets banned entirely.

The standard response: buy a proxy list, wire up rotation logic, add CAPTCHA solving, implement retry logic with exponential backoff, handle fingerprinting, manage session cookies, and deal with geographic targeting. That is a full-time infrastructure project before you scrape a single page of useful data.

Most teams do not need to build this. They need data. This guide covers the patterns that work for production scraping at scale, with code you can deploy today.

How Bot Detection Works in 2026

Modern anti-bot systems layer multiple signals. Understanding what they check tells you what your scraper needs to handle.

IP reputation. Datacenter IP ranges are flagged. AWS, GCP, and Azure subnets are well-known. Sites maintain blocklists updated in real time from threat intelligence feeds.

TLS fingerprinting. The JA3/JA4 hash of your TLS handshake reveals your HTTP client. Python requests, Go net/http, and raw curl each produce distinct fingerprints. Headless browsers have their own signatures.

Browser fingerprinting. Canvas rendering, WebGL vendor strings, font enumeration, audio context, and screen resolution combine into a fingerprint that distinguishes automated browsers from real users.

Behavioral analysis. Mouse movement patterns, scroll velocity, click timing, and navigation sequences are tracked. Bots move in straight lines. Humans do not.

CAPTCHA challenges. reCAPTCHA v3 runs silently and scores each visitor. hCaptcha and Cloudflare Turnstile present visual puzzles when your score drops below threshold.

Rate limiting. Request frequency per IP, per session, and per account is tracked. Burst patterns trigger blocks faster than steady traffic.

A scraper that handles only one of these signals will fail. You need a system that addresses all of them simultaneously.

Proxy Rotation: The Right Way

Why Single-IP Scraping Fails

Every request from the same IP to the same domain creates a pattern. After a threshold, the target site flags the IP. The threshold varies: some sites block after 50 requests per minute, others after 500 per hour. E-commerce sites during product launches are the most aggressive.

Residential vs Datacenter vs Mobile Proxies

Datacenter proxies are cheap and fast but get flagged quickly. Residential proxies route through real ISP connections and blend in with normal traffic. Mobile proxies use carrier networks and have the highest success rate but cost more and add latency.

The practical approach: start with residential proxies, escalate to mobile only when the target requires it.

Implementing Proxy Rotation

If you manage your own proxy pool, rotation logic looks like this:

Python

import random
import requests
from collections import deque

class ProxyRotator:
    def __init__(self, proxy_list: list[str]):
        self.proxies = deque(proxy_list)

    def rotate(self) -> dict:
        self.proxies.rotate(1)
        proxy = self.proxies[0]
        return {"http": proxy, "https": proxy}

    def remove_bad(self, proxy: str):
        if proxy in self.proxies:
            self.proxies.remove(proxy)

rotator = ProxyRotator([
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
    "http://user:pass@proxy3:8080",
])

for url in urls_to_scrape:
    proxy = rotator.rotate()
    try:
        response = requests.get(url, proxies=proxy, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException:
        rotator.remove_bad(proxy["http"])

This works for simple cases. It does not handle TLS fingerprinting, browser fingerprinting, CAPTCHAs, or behavioral detection. For those, you need a higher-level solution.

Try it yourself

Try scraping this page with automatic proxy rotation

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Enable JavaScript to try the live demo, or sign up to use the API directly.

CAPTCHA Solving at Scale

Types of CAPTCHAs You Will Encounter

reCAPTCHA v2. The classic "I'm not a robot" checkbox. Requires solving image puzzles or selecting objects. Solvable via third-party services with 15-30 second latency.

reCAPTCHA v3. Runs silently. Returns a score from 0.0 to 1.0. Sites set a threshold, usually 0.5. Below that, you get blocked or redirected. No puzzle to solve, just a score to maintain.

hCaptcha. Similar to reCAPTCHA v2 but with different image sets. Popular among sites that want an alternative to Google's ecosystem.

Cloudflare Turnstile. Newer, privacy-focused. Presents challenges based on browser behavior rather than image puzzles. Harder to bypass without a real browser environment.

Custom CAPTCHAs. Some sites build their own. Math problems, text recognition, or logic puzzles. These require custom solving logic.

The CAPTCHA Solving Pipeline

A production CAPTCHA solver works in four stages:

Building this pipeline yourself means integrating with solving services like 2Captcha or Anti-Captcha, handling their APIs, managing solve latency, and retrying on failures. Each CAPTCHA type requires different extraction and submission logic.

Using a Managed Anti-Bot Bypass

The alternative is using a service that handles all of this automatically. The anti-bot bypass API detects CAPTCHAs, solves them, and returns the page content without your code needing to know a CAPTCHA existed.

Python

import alterlab

client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://example.com/protected-page")
print(response.text)

The request goes through a headless browser environment with rotating residential IPs, realistic TLS fingerprints, and automatic CAPTCHA solving. You get back the rendered HTML or clean JSON. No proxy management, no CAPTCHA integration, no fingerprint handling.

Production Scraping Patterns

Exponential Backoff with Jitter

Blind retries make blocking worse. Use exponential backoff with random jitter to avoid thundering herd problems.

Python

import time
import random
import requests

def scrape_with_backoff(url: str, max_retries: int = 5):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=15)
            response.raise_for_status()
            return response.text
        except requests.exceptions.RequestException as e:
            if attempt == max_retries - 1:
                raise
            wait = min(2 ** attempt + random.uniform(0, 1), 60)
            time.sleep(wait)

The jitter prevents multiple scraper instances from retrying simultaneously. The cap at 60 seconds prevents unbounded waits.

Session Management

Maintaining sessions reduces detection risk. Reusing cookies and connection pools looks more like a real browser than opening fresh connections for every request.

Python

import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
})

# First request establishes session
response = session.get("https://example.com/login")
csrf_token = extract_csrf(response.text)

# Second request reuses cookies and connection
response = session.post("https://example.com/login", data={
    "csrf_token": csrf_token,
    "username": "user",
    "password": "pass",
})

Parallel Scraping with Rate Limits

Scraping one page at a time is slow. Scraping 100 pages at once gets you blocked. The middle ground: controlled concurrency with per-domain rate limiting.

Python

import asyncio
import aiohttp
from asyncio import Semaphore

async def scrape_with_limit(urls: list[str], max_concurrent: int = 5):
    semaphore = Semaphore(max_concurrent)
    results = []

    async def scrape_one(session: aiohttp.ClientSession, url: str):
        async with semaphore:
            try:
                async with session.get(url, timeout=15) as resp:
                    return await resp.text()
            except Exception:
                return None

    async with aiohttp.ClientSession() as session:
        tasks = [scrape_one(session, url) for url in urls]
        results = await asyncio.gather(*tasks)

    return results

The semaphore caps concurrent requests. Adjust max_concurrent based on the target site's tolerance. Start at 3, monitor response codes, increase gradually.

When to Use Headless Browsers

Some sites require JavaScript execution to render content. Static HTTP requests return empty pages or login walls. Headless browsers handle this but add complexity.

Use headless browsers when:

Content renders client-side via React, Vue, or Angular
The site requires JavaScript to set authentication cookies
You need to interact with the page (click, scroll, fill forms)
The site uses WebSocket connections for data

Skip headless browsers when:

The page returns complete HTML on initial load
You only need API response data (intercept network calls instead)
Speed is critical and the target serves server-rendered HTML

Running headless browsers at scale requires managing browser instances, memory, and crash recovery. A managed web scraping API handles browser lifecycle, memory limits, and crash recovery automatically.

Bash

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -d '{
    "url": "https://example.com/js-heavy-page",
    "render_js": true,
    "wait_for": ".content-loaded",
    "formats": ["json"]
  }'

The render_js parameter spins up a headless browser. wait_for pauses until a CSS selector appears. formats returns structured data instead of raw HTML.

Monitoring and Alerting

Scraping pipelines fail silently. A site changes its layout, adds a new CAPTCHA type, or starts blocking your proxy range. Your scraper returns empty results and nobody notices for weeks.

Set up monitoring at three levels:

Response codes. Track 200 vs 403 vs 429 ratios. A spike in 403s means your proxies are burning. A spike in 429s means you are hitting rate limits.

Content validation. Check that responses contain expected data. If you scrape product pages, verify each response has a price, title, and description. Empty fields mean the page structure changed.

Latency tracking. Sudden latency increases often mean the site added new anti-bot checks or your proxies are routing through congested nodes.

Python

import requests
import logging

logger = logging.getLogger("scraper.monitor")

def scrape_and_validate(url: str) -> dict:
    response = requests.get(url, timeout=15)

    result = {
        "url": url,
        "status_code": response.status_code,
        "has_price": "price" in response.text,
        "has_title": "<title>" in response.text,
        "content_length": len(response.text),
        "latency_ms": response.elapsed.total_seconds() * 1000,
    }

    if not result["has_price"] or not result["has_title"]:
        logger.warning("Content validation failed: %s", url)

    return result

Log these metrics to your monitoring system. Set alerts on threshold breaches. Catching a block within minutes saves hours of missed data.

Cost Considerations

Running your own scraping infrastructure has hidden costs:

Proxy subscriptions: $50-500/month depending on pool size and type
CAPTCHA solving: $2-5 per 1000 solves
Server costs for headless browser instances
Engineering time for maintenance and debugging
Data loss from undetected failures

A managed API converts these variable costs into a predictable per-request cost. You pay for successful scrapes, not for infrastructure that might fail. Check pricing to model costs against your expected volume.

Takeaway

Building resilient scraping infrastructure requires handling proxy rotation, CAPTCHA solving, TLS fingerprinting, browser fingerprinting, rate limiting, and session management. Each layer adds complexity and failure modes.

The practical path: use a managed API that handles anti-bot detection, proxy rotation, and CAPTCHA solving automatically. Focus your engineering time on data processing and pipeline reliability, not on staying one step ahead of bot detection systems.

Start with the quickstart guide to get an API key and run your first scrape. The documentation covers advanced parameters for JavaScript rendering, scheduling, and structured data extraction.

Was this article helpful?

Try it yourself

Skip the proxy management overhead

AlterLab handles proxy rotation, browser environments, and challenge resolution for you.

curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

No credit card required · 5,000 free requests

Frequently Asked Questions

Use a scraping API that manages proxy rotation automatically. Send requests through a single endpoint and the service rotates IPs, handles retries, and manages proxy pools across residential, datacenter, and mobile networks.

Integrate a CAPTCHA-solving service into your scraping pipeline or use a scraping API with built-in CAPTCHA bypass. Services like AlterLab detect CAPTCHAs automatically and solve them using a combination of AI and human-in-the-loop systems, adding minimal latency.

Combine proxy rotation with request throttling. Use exponential backoff on failures, respect robots.txt crawl-delay directives, and distribute requests across multiple IP addresses. A managed API handles this automatically.

Yash Dubey

View all posts

Tutorials

How to Scrape DoorDash Data: Complete Guide for 2026

Learn how to scrape DoorDash data using Python and Node.js. A technical guide on extracting public food data, handling anti-bot protections, and structured AI extraction.

Herald Blog Service

Jul 4, 2026

Tutorials

SEC EDGAR Data API: Extract Structured JSON in 2026

Get structured JSON from SEC EDGAR via AlterLab’s API. Extract title, identifier, date_published and more with schema validation. Always start with the answer and keep it concise.

Herald Blog Service

Jul 2, 2026

Tutorials

How to Scrape Stack Overflow Data in 2026

A 2026 guide showing how to scrape stack overflow with Python, Node.js, and AlterLab, covering anti‑bot hurdles, pricing tiers, and best practices for clean extraction.

Herald Blog Service

Jul 2, 2026

Stay in the Loop

Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.

Web Scraping API Resources

Part of the Web Scraping API Documentation cluster

Web Scraping API Documentation

Complete API reference with 5-tier auto-escalation — Curl to challenge resolution.

Pillar page

JavaScript Rendering Guide

Configure Tier 4 browser rendering for SPAs and dynamic content.

Authenticated Scraping Guide

Scrape pages behind login using session management.

Web Scraping API Benchmarks

Real success rates and cost data across all 5 tiers.

AlterLab for AI Agents

MCP Server, Python SDK, and Firecrawl-compatible API for AI agent workflows.

The Problem With Building Your Own Proxy Infrastructure

How Bot Detection Works in 2026

Proxy Rotation: The Right Way

Why Single-IP Scraping Fails

Residential vs Datacenter vs Mobile Proxies

Implementing Proxy Rotation

CAPTCHA Solving at Scale

Types of CAPTCHAs You Will Encounter

The CAPTCHA Solving Pipeline

Using a Managed Anti-Bot Bypass

Production Scraping Patterns

Exponential Backoff with Jitter

Session Management

Parallel Scraping with Rate Limits

When to Use Headless Browsers

Monitoring and Alerting

Cost Considerations

Takeaway

Frequently Asked Questions

Related Articles

How to Scrape DoorDash Data: Complete Guide for 2026

SEC EDGAR Data API: Extract Structured JSON in 2026

How to Scrape Stack Overflow Data in 2026

Popular Posts

Playwright Bot Detection: What Actually Works in 2026

How to Scrape Twitter/X: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Best Web Scraping APIs in 2026: Complete Comparison Guide

How to Scrape Cloudflare-Protected Sites in 2026

Recommended

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Newsletter

Recommended Reading

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

AlterLab vs Firecrawl: Which Scraping API Is Better in 2026?

How to Scrape Twitter/X Data: Complete Guide for 2026

How to Scrape Cloudflare-Protected Sites in 2026

Stay in the Loop

Explore AlterLab

Anti-Bot Handling API

JavaScript Rendering API

Pricing

Documentation

Web Scraping API Resources