
How to Rotate Proxies and Solve CAPTCHAs at Scale in 2026
Learn how to rotate proxies and solve CAPTCHAs at scale without getting blocked. Practical patterns for building resilient scraping pipelines in 2026.
April 6, 2026
The Problem With Building Your Own Proxy Infrastructure
You spin up a scraper. It works for 200 requests. Then the target site starts returning 403s. Then CAPTCHAs. Then your IP gets banned entirely.
The standard response: buy a proxy list, wire up rotation logic, add CAPTCHA solving, implement retry logic with exponential backoff, handle fingerprinting, manage session cookies, and deal with geographic targeting. That is a full-time infrastructure project before you scrape a single page of useful data.
Most teams do not need to build this. They need data. This guide covers the patterns that work for production scraping at scale, with code you can deploy today.
How Bot Detection Works in 2026
Modern anti-bot systems layer multiple signals. Understanding what they check tells you what your scraper needs to handle.
IP reputation. Datacenter IP ranges are flagged. AWS, GCP, and Azure subnets are well-known. Sites maintain blocklists updated in real time from threat intelligence feeds.
TLS fingerprinting. The JA3/JA4 hash of your TLS handshake reveals your HTTP client. Python requests, Go net/http, and raw curl each produce distinct fingerprints. Headless browsers have their own signatures.
Browser fingerprinting. Canvas rendering, WebGL vendor strings, font enumeration, audio context, and screen resolution combine into a fingerprint that distinguishes automated browsers from real users.
Behavioral analysis. Mouse movement patterns, scroll velocity, click timing, and navigation sequences are tracked. Bots move in straight lines. Humans do not.
CAPTCHA challenges. reCAPTCHA v3 runs silently and scores each visitor. hCaptcha and Cloudflare Turnstile present visual puzzles when your score drops below threshold.
Rate limiting. Request frequency per IP, per session, and per account is tracked. Burst patterns trigger blocks faster than steady traffic.
A scraper that handles only one of these signals will fail. You need a system that addresses all of them simultaneously.
Proxy Rotation: The Right Way
Why Single-IP Scraping Fails
Every request from the same IP to the same domain creates a pattern. After a threshold, the target site flags the IP. The threshold varies: some sites block after 50 requests per minute, others after 500 per hour. E-commerce sites during product launches are the most aggressive.
Residential vs Datacenter vs Mobile Proxies
Datacenter proxies are cheap and fast but get flagged quickly. Residential proxies route through real ISP connections and blend in with normal traffic. Mobile proxies use carrier networks and have the highest success rate but cost more and add latency.
The practical approach: start with residential proxies, escalate to mobile only when the target requires it.
Implementing Proxy Rotation
If you manage your own proxy pool, rotation logic looks like this:
import random
import requests
from collections import deque
class ProxyRotator:
def __init__(self, proxy_list: list[str]):
self.proxies = deque(proxy_list)
def rotate(self) -> dict:
self.proxies.rotate(1)
proxy = self.proxies[0]
return {"http": proxy, "https": proxy}
def remove_bad(self, proxy: str):
if proxy in self.proxies:
self.proxies.remove(proxy)
rotator = ProxyRotator([
"http://user:pass@proxy1:8080",
"http://user:pass@proxy2:8080",
"http://user:pass@proxy3:8080",
])
for url in urls_to_scrape:
proxy = rotator.rotate()
try:
response = requests.get(url, proxies=proxy, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException:
rotator.remove_bad(proxy["http"])This works for simple cases. It does not handle TLS fingerprinting, browser fingerprinting, CAPTCHAs, or behavioral detection. For those, you need a higher-level solution.
Try scraping this page with automatic proxy rotation
CAPTCHA Solving at Scale
Types of CAPTCHAs You Will Encounter
reCAPTCHA v2. The classic "I'm not a robot" checkbox. Requires solving image puzzles or selecting objects. Solvable via third-party services with 15-30 second latency.
reCAPTCHA v3. Runs silently. Returns a score from 0.0 to 1.0. Sites set a threshold, usually 0.5. Below that, you get blocked or redirected. No puzzle to solve, just a score to maintain.
hCaptcha. Similar to reCAPTCHA v2 but with different image sets. Popular among sites that want an alternative to Google's ecosystem.
Cloudflare Turnstile. Newer, privacy-focused. Presents challenges based on browser behavior rather than image puzzles. Harder to bypass without a real browser environment.
Custom CAPTCHAs. Some sites build their own. Math problems, text recognition, or logic puzzles. These require custom solving logic.
The CAPTCHA Solving Pipeline
A production CAPTCHA solver works in four stages:
Building this pipeline yourself means integrating with solving services like 2Captcha or Anti-Captcha, handling their APIs, managing solve latency, and retrying on failures. Each CAPTCHA type requires different extraction and submission logic.
Using a Managed Anti-Bot Bypass
The alternative is using a service that handles all of this automatically. The anti-bot bypass API detects CAPTCHAs, solves them, and returns the page content without your code needing to know a CAPTCHA existed.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
response = client.scrape("https://example.com/protected-page")
print(response.text)The request goes through a headless browser environment with rotating residential IPs, realistic TLS fingerprints, and automatic CAPTCHA solving. You get back the rendered HTML or clean JSON. No proxy management, no CAPTCHA integration, no fingerprint handling.
Production Scraping Patterns
Exponential Backoff with Jitter
Blind retries make blocking worse. Use exponential backoff with random jitter to avoid thundering herd problems.
import time
import random
import requests
def scrape_with_backoff(url: str, max_retries: int = 5):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=15)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
raise
wait = min(2 ** attempt + random.uniform(0, 1), 60)
time.sleep(wait)The jitter prevents multiple scraper instances from retrying simultaneously. The cap at 60 seconds prevents unbounded waits.
Session Management
Maintaining sessions reduces detection risk. Reusing cookies and connection pools looks more like a real browser than opening fresh connections for every request.
import requests
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
})
# First request establishes session
response = session.get("https://example.com/login")
csrf_token = extract_csrf(response.text)
# Second request reuses cookies and connection
response = session.post("https://example.com/login", data={
"csrf_token": csrf_token,
"username": "user",
"password": "pass",
})Parallel Scraping with Rate Limits
Scraping one page at a time is slow. Scraping 100 pages at once gets you blocked. The middle ground: controlled concurrency with per-domain rate limiting.
import asyncio
import aiohttp
from asyncio import Semaphore
async def scrape_with_limit(urls: list[str], max_concurrent: int = 5):
semaphore = Semaphore(max_concurrent)
results = []
async def scrape_one(session: aiohttp.ClientSession, url: str):
async with semaphore:
try:
async with session.get(url, timeout=15) as resp:
return await resp.text()
except Exception:
return None
async with aiohttp.ClientSession() as session:
tasks = [scrape_one(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return resultsThe semaphore caps concurrent requests. Adjust max_concurrent based on the target site's tolerance. Start at 3, monitor response codes, increase gradually.
When to Use Headless Browsers
Some sites require JavaScript execution to render content. Static HTTP requests return empty pages or login walls. Headless browsers handle this but add complexity.
Use headless browsers when:
- Content renders client-side via React, Vue, or Angular
- The site requires JavaScript to set authentication cookies
- You need to interact with the page (click, scroll, fill forms)
- The site uses WebSocket connections for data
Skip headless browsers when:
- The page returns complete HTML on initial load
- You only need API response data (intercept network calls instead)
- Speed is critical and the target serves server-rendered HTML
Running headless browsers at scale requires managing browser instances, memory, and crash recovery. A managed web scraping API handles browser lifecycle, memory limits, and crash recovery automatically.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-d '{
"url": "https://example.com/js-heavy-page",
"render_js": true,
"wait_for": ".content-loaded",
"formats": ["json"]
}'The render_js parameter spins up a headless browser. wait_for pauses until a CSS selector appears. formats returns structured data instead of raw HTML.
Monitoring and Alerting
Scraping pipelines fail silently. A site changes its layout, adds a new CAPTCHA type, or starts blocking your proxy range. Your scraper returns empty results and nobody notices for weeks.
Set up monitoring at three levels:
Response codes. Track 200 vs 403 vs 429 ratios. A spike in 403s means your proxies are burning. A spike in 429s means you are hitting rate limits.
Content validation. Check that responses contain expected data. If you scrape product pages, verify each response has a price, title, and description. Empty fields mean the page structure changed.
Latency tracking. Sudden latency increases often mean the site added new anti-bot checks or your proxies are routing through congested nodes.
import requests
import logging
logger = logging.getLogger("scraper.monitor")
def scrape_and_validate(url: str) -> dict:
response = requests.get(url, timeout=15)
result = {
"url": url,
"status_code": response.status_code,
"has_price": "price" in response.text,
"has_title": "<title>" in response.text,
"content_length": len(response.text),
"latency_ms": response.elapsed.total_seconds() * 1000,
}
if not result["has_price"] or not result["has_title"]:
logger.warning("Content validation failed: %s", url)
return resultLog these metrics to your monitoring system. Set alerts on threshold breaches. Catching a block within minutes saves hours of missed data.
Cost Considerations
Running your own scraping infrastructure has hidden costs:
- Proxy subscriptions: $50-500/month depending on pool size and type
- CAPTCHA solving: $2-5 per 1000 solves
- Server costs for headless browser instances
- Engineering time for maintenance and debugging
- Data loss from undetected failures
A managed API converts these variable costs into a predictable per-request cost. You pay for successful scrapes, not for infrastructure that might fail. Check pricing to model costs against your expected volume.
Takeaway
Building resilient scraping infrastructure requires handling proxy rotation, CAPTCHA solving, TLS fingerprinting, browser fingerprinting, rate limiting, and session management. Each layer adds complexity and failure modes.
The practical path: use a managed API that handles anti-bot detection, proxy rotation, and CAPTCHA solving automatically. Focus your engineering time on data processing and pipeline reliability, not on staying one step ahead of bot detection systems.
Start with the quickstart guide to get an API key and run your first scrape. The documentation covers advanced parameters for JavaScript rendering, scheduling, and structured data extraction.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

Selenium Bot Detection: Why You Get Caught and How to Avoid It

How to Scrape Glassdoor: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure

Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


