AlterLabAlterLab
Tutorials

Playwright Anti-Bot Detection: What Actually Works in 2026

Playwright AntiBot Detection: What Actually Works in 2026 You picked Playwright because it is fast, has a clean API, and supports all major browsers. But...

Yash Dubey

Yash Dubey

February 19, 2026

15 min read
198 views
Share:

You picked Playwright because it is fast, has a clean API, and supports all major browsers. But within five minutes of scraping a real target, you hit a Cloudflare challenge page. Your headless browser is getting detected, and no amount of await page.wait_for_load_state("networkidle") is going to fix it.

This guide covers the specific detection vectors that flag Playwright, what stealth techniques actually work in 2026, and how to handle the major anti-bot providers (Cloudflare, DataDome, PerimeterX) with working Python code.

87%Sites using at least one anti-bot service
<2 secAverage bot detection time
12+Fingerprint vectors checked
3xDetection rate increase since 2024

Why Playwright Gets Detected

Playwright is not invisible. Out of the box, it leaves a trail of signals that anti-bot systems check in milliseconds. Understanding these signals is the first step to avoiding them.

The navigator.webdriver Flag

Every Playwright browser instance sets navigator.webdriver to true. This is a W3C WebDriver spec requirement. Anti-bot scripts check this property first because it is the cheapest detection method available.

python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")

    # This returns True in default Playwright — instant detection
    is_bot = page.evaluate("() => navigator.webdriver")
    print(f"webdriver flag: {is_bot}")  # True
    browser.close()

Headless Browser Tells

Headless Chromium differs from headed Chrome in dozens of subtle ways. Anti-bot systems check for:

  • Missing plugins: navigator.plugins is empty in headless mode. Real Chrome reports PDF Viewer, Chrome PDF Plugin, etc.
  • Missing WebGL renderer: Headless Chrome uses SwiftShader as its GPU renderer. Real browsers report actual GPU hardware like "ANGLE (NVIDIA GeForce RTX 3080)".
  • Screen dimensions: Headless defaults to 800x600 with 0 values for screen.availHeight and screen.availWidth.
  • Missing permissions API behavior: Notification.permission returns unexpected values in headless mode.
  • Chrome runtime objects: Real Chrome injects window.chrome with runtime, loadTimes, and csi objects. Headless Chrome is missing or has incomplete versions of these.

JavaScript Fingerprinting

Modern anti-bot systems build a fingerprint from 50+ browser properties and compare it against known profiles. Playwright's fingerprint is distinctive:

python
# What anti-bot scripts collect (simplified)
fingerprint_checks = """
() => ({
    webdriver: navigator.webdriver,
    plugins: navigator.plugins.length,
    languages: navigator.languages,
    platform: navigator.platform,
    hardwareConcurrency: navigator.hardwareConcurrency,
    deviceMemory: navigator.deviceMemory,
    webgl: (() => {
        const canvas = document.createElement('canvas');
        const gl = canvas.getContext('webgl');
        return gl ? gl.getParameter(gl.RENDERER) : null;
    })(),
    chrome: !!window.chrome,
    permissions: typeof navigator.permissions,
})
"""

If any of these values look like a default automation tool, you are flagged before the page even finishes loading.

Detection Vectors Specific to Playwright

Beyond generic headless detection, Playwright has its own unique fingerprint that anti-bot vendors specifically target.

FeatureDefault PlaywrightReal Browser
navigator.webdrivertruefalse
navigator.plugins.length05+
WebGL rendererSwiftShaderGPU hardware
window.chrome.runtime
Notification.permissiondeniedprompt
CDP detection
Consistent TLS fingerprint

CDP Protocol Leak

Playwright communicates with the browser through Chrome DevTools Protocol (CDP). Some anti-bot scripts detect this by checking for the presence of CDP-related runtime objects or by measuring timing anomalies introduced by the protocol layer.

TLS Fingerprinting (JA3/JA4)

This is the hardest to fix. When your browser makes an HTTPS connection, the TLS handshake includes a unique ordering of cipher suites, extensions, and supported curves. Anti-bot services like Cloudflare fingerprint this handshake (JA3/JA4 hash) and compare it against known browser signatures.

Playwright's Chromium binary has a JA3 fingerprint that does not match any real Chrome release. This alone can get you blocked before any JavaScript even runs.

Stealth Techniques That Work

playwright-stealth

The playwright-stealth package patches the most common detection vectors. It is not a silver bullet, but it is the minimum viable starting point.

python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--no-first-run",
            "--no-default-browser-check",
        ],
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = context.new_page()
    stealth_sync(page)

    page.goto("https://bot.sannysoft.com")
    page.screenshot(path="stealth_test.png")
    browser.close()

What playwright-stealth patches:

  • Sets navigator.webdriver to false
  • Fakes navigator.plugins and navigator.mimeTypes
  • Patches chrome.runtime to look like a real Chrome extension environment
  • Fixes Notification.permission behavior
  • Overrides navigator.permissions.query responses

What it does not fix: WebGL fingerprint, TLS fingerprint, CDP detection, or behavioral analysis.

Browser Context Hardening

Beyond stealth patches, your browser context configuration matters. Here is a hardened context that covers most fingerprint vectors:

python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def create_stealth_context(playwright):
    browser = playwright.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
            "--disable-dev-shm-usage",
            "--disable-accelerated-2d-canvas",
            "--disable-gpu-sandbox",
            "--no-first-run",
            "--no-zygote",
        ],
    )

    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        screen={"width": 1920, "height": 1080},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
        color_scheme="light",
        has_touch=False,
        is_mobile=False,
        java_script_enabled=True,
        extra_http_headers={
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "sec-ch-ua": (
                '"Chromium";v="124", '
                '"Google Chrome";v="124", '
                '"Not-A.Brand";v="99"'
            ),
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-platform": '"Windows"',
        },
    )
    return browser, context

Anti-bot systems track whether your browser maintains cookies between requests. A real user has cookies from previous visits. A bot starts fresh every time.

python
import json
from pathlib import Path
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

COOKIE_FILE = Path("cookies.json")

def load_cookies(context, url):
    """Load cookies from a previous session."""
    if COOKIE_FILE.exists():
        cookies = json.loads(COOKIE_FILE.read_text())
        context.add_cookies(cookies)

def save_cookies(context):
    """Persist cookies for future sessions."""
    cookies = context.cookies()
    COOKIE_FILE.write_text(json.dumps(cookies, indent=2))

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()
    stealth_sync(page)

    # Load cookies from a previous scraping session
    load_cookies(context, "https://target-site.com")

    page.goto("https://target-site.com")
    # ... do your scraping ...

    # Save cookies for the next run
    save_cookies(context)
    browser.close()

For persistent browser profiles that survive between runs (including localStorage, IndexedDB, and service workers):

python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    # Use persistent context to maintain full browser state
    context = p.chromium.launch_persistent_context(
        user_data_dir="./browser_profile",
        headless=True,
        viewport={"width": 1920, "height": 1080},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        args=["--disable-blink-features=AutomationControlled"],
    )
    page = context.pages[0] if context.pages else context.new_page()
    stealth_sync(page)

    page.goto("https://target-site.com")
    # Full browser state persists between runs
    context.close()

Request Interception

Intercept and modify requests to strip automation headers and add missing ones:

python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def handle_route(route):
    headers = route.request.headers.copy()

    # Remove headers that leak automation
    headers.pop("x-playwright", None)
    headers.pop("x-devtools", None)

    # Ensure consistent Accept header
    if route.request.resource_type == "document":
        headers["Accept"] = (
            "text/html,application/xhtml+xml,"
            "application/xml;q=0.9,image/avif,"
            "image/webp,image/apng,*/*;q=0.8"
        )
        headers["Upgrade-Insecure-Requests"] = "1"

    route.continue_(headers=headers)

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()
    stealth_sync(page)

    # Intercept all requests to fix headers
    page.route("**/*", handle_route)

    # Block tracking scripts that report automation
    page.route(
        "**/{datadome,px,perimeterx,kasada}*.js",
        lambda route: route.abort(),
    )

    page.goto("https://target-site.com")
    browser.close()

Warning: Blocking anti-bot scripts outright can be counterproductive. Some sites check if their detection scripts ran and block you if they did not load. Use this selectively.

Handling Cloudflare, DataDome, and PerimeterX

Each anti-bot provider has different detection strategies. A technique that works against Cloudflare may fail against DataDome.

1

Detect the provider

Check response headers and page source. Cloudflare returns cf-ray headers. DataDome sets datadome cookies. PerimeterX uses _px cookies and loads px scripts.

2

Apply provider-specific patches

Each system checks different fingerprint vectors. Cloudflare focuses on TLS and JS challenges. DataDome checks behavioral patterns. PerimeterX does deep browser fingerprinting.

3

Handle challenges

Wait for challenge pages to resolve. Some require JavaScript execution time, others need CAPTCHA solving, and some need specific cookie values from previous visits.

4

Validate success

Check that the response contains actual content, not a challenge page. Verify status codes and look for challenge page markers in the HTML.

Cloudflare

Cloudflare is the most common anti-bot system. Their detection layers include TLS fingerprinting, JavaScript challenges (Turnstile), and behavioral analysis.

python
import time
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def scrape_cloudflare_site(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        )
        page = context.new_page()
        stealth_sync(page)

        page.goto(url, wait_until="domcontentloaded")

        # Cloudflare challenge pages take 3-8 seconds to resolve
        # Wait for the challenge to complete
        for attempt in range(15):
            title = page.title()
            content = page.content()

            # Check if we are past the challenge
            if "just a moment" not in title.lower() and \
               "checking your browser" not in content.lower() and \
               "cf-challenge" not in content.lower():
                break

            time.sleep(1)

        # Verify we got real content
        if "just a moment" in page.title().lower():
            print("Failed to bypass Cloudflare challenge")
            browser.close()
            return None

        html = page.content()
        browser.close()
        return html

Cloudflare Turnstile is their CAPTCHA replacement. It runs in the background and does not always require user interaction. But when it does, you need a CAPTCHA solving service or a real user session. More on this in the CAPTCHA section below.

DataDome

DataDome is behavioral-heavy. It watches mouse movements, scroll patterns, and typing cadence. A browser that navigates directly to a URL without any human-like interaction gets flagged.

python
import random
import time
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def human_like_mouse(page):
    """Simulate realistic mouse movement."""
    width = page.viewport_size["width"]
    height = page.viewport_size["height"]

    # Move to 3-5 random positions with realistic timing
    for _ in range(random.randint(3, 5)):
        x = random.randint(100, width - 100)
        y = random.randint(100, height - 100)
        page.mouse.move(x, y, steps=random.randint(10, 25))
        time.sleep(random.uniform(0.1, 0.4))

def human_like_scroll(page):
    """Scroll down in chunks like a human reader."""
    total_scroll = random.randint(500, 1500)
    scrolled = 0
    while scrolled < total_scroll:
        delta = random.randint(80, 200)
        page.mouse.wheel(0, delta)
        scrolled += delta
        time.sleep(random.uniform(0.1, 0.3))

def scrape_datadome_site(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        )
        page = context.new_page()
        stealth_sync(page)

        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_timeout(2000)

        # DataDome checks for human-like behavior
        human_like_mouse(page)
        human_like_scroll(page)

        # Wait for any DataDome challenge to resolve
        page.wait_for_timeout(3000)

        # Check for DataDome block page
        content = page.content()
        if "datadome" in content.lower() and "blocked" in content.lower():
            print("DataDome blocked the request")
            browser.close()
            return None

        html = page.content()
        browser.close()
        return html

PerimeterX (now HUMAN Security)

PerimeterX runs deep fingerprinting via their _px scripts. They check canvas fingerprints, AudioContext fingerprints, WebGL parameters, and font enumeration.

python
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def patch_fingerprint(page):
    """Inject scripts to override fingerprint vectors."""
    page.add_init_script("""
        // Override canvas fingerprint
        const originalToDataURL = HTMLCanvasElement.prototype.toDataURL;
        HTMLCanvasElement.prototype.toDataURL = function(type) {
            if (type === 'image/png') {
                const ctx = this.getContext('2d');
                if (ctx) {
                    const imageData = ctx.getImageData(
                        0, 0, this.width, this.height
                    );
                    for (let i = 0; i < imageData.data.length; i += 4) {
                        imageData.data[i] ^= 1;
                    }
                    ctx.putImageData(imageData, 0, 0);
                }
            }
            return originalToDataURL.apply(this, arguments);
        };

        // Override AudioContext fingerprint
        const originalGetFloatFrequencyData =
            AnalyserNode.prototype.getFloatFrequencyData;
        AnalyserNode.prototype.getFloatFrequencyData = function(array) {
            originalGetFloatFrequencyData.call(this, array);
            for (let i = 0; i < array.length; i++) {
                array[i] += Math.random() * 0.0001;
            }
        };
    """)

def scrape_perimeterx_site(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-web-security",
            ],
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        )
        page = context.new_page()
        stealth_sync(page)
        patch_fingerprint(page)

        page.goto(url, wait_until="networkidle")

        # Check for PerimeterX block
        if page.query_selector("[data-testid='px-captcha']"):
            print("PerimeterX CAPTCHA detected")
            browser.close()
            return None

        html = page.content()
        browser.close()
        return html

CAPTCHA Handling in Playwright

When stealth techniques fail, you hit CAPTCHAs. Automating CAPTCHA solving requires integrating with a solving service. Here is how to handle the two most common types.

Cloudflare Turnstile

Turnstile works differently from traditional CAPTCHAs. It runs a background challenge and injects a token into a hidden form field. You can extract the site key and send it to a solving service.

python
import time
import urllib.request
import json
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

CAPSOLVER_API_KEY = "your_capsolver_key"

def solve_turnstile(site_key, page_url):
    """Send Turnstile challenge to a solving service."""
    payload = json.dumps({
        "clientKey": CAPSOLVER_API_KEY,
        "task": {
            "type": "AntiTurnstileTaskProxyLess",
            "websiteURL": page_url,
            "websiteKey": site_key,
        },
    }).encode()

    req = urllib.request.Request(
        "https://api.capsolver.com/createTask",
        data=payload,
        headers={"Content-Type": "application/json"},
    )
    resp = json.loads(urllib.request.urlopen(req).read())
    task_id = resp["taskId"]

    # Poll for solution
    for _ in range(30):
        time.sleep(2)
        check = json.dumps({
            "clientKey": CAPSOLVER_API_KEY,
            "taskId": task_id,
        }).encode()
        req = urllib.request.Request(
            "https://api.capsolver.com/getTaskResult",
            data=check,
            headers={"Content-Type": "application/json"},
        )
        result = json.loads(urllib.request.urlopen(req).read())
        if result["status"] == "ready":
            return result["solution"]["token"]

    return None

def scrape_with_turnstile(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        stealth_sync(page)

        page.goto(url, wait_until="domcontentloaded")
        page.wait_for_timeout(3000)

        # Find Turnstile widget and extract site key
        turnstile_frame = page.query_selector(
            "iframe[src*='challenges.cloudflare.com']"
        )
        if turnstile_frame:
            site_key = page.evaluate("""
                () => {
                    const widget = document.querySelector(
                        '[data-sitekey]'
                    );
                    return widget
                        ? widget.getAttribute('data-sitekey')
                        : null;
                }
            """)

            if site_key:
                token = solve_turnstile(site_key, url)
                if token:
                    page.evaluate(
                        """(token) => {
                            const input = document.querySelector(
                                '[name="cf-turnstile-response"]'
                            );
                            if (input) input.value = token;

                            const callback = window.turnstileCallback
                                || window._cf_chl_opt?.clCb;
                            if (callback) callback(token);
                        }""",
                        token,
                    )
                    page.wait_for_timeout(2000)

        html = page.content()
        browser.close()
        return html

hCaptcha

hCaptcha is used by Cloudflare on some domains and by many other sites directly.

python
import json
import time
import urllib.request

CAPSOLVER_API_KEY = "your_capsolver_key"

def solve_hcaptcha(site_key, page_url):
    """Send hCaptcha to a solving service."""
    payload = json.dumps({
        "clientKey": CAPSOLVER_API_KEY,
        "task": {
            "type": "HCaptchaTaskProxyLess",
            "websiteURL": page_url,
            "websiteKey": site_key,
        },
    }).encode()

    req = urllib.request.Request(
        "https://api.capsolver.com/createTask",
        data=payload,
        headers={"Content-Type": "application/json"},
    )
    resp = json.loads(urllib.request.urlopen(req).read())
    task_id = resp["taskId"]

    for _ in range(60):
        time.sleep(2)
        check = json.dumps({
            "clientKey": CAPSOLVER_API_KEY,
            "taskId": task_id,
        }).encode()
        req = urllib.request.Request(
            "https://api.capsolver.com/getTaskResult",
            data=check,
            headers={"Content-Type": "application/json"},
        )
        result = json.loads(urllib.request.urlopen(req).read())
        if result["status"] == "ready":
            return result["solution"]["gRecaptchaResponse"]

    return None

The pattern is the same regardless of the CAPTCHA type: extract the site key from the page, send it to a solving service, inject the response token back into the page. Budget $2-5 per thousand solves depending on the provider and CAPTCHA type.

Wait Strategies That Prevent Detection

Naive waits are one of the most common reasons scrapes fail. Using the wrong wait strategy either triggers anti-bot detection (too fast) or wastes time (too slow).

networkidle vs domcontentloaded vs Custom Waits

python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Option 1: domcontentloaded — fires when HTML is parsed
    # Fast, but JS-rendered content is not ready yet
    page.goto("https://example.com", wait_until="domcontentloaded")

    # Option 2: networkidle — waits until no network requests
    # for 500ms. Works for most sites but can hang on sites
    # with persistent WebSocket connections or polling.
    page.goto("https://example.com", wait_until="networkidle")

    # Option 3: Custom wait — the most reliable approach
    # Wait for a specific element that indicates content loaded
    page.goto("https://example.com", wait_until="domcontentloaded")
    page.wait_for_selector("div.product-list", timeout=10000)

    browser.close()

When to use each:

  • domcontentloaded -- Static sites, server-rendered pages. Fast and reliable.
  • networkidle -- Sites with standard JS rendering (React, Vue, Angular). Good default for dynamic rendering with headless browsers, but set a timeout.
  • Custom selector waits -- Best for known targets. Wait for the exact element you need to scrape.

Waiting a Fixed Duration

Sometimes you just need to wait for a specific duration. This is common when dealing with anti-bot challenges that need time to resolve.

python
# Playwright's built-in timeout (non-blocking, better than time.sleep)
page.wait_for_timeout(5000)  # Wait for 5 seconds

# Conditional wait with timeout
try:
    page.wait_for_selector(
        "div.content",
        state="visible",
        timeout=5000,
    )
except Exception:
    # Element did not appear in 5 seconds — take a screenshot
    page.screenshot(path="debug_timeout.png")

Waiting for Dynamic Content

For SPAs and sites that load content via XHR/fetch after the initial page load:

python
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Wait for a specific API response before scraping
    with page.expect_response(
        lambda resp: "/api/products" in resp.url
        and resp.status == 200,
        timeout=15000,
    ) as response_info:
        page.goto(
            "https://spa-site.com/products",
            wait_until="domcontentloaded",
        )

    api_response = response_info.value
    data = api_response.json()
    print(f"Got {len(data['items'])} products from API")

    browser.close()

Screenshot Debugging for Anti-Bot Issues

When a scrape fails, a screenshot tells you more than any log message. Build screenshot debugging into your scraping pipeline from the start.

python
import os
from datetime import datetime
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

DEBUG_DIR = "debug_screenshots"
os.makedirs(DEBUG_DIR, exist_ok=True)

def scrape_with_debug(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(
            viewport={"width": 1920, "height": 1080}
        )
        stealth_sync(page)

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        domain = (
            url.split("//")[-1].split("/")[0].replace(".", "_")
        )

        try:
            response = page.goto(
                url, wait_until="networkidle", timeout=30000
            )

            # Screenshot on non-200 status
            if response and response.status != 200:
                path = (
                    f"{DEBUG_DIR}/{domain}_{timestamp}"
                    f"_status{response.status}.png"
                )
                page.screenshot(path=path, full_page=True)
                print(
                    f"Non-200 status ({response.status}). "
                    f"Screenshot: {path}"
                )

            # Screenshot if challenge page detected
            content = page.content()
            challenge_markers = [
                "just a moment",
                "checking your browser",
                "access denied",
                "blocked",
                "captcha",
                "cf-challenge",
                "datadome",
            ]
            for marker in challenge_markers:
                if marker in content.lower():
                    slug = marker.replace(" ", "_")
                    path = (
                        f"{DEBUG_DIR}/{domain}"
                        f"_{timestamp}_{slug}.png"
                    )
                    page.screenshot(
                        path=path, full_page=True
                    )
                    print(
                        f"Challenge detected: {marker}. "
                        f"Screenshot: {path}"
                    )
                    break

            return content

        except Exception as e:
            path = (
                f"{DEBUG_DIR}/{domain}_{timestamp}_error.png"
            )
            try:
                page.screenshot(
                    path=path, full_page=True
                )
                print(f"Error: {e}. Screenshot: {path}")
            except Exception:
                print(
                    f"Error: {e}. "
                    f"Could not take screenshot."
                )
            return None

        finally:
            browser.close()

This pattern -- playwright screenshot on failure -- saves hours of debugging. When you see the actual challenge page or error screen, you know exactly which anti-bot system blocked you and at what stage.

When DIY Fails: Knowing the Limits

There is a ceiling to what stealth patches and browser configuration can achieve. The arms race between anti-bot systems and automation tools is constant, and the detection side has structural advantages:

  • TLS fingerprinting cannot be fixed from JavaScript. You would need to patch Chromium's TLS stack at the C++ level and rebuild the binary.
  • Behavioral analysis gets more sophisticated every month. Anti-bot vendors now use ML models trained on billions of real user sessions.
  • Maintaining stealth is a full-time job. Patches that work today break with the next Chrome update or anti-bot vendor release.
  • Scale amplifies problems. A technique that works for 100 pages per day fails at 10,000 because rate limiting and IP reputation compound.

If you are spending more time maintaining your stealth setup than building your actual product, it is time to consider a managed solution. Services like AlterLab handle the anti-bot layer at the infrastructure level -- TLS fingerprint rotation, browser profile management, residential proxy networks, and automatic challenge solving -- so your code stays a simple API call:

python
import requests

response = requests.post(
    "https://api.alterlab.io/v1/scrape",
    headers={"X-API-Key": "your_api_key"},
    json={
        "url": "https://cloudflare-protected-site.com",
        "formats": ["html", "markdown"],
    },
)
data = response.json()
print(data["content"])

No stealth patches. No fingerprint maintenance. No CAPTCHA integration. The anti-bot bypass happens at the network level, which is fundamentally harder to detect than anything you can do from within a browser.

Quick Reference

FeatureTechniqueCloudflareDataDomePerimeterX
playwright-stealth
Browser context hardening
Cookie persistence
Request interception
Human-like behavior
CAPTCHA solving service
Managed API (AlterLab)

The honest summary: playwright-stealth plus browser context hardening will get you past basic bot detection and work on sites without dedicated anti-bot services. For Cloudflare, DataDome, and PerimeterX, you need a combination of stealth, behavioral simulation, and CAPTCHA solving. For reliable production-scale scraping against these systems, the cost-benefit math usually points toward a managed service.

Start with stealth patches. Add behavioral simulation when you hit challenges. Integrate CAPTCHA solving when you hit CAPTCHAs. And when the maintenance overhead crosses your tolerance threshold, switch to an API.

Yash Dubey

Yash Dubey