Agentic Web Browsing Workflows with Python and Playwright
Tutorials

Agentic Web Browsing Workflows with Python and Playwright

Build robust agentic web scraping pipelines combining Python, Playwright, and LLMs for real-time structured data extraction from dynamic web applications.

7 min read
8 views

TL;DR

Agentic web browsing combines Playwright's headless browser automation with large language models to extract data from dynamic sites without relying on hardcoded CSS selectors. By passing a sanitized version of the rendered DOM to an LLM, the model can navigate pages, interact with elements, and return structured JSON in real time.

The Core Challenge of Dynamic Data

Modern web applications do not serve static HTML. Content is fetched asynchronously via API calls, rendered on the client side, and obfuscated behind complex CSS modules. Traditional web scraping relies on identifying specific DOM elements using XPath or CSS selectors. When a site deploys a new build, class names change, and standard scrapers break.

LLMs change this paradigm. Instead of defining exactly where data lives, developers can define what data they want. The LLM acts as the routing layer, analyzing the current state of the page and deciding how to extract the target information. This shifts scraping from a brittle, rule-based approach to an adaptable, semantic model.

Implementing this requires a bridge between the LLM's reasoning engine and the actual web page. Playwright provides the execution environment. Python orchestrates the logic.

Designing the Agentic Loop

An agentic scraper operates in a continuous loop. It observes the environment, plans an action, executes that action, and repeats until the objective is complete.

The observation phase is critical. LLMs have strict context window limits. Feeding raw HTML from a modern single-page application into an LLM will exhaust token limits and result in hallucinations. The DOM must be minimized.

The planning phase utilizes the LLM's function-calling capabilities. You define a set of available tools, such as click_element(id), type_text(id, text), and extract_data(json_schema). The model reviews the sanitized DOM and selects the appropriate tool.

The execution phase runs the selected tool within the Playwright context. If the model chooses to click a button, Python triggers the Playwright click event, waits for the DOM to settle, and restarts the loop.

Try it yourself

Try extracting product catalogs using the agentic workflow

Building the Playwright Controller

The first component is the browser controller. Playwright needs to be configured to handle dynamic content, manage timeouts, and intercept unnecessary network requests to save bandwidth.

Python
import asyncio
from playwright.async_api import async_playwright

async def setup_browser():
    playwright = await async_playwright().start()
    browser = await playwright.chromium.launch(headless=True)
    
    context = await browser.new_context(
        viewport={'width': 1280, 'height': 800},
        user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    )
    
    # Block media and tracking to speed up rendering
    await context.route("**/*", lambda route: 
        route.abort() if route.request.resource_type in ["image", "media", "font"] 
        else route.continue_()
    )
    
    page = await context.new_page()
    return playwright, browser, page

async def fetch_page(page, url):
    await page.goto(url, wait_until="networkidle")
    return await page.content()

This controller sets up a clean environment. Blocking images and fonts accelerates page load times, which is essential for real-time extraction tasks. The networkidle state ensures that asynchronous JavaScript has finished rendering before we pass the HTML to the next step.

DOM Sanitization for Context Windows

Raw HTML contains megabytes of data irrelevant to data extraction. Inline styles, SVG paths, tracking scripts, and deep nested divs add token overhead.

We use Python libraries like BeautifulSoup to strip out noise before sending the content to the LLM. Furthermore, we must map actionable elements to unique IDs so the LLM can reference them in its function calls.

Python
from bs4 import BeautifulSoup
import re

def sanitize_html(raw_html):
    soup = BeautifulSoup(raw_html, "html.parser")
    
    # Remove non-content tags
    for tag in soup(["script", "style", "noscript", "svg", "img", "video"]):
        tag.decompose()
        
    # Remove all attributes except href, and assign interactive IDs
    element_counter = 0
    interactive_tags = ['a', 'button', 'input', 'select']
    
    for tag in soup.find_all(True):
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in ['href']}
        
        if tag.name in interactive_tags:
            tag_id = f"el_{element_counter}"
            tag['data-interact-id'] = tag_id
            element_counter += 1
            
    # Remove empty tags and compress whitespace
    text_content = str(soup)
    text_content = re.sub(r'\n\s*\n', '\n', text_content)
    
    return text_content

This sanitization dramatically reduces token count. By injecting data-interact-id attributes into buttons and links, we give the LLM a precise coordinate system for interacting with the page.

LLM Function Calling Integration

The LLM needs a strict schema to interact with our Playwright script. Using OpenAI's API or open-source equivalents, we define the tools available to the model.

Python
import json
import openai

client = openai.AsyncOpenAI(api_key="YOUR_KEY")

async def get_agent_decision(sanitized_html, objective):
    tools = [
        {
            "type": "function",
            "function": {
                "name": "extract_data",
                "description": "Extract structured data when the objective is met",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "items": {
                            "type": "array",
                            "items": {"type": "object"}
                        }
                    },
                    "required": ["items"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "click_element",
                "description": "Click an element to load more data or navigate",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "element_id": {"type": "string"}
                    },
                    "required": ["element_id"]
                }
            }
        }
    ]

    response = await client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are a web automation agent. Analyze the HTML and decide the next action."},
            {"role": "user", "content": f"Objective: {objective}\n\nHTML:\n{sanitized_html}"}
        ],
        tools=tools,
        tool_choice="auto"
    )
    
    return response.choices[0].message

The system prompts the model with the objective and the sanitized HTML. The model responds with either a function call to interact with the page or a JSON payload containing the extracted data.

Executing the Agentic Loop

With the components built, we tie them together into the main loop. The Python script evaluates the LLM's response, maps the function call back to a Playwright action, and executes it.

Python
import asyncio
import json

async def run_agent(url, objective):
    playwright, browser, page = await setup_browser()
    await page.goto(url, wait_until="networkidle")
    
    max_steps = 5
    for step in range(max_steps):
        raw_html = await page.content()
        clean_html = sanitize_html(raw_html)
        
        message = await get_agent_decision(clean_html, objective)
        
        if not message.tool_calls:
            print("Agent failed to decide.")
            break
            
        tool_call = message.tool_calls[0]
        
        if tool_call.function.name == "extract_data":
            data = json.loads(tool_call.function.arguments)
            print("Extraction complete:", json.dumps(data, indent=2))
            break
            
        elif tool_call.function.name == "click_element":
            args = json.loads(tool_call.function.arguments)
            element_id = args["element_id"]
            
            # Find the element by our injected ID and click it
            selector = f"[data-interact-id='{element_id}']"
            await page.click(selector)
            await page.wait_for_load_state("networkidle")
            
    await browser.close()
    await playwright.stop()

# asyncio.run(run_agent("https://example.com/catalog", "Extract product names and prices"))

This architecture handles complex scenarios. If data is hidden behind a "Load More" button or requires expanding a dropdown, the agent can parse the layout, click the specific element, wait for the new HTML to render, and proceed with extraction.

Managing Headless Infrastructure

Running a local Playwright script works for small tasks. Scaling agentic web browsing presents significant infrastructure challenges.

E-commerce sites, travel aggregators, and social platforms deploy aggressive fingerprinting and behavioral analysis to detect automated browsers. Running raw Playwright instances from cloud servers will result in immediate IP bans and CAPTCHA challenges.

Instead of managing proxy rotations, header spoofing, and browser fingerprints manually, developers route traffic through managed infrastructure. AlterLab handles the complexity of headless browser execution at scale.

By passing requests through a smart rendering API, the anti-bot bypass logic is abstracted away. The API handles the browser lifecycle, solves required challenges, and returns the clean HTML payload for your LLM pipeline.

Integration Examples

Here is how you execute a request using the Python SDK.

Python
import alterlab

client = alterlab.Client("YOUR_API_KEY")

# AlterLab handles the headless browser rendering automatically
response = client.scrape(
    "https://example.com/catalog",
    render_js=True,
    wait_for="networkidle"
)

# Pass response.text to your dom_sanitizer function
print(response.text)

The equivalent operation using cURL is straightforward. This is useful for testing or integrating into non-Python environments.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/catalog",
    "render_js": true,
    "wait_for": "networkidle"
  }'

Both examples return the fully rendered HTML payload, ready for processing by your agentic pipeline. For deeper integration patterns, consult the API docs.

Advanced Patterns: Streaming and State Management

As your pipelines grow more sophisticated, maintaining state across the agentic loop becomes vital. The standard loop processes single pages. Complex extraction might require logging into a portal, navigating through a multi-step form, and polling for asynchronous job completions.

To manage this, persist the Playwright browser context between runs. Store cookies and local storage tokens locally. When the agent restarts, inject the stored state to bypass login walls.

Furthermore, streaming the LLM responses can reduce latency. Instead of waiting for the entire JSON payload to generate, stream the tokens, parse the function calls on the fly, and begin executing Playwright actions milliseconds after the model makes a decision. This optimization drastically cuts down the total execution time for deeply nested scraping tasks.

Takeaway

Agentic web scraping replaces brittle CSS selectors with semantic, resilient data extraction. By pairing Playwright's browser automation with Python and function-calling LLMs, engineers can build pipelines that adapt to UI changes automatically. While scaling these systems requires managing complex browser fingerprints, offloading infrastructure concerns allows teams to focus entirely on writing robust agent logic and maximizing data quality.

Share

Was this article helpful?

Frequently Asked Questions

Agentic web browsing involves using LLMs to autonomously navigate websites, interact with elements, and extract unstructured data into predictable formats. Instead of hardcoded selectors, the agent analyzes the DOM to determine its next action.
Playwright renders JavaScript and handles dynamic single-page applications. This allows the LLM to interact with fully hydrated pages, click buttons, and wait for network responses.
Managing headless browsers at scale requires residential proxies, browser fingerprint rotation, and CAPTCHA handling. Offloading this to a smart rendering API prevents blocks while letting the agent focus on data extraction.