
Agentic Web Browsing Workflows with Python and Playwright
Build robust agentic web scraping pipelines combining Python, Playwright, and LLMs for real-time structured data extraction from dynamic web applications.
May 29, 2026
TL;DR
Agentic web browsing combines Playwright's headless browser automation with large language models to extract data from dynamic sites without relying on hardcoded CSS selectors. By passing a sanitized version of the rendered DOM to an LLM, the model can navigate pages, interact with elements, and return structured JSON in real time.
The Core Challenge of Dynamic Data
Modern web applications do not serve static HTML. Content is fetched asynchronously via API calls, rendered on the client side, and obfuscated behind complex CSS modules. Traditional web scraping relies on identifying specific DOM elements using XPath or CSS selectors. When a site deploys a new build, class names change, and standard scrapers break.
LLMs change this paradigm. Instead of defining exactly where data lives, developers can define what data they want. The LLM acts as the routing layer, analyzing the current state of the page and deciding how to extract the target information. This shifts scraping from a brittle, rule-based approach to an adaptable, semantic model.
Implementing this requires a bridge between the LLM's reasoning engine and the actual web page. Playwright provides the execution environment. Python orchestrates the logic.
Designing the Agentic Loop
An agentic scraper operates in a continuous loop. It observes the environment, plans an action, executes that action, and repeats until the objective is complete.
The observation phase is critical. LLMs have strict context window limits. Feeding raw HTML from a modern single-page application into an LLM will exhaust token limits and result in hallucinations. The DOM must be minimized.
The planning phase utilizes the LLM's function-calling capabilities. You define a set of available tools, such as click_element(id), type_text(id, text), and extract_data(json_schema). The model reviews the sanitized DOM and selects the appropriate tool.
The execution phase runs the selected tool within the Playwright context. If the model chooses to click a button, Python triggers the Playwright click event, waits for the DOM to settle, and restarts the loop.
Try extracting product catalogs using the agentic workflow
Building the Playwright Controller
The first component is the browser controller. Playwright needs to be configured to handle dynamic content, manage timeouts, and intercept unnecessary network requests to save bandwidth.
import asyncio
from playwright.async_api import async_playwright
async def setup_browser():
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context(
viewport={'width': 1280, 'height': 800},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
# Block media and tracking to speed up rendering
await context.route("**/*", lambda route:
route.abort() if route.request.resource_type in ["image", "media", "font"]
else route.continue_()
)
page = await context.new_page()
return playwright, browser, page
async def fetch_page(page, url):
await page.goto(url, wait_until="networkidle")
return await page.content()This controller sets up a clean environment. Blocking images and fonts accelerates page load times, which is essential for real-time extraction tasks. The networkidle state ensures that asynchronous JavaScript has finished rendering before we pass the HTML to the next step.
DOM Sanitization for Context Windows
Raw HTML contains megabytes of data irrelevant to data extraction. Inline styles, SVG paths, tracking scripts, and deep nested divs add token overhead.
We use Python libraries like BeautifulSoup to strip out noise before sending the content to the LLM. Furthermore, we must map actionable elements to unique IDs so the LLM can reference them in its function calls.
from bs4 import BeautifulSoup
import re
def sanitize_html(raw_html):
soup = BeautifulSoup(raw_html, "html.parser")
# Remove non-content tags
for tag in soup(["script", "style", "noscript", "svg", "img", "video"]):
tag.decompose()
# Remove all attributes except href, and assign interactive IDs
element_counter = 0
interactive_tags = ['a', 'button', 'input', 'select']
for tag in soup.find_all(True):
tag.attrs = {k: v for k, v in tag.attrs.items() if k in ['href']}
if tag.name in interactive_tags:
tag_id = f"el_{element_counter}"
tag['data-interact-id'] = tag_id
element_counter += 1
# Remove empty tags and compress whitespace
text_content = str(soup)
text_content = re.sub(r'\n\s*\n', '\n', text_content)
return text_contentThis sanitization dramatically reduces token count. By injecting data-interact-id attributes into buttons and links, we give the LLM a precise coordinate system for interacting with the page.
LLM Function Calling Integration
The LLM needs a strict schema to interact with our Playwright script. Using OpenAI's API or open-source equivalents, we define the tools available to the model.
import json
import openai
client = openai.AsyncOpenAI(api_key="YOUR_KEY")
async def get_agent_decision(sanitized_html, objective):
tools = [
{
"type": "function",
"function": {
"name": "extract_data",
"description": "Extract structured data when the objective is met",
"parameters": {
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {"type": "object"}
}
},
"required": ["items"]
}
}
},
{
"type": "function",
"function": {
"name": "click_element",
"description": "Click an element to load more data or navigate",
"parameters": {
"type": "object",
"properties": {
"element_id": {"type": "string"}
},
"required": ["element_id"]
}
}
}
]
response = await client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a web automation agent. Analyze the HTML and decide the next action."},
{"role": "user", "content": f"Objective: {objective}\n\nHTML:\n{sanitized_html}"}
],
tools=tools,
tool_choice="auto"
)
return response.choices[0].messageThe system prompts the model with the objective and the sanitized HTML. The model responds with either a function call to interact with the page or a JSON payload containing the extracted data.
Executing the Agentic Loop
With the components built, we tie them together into the main loop. The Python script evaluates the LLM's response, maps the function call back to a Playwright action, and executes it.
import asyncio
import json
async def run_agent(url, objective):
playwright, browser, page = await setup_browser()
await page.goto(url, wait_until="networkidle")
max_steps = 5
for step in range(max_steps):
raw_html = await page.content()
clean_html = sanitize_html(raw_html)
message = await get_agent_decision(clean_html, objective)
if not message.tool_calls:
print("Agent failed to decide.")
break
tool_call = message.tool_calls[0]
if tool_call.function.name == "extract_data":
data = json.loads(tool_call.function.arguments)
print("Extraction complete:", json.dumps(data, indent=2))
break
elif tool_call.function.name == "click_element":
args = json.loads(tool_call.function.arguments)
element_id = args["element_id"]
# Find the element by our injected ID and click it
selector = f"[data-interact-id='{element_id}']"
await page.click(selector)
await page.wait_for_load_state("networkidle")
await browser.close()
await playwright.stop()
# asyncio.run(run_agent("https://example.com/catalog", "Extract product names and prices"))This architecture handles complex scenarios. If data is hidden behind a "Load More" button or requires expanding a dropdown, the agent can parse the layout, click the specific element, wait for the new HTML to render, and proceed with extraction.
Managing Headless Infrastructure
Running a local Playwright script works for small tasks. Scaling agentic web browsing presents significant infrastructure challenges.
E-commerce sites, travel aggregators, and social platforms deploy aggressive fingerprinting and behavioral analysis to detect automated browsers. Running raw Playwright instances from cloud servers will result in immediate IP bans and CAPTCHA challenges.
Instead of managing proxy rotations, header spoofing, and browser fingerprints manually, developers route traffic through managed infrastructure. AlterLab handles the complexity of headless browser execution at scale.
By passing requests through a smart rendering API, the anti-bot bypass logic is abstracted away. The API handles the browser lifecycle, solves required challenges, and returns the clean HTML payload for your LLM pipeline.
Integration Examples
Here is how you execute a request using the Python SDK.
import alterlab
client = alterlab.Client("YOUR_API_KEY")
# AlterLab handles the headless browser rendering automatically
response = client.scrape(
"https://example.com/catalog",
render_js=True,
wait_for="networkidle"
)
# Pass response.text to your dom_sanitizer function
print(response.text)The equivalent operation using cURL is straightforward. This is useful for testing or integrating into non-Python environments.
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/catalog",
"render_js": true,
"wait_for": "networkidle"
}'Both examples return the fully rendered HTML payload, ready for processing by your agentic pipeline. For deeper integration patterns, consult the API docs.
Advanced Patterns: Streaming and State Management
As your pipelines grow more sophisticated, maintaining state across the agentic loop becomes vital. The standard loop processes single pages. Complex extraction might require logging into a portal, navigating through a multi-step form, and polling for asynchronous job completions.
To manage this, persist the Playwright browser context between runs. Store cookies and local storage tokens locally. When the agent restarts, inject the stored state to bypass login walls.
Furthermore, streaming the LLM responses can reduce latency. Instead of waiting for the entire JSON payload to generate, stream the tokens, parse the function calls on the fly, and begin executing Playwright actions milliseconds after the model makes a decision. This optimization drastically cuts down the total execution time for deeply nested scraping tasks.
Takeaway
Agentic web scraping replaces brittle CSS selectors with semantic, resilient data extraction. By pairing Playwright's browser automation with Python and function-calling LLMs, engineers can build pipelines that adapt to UI changes automatically. While scaling these systems requires managing complex browser fingerprints, offloading infrastructure concerns allows teams to focus entirely on writing robust agent logic and maximizing data quality.
Was this article helpful?
Frequently Asked Questions
Related Articles
Popular Posts
Recommended
Newsletter
Scraping insights and API tips. No spam.
Recommended Reading

How to Scrape Amazon in 2026: Engineering Guide

How to Scrape AliExpress: Complete Guide for 2026

Why Your Headless Browser Gets Detected (and How to Fix It)

How to Scrape Indeed: Complete Guide for 2026

How to Scrape Twitter/X Data: Complete Guide for 2026
Stay in the Loop
Get scraping insights, API tips, and platform updates. No spam — we only send when we have something worth reading.


