Handling Infinite Scroll & Pagination in Headless Browsers
Tutorials

Handling Infinite Scroll & Pagination in Headless Browsers

Learn how to reliably handle infinite scroll, cursor-based pagination, and dynamic rendering for autonomous AI web scraping agents using headless browsers.

6 min read
8 views

TL;DR

To handle infinite scroll and pagination in headless browsers, you must synchronize programmatic scrolling or button clicking with network idle events and DOM updates to ensure complete data extraction. Intercepting the underlying XHR/Fetch API requests is the most robust approach, but when that fails, carefully timed JavaScript execution simulating user scrolling combined with smart rendering provides a reliable fallback for autonomous AI agents.

The Challenge of Dynamic Content Loading

Modern web applications rarely load all content simultaneously. Instead, they rely on single-page application (SPA) architectures, infinite scrolling, or "Load More" buttons to fetch data asynchronously. For autonomous AI agents tasked with reading public data feeds, e-commerce product grids, or article archives, standard static HTML fetching fails because the required content is trapped behind client-side JavaScript execution.

Handling this correctly requires a headless browser capable of executing JavaScript, intercepting network requests, and managing state across multiple asynchronous loads. The two primary strategies for extracting this data are intercepting API requests and simulating user interactions.

Strategy 1: Intercepting XHR and Fetch Requests (The API Route)

The cleanest and most resource-efficient way to handle pagination is entirely bypassing the UI rendering layer. When a user scrolls down an infinite scroll page, the frontend application fires an HTTP request (usually XHR or Fetch) to a backend API to retrieve the next batch of items.

By observing the network tab in your browser's developer tools, you can often identify these requests. They typically return JSON data and include pagination parameters like offset, limit, page, or an opaque cursor.

If the API is publicly accessible without complex cryptographic signatures, your agent can recreate these requests in a loop, paginating through the dataset directly until an empty response or a has_next: false flag is returned.

However, many modern sites implement strict bot mitigation that blocks direct API access. In these cases, you must rely on a browser environment to execute the requests natively, allowing the site's own JavaScript to handle token generation and request signing. Using an anti-bot solution helps maintain session validity while extracting this data.

Try it yourself

Try scraping this dynamically loaded page with AlterLab

Strategy 2: Programmatic Scrolling and DOM Extraction

When API interception is impossible, your AI agent must drive the headless browser to behave like a user. This means scrolling the viewport, waiting for loading indicators to disappear, and verifying that new DOM elements have been injected before extracting the data.

The Mechanics of Programmatic Scrolling

Scrolling a headless browser reliably requires more than simply setting window.scrollTo(0, document.body.scrollHeight). Modern infinite scroll implementations often employ techniques like virtualization, where DOM elements that scroll out of view are removed to save memory.

To scrape all items, you must extract data incrementally during the scroll process, keeping a running hash set of unique identifiers (like product IDs or URLs) to deduplicate items.

Implementing the Scroll Loop

A robust scroll loop requires three components:

  1. Scroll Action: Executing JavaScript to move the viewport down.
  2. Wait Condition: Pausing execution until the network is idle or a specific DOM element appears/disappears (e.g., waiting for a spinner to vanish).
  3. Termination Condition: Determining when the end of the list is reached. This is typically detected when the page height stops increasing after consecutive scroll attempts.

Here is how you can implement this logic using AlterLab.

Python SDK Implementation

We provide a robust Python SDK that allows you to easily dispatch scraping jobs with custom JavaScript execution. The following example demonstrates how to inject a script that handles infinite scrolling before the HTML is returned.

Python
import alterlab
import time

# Initialize the client
client = alterlab.Client("YOUR_API_KEY")

# The JavaScript to execute in the browser environment
scroll_script = """
async () => {
    let lastHeight = document.body.scrollHeight;
    while (true) {
        window.scrollTo(0, document.body.scrollHeight);
        // Wait for new content to load
        await new Promise(resolve => setTimeout(resolve, 2000));
        
        let newHeight = document.body.scrollHeight;
        if (newHeight === lastHeight) {
            break; // No new content loaded, exit loop
        }
        lastHeight = newHeight;
    }
}
"""

print("Starting extraction...")
response = client.scrape(
    url="https://example-ecommerce-site.com/products",
    js_scenario={"evaluate": scroll_script},
    wait_for={"network_idle": True}
)

print(f"Extraction complete. HTML length: {len(response.text)}")

cURL Implementation

The same behavior can be achieved via direct API calls using standard tools. This is useful for integrating into diverse environments or lightweight AI agents.

Bash
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-ecommerce-site.com/products",
    "js_scenario": {
      "evaluate": "async () => { let h = document.body.scrollHeight; while(true) { window.scrollTo(0, h); await new Promise(r => setTimeout(r, 2000)); let newH = document.body.scrollHeight; if(newH === h) break; h = newH; } }"
    },
    "wait_for": {"network_idle": true}
  }'

Handling Pagination Buttons

Infinite scrolling is common, but traditional "Next Page" buttons are still prevalent, particularly in enterprise directories or search results.

Navigating traditional pagination requires identifying the "Next" button via CSS selectors or XPath, clicking it, and waiting for the new results to render. The challenge is that single-page applications often update the DOM without triggering a full page reload, meaning standard page-load wait conditions will fail.

Your agent must locate the container holding the results, store a reference to the current items, click "Next", and explicitly wait for the items inside the container to change. If the "Next" button becomes disabled or is removed from the DOM, the pagination sequence is complete.

Structuring Output for AI Agents

Autonomous AI agents operate best on structured data, not raw HTML. When extracting data across multiple pages or scroll events, it is highly recommended to parse the DOM into JSON format immediately within the browser context, rather than pulling gigabytes of raw HTML back to your agent for post-processing.

You can modify the JavaScript scenario to execute document.querySelectorAll, map the elements to a JSON array, and return that structured object directly. AlterLab handles the underlying browser infrastructure and proxy rotation automatically, letting you focus entirely on the extraction logic and data quality. For a full breakdown of request parameters, consult our API docs.

Takeaway

Successfully extracting data from dynamically loaded interfaces requires careful management of browser state. Whether you are reverse-engineering undocumented pagination APIs or simulating complex user scroll behaviors, rely on robust wait conditions targeting network activity and DOM mutations. Intercepting XHR requests is always the preferred method for performance and reliability, but programmatic scrolling in a headless browser serves as an essential fallback when APIs are inaccessible.

Share

Was this article helpful?

Frequently Asked Questions

You can scrape infinite scroll pages by programmatically scrolling down the DOM using JavaScript execution in a headless browser, while implementing a wait condition to ensure new data loads before extracting. Alternatively, you can intercept the underlying XHR requests to directly fetch the paginated JSON data.
The most reliable method is intercepting backend API calls (XHR/Fetch) and paginating through them directly using cursors or offsets. If the API is protected, using a headless browser to simulate clicks on "Next" buttons while waiting for DOM changes is the standard fallback.
AI agents typically use a combination of headless browser automation and DOM analysis to identify loading states. They execute scroll scripts, wait for network idle states, and evaluate when the bottom of the page has been reached.