Python GuideBeginner → Production

Web Scraping with Python — Complete Guide

From your first request to production-scale data extraction. Everything you need to scrape websites reliably with Python.

Python is the most popular language for web scraping — and for good reason. The ecosystem is mature, the libraries are well-documented, and the path from prototype to production is shorter than with any other stack. This guide covers the full journey: fetching pages, parsing HTML, handling JavaScript-rendered content, and scaling to thousands of URLs.

Setting Up Your Python Scraping Environment

You need two libraries to get started: requests for HTTP and BeautifulSoup for HTML parsing. Install them with pip:

pip install requests beautifulsoup4 lxml

For async scraping (faster when collecting many URLs), also install httpx:

pip install httpx

For structured data extraction and saving to CSV or DataFrame, add pandas:

pip install pandas
pip install requests beautifulsoup4 lxml httpx pandas

Fetching a Web Page with Python

The simplest scrape: fetch a URL and read its HTML. Use requests.get() for synchronous fetching. Always check the status code before parsing — a 200 means success, 403 means you need better request headers, and 429 means you are being rate-limited.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

response = requests.get("https://example.com/products", headers=headers, timeout=10)
response.raise_for_status()  # raises on 4xx/5xx
html = response.text
print(f"Fetched {len(html)} bytes")

Parsing HTML with BeautifulSoup

BeautifulSoup converts raw HTML into a navigable tree structure. Use .find() for the first match or .find_all() for a list. CSS selectors via .select() are often the clearest approach for modern HTML.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# CSS selector approach — often the clearest
products = soup.select("div.product-card")
for product in products:
    title = product.select_one("h2.product-title")
    price = product.select_one("span.price")
    print(f"{title.get_text(strip=True)}: {price.get_text(strip=True)}")

Extracting Structured Data

Most scraping projects need clean structured output — not raw HTML. Build a list of dictionaries, then save to CSV, JSON, or a database. Use .get_text(strip=True) to remove whitespace and use .get('href') to extract link attributes.

import csv
from bs4 import BeautifulSoup

def extract_products(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    results = []
    for card in soup.select("div.product-card"):
        results.append({
            "title": (card.select_one("h2") or {}).get_text(strip=True),
            "price": (card.select_one(".price") or {}).get_text(strip=True),
            "url": card.select_one("a")["href"] if card.select_one("a") else "",
        })
    return results

with open("products.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])
    writer.writeheader()
    writer.writerows(extract_products(html))

Handling JavaScript-Rendered Pages

Many modern websites — product pages, dashboards, search results — load content dynamically with JavaScript. A plain requests.get() returns an empty template with none of the data. You need a browser that executes JavaScript and waits for the page to fully render before returning HTML.

The two main approaches: run a headless browser locally (Playwright, Selenium), or use a cloud rendering API. Running browsers locally adds significant complexity — you need to manage Chrome binaries, handle browser crashes, configure viewport and language settings, and solve detection challenges.

AlterLab's API handles rendering server-side. You send one POST request with render_js: true and get back fully rendered HTML — no browser management required.

Scraping at Scale — Async and Concurrency

For large datasets (thousands of URLs), sequential fetching is too slow. Use asyncio with httpx or aiohttp to send concurrent requests. A well-tuned async scraper can process 50–100 URLs per second, limited by network and target server rate limits.

import asyncio
import httpx

API_KEY = "YOUR_API_KEY"

async def scrape_url(client: httpx.AsyncClient, url: str) -> dict:
    response = await client.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
        json={"url": url, "render_js": True},
        timeout=30,
    )
    return {"url": url, "html": response.json().get("html", "")}

async def scrape_many(urls: list[str]) -> list[dict]:
    async with httpx.AsyncClient() as client:
        tasks = [scrape_url(client, url) for url in urls]
        return await asyncio.gather(*tasks)

results = asyncio.run(scrape_many([
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
]))

Common Pitfalls and How to Avoid Them

The most common issues in Python web scraping:

**1. No timeout set** — requests.get() without a timeout can hang indefinitely. Always pass timeout=10 or timeout=(connect, read).

**2. No error handling** — Sites return 403, 429, 503. Use response.raise_for_status() and wrap in try/except.

**3. Scraping JavaScript-only content** — If your soup object returns None for everything, the page renders with JavaScript. Switch to a rendering solution.

**4. IP-based rate limiting** — Sending requests too fast gets your IP blocked. Add delays, or use a service that manages IP rotation.

**5. Brittle CSS selectors** — Sites change their HTML structure. Use the most specific stable selectors (data-* attributes, IDs) and add assertions to detect when extraction breaks.

Complete Python Scraper — Static Site

Complete working scraper with error handling, session management, and CSV output.

import requests
from bs4 import BeautifulSoup
import csv
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

HEADERS = {
    "User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

def fetch_page(url: str, session: requests.Session) -> str:
    response = session.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()
    return response.text

def parse_articles(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    articles = []
    for article in soup.select("article.post"):
        articles.append({
            "title": (article.select_one("h2") or {}).get_text(strip=True),
            "url": article.select_one("a")["href"] if article.select_one("a") else "",
            "date": (article.select_one("time") or {}).get("datetime", ""),
        })
    return articles

def scrape_site(base_url: str, max_pages: int = 10) -> list[dict]:
    all_results = []
    with requests.Session() as session:
        for page in range(1, max_pages + 1):
            url = f"{base_url}?page={page}"
            logger.info(f"Scraping page {page}: {url}")
            try:
                html = fetch_page(url, session)
                results = parse_articles(html)
                if not results:
                    logger.info("No more results — stopping")
                    break
                all_results.extend(results)
                time.sleep(1)  # polite crawl delay
            except requests.HTTPError as e:
                logger.error(f"HTTP error on page {page}: {e}")
                break
    return all_results

results = scrape_site("https://example.com/articles")
with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "url", "date"])
    writer.writeheader()
    writer.writerows(results)
logger.info(f"Saved {len(results)} articles")

Or Skip the Complexity — Use AlterLab

AlterLab handles JavaScript rendering, IP rotation, and website compatibility automatically. One POST request returns rendered HTML — no browser management, no proxy configuration. Starts at $0.0002/request with 5,000 free to start.

import requests
from bs4 import BeautifulSoup
import csv

API_KEY = "YOUR_API_KEY"  # Get free at alterlab.io

def scrape_with_alterlab(url: str, render_js: bool = False) -> str:
    """Fetch a URL through AlterLab — handles rendering and website compatibility."""
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
        json={"url": url, "render_js": render_js},
        timeout=30,
    )
    response.raise_for_status()
    return response.json().get("html", "")

# Works on JavaScript-heavy pages and sites with compatibility layers
html = scrape_with_alterlab(
    "https://example.com/products",
    render_js=True,  # Enable for SPAs and dynamic content
)

soup = BeautifulSoup(html, "lxml")
products = [
    {
        "title": el.select_one("h2").get_text(strip=True),
        "price": el.select_one(".price").get_text(strip=True),
    }
    for el in soup.select("div.product-card")
    if el.select_one("h2") and el.select_one(".price")
]

with open("products.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price"])
    writer.writeheader()
    writer.writerows(products)
print(f"Extracted {len(products)} products")

Choosing Your Approach

requests + BeautifulSoup

Pros

  • +Simple and fast for static pages
  • +Low resource usage
  • +Full Python control

Cons

  • No JavaScript execution
  • Manual IP rotation needed
  • Breaks on challenge pages

Playwright / Selenium

Pros

  • +Full browser — executes any JavaScript
  • +Handles complex interactions

Cons

  • High memory usage (1+ GB per browser)
  • Slow — 5–15 seconds per page
  • Browser detection is common
  • Complex setup and maintenance

AlterLab API

Pros

  • +Handles static, JavaScript, and challenge pages
  • +No browser management
  • +Automatic IP rotation
  • +5-tier auto-escalation
  • +From $0.0002/request

Cons

  • Per-request cost
  • Requires network access

Frequently Asked Questions

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Python Web Scraping Guide 2026 — From Requests to Production | AlterLab | AlterLab