Web Scraping with Python — Complete Guide
From your first request to production-scale data extraction. Everything you need to scrape websites reliably with Python.
Python is the most popular language for web scraping — and for good reason. The ecosystem is mature, the libraries are well-documented, and the path from prototype to production is shorter than with any other stack. This guide covers the full journey: fetching pages, parsing HTML, handling JavaScript-rendered content, and scaling to thousands of URLs.
Setting Up Your Python Scraping Environment
You need two libraries to get started: requests for HTTP and BeautifulSoup for HTML parsing. Install them with pip:
pip install requests beautifulsoup4 lxmlFor async scraping (faster when collecting many URLs), also install httpx:
pip install httpxFor structured data extraction and saving to CSV or DataFrame, add pandas:
pip install pandaspip install requests beautifulsoup4 lxml httpx pandasFetching a Web Page with Python
The simplest scrape: fetch a URL and read its HTML. Use requests.get() for synchronous fetching. Always check the status code before parsing — a 200 means success, 403 means you need better request headers, and 429 means you are being rate-limited.
import requests
headers = {
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
response = requests.get("https://example.com/products", headers=headers, timeout=10)
response.raise_for_status() # raises on 4xx/5xx
html = response.text
print(f"Fetched {len(html)} bytes")Parsing HTML with BeautifulSoup
BeautifulSoup converts raw HTML into a navigable tree structure. Use .find() for the first match or .find_all() for a list. CSS selectors via .select() are often the clearest approach for modern HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# CSS selector approach — often the clearest
products = soup.select("div.product-card")
for product in products:
title = product.select_one("h2.product-title")
price = product.select_one("span.price")
print(f"{title.get_text(strip=True)}: {price.get_text(strip=True)}")Extracting Structured Data
Most scraping projects need clean structured output — not raw HTML. Build a list of dictionaries, then save to CSV, JSON, or a database. Use .get_text(strip=True) to remove whitespace and use .get('href') to extract link attributes.
import csv
from bs4 import BeautifulSoup
def extract_products(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
results = []
for card in soup.select("div.product-card"):
results.append({
"title": (card.select_one("h2") or {}).get_text(strip=True),
"price": (card.select_one(".price") or {}).get_text(strip=True),
"url": card.select_one("a")["href"] if card.select_one("a") else "",
})
return results
with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "url"])
writer.writeheader()
writer.writerows(extract_products(html))Handling JavaScript-Rendered Pages
Many modern websites — product pages, dashboards, search results — load content dynamically with JavaScript. A plain requests.get() returns an empty template with none of the data. You need a browser that executes JavaScript and waits for the page to fully render before returning HTML.
The two main approaches: run a headless browser locally (Playwright, Selenium), or use a cloud rendering API. Running browsers locally adds significant complexity — you need to manage Chrome binaries, handle browser crashes, configure viewport and language settings, and solve detection challenges.
AlterLab's API handles rendering server-side. You send one POST request with render_js: true and get back fully rendered HTML — no browser management required.
Scraping at Scale — Async and Concurrency
For large datasets (thousands of URLs), sequential fetching is too slow. Use asyncio with httpx or aiohttp to send concurrent requests. A well-tuned async scraper can process 50–100 URLs per second, limited by network and target server rate limits.
import asyncio
import httpx
API_KEY = "YOUR_API_KEY"
async def scrape_url(client: httpx.AsyncClient, url: str) -> dict:
response = await client.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
json={"url": url, "render_js": True},
timeout=30,
)
return {"url": url, "html": response.json().get("html", "")}
async def scrape_many(urls: list[str]) -> list[dict]:
async with httpx.AsyncClient() as client:
tasks = [scrape_url(client, url) for url in urls]
return await asyncio.gather(*tasks)
results = asyncio.run(scrape_many([
"https://example.com/page/1",
"https://example.com/page/2",
"https://example.com/page/3",
]))Common Pitfalls and How to Avoid Them
The most common issues in Python web scraping:
**1. No timeout set** — requests.get() without a timeout can hang indefinitely. Always pass timeout=10 or timeout=(connect, read).
**2. No error handling** — Sites return 403, 429, 503. Use response.raise_for_status() and wrap in try/except.
**3. Scraping JavaScript-only content** — If your soup object returns None for everything, the page renders with JavaScript. Switch to a rendering solution.
**4. IP-based rate limiting** — Sending requests too fast gets your IP blocked. Add delays, or use a service that manages IP rotation.
**5. Brittle CSS selectors** — Sites change their HTML structure. Use the most specific stable selectors (data-* attributes, IDs) and add assertions to detect when extraction breaks.
Complete Python Scraper — Static Site
Complete working scraper with error handling, session management, and CSV output.
import requests
from bs4 import BeautifulSoup
import csv
import time
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; DataBot/1.0)",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
def fetch_page(url: str, session: requests.Session) -> str:
response = session.get(url, headers=HEADERS, timeout=10)
response.raise_for_status()
return response.text
def parse_articles(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
articles = []
for article in soup.select("article.post"):
articles.append({
"title": (article.select_one("h2") or {}).get_text(strip=True),
"url": article.select_one("a")["href"] if article.select_one("a") else "",
"date": (article.select_one("time") or {}).get("datetime", ""),
})
return articles
def scrape_site(base_url: str, max_pages: int = 10) -> list[dict]:
all_results = []
with requests.Session() as session:
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
logger.info(f"Scraping page {page}: {url}")
try:
html = fetch_page(url, session)
results = parse_articles(html)
if not results:
logger.info("No more results — stopping")
break
all_results.extend(results)
time.sleep(1) # polite crawl delay
except requests.HTTPError as e:
logger.error(f"HTTP error on page {page}: {e}")
break
return all_results
results = scrape_site("https://example.com/articles")
with open("output.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "url", "date"])
writer.writeheader()
writer.writerows(results)
logger.info(f"Saved {len(results)} articles")Or Skip the Complexity — Use AlterLab
AlterLab handles JavaScript rendering, IP rotation, and website compatibility automatically. One POST request returns rendered HTML — no browser management, no proxy configuration. Starts at $0.0002/request with 5,000 free to start.
import requests
from bs4 import BeautifulSoup
import csv
API_KEY = "YOUR_API_KEY" # Get free at alterlab.io
def scrape_with_alterlab(url: str, render_js: bool = False) -> str:
"""Fetch a URL through AlterLab — handles rendering and website compatibility."""
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
json={"url": url, "render_js": render_js},
timeout=30,
)
response.raise_for_status()
return response.json().get("html", "")
# Works on JavaScript-heavy pages and sites with compatibility layers
html = scrape_with_alterlab(
"https://example.com/products",
render_js=True, # Enable for SPAs and dynamic content
)
soup = BeautifulSoup(html, "lxml")
products = [
{
"title": el.select_one("h2").get_text(strip=True),
"price": el.select_one(".price").get_text(strip=True),
}
for el in soup.select("div.product-card")
if el.select_one("h2") and el.select_one(".price")
]
with open("products.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price"])
writer.writeheader()
writer.writerows(products)
print(f"Extracted {len(products)} products")Choosing Your Approach
requests + BeautifulSoup
Pros
- +Simple and fast for static pages
- +Low resource usage
- +Full Python control
Cons
- −No JavaScript execution
- −Manual IP rotation needed
- −Breaks on challenge pages
Playwright / Selenium
Pros
- +Full browser — executes any JavaScript
- +Handles complex interactions
Cons
- −High memory usage (1+ GB per browser)
- −Slow — 5–15 seconds per page
- −Browser detection is common
- −Complex setup and maintenance
AlterLab API
Pros
- +Handles static, JavaScript, and challenge pages
- +No browser management
- +Automatic IP rotation
- +5-tier auto-escalation
- +From $0.0002/request
Cons
- −Per-request cost
- −Requires network access
Frequently Asked Questions
More Python & Scraping Resources
Web Scraping with Playwright
Full Playwright tutorial — setup, wait strategies, and when a cloud API is more practical.
Web Scraping with Puppeteer
Complete Node.js Puppeteer guide with code examples and production tradeoffs.
Python Web Scraping API
Official Python SDK — pip install alterlab. Async-ready, typed, 5,000 free requests.
JavaScript Rendering API
Render SPAs and dynamic content with headless Chromium — no browser management.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expires