protocol

HTTP

HTTP (HyperText Transfer Protocol) is the application-layer protocol used by web browsers and scrapers to request and receive resources from web servers.

HTTP defines how clients (browsers, scrapers) formulate requests and how servers respond. A request specifies a method (GET, POST, PUT, DELETE), a URL, and optional headers and body. The server responds with a status code, response headers, and an optional body containing the requested resource (HTML, JSON, image, etc.).

Key HTTP status codes for scrapers: 200 OK (success), 301/302 (redirect — follow the Location header), 403 Forbidden (access denied), 404 Not Found, 429 Too Many Requests (rate limited — respect Retry-After), 503 Service Unavailable (server temporarily down). Understanding these codes is essential for writing robust retry and error-handling logic.

HTTP/1.1 uses text-based headers and persistent connections. HTTP/2 uses binary framing and multiplexing. HTTP/3 runs over QUIC (UDP-based). Anti-bot systems fingerprint the HTTP version and protocol features used by the client, so scrapers should use the same HTTP version as a real browser for the target site.

Examples

# Minimal HTTP GET with Python httpx
import httpx

response = httpx.get(
    "https://example.com",
    headers={"User-Agent": "Mozilla/5.0 ..."},
    follow_redirects=True,
    timeout=30
)
print(response.status_code, len(response.text))