Web Scraping Glossary

Definitions of the key terms in web scraping, proxy infrastructure, and anti-bot protection — the reference for developers building data pipelines.

Web Scraping→

Automated extraction of data from websites using software tools that parse HTML and collect structured information at scale.

Full definition →

Data Extraction→

The process of identifying and pulling specific fields from raw web content, converting unstructured HTML into typed JSON records.

Full definition →

Web Crawler→

A program that automatically navigates websites by following hyperlinks, systematically indexing pages across an entire domain.

Full definition →

Residential Proxy→

An IP address assigned by an ISP to a real residential device, used to route scraping requests through genuine consumer IPs.

Full definition →

Datacenter Proxy→

An IP address hosted in a commercial data centre rather than assigned by a residential ISP — faster but more easily detected by anti-bot systems.

Full definition →

Rotating Proxy→

A proxy configuration that automatically assigns a different IP address for each request or at a set interval to prevent IP-based rate limiting.

Full definition →

IP Rotation→

The practice of cycling through multiple IP addresses across requests so no single IP triggers rate limits or bans.

Full definition →

Headless Browser→

A web browser that runs without a graphical user interface, executing JavaScript and rendering pages exactly as a real browser would.

Full definition →

JavaScript Rendering→

The process of executing a page's JavaScript in a browser engine before extracting content, required for React, Vue, and Angular SPAs.

Full definition →

Browser Fingerprinting→

A technique anti-bot systems use to identify bots by collecting dozens of browser attributes including Canvas, WebGL, and timing signatures.

Full definition →

TLS Fingerprint→

A signature derived from TLS handshake parameters a client sends when opening an HTTPS connection, used by anti-bot systems to classify traffic.

Full definition →

HTTP/2 Fingerprint→

A network-level identifier from HTTP/2 handshake parameters including SETTINGS frame values and header ordering, used alongside TLS fingerprinting.

Full definition →

Anti-Bot Protection→

Systems deployed by websites to detect and block automated traffic using TLS fingerprinting, CAPTCHA challenges, and behavioural analysis.

Full definition →

CAPTCHA→

Completely Automated Public Turing test to tell Computers and Humans Apart — challenges that gate access to content for bots.

Full definition →

Cloudflare Bot Management→

Cloudflare's enterprise bot detection suite combining machine learning, browser fingerprinting, TLS fingerprinting, and JavaScript challenges.

Full definition →

DataDome→

A real-time bot detection platform that analyses every HTTP request using device fingerprinting, behavioural signals, and IP reputation.

Full definition →

Akamai Bot Manager→

Akamai's enterprise bot detection solution deployed at the CDN edge using sensor data, TLS fingerprinting, and IP reputation.

Full definition →

Rate Limiting→

A server-side control that caps the number of requests accepted from a single IP or session within a time window, returning HTTP 429 when exceeded.

Full definition →

Session Management→

The handling of cookies, authentication tokens, and browser state across multiple requests to maintain a continuous browsing session.

Full definition →

Dynamic Content→

Page content generated client-side by JavaScript after the initial HTML is served, requiring a browser engine to render.

Full definition →

Stealth Mode→

Browser automation patches that normalise headless browser signals to match genuine user browser profiles and navigate browser environment checks.

Full definition →

User Agent→

An HTTP header string identifying the browser, version, and operating system to the web server, checked by anti-bot systems against other request properties.

Full definition →

Proxy Pool→

A managed collection of proxy IP addresses used to distribute outbound requests across multiple origins and stay below per-IP rate limits.

Full definition →

Structured Data Extraction→

Converting free-form HTML into typed JSON records using explicit schemas, producing clean structured output instead of raw markup.

Full definition →

Webhook→

An HTTP callback that delivers data to a specified URL when an async scrape job completes.

Full definition →

REST API→

A web service that exposes data and actions through standard HTTP methods and resource-oriented URLs, returning structured JSON responses.

Full definition →

Playwright→

An open-source browser automation library from Microsoft that controls Chromium, Firefox, and WebKit through a unified async API.

Full definition →

Puppeteer→

A Node.js library providing a high-level API to control Chrome or Chromium over the DevTools Protocol, commonly used for scraping JavaScript-heavy sites.

Full definition →

robots.txt→

A plain-text file at a website's root specifying which URL paths crawlers are permitted or disallowed from accessing.

Full definition →

SERP→

Search Engine Results Page — the page returned by a search engine for a query, containing organic results, ads, and rich features.

Full definition →

Sitemap→

An XML file listing all URLs on a website with metadata like last modification date, helping crawlers discover and prioritise pages.

Full definition →

CSS Selector→

A pattern used to match HTML elements by tag, class, ID, or attribute, used in web scraping to extract specific nodes from parsed HTML.

Full definition →

XPath→

XML Path Language — a query syntax for selecting nodes in HTML or XML using path expressions, more expressive than CSS selectors for complex traversals.

Full definition →

BeautifulSoup→

A Python library for parsing HTML and XML and navigating the parse tree using CSS selectors or tag methods.

Full definition →

Scrapy→

An open-source Python framework for large-scale web crawling and scraping with async request scheduling, deduplication, and output pipelines.

Full definition →

HTTP Headers→

Key-value metadata fields sent with HTTP requests and responses that communicate authentication, content type, caching, and other directives.

Full definition →

Geolocation Targeting→

Routing scraping requests through proxy IPs in a specific country or region to access geo-restricted content or localised pricing.

Full definition →

Parse Tree→

The hierarchical tree structure created by an HTML parser that represents the DOM elements of a web page for programmatic traversal.

Full definition →

Pagination→

The division of large data sets across multiple pages, requiring scrapers to follow next-page links or increment page parameters to retrieve complete data.

Full definition →

Content Delivery Network (CDN)→

A distributed network of servers that delivers web content from locations geographically close to the user, often where anti-bot protection is deployed.

Full definition →

Honeypot→

A hidden link or element invisible to real users but detectable by scrapers, used to identify and ban automated crawlers.

Full definition →

Scraping API→

A managed web scraping service that abstracts proxy rotation, JavaScript rendering, and automatic website compatibility into a single HTTP endpoint.

Full definition →

DOM (Document Object Model)→

The tree-structured programmatic representation of a web page that JavaScript manipulates to create dynamic content.

Full definition →

Caching→

Storing copies of web responses to serve repeat requests faster, which can cause scrapers to receive stale data instead of live content.

Full definition →

JSON→

JavaScript Object Notation — the universal text format for structured data exchange between APIs and web services.

Full definition →

API Authentication→

The mechanisms used to verify the identity of callers making requests to an API, typically via API keys, OAuth tokens, or JWT.

Full definition →

Mobile Proxy→

An IP address from a mobile carrier network (4G/5G), providing the highest trust signals for anti-bot systems due to genuine mobile device attribution.

Full definition →

Network Interception→

Capturing XHR/fetch API responses made by a page's JavaScript during browser rendering to extract structured data directly from network calls.

Full definition →

GraphQL→

A query language for APIs that allows clients to request exactly the fields they need, commonly used by modern SPAs and a more efficient scraping target than REST.

Full definition →

Retry Logic→

Automated re-sending of failed requests with backoff strategies, essential for handling transient errors, rate limits, and flaky anti-bot challenges.

Full definition →

WAF (Web Application Firewall)→

A WAF is a security layer that inspects and filters HTTP traffic to block malicious requests including automated scrapers.

Full definition →

JA3 Fingerprint→

JA3 is a method of fingerprinting TLS client hellos to identify the TLS library (and therefore the application) making a connection.

Full definition →

Bot Score→

A bot score is a probability value (0–100) assigned by an anti-bot system indicating the likelihood that a request comes from automated software rather than a human.

Full definition →

Behavioral Analysis→

Behavioral analysis is an anti-bot technique that tracks mouse movements, keystrokes, scroll patterns, and timing to distinguish humans from bots.

Full definition →

Challenge Page→

A challenge page is an interstitial served by an anti-bot system that asks a visitor to prove they are human before accessing the target content.

Full definition →

Cloudflare Turnstile→

Cloudflare Turnstile is a privacy-friendly CAPTCHA alternative that uses browser signals and proofs-of-work to verify humans without visible puzzles.

Full definition →

Proof-of-Work Challenge→

A proof-of-work challenge requires a client to perform a computationally expensive calculation before accessing a resource, making large-scale automated requests economically costly.

Full definition →

Infinite Scroll→

Infinite scroll is a UX pattern where additional content loads automatically as the user scrolls toward the bottom of the page, replacing traditional pagination.

Full definition →

Form Submission→

Form submission scraping involves programmatically filling and submitting HTML forms to navigate sites that gate content behind search fields, logins, or multi-step workflows.

Full definition →

Screenshot Capture→

Screenshot capture renders a web page in a headless browser and saves the visual output as an image, useful for visual monitoring, archiving, or content that is difficult to parse as HTML.

Full definition →

Iframe Scraping→

Iframe scraping involves accessing content embedded inside HTML iframe elements, which load content from a separate URL and have their own isolated DOM.

Full definition →

SPA Scraping→

SPA scraping targets Single-Page Applications, where the HTML shell is static but all content is rendered by JavaScript after the initial page load.

Full definition →

Link Following→

Link following (or crawling) is the process of discovering and visiting URLs by extracting hyperlinks from already-scraped pages, enabling systematic traversal of a website.

Full definition →

PDF Extraction→

PDF extraction involves parsing PDF files to retrieve structured text, tables, and metadata from documents served as downloadable files rather than HTML pages.

Full definition →

JSON-LD→

JSON-LD is a W3C standard for embedding structured, machine-readable data inside an HTML page's script tags using JSON syntax and schema.org vocabulary.

Full definition →

Schema.org→

Schema.org is a collaborative vocabulary of structured data types used to annotate web content so search engines and automated tools can understand its meaning.

Full definition →

Open Graph Protocol→

Open Graph is a protocol that uses meta tags in an HTML page's head to define how a URL is displayed when shared on social media platforms.

Full definition →

Table Parsing→

Table parsing extracts structured rows and columns of data from HTML `<table>` elements, converting them into arrays, dataframes, or other tabular formats.

Full definition →

Regex Extraction→

Regex extraction uses regular expressions to match and capture specific text patterns — such as prices, dates, or identifiers — from raw HTML or text content.

Full definition →

Microdata→

Microdata is an HTML specification for embedding machine-readable structured data directly in HTML elements using `itemscope`, `itemtype`, and `itemprop` attributes.

Full definition →

RAG (Retrieval-Augmented Generation)→

RAG is an AI architecture that grounds LLM responses in retrieved documents, combining a vector search step with language model generation to reduce hallucinations.

Full definition →

Embeddings→

Embeddings are dense vector representations of text (or other data) produced by a neural network, where semantically similar content maps to nearby points in vector space.

Full definition →

Vector Store→

A vector store is a database optimised for storing and querying high-dimensional embedding vectors by semantic similarity, enabling fast nearest-neighbour search at scale.

Full definition →

MCP (Model Context Protocol)→

MCP is an open protocol by Anthropic that standardises how AI models connect to external tools, APIs, and data sources through a uniform server-client interface.

Full definition →

Tool Use→

Tool use is the ability of an AI model to call external functions or APIs during generation, enabling it to retrieve live data, execute code, and interact with the world.

Full definition →

LLM (Large Language Model)→

A Large Language Model is a neural network trained on vast text corpora that can generate, summarise, translate, and reason over natural language at human level.

Full definition →

Agent Orchestration→

Agent orchestration coordinates multiple AI agents or tool calls in a structured workflow to complete complex, multi-step tasks that exceed a single LLM call.

Full definition →

Exponential Backoff→

Exponential backoff is a retry strategy where the wait time between successive retries increases exponentially, reducing load on rate-limited or temporarily unavailable endpoints.

Full definition →

Circuit Breaker→

A circuit breaker is a fault-tolerance pattern that stops making requests to a failing endpoint after a threshold of errors, allowing the downstream service time to recover.

Full definition →

Job Queue→

A job queue is a buffer that decouples producers (tasks submitted by clients) from consumers (workers that execute the tasks), enabling asynchronous, scalable processing.

Full definition →

Idempotency→

Idempotency means that making the same API request multiple times produces the same result as making it once, enabling safe retries without duplicate side effects.

Full definition →

Concurrency→

Concurrency in scraping refers to the number of requests or browser sessions running in parallel, which directly controls throughput and resource consumption.

Full definition →

Connection Pooling→

Connection pooling reuses open TCP/TLS connections across multiple HTTP requests to reduce handshake overhead and improve scraping throughput.

Full definition →

Data Pipeline→

A data pipeline is an automated sequence of steps that ingests raw data from a source, transforms it, and delivers it to a destination such as a database, data warehouse, or analytics system.

Full definition →

HTML→

HTML (HyperText Markup Language) is the standard markup language for structuring web page content, defining elements like headings, paragraphs, links, images, and tables.

Full definition →

HTTP→

HTTP (HyperText Transfer Protocol) is the application-layer protocol used by web browsers and scrapers to request and receive resources from web servers.

Full definition →

URL→

A URL (Uniform Resource Locator) is the address of a resource on the web, comprising a scheme, host, optional port, path, query string, and fragment.

Full definition →

DNS (Domain Name System)→

DNS translates human-readable domain names like `example.com` into IP addresses that routers use to deliver network packets.

Full definition →

Canonical URL→

A canonical URL is the preferred URL for a piece of content when multiple URLs serve the same or similar content, indicated by a `<link rel='canonical'>` tag.

Full definition →

Redirect Chain→

A redirect chain is a sequence of HTTP redirects (301, 302, etc.) from an initial URL to a final destination, which scrapers must follow to reach the target content.

Full definition →

SDK (Software Development Kit)→

An SDK is a set of libraries, code samples, and documentation that simplifies integrating a platform's API into a specific programming language or framework.

Full definition →

API Gateway→

An API gateway is a reverse proxy that sits in front of backend services, handling authentication, rate limiting, routing, and protocol translation for incoming API requests.

Full definition →

Deduplication→

Deduplication is the process of identifying and removing or merging duplicate records in a scraped dataset, ensuring each real-world entity appears exactly once.

Full definition →

Data Enrichment→

Data enrichment augments scraped records with additional information from secondary sources — geocoding addresses, looking up company data, or appending social profiles.

Full definition →

Crawl Budget→

Crawl budget is the number of pages a search engine crawler (or a custom crawler) will fetch from a site within a given timeframe, influenced by server capacity and site size.

Full definition →

ETL (Extract, Transform, Load)→

ETL is a data integration pattern where raw data is Extracted from a source, Transformed into the desired format, and Loaded into a destination system.

Full definition →

Schema Validation→

Schema validation checks that extracted data conforms to an expected structure and type constraints before it is written to a database or downstream system.

Full definition →

Data Normalization→

Data normalization standardises scraped values into a consistent format — stripping currency symbols, trimming whitespace, converting dates, and unifying units.

Full definition →

Headless Chrome→

Headless Chrome runs the Chrome browser without a graphical user interface, enabling automated page rendering, JavaScript execution, and screenshot capture from a server.

Full definition →

Load Balancer→

A load balancer distributes incoming requests across multiple servers to prevent any single server from becoming a bottleneck and to provide high availability.

Full definition →

HTTP Status Code→

HTTP status codes are three-digit numbers in server responses that indicate the outcome of a request — success, redirection, client error, or server error.

Full definition →

Middleware→

Middleware is software that sits between a scraping framework's core request/response cycle and user-defined handlers, adding cross-cutting behaviour like logging, retry, or rate limiting.

Full definition →

IP Ban→

An IP ban blocks all requests from a specific IP address or range, typically issued by a site or CDN after detecting automated or abusive traffic.

Full definition →

Browser Context→

A browser context is an isolated browser session within a single browser process, with separate cookies, cache, and storage — enabling parallel scraping without session cross-contamination.

Full definition →

Proxy Authentication→

Proxy authentication is the process of presenting credentials (username and password) to a proxy server to authorise its use for forwarding scraping requests.

Full definition →

SOCKS Proxy→

A SOCKS proxy operates at the network layer, tunnelling any TCP connection — not just HTTP — making it more versatile than HTTP proxies for scraping diverse protocols.

Full definition →

Data Lake→

A data lake is a centralised storage repository that holds raw scraped data in its native format at any scale, deferring schema definition and transformation to query time.

Full definition →

WebSocket→

WebSocket is a full-duplex communication protocol over a single TCP connection, used by websites to push real-time data such as live prices, chat messages, or event feeds.

Full definition →

User-Agent Rotation→

User-agent rotation changes the User-Agent header sent with each request or session to mimic different browsers, reducing the likelihood that a scraper is fingerprinted by its UA string.

Full definition →

Crawl Depth→

Crawl depth is the maximum number of link hops a crawler will follow from its seed URLs, limiting the scope of the crawl to pages within N clicks of the starting point.

Full definition →

Headless Browser Detection→

Headless browser detection is the set of JavaScript checks anti-bot systems use to identify browsers running without a graphical display, which are strong bot indicators.

Full definition →

XML Sitemap→

An XML sitemap is a structured file listing a website's important URLs with optional metadata, helping search engine crawlers and custom scrapers discover all public pages.

Full definition →

Content Negotiation→

Content negotiation is an HTTP mechanism where the client specifies preferred response formats (JSON, HTML, XML) via Accept headers and the server responds in the best matching format.

Full definition →

OCR (Optical Character Recognition)→

OCR converts images containing text into machine-readable text, enabling scrapers to extract data from images, scanned PDFs, and canvas-rendered content.

Full definition →

Browser Pool→

A browser pool manages a fixed set of pre-warmed browser instances that are reused across multiple scraping tasks, reducing startup overhead and controlling memory consumption.

Full definition →

Anti-Scraping Measures→

Anti-scraping measures are the collection of technical and legal techniques websites use to detect and prevent automated data extraction.

Full definition →

Async Scraping→

Async scraping uses asynchronous I/O to issue multiple HTTP requests concurrently without blocking, dramatically increasing throughput compared to sequential scraping.

Full definition →

Content-Type→

The Content-Type HTTP header describes the media type and encoding of a request or response body, telling the recipient how to parse the payload.

Full definition →

Anti-Fingerprinting→

Anti-fingerprinting techniques modify or spoof browser attributes to prevent websites from uniquely identifying a scraping client across sessions.

Full definition →

Scraper Framework→

A scraper framework is a structured software library that provides components for HTTP fetching, HTML parsing, link following, concurrency management, and data export in a reusable architecture.

Full definition →

Rate Limit Headers→

Rate limit headers are HTTP response headers that communicate how many requests remain in the current window and when the limit resets, allowing clients to self-throttle.

Full definition →

Shadow DOM→

Shadow DOM is a browser API that encapsulates a component's internal HTML and CSS in an isolated subtree, separate from the main document DOM.

Full definition →

Lazy Loading→

Lazy loading defers fetching images or content until the user scrolls them into view, reducing initial page load time but requiring scrapers to trigger scroll events to access deferred content.

Full definition →

Automatic Website Compatibility→

Automatic website compatibility refers to the techniques and tools used to make scraping requests appear indistinguishable from legitimate human browser traffic, navigating automated bot detection systems.

Full definition →

Event-Driven Scraping→

Event-driven scraping triggers scrape jobs in response to external events — webhooks, schedule triggers, or message queue messages — rather than running on a fixed polling interval.

Full definition →

Link Graph→

A link graph is a directed graph where nodes are web pages and edges are hyperlinks, used for crawl planning, PageRank analysis, and understanding site structure.

Full definition →

TLS Certificate→

A TLS certificate authenticates a server's identity and enables encrypted HTTPS communication between a browser or scraper and the target web server.

Full definition →

HTTPS→

HTTPS is HTTP secured with TLS encryption, ensuring that request and response data cannot be read or modified by intermediaries between the client and the server.

Full definition →

Playwright Network Interception→

Playwright's route API intercepts, inspects, and modifies outgoing browser network requests, enabling scrapers to block ads, capture API responses, or mock data during tests.

Full definition →

Data Warehouse→

A data warehouse is an analytical database optimised for querying and reporting on large volumes of structured historical data aggregated from multiple sources including web scraping.

Full definition →

Observability→

Observability in scraping systems refers to the ability to understand system behaviour from its external outputs — metrics, logs, and traces — without modifying the code.

Full definition →

Browser Extension for Scraping→

A browser extension scraper runs directly inside the user's browser, bypassing anti-bot checks by operating in the same environment as a human user, at the cost of requiring manual interaction.

Full definition →

Structured Output→

Structured output is data extracted from web pages and returned in a machine-readable format such as JSON or CSV, with well-defined fields rather than raw HTML text.

Full definition →

Grounding (AI)→

Grounding connects an AI model's output to verifiable, real-world data sources, reducing hallucinations by anchoring generated text in retrieved facts.

Full definition →

HTTP Cache→

HTTP caching stores copies of responses at the client or an intermediate proxy, allowing subsequent requests for the same resource to be served without a full server round-trip.

Full definition →

Concurrent Crawling→

Concurrent crawling runs multiple URL fetches in parallel, balancing throughput against politeness by limiting simultaneous connections to any single domain.

Full definition →

Request Throttling→

Request throttling deliberately limits the rate at which a scraper sends requests to a target site, preventing detection and avoiding overloading the server.

Full definition →

Selenium→

Selenium is a browser automation framework originally designed for testing that is widely used for web scraping when JavaScript rendering and browser interaction are required.

Full definition →

Proxy Provider→

A proxy provider is a commercial service that sells access to a managed pool of IP addresses, handling IP sourcing, rotation, and geo-targeting on behalf of scraping clients.

Full definition →

Gzip Compression→

Gzip is a lossless data compression algorithm used to reduce HTTP response body size, decreasing transfer time for HTML and JSON responses by 60–90%.

Full definition →

Crawl Scope Creep→

Crawl scope creep occurs when a crawler unintentionally follows links beyond the intended domain or path boundaries, wasting resources on irrelevant content.

Full definition →

Data Freshness→

Data freshness measures how recently scraped data was collected relative to the actual source, a critical metric for price monitoring, news aggregation, and inventory tracking.

Full definition →

API Discovery→

API discovery is the process of identifying undocumented JSON or GraphQL endpoints used by a website's frontend that can be called directly for cleaner data than HTML scraping.

Full definition →

Cache Busting→

Cache busting forces a fresh server request by adding a unique query parameter or modifying the URL so that cached responses are bypassed and current content is fetched.

Full definition →

Server-Side Rendering→

Server-side rendering (SSR) generates HTML on the server for each request, delivering fully populated markup to the client without requiring JavaScript execution — making content directly accessible to HTTP scrapers.

Full definition →

Noindex→

The `noindex` directive instructs search engine crawlers not to include a page in search index results, often used for private, duplicate, or low-value pages.

Full definition →

Character Encoding→

Character encoding defines how text characters are represented as bytes; mismatched encoding between server and parser causes garbled (mojibake) text extraction.

Full definition →

Sitemap Index File→

A sitemap index file is a sitemap that lists other sitemap files rather than individual URLs, enabling large sites to split their sitemap into manageable chunks.

Full definition →

Output Format→

Output format refers to the structured file or data format in which scraped results are delivered — JSON, CSV, Parquet, NDJSON, or database rows — each suited to different downstream consumers.

Full definition →

Ready to start scraping?

AlterLab handles proxy rotation, website compatibility, JavaScript rendering, and challenge resolution — all in a single API call. Starts at $0.0002 per request with no subscription.

Read the Docs View Pricing