Web Scraping Glossary
Definitions of the key terms in web scraping, proxy infrastructure, and anti-bot protection — the reference for developers building data pipelines.
Automated extraction of data from websites using software tools. Web scraping parses HTML, follows links, and collects structured information such as prices, product listings, or articles. Modern scraping must handle JavaScript rendering, anti-bot systems, and rate limiting. AlterLab's API abstracts these challenges into a single POST request.
A program that automatically navigates websites by following hyperlinks, indexing pages as it goes. Crawlers discover URLs systematically rather than scraping individual pages. Search engines use crawlers to index the web; developers use them to map site structure, discover content, or build datasets across hundreds of pages.
An IP address assigned by an ISP to a real residential device, used in web scraping to route requests through genuine consumer IPs. Sites that challenge or block datacenter IP ranges are far less likely to flag residential IPs. AlterLab's Tier 3 and above use rotating residential proxies across 195+ countries to maximise request success rates.
An IP address hosted in a commercial data centre rather than assigned by a residential ISP. Datacenter proxies are faster and cheaper than residential proxies but are more easily fingerprinted by anti-bot systems because their IP ranges are publicly known. Suitable for sites without aggressive bot protection.
A proxy configuration that automatically assigns a different IP address for each request or at a set interval. Rotating proxies prevent sites from correlating multiple requests to a single origin, reducing the chance of rate limiting or IP bans. AlterLab rotates IPs automatically on every request — no manual pool management required.
The practice of cycling through multiple IP addresses across requests so no single IP accumulates enough traffic to trigger rate limits or bans. Effective IP rotation uses diverse IP ranges (residential, mobile, datacenter) and mimics realistic request patterns. AlterLab handles IP rotation transparently at the infrastructure layer.
A web browser that runs without a graphical user interface. Headless browsers execute JavaScript, handle dynamic content, and interact with DOM elements exactly as a real browser would — but from the command line or API. Chromium and Firefox both support headless mode. AlterLab uses headless Chromium (Playwright) in Tier 4 for full JavaScript rendering.
The process of executing a page's JavaScript before extracting content. Many modern sites build their DOM client-side via React, Vue, or Angular — fetching raw HTML returns an empty shell. JavaScript rendering launches a real browser engine, waits for scripts to complete, and returns the final populated HTML. AlterLab enables this with a single render_js: true flag.
A technique anti-bot systems use to identify and classify visitors by collecting dozens of browser attributes — user agent, screen resolution, installed fonts, Canvas rendering, WebGL parameters, and timing signatures. Headless browsers have distinctive fingerprint patterns. AlterLab's rendering layer normalises these signals to match genuine user browser profiles.
A signature derived from the TLS handshake parameters a client sends when opening an HTTPS connection — cipher suite order, extension types, elliptic curve preferences, and more. Different HTTP libraries (Python requests, curl, Chromium) produce different TLS fingerprints. Anti-bot systems like Akamai Bot Manager classify clients at the network layer before any HTML is served. AlterLab's Tier 2 uses Chromium-compatible TLS to pass this classification layer.
A network-level identifier derived from HTTP/2 handshake parameters including SETTINGS frame values, WINDOW_UPDATE sizes, and header ordering. HTTP/2 fingerprinting is used alongside TLS fingerprinting by advanced bot detection systems. Libraries that use HTTP/1.1 or send non-browser HTTP/2 settings are trivially flagged before any application logic runs.
Systems deployed by websites to detect and block automated traffic. Modern anti-bot protection layers include TLS fingerprinting, browser fingerprinting, CAPTCHA challenges, behavioural analysis, IP reputation scoring, and device attestation. Leading providers include Cloudflare Bot Management, Akamai Bot Manager, and DataDome. AlterLab's 5-tier escalation automatically selects the right rendering strategy for each protection layer.
Completely Automated Public Turing test to tell Computers and Humans Apart. CAPTCHAs challenge visitors to complete tasks that are easy for humans but hard for bots — clicking images, solving puzzles, or typing distorted text. Common implementations include Google reCAPTCHA v2/v3, hCaptcha, and Cloudflare Turnstile. AlterLab's Tier 5 integrates automated challenge resolution for gated content.
Cloudflare's enterprise-grade bot detection suite that combines machine learning, browser fingerprinting, TLS fingerprinting, and JavaScript challenges to classify traffic. Sites protected by Cloudflare may issue a browser integrity check or Turnstile challenge before serving content. AlterLab handles standard Cloudflare protections in Tier 2 and Tier 3.
A real-time bot detection platform that analyses every HTTP request using device fingerprinting, behavioural signals, and IP reputation. DataDome is commonly deployed by e-commerce and media sites. It serves as an inline proxy that can block, CAPTCHA-gate, or silently allow traffic. Handling DataDome-protected sites requires consistent browser-grade fingerprinting across all request layers.
Akamai's enterprise bot detection solution deployed at the CDN edge. It uses sensor data (JavaScript-collected device signals), TLS fingerprinting, and IP reputation to classify bots in real time. Akamai protection is common on large retailer and financial sites. Successfully accessing Akamai-protected sites requires consistent matching across both network-layer and browser-layer signals.
A server-side control that caps the number of requests accepted from a single IP or session within a time window. Exceeding the limit returns HTTP 429 Too Many Requests or silently degrades response quality. Effective scraping respects rate limits by distributing requests across IPs, adding delays, and using exponential backoff on 429 responses.
The handling of cookies, authentication tokens, and browser state across multiple requests to maintain a continuous browsing session. Many sites require an active session cookie to access protected pages. Scraping tools must persist and replay cookies between requests to avoid repeated login challenges or fingerprint resets.
Page content generated client-side by JavaScript after the initial HTML is served. SPAs (single-page applications) built with React, Vue, or Angular render their primary content dynamically — raw HTML fetch returns an empty div. Extracting dynamic content requires a JavaScript rendering engine that waits for the DOM to stabilise after script execution.
A set of browser automation patches that normalise the signals emitted during headless browser execution to match genuine user browser profiles. Stealth mode overrides properties like navigator.webdriver, patches the Chrome runtime object, aligns Plugin arrays, randomises Canvas fingerprints, and injects realistic timing. AlterLab's Tier 3 applies these patches to Playwright for challenge-protected sites.
A string sent in the HTTP User-Agent header that identifies the browser, version, operating system, and rendering engine to the web server. Anti-bot systems cross-reference User-Agent strings against other request properties. A Python requests default user agent is immediately flagged; setting a realistic Chrome/Windows user agent is the baseline for any scraping operation.
A managed collection of proxy IP addresses used to distribute outbound requests across multiple origins. Proxy pools are sized and rotated to stay below per-IP rate limits, replace banned IPs automatically, and maintain geographic diversity. AlterLab manages its own proxy pool infrastructure, eliminating the need for developers to source or maintain their own IPs.
An HTTP callback that delivers data to a specified URL when an event completes. In web scraping, webhooks allow async job completion notifications — you submit a scrape job, and AlterLab POSTs the result to your endpoint when the page is ready. Useful for long-running JavaScript rendering or challenge resolution that exceeds typical request timeouts.
A web service that exposes data and actions through standard HTTP methods (GET, POST, PUT, DELETE) and resource-oriented URLs. REST APIs return structured data (typically JSON) and are stateless — each request contains all information needed to complete it. AlterLab's scraping service is a REST API: POST a URL, receive structured content.
An open-source browser automation library from Microsoft that controls Chromium, Firefox, and WebKit through a unified async API. Playwright is the successor to Puppeteer for multi-browser support. It handles navigation, form filling, clicking, screenshot capture, and network interception. AlterLab uses Playwright in Tier 4 and Tier 5 for full browser rendering.
A Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer is commonly used for scraping JavaScript-heavy sites, generating PDFs, and taking screenshots. It is the predecessor to Playwright and remains widely used. AlterLab abstracts both Puppeteer and Playwright behind its unified scraping API.
A plain-text file at the root of a website (e.g. https://example.com/robots.txt) that specifies which URL paths crawlers are permitted or disallowed from accessing. robots.txt is a convention — it is not enforced by servers — but respecting it is a best practice for ethical scraping and reduces legal risk. Review robots.txt before scraping any site at scale.
Search Engine Results Page — the page displayed by a search engine in response to a query, containing organic results, ads, featured snippets, and rich cards. SERP scraping is used for keyword tracking, competitor monitoring, and ad intelligence. AlterLab can retrieve and parse SERPs via the standard scraping API.
An XML file that lists all URLs on a website along with metadata like last modification date and update frequency. Sitemaps help crawlers discover and prioritise pages without following every internal link. For scraping at scale, parsing a site's sitemap is more efficient than recursive link following and avoids duplicate-URL traps.
A pattern used to match HTML elements by tag, class, ID, attribute, or hierarchical relationship. In web scraping, CSS selectors extract specific nodes from a parsed HTML document. Examples: `div.price` selects all divs with class price; `#product-title` selects the element with ID product-title. Paired with libraries like BeautifulSoup or Cheerio.
XML Path Language — a query syntax for selecting nodes in an XML or HTML document using path expressions. XPath is more expressive than CSS selectors for complex traversals (e.g. selecting a node based on its sibling's text). Common in lxml, Scrapy, and browser DevTools. Example: `//div[@class='price']/span/text()` extracts text from a span inside a matching div.
A Python library for parsing HTML and XML and navigating the parse tree using CSS selectors or tag methods. BeautifulSoup is the standard choice for simple scraping tasks that don't require JavaScript rendering. It works with the built-in html.parser or faster third-party parsers like lxml. For JS-heavy pages, pair it with a headless browser or AlterLab's render_js flag.
An open-source Python framework for large-scale web crawling and scraping. Scrapy handles request scheduling, deduplication, middleware pipelines, and output to databases or files. It runs asynchronously using Twisted and scales to millions of pages. For sites requiring browser rendering or challenge resolution, Scrapy integrates with AlterLab's API as a drop-in HTTP backend.
Ready to start scraping?
AlterLab handles proxy rotation, website compatibility, JavaScript rendering, and challenge resolution — all in a single API call. Starts at $0.0002 per request with no subscription.