Scrapy is an open-source Python web crawling and scraping framework designed for large-scale data extraction. It provides a complete, production-ready architecture: asynchronous request scheduling via Twisted, URL deduplication to avoid revisiting pages, middleware pipelines for processing requests and responses, item pipelines for cleaning and storing extracted data, and built-in support for exporting to JSON, CSV, XML, and databases.
Scrapy's architecture is built around Spiders — Python classes that define how to crawl a site (start URLs, following rules) and how to extract data (CSS or XPath selectors). Spiders yield Item objects that flow through configurable pipelines: data cleaning, validation, deduplication, and storage. The framework handles the async I/O, request scheduling, and retry logic automatically.
For JavaScript-rendered pages, Scrapy alone is insufficient — it sends plain HTTP requests. The common integration pattern is to use Scrapy as the crawling and pipeline framework while routing requests through AlterLab's API for rendering. The `scrapy-playwright` and `scrapy-splash` integrations provide alternative browser rendering backends. AlterLab can serve as a transparent HTTP rendering proxy for Scrapy spiders.