tool

Scrapy

An open-source Python framework for large-scale web crawling and scraping with async request scheduling, deduplication, and output pipelines.

Scrapy is an open-source Python web crawling and scraping framework designed for large-scale data extraction. It provides a complete, production-ready architecture: asynchronous request scheduling via Twisted, URL deduplication to avoid revisiting pages, middleware pipelines for processing requests and responses, item pipelines for cleaning and storing extracted data, and built-in support for exporting to JSON, CSV, XML, and databases.

Scrapy's architecture is built around Spiders — Python classes that define how to crawl a site (start URLs, following rules) and how to extract data (CSS or XPath selectors). Spiders yield Item objects that flow through configurable pipelines: data cleaning, validation, deduplication, and storage. The framework handles the async I/O, request scheduling, and retry logic automatically.

For JavaScript-rendered pages, Scrapy alone is insufficient — it sends plain HTTP requests. The common integration pattern is to use Scrapy as the crawling and pipeline framework while routing requests through AlterLab's API for rendering. The `scrapy-playwright` and `scrapy-splash` integrations provide alternative browser rendering backends. AlterLab can serve as a transparent HTTP rendering proxy for Scrapy spiders.

Examples

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        # Route through AlterLab for rendering
        yield scrapy.Request(
            'https://api.alterlab.io/v1/scrape',
            method='POST',
            body='{"url": "https://example.com", "render_js": true}',
            headers={'X-API-Key': 'sk_live_...'}
        )

Related Terms

    Scrapy — Web Scraping Glossary | AlterLab