AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Tutorial
    New

    Site Crawling Tutorial

    Learn to crawl entire websites step by step. Start with a simple 3-page crawl and build up to production-grade crawling with extraction, webhooks, and cost controls.

    Time to Complete

    This tutorial takes about 15 minutes. You will need an AlterLab API key and about 50 credits.

    Prerequisites

    • An AlterLab API key (get one here)
    • At least 50 credits in your account
    • Python 3.8+ with the alterlab package, or Node.js 18+ with @alterlab/sdk, or cURL

    Step 1: Basic Crawl

    Start with a small crawl to see how it works. We will crawl 5 pages from a documentation site.

    Bash
    # Start a small crawl
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com",
        "max_pages": 5,
        "max_depth": 1
      }'
    
    # Save the crawl_id from the response
    # {"crawl_id": "abc123", "status": "discovering", ...}
    
    # Poll for status
    curl https://api.alterlab.io/api/v1/crawl/abc123 \
      -H "X-API-Key: YOUR_API_KEY"
    
    # When status is "completed", get results
    curl "https://api.alterlab.io/api/v1/crawl/abc123?include_results=true" \
      -H "X-API-Key: YOUR_API_KEY"

    You should see 5 pages discovered and scraped. The response includes each page's URL, status, and scraped content.

    Step 2: Add URL Filtering

    Real websites have thousands of pages. Use URL filtering to crawl only the sections you need. Let us target product pages on an e-commerce site.

    Bash
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://shop.example.com",
        "max_pages": 30,
        "max_depth": 2,
        "include_patterns": ["/products/*", "/categories/*"],
        "exclude_patterns": ["/blog/*", "/account/*", "/cart"]
      }'

    Tip: Start Narrow

    It is cheaper to run a small filtered crawl first to verify you are getting the right pages, then widen the scope for production.

    Step 3: Add Structured Extraction

    Instead of raw HTML, extract structured product data from every page using an extraction profile or custom schema.

    Bash
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://shop.example.com",
        "max_pages": 30,
        "include_patterns": ["/products/*"],
        "extraction_schema": {
          "name": "string",
          "price": "number",
          "currency": "string",
          "description": "string",
          "in_stock": "boolean",
          "image_url": "string"
        }
      }'

    Step 4: Set Up Webhooks

    Instead of polling, receive a notification when the crawl completes. This is the recommended approach for production use.

    Bash
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://shop.example.com",
        "max_pages": 100,
        "include_patterns": ["/products/*"],
        "extraction_profile": "product",
        "webhook_url": "https://your-server.com/webhooks/crawl-done"
      }'
    
    # Your webhook endpoint will receive:
    # POST /webhooks/crawl-done
    # {
    #   "event": "crawl.completed",
    #   "crawl_id": "abc123",
    #   "status": "completed",
    #   "total": 87,
    #   "completed": 85,
    #   "failed": 2,
    #   "credits_used": 142
    # }

    Step 5: Scale Up with Cost Controls

    For production crawls, add cost controls to prevent unexpected spend. Combine max_pages, max_credits, and max_tier for full control.

    Bash
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://shop.example.com",
        "max_pages": 500,
        "max_depth": 3,
        "include_patterns": ["/products/*"],
        "exclude_patterns": ["/products/*/reviews", "/products/*/questions"],
        "extraction_profile": "product",
        "webhook_url": "https://your-server.com/webhooks/crawl-done",
        "cost_controls": {
          "max_credits": 1000,
          "max_tier": "3"
        },
        "budget": {
          "/products/*": 400,
          "/categories/*": 50,
          "*": 10
        }
      }'

    Complete Working Example

    Here is a complete production-ready script that crawls a site, monitors progress, and saves extracted data to a JSON file.

    Python
    """Complete crawl example — crawl products and save to JSON."""
    import json
    import time
    from alterlab import AlterLab
    
    API_KEY = "YOUR_API_KEY"
    client = AlterLab(api_key=API_KEY)
    
    # Configure the crawl
    crawl = client.crawl(
        url="https://shop.example.com",
        max_pages=200,
        max_depth=3,
        include_patterns=["/products/*"],
        exclude_patterns=["/blog/*", "/account/*"],
        extraction_schema={
            "name": "string",
            "price": "number",
            "currency": "string",
            "in_stock": "boolean",
        },
        cost_controls={"max_credits": 500},
    )
    
    print(f"Crawl {crawl.crawl_id} started")
    print(f"Estimated: {crawl.estimated_pages} pages, {crawl.estimated_credits} credits")
    
    # Poll with exponential backoff
    interval = 3
    while True:
        status = client.get_crawl(crawl.crawl_id)
        pct = (status.completed / max(status.total, 1)) * 100
        print(f"[{status.status}] {status.completed}/{status.total} ({pct:.0f}%)")
    
        if status.status in ("completed", "partial", "failed"):
            break
    
        time.sleep(interval)
        interval = min(interval * 1.5, 15)  # max 15s
    
    # Fetch results
    result = client.get_crawl(crawl.crawl_id, include_results=True)
    
    # Extract and save data
    products = []
    for page in result.pages or []:
        if page.status == "succeeded" and page.result:
            extracted = page.result.get("extracted", {})
            if extracted.get("name"):
                products.append({
                    "url": page.url,
                    **extracted,
                })
    
    with open("products.json", "w") as f:
        json.dump(products, f, indent=2)
    
    print(f"Saved {len(products)} products to products.json")
    if result.billing:
        print(f"Credits used: {result.billing.credits_used}")
        print(f"Credits refunded: {result.billing.credits_refunded}")

    Next Steps

    Crawl API Reference

    Full API reference with all parameters, schemas, and error codes.

    Crawling Guide

    Deep dive into how crawling works, priority scoring, and advanced patterns.

    Structured Extraction

    Learn about extraction profiles and custom schemas for pulling structured data.

    Webhooks Guide

    Set up webhook endpoints, handle signatures, and configure retries.

    Last updated: March 2026

    On this page