AlterLabAlterLab
PricingComparePlaygroundBlogDocs
    AlterLabAlterLab
    PricingPlaygroundBlogDocsChangelog
    IntroductionInstallationYour First Request
    REST APIJob PollingAPI Keys
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Guide

    Web Crawling

    Crawl entire websites from a single URL. AlterLab discovers pages via sitemaps and link extraction, respects robots.txt, and scrapes each page with automatic tier escalation.

    Async by Design

    Crawl requests return immediately with a crawl_id. Use it to poll status or configure a webhook for delivery.

    How It Works

    1

    Discover

    AlterLab fetches sitemaps and extracts links from the start page. URLs are deduplicated and filtered against your include/exclude patterns and robots.txt rules.

    2

    Scrape

    Each discovered URL becomes an individual scrape job. Jobs run in parallel with the same tier escalation and anti-bot bypass logic as single scrapes.

    3

    Deepen

    When max_depth is greater than 0, completed pages are scanned for new internal links. New URLs are deduplicated and enqueued until the depth limit or page cap is reached.

    4

    Collect

    Poll GET /api/v1/crawl/{crawl_id} for progress and per-page results, or receive a crawl.completed webhook when all pages finish.

    Start a Crawl

    POST
    /api/v1/crawl

    Submit a start URL to discover and scrape an entire website. Returns 202 Accepted with a crawl ID for polling.

    Bash
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: your_api_key" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com",
        "max_pages": 100,
        "max_depth": 3,
        "include_patterns": ["/blog/*", "/docs/*"],
        "exclude_patterns": ["/admin/*", "*.pdf"],
        "formats": ["text", "markdown"],
        "webhook_url": "https://your-server.com/webhook"
      }'

    Response (202 Accepted):

    JSON
    {
      "crawl_id": "c9a1b2d3-4e5f-6789-abcd-ef0123456789",
      "status": "discovering",
      "estimated_pages": 47,
      "total_enqueued": 47,
      "estimated_credits": 47000,
      "message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
    }

    Request Body

    ParameterTypeDefaultDescription
    urlstring--Start URL for the crawl (required, http/https)
    max_pagesinteger50Maximum number of pages to scrape (1--100,000)
    max_depthinteger3Maximum link-following depth from start URL (0 = start page only, max 50)
    include_patternsstring[]nullGlob patterns -- only scrape URLs whose path matches at least one
    exclude_patternsstring[]nullGlob patterns -- skip URLs whose path matches any
    formatsstring[]nullOutput formats per page: text, json, json_v2, html, markdown
    extraction_schemaobjectnullJSON schema for structured extraction on each page
    extraction_profilestringnullPre-defined profile: auto, product, article, job_posting, faq, recipe, event
    webhook_urlstringnullURL to receive a crawl.completed webhook when all pages finish
    respect_robotsbooleantrueRespect robots.txt rules for the target domain
    include_subdomainsbooleanfalseInclude links to subdomains during discovery
    advancedobjectnullAdvanced scraping options applied to every page (see below)
    cost_controlsobjectnullCost controls for the entire crawl (see below)

    Advanced Options

    Pass an advanced object to configure scraping behaviour applied to every page in the crawl:

    FieldDefaultDescription
    render_jsfalseRender JavaScript on every page (forces Tier 4)
    screenshotfalseCapture a screenshot of every crawled page
    use_proxyfalseRoute all crawl requests through premium proxy
    wait_fornullCSS selector to wait for on each page before extracting
    timeout90Timeout per page in seconds (1--300)

    Cost Controls

    Pass a cost_controls object to cap spending:

    FieldDescription
    max_creditsMaximum total cost to spend on this crawl
    max_tierMaximum tier to use for page scrapes (1, 2, 3, 3.5, or 4)
    force_tierForce a specific tier for all pages (1, 2, 3, 3.5, or 4)

    URL Discovery

    When you start a crawl, AlterLab runs a fast discovery phase before scraping begins:

    • Sitemap parsing -- fetches robots.txt to find sitemap URLs, then recursively parses sitemaps (including sitemap indexes) to collect all listed URLs.
    • Link extraction -- fetches the start page HTML and extracts internal links, giving you pages that may not appear in sitemaps.
    • Deduplication -- all discovered URLs are normalised and deduplicated before scraping begins.
    • robots.txt compliance -- when respect_robots is true (the default), URLs disallowed by robots.txt are automatically excluded.

    Domain Scoping

    By default, only URLs on the same domain as the start URL are included. Set include_subdomains to true to also follow links to subdomains (e.g., blog.example.com from example.com).

    Depth Control

    The max_depth parameter controls how many levels of links the crawler follows from the start URL:

    max_depthBehaviour
    0Scrape only the start page (plus any pages found in sitemaps)
    1Scrape the start page and all pages linked from it
    2Follow links two levels deep from the start page
    3 (default)Three levels deep -- covers most site structures

    Depth + Pages

    Higher depth values can discover many more pages. Always set max_pages to cap total pages and control costs. The crawler stops discovering new pages once the limit is reached.

    URL Filtering

    Use glob patterns to precisely control which pages are included or excluded:

    JSON
    {
      "url": "https://example.com",
      "include_patterns": ["/blog/*", "/products/*"],
      "exclude_patterns": ["/admin/*", "/internal/*", "*.pdf"]
    }
    • include_patterns -- if provided, only URLs whose path matches at least one pattern are scraped.
    • exclude_patterns -- URLs whose path matches any pattern are always skipped, even if they match an include pattern.
    • Patterns use standard glob syntax: * matches any characters, ? matches a single character.

    Poll Crawl Status

    GET
    /api/v1/crawl/{crawl_id}

    Returns crawl progress, billing breakdown, and optionally per-page results. Pass ?include_results=true to include individual page data.

    Bash
    curl https://api.alterlab.io/api/v1/crawl/c9a1b2d3-...?include_results=true \
      -H "X-API-Key: your_api_key"

    Status Response

    JSON
    {
      "crawl_id": "c9a1b2d3-...",
      "status": "scraping",
      "total": 47,
      "completed": 32,
      "failed": 2,
      "in_progress": 13,
      "credits_debited": 47000,
      "created_at": "2026-03-24T10:00:00Z",
      "current_depth": 2,
      "max_depth": 3,
      "total_discovered": 47,
      "billing": {
        "credits_debited": 47000,
        "credits_used": 34000,
        "credits_refunded": 0,
        "estimated_cost_usd": "$0.47",
        "actual_cost_usd": "$0.34",
        "is_byop": false,
        "tier_breakdown": { "1": 28, "3": 4 }
      },
      "pages": [
        {
          "job_id": "a1b2c3d4-...",
          "url": "https://example.com/blog/post-1",
          "status": "succeeded",
          "result": { "text": "...", "metadata": { ... } }
        }
      ]
    }

    Crawl status values:

    StatusMeaning
    discoveringFetching sitemaps and extracting links from the start page
    scrapingScrape jobs are running (may still discover deeper pages)
    completedAll pages scraped successfully
    partialAll jobs done, some pages failed
    failedAll pages failed
    cancelledCrawl was cancelled via DELETE endpoint

    Cancel a Crawl

    DELETE
    /api/v1/crawl/{crawl_id}

    Cancel a running crawl. Jobs already completed are not affected. Pending and queued jobs are cancelled, and their credits are automatically refunded.

    Bash
    curl -X DELETE https://api.alterlab.io/api/v1/crawl/c9a1b2d3-... \
      -H "X-API-Key: your_api_key"

    Response:

    JSON
    {
      "crawl_id": "c9a1b2d3-...",
      "status": "cancelled",
      "cancelled_jobs": 15,
      "credits_refunded": 15000,
      "credits_used": 32000,
      "message": "Crawl cancelled. Unprocessed credits have been refunded."
    }

    Already Completed

    You cannot cancel a crawl that has already completed or was previously cancelled. The API returns 409 Conflict in those cases.

    Billing & Credits

    • Credits are pre-debited when the crawl starts, based on the estimated cost per discovered URL.
    • As depth crawling discovers new pages, additional credits are debited for each new wave of URLs.
    • If a page fails, credits for that page are automatically refunded.
    • On completion, any overpayment (estimated minus actual) is automatically refunded.
    • Cancelling a crawl refunds all credits for unprocessed pages.
    • Use cost_controls.max_credits to set an absolute spending cap.
    • BYOP (Bring Your Own Proxy) discounts apply per-page if you have an active proxy integration.

    The billing summary is included in the status response under billing, which shows debited, used, and refunded credits alongside a per-tier breakdown.

    Limits

    Crawl Limits

    • Maximum 3 concurrent crawls per user
    • Maximum 100,000 pages per crawl (max_pages)
    • Maximum depth of 50 (max_depth)
    • Discovery phase timeout: 30 seconds
    • Crawl metadata expires after 24 hours -- poll or use webhooks before then
    • Each page counts as a separate scrape credit-wise

    Python Example

    Python
    import alterlab
    import time
    
    client = alterlab.AlterLab(api_key="your_api_key")
    
    # Start a crawl
    crawl = client.crawl(
        url="https://example.com",
        max_pages=100,
        max_depth=3,
        include_patterns=["/blog/*", "/docs/*"],
        exclude_patterns=["/admin/*"],
        formats=["text", "markdown"],
        webhook_url="https://your-server.com/webhook",
    )
    
    print(f"Crawl ID: {crawl['crawl_id']}")
    print(f"Pages discovered: {crawl['estimated_pages']}")
    print(f"Estimated credits: {crawl['estimated_credits']}")
    
    # Poll until complete
    while True:
        status = client.get_crawl_status(crawl["crawl_id"], include_results=True)
        print(f"Status: {status['status']} -- depth {status['current_depth']}/{status['max_depth']}")
        print(f"  {status['completed']}/{status['total']} pages done, {status['failed']} failed")
        if status["status"] not in ("discovering", "scraping"):
            break
        time.sleep(3)
    
    # Process results
    for page in status.get("pages", []):
        if page["status"] == "succeeded":
            text = page["result"].get("text", "")
            print(f"  {page['url']}: {len(text)} chars")
        else:
            print(f"  {page['url']}: FAILED -- {page.get('error', 'unknown')}")
    
    # Check billing
    billing = status.get("billing", {})
    print(f"Cost: {billing.get('actual_cost_usd', 'N/A')} (estimated {billing.get('estimated_cost_usd', 'N/A')})")
    print(f"Refunded: {billing.get('credits_refunded', 0)} credits")

    Node.js Example

    TYPESCRIPT
    import AlterLab from "@alterlab/sdk";
    
    const client = new AlterLab({ apiKey: "your_api_key" });
    
    // Start a crawl
    const crawl = await client.crawl({
      url: "https://example.com",
      maxPages: 100,
      maxDepth: 3,
      includePatterns: ["/blog/*", "/docs/*"],
      excludePatterns: ["/admin/*"],
      formats: ["text", "markdown"],
      webhookUrl: "https://your-server.com/webhook",
    });
    
    console.log(`Crawl ID: ${crawl.crawlId}`);
    console.log(`Pages discovered: ${crawl.estimatedPages}`);
    
    // Poll until complete
    let status;
    do {
      await new Promise((r) => setTimeout(r, 3000));
      status = await client.getCrawlStatus(crawl.crawlId, { includeResults: true });
      console.log(`${status.completed}/${status.total} done (depth ${status.currentDepth}/${status.maxDepth})`);
    } while (["discovering", "scraping"].includes(status.status));
    
    // Process results
    for (const page of status.pages ?? []) {
      if (page.status === "succeeded") {
        console.log(`  ${page.url}: ${page.result?.text?.length ?? 0} chars`);
      } else {
        console.log(`  ${page.url}: FAILED -- ${page.error}`);
      }
    }
    
    // Cancel a crawl (if needed)
    // const cancelled = await client.cancelCrawl(crawl.crawlId);
    // console.log(`Refunded: ${cancelled.creditsRefunded} credits`);

    Related Guides

    • Batch Scraping -- scrape a fixed list of URLs instead of discovering them
    • Webhooks -- learn more about webhook delivery and payload formats
    • Pricing -- credit costs per tier and BYOP discounts
    Bring Your Own ProxyBatch Scraping
    Last updated: March 2026

    On this page

    AlterLabAlterLab

    AlterLab is the modern web scraping platform for developers. Reliable, scalable, and easy to use.

    Product

    • Pricing
    • Documentation
    • Changelog
    • Status

    Solutions

    • Python API
    • JS Rendering
    • Anti-Bot Bypass
    • Compare APIs

    Comparisons

    • Compare All
    • vs ScraperAPI
    • vs Firecrawl
    • vs ScrapingBee
    • vs Bright Data
    • vs Apify

    Company

    • About
    • Blog
    • Contact
    • FAQ

    Guides

    • Bypass Cloudflare
    • Playwright Anti-Detection
    • Puppeteer Bypass Guide
    • Selenium Detection Fix
    • Best Scraping APIs 2026

    Legal

    • Privacy
    • Terms
    • Acceptable Use
    • DPA
    • Cookie Policy
    • Licenses

    © 2026 RapierCraft Inc. All rights reserved.

    Middletown, DE