AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    API Reference
    New

    Crawl API

    Discover and scrape entire websites with a single API call. Supports depth crawling, URL filtering, priority scoring, cost controls, and structured extraction on every page.

    Base URL

    https://api.alterlab.io/api/v1

    Async by Design

    Crawl requests return immediately with a crawl_id. Poll GET /v1/crawl/{id} for progress, or configure a webhook_url to receive a notification when the crawl completes.

    Quick Start

    Crawl a website in two steps: start the crawl, then poll for results.

    Bash
    # Step 1: Start crawl
    curl -X POST https://api.alterlab.io/api/v1/crawl \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com",
        "max_pages": 20,
        "max_depth": 2
      }'
    
    # Response: {"crawl_id": "abc123", "status": "discovering", ...}
    
    # Step 2: Poll for results
    curl https://api.alterlab.io/api/v1/crawl/abc123?include_results=true \
      -H "X-API-Key: YOUR_API_KEY"

    Start a Crawl

    POST
    /v1/crawl

    Discover pages on a website and scrape them. Returns immediately with a crawl_id for polling.

    Request Body

    ParameterTypeDefaultDescription
    urlstring
    Required
    Start URL for the crawl. Must be http or https.
    max_pagesinteger50Maximum pages to scrape (1 - 100,000).
    max_depthinteger3Maximum link-following depth from start URL (0 - 50). 0 = start page only.
    include_patternsstring[]nullGlob patterns — only scrape URLs whose path matches at least one. Example: ["/products/*", "/docs/*"]
    exclude_patternsstring[]nullGlob patterns — skip URLs whose path matches any. Example: ["/blog/tag/*", "/author/*"]
    formatsstring[]nullOutput formats for each page: text, json, json_v2, html, markdown
    extraction_schemaobjectnullJSON schema for structured extraction on each page.
    extraction_profilestringnullPre-defined extraction profile: auto, product, article, job_posting, faq, recipe, event
    webhook_urlstringnullURL to receive a crawl.completed webhook when all pages are done.
    respect_robotsbooleantrueRespect robots.txt rules and crawl-delay for the target domain.
    include_subdomainsbooleanfalseInclude links to subdomains during discovery.
    advancedobjectnullAdvanced scraping options applied to every page. See below.
    cost_controlsobjectnullCost controls for the entire crawl. See below.
    priorityobjectnullContent-aware URL prioritization. See below.
    budgetobjectnullPer-path page budget. See below.

    Advanced Options

    The advanced object controls per-page scraping behavior:

    FieldTypeDefaultDescription
    render_jsbooleanfalseRender JavaScript on every page (forces Tier 4).
    screenshotbooleanfalseCapture a screenshot of every crawled page.
    use_proxybooleanfalseRoute all requests through premium proxy.
    wait_forstringnullCSS selector to wait for on each page before extraction.
    timeoutinteger90Timeout per page in seconds (1 - 300).

    Cost Controls

    The cost_controls object limits total crawl spend:

    FieldTypeDescription
    max_creditsintegerMaximum total credits to spend on this crawl.
    max_tierstringMaximum tier to use for page scrapes. Values: 1, 2, 3, 3.5, 4
    force_tierstringForce a specific tier for all pages. Cannot exceed max_tier.
    JSON
    // Example: crawl with cost controls
    {
      "url": "https://shop.example.com",
      "max_pages": 200,
      "cost_controls": {
        "max_credits": 500,
        "max_tier": "3"
      }
    }

    Priority Scoring

    The priority object enables content-aware URL ordering. Discovered URLs are scored 0.0-1.0 and scraped in priority order. When max_pages is reached, low-scoring URLs are dropped first.

    FieldTypeDescription
    keywordsstring[]Keywords matched against URL path segments.
    boost_patternsstring[]Glob patterns — matching URLs receive a score boost.
    demote_patternsstring[]Glob patterns — matching URLs receive a score reduction.
    JSON
    // Example: prioritize product pages, deprioritize blog tags
    {
      "url": "https://shop.example.com",
      "max_pages": 100,
      "priority": {
        "keywords": ["pricing", "product", "buy"],
        "boost_patterns": ["/products/*", "/pricing*"],
        "demote_patterns": ["/blog/tag/*", "/author/*"]
      }
    }

    Budget Allocation

    The budget object allocates page quotas by URL path pattern. Keys are glob patterns, values are max page counts. Use * as the catch-all default. Most-specific pattern wins (longest match).

    JSON
    // Example: allocate pages by section
    {
      "url": "https://shop.example.com",
      "max_pages": 200,
      "budget": {
        "/products/*": 100,
        "/blog/*": 50,
        "/docs/*": 30,
        "*": 5
      }
    }

    Sitemap Modes

    The sitemap field controls how sitemaps are used during URL discovery:

    ModeBehavior
    includeDefault. Discover sitemap from robots.txt and parse it, also extract links from pages.
    skipSkip sitemaps entirely. Only discover via link extraction from the start URL.
    onlyCrawl exclusively from sitemap URLs. No link extraction, no depth crawling.

    Use sitemap_path to specify a custom sitemap location (e.g., /product-sitemap.xml).

    JSON
    // Crawl only from a custom sitemap
    {
      "url": "https://shop.example.com",
      "sitemap": "only",
      "sitemap_path": "/product-sitemap.xml",
      "max_pages": 1000
    }

    Cache Control

    Enable per-page response caching to avoid re-scraping unchanged pages. Cached page hits cost 0 credits.

    FieldTypeDefaultDescription
    cachebooleanfalseEnable per-page response caching.
    cache_ttlinteger900Cache TTL in seconds (60 - 86,400). Defaults to 15 minutes.
    force_refreshbooleanfalseBypass cache for every page and re-scrape.

    Custom Headers

    Inject custom HTTP headers into every page request during the crawl. Useful for authentication tokens, cookies, or custom User-Agent strings.

    JSON
    // Crawl an authenticated site
    {
      "url": "https://internal.example.com",
      "max_pages": 50,
      "headers": {
        "Authorization": "Bearer eyJhbGciOiJIUzI1NiIs...",
        "Cookie": "session=abc123"
      }
    }

    Header Restrictions

    Blocked headers: Host, Content-Length, Transfer-Encoding, Connection, Upgrade, Proxy-Authorization. Maximum 50 custom headers. Values must not contain control characters.

    Storage Backends

    Control where crawl results are stored via the storage field:

    BackendBehaviorBest For
    redisResults stored in Redis with 24h TTL, returned inline via polling.Small to medium crawls (<1,000 pages)
    s3Results uploaded to S3 as JSONL. Status endpoint returns a presigned download URL.Large crawls (1,000+ pages), 30-day retention
    r2Same as S3 but stored in Cloudflare R2. Presigned download URL on completion.Large crawls with global edge access

    Webhooks

    Configure real-time event notifications for your crawl. The webhook object provides full control over events, HMAC signing, custom headers, and metadata passthrough.

    FieldTypeDescription
    urlstringURL to receive webhook events (required).
    eventsstring[]Event types to subscribe to. Default: all events. Options: crawl.started, crawl.page.completed, crawl.page.failed, crawl.completed, crawl.failed
    secretstringShared secret for HMAC-SHA256 signing. When set, each delivery includes X-Webhook-Signature and X-Webhook-Timestamp headers.
    headersobjectCustom HTTP headers to include with every webhook delivery.
    metadataobjectCustom JSON metadata passed through in every event payload.
    JSON
    // Full webhook configuration with HMAC signing
    {
      "url": "https://example.com",
      "max_pages": 100,
      "webhook": {
        "url": "https://hooks.example.com/crawl",
        "events": ["crawl.page.completed", "crawl.completed"],
        "secret": "whsec_my_signing_secret",
        "headers": { "X-Project-Id": "proj_123" },
        "metadata": { "source": "nightly-crawl" }
      }
    }

    Legacy webhook_url

    The webhook_url field still works for simple use cases (fires crawl.completed only, no signing). Use the full webhook object for per-event filtering and HMAC verification. They are mutually exclusive.

    Change Tracking

    Enable change detection to compare crawl results against previous runs. Detects new, changed, unchanged, and removed pages.

    FieldTypeDefaultDescription
    enabledbooleanfalseEnable change detection against the previous crawl.
    tracking_idstringauto (domain)Identifier for this tracking history. Allows parallel tracks for the same domain.
    modestringcontentcontent: SHA-256 hash comparison. structured: JSON field-level diff on extraction results.
    include_diffbooleanfalseInclude actual diff output. Adds +1 credit/page (content mode) or +2 (structured mode).
    JSON
    // Track changes across daily crawls
    {
      "url": "https://shop.example.com",
      "max_pages": 200,
      "change_tracking": {
        "enabled": true,
        "tracking_id": "daily-price-check",
        "mode": "content",
        "include_diff": true
      }
    }

    Output Connectors

    Stream crawl results directly to cloud storage or a webhook endpoint as pages complete. No polling needed.

    TypeRequired FieldsDescription
    s3bucket, credentials (access_key_id, secret_access_key)Export to AWS S3 as JSONL, JSON per page, or CSV.
    gcsbucket, credentials (service_account_json)Export to Google Cloud Storage.
    webhook_streamurlPOST each page result to a webhook endpoint as it completes. Optional HMAC signing.
    JSON
    // Stream results to S3
    {
      "url": "https://example.com",
      "max_pages": 500,
      "output": {
        "type": "s3",
        "bucket": "my-crawl-data",
        "prefix": "crawls/2026-04/",
        "region": "us-east-1",
        "credentials": {
          "access_key_id": "AKIA...",
          "secret_access_key": "..."
        },
        "format": "jsonl"
      }
    }

    Response (202 Accepted)

    JSON
    {
      "crawl_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "status": "discovering",
      "estimated_pages": 47,
      "total_enqueued": 47,
      "estimated_credits": 94,
      "message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
    }

    Poll Crawl Status

    GET
    /v1/crawl/{crawl_id}

    Get the current status, progress, billing breakdown, and optionally per-page results of a crawl.

    Parameters

    NameTypeRequiredDescription
    include_resultsbooleanOptionalInclude per-page scrape results in the response. Default: false.Default: false

    Status Values

    StatusDescription
    discoveringFetching sitemap and/or extracting links from the start page.
    scrapingPages are being scraped. Check completed / total for progress.
    completedAll pages scraped successfully.
    partialCrawl finished but some pages failed. Results are available for successful pages.
    failedAll pages failed or a critical error occurred.
    cancelledCrawl was cancelled via DELETE endpoint.

    Billing Breakdown

    The billing object in the status response provides a full cost accounting:

    JSON
    {
      "crawl_id": "a1b2c3d4-...",
      "status": "completed",
      "total": 47,
      "completed": 45,
      "failed": 2,
      "in_progress": 0,
      "credits_debited": 94,
      "current_depth": 2,
      "max_depth": 3,
      "total_discovered": 52,
      "billing": {
        "credits_debited": 94,
        "credits_used": 78,
        "credits_refunded": 16,
        "estimated_cost_usd": "$0.094",
        "actual_cost_usd": "$0.078",
        "is_byop": false,
        "tier_breakdown": { "1": 30, "3": 12, "4": 3 }
      },
      "budget_status": {
        "budget": { "/products/*": 100, "*": 5 },
        "counts": { "/products/*": 42, "*": 3 },
        "exhausted": []
      }
    }

    Cancel a Crawl

    DELETE
    /v1/crawl/{crawl_id}

    Cancel an active crawl. Pending jobs are cancelled and unused credits are refunded.

    JSON
    // Cancel response
    {
      "crawl_id": "a1b2c3d4-...",
      "status": "cancelled",
      "cancelled_jobs": 12,
      "credits_refunded": 24,
      "credits_used": 70,
      "message": "Crawl cancelled. Unprocessed credits have been refunded."
    }

    Partial Results Available

    After cancellation, pages that were already scraped are still accessible via GET /v1/crawl/{id}?include_results=true. You only pay for completed pages.

    Credit Model

    Crawl credits work on a pre-debit and refund model:

    1

    Pre-debit

    When you start a crawl, credits are estimated and debited upfront based on the number of discovered pages and expected tier costs.

    2

    Actual Consumption

    Each page consumes credits based on the actual tier used (1-15 credits per page). Failed pages do not consume credits.

    3

    Refund

    On completion or cancellation, the difference between pre-debited and actual credits is automatically refunded.

    Error Codes

    CodeDescriptionResolution
    402Insufficient credits for estimated crawl cost.Add credits or reduce max_pages.
    429Concurrent crawl limit reached (max 3 per account).Wait for an active crawl to finish or cancel one.
    404Crawl ID not found or expired (24h TTL).Crawl results expire after 24 hours. Start a new crawl.
    503Queue is full. Too many jobs queued system-wide.Retry after a short delay.

    Polling Best Practices

    Use exponential backoff

    Start at 3-5 second intervals. Increase to 10-15 seconds for large crawls (100+ pages) to reduce API calls.

    Only include results when needed

    Use include_results=false for progress checks. Only set include_results=true on the final poll to reduce payload size.

    Prefer webhooks for large crawls

    Set webhook_url to receive a crawl.completed notification instead of polling.

    Last updated: March 2026

    On this page