How does AlterLab crawl a website?

POST a start URL to /v1/crawl. AlterLab discovers pages via sitemap and link extraction, then scrapes each page. Poll GET /v1/crawl/{id} for progress and results.

How much does crawling cost?

Each crawled page costs the same as a regular scrape request (1-15 credits depending on the tier). Credits are pre-debited at crawl start and unused credits are refunded on completion.

Can I limit crawl costs?

Yes. Use max_pages, max_credits, and max_tier in your crawl request to control total spend. You can also cancel an active crawl to refund remaining credits.

How many pages can I crawl at once?

Up to 100,000 pages per crawl, with a maximum of 3 concurrent crawls per account.

API Reference

New

Crawl API

Discover and scrape entire websites with a single API call. Supports depth crawling, URL filtering, priority scoring, cost controls, and structured extraction on every page.

Base URL

https://api.alterlab.io/api/v1

Async by Design

Crawl requests return immediately with a crawl_id. Poll GET /v1/crawl/{id} for progress, or configure a webhook_url to receive a notification when the crawl completes.

Quick Start

Crawl a website in two steps: start the crawl, then poll for results.

Bash

# Step 1: Start crawl
curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 20,
    "max_depth": 2
  }'

# Response: {"crawl_id": "abc123", "status": "discovering", ...}

# Step 2: Poll for results
curl https://api.alterlab.io/api/v1/crawl/abc123?include_results=true \
  -H "X-API-Key: YOUR_API_KEY"

Start a Crawl

POST

/v1/crawl

Discover pages on a website and scrape them. Returns immediately with a crawl_id for polling.

Request Body

Parameter	Type	Default	Description
url	`string`	Required	Start URL for the crawl. Must be http or https.
max_pages	`integer`	`50`	Maximum pages to scrape (1 - 100,000).
max_depth	`integer`	`3`	Maximum link-following depth from start URL (0 - 50). 0 = start page only.
include_patterns	`string[]`	null	Glob patterns — only scrape URLs whose path matches at least one. Example: `["/products/", "/docs/"]`
exclude_patterns	`string[]`	null	Glob patterns — skip URLs whose path matches any. Example: `["/blog/tag/", "/author/"]`
formats	`string[]`	null	Output formats for each page: `text`, `json`, `json_v2`, `html`, `markdown`
extraction_schema	`object`	null	JSON schema for structured extraction on each page.
extraction_profile	`string`	null	Pre-defined extraction profile: `auto`, `product`, `article`, `job_posting`, `faq`, `recipe`, `event`
webhook_url	`string`	null	URL to receive a `crawl.completed` webhook when all pages are done.
respect_robots	`boolean`	`true`	Respect robots.txt rules and crawl-delay for the target domain.
include_subdomains	`boolean`	`false`	Include links to subdomains during discovery.
advanced	`object`	null	Advanced scraping options applied to every page. See below.
cost_controls	`object`	null	Cost controls for the entire crawl. See below.
priority	`object`	null	Content-aware URL prioritization. See below.
budget	`object`	null	Per-path page budget. See below.

Advanced Options

The advanced object controls per-page scraping behavior:

Field	Type	Default	Description
render_js	`boolean`	`false`	Render JavaScript on every page (forces Tier 4).
screenshot	`boolean`	`false`	Capture a screenshot of every crawled page.
use_proxy	`boolean`	`false`	Route all requests through premium proxy.
wait_for	`string`	null	CSS selector to wait for on each page before extraction.
timeout	`integer`	`90`	Timeout per page in seconds (1 - 300).

Cost Controls

The cost_controls object limits total crawl spend:

Field	Type	Description
max_credits	`integer`	Maximum total credits to spend on this crawl.
max_tier	`string`	Maximum tier to use for page scrapes. Values: `1`, `2`, `3`, `3.5`, `4`
force_tier	`string`	Force a specific tier for all pages. Cannot exceed max_tier.

JSON

// Example: crawl with cost controls
{
  "url": "https://shop.example.com",
  "max_pages": 200,
  "cost_controls": {
    "max_credits": 500,
    "max_tier": "3"
  }
}

Priority Scoring

The priority object enables content-aware URL ordering. Discovered URLs are scored 0.0-1.0 and scraped in priority order. When max_pages is reached, low-scoring URLs are dropped first.

Field	Type	Description
keywords	`string[]`	Keywords matched against URL path segments.
boost_patterns	`string[]`	Glob patterns — matching URLs receive a score boost.
demote_patterns	`string[]`	Glob patterns — matching URLs receive a score reduction.

JSON

// Example: prioritize product pages, deprioritize blog tags
{
  "url": "https://shop.example.com",
  "max_pages": 100,
  "priority": {
    "keywords": ["pricing", "product", "buy"],
    "boost_patterns": ["/products/*", "/pricing*"],
    "demote_patterns": ["/blog/tag/*", "/author/*"]
  }
}

Budget Allocation

The budget object allocates page quotas by URL path pattern. Keys are glob patterns, values are max page counts. Use * as the catch-all default. Most-specific pattern wins (longest match).

JSON

// Example: allocate pages by section
{
  "url": "https://shop.example.com",
  "max_pages": 200,
  "budget": {
    "/products/*": 100,
    "/blog/*": 50,
    "/docs/*": 30,
    "*": 5
  }
}

Sitemap Modes

The sitemap field controls how sitemaps are used during URL discovery:

Mode	Behavior
include	Default. Discover sitemap from robots.txt and parse it, also extract links from pages.
skip	Skip sitemaps entirely. Only discover via link extraction from the start URL.
only	Crawl exclusively from sitemap URLs. No link extraction, no depth crawling.

Use sitemap_path to specify a custom sitemap location (e.g., /product-sitemap.xml).

JSON

// Crawl only from a custom sitemap
{
  "url": "https://shop.example.com",
  "sitemap": "only",
  "sitemap_path": "/product-sitemap.xml",
  "max_pages": 1000
}

Cache Control

Enable per-page response caching to avoid re-scraping unchanged pages. Cached page hits cost 0 credits.

Field	Type	Default	Description
cache	`boolean`	`false`	Enable per-page response caching.
cache_ttl	`integer`	`900`	Cache TTL in seconds (60 - 86,400). Defaults to 15 minutes.
force_refresh	`boolean`	`false`	Bypass cache for every page and re-scrape.

Custom Headers

Inject custom HTTP headers into every page request during the crawl. Useful for authentication tokens, cookies, or custom User-Agent strings.

JSON

// Crawl an authenticated site
{
  "url": "https://internal.example.com",
  "max_pages": 50,
  "headers": {
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiIs...",
    "Cookie": "session=abc123"
  }
}

Header Restrictions

Blocked headers: Host, Content-Length, Transfer-Encoding, Connection, Upgrade, Proxy-Authorization. Maximum 50 custom headers. Values must not contain control characters.

Storage Backends

Control where crawl results are stored via the storage field:

Backend	Behavior	Best For
redis	Results stored in Redis with 24h TTL, returned inline via polling.	Small to medium crawls (<1,000 pages)
s3	Results uploaded to S3 as JSONL. Status endpoint returns a presigned download URL.	Large crawls (1,000+ pages), 30-day retention
r2	Same as S3 but stored in Cloudflare R2. Presigned download URL on completion.	Large crawls with global edge access

Webhooks

Configure real-time event notifications for your crawl. The webhook object provides full control over events, HMAC signing, custom headers, and metadata passthrough.

Field	Type	Description
url	`string`	URL to receive webhook events (required).
events	`string[]`	Event types to subscribe to. Default: all events. Options: `crawl.started`, `crawl.page.completed`, `crawl.page.failed`, `crawl.completed`, `crawl.failed`
secret	`string`	Shared secret for HMAC-SHA256 signing. When set, each delivery includes `X-Webhook-Signature` and `X-Webhook-Timestamp` headers.
headers	`object`	Custom HTTP headers to include with every webhook delivery.
metadata	`object`	Custom JSON metadata passed through in every event payload.

JSON

// Full webhook configuration with HMAC signing
{
  "url": "https://example.com",
  "max_pages": 100,
  "webhook": {
    "url": "https://hooks.example.com/crawl",
    "events": ["crawl.page.completed", "crawl.completed"],
    "secret": "whsec_my_signing_secret",
    "headers": { "X-Project-Id": "proj_123" },
    "metadata": { "source": "nightly-crawl" }
  }
}

Legacy webhook_url

The webhook_url field still works for simple use cases (fires crawl.completed only, no signing). Use the full webhook object for per-event filtering and HMAC verification. They are mutually exclusive.

Change Tracking

Enable change detection to compare crawl results against previous runs. Detects new, changed, unchanged, and removed pages.

Field	Type	Default	Description
enabled	`boolean`	`false`	Enable change detection against the previous crawl.
tracking_id	`string`	auto (domain)	Identifier for this tracking history. Allows parallel tracks for the same domain.
mode	`string`	`content`	`content`: SHA-256 hash comparison. `structured`: JSON field-level diff on extraction results.
include_diff	`boolean`	`false`	Include actual diff output. Adds +1 credit/page (content mode) or +2 (structured mode).

JSON

// Track changes across daily crawls
{
  "url": "https://shop.example.com",
  "max_pages": 200,
  "change_tracking": {
    "enabled": true,
    "tracking_id": "daily-price-check",
    "mode": "content",
    "include_diff": true
  }
}

Output Connectors

Stream crawl results directly to cloud storage or a webhook endpoint as pages complete. No polling needed.

Type	Required Fields	Description
s3	bucket, credentials (access_key_id, secret_access_key)	Export to AWS S3 as JSONL, JSON per page, or CSV.
gcs	bucket, credentials (service_account_json)	Export to Google Cloud Storage.
webhook_stream	url	POST each page result to a webhook endpoint as it completes. Optional HMAC signing.

JSON

// Stream results to S3
{
  "url": "https://example.com",
  "max_pages": 500,
  "output": {
    "type": "s3",
    "bucket": "my-crawl-data",
    "prefix": "crawls/2026-04/",
    "region": "us-east-1",
    "credentials": {
      "access_key_id": "AKIA...",
      "secret_access_key": "..."
    },
    "format": "jsonl"
  }
}

Response (202 Accepted)

JSON

{
  "crawl_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "discovering",
  "estimated_pages": 47,
  "total_enqueued": 47,
  "estimated_credits": 94,
  "message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
}

Poll Crawl Status

GET

/v1/crawl/{crawl_id}

Get the current status, progress, billing breakdown, and optionally per-page results of a crawl.

Parameters

Name	Type	Required	Description
include_results	`boolean`	Optional	Include per-page scrape results in the response. Default: false.Default: `false`

Status Values

Status	Description
discovering	Fetching sitemap and/or extracting links from the start page.
scraping	Pages are being scraped. Check `completed` / `total` for progress.
completed	All pages scraped successfully.
partial	Crawl finished but some pages failed. Results are available for successful pages.
failed	All pages failed or a critical error occurred.
cancelled	Crawl was cancelled via DELETE endpoint.

Billing Breakdown

The billing object in the status response provides a full cost accounting:

JSON

{
  "crawl_id": "a1b2c3d4-...",
  "status": "completed",
  "total": 47,
  "completed": 45,
  "failed": 2,
  "in_progress": 0,
  "credits_debited": 94,
  "current_depth": 2,
  "max_depth": 3,
  "total_discovered": 52,
  "billing": {
    "credits_debited": 94,
    "credits_used": 78,
    "credits_refunded": 16,
    "estimated_cost_usd": "$0.094",
    "actual_cost_usd": "$0.078",
    "is_byop": false,
    "tier_breakdown": { "1": 30, "3": 12, "4": 3 }
  },
  "budget_status": {
    "budget": { "/products/*": 100, "*": 5 },
    "counts": { "/products/*": 42, "*": 3 },
    "exhausted": []
  }
}

Cancel a Crawl

DELETE

/v1/crawl/{crawl_id}

Cancel an active crawl. Pending jobs are cancelled and unused credits are refunded.

JSON

// Cancel response
{
  "crawl_id": "a1b2c3d4-...",
  "status": "cancelled",
  "cancelled_jobs": 12,
  "credits_refunded": 24,
  "credits_used": 70,
  "message": "Crawl cancelled. Unprocessed credits have been refunded."
}

Partial Results Available

After cancellation, pages that were already scraped are still accessible via GET /v1/crawl/{id}?include_results=true. You only pay for completed pages.

Credit Model

Crawl credits work on a pre-debit and refund model:

Pre-debit

When you start a crawl, credits are estimated and debited upfront based on the number of discovered pages and expected tier costs.

Actual Consumption

Each page consumes credits based on the actual tier used (1-15 credits per page). Failed pages do not consume credits.

Refund

On completion or cancellation, the difference between pre-debited and actual credits is automatically refunded.

Error Codes

Code	Description	Resolution
402	Insufficient credits for estimated crawl cost.	Add credits or reduce max_pages.
429	Concurrent crawl limit reached (max 3 per account).	Wait for an active crawl to finish or cancel one.
404	Crawl ID not found or expired (24h TTL).	Crawl results expire after 24 hours. Start a new crawl.
503	Queue is full. Too many jobs queued system-wide.	Retry after a short delay.

Polling Best Practices

Use exponential backoff

Start at 3-5 second intervals. Increase to 10-15 seconds for large crawls (100+ pages) to reduce API calls.

Only include results when needed

Use include_results=false for progress checks. Only set include_results=true on the final poll to reduce payload size.

Prefer webhooks for large crawls

Set webhook_url to receive a crawl.completed notification instead of polling.

Last updated: March 2026

API Reference

New

Crawl API

Discover and scrape entire websites with a single API call. Supports depth crawling, URL filtering, priority scoring, cost controls, and structured extraction on every page.

Base URL

https://api.alterlab.io/api/v1

Async by Design

Crawl requests return immediately with a crawl_id. Poll GET /v1/crawl/{id} for progress, or configure a webhook_url to receive a notification when the crawl completes.

Quick Start

Crawl a website in two steps: start the crawl, then poll for results.

Bash

# Step 1: Start crawl
curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 20,
    "max_depth": 2
  }'

# Response: {"crawl_id": "abc123", "status": "discovering", ...}

# Step 2: Poll for results
curl https://api.alterlab.io/api/v1/crawl/abc123?include_results=true \
  -H "X-API-Key: YOUR_API_KEY"

Start a Crawl

POST

/v1/crawl

Discover pages on a website and scrape them. Returns immediately with a crawl_id for polling.

Request Body

Parameter	Type	Default	Description
url	`string`	Required	Start URL for the crawl. Must be http or https.
max_pages	`integer`	`50`	Maximum pages to scrape (1 - 100,000).
max_depth	`integer`	`3`	Maximum link-following depth from start URL (0 - 50). 0 = start page only.
include_patterns	`string[]`	null	Glob patterns — only scrape URLs whose path matches at least one. Example: `["/products/", "/docs/"]`
exclude_patterns	`string[]`	null	Glob patterns — skip URLs whose path matches any. Example: `["/blog/tag/", "/author/"]`
formats	`string[]`	null	Output formats for each page: `text`, `json`, `json_v2`, `html`, `markdown`
extraction_schema	`object`	null	JSON schema for structured extraction on each page.
extraction_profile	`string`	null	Pre-defined extraction profile: `auto`, `product`, `article`, `job_posting`, `faq`, `recipe`, `event`
webhook_url	`string`	null	URL to receive a `crawl.completed` webhook when all pages are done.
respect_robots	`boolean`	`true`	Respect robots.txt rules and crawl-delay for the target domain.
include_subdomains	`boolean`	`false`	Include links to subdomains during discovery.
advanced	`object`	null	Advanced scraping options applied to every page. See below.
cost_controls	`object`	null	Cost controls for the entire crawl. See below.
priority	`object`	null	Content-aware URL prioritization. See below.
budget	`object`	null	Per-path page budget. See below.

Advanced Options

The advanced object controls per-page scraping behavior:

Field	Type	Default	Description
render_js	`boolean`	`false`	Render JavaScript on every page (forces Tier 4).
screenshot	`boolean`	`false`	Capture a screenshot of every crawled page.
use_proxy	`boolean`	`false`	Route all requests through premium proxy.
wait_for	`string`	null	CSS selector to wait for on each page before extraction.
timeout	`integer`	`90`	Timeout per page in seconds (1 - 300).

Cost Controls

The cost_controls object limits total crawl spend:

Field	Type	Description
max_credits	`integer`	Maximum total credits to spend on this crawl.
max_tier	`string`	Maximum tier to use for page scrapes. Values: `1`, `2`, `3`, `3.5`, `4`
force_tier	`string`	Force a specific tier for all pages. Cannot exceed max_tier.

JSON

// Example: crawl with cost controls
{
  "url": "https://shop.example.com",
  "max_pages": 200,
  "cost_controls": {
    "max_credits": 500,
    "max_tier": "3"
  }
}

Priority Scoring

The priority object enables content-aware URL ordering. Discovered URLs are scored 0.0-1.0 and scraped in priority order. When max_pages is reached, low-scoring URLs are dropped first.

Field	Type	Description
keywords	`string[]`	Keywords matched against URL path segments.
boost_patterns	`string[]`	Glob patterns — matching URLs receive a score boost.
demote_patterns	`string[]`	Glob patterns — matching URLs receive a score reduction.

JSON

// Example: prioritize product pages, deprioritize blog tags
{
  "url": "https://shop.example.com",
  "max_pages": 100,
  "priority": {
    "keywords": ["pricing", "product", "buy"],
    "boost_patterns": ["/products/*", "/pricing*"],
    "demote_patterns": ["/blog/tag/*", "/author/*"]
  }
}

Budget Allocation

The budget object allocates page quotas by URL path pattern. Keys are glob patterns, values are max page counts. Use * as the catch-all default. Most-specific pattern wins (longest match).

JSON

// Example: allocate pages by section
{
  "url": "https://shop.example.com",
  "max_pages": 200,
  "budget": {
    "/products/*": 100,
    "/blog/*": 50,
    "/docs/*": 30,
    "*": 5
  }
}

Sitemap Modes

The sitemap field controls how sitemaps are used during URL discovery:

Mode	Behavior
include	Default. Discover sitemap from robots.txt and parse it, also extract links from pages.
skip	Skip sitemaps entirely. Only discover via link extraction from the start URL.
only	Crawl exclusively from sitemap URLs. No link extraction, no depth crawling.

Use sitemap_path to specify a custom sitemap location (e.g., /product-sitemap.xml).

JSON

// Crawl only from a custom sitemap
{
  "url": "https://shop.example.com",
  "sitemap": "only",
  "sitemap_path": "/product-sitemap.xml",
  "max_pages": 1000
}

Cache Control

Enable per-page response caching to avoid re-scraping unchanged pages. Cached page hits cost 0 credits.

Field	Type	Default	Description
cache	`boolean`	`false`	Enable per-page response caching.
cache_ttl	`integer`	`900`	Cache TTL in seconds (60 - 86,400). Defaults to 15 minutes.
force_refresh	`boolean`	`false`	Bypass cache for every page and re-scrape.

Custom Headers

Inject custom HTTP headers into every page request during the crawl. Useful for authentication tokens, cookies, or custom User-Agent strings.

JSON

// Crawl an authenticated site
{
  "url": "https://internal.example.com",
  "max_pages": 50,
  "headers": {
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiIs...",
    "Cookie": "session=abc123"
  }
}

Header Restrictions

Blocked headers: Host, Content-Length, Transfer-Encoding, Connection, Upgrade, Proxy-Authorization. Maximum 50 custom headers. Values must not contain control characters.

Storage Backends

Control where crawl results are stored via the storage field:

Backend	Behavior	Best For
redis	Results stored in Redis with 24h TTL, returned inline via polling.	Small to medium crawls (<1,000 pages)
s3	Results uploaded to S3 as JSONL. Status endpoint returns a presigned download URL.	Large crawls (1,000+ pages), 30-day retention
r2	Same as S3 but stored in Cloudflare R2. Presigned download URL on completion.	Large crawls with global edge access

Webhooks

Configure real-time event notifications for your crawl. The webhook object provides full control over events, HMAC signing, custom headers, and metadata passthrough.

Field	Type	Description
url	`string`	URL to receive webhook events (required).
events	`string[]`	Event types to subscribe to. Default: all events. Options: `crawl.started`, `crawl.page.completed`, `crawl.page.failed`, `crawl.completed`, `crawl.failed`
secret	`string`	Shared secret for HMAC-SHA256 signing. When set, each delivery includes `X-Webhook-Signature` and `X-Webhook-Timestamp` headers.
headers	`object`	Custom HTTP headers to include with every webhook delivery.
metadata	`object`	Custom JSON metadata passed through in every event payload.

JSON

// Full webhook configuration with HMAC signing
{
  "url": "https://example.com",
  "max_pages": 100,
  "webhook": {
    "url": "https://hooks.example.com/crawl",
    "events": ["crawl.page.completed", "crawl.completed"],
    "secret": "whsec_my_signing_secret",
    "headers": { "X-Project-Id": "proj_123" },
    "metadata": { "source": "nightly-crawl" }
  }
}

Legacy webhook_url

Change Tracking

Enable change detection to compare crawl results against previous runs. Detects new, changed, unchanged, and removed pages.

Field	Type	Default	Description
enabled	`boolean`	`false`	Enable change detection against the previous crawl.
tracking_id	`string`	auto (domain)	Identifier for this tracking history. Allows parallel tracks for the same domain.
mode	`string`	`content`	`content`: SHA-256 hash comparison. `structured`: JSON field-level diff on extraction results.
include_diff	`boolean`	`false`	Include actual diff output. Adds +1 credit/page (content mode) or +2 (structured mode).

JSON

// Track changes across daily crawls
{
  "url": "https://shop.example.com",
  "max_pages": 200,
  "change_tracking": {
    "enabled": true,
    "tracking_id": "daily-price-check",
    "mode": "content",
    "include_diff": true
  }
}

Output Connectors

Stream crawl results directly to cloud storage or a webhook endpoint as pages complete. No polling needed.

Type	Required Fields	Description
s3	bucket, credentials (access_key_id, secret_access_key)	Export to AWS S3 as JSONL, JSON per page, or CSV.
gcs	bucket, credentials (service_account_json)	Export to Google Cloud Storage.
webhook_stream	url	POST each page result to a webhook endpoint as it completes. Optional HMAC signing.

JSON

// Stream results to S3
{
  "url": "https://example.com",
  "max_pages": 500,
  "output": {
    "type": "s3",
    "bucket": "my-crawl-data",
    "prefix": "crawls/2026-04/",
    "region": "us-east-1",
    "credentials": {
      "access_key_id": "AKIA...",
      "secret_access_key": "..."
    },
    "format": "jsonl"
  }
}

Response (202 Accepted)

JSON

{
  "crawl_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "status": "discovering",
  "estimated_pages": 47,
  "total_enqueued": 47,
  "estimated_credits": 94,
  "message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
}

Poll Crawl Status

GET

/v1/crawl/{crawl_id}

Get the current status, progress, billing breakdown, and optionally per-page results of a crawl.

Parameters

Name	Type	Required	Description
include_results	`boolean`	Optional	Include per-page scrape results in the response. Default: false.Default: `false`

Status Values

Status	Description
discovering	Fetching sitemap and/or extracting links from the start page.
scraping	Pages are being scraped. Check `completed` / `total` for progress.
completed	All pages scraped successfully.
partial	Crawl finished but some pages failed. Results are available for successful pages.
failed	All pages failed or a critical error occurred.
cancelled	Crawl was cancelled via DELETE endpoint.

Billing Breakdown

The billing object in the status response provides a full cost accounting:

JSON

{
  "crawl_id": "a1b2c3d4-...",
  "status": "completed",
  "total": 47,
  "completed": 45,
  "failed": 2,
  "in_progress": 0,
  "credits_debited": 94,
  "current_depth": 2,
  "max_depth": 3,
  "total_discovered": 52,
  "billing": {
    "credits_debited": 94,
    "credits_used": 78,
    "credits_refunded": 16,
    "estimated_cost_usd": "$0.094",
    "actual_cost_usd": "$0.078",
    "is_byop": false,
    "tier_breakdown": { "1": 30, "3": 12, "4": 3 }
  },
  "budget_status": {
    "budget": { "/products/*": 100, "*": 5 },
    "counts": { "/products/*": 42, "*": 3 },
    "exhausted": []
  }
}

Cancel a Crawl

DELETE

/v1/crawl/{crawl_id}

Cancel an active crawl. Pending jobs are cancelled and unused credits are refunded.

JSON

// Cancel response
{
  "crawl_id": "a1b2c3d4-...",
  "status": "cancelled",
  "cancelled_jobs": 12,
  "credits_refunded": 24,
  "credits_used": 70,
  "message": "Crawl cancelled. Unprocessed credits have been refunded."
}

Partial Results Available

After cancellation, pages that were already scraped are still accessible via GET /v1/crawl/{id}?include_results=true. You only pay for completed pages.

Credit Model

Crawl credits work on a pre-debit and refund model:

Pre-debit

When you start a crawl, credits are estimated and debited upfront based on the number of discovered pages and expected tier costs.

Actual Consumption

Each page consumes credits based on the actual tier used (1-15 credits per page). Failed pages do not consume credits.

Refund

On completion or cancellation, the difference between pre-debited and actual credits is automatically refunded.

Error Codes

Code	Description	Resolution
402	Insufficient credits for estimated crawl cost.	Add credits or reduce max_pages.
429	Concurrent crawl limit reached (max 3 per account).	Wait for an active crawl to finish or cancel one.
404	Crawl ID not found or expired (24h TTL).	Crawl results expire after 24 hours. Start a new crawl.
503	Queue is full. Too many jobs queued system-wide.	Retry after a short delay.

Polling Best Practices

Use exponential backoff

Start at 3-5 second intervals. Increase to 10-15 seconds for large crawls (100+ pages) to reduce API calls.

Only include results when needed

Use include_results=false for progress checks. Only set include_results=true on the final poll to reduce payload size.

Prefer webhooks for large crawls

Set webhook_url to receive a crawl.completed notification instead of polling.

Last updated: March 2026