Crawl API
Discover and scrape entire websites with a single API call. Supports depth crawling, URL filtering, priority scoring, cost controls, and structured extraction on every page.
Base URL
https://api.alterlab.io/api/v1Async by Design
crawl_id. Poll GET /v1/crawl/{id} for progress, or configure a webhook_url to receive a notification when the crawl completes.Quick Start
Crawl a website in two steps: start the crawl, then poll for results.
# Step 1: Start crawl
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"max_pages": 20,
"max_depth": 2
}'
# Response: {"crawl_id": "abc123", "status": "discovering", ...}
# Step 2: Poll for results
curl https://api.alterlab.io/api/v1/crawl/abc123?include_results=true \
-H "X-API-Key: YOUR_API_KEY"Start a Crawl
/v1/crawlDiscover pages on a website and scrape them. Returns immediately with a crawl_id for polling.
Request Body
| Parameter | Type | Default | Description |
|---|---|---|---|
| url | string | Required | Start URL for the crawl. Must be http or https. |
| max_pages | integer | 50 | Maximum pages to scrape (1 - 100,000). |
| max_depth | integer | 3 | Maximum link-following depth from start URL (0 - 50). 0 = start page only. |
| include_patterns | string[] | null | Glob patterns — only scrape URLs whose path matches at least one. Example: ["/products/*", "/docs/*"] |
| exclude_patterns | string[] | null | Glob patterns — skip URLs whose path matches any. Example: ["/blog/tag/*", "/author/*"] |
| formats | string[] | null | Output formats for each page: text, json, json_v2, html, markdown |
| extraction_schema | object | null | JSON schema for structured extraction on each page. |
| extraction_profile | string | null | Pre-defined extraction profile: auto, product, article, job_posting, faq, recipe, event |
| webhook_url | string | null | URL to receive a crawl.completed webhook when all pages are done. |
| respect_robots | boolean | true | Respect robots.txt rules and crawl-delay for the target domain. |
| include_subdomains | boolean | false | Include links to subdomains during discovery. |
| advanced | object | null | Advanced scraping options applied to every page. See below. |
| cost_controls | object | null | Cost controls for the entire crawl. See below. |
| priority | object | null | Content-aware URL prioritization. See below. |
| budget | object | null | Per-path page budget. See below. |
Advanced Options
The advanced object controls per-page scraping behavior:
| Field | Type | Default | Description |
|---|---|---|---|
| render_js | boolean | false | Render JavaScript on every page (forces Tier 4). |
| screenshot | boolean | false | Capture a screenshot of every crawled page. |
| use_proxy | boolean | false | Route all requests through premium proxy. |
| wait_for | string | null | CSS selector to wait for on each page before extraction. |
| timeout | integer | 90 | Timeout per page in seconds (1 - 300). |
Cost Controls
The cost_controls object limits total crawl spend:
| Field | Type | Description |
|---|---|---|
| max_credits | integer | Maximum total credits to spend on this crawl. |
| max_tier | string | Maximum tier to use for page scrapes. Values: 1, 2, 3, 3.5, 4 |
| force_tier | string | Force a specific tier for all pages. Cannot exceed max_tier. |
// Example: crawl with cost controls
{
"url": "https://shop.example.com",
"max_pages": 200,
"cost_controls": {
"max_credits": 500,
"max_tier": "3"
}
}Priority Scoring
The priority object enables content-aware URL ordering. Discovered URLs are scored 0.0-1.0 and scraped in priority order. When max_pages is reached, low-scoring URLs are dropped first.
| Field | Type | Description |
|---|---|---|
| keywords | string[] | Keywords matched against URL path segments. |
| boost_patterns | string[] | Glob patterns — matching URLs receive a score boost. |
| demote_patterns | string[] | Glob patterns — matching URLs receive a score reduction. |
// Example: prioritize product pages, deprioritize blog tags
{
"url": "https://shop.example.com",
"max_pages": 100,
"priority": {
"keywords": ["pricing", "product", "buy"],
"boost_patterns": ["/products/*", "/pricing*"],
"demote_patterns": ["/blog/tag/*", "/author/*"]
}
}Budget Allocation
The budget object allocates page quotas by URL path pattern. Keys are glob patterns, values are max page counts. Use * as the catch-all default. Most-specific pattern wins (longest match).
// Example: allocate pages by section
{
"url": "https://shop.example.com",
"max_pages": 200,
"budget": {
"/products/*": 100,
"/blog/*": 50,
"/docs/*": 30,
"*": 5
}
}Sitemap Modes
The sitemap field controls how sitemaps are used during URL discovery:
| Mode | Behavior |
|---|---|
| include | Default. Discover sitemap from robots.txt and parse it, also extract links from pages. |
| skip | Skip sitemaps entirely. Only discover via link extraction from the start URL. |
| only | Crawl exclusively from sitemap URLs. No link extraction, no depth crawling. |
Use sitemap_path to specify a custom sitemap location (e.g., /product-sitemap.xml).
// Crawl only from a custom sitemap
{
"url": "https://shop.example.com",
"sitemap": "only",
"sitemap_path": "/product-sitemap.xml",
"max_pages": 1000
}Cache Control
Enable per-page response caching to avoid re-scraping unchanged pages. Cached page hits cost 0 credits.
| Field | Type | Default | Description |
|---|---|---|---|
| cache | boolean | false | Enable per-page response caching. |
| cache_ttl | integer | 900 | Cache TTL in seconds (60 - 86,400). Defaults to 15 minutes. |
| force_refresh | boolean | false | Bypass cache for every page and re-scrape. |
Custom Headers
Inject custom HTTP headers into every page request during the crawl. Useful for authentication tokens, cookies, or custom User-Agent strings.
// Crawl an authenticated site
{
"url": "https://internal.example.com",
"max_pages": 50,
"headers": {
"Authorization": "Bearer eyJhbGciOiJIUzI1NiIs...",
"Cookie": "session=abc123"
}
}Header Restrictions
Storage Backends
Control where crawl results are stored via the storage field:
| Backend | Behavior | Best For |
|---|---|---|
| redis | Results stored in Redis with 24h TTL, returned inline via polling. | Small to medium crawls (<1,000 pages) |
| s3 | Results uploaded to S3 as JSONL. Status endpoint returns a presigned download URL. | Large crawls (1,000+ pages), 30-day retention |
| r2 | Same as S3 but stored in Cloudflare R2. Presigned download URL on completion. | Large crawls with global edge access |
Webhooks
Configure real-time event notifications for your crawl. The webhook object provides full control over events, HMAC signing, custom headers, and metadata passthrough.
| Field | Type | Description |
|---|---|---|
| url | string | URL to receive webhook events (required). |
| events | string[] | Event types to subscribe to. Default: all events. Options: crawl.started, crawl.page.completed, crawl.page.failed, crawl.completed, crawl.failed |
| secret | string | Shared secret for HMAC-SHA256 signing. When set, each delivery includes X-Webhook-Signature and X-Webhook-Timestamp headers. |
| headers | object | Custom HTTP headers to include with every webhook delivery. |
| metadata | object | Custom JSON metadata passed through in every event payload. |
// Full webhook configuration with HMAC signing
{
"url": "https://example.com",
"max_pages": 100,
"webhook": {
"url": "https://hooks.example.com/crawl",
"events": ["crawl.page.completed", "crawl.completed"],
"secret": "whsec_my_signing_secret",
"headers": { "X-Project-Id": "proj_123" },
"metadata": { "source": "nightly-crawl" }
}
}Legacy webhook_url
webhook_url field still works for simple use cases (fires crawl.completed only, no signing). Use the full webhook object for per-event filtering and HMAC verification. They are mutually exclusive.Change Tracking
Enable change detection to compare crawl results against previous runs. Detects new, changed, unchanged, and removed pages.
| Field | Type | Default | Description |
|---|---|---|---|
| enabled | boolean | false | Enable change detection against the previous crawl. |
| tracking_id | string | auto (domain) | Identifier for this tracking history. Allows parallel tracks for the same domain. |
| mode | string | content | content: SHA-256 hash comparison. structured: JSON field-level diff on extraction results. |
| include_diff | boolean | false | Include actual diff output. Adds +1 credit/page (content mode) or +2 (structured mode). |
// Track changes across daily crawls
{
"url": "https://shop.example.com",
"max_pages": 200,
"change_tracking": {
"enabled": true,
"tracking_id": "daily-price-check",
"mode": "content",
"include_diff": true
}
}Output Connectors
Stream crawl results directly to cloud storage or a webhook endpoint as pages complete. No polling needed.
| Type | Required Fields | Description |
|---|---|---|
| s3 | bucket, credentials (access_key_id, secret_access_key) | Export to AWS S3 as JSONL, JSON per page, or CSV. |
| gcs | bucket, credentials (service_account_json) | Export to Google Cloud Storage. |
| webhook_stream | url | POST each page result to a webhook endpoint as it completes. Optional HMAC signing. |
// Stream results to S3
{
"url": "https://example.com",
"max_pages": 500,
"output": {
"type": "s3",
"bucket": "my-crawl-data",
"prefix": "crawls/2026-04/",
"region": "us-east-1",
"credentials": {
"access_key_id": "AKIA...",
"secret_access_key": "..."
},
"format": "jsonl"
}
}Response (202 Accepted)
{
"crawl_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "discovering",
"estimated_pages": 47,
"total_enqueued": 47,
"estimated_credits": 94,
"message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
}Poll Crawl Status
/v1/crawl/{crawl_id}Get the current status, progress, billing breakdown, and optionally per-page results of a crawl.
Parameters
| Name | Type | Required | Description |
|---|---|---|---|
| include_results | boolean | Optional | Include per-page scrape results in the response. Default: false.Default: false |
Status Values
| Status | Description |
|---|---|
| discovering | Fetching sitemap and/or extracting links from the start page. |
| scraping | Pages are being scraped. Check completed / total for progress. |
| completed | All pages scraped successfully. |
| partial | Crawl finished but some pages failed. Results are available for successful pages. |
| failed | All pages failed or a critical error occurred. |
| cancelled | Crawl was cancelled via DELETE endpoint. |
Billing Breakdown
The billing object in the status response provides a full cost accounting:
{
"crawl_id": "a1b2c3d4-...",
"status": "completed",
"total": 47,
"completed": 45,
"failed": 2,
"in_progress": 0,
"credits_debited": 94,
"current_depth": 2,
"max_depth": 3,
"total_discovered": 52,
"billing": {
"credits_debited": 94,
"credits_used": 78,
"credits_refunded": 16,
"estimated_cost_usd": "$0.094",
"actual_cost_usd": "$0.078",
"is_byop": false,
"tier_breakdown": { "1": 30, "3": 12, "4": 3 }
},
"budget_status": {
"budget": { "/products/*": 100, "*": 5 },
"counts": { "/products/*": 42, "*": 3 },
"exhausted": []
}
}Cancel a Crawl
/v1/crawl/{crawl_id}Cancel an active crawl. Pending jobs are cancelled and unused credits are refunded.
// Cancel response
{
"crawl_id": "a1b2c3d4-...",
"status": "cancelled",
"cancelled_jobs": 12,
"credits_refunded": 24,
"credits_used": 70,
"message": "Crawl cancelled. Unprocessed credits have been refunded."
}Partial Results Available
GET /v1/crawl/{id}?include_results=true. You only pay for completed pages.Credit Model
Crawl credits work on a pre-debit and refund model:
Pre-debit
When you start a crawl, credits are estimated and debited upfront based on the number of discovered pages and expected tier costs.
Actual Consumption
Each page consumes credits based on the actual tier used (1-15 credits per page). Failed pages do not consume credits.
Refund
On completion or cancellation, the difference between pre-debited and actual credits is automatically refunded.
Error Codes
| Code | Description | Resolution |
|---|---|---|
| 402 | Insufficient credits for estimated crawl cost. | Add credits or reduce max_pages. |
| 429 | Concurrent crawl limit reached (max 3 per account). | Wait for an active crawl to finish or cancel one. |
| 404 | Crawl ID not found or expired (24h TTL). | Crawl results expire after 24 hours. Start a new crawl. |
| 503 | Queue is full. Too many jobs queued system-wide. | Retry after a short delay. |
Polling Best Practices
Use exponential backoff
Start at 3-5 second intervals. Increase to 10-15 seconds for large crawls (100+ pages) to reduce API calls.
Only include results when needed
Use include_results=false for progress checks. Only set include_results=true on the final poll to reduce payload size.
Prefer webhooks for large crawls
Set webhook_url to receive a crawl.completed notification instead of polling.