Web Crawling
Crawl entire websites from a single URL. AlterLab discovers pages via sitemaps and link extraction, respects robots.txt, and scrapes each page with automatic tier escalation.
Async by Design
crawl_id. Use it to poll status or configure a webhook for delivery.How It Works
Discover
AlterLab fetches sitemaps and extracts links from the start page. URLs are deduplicated and filtered against your include/exclude patterns and robots.txt rules.
Scrape
Each discovered URL becomes an individual scrape job. Jobs run in parallel with the same tier escalation and anti-bot bypass logic as single scrapes.
Deepen
When max_depth is greater than 0, completed pages are scanned for new internal links. New URLs are deduplicated and enqueued until the depth limit or page cap is reached.
Collect
Poll GET /api/v1/crawl/{crawl_id} for progress and per-page results, or receive a crawl.completed webhook when all pages finish.
Start a Crawl
/api/v1/crawlSubmit a start URL to discover and scrape an entire website. Returns 202 Accepted with a crawl ID for polling.
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"max_pages": 100,
"max_depth": 3,
"include_patterns": ["/blog/*", "/docs/*"],
"exclude_patterns": ["/admin/*", "*.pdf"],
"formats": ["text", "markdown"],
"webhook_url": "https://your-server.com/webhook"
}'Response (202 Accepted):
{
"crawl_id": "c9a1b2d3-4e5f-6789-abcd-ef0123456789",
"status": "discovering",
"estimated_pages": 47,
"total_enqueued": 47,
"estimated_credits": 47000,
"message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
}Request Body
| Parameter | Type | Default | Description |
|---|---|---|---|
| url | string | -- | Start URL for the crawl (required, http/https) |
| max_pages | integer | 50 | Maximum number of pages to scrape (1--100,000) |
| max_depth | integer | 3 | Maximum link-following depth from start URL (0 = start page only, max 50) |
| include_patterns | string[] | null | Glob patterns -- only scrape URLs whose path matches at least one |
| exclude_patterns | string[] | null | Glob patterns -- skip URLs whose path matches any |
| formats | string[] | null | Output formats per page: text, json, json_v2, html, markdown |
| extraction_schema | object | null | JSON schema for structured extraction on each page |
| extraction_profile | string | null | Pre-defined profile: auto, product, article, job_posting, faq, recipe, event |
| webhook_url | string | null | URL to receive a crawl.completed webhook when all pages finish |
| respect_robots | boolean | true | Respect robots.txt rules for the target domain |
| include_subdomains | boolean | false | Include links to subdomains during discovery |
| advanced | object | null | Advanced scraping options applied to every page (see below) |
| cost_controls | object | null | Cost controls for the entire crawl (see below) |
Advanced Options
Pass an advanced object to configure scraping behaviour applied to every page in the crawl:
| Field | Default | Description |
|---|---|---|
| render_js | false | Render JavaScript on every page (forces Tier 4) |
| screenshot | false | Capture a screenshot of every crawled page |
| use_proxy | false | Route all crawl requests through premium proxy |
| wait_for | null | CSS selector to wait for on each page before extracting |
| timeout | 90 | Timeout per page in seconds (1--300) |
Cost Controls
Pass a cost_controls object to cap spending:
| Field | Description |
|---|---|
| max_credits | Maximum total cost to spend on this crawl |
| max_tier | Maximum tier to use for page scrapes (1, 2, 3, 3.5, or 4) |
| force_tier | Force a specific tier for all pages (1, 2, 3, 3.5, or 4) |
URL Discovery
When you start a crawl, AlterLab runs a fast discovery phase before scraping begins:
- Sitemap parsing -- fetches
robots.txtto find sitemap URLs, then recursively parses sitemaps (including sitemap indexes) to collect all listed URLs. - Link extraction -- fetches the start page HTML and extracts internal links, giving you pages that may not appear in sitemaps.
- Deduplication -- all discovered URLs are normalised and deduplicated before scraping begins.
- robots.txt compliance -- when
respect_robotsis true (the default), URLs disallowed by robots.txt are automatically excluded.
Domain Scoping
include_subdomains to true to also follow links to subdomains (e.g., blog.example.com from example.com).Depth Control
The max_depth parameter controls how many levels of links the crawler follows from the start URL:
| max_depth | Behaviour |
|---|---|
| 0 | Scrape only the start page (plus any pages found in sitemaps) |
| 1 | Scrape the start page and all pages linked from it |
| 2 | Follow links two levels deep from the start page |
| 3 (default) | Three levels deep -- covers most site structures |
Depth + Pages
max_pages to cap total pages and control costs. The crawler stops discovering new pages once the limit is reached.URL Filtering
Use glob patterns to precisely control which pages are included or excluded:
{
"url": "https://example.com",
"include_patterns": ["/blog/*", "/products/*"],
"exclude_patterns": ["/admin/*", "/internal/*", "*.pdf"]
}- include_patterns -- if provided, only URLs whose path matches at least one pattern are scraped.
- exclude_patterns -- URLs whose path matches any pattern are always skipped, even if they match an include pattern.
- Patterns use standard glob syntax:
*matches any characters,?matches a single character.
Poll Crawl Status
/api/v1/crawl/{crawl_id}Returns crawl progress, billing breakdown, and optionally per-page results. Pass ?include_results=true to include individual page data.
curl https://api.alterlab.io/api/v1/crawl/c9a1b2d3-...?include_results=true \
-H "X-API-Key: your_api_key"Status Response
{
"crawl_id": "c9a1b2d3-...",
"status": "scraping",
"total": 47,
"completed": 32,
"failed": 2,
"in_progress": 13,
"credits_debited": 47000,
"created_at": "2026-03-24T10:00:00Z",
"current_depth": 2,
"max_depth": 3,
"total_discovered": 47,
"billing": {
"credits_debited": 47000,
"credits_used": 34000,
"credits_refunded": 0,
"estimated_cost_usd": "$0.47",
"actual_cost_usd": "$0.34",
"is_byop": false,
"tier_breakdown": { "1": 28, "3": 4 }
},
"pages": [
{
"job_id": "a1b2c3d4-...",
"url": "https://example.com/blog/post-1",
"status": "succeeded",
"result": { "text": "...", "metadata": { ... } }
}
]
}Crawl status values:
| Status | Meaning |
|---|---|
| discovering | Fetching sitemaps and extracting links from the start page |
| scraping | Scrape jobs are running (may still discover deeper pages) |
| completed | All pages scraped successfully |
| partial | All jobs done, some pages failed |
| failed | All pages failed |
| cancelled | Crawl was cancelled via DELETE endpoint |
Cancel a Crawl
/api/v1/crawl/{crawl_id}Cancel a running crawl. Jobs already completed are not affected. Pending and queued jobs are cancelled, and their credits are automatically refunded.
curl -X DELETE https://api.alterlab.io/api/v1/crawl/c9a1b2d3-... \
-H "X-API-Key: your_api_key"Response:
{
"crawl_id": "c9a1b2d3-...",
"status": "cancelled",
"cancelled_jobs": 15,
"credits_refunded": 15000,
"credits_used": 32000,
"message": "Crawl cancelled. Unprocessed credits have been refunded."
}Already Completed
Billing & Credits
- Credits are pre-debited when the crawl starts, based on the estimated cost per discovered URL.
- As depth crawling discovers new pages, additional credits are debited for each new wave of URLs.
- If a page fails, credits for that page are automatically refunded.
- On completion, any overpayment (estimated minus actual) is automatically refunded.
- Cancelling a crawl refunds all credits for unprocessed pages.
- Use
cost_controls.max_creditsto set an absolute spending cap. - BYOP (Bring Your Own Proxy) discounts apply per-page if you have an active proxy integration.
The billing summary is included in the status response under billing, which shows debited, used, and refunded credits alongside a per-tier breakdown.
Limits
Crawl Limits
- Maximum 3 concurrent crawls per user
- Maximum 100,000 pages per crawl (
max_pages) - Maximum depth of 50 (
max_depth) - Discovery phase timeout: 30 seconds
- Crawl metadata expires after 24 hours -- poll or use webhooks before then
- Each page counts as a separate scrape credit-wise
Python Example
import alterlab
import time
client = alterlab.AlterLab(api_key="your_api_key")
# Start a crawl
crawl = client.crawl(
url="https://example.com",
max_pages=100,
max_depth=3,
include_patterns=["/blog/*", "/docs/*"],
exclude_patterns=["/admin/*"],
formats=["text", "markdown"],
webhook_url="https://your-server.com/webhook",
)
print(f"Crawl ID: {crawl['crawl_id']}")
print(f"Pages discovered: {crawl['estimated_pages']}")
print(f"Estimated credits: {crawl['estimated_credits']}")
# Poll until complete
while True:
status = client.get_crawl_status(crawl["crawl_id"], include_results=True)
print(f"Status: {status['status']} -- depth {status['current_depth']}/{status['max_depth']}")
print(f" {status['completed']}/{status['total']} pages done, {status['failed']} failed")
if status["status"] not in ("discovering", "scraping"):
break
time.sleep(3)
# Process results
for page in status.get("pages", []):
if page["status"] == "succeeded":
text = page["result"].get("text", "")
print(f" {page['url']}: {len(text)} chars")
else:
print(f" {page['url']}: FAILED -- {page.get('error', 'unknown')}")
# Check billing
billing = status.get("billing", {})
print(f"Cost: {billing.get('actual_cost_usd', 'N/A')} (estimated {billing.get('estimated_cost_usd', 'N/A')})")
print(f"Refunded: {billing.get('credits_refunded', 0)} credits")Node.js Example
import AlterLab from "@alterlab/sdk";
const client = new AlterLab({ apiKey: "your_api_key" });
// Start a crawl
const crawl = await client.crawl({
url: "https://example.com",
maxPages: 100,
maxDepth: 3,
includePatterns: ["/blog/*", "/docs/*"],
excludePatterns: ["/admin/*"],
formats: ["text", "markdown"],
webhookUrl: "https://your-server.com/webhook",
});
console.log(`Crawl ID: ${crawl.crawlId}`);
console.log(`Pages discovered: ${crawl.estimatedPages}`);
// Poll until complete
let status;
do {
await new Promise((r) => setTimeout(r, 3000));
status = await client.getCrawlStatus(crawl.crawlId, { includeResults: true });
console.log(`${status.completed}/${status.total} done (depth ${status.currentDepth}/${status.maxDepth})`);
} while (["discovering", "scraping"].includes(status.status));
// Process results
for (const page of status.pages ?? []) {
if (page.status === "succeeded") {
console.log(` ${page.url}: ${page.result?.text?.length ?? 0} chars`);
} else {
console.log(` ${page.url}: FAILED -- ${page.error}`);
}
}
// Cancel a crawl (if needed)
// const cancelled = await client.cancelCrawl(crawl.crawlId);
// console.log(`Refunded: ${cancelled.creditsRefunded} credits`);Related Guides
- Batch Scraping -- scrape a fixed list of URLs instead of discovering them
- Webhooks -- learn more about webhook delivery and payload formats
- Pricing -- credit costs per tier and BYOP discounts