Guide

Web Crawling

Crawl entire websites from a single URL. AlterLab discovers pages via sitemaps and link extraction, respects robots.txt, and scrapes each page with automatic tier escalation.

Async by Design

Crawl requests return immediately with a crawl_id. Use it to poll status or configure a webhook for delivery.

How It Works

Discover

AlterLab fetches sitemaps and extracts links from the start page. URLs are deduplicated and filtered against your include/exclude patterns and robots.txt rules.

Scrape

Each discovered URL becomes an individual scrape job. Jobs run in parallel with the same tier escalation and anti-bot bypass logic as single scrapes.

Deepen

When max_depth is greater than 0, completed pages are scanned for new internal links. New URLs are deduplicated and enqueued until the depth limit or page cap is reached.

Collect

Poll GET /api/v1/crawl/{crawl_id} for progress and per-page results, or receive a crawl.completed webhook when all pages finish.

Start a Crawl

POST

/api/v1/crawl

Submit a start URL to discover and scrape an entire website. Returns 202 Accepted with a crawl ID for polling.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 100,
    "max_depth": 3,
    "include_patterns": ["/blog/*", "/docs/*"],
    "exclude_patterns": ["/admin/*", "*.pdf"],
    "formats": ["text", "markdown"],
    "webhook_url": "https://your-server.com/webhook"
  }'

Response (202 Accepted):

JSON

{
  "crawl_id": "c9a1b2d3-4e5f-6789-abcd-ef0123456789",
  "status": "discovering",
  "estimated_pages": 47,
  "total_enqueued": 47,
  "estimated_credits": 47000,
  "message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
}

Request Body

Parameter	Type	Default	Description
url	string	--	Start URL for the crawl (required, http/https)
max_pages	integer	50	Maximum number of pages to scrape (1--100,000)
max_depth	integer	3	Maximum link-following depth from start URL (0 = start page only, max 50)
include_patterns	string[]	null	Glob patterns -- only scrape URLs whose path matches at least one
exclude_patterns	string[]	null	Glob patterns -- skip URLs whose path matches any
formats	string[]	null	Output formats per page: text, json, json_v2, html, markdown
extraction_schema	object	null	JSON schema for structured extraction on each page
extraction_profile	string	null	Pre-defined profile: auto, product, article, job_posting, faq, recipe, event
webhook_url	string	null	URL to receive a crawl.completed webhook when all pages finish
respect_robots	boolean	true	Respect robots.txt rules for the target domain
include_subdomains	boolean	false	Include links to subdomains during discovery
advanced	object	null	Advanced scraping options applied to every page (see below)
cost_controls	object	null	Cost controls for the entire crawl (see below)

Advanced Options

Pass an advanced object to configure scraping behaviour applied to every page in the crawl:

Field	Default	Description
render_js	false	Render JavaScript on every page (forces Tier 4)
screenshot	false	Capture a screenshot of every crawled page
use_proxy	false	Route all crawl requests through premium proxy
wait_for	null	CSS selector to wait for on each page before extracting
timeout	90	Timeout per page in seconds (1--300)

Cost Controls

Pass a cost_controls object to cap spending:

Field	Description
max_credits	Maximum total cost to spend on this crawl
max_tier	Maximum tier to use for page scrapes (1, 2, 3, 3.5, or 4)
force_tier	Force a specific tier for all pages (1, 2, 3, 3.5, or 4)

URL Discovery

When you start a crawl, AlterLab runs a fast discovery phase before scraping begins:

Sitemap parsing -- fetches robots.txt to find sitemap URLs, then recursively parses sitemaps (including sitemap indexes) to collect all listed URLs.
Link extraction -- fetches the start page HTML and extracts internal links, giving you pages that may not appear in sitemaps.
Deduplication -- all discovered URLs are normalised and deduplicated before scraping begins.
robots.txt compliance -- when respect_robots is true (the default), URLs disallowed by robots.txt are automatically excluded.

Domain Scoping

By default, only URLs on the same domain as the start URL are included. Set include_subdomains to true to also follow links to subdomains (e.g., blog.example.com from example.com).

Depth Control

The max_depth parameter controls how many levels of links the crawler follows from the start URL:

max_depth	Behaviour
0	Scrape only the start page (plus any pages found in sitemaps)
1	Scrape the start page and all pages linked from it
2	Follow links two levels deep from the start page
3 (default)	Three levels deep -- covers most site structures

Depth + Pages

Higher depth values can discover many more pages. Always set max_pages to cap total pages and control costs. The crawler stops discovering new pages once the limit is reached.

URL Filtering

Use glob patterns to precisely control which pages are included or excluded:

JSON

{
  "url": "https://example.com",
  "include_patterns": ["/blog/*", "/products/*"],
  "exclude_patterns": ["/admin/*", "/internal/*", "*.pdf"]
}

include_patterns -- if provided, only URLs whose path matches at least one pattern are scraped.
exclude_patterns -- URLs whose path matches any pattern are always skipped, even if they match an include pattern.
Patterns use standard glob syntax: * matches any characters, ? matches a single character.

Poll Crawl Status

GET

/api/v1/crawl/{crawl_id}

Returns crawl progress, billing breakdown, and optionally per-page results. Pass ?include_results=true to include individual page data.

Bash

curl https://api.alterlab.io/api/v1/crawl/c9a1b2d3-...?include_results=true \
  -H "X-API-Key: your_api_key"

Status Response

JSON

{
  "crawl_id": "c9a1b2d3-...",
  "status": "scraping",
  "total": 47,
  "completed": 32,
  "failed": 2,
  "in_progress": 13,
  "credits_debited": 47000,
  "created_at": "2026-03-24T10:00:00Z",
  "current_depth": 2,
  "max_depth": 3,
  "total_discovered": 47,
  "billing": {
    "credits_debited": 47000,
    "credits_used": 34000,
    "credits_refunded": 0,
    "estimated_cost_usd": "$0.47",
    "actual_cost_usd": "$0.34",
    "is_byop": false,
    "tier_breakdown": { "1": 28, "3": 4 }
  },
  "pages": [
    {
      "job_id": "a1b2c3d4-...",
      "url": "https://example.com/blog/post-1",
      "status": "succeeded",
      "result": { "text": "...", "metadata": { ... } }
    }
  ]
}

Crawl status values:

Status	Meaning
discovering	Fetching sitemaps and extracting links from the start page
scraping	Scrape jobs are running (may still discover deeper pages)
completed	All pages scraped successfully
partial	All jobs done, some pages failed
failed	All pages failed
cancelled	Crawl was cancelled via DELETE endpoint

Cancel a Crawl

DELETE

/api/v1/crawl/{crawl_id}

Cancel a running crawl. Jobs already completed are not affected. Pending and queued jobs are cancelled, and their credits are automatically refunded.

Bash

curl -X DELETE https://api.alterlab.io/api/v1/crawl/c9a1b2d3-... \
  -H "X-API-Key: your_api_key"

Response:

JSON

{
  "crawl_id": "c9a1b2d3-...",
  "status": "cancelled",
  "cancelled_jobs": 15,
  "credits_refunded": 15000,
  "credits_used": 32000,
  "message": "Crawl cancelled. Unprocessed credits have been refunded."
}

Already Completed

You cannot cancel a crawl that has already completed or was previously cancelled. The API returns 409 Conflict in those cases.

Billing & Credits

Credits are pre-debited when the crawl starts, based on the estimated cost per discovered URL.
As depth crawling discovers new pages, additional credits are debited for each new wave of URLs.
If a page fails, credits for that page are automatically refunded.
On completion, any overpayment (estimated minus actual) is automatically refunded.
Cancelling a crawl refunds all credits for unprocessed pages.
Use cost_controls.max_credits to set an absolute spending cap.
BYOP (Bring Your Own Proxy) discounts apply per-page if you have an active proxy integration.

The billing summary is included in the status response under billing, which shows debited, used, and refunded credits alongside a per-tier breakdown.

Limits

Crawl Limits

Maximum 3 concurrent crawls per user
Maximum 100,000 pages per crawl (max_pages)
Maximum depth of 50 (max_depth)
Discovery phase timeout: 30 seconds
Crawl metadata expires after 2 hours -- poll or use webhooks before then
Each page counts as a separate scrape credit-wise

Python Example

Python

import alterlab
import time

client = alterlab.AlterLab(api_key="your_api_key")

# Start a crawl
crawl = client.crawl(
    url="https://example.com",
    max_pages=100,
    max_depth=3,
    include_patterns=["/blog/*", "/docs/*"],
    exclude_patterns=["/admin/*"],
    formats=["text", "markdown"],
    webhook_url="https://your-server.com/webhook",
)

print(f"Crawl ID: {crawl['crawl_id']}")
print(f"Pages discovered: {crawl['estimated_pages']}")
print(f"Estimated credits: {crawl['estimated_credits']}")

# Poll until complete
while True:
    status = client.get_crawl_status(crawl["crawl_id"], include_results=True)
    print(f"Status: {status['status']} -- depth {status['current_depth']}/{status['max_depth']}")
    print(f"  {status['completed']}/{status['total']} pages done, {status['failed']} failed")
    if status["status"] not in ("discovering", "scraping"):
        break
    time.sleep(3)

# Process results
for page in status.get("pages", []):
    if page["status"] == "succeeded":
        text = page["result"].get("text", "")
        print(f"  {page['url']}: {len(text)} chars")
    else:
        print(f"  {page['url']}: FAILED -- {page.get('error', 'unknown')}")

# Check billing
billing = status.get("billing", {})
print(f"Cost: {billing.get('actual_cost_usd', 'N/A')} (estimated {billing.get('estimated_cost_usd', 'N/A')})")
print(f"Refunded: {billing.get('credits_refunded', 0)} credits")

Node.js Example

TYPESCRIPT

import AlterLab from "@alterlab/sdk";

const client = new AlterLab({ apiKey: "your_api_key" });

// Start a crawl
const crawl = await client.crawl({
  url: "https://example.com",
  maxPages: 100,
  maxDepth: 3,
  includePatterns: ["/blog/*", "/docs/*"],
  excludePatterns: ["/admin/*"],
  formats: ["text", "markdown"],
  webhookUrl: "https://your-server.com/webhook",
});

console.log(`Crawl ID: ${crawl.crawlId}`);
console.log(`Pages discovered: ${crawl.estimatedPages}`);

// Poll until complete
let status;
do {
  await new Promise((r) => setTimeout(r, 3000));
  status = await client.getCrawlStatus(crawl.crawlId, { includeResults: true });
  console.log(`${status.completed}/${status.total} done (depth ${status.currentDepth}/${status.maxDepth})`);
} while (["discovering", "scraping"].includes(status.status));

// Process results
for (const page of status.pages ?? []) {
  if (page.status === "succeeded") {
    console.log(`  ${page.url}: ${page.result?.text?.length ?? 0} chars`);
  } else {
    console.log(`  ${page.url}: FAILED -- ${page.error}`);
  }
}

// Cancel a crawl (if needed)
// const cancelled = await client.cancelCrawl(crawl.crawlId);
// console.log(`Refunded: ${cancelled.creditsRefunded} credits`);

Related Guides

Batch Scraping -- scrape a fixed list of URLs instead of discovering them
Webhooks -- learn more about webhook delivery and payload formats
Pricing -- credit costs per tier and BYOP discounts

Site Mapping Batch Scraping

Last updated: June 2026

Guide

Web Crawling

Crawl entire websites from a single URL. AlterLab discovers pages via sitemaps and link extraction, respects robots.txt, and scrapes each page with automatic tier escalation.

Async by Design

Crawl requests return immediately with a crawl_id. Use it to poll status or configure a webhook for delivery.

How It Works

Discover

AlterLab fetches sitemaps and extracts links from the start page. URLs are deduplicated and filtered against your include/exclude patterns and robots.txt rules.

Scrape

Each discovered URL becomes an individual scrape job. Jobs run in parallel with the same tier escalation and anti-bot bypass logic as single scrapes.

Deepen

When max_depth is greater than 0, completed pages are scanned for new internal links. New URLs are deduplicated and enqueued until the depth limit or page cap is reached.

Collect

Poll GET /api/v1/crawl/{crawl_id} for progress and per-page results, or receive a crawl.completed webhook when all pages finish.

Start a Crawl

POST

/api/v1/crawl

Submit a start URL to discover and scrape an entire website. Returns 202 Accepted with a crawl ID for polling.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 100,
    "max_depth": 3,
    "include_patterns": ["/blog/*", "/docs/*"],
    "exclude_patterns": ["/admin/*", "*.pdf"],
    "formats": ["text", "markdown"],
    "webhook_url": "https://your-server.com/webhook"
  }'

Response (202 Accepted):

JSON

{
  "crawl_id": "c9a1b2d3-4e5f-6789-abcd-ef0123456789",
  "status": "discovering",
  "estimated_pages": 47,
  "total_enqueued": 47,
  "estimated_credits": 47000,
  "message": "Crawl started. Poll GET /v1/crawl/{crawl_id} for progress."
}

Request Body

Parameter	Type	Default	Description
url	string	--	Start URL for the crawl (required, http/https)
max_pages	integer	50	Maximum number of pages to scrape (1--100,000)
max_depth	integer	3	Maximum link-following depth from start URL (0 = start page only, max 50)
include_patterns	string[]	null	Glob patterns -- only scrape URLs whose path matches at least one
exclude_patterns	string[]	null	Glob patterns -- skip URLs whose path matches any
formats	string[]	null	Output formats per page: text, json, json_v2, html, markdown
extraction_schema	object	null	JSON schema for structured extraction on each page
extraction_profile	string	null	Pre-defined profile: auto, product, article, job_posting, faq, recipe, event
webhook_url	string	null	URL to receive a crawl.completed webhook when all pages finish
respect_robots	boolean	true	Respect robots.txt rules for the target domain
include_subdomains	boolean	false	Include links to subdomains during discovery
advanced	object	null	Advanced scraping options applied to every page (see below)
cost_controls	object	null	Cost controls for the entire crawl (see below)

Advanced Options

Pass an advanced object to configure scraping behaviour applied to every page in the crawl:

Field	Default	Description
render_js	false	Render JavaScript on every page (forces Tier 4)
screenshot	false	Capture a screenshot of every crawled page
use_proxy	false	Route all crawl requests through premium proxy
wait_for	null	CSS selector to wait for on each page before extracting
timeout	90	Timeout per page in seconds (1--300)

Cost Controls

Pass a cost_controls object to cap spending:

Field	Description
max_credits	Maximum total cost to spend on this crawl
max_tier	Maximum tier to use for page scrapes (1, 2, 3, 3.5, or 4)
force_tier	Force a specific tier for all pages (1, 2, 3, 3.5, or 4)

URL Discovery

When you start a crawl, AlterLab runs a fast discovery phase before scraping begins:

Sitemap parsing -- fetches robots.txt to find sitemap URLs, then recursively parses sitemaps (including sitemap indexes) to collect all listed URLs.
Link extraction -- fetches the start page HTML and extracts internal links, giving you pages that may not appear in sitemaps.
Deduplication -- all discovered URLs are normalised and deduplicated before scraping begins.
robots.txt compliance -- when respect_robots is true (the default), URLs disallowed by robots.txt are automatically excluded.

Domain Scoping

By default, only URLs on the same domain as the start URL are included. Set include_subdomains to true to also follow links to subdomains (e.g., blog.example.com from example.com).

Depth Control

The max_depth parameter controls how many levels of links the crawler follows from the start URL:

max_depth	Behaviour
0	Scrape only the start page (plus any pages found in sitemaps)
1	Scrape the start page and all pages linked from it
2	Follow links two levels deep from the start page
3 (default)	Three levels deep -- covers most site structures

Depth + Pages

Higher depth values can discover many more pages. Always set max_pages to cap total pages and control costs. The crawler stops discovering new pages once the limit is reached.

URL Filtering

Use glob patterns to precisely control which pages are included or excluded:

JSON

{
  "url": "https://example.com",
  "include_patterns": ["/blog/*", "/products/*"],
  "exclude_patterns": ["/admin/*", "/internal/*", "*.pdf"]
}

include_patterns -- if provided, only URLs whose path matches at least one pattern are scraped.
exclude_patterns -- URLs whose path matches any pattern are always skipped, even if they match an include pattern.
Patterns use standard glob syntax: * matches any characters, ? matches a single character.

Poll Crawl Status

GET

/api/v1/crawl/{crawl_id}

Returns crawl progress, billing breakdown, and optionally per-page results. Pass ?include_results=true to include individual page data.

Bash

curl https://api.alterlab.io/api/v1/crawl/c9a1b2d3-...?include_results=true \
  -H "X-API-Key: your_api_key"

Status Response

JSON

{
  "crawl_id": "c9a1b2d3-...",
  "status": "scraping",
  "total": 47,
  "completed": 32,
  "failed": 2,
  "in_progress": 13,
  "credits_debited": 47000,
  "created_at": "2026-03-24T10:00:00Z",
  "current_depth": 2,
  "max_depth": 3,
  "total_discovered": 47,
  "billing": {
    "credits_debited": 47000,
    "credits_used": 34000,
    "credits_refunded": 0,
    "estimated_cost_usd": "$0.47",
    "actual_cost_usd": "$0.34",
    "is_byop": false,
    "tier_breakdown": { "1": 28, "3": 4 }
  },
  "pages": [
    {
      "job_id": "a1b2c3d4-...",
      "url": "https://example.com/blog/post-1",
      "status": "succeeded",
      "result": { "text": "...", "metadata": { ... } }
    }
  ]
}

Crawl status values:

Status	Meaning
discovering	Fetching sitemaps and extracting links from the start page
scraping	Scrape jobs are running (may still discover deeper pages)
completed	All pages scraped successfully
partial	All jobs done, some pages failed
failed	All pages failed
cancelled	Crawl was cancelled via DELETE endpoint

Cancel a Crawl

DELETE

/api/v1/crawl/{crawl_id}

Cancel a running crawl. Jobs already completed are not affected. Pending and queued jobs are cancelled, and their credits are automatically refunded.

Bash

curl -X DELETE https://api.alterlab.io/api/v1/crawl/c9a1b2d3-... \
  -H "X-API-Key: your_api_key"

Response:

JSON

{
  "crawl_id": "c9a1b2d3-...",
  "status": "cancelled",
  "cancelled_jobs": 15,
  "credits_refunded": 15000,
  "credits_used": 32000,
  "message": "Crawl cancelled. Unprocessed credits have been refunded."
}

Already Completed

You cannot cancel a crawl that has already completed or was previously cancelled. The API returns 409 Conflict in those cases.

Billing & Credits

Credits are pre-debited when the crawl starts, based on the estimated cost per discovered URL.
As depth crawling discovers new pages, additional credits are debited for each new wave of URLs.
If a page fails, credits for that page are automatically refunded.
On completion, any overpayment (estimated minus actual) is automatically refunded.
Cancelling a crawl refunds all credits for unprocessed pages.
Use cost_controls.max_credits to set an absolute spending cap.
BYOP (Bring Your Own Proxy) discounts apply per-page if you have an active proxy integration.

The billing summary is included in the status response under billing, which shows debited, used, and refunded credits alongside a per-tier breakdown.

Limits

Crawl Limits

Maximum 3 concurrent crawls per user
Maximum 100,000 pages per crawl (max_pages)
Maximum depth of 50 (max_depth)
Discovery phase timeout: 30 seconds
Crawl metadata expires after 2 hours -- poll or use webhooks before then
Each page counts as a separate scrape credit-wise

Python Example

Python

import alterlab
import time

client = alterlab.AlterLab(api_key="your_api_key")

# Start a crawl
crawl = client.crawl(
    url="https://example.com",
    max_pages=100,
    max_depth=3,
    include_patterns=["/blog/*", "/docs/*"],
    exclude_patterns=["/admin/*"],
    formats=["text", "markdown"],
    webhook_url="https://your-server.com/webhook",
)

print(f"Crawl ID: {crawl['crawl_id']}")
print(f"Pages discovered: {crawl['estimated_pages']}")
print(f"Estimated credits: {crawl['estimated_credits']}")

# Poll until complete
while True:
    status = client.get_crawl_status(crawl["crawl_id"], include_results=True)
    print(f"Status: {status['status']} -- depth {status['current_depth']}/{status['max_depth']}")
    print(f"  {status['completed']}/{status['total']} pages done, {status['failed']} failed")
    if status["status"] not in ("discovering", "scraping"):
        break
    time.sleep(3)

# Process results
for page in status.get("pages", []):
    if page["status"] == "succeeded":
        text = page["result"].get("text", "")
        print(f"  {page['url']}: {len(text)} chars")
    else:
        print(f"  {page['url']}: FAILED -- {page.get('error', 'unknown')}")

# Check billing
billing = status.get("billing", {})
print(f"Cost: {billing.get('actual_cost_usd', 'N/A')} (estimated {billing.get('estimated_cost_usd', 'N/A')})")
print(f"Refunded: {billing.get('credits_refunded', 0)} credits")

Node.js Example

TYPESCRIPT

import AlterLab from "@alterlab/sdk";

const client = new AlterLab({ apiKey: "your_api_key" });

// Start a crawl
const crawl = await client.crawl({
  url: "https://example.com",
  maxPages: 100,
  maxDepth: 3,
  includePatterns: ["/blog/*", "/docs/*"],
  excludePatterns: ["/admin/*"],
  formats: ["text", "markdown"],
  webhookUrl: "https://your-server.com/webhook",
});

console.log(`Crawl ID: ${crawl.crawlId}`);
console.log(`Pages discovered: ${crawl.estimatedPages}`);

// Poll until complete
let status;
do {
  await new Promise((r) => setTimeout(r, 3000));
  status = await client.getCrawlStatus(crawl.crawlId, { includeResults: true });
  console.log(`${status.completed}/${status.total} done (depth ${status.currentDepth}/${status.maxDepth})`);
} while (["discovering", "scraping"].includes(status.status));

// Process results
for (const page of status.pages ?? []) {
  if (page.status === "succeeded") {
    console.log(`  ${page.url}: ${page.result?.text?.length ?? 0} chars`);
  } else {
    console.log(`  ${page.url}: FAILED -- ${page.error}`);
  }
}

// Cancel a crawl (if needed)
// const cancelled = await client.cancelCrawl(crawl.crawlId);
// console.log(`Refunded: ${cancelled.creditsRefunded} credits`);

Related Guides

Batch Scraping -- scrape a fixed list of URLs instead of discovering them
Webhooks -- learn more about webhook delivery and payload formats
Pricing -- credit costs per tier and BYOP discounts

Site Mapping Batch Scraping

Last updated: June 2026