Tutorial

New

Site Crawling Tutorial

Learn to crawl entire websites step by step. Start with a simple 3-page crawl and build up to production-grade crawling with extraction, webhooks, and cost controls.

Time to Complete

This tutorial takes about 15 minutes. You will need an AlterLab API key and about 50 credits.

Prerequisites

An AlterLab API key (get one here)
At least 50 credits in your account
Python 3.8+ with the alterlab package, or Node.js 18+ with @alterlab/sdk, or cURL

Step 1: Basic Crawl

Start with a small crawl to see how it works. We will crawl 5 pages from a documentation site.

Bash

# Start a small crawl
curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 5,
    "max_depth": 1
  }'

# Save the crawl_id from the response
# {"crawl_id": "abc123", "status": "discovering", ...}

# Poll for status
curl https://api.alterlab.io/api/v1/crawl/abc123 \
  -H "X-API-Key: YOUR_API_KEY"

# When status is "completed", get results
curl "https://api.alterlab.io/api/v1/crawl/abc123?include_results=true" \
  -H "X-API-Key: YOUR_API_KEY"

You should see 5 pages discovered and scraped. The response includes each page's URL, status, and scraped content.

Step 2: Add URL Filtering

Real websites have thousands of pages. Use URL filtering to crawl only the sections you need. Let us target product pages on an e-commerce site.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 30,
    "max_depth": 2,
    "include_patterns": ["/products/*", "/categories/*"],
    "exclude_patterns": ["/blog/*", "/account/*", "/cart"]
  }'

Tip: Start Narrow

It is cheaper to run a small filtered crawl first to verify you are getting the right pages, then widen the scope for production.

Step 3: Add Structured Extraction

Instead of raw HTML, extract structured product data from every page using an extraction profile or custom schema.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 30,
    "include_patterns": ["/products/*"],
    "extraction_schema": {
      "name": "string",
      "price": "number",
      "currency": "string",
      "description": "string",
      "in_stock": "boolean",
      "image_url": "string"
    }
  }'

Step 4: Set Up Webhooks

Instead of polling, receive a notification when the crawl completes. This is the recommended approach for production use.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 100,
    "include_patterns": ["/products/*"],
    "extraction_profile": "product",
    "webhook_url": "https://your-server.com/webhooks/crawl-done"
  }'

# Your webhook endpoint will receive:
# POST /webhooks/crawl-done
# {
#   "event": "crawl.completed",
#   "crawl_id": "abc123",
#   "status": "completed",
#   "total": 87,
#   "completed": 85,
#   "failed": 2,
#   "credits_used": 142
# }

Step 5: Scale Up with Cost Controls

For production crawls, add cost controls to prevent unexpected spend. Combine max_pages, max_credits, and max_tier for full control.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 500,
    "max_depth": 3,
    "include_patterns": ["/products/*"],
    "exclude_patterns": ["/products/*/reviews", "/products/*/questions"],
    "extraction_profile": "product",
    "webhook_url": "https://your-server.com/webhooks/crawl-done",
    "cost_controls": {
      "max_credits": 1000,
      "max_tier": "3"
    },
    "budget": {
      "/products/*": 400,
      "/categories/*": 50,
      "*": 10
    }
  }'

Complete Working Example

Here is a complete production-ready script that crawls a site, monitors progress, and saves extracted data to a JSON file.

Python

"""Complete crawl example — crawl products and save to JSON."""
import json
import time
from alterlab import AlterLab

API_KEY = "YOUR_API_KEY"
client = AlterLab(api_key=API_KEY)

# Configure the crawl
crawl = client.crawl(
    url="https://shop.example.com",
    max_pages=200,
    max_depth=3,
    include_patterns=["/products/*"],
    exclude_patterns=["/blog/*", "/account/*"],
    extraction_schema={
        "name": "string",
        "price": "number",
        "currency": "string",
        "in_stock": "boolean",
    },
    cost_controls={"max_credits": 500},
)

print(f"Crawl {crawl.crawl_id} started")
print(f"Estimated: {crawl.estimated_pages} pages, {crawl.estimated_credits} credits")

# Poll with exponential backoff
interval = 3
while True:
    status = client.get_crawl(crawl.crawl_id)
    pct = (status.completed / max(status.total, 1)) * 100
    print(f"[{status.status}] {status.completed}/{status.total} ({pct:.0f}%)")

    if status.status in ("completed", "partial", "failed"):
        break

    time.sleep(interval)
    interval = min(interval * 1.5, 15)  # max 15s

# Fetch results
result = client.get_crawl(crawl.crawl_id, include_results=True)

# Extract and save data
products = []
for page in result.pages or []:
    if page.status == "succeeded" and page.result:
        extracted = page.result.get("extracted", {})
        if extracted.get("name"):
            products.append({
                "url": page.url,
                **extracted,
            })

with open("products.json", "w") as f:
    json.dump(products, f, indent=2)

print(f"Saved {len(products)} products to products.json")
if result.billing:
    print(f"Credits used: {result.billing.credits_used}")
    print(f"Credits refunded: {result.billing.credits_refunded}")

Next Steps

Crawl API Reference

Full API reference with all parameters, schemas, and error codes.

Crawling Guide

Deep dive into how crawling works, priority scoring, and advanced patterns.

Structured Extraction

Learn about extraction profiles and custom schemas for pulling structured data.

Webhooks Guide

Set up webhook endpoints, handle signatures, and configure retries.

Last updated: March 2026

Tutorial

New

Site Crawling Tutorial

Learn to crawl entire websites step by step. Start with a simple 3-page crawl and build up to production-grade crawling with extraction, webhooks, and cost controls.

Time to Complete

This tutorial takes about 15 minutes. You will need an AlterLab API key and about 50 credits.

Prerequisites

An AlterLab API key (get one here)
At least 50 credits in your account
Python 3.8+ with the alterlab package, or Node.js 18+ with @alterlab/sdk, or cURL

Step 1: Basic Crawl

Start with a small crawl to see how it works. We will crawl 5 pages from a documentation site.

Bash

# Start a small crawl
curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "max_pages": 5,
    "max_depth": 1
  }'

# Save the crawl_id from the response
# {"crawl_id": "abc123", "status": "discovering", ...}

# Poll for status
curl https://api.alterlab.io/api/v1/crawl/abc123 \
  -H "X-API-Key: YOUR_API_KEY"

# When status is "completed", get results
curl "https://api.alterlab.io/api/v1/crawl/abc123?include_results=true" \
  -H "X-API-Key: YOUR_API_KEY"

You should see 5 pages discovered and scraped. The response includes each page's URL, status, and scraped content.

Step 2: Add URL Filtering

Real websites have thousands of pages. Use URL filtering to crawl only the sections you need. Let us target product pages on an e-commerce site.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 30,
    "max_depth": 2,
    "include_patterns": ["/products/*", "/categories/*"],
    "exclude_patterns": ["/blog/*", "/account/*", "/cart"]
  }'

Tip: Start Narrow

It is cheaper to run a small filtered crawl first to verify you are getting the right pages, then widen the scope for production.

Step 3: Add Structured Extraction

Instead of raw HTML, extract structured product data from every page using an extraction profile or custom schema.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 30,
    "include_patterns": ["/products/*"],
    "extraction_schema": {
      "name": "string",
      "price": "number",
      "currency": "string",
      "description": "string",
      "in_stock": "boolean",
      "image_url": "string"
    }
  }'

Step 4: Set Up Webhooks

Instead of polling, receive a notification when the crawl completes. This is the recommended approach for production use.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 100,
    "include_patterns": ["/products/*"],
    "extraction_profile": "product",
    "webhook_url": "https://your-server.com/webhooks/crawl-done"
  }'

# Your webhook endpoint will receive:
# POST /webhooks/crawl-done
# {
#   "event": "crawl.completed",
#   "crawl_id": "abc123",
#   "status": "completed",
#   "total": 87,
#   "completed": 85,
#   "failed": 2,
#   "credits_used": 142
# }

Step 5: Scale Up with Cost Controls

For production crawls, add cost controls to prevent unexpected spend. Combine max_pages, max_credits, and max_tier for full control.

Bash

curl -X POST https://api.alterlab.io/api/v1/crawl \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com",
    "max_pages": 500,
    "max_depth": 3,
    "include_patterns": ["/products/*"],
    "exclude_patterns": ["/products/*/reviews", "/products/*/questions"],
    "extraction_profile": "product",
    "webhook_url": "https://your-server.com/webhooks/crawl-done",
    "cost_controls": {
      "max_credits": 1000,
      "max_tier": "3"
    },
    "budget": {
      "/products/*": 400,
      "/categories/*": 50,
      "*": 10
    }
  }'

Complete Working Example

Here is a complete production-ready script that crawls a site, monitors progress, and saves extracted data to a JSON file.

Python

"""Complete crawl example — crawl products and save to JSON."""
import json
import time
from alterlab import AlterLab

API_KEY = "YOUR_API_KEY"
client = AlterLab(api_key=API_KEY)

# Configure the crawl
crawl = client.crawl(
    url="https://shop.example.com",
    max_pages=200,
    max_depth=3,
    include_patterns=["/products/*"],
    exclude_patterns=["/blog/*", "/account/*"],
    extraction_schema={
        "name": "string",
        "price": "number",
        "currency": "string",
        "in_stock": "boolean",
    },
    cost_controls={"max_credits": 500},
)

print(f"Crawl {crawl.crawl_id} started")
print(f"Estimated: {crawl.estimated_pages} pages, {crawl.estimated_credits} credits")

# Poll with exponential backoff
interval = 3
while True:
    status = client.get_crawl(crawl.crawl_id)
    pct = (status.completed / max(status.total, 1)) * 100
    print(f"[{status.status}] {status.completed}/{status.total} ({pct:.0f}%)")

    if status.status in ("completed", "partial", "failed"):
        break

    time.sleep(interval)
    interval = min(interval * 1.5, 15)  # max 15s

# Fetch results
result = client.get_crawl(crawl.crawl_id, include_results=True)

# Extract and save data
products = []
for page in result.pages or []:
    if page.status == "succeeded" and page.result:
        extracted = page.result.get("extracted", {})
        if extracted.get("name"):
            products.append({
                "url": page.url,
                **extracted,
            })

with open("products.json", "w") as f:
    json.dump(products, f, indent=2)

print(f"Saved {len(products)} products to products.json")
if result.billing:
    print(f"Credits used: {result.billing.credits_used}")
    print(f"Credits refunded: {result.billing.credits_refunded}")