Site Crawling Tutorial
Learn to crawl entire websites step by step. Start with a simple 3-page crawl and build up to production-grade crawling with extraction, webhooks, and cost controls.
Time to Complete
Prerequisites
- An AlterLab API key (get one here)
- At least 50 credits in your account
- Python 3.8+ with the
alterlabpackage, or Node.js 18+ with@alterlab/sdk, or cURL
Step 1: Basic Crawl
Start with a small crawl to see how it works. We will crawl 5 pages from a documentation site.
# Start a small crawl
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"max_pages": 5,
"max_depth": 1
}'
# Save the crawl_id from the response
# {"crawl_id": "abc123", "status": "discovering", ...}
# Poll for status
curl https://api.alterlab.io/api/v1/crawl/abc123 \
-H "X-API-Key: YOUR_API_KEY"
# When status is "completed", get results
curl "https://api.alterlab.io/api/v1/crawl/abc123?include_results=true" \
-H "X-API-Key: YOUR_API_KEY"You should see 5 pages discovered and scraped. The response includes each page's URL, status, and scraped content.
Step 2: Add URL Filtering
Real websites have thousands of pages. Use URL filtering to crawl only the sections you need. Let us target product pages on an e-commerce site.
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com",
"max_pages": 30,
"max_depth": 2,
"include_patterns": ["/products/*", "/categories/*"],
"exclude_patterns": ["/blog/*", "/account/*", "/cart"]
}'Tip: Start Narrow
Step 3: Add Structured Extraction
Instead of raw HTML, extract structured product data from every page using an extraction profile or custom schema.
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com",
"max_pages": 30,
"include_patterns": ["/products/*"],
"extraction_schema": {
"name": "string",
"price": "number",
"currency": "string",
"description": "string",
"in_stock": "boolean",
"image_url": "string"
}
}'Step 4: Set Up Webhooks
Instead of polling, receive a notification when the crawl completes. This is the recommended approach for production use.
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com",
"max_pages": 100,
"include_patterns": ["/products/*"],
"extraction_profile": "product",
"webhook_url": "https://your-server.com/webhooks/crawl-done"
}'
# Your webhook endpoint will receive:
# POST /webhooks/crawl-done
# {
# "event": "crawl.completed",
# "crawl_id": "abc123",
# "status": "completed",
# "total": 87,
# "completed": 85,
# "failed": 2,
# "credits_used": 142
# }Step 5: Scale Up with Cost Controls
For production crawls, add cost controls to prevent unexpected spend. Combine max_pages, max_credits, and max_tier for full control.
curl -X POST https://api.alterlab.io/api/v1/crawl \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com",
"max_pages": 500,
"max_depth": 3,
"include_patterns": ["/products/*"],
"exclude_patterns": ["/products/*/reviews", "/products/*/questions"],
"extraction_profile": "product",
"webhook_url": "https://your-server.com/webhooks/crawl-done",
"cost_controls": {
"max_credits": 1000,
"max_tier": "3"
},
"budget": {
"/products/*": 400,
"/categories/*": 50,
"*": 10
}
}'Complete Working Example
Here is a complete production-ready script that crawls a site, monitors progress, and saves extracted data to a JSON file.
"""Complete crawl example — crawl products and save to JSON."""
import json
import time
from alterlab import AlterLab
API_KEY = "YOUR_API_KEY"
client = AlterLab(api_key=API_KEY)
# Configure the crawl
crawl = client.crawl(
url="https://shop.example.com",
max_pages=200,
max_depth=3,
include_patterns=["/products/*"],
exclude_patterns=["/blog/*", "/account/*"],
extraction_schema={
"name": "string",
"price": "number",
"currency": "string",
"in_stock": "boolean",
},
cost_controls={"max_credits": 500},
)
print(f"Crawl {crawl.crawl_id} started")
print(f"Estimated: {crawl.estimated_pages} pages, {crawl.estimated_credits} credits")
# Poll with exponential backoff
interval = 3
while True:
status = client.get_crawl(crawl.crawl_id)
pct = (status.completed / max(status.total, 1)) * 100
print(f"[{status.status}] {status.completed}/{status.total} ({pct:.0f}%)")
if status.status in ("completed", "partial", "failed"):
break
time.sleep(interval)
interval = min(interval * 1.5, 15) # max 15s
# Fetch results
result = client.get_crawl(crawl.crawl_id, include_results=True)
# Extract and save data
products = []
for page in result.pages or []:
if page.status == "succeeded" and page.result:
extracted = page.result.get("extracted", {})
if extracted.get("name"):
products.append({
"url": page.url,
**extracted,
})
with open("products.json", "w") as f:
json.dump(products, f, indent=2)
print(f"Saved {len(products)} products to products.json")
if result.billing:
print(f"Credits used: {result.billing.credits_used}")
print(f"Credits refunded: {result.billing.credits_refunded}")Next Steps
Crawl API Reference
Full API reference with all parameters, schemas, and error codes.
Crawling Guide
Deep dive into how crawling works, priority scoring, and advanced patterns.
Structured Extraction
Learn about extraction profiles and custom schemas for pulling structured data.
Webhooks Guide
Set up webhook endpoints, handle signatures, and configure retries.