Planning with Map
Use the Map endpoint to discover a site's structure before committing credits to scraping. Map costs just 1 credit per call and returns every URL it finds.
1 Credit, Unlimited Insight
Why Map First?
Save Credits
Instead of blindly scraping an entire site, map it first to identify exactly which pages you need. Scrape only what matters.
Understand Structure
See how pages are organized, what URL patterns exist, and how deep the site goes — before writing any scraping logic.
Find Hidden Pages
Compare link discovery with sitemap parsing. Some pages only appear in one source — map catches what manual exploration misses.
Plan Batch Jobs
Filter map results by URL pattern, then pipe the filtered list directly into a batch scrape for efficient processing.
Discover Site Structure
Start by mapping a site with default settings. This crawls links up to 3 levels deep and returns up to 100 URLs:
curl -X POST https://api.alterlab.io/api/v1/map \
-H "X-API-Key: your_api_key" \
-H "Content-Type: application/json" \
-d '{"url": "https://store.com"}'For larger sites, increase max_pages and max_depth:
{
"url": "https://store.com",
"max_pages": 1000,
"max_depth": 5,
"include_metadata": true
}The response includes a depth field for each URL, showing how many clicks it is from the starting page. Use this to understand the site hierarchy.
Filter Map Results
Use include_patterns and exclude_patterns to narrow results before they are returned:
{
"url": "https://store.com",
"max_pages": 500,
"include_patterns": ["/products/*", "/categories/*"],
"exclude_patterns": ["/products/discontinued/*", "*.pdf"]
}Filter Server-Side
Search for Specific Pages
The search parameter lets you find pages by keyword relevance. Results include a relevance_score (0 to 1) and are sorted by relevance:
# Find pricing-related pages on a docs site
curl -X POST https://api.alterlab.io/api/v1/map \
-H "X-API-Key: your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"search": "pricing enterprise plan",
"max_pages": 20,
"include_metadata": true
}'Search is useful when you need a specific page but do not know the exact URL. It is faster and cheaper than scraping every page and searching the content.
Sitemap vs Link Discovery
| Aspect | Link Discovery (default) | Sitemap Mode |
|---|---|---|
| Speed | Slower — must fetch and parse each page | Faster — parses a single XML file |
| Coverage | Finds pages reachable via navigation | Finds pages the owner lists as canonical |
| Orphan Pages | Misses pages with no inbound links | Catches orphans if they are in the sitemap |
| Dynamic Sites | Better for SPAs and JS-rendered navigation | May miss pages not in static sitemap |
| Best For | E-commerce, forums, user-generated content | Blogs, documentation, news sites |
For best coverage, run map twice — once with link discovery and once with sitemap mode — then merge the results. The source field in each URL object tells you how it was discovered.
Map + Batch Scrape Workflow
The most common pattern is to map a site, filter the URLs, and then send them to batch scrape:
Map the site
Call POST /api/v1/map to discover all URLs. Cost: 1 credit.
Filter client-side
Apply your own logic to select which URLs to scrape. Filter by URL pattern, depth, metadata, or relevance score.
Batch scrape
Send the filtered URL list to POST /api/v1/batch for parallel processing. You pay per URL scraped, not per URL discovered.
Full Python Workflow
import alterlab
import time
client = alterlab.AlterLab(api_key="your_api_key")
# Step 1: Map the site (1 credit)
site_map = client.map(
"https://store.com",
max_pages=500,
include_patterns=["/products/*"],
include_metadata=True,
)
print(f"Discovered {site_map['total_urls']} product URLs")
# Step 2: Filter client-side
# Only scrape products updated in the last 30 days
from datetime import datetime, timedelta, timezone
cutoff = datetime.now(timezone.utc) - timedelta(days=30)
recent_urls = []
for u in site_map["urls"]:
modified = u.get("metadata", {}).get("last_modified")
if modified:
mod_date = datetime.fromisoformat(modified.replace("Z", "+00:00"))
if mod_date > cutoff:
recent_urls.append(u["url"])
print(f"Filtered to {len(recent_urls)} recently updated products")
# Step 3: Batch scrape the filtered URLs
if recent_urls:
batch = client.batch_scrape(
urls=[
{"url": u, "formats": ["json"], "extraction_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"},
},
}}
for u in recent_urls[:100] # Max 100 per batch
],
)
print(f"Batch {batch['batch_id']} submitted")
# Poll for results
while True:
status = client.get_batch_status(batch["batch_id"])
if status["status"] != "processing":
break
time.sleep(2)
for item in status["items"]:
if item["status"] == "succeeded":
data = item["result"].get("json", {})
print(f" {data.get('name')}: ${data.get('price')}")Full Node.js Workflow
import AlterLab from "@alterlab/sdk";
const client = new AlterLab({ apiKey: "your_api_key" });
// Step 1: Map the site (1 credit)
const siteMap = await client.map("https://store.com", {
maxPages: 500,
includePatterns: ["/products/*"],
includeMetadata: true,
});
console.log(`Discovered ${siteMap.totalUrls} product URLs`);
// Step 2: Filter client-side
const cutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);
const recentUrls = siteMap.urls
.filter((u) => {
const modified = u.metadata?.lastModified;
return modified && new Date(modified) > cutoff;
})
.map((u) => u.url);
console.log(`Filtered to ${recentUrls.length} recently updated products`);
// Step 3: Batch scrape the filtered URLs
if (recentUrls.length > 0) {
const batch = await client.batchScrape({
urls: recentUrls.slice(0, 100).map((url) => ({
url,
formats: ["json"],
extractionSchema: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
inStock: { type: "boolean" },
},
},
})),
});
console.log(`Batch ${batch.batchId} submitted`);
// Poll for results
let status;
do {
await new Promise((r) => setTimeout(r, 2000));
status = await client.getBatchStatus(batch.batchId);
} while (status.status === "processing");
for (const item of status.items) {
if (item.status === "succeeded") {
const data = item.result?.json ?? {};
console.log(` ${data.name}: $${data.price}`);
}
}
}Tips & Best Practices
- Start small. Use
max_pages: 50to quickly check URL patterns before running a full map. - Use include_patterns server-side. Filtering during the crawl keeps your response small. Client-side filtering still works but transfers more data.
- Enable metadata selectively.
include_metadataincreases response time because the crawler must fetch each page's HTML. Skip it for initial exploration. - Compare discovery methods. Run map once with links and once with
sitemap: trueto find pages that only appear in one source. - Respect robots.txt. The default is
respect_robots: true. Only disable it if you have explicit permission from the site owner. - Cache map results. Site structures do not change often. Map once, save the URL list, and re-scrape from that list on a schedule.