Guide

Planning with Map

Use the Map endpoint to discover a site's structure before committing credits to scraping. Map costs just 1 credit per call and returns every URL it finds.

1 Credit, Unlimited Insight

Map is the cheapest way to understand a website. One call returns up to thousands of URLs with depth, source, and metadata — for a single credit.

Why Map First?

Save Credits

Instead of blindly scraping an entire site, map it first to identify exactly which pages you need. Scrape only what matters.

Understand Structure

See how pages are organized, what URL patterns exist, and how deep the site goes — before writing any scraping logic.

Find Hidden Pages

Compare link discovery with sitemap parsing. Some pages only appear in one source — map catches what manual exploration misses.

Plan Batch Jobs

Filter map results by URL pattern, then pipe the filtered list directly into a batch scrape for efficient processing.

Discover Site Structure

Start by mapping a site with default settings. This crawls links up to 3 levels deep and returns up to 100 URLs:

Bash

curl -X POST https://api.alterlab.io/api/v1/map \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://store.com"}'

For larger sites, increase max_pages and max_depth:

JSON

{
  "url": "https://store.com",
  "max_pages": 1000,
  "max_depth": 5,
  "include_metadata": true
}

The response includes a depth field for each URL, showing how many clicks it is from the starting page. Use this to understand the site hierarchy.

Filter Map Results

Use include_patterns and exclude_patterns to narrow results before they are returned:

JSON

{
  "url": "https://store.com",
  "max_pages": 500,
  "include_patterns": ["/products/*", "/categories/*"],
  "exclude_patterns": ["/products/discontinued/*", "*.pdf"]
}

Filter Server-Side

Patterns are applied during the crawl, not after. This means the crawler still discovers all pages (to follow links), but only matching URLs are included in the response. The credit cost does not change.

Search for Specific Pages

The search parameter lets you find pages by keyword relevance. Results include a relevance_score (0 to 1) and are sorted by relevance:

Bash

# Find pricing-related pages on a docs site
curl -X POST https://api.alterlab.io/api/v1/map \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "search": "pricing enterprise plan",
    "max_pages": 20,
    "include_metadata": true
  }'

Search is useful when you need a specific page but do not know the exact URL. It is faster and cheaper than scraping every page and searching the content.

Sitemap vs Link Discovery

Aspect	Link Discovery (default)	Sitemap Mode
Speed	Slower — must fetch and parse each page	Faster — parses a single XML file
Coverage	Finds pages reachable via navigation	Finds pages the owner lists as canonical
Orphan Pages	Misses pages with no inbound links	Catches orphans if they are in the sitemap
Dynamic Sites	Better for SPAs and JS-rendered navigation	May miss pages not in static sitemap
Best For	E-commerce, forums, user-generated content	Blogs, documentation, news sites

For best coverage, run map twice — once with link discovery and once with sitemap mode — then merge the results. The source field in each URL object tells you how it was discovered.

Map + Batch Scrape Workflow

The most common pattern is to map a site, filter the URLs, and then send them to batch scrape:

Map the site

Call POST /api/v1/map to discover all URLs. Cost: 1 credit.

Filter client-side

Apply your own logic to select which URLs to scrape. Filter by URL pattern, depth, metadata, or relevance score.

Batch scrape

Send the filtered URL list to POST /api/v1/batch for parallel processing. You pay per URL scraped, not per URL discovered.

Full Python Workflow

Python

import alterlab
import time

client = alterlab.AlterLab(api_key="your_api_key")

# Step 1: Map the site (1 credit)
site_map = client.map(
    "https://store.com",
    max_pages=500,
    include_patterns=["/products/*"],
    include_metadata=True,
)
print(f"Discovered {site_map['total_urls']} product URLs")

# Step 2: Filter client-side
# Only scrape products updated in the last 30 days
from datetime import datetime, timedelta, timezone

cutoff = datetime.now(timezone.utc) - timedelta(days=30)
recent_urls = []
for u in site_map["urls"]:
    modified = u.get("metadata", {}).get("last_modified")
    if modified:
        mod_date = datetime.fromisoformat(modified.replace("Z", "+00:00"))
        if mod_date > cutoff:
            recent_urls.append(u["url"])

print(f"Filtered to {len(recent_urls)} recently updated products")

# Step 3: Batch scrape the filtered URLs
if recent_urls:
    batch = client.batch_scrape(
        urls=[
            {"url": u, "formats": ["json"], "extraction_schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "in_stock": {"type": "boolean"},
                },
            }}
            for u in recent_urls[:100]  # Max 100 per batch
        ],
    )
    print(f"Batch {batch['batch_id']} submitted")

    # Poll for results
    while True:
        status = client.get_batch_status(batch["batch_id"])
        if status["status"] != "processing":
            break
        time.sleep(2)

    for item in status["items"]:
        if item["status"] == "succeeded":
            data = item["result"].get("json", {})
            print(f"  {data.get('name')}: ${data.get('price')}")

Full Node.js Workflow

TYPESCRIPT

import AlterLab from "@alterlab/sdk";

const client = new AlterLab({ apiKey: "your_api_key" });

// Step 1: Map the site (1 credit)
const siteMap = await client.map("https://store.com", {
  maxPages: 500,
  includePatterns: ["/products/*"],
  includeMetadata: true,
});
console.log(`Discovered ${siteMap.totalUrls} product URLs`);

// Step 2: Filter client-side
const cutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);
const recentUrls = siteMap.urls
  .filter((u) => {
    const modified = u.metadata?.lastModified;
    return modified && new Date(modified) > cutoff;
  })
  .map((u) => u.url);

console.log(`Filtered to ${recentUrls.length} recently updated products`);

// Step 3: Batch scrape the filtered URLs
if (recentUrls.length > 0) {
  const batch = await client.batchScrape({
    urls: recentUrls.slice(0, 100).map((url) => ({
      url,
      formats: ["json"],
      extractionSchema: {
        type: "object",
        properties: {
          name: { type: "string" },
          price: { type: "number" },
          inStock: { type: "boolean" },
        },
      },
    })),
  });
  console.log(`Batch ${batch.batchId} submitted`);

  // Poll for results
  let status;
  do {
    await new Promise((r) => setTimeout(r, 2000));
    status = await client.getBatchStatus(batch.batchId);
  } while (status.status === "processing");

  for (const item of status.items) {
    if (item.status === "succeeded") {
      const data = item.result?.json ?? {};
      console.log(`  ${data.name}: $${data.price}`);
    }
  }
}

Tips & Best Practices

Start small. Use max_pages: 50 to quickly check URL patterns before running a full map.
Use include_patterns server-side. Filtering during the crawl keeps your response small. Client-side filtering still works but transfers more data.
Enable metadata selectively. include_metadata increases response time because the crawler must fetch each page's HTML. Skip it for initial exploration.
Compare discovery methods. Run map once with links and once with sitemap: true to find pages that only appear in one source.
Respect robots.txt. The default is respect_robots: true. Only disable it if you have explicit permission from the site owner.
Cache map results. Site structures do not change often. Map once, save the URL list, and re-scrape from that list on a schedule.

Last updated: March 2026

Guide

Planning with Map

Use the Map endpoint to discover a site's structure before committing credits to scraping. Map costs just 1 credit per call and returns every URL it finds.

1 Credit, Unlimited Insight

Map is the cheapest way to understand a website. One call returns up to thousands of URLs with depth, source, and metadata — for a single credit.

Why Map First?

Save Credits

Instead of blindly scraping an entire site, map it first to identify exactly which pages you need. Scrape only what matters.

Understand Structure

See how pages are organized, what URL patterns exist, and how deep the site goes — before writing any scraping logic.

Find Hidden Pages

Compare link discovery with sitemap parsing. Some pages only appear in one source — map catches what manual exploration misses.

Plan Batch Jobs

Filter map results by URL pattern, then pipe the filtered list directly into a batch scrape for efficient processing.

Discover Site Structure

Start by mapping a site with default settings. This crawls links up to 3 levels deep and returns up to 100 URLs:

Bash

curl -X POST https://api.alterlab.io/api/v1/map \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://store.com"}'

For larger sites, increase max_pages and max_depth:

JSON

{
  "url": "https://store.com",
  "max_pages": 1000,
  "max_depth": 5,
  "include_metadata": true
}

The response includes a depth field for each URL, showing how many clicks it is from the starting page. Use this to understand the site hierarchy.

Filter Map Results

Use include_patterns and exclude_patterns to narrow results before they are returned:

JSON

{
  "url": "https://store.com",
  "max_pages": 500,
  "include_patterns": ["/products/*", "/categories/*"],
  "exclude_patterns": ["/products/discontinued/*", "*.pdf"]
}

Filter Server-Side

Search for Specific Pages

The search parameter lets you find pages by keyword relevance. Results include a relevance_score (0 to 1) and are sorted by relevance:

Bash

# Find pricing-related pages on a docs site
curl -X POST https://api.alterlab.io/api/v1/map \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "search": "pricing enterprise plan",
    "max_pages": 20,
    "include_metadata": true
  }'

Search is useful when you need a specific page but do not know the exact URL. It is faster and cheaper than scraping every page and searching the content.

Sitemap vs Link Discovery

Aspect	Link Discovery (default)	Sitemap Mode
Speed	Slower — must fetch and parse each page	Faster — parses a single XML file
Coverage	Finds pages reachable via navigation	Finds pages the owner lists as canonical
Orphan Pages	Misses pages with no inbound links	Catches orphans if they are in the sitemap
Dynamic Sites	Better for SPAs and JS-rendered navigation	May miss pages not in static sitemap
Best For	E-commerce, forums, user-generated content	Blogs, documentation, news sites

For best coverage, run map twice — once with link discovery and once with sitemap mode — then merge the results. The source field in each URL object tells you how it was discovered.

Map + Batch Scrape Workflow

The most common pattern is to map a site, filter the URLs, and then send them to batch scrape:

Map the site

Call POST /api/v1/map to discover all URLs. Cost: 1 credit.

Filter client-side

Apply your own logic to select which URLs to scrape. Filter by URL pattern, depth, metadata, or relevance score.

Batch scrape

Send the filtered URL list to POST /api/v1/batch for parallel processing. You pay per URL scraped, not per URL discovered.

Full Python Workflow

Python

import alterlab
import time

client = alterlab.AlterLab(api_key="your_api_key")

# Step 1: Map the site (1 credit)
site_map = client.map(
    "https://store.com",
    max_pages=500,
    include_patterns=["/products/*"],
    include_metadata=True,
)
print(f"Discovered {site_map['total_urls']} product URLs")

# Step 2: Filter client-side
# Only scrape products updated in the last 30 days
from datetime import datetime, timedelta, timezone

cutoff = datetime.now(timezone.utc) - timedelta(days=30)
recent_urls = []
for u in site_map["urls"]:
    modified = u.get("metadata", {}).get("last_modified")
    if modified:
        mod_date = datetime.fromisoformat(modified.replace("Z", "+00:00"))
        if mod_date > cutoff:
            recent_urls.append(u["url"])

print(f"Filtered to {len(recent_urls)} recently updated products")

# Step 3: Batch scrape the filtered URLs
if recent_urls:
    batch = client.batch_scrape(
        urls=[
            {"url": u, "formats": ["json"], "extraction_schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "in_stock": {"type": "boolean"},
                },
            }}
            for u in recent_urls[:100]  # Max 100 per batch
        ],
    )
    print(f"Batch {batch['batch_id']} submitted")

    # Poll for results
    while True:
        status = client.get_batch_status(batch["batch_id"])
        if status["status"] != "processing":
            break
        time.sleep(2)

    for item in status["items"]:
        if item["status"] == "succeeded":
            data = item["result"].get("json", {})
            print(f"  {data.get('name')}: ${data.get('price')}")

Full Node.js Workflow

TYPESCRIPT

import AlterLab from "@alterlab/sdk";

const client = new AlterLab({ apiKey: "your_api_key" });

// Step 1: Map the site (1 credit)
const siteMap = await client.map("https://store.com", {
  maxPages: 500,
  includePatterns: ["/products/*"],
  includeMetadata: true,
});
console.log(`Discovered ${siteMap.totalUrls} product URLs`);

// Step 2: Filter client-side
const cutoff = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);
const recentUrls = siteMap.urls
  .filter((u) => {
    const modified = u.metadata?.lastModified;
    return modified && new Date(modified) > cutoff;
  })
  .map((u) => u.url);

console.log(`Filtered to ${recentUrls.length} recently updated products`);

// Step 3: Batch scrape the filtered URLs
if (recentUrls.length > 0) {
  const batch = await client.batchScrape({
    urls: recentUrls.slice(0, 100).map((url) => ({
      url,
      formats: ["json"],
      extractionSchema: {
        type: "object",
        properties: {
          name: { type: "string" },
          price: { type: "number" },
          inStock: { type: "boolean" },
        },
      },
    })),
  });
  console.log(`Batch ${batch.batchId} submitted`);

  // Poll for results
  let status;
  do {
    await new Promise((r) => setTimeout(r, 2000));
    status = await client.getBatchStatus(batch.batchId);
  } while (status.status === "processing");

  for (const item of status.items) {
    if (item.status === "succeeded") {
      const data = item.result?.json ?? {};
      console.log(`  ${data.name}: $${data.price}`);
    }
  }
}

Tips & Best Practices

Start small. Use max_pages: 50 to quickly check URL patterns before running a full map.
Use include_patterns server-side. Filtering during the crawl keeps your response small. Client-side filtering still works but transfers more data.
Enable metadata selectively. include_metadata increases response time because the crawler must fetch each page's HTML. Skip it for initial exploration.
Compare discovery methods. Run map once with links and once with sitemap: true to find pages that only appear in one source.
Respect robots.txt. The default is respect_robots: true. Only disable it if you have explicit permission from the site owner.
Cache map results. Site structures do not change often. Map once, save the URL list, and re-scrape from that list on a schedule.

Last updated: March 2026