AlterLabAlterLab
Tutorial

Structured Extraction

Extract structured data from web pages using pre-built profiles or custom JSON schemas. Turn messy HTML into clean, predictable JSON.

How It Works

AlterLab extracts data according to your schema specifications. Pre-built profiles work out of the box for common page types, or define custom schemas for exact control over field names and types.

Extraction Methods

Pre-built Profiles

Ready-to-use templates for common data types

Easiest

JSON Schema

Define exact structure with types and validation

Most Control

Natural Language

Coming Soon

Describe what you want in plain English

Most Flexible

Pre-built Profiles

Use pre-defined extraction profiles for common page types. These are optimized schemas that work out of the box.

ProfileExtracted FieldsBest For
productname, price, description, images, ratings, availabilityE-commerce product pages
articletitle, author, date, content, summaryNews, blogs, documentation
job_postingtitle, company, location, salary, requirementsJob boards, career pages
faqquestions, answers, categoriesFAQ pages, help centers
recipename, ingredients, instructions, time, servingsRecipe websites
eventname, date, location, description, organizerEvent pages, calendars
autoAutomatically detectedUnknown page types
import requests

# Extract product data using the product profile
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/product/123",
        "extraction_profile": "product"
    }
)

data = response.json()
product = data["extracted"]

print(f"Name: {product['name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product['rating']}")

Custom JSON Schema

Define exactly what data you want using JSON Schema. This gives you full control over field names, types, and structure.

import requests

# Define a custom schema for competitor pricing
schema = {
    "type": "object",
    "properties": {
        "product_name": {
            "type": "string",
            "description": "The full product name"
        },
        "current_price": {
            "type": "number",
            "description": "Current price in USD"
        },
        "original_price": {
            "type": "number",
            "description": "Original price before discount, if any"
        },
        "discount_percent": {
            "type": "number",
            "description": "Discount percentage if on sale"
        },
        "in_stock": {
            "type": "boolean",
            "description": "Whether the product is currently in stock"
        },
        "shipping_info": {
            "type": "string",
            "description": "Shipping time or availability"
        }
    },
    "required": ["product_name", "current_price", "in_stock"]
}

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://competitor.com/product/xyz",
        "extraction_schema": schema
    }
)

data = response.json()
pricing = data["extracted"]

print(f"Product: {pricing['product_name']}")
print(f"Price: ${pricing['current_price']}")
print(f"In Stock: {pricing['in_stock']}")

Schema Tips

  • Use description fields to guide the AI
  • Mark important fields as required
  • Use specific types (number vs string) for proper formatting

Natural Language Prompts

Coming Soon

Feature In Development

Natural language extraction is currently under development and trials. This feature will be available soon. For now, use Pre-built Profiles or JSON Schema for structured extraction.

Just describe what you want in plain English. Great for quick extraction or when you're not sure of the exact structure.

import requests

# Simple natural language extraction
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://techblog.example.com/article/123",
        "extraction_prompt": "Extract the article title, author name, publication date, and a 2-sentence summary of the main points"
    }
)

data = response.json()
print(data["extracted"])

# More complex extraction
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/category/laptops",
        "extraction_prompt": """
        Extract all laptops listed on this page. For each laptop, get:
        - Product name
        - Price (as a number)
        - Key specs (RAM, storage, processor)
        - Whether it's on sale
        Return as a list of objects.
        """
    }
)

Combining Methods

Coming Soon

Use a schema for structure and a prompt for additional guidance:

import requests

schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "headquarters": {"type": "string"},
        "founded_year": {"type": "integer"},
        "key_products": {
            "type": "array",
            "items": {"type": "string"}
        },
        "leadership": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "title": {"type": "string"}
                }
            }
        }
    }
}

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://company.example.com/about",
        "extraction_schema": schema,
        "extraction_prompt": "Focus on the 'About Us' and 'Leadership' sections. Only include C-level executives in the leadership array."
    }
)

Evidence Mode

Enable evidence mode to see exactly where each extracted value came from in the source HTML:

response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://shop.example.com/product/123",
        "extraction_profile": "product",
        "evidence": True  # Enable evidence tracking
    }
)

data = response.json()

# Each field now includes provenance
for field, value in data["extracted"].items():
    if isinstance(value, dict) and "evidence" in value:
        print(f"{field}: {value['value']}")
        print(f"  Source: {value['evidence'][:100]}...")
    else:
        print(f"{field}: {value}")

Use Cases for Evidence

  • Debugging extraction issues
  • Verifying data accuracy
  • Audit trails for compliance
  • Building training data for ML models

Real-World Examples

E-commerce Price Monitoring

{
  "url": "https://amazon.com/dp/B0123456789",
  "extraction_schema": {
    "type": "object",
    "properties": {
      "title": {"type": "string"},
      "price": {"type": "number"},
      "rating": {"type": "number"},
      "review_count": {"type": "integer"},
      "availability": {"type": "string"},
      "seller": {"type": "string"}
    }
  }
}

News Article Analysis

Coming Soon
{
  "url": "https://news.example.com/article/123",
  "extraction_prompt": "Extract the headline, author, publication date, main topics covered, sentiment (positive/negative/neutral), and a 100-word summary"
}

Job Board Scraping

Coming Soon
{
  "url": "https://jobs.example.com/listing/456",
  "extraction_profile": "job_posting",
  "extraction_prompt": "Also extract: required years of experience, remote work policy, and tech stack mentioned"
}

Best Practices

1. Start with Profiles

Pre-built profiles are optimized and tested. Use them when they match your use case, then customize with prompts if needed.

2. Be Specific in Prompts

"Extract the price" is less effective than "Extract the current sale price in USD as a number, ignoring shipping costs."

3. Use Schemas for Consistency

When scraping many pages, schemas ensure consistent field names and types across all results.

4. Handle Missing Data

Not all pages have all fields. Check for null values and handle gracefully.

5. Test on Sample Pages First

Before running large batches, test your extraction on a few pages to verify the output matches expectations.