Why is AlterLab's markdown output better for LLM training data?

AlterLab strips navigation, ads, footers, and boilerplate — returning only substantive content as clean, structured markdown. A typical 50KB web page becomes 3–8KB of content that's ready for tokenization without preprocessing. Consistent heading structure and code block formatting are preserved, making chunking strategies more effective for RAG.

Is AlterLab compatible with Firecrawl?

Yes. AlterLab's API is compatible with the Firecrawl request and response format. If you're currently using Firecrawl, you can switch to AlterLab by changing the base URL and API key — no code changes needed for basic scrape and batch operations. AlterLab typically offers lower per-page costs, especially for JavaScript-heavy sites.

How do I collect an entire domain for training data?

Use AlterLab's crawl endpoint: submit a root URL and it discovers and scrapes all linked pages up to your specified depth. Returns a list of markdown documents suitable for dataset construction. Combined with the batch endpoint, you can collect hundreds of thousands of pages in parallel.

What types of web content work best for LLM training data?

High-quality, domain-specific content performs best: technical documentation, academic papers, forum discussions (Stack Overflow, Reddit), news articles, product reviews, and industry blogs. Avoid thin pages, duplicate content, and pages behind authentication. AlterLab's markdown output preserves content hierarchy which helps models learn document structure.

How do I handle JavaScript-rendered content for AI data collection?

AlterLab automatically detects JavaScript rendering requirements and escalates to the headless browser tier. Request `"formats": ["markdown"]` — the same API call works for both static and dynamic pages. You don't need to configure per-site scraping strategies.

What does large-scale AI data collection cost with AlterLab?

At $0.0002/request for static pages, collecting 100,000 documents costs $20. For JavaScript-rendered documentation sites at $0.004/request, the same volume costs $400. Most AI training datasets use a mix: static pages at Tier 1 ($0.0002) for news and forums, headless browser at Tier 4 ($0.004) for SPAs and docs sites.

LLM training dataRAG pipeline dataFirecrawl alternative

AI Training Data API

Collect high-quality web data for LLM fine-tuning, RAG pipelines, and knowledge bases. AlterLab returns clean markdown — boilerplate stripped, structure preserved — ready for tokenization. Firecrawl-compatible. From $0.0002/request.

Documentation

No credit card

SOC 2 aligned

99.9% uptime

Simple Pricing

One dollar

5,000

Requests

Pay as you go

No subscriptions

Never expires

2,847,653+

Requests processed this week

Web Data → Clean Markdown → LLM Pipeline

One API call delivers tokenization-ready content. No preprocessing needed.

rag_pipeline.py

import requests

# Collect docs for RAG knowledge base
response = requests.post(
    "https://api.alterlab.io/v1/scrape",
    headers={"X-API-Key": api_key},
    json={
        "url": "https://docs.example.com/guide",
        "formats": ["markdown"],
    }
)

markdown = response.json()["data"]["markdown"]
# Chunk by ## headings → embed → store

crawl_domain.py

# Crawl entire docs site
response = requests.post(
    "https://api.alterlab.io/v1/crawl",
    headers={"X-API-Key": api_key},
    json={
        "url": "https://docs.example.com",
        "formats": ["markdown"],
        "max_depth": 3,
        "limit": 500
    }
)
# Returns list of markdown docs

Firecrawl-Compatible API

AlterLab uses the same request and response format as Firecrawl. If your AI pipeline already uses Firecrawl, switching to AlterLab requires only a base URL and API key change — no code modifications for standard scrape and batch operations. AlterLab typically offers lower per-page costs, particularly for JavaScript-rendered documentation sites.

Building an AI Data Pipeline with AlterLab

From raw URLs to tokenized training data or embedded knowledge base.

Define Data Sources and Scope

Identify the domains relevant to your model's target domain: technical documentation, domain-specific forums, news archives, Wikipedia mirrors, or proprietary knowledge bases. For RAG, prioritize authoritative, frequently updated sources. For fine-tuning, prioritize diverse, high-quality writing with consistent structure.

Collect Clean Markdown at Scale

Use the crawl endpoint for entire domains (up to thousands of pages) or the batch endpoint for curated URL lists. Request markdown format — AlterLab strips navigation, ads, cookie banners, and boilerplate, preserving only the substantive content with heading structure intact. Anti-bot protection is handled automatically for documentation sites, CMS platforms, and content-heavy sites.

Chunk and Process for Your Use Case

For RAG: split by h2/h3 headings for semantically coherent chunks, embed with your vector model, and store in a vector database (Pinecone, Weaviate, Chroma). For fine-tuning: convert to instruction-response pairs or chat format, apply quality filters, and format for your training framework (Axolotl, TRL, LLaMA-Factory).

Refresh Continuously

Web content changes — documentation updates, new articles, forum discussions. Schedule weekly or monthly crawls to keep your knowledge base current. AlterLab's consistent markdown output means your chunking and embedding pipeline can run unchanged on refreshed content.

AI Data Collection Use Cases

What AI engineers and researchers build with AlterLab.

Domain-Specific Fine-Tuning

Collect high-quality domain content (legal, medical, financial, technical) for fine-tuning LLMs on specialized tasks. AlterLab's markdown preserves structure critical for instruction-following training.

RAG Knowledge Bases

Build retrieval-augmented generation systems grounded in your product documentation, support articles, and knowledge base. Crawl entire docs sites in one API call.

Synthetic Data Generation

Collect diverse web content as seed material for generating synthetic training examples. AlterLab's clean markdown reduces noise in the seed corpus.

Benchmark Dataset Construction

Scrape question-answer pairs, forum discussions, and expert commentary for building evaluation benchmarks across specific domains or tasks.

Pre-Training Corpus Augmentation

Supplement common crawl datasets with targeted, high-quality content from specific domains, time periods, or source types.

Multimodal Data Collection

Collect text paired with images, code paired with documentation, or tables paired with explanations — all from the same page with consistent formatting.