LLM training dataRAG pipeline dataFirecrawl alternative

AI Training Data API

Collect high-quality web data for LLM fine-tuning, RAG pipelines, and knowledge bases. AlterLab returns clean markdown — boilerplate stripped, structure preserved — ready for tokenization. Firecrawl-compatible. From $0.0002/request.

No credit card
SOC 2 aligned
99.9% uptime
Simple Pricing
$1
One dollar
=
5,000
Requests
Pay as you go
No subscriptions
Never expires
2,847,653+
Requests processed this week

Web Data → Clean Markdown → LLM Pipeline

One API call delivers tokenization-ready content. No preprocessing needed.

rag_pipeline.py
import requests

# Collect docs for RAG knowledge base
response = requests.post(
    "https://api.alterlab.io/v1/scrape",
    headers={"X-API-Key": api_key},
    json={
        "url": "https://docs.example.com/guide",
        "formats": ["markdown"],
    }
)

markdown = response.json()["data"]["markdown"]
# Chunk by ## headings → embed → store
crawl_domain.py
# Crawl entire docs site
response = requests.post(
    "https://api.alterlab.io/v1/crawl",
    headers={"X-API-Key": api_key},
    json={
        "url": "https://docs.example.com",
        "formats": ["markdown"],
        "max_depth": 3,
        "limit": 500
    }
)
# Returns list of markdown docs

Firecrawl-Compatible API

AlterLab uses the same request and response format as Firecrawl. If your AI pipeline already uses Firecrawl, switching to AlterLab requires only a base URL and API key change — no code modifications for standard scrape and batch operations. AlterLab typically offers lower per-page costs, particularly for JavaScript-rendered documentation sites.

Building an AI Data Pipeline with AlterLab

From raw URLs to tokenized training data or embedded knowledge base.

1

Define Data Sources and Scope

Identify the domains relevant to your model's target domain: technical documentation, domain-specific forums, news archives, Wikipedia mirrors, or proprietary knowledge bases. For RAG, prioritize authoritative, frequently updated sources. For fine-tuning, prioritize diverse, high-quality writing with consistent structure.

2

Collect Clean Markdown at Scale

Use the crawl endpoint for entire domains (up to thousands of pages) or the batch endpoint for curated URL lists. Request markdown format — AlterLab strips navigation, ads, cookie banners, and boilerplate, preserving only the substantive content with heading structure intact. Anti-bot protection is handled automatically for documentation sites, CMS platforms, and content-heavy sites.

3

Chunk and Process for Your Use Case

For RAG: split by h2/h3 headings for semantically coherent chunks, embed with your vector model, and store in a vector database (Pinecone, Weaviate, Chroma). For fine-tuning: convert to instruction-response pairs or chat format, apply quality filters, and format for your training framework (Axolotl, TRL, LLaMA-Factory).

4

Refresh Continuously

Web content changes — documentation updates, new articles, forum discussions. Schedule weekly or monthly crawls to keep your knowledge base current. AlterLab's consistent markdown output means your chunking and embedding pipeline can run unchanged on refreshed content.

AI Data Collection Use Cases

What AI engineers and researchers build with AlterLab.

Domain-Specific Fine-Tuning

Collect high-quality domain content (legal, medical, financial, technical) for fine-tuning LLMs on specialized tasks. AlterLab's markdown preserves structure critical for instruction-following training.

RAG Knowledge Bases

Build retrieval-augmented generation systems grounded in your product documentation, support articles, and knowledge base. Crawl entire docs sites in one API call.

Synthetic Data Generation

Collect diverse web content as seed material for generating synthetic training examples. AlterLab's clean markdown reduces noise in the seed corpus.

Benchmark Dataset Construction

Scrape question-answer pairs, forum discussions, and expert commentary for building evaluation benchmarks across specific domains or tasks.

Pre-Training Corpus Augmentation

Supplement common crawl datasets with targeted, high-quality content from specific domains, time periods, or source types.

Multimodal Data Collection

Collect text paired with images, code paired with documentation, or tables paired with explanations — all from the same page with consistent formatting.

AI Training Data API — FAQ

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expire