AI Training Data API
Collect high-quality web data for LLM fine-tuning, RAG pipelines, and knowledge bases. AlterLab returns clean markdown — boilerplate stripped, structure preserved — ready for tokenization. Firecrawl-compatible. From $0.0002/request.
Web Data → Clean Markdown → LLM Pipeline
One API call delivers tokenization-ready content. No preprocessing needed.
import requests
# Collect docs for RAG knowledge base
response = requests.post(
"https://api.alterlab.io/v1/scrape",
headers={"X-API-Key": api_key},
json={
"url": "https://docs.example.com/guide",
"formats": ["markdown"],
}
)
markdown = response.json()["data"]["markdown"]
# Chunk by ## headings → embed → store# Crawl entire docs site
response = requests.post(
"https://api.alterlab.io/v1/crawl",
headers={"X-API-Key": api_key},
json={
"url": "https://docs.example.com",
"formats": ["markdown"],
"max_depth": 3,
"limit": 500
}
)
# Returns list of markdown docsFirecrawl-Compatible API
AlterLab uses the same request and response format as Firecrawl. If your AI pipeline already uses Firecrawl, switching to AlterLab requires only a base URL and API key change — no code modifications for standard scrape and batch operations. AlterLab typically offers lower per-page costs, particularly for JavaScript-rendered documentation sites.
Building an AI Data Pipeline with AlterLab
From raw URLs to tokenized training data or embedded knowledge base.
Define Data Sources and Scope
Identify the domains relevant to your model's target domain: technical documentation, domain-specific forums, news archives, Wikipedia mirrors, or proprietary knowledge bases. For RAG, prioritize authoritative, frequently updated sources. For fine-tuning, prioritize diverse, high-quality writing with consistent structure.
Collect Clean Markdown at Scale
Use the crawl endpoint for entire domains (up to thousands of pages) or the batch endpoint for curated URL lists. Request markdown format — AlterLab strips navigation, ads, cookie banners, and boilerplate, preserving only the substantive content with heading structure intact. Anti-bot protection is handled automatically for documentation sites, CMS platforms, and content-heavy sites.
Chunk and Process for Your Use Case
For RAG: split by h2/h3 headings for semantically coherent chunks, embed with your vector model, and store in a vector database (Pinecone, Weaviate, Chroma). For fine-tuning: convert to instruction-response pairs or chat format, apply quality filters, and format for your training framework (Axolotl, TRL, LLaMA-Factory).
Refresh Continuously
Web content changes — documentation updates, new articles, forum discussions. Schedule weekly or monthly crawls to keep your knowledge base current. AlterLab's consistent markdown output means your chunking and embedding pipeline can run unchanged on refreshed content.
AI Data Collection Use Cases
What AI engineers and researchers build with AlterLab.
Domain-Specific Fine-Tuning
Collect high-quality domain content (legal, medical, financial, technical) for fine-tuning LLMs on specialized tasks. AlterLab's markdown preserves structure critical for instruction-following training.
RAG Knowledge Bases
Build retrieval-augmented generation systems grounded in your product documentation, support articles, and knowledge base. Crawl entire docs sites in one API call.
Synthetic Data Generation
Collect diverse web content as seed material for generating synthetic training examples. AlterLab's clean markdown reduces noise in the seed corpus.
Benchmark Dataset Construction
Scrape question-answer pairs, forum discussions, and expert commentary for building evaluation benchmarks across specific domains or tasks.
Pre-Training Corpus Augmentation
Supplement common crawl datasets with targeted, high-quality content from specific domains, time periods, or source types.
Multimodal Data Collection
Collect text paired with images, code paired with documentation, or tables paired with explanations — all from the same page with consistent formatting.
AI Training Data API — FAQ
Related Resources
Web Scraping Pipelines for AI Agents
Optimize data pipelines for LLM consumption — reduce token waste and cost.
JavaScript Rendering API
Collect data from SPAs and JS-rendered documentation sites with headless Chromium.
Other Use Cases
Price monitoring, lead generation, market research — see all AlterLab use cases.
Pricing
From $0.0002/request. Firecrawl-compatible. Balance never expires.
Your first scrape.
Sixty seconds.
$1 free balance. No credit card. No SDK.
Just a POST request.
No credit card required · Up to 5,000 free scrapes · Balance never expire