AlterLabAlterLab
Guide

PDF & OCR Extraction

Extract text from PDF documents and images using our document processing and OCR capabilities.

PDF Extraction

Extract text from PDF documents

+$0.0006

OCR Extraction

Extract text from images

+$0.001

PDF Extraction

Extract text content from PDF documents. Works with both text-based PDFs and scanned documents (using OCR fallback).

Output Formats

FormatDescriptionBest For
textPlain text extractionData processing, search indexing
markdownFormatted with headers, lists, tablesDocumentation, readable output

PDF Examples

# Extract PDF as markdown
curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "mode": "pdf"
  }'

Automatic Detection

When using mode: "auto", AlterLab automatically detects PDF files and processes them appropriately.

OCR Extraction

Extract text from images using Optical Character Recognition. Supports multiple languages and image formats.

Supported Languages

engEnglish
deuGerman
fraFrench
spaSpanish
itaItalian
porPortuguese
nldDutch
rusRussian
jpnJapanese
korKorean
chi_simChinese (Simplified)
chi_traChinese (Traditional)

OCR Examples

# Extract text from image
curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/screenshot.png",
    "mode": "ocr"
  }'

Combined PDF + OCR

For scanned PDFs or PDFs with embedded images, you can enable OCR alongside PDF extraction:

import requests

# Extract text from scanned PDF using OCR
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/scanned-document.pdf",
        "mode": "pdf",
        "advanced": {
            "ocr": True  # Enable OCR for image-based pages
        }
    }
)

data = response.json()
print(f"Credits used: {data['credits_used']}")  # PDF + OCR cost
print(data["content"])

Smart OCR

When OCR is enabled for PDFs, we only apply it to pages that don't have extractable text. If the PDF has selectable text, OCR is skipped (and cost is refunded).

Best Practices

1. Choose the Right Mode

  • PDF mode: For actual PDF documents (reports, papers, ebooks)
  • OCR mode: For images (screenshots, photos of text, infographics)
  • PDF + OCR: For scanned PDFs without selectable text

2. Optimize for Large Documents

  • Use sync: false for documents over 50 pages
  • Set appropriate timeout (up to 300 seconds for large PDFs)
  • Consider webhooks for notification when processing completes

3. Handle Multiple Languages

  • Specify the primary language for best accuracy
  • Use eng+fra syntax for multilingual docs
  • English is the default if no language is specified

4. Image Quality Matters

  • Higher resolution images yield better OCR results
  • Ensure good contrast between text and background
  • Avoid heavily compressed images (JPEG artifacts reduce accuracy)

Costs Summary

OperationCostNotes
PDF Extraction
+$0.0006
Per document
OCR Extraction
+$0.001
Per image/page
PDF + OCR (scanned)
+$0.0016
OCR only applied when needed