Guide

PDF & OCR Extraction

Extract text from PDF documents and images using our document processing and OCR capabilities.

PDF Extraction

Extract text from PDF documents

+$0.0006

OCR Extraction

Extract text from images

+$0.001

PDF Extraction

Extract text content from PDF documents. Works with both text-based PDFs and scanned documents (using OCR fallback).

Output Formats

Format	Description	Best For
`text`	Plain text extraction	Data processing, search indexing
`markdown`	Formatted with headers, lists, tables	Documentation, readable output

PDF Examples

Bash

# Extract PDF as markdown
curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "mode": "pdf"
  }'

Automatic Detection

When using mode: "auto", AlterLab automatically detects PDF files and processes them appropriately.

OCR Extraction

Extract text from images using Optical Character Recognition. Supports multiple languages and image formats.

Supported Languages

engEnglish

deuGerman

fraFrench

spaSpanish

itaItalian

porPortuguese

nldDutch

rusRussian

jpnJapanese

korKorean

chi_simChinese (Simplified)

chi_traChinese (Traditional)

OCR Examples

Bash

# Extract text from image
curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/screenshot.png",
    "mode": "ocr"
  }'

Combined PDF + OCR

For scanned PDFs or PDFs with embedded images, you can enable OCR alongside PDF extraction:

Python

import requests

# Extract text from scanned PDF using OCR
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/scanned-document.pdf",
        "mode": "pdf",
        "advanced": {
            "ocr": True  # Enable OCR for image-based pages
        }
    }
)

data = response.json()
print(f"Cost: {data['credits_used']}")  # PDF + OCR cost
print(data["content"])

Smart OCR

When OCR is enabled for PDFs, we only apply it to pages that don't have extractable text. If the PDF has selectable text, OCR is skipped (and cost is refunded).

Best Practices

1. Choose the Right Mode

PDF mode: For actual PDF documents (reports, papers, ebooks)
OCR mode: For images (screenshots, photos of text, infographics)
PDF + OCR: For scanned PDFs without selectable text

2. Optimize for Large Documents

Use sync: false for documents over 50 pages
Set appropriate timeout (up to 300 seconds for large PDFs)
Consider webhooks for notification when processing completes

3. Handle Multiple Languages

Specify the primary language for best accuracy
Use eng+fra syntax for multilingual docs
English is the default if no language is specified

4. Image Quality Matters

Higher resolution images yield better OCR results
Ensure good contrast between text and background
Avoid heavily compressed images (JPEG artifacts reduce accuracy)

Costs Summary

Operation	Cost	Notes
PDF Extraction	+$0.0006	Per document
OCR Extraction	+$0.001	Per image/page
PDF + OCR (scanned)	+$0.0016	OCR only applied when needed

Output Formats Caching

Last updated: June 2026

Guide

PDF & OCR Extraction

Extract text from PDF documents and images using our document processing and OCR capabilities.

PDF Extraction

Extract text from PDF documents

+$0.0006

OCR Extraction

Extract text from images

+$0.001

PDF Extraction

Extract text content from PDF documents. Works with both text-based PDFs and scanned documents (using OCR fallback).

Output Formats

Format	Description	Best For
`text`	Plain text extraction	Data processing, search indexing
`markdown`	Formatted with headers, lists, tables	Documentation, readable output

PDF Examples

Bash

# Extract PDF as markdown
curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/document.pdf",
    "mode": "pdf"
  }'

Automatic Detection

When using mode: "auto", AlterLab automatically detects PDF files and processes them appropriately.

OCR Extraction

Extract text from images using Optical Character Recognition. Supports multiple languages and image formats.

Supported Languages

engEnglish

deuGerman

fraFrench

spaSpanish

itaItalian

porPortuguese

nldDutch

rusRussian

jpnJapanese

korKorean

chi_simChinese (Simplified)

chi_traChinese (Traditional)

OCR Examples

Bash

# Extract text from image
curl -X POST https://api.alterlab.io/api/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/screenshot.png",
    "mode": "ocr"
  }'

Combined PDF + OCR

For scanned PDFs or PDFs with embedded images, you can enable OCR alongside PDF extraction:

Python

import requests

# Extract text from scanned PDF using OCR
response = requests.post(
    "https://api.alterlab.io/api/v1/scrape",
    headers={"X-API-Key": "YOUR_API_KEY"},
    json={
        "url": "https://example.com/scanned-document.pdf",
        "mode": "pdf",
        "advanced": {
            "ocr": True  # Enable OCR for image-based pages
        }
    }
)

data = response.json()
print(f"Cost: {data['credits_used']}")  # PDF + OCR cost
print(data["content"])

Smart OCR

When OCR is enabled for PDFs, we only apply it to pages that don't have extractable text. If the PDF has selectable text, OCR is skipped (and cost is refunded).

Best Practices

1. Choose the Right Mode

PDF mode: For actual PDF documents (reports, papers, ebooks)
OCR mode: For images (screenshots, photos of text, infographics)
PDF + OCR: For scanned PDFs without selectable text

2. Optimize for Large Documents

Use sync: false for documents over 50 pages
Set appropriate timeout (up to 300 seconds for large PDFs)
Consider webhooks for notification when processing completes

3. Handle Multiple Languages

Specify the primary language for best accuracy
Use eng+fra syntax for multilingual docs
English is the default if no language is specified

4. Image Quality Matters

Higher resolution images yield better OCR results
Ensure good contrast between text and background
Avoid heavily compressed images (JPEG artifacts reduce accuracy)

Costs Summary

Operation	Cost	Notes
PDF Extraction	+$0.0006	Per document
OCR Extraction	+$0.001	Per image/page
PDF + OCR (scanned)	+$0.0016	OCR only applied when needed

Output Formats Caching

Last updated: June 2026