AlterLabAlterLab
PricingComparePlaygroundBlogDocsChangelog
    AlterLabAlterLab
    PricingComparePlaygroundBlogDocsChangelog
    IntroductionQuickstartInstallationYour First Request
    REST APIJob PollingAPI KeysSessions APINew
    OverviewPythonNode.js
    JavaScript RenderingOutput FormatsPDF & OCRCachingWebhooksJSON Schema FilteringWebSocket Real-TimeBring Your Own ProxyProAuthenticated ScrapingNewWeb CrawlingBatch ScrapingSchedulerChange DetectionCloud Storage ExportSpend LimitsOrganizations & TeamsAlerts & Notifications
    Structured ExtractionAIE-commerce ScrapingNews MonitoringPrice MonitoringMulti-Page CrawlingMonitoring DashboardAI Agent / MCPMCPData Pipeline to Cloud
    PricingRate LimitsError Codes
    From FirecrawlFrom ApifyFrom ScrapingBee / ScraperAPI
    PlaygroundPricingStatus
    Guide

    PDF & OCR Extraction

    Extract text from PDF documents and images using our document processing and OCR capabilities.

    PDF Extraction

    Extract text from PDF documents

    +$0.0006

    OCR Extraction

    Extract text from images

    +$0.001

    PDF Extraction

    Extract text content from PDF documents. Works with both text-based PDFs and scanned documents (using OCR fallback).

    Output Formats

    FormatDescriptionBest For
    textPlain text extractionData processing, search indexing
    markdownFormatted with headers, lists, tablesDocumentation, readable output

    PDF Examples

    Bash
    # Extract PDF as markdown
    curl -X POST https://api.alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/document.pdf",
        "mode": "pdf"
      }'

    Automatic Detection

    When using mode: "auto", AlterLab automatically detects PDF files and processes them appropriately.

    OCR Extraction

    Extract text from images using Optical Character Recognition. Supports multiple languages and image formats.

    Supported Languages

    engEnglish
    deuGerman
    fraFrench
    spaSpanish
    itaItalian
    porPortuguese
    nldDutch
    rusRussian
    jpnJapanese
    korKorean
    chi_simChinese (Simplified)
    chi_traChinese (Traditional)

    OCR Examples

    Bash
    # Extract text from image
    curl -X POST https://api.alterlab.io/api/v1/scrape \
      -H "X-API-Key: YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "url": "https://example.com/screenshot.png",
        "mode": "ocr"
      }'

    Combined PDF + OCR

    For scanned PDFs or PDFs with embedded images, you can enable OCR alongside PDF extraction:

    Python
    import requests
    
    # Extract text from scanned PDF using OCR
    response = requests.post(
        "https://api.alterlab.io/api/v1/scrape",
        headers={"X-API-Key": "YOUR_API_KEY"},
        json={
            "url": "https://example.com/scanned-document.pdf",
            "mode": "pdf",
            "advanced": {
                "ocr": True  # Enable OCR for image-based pages
            }
        }
    )
    
    data = response.json()
    print(f"Cost: {data['credits_used']}")  # PDF + OCR cost
    print(data["content"])

    Smart OCR

    When OCR is enabled for PDFs, we only apply it to pages that don't have extractable text. If the PDF has selectable text, OCR is skipped (and cost is refunded).

    Best Practices

    1. Choose the Right Mode

    • PDF mode: For actual PDF documents (reports, papers, ebooks)
    • OCR mode: For images (screenshots, photos of text, infographics)
    • PDF + OCR: For scanned PDFs without selectable text

    2. Optimize for Large Documents

    • Use sync: false for documents over 50 pages
    • Set appropriate timeout (up to 300 seconds for large PDFs)
    • Consider webhooks for notification when processing completes

    3. Handle Multiple Languages

    • Specify the primary language for best accuracy
    • Use eng+fra syntax for multilingual docs
    • English is the default if no language is specified

    4. Image Quality Matters

    • Higher resolution images yield better OCR results
    • Ensure good contrast between text and background
    • Avoid heavily compressed images (JPEG artifacts reduce accuracy)

    Costs Summary

    OperationCostNotes
    PDF Extraction
    +$0.0006
    Per document
    OCR Extraction
    +$0.001
    Per image/page
    PDF + OCR (scanned)
    +$0.0016
    OCR only applied when needed
    Output FormatsCaching
    Last updated: March 2026

    On this page