Guide
PDF & OCR Extraction
Extract text from PDF documents and images using our document processing and OCR capabilities.
PDF Extraction
Extract text from PDF documents
+$0.0006
OCR Extraction
Extract text from images
+$0.001
PDF Extraction
Extract text content from PDF documents. Works with both text-based PDFs and scanned documents (using OCR fallback).
Output Formats
| Format | Description | Best For |
|---|---|---|
text | Plain text extraction | Data processing, search indexing |
markdown | Formatted with headers, lists, tables | Documentation, readable output |
PDF Examples
# Extract PDF as markdown
curl -X POST https://api.alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/document.pdf",
"mode": "pdf"
}'Automatic Detection
When using
mode: "auto", AlterLab automatically detects PDF files and processes them appropriately.OCR Extraction
Extract text from images using Optical Character Recognition. Supports multiple languages and image formats.
Supported Languages
engEnglishdeuGermanfraFrenchspaSpanishitaItalianporPortuguesenldDutchrusRussianjpnJapanesekorKoreanchi_simChinese (Simplified)chi_traChinese (Traditional)OCR Examples
# Extract text from image
curl -X POST https://api.alterlab.io/api/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/screenshot.png",
"mode": "ocr"
}'Combined PDF + OCR
For scanned PDFs or PDFs with embedded images, you can enable OCR alongside PDF extraction:
import requests
# Extract text from scanned PDF using OCR
response = requests.post(
"https://api.alterlab.io/api/v1/scrape",
headers={"X-API-Key": "YOUR_API_KEY"},
json={
"url": "https://example.com/scanned-document.pdf",
"mode": "pdf",
"advanced": {
"ocr": True # Enable OCR for image-based pages
}
}
)
data = response.json()
print(f"Credits used: {data['credits_used']}") # PDF + OCR cost
print(data["content"])Smart OCR
When OCR is enabled for PDFs, we only apply it to pages that don't have extractable text. If the PDF has selectable text, OCR is skipped (and cost is refunded).
Best Practices
1. Choose the Right Mode
- PDF mode: For actual PDF documents (reports, papers, ebooks)
- OCR mode: For images (screenshots, photos of text, infographics)
- PDF + OCR: For scanned PDFs without selectable text
2. Optimize for Large Documents
- Use
sync: falsefor documents over 50 pages - Set appropriate timeout (up to 300 seconds for large PDFs)
- Consider webhooks for notification when processing completes
3. Handle Multiple Languages
- Specify the primary language for best accuracy
- Use
eng+frasyntax for multilingual docs - English is the default if no language is specified
4. Image Quality Matters
- Higher resolution images yield better OCR results
- Ensure good contrast between text and background
- Avoid heavily compressed images (JPEG artifacts reduce accuracy)
Costs Summary
| Operation | Cost | Notes |
|---|---|---|
| PDF Extraction | +$0.0006 | Per document |
| OCR Extraction | +$0.001 | Per image/page |
| PDF + OCR (scanned) | +$0.0016 | OCR only applied when needed |