extraction

OCR (Optical Character Recognition)

OCR converts images containing text into machine-readable text, enabling scrapers to extract data from images, scanned PDFs, and canvas-rendered content.

Some web content deliberately obfuscates text by rendering it as an image, canvas element, or SVG path rather than HTML text — a technique used to prevent scraping or to embed contact details (phone numbers, email addresses) in a way that crawlers cannot read. OCR recovers this text by analysing the pixel patterns and recognising character shapes.

Tesseract is the most widely used open-source OCR engine, supporting over 100 languages. Cloud-based alternatives (Google Cloud Vision, AWS Textract, Azure Computer Vision) offer higher accuracy, especially for degraded or handwritten text. Modern multimodal LLMs (GPT-4o, Claude 3.5) can also extract text from images passed as base64-encoded inputs.

In scraping pipelines, OCR is applied selectively: first attempt to extract text directly from the DOM; only fall back to OCR for elements that yield no text content or that are detected as images. OCR is significantly slower and more resource-intensive than DOM-based extraction.

Examples

# Tesseract OCR on a screenshot of a page element
from PIL import Image
import pytesseract

# Capture screenshot with Playwright, save to file, then OCR
img = Image.open("element_screenshot.png")
text = pytesseract.image_to_string(img, config="--psm 6")
print(text)

Related Terms

Extract OCR (Optical Character Recognition) data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    OCR (Optical Character Recognition) — Web Scraping Glossary | AlterLab