extraction

OCR (Optical Character Recognition)

OCR converts images containing text into machine-readable text, enabling scrapers to extract data from images, scanned PDFs, and canvas-rendered content.

Some web content deliberately obfuscates text by rendering it as an image, canvas element, or SVG path rather than HTML text — a technique used to prevent scraping or to embed contact details (phone numbers, email addresses) in a way that crawlers cannot read. OCR recovers this text by analysing the pixel patterns and recognising character shapes.

Tesseract is the most widely used open-source OCR engine, supporting over 100 languages. Cloud-based alternatives (Google Cloud Vision, AWS Textract, Azure Computer Vision) offer higher accuracy, especially for degraded or handwritten text. Modern multimodal LLMs (GPT-4o, Claude 3.5) can also extract text from images passed as base64-encoded inputs.

In scraping pipelines, OCR is applied selectively: first attempt to extract text directly from the DOM; only fall back to OCR for elements that yield no text content or that are detected as images. OCR is significantly slower and more resource-intensive than DOM-based extraction.

Examples

# Tesseract OCR on a screenshot of a page element
from PIL import Image
import pytesseract

# Capture screenshot with Playwright, save to file, then OCR
img = Image.open("element_screenshot.png")
text = pytesseract.image_to_string(img, config="--psm 6")
print(text)