Some web content deliberately obfuscates text by rendering it as an image, canvas element, or SVG path rather than HTML text — a technique used to prevent scraping or to embed contact details (phone numbers, email addresses) in a way that crawlers cannot read. OCR recovers this text by analysing the pixel patterns and recognising character shapes.
Tesseract is the most widely used open-source OCR engine, supporting over 100 languages. Cloud-based alternatives (Google Cloud Vision, AWS Textract, Azure Computer Vision) offer higher accuracy, especially for degraded or handwritten text. Modern multimodal LLMs (GPT-4o, Claude 3.5) can also extract text from images passed as base64-encoded inputs.
In scraping pipelines, OCR is applied selectively: first attempt to extract text directly from the DOM; only fall back to OCR for elements that yield no text content or that are detected as images. OCR is significantly slower and more resource-intensive than DOM-based extraction.