Large Language Models (LLMs) are transformer-based neural networks with billions of parameters, trained by predicting the next token in massive text datasets. The training process imparts broad world knowledge, language understanding, and reasoning capability. Leading models include GPT-4o (OpenAI), Claude 4 (Anthropic), Gemini 2 (Google), and open-weight models like Llama 3 and Mistral.
LLMs can be prompted to perform a wide range of tasks without fine-tuning: summarisation, translation, question answering, code generation, classification, and data extraction. In the context of web scraping, LLMs are used to extract structured fields from messy HTML, classify scraped content, summarise long articles, and generate queries for subsequent scraping steps.
Because LLMs have a finite context window (the amount of text they can process in one call), large scraped documents must be chunked before being passed to the model. The cost of LLM inference per token makes it important to pre-filter content and pass only the relevant sections rather than entire page HTML.