Large Language Models (LLMs) are transformer-based neural networks with billions of parameters, trained by predicting the next token in massive text datasets. The training process imparts broad world knowledge, language understanding, and reasoning capability. Leading models include GPT-4o (OpenAI), Claude 4 (Anthropic), Gemini 2 (Google), and open-weight models like Llama 3 and Mistral.

LLMs can be prompted to perform a wide range of tasks without fine-tuning: summarisation, translation, question answering, code generation, classification, and data extraction. In the context of web scraping, LLMs are used to extract structured fields from messy HTML, classify scraped content, summarise long articles, and generate queries for subsequent scraping steps.

Because LLMs have a finite context window (the amount of text they can process in one call), large scraped documents must be chunked before being passed to the model. The cost of LLM inference per token makes it important to pre-filter content and pass only the relevant sections rather than entire page HTML.

Examples

# Use an LLM to extract structured data from scraped HTML
import anthropic, json

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"Extract product name, price, and SKU from:\n{html_snippet}\nReturn JSON only."}]
)
data = json.loads(response.content[0].text)

LLM (Large Language Model)

Examples

Related Terms

Extract LLM (Large Language Model) data from any website

Your first scrape.
Sixty seconds.

Examples

Related Terms

Extract LLM (Large Language Model) data from any website

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.