extraction

Regex Extraction

Regex extraction uses regular expressions to match and capture specific text patterns — such as prices, dates, or identifiers — from raw HTML or text content.

Regular expressions (regex) are pattern-matching rules that can locate and extract specific text sequences from a larger string. In web scraping, regex is used when the data follows a predictable format that can be described as a pattern: phone numbers, email addresses, product SKUs, prices, dates, or embedded JSON objects.

Regex is most useful as a secondary extraction tool: first parse the HTML with a proper parser like BeautifulSoup or an XPath engine to narrow down to the relevant section, then apply regex to the resulting text to extract the precise value. Using regex on raw HTML directly is error-prone because HTML structure is nested and variable.

Common regex extraction patterns in scraping include extracting the content of a JavaScript variable from a `<script>` block, capturing a number embedded in a CSS class name, or finding all URLs matching a specific path pattern in an href attribute.

Examples

import re

# Extract a JSON object assigned to a JS variable
script_content = page.find("script", string=re.compile("window.__data__")).string
match = re.search(r"window.__data__s*=s*({.*?});", script_content, re.DOTALL)
if match:
    import json
    data = json.loads(match.group(1))