extraction

XPath

XML Path Language — a query syntax for selecting nodes in HTML or XML using path expressions, more expressive than CSS selectors for complex traversals.

XPath (XML Path Language) is a query language for selecting nodes in XML and HTML documents using path expressions. While CSS selectors match elements by their styling attributes and structural relationships, XPath can express more complex traversals: selecting nodes based on their text content, selecting parent or sibling nodes from a known child, counting sibling elements, or applying arithmetic operations to attribute values.

XPath expressions follow a path syntax: `//` selects anywhere in the document, `/` selects direct children, `@` accesses attributes, `[]` applies predicates (conditions), and `text()` extracts text nodes. For example, `//div[@class='price']/span[1]/text()` selects the text of the first span inside any div with class "price".

XPath's greater expressiveness makes it valuable for situations where CSS selectors cannot express the required relationship — such as selecting a node based on adjacent sibling text or navigating up the DOM tree. It is the standard in XML processing ecosystems and is supported by lxml, Scrapy, and browser DevTools. AlterLab's structured extraction supports both CSS selector and XPath specification in requests.

Examples

# XPath examples
//div[@class='price']/span/text()    # text inside span in .price div
//a[contains(@href, 'product')]      # links containing 'product' in href
//table//tr[position()>1]            # table rows after the header
//h2[following-sibling::p]           # h2 elements that have a sibling p

Related Terms

    XPath — Web Scraping Glossary | AlterLab