HTML is the backbone of every web page. An HTML document is a tree of elements — the DOM — where each element is delimited by opening and closing tags (e.g., `<p>text</p>`) and can carry attributes that modify its behaviour or appearance (e.g., `<a href='...'>link</a>`). Browsers parse HTML into the Document Object Model and render it visually according to associated CSS stylesheets.
For web scrapers, HTML is the primary raw material. After fetching a page, the scraper parses the HTML to navigate the element tree and extract target data. Parsers like Python's BeautifulSoup, lxml, and Node's Cheerio expose CSS selector and XPath query APIs for locating elements by tag name, class, attribute, or structural position.
HTML is a fault-tolerant format: browsers are designed to render malformed HTML gracefully, and parsers similarly handle missing closing tags, unclosed attributes, and other violations. However, malformed HTML can cause selector queries to return unexpected results if the parsed tree differs from the visually rendered layout.