protocol

HTML

HTML (HyperText Markup Language) is the standard markup language for structuring web page content, defining elements like headings, paragraphs, links, images, and tables.

HTML is the backbone of every web page. An HTML document is a tree of elements — the DOM — where each element is delimited by opening and closing tags (e.g., `<p>text</p>`) and can carry attributes that modify its behaviour or appearance (e.g., `<a href='...'>link</a>`). Browsers parse HTML into the Document Object Model and render it visually according to associated CSS stylesheets.

For web scrapers, HTML is the primary raw material. After fetching a page, the scraper parses the HTML to navigate the element tree and extract target data. Parsers like Python's BeautifulSoup, lxml, and Node's Cheerio expose CSS selector and XPath query APIs for locating elements by tag name, class, attribute, or structural position.

HTML is a fault-tolerant format: browsers are designed to render malformed HTML gracefully, and parsers similarly handle missing closing tags, unclosed attributes, and other violations. However, malformed HTML can cause selector queries to return unexpected results if the parsed tree differs from the visually rendered layout.

Examples

<!-- Basic HTML structure a scraper navigates -->
<article class="product-card" data-sku="ABC123">
  <h2 class="product-name">Widget Pro</h2>
  <span class="price">$49.99</span>
  <a href="/products/widget-pro">Details</a>
</article>

Related Terms

Extract HTML data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    HTML — Web Scraping Glossary | AlterLab