tool

BeautifulSoup

A Python library for parsing HTML and XML and navigating the parse tree using CSS selectors or tag methods.

BeautifulSoup is a Python library for parsing HTML and XML documents and navigating the resulting parse tree. It accepts HTML from any source — an HTTP response, a file, a string — and exposes the parsed document through an intuitive API: find elements by tag name (`soup.find('div')`), by class (`soup.find_all('p', class_='description')`), by CSS selector (`soup.select('.price')`), or by attribute (`soup.find('a', href=True)`).

BeautifulSoup works with multiple underlying parsers. The built-in `html.parser` is included with Python and handles most HTML correctly. For faster parsing of large documents, `lxml` provides a C-backed HTML and XML parser. For extremely lenient parsing of malformed HTML, `html5lib` produces a browser-compatible parse tree regardless of how broken the markup is.

The key limitation of BeautifulSoup is that it operates on static HTML — it does not execute JavaScript. For JavaScript-rendered pages, the HTML must first be rendered by a browser (Playwright, Puppeteer, or a scraping API with JavaScript rendering enabled). AlterLab's API returns the post-render HTML, which can then be parsed with BeautifulSoup for structured extraction.

Examples

from bs4 import BeautifulSoup

# Parse AlterLab's HTML response
soup = BeautifulSoup(response['html'], 'lxml')

# Extract by CSS selector
prices = soup.select('div.price > span.amount')
for price in prices:
    print(price.text.strip())

Related Terms

    BeautifulSoup — Web Scraping Glossary | AlterLab