extraction

Table Parsing

Table parsing extracts structured rows and columns of data from HTML `<table>` elements, converting them into arrays, dataframes, or other tabular formats.

HTML tables encode information in a nested structure of `<tr>` (row), `<th>` (header cell), and `<td>` (data cell) elements. Libraries like BeautifulSoup, pandas `read_html`, and Scrapy selectors can traverse this structure and return the contents as a two-dimensional array.

Complications arise from tables that use `colspan` and `rowspan` attributes to merge cells across multiple rows or columns. Parsing these correctly requires tracking span state across rows to align data with the right column header. Some pages also nest tables inside other tables for layout purposes, requiring the scraper to target the correct table by position, class, or caption text.

Pandas `pd.read_html(url)` is a common shortcut that downloads a page and returns all detected tables as DataFrames with a single call, handling most span logic automatically. For more complex tables or JavaScript-rendered ones, BeautifulSoup or browser-based extraction is needed.

Examples

# pandas: extract all tables from a page
import pandas as pd

tables = pd.read_html("https://example.com/data")
first_table = tables[0]
print(first_table.head())