general

Canonical URL

A canonical URL is the preferred URL for a piece of content when multiple URLs serve the same or similar content, indicated by a `<link rel='canonical'>` tag.

When the same content is accessible at multiple URLs — due to query parameters, trailing slashes, `www` vs. non-`www` subdomains, or session tokens — search engines and scrapers need to know which URL to treat as authoritative. The `rel=canonical` link tag in the HTML `<head>` signals the preferred URL.

For crawlers, following canonical URLs prevents indexing duplicate content and wasting crawl budget. A robust crawler reads the canonical tag and records the canonical URL alongside or instead of the requested URL. Deduplication logic can then match records by canonical URL rather than the exact URL that was fetched.

Canonical URLs also matter for link graph construction: inbound links to duplicate URLs should be consolidated and attributed to the canonical for accurate page-authority calculations.

Examples

<!-- Canonical tag in page <head> -->
<link rel="canonical" href="https://example.com/products/widget" />

<!-- Scraper extraction -->
from bs4 import BeautifulSoup
canonical = soup.find("link", rel="canonical")
canonical_url = canonical["href"] if canonical else response.url