general

Canonical URL

A canonical URL is the preferred URL for a piece of content when multiple URLs serve the same or similar content, indicated by a `<link rel='canonical'>` tag.

When the same content is accessible at multiple URLs — due to query parameters, trailing slashes, `www` vs. non-`www` subdomains, or session tokens — search engines and scrapers need to know which URL to treat as authoritative. The `rel=canonical` link tag in the HTML `<head>` signals the preferred URL.

For crawlers, following canonical URLs prevents indexing duplicate content and wasting crawl budget. A robust crawler reads the canonical tag and records the canonical URL alongside or instead of the requested URL. Deduplication logic can then match records by canonical URL rather than the exact URL that was fetched.

Canonical URLs also matter for link graph construction: inbound links to duplicate URLs should be consolidated and attributed to the canonical for accurate page-authority calculations.

Examples

<!-- Canonical tag in page <head> -->
<link rel="canonical" href="https://example.com/products/widget" />

<!-- Scraper extraction -->
from bs4 import BeautifulSoup
canonical = soup.find("link", rel="canonical")
canonical_url = canonical["href"] if canonical else response.url

Related Terms

Extract Canonical URL data from any website

AlterLab returns clean, structured data from any public URL — no scraper infrastructure needed. Start free, no credit card required.

View API docs

Your first scrape.
Sixty seconds.

$1 free balance. No credit card. No SDK.Just a POST request.

terminal
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'

No credit card required · Up to 5,000 free scrapes · Balance never expires

    Canonical URL — Web Scraping Glossary | AlterLab