A URL uniquely identifies a web resource. Its components are: scheme (`https`), authority (host `example.com` and optional port `:8080`), path (`/products/widget`), query string (`?page=2&sort=price`), and fragment (`#reviews`). Fragments are client-side only and are not sent to the server; scrapers targeting fragment-routed SPAs must handle them via browser-based rendering.

URL normalisation is a common preprocessing step in crawlers: converting relative URLs to absolute, lowercasing the scheme and host, removing default ports (`:80` for http, `:443` for https), sorting query parameters, and stripping tracking parameters like `utm_source`. Normalised URLs improve deduplication accuracy by ensuring that `https://example.com/p?a=1&b=2` and `https://example.com/p?b=2&a=1` are recognised as the same resource.

URLs can encode arbitrary data in query parameters, making them a key input for systematic scraping of paginated or parameterised datasets. Iterating over known URL patterns (changing a product ID or page number) is a fast and reliable scraping strategy when the URL structure is predictable.

Examples

# URL parsing and normalisation
from urllib.parse import urlparse, urlencode, urlunparse
import urllib.parse

parsed = urlparse("https://example.com/search?q=scraping&page=1#results")
print(parsed.scheme)    # https
print(parsed.netloc)    # example.com
print(parsed.path)      # /search
print(parsed.query)     # q=scraping&page=1
# Fragment is not sent to server

URL

Examples

Related Terms

Extract URL data from any website

Your first scrape.
Sixty seconds.

Examples

Related Terms

Extract URL data from any website

Your first scrape. Sixty seconds.

Your first scrape.
Sixty seconds.