A URL uniquely identifies a web resource. Its components are: scheme (`https`), authority (host `example.com` and optional port `:8080`), path (`/products/widget`), query string (`?page=2&sort=price`), and fragment (`#reviews`). Fragments are client-side only and are not sent to the server; scrapers targeting fragment-routed SPAs must handle them via browser-based rendering.
URL normalisation is a common preprocessing step in crawlers: converting relative URLs to absolute, lowercasing the scheme and host, removing default ports (`:80` for http, `:443` for https), sorting query parameters, and stripping tracking parameters like `utm_source`. Normalised URLs improve deduplication accuracy by ensuring that `https://example.com/p?a=1&b=2` and `https://example.com/p?b=2&a=1` are recognised as the same resource.
URLs can encode arbitrary data in query parameters, making them a key input for systematic scraping of paginated or parameterised datasets. Iterating over known URL patterns (changing a product ID or page number) is a fast and reliable scraping strategy when the URL structure is predictable.