general

Link Following

Link following (or crawling) is the process of discovering and visiting URLs by extracting hyperlinks from already-scraped pages, enabling systematic traversal of a website.

A scraper that follows links autonomously becomes a crawler: it starts from one or more seed URLs, extracts all anchor hrefs from the fetched HTML, adds new URLs to a queue, and visits them in turn. The crawl frontier (the queue of URLs to visit) grows as new links are discovered. Visited URLs are tracked to prevent re-processing.

Effective link-following requires normalising URLs (resolving relative paths, stripping fragment identifiers, canonicalising query parameters), respecting robots.txt exclusions, enforcing domain or path scope filters to avoid leaving the target site, and managing crawl depth limits to prevent infinite loops caused by dynamically generated URLs.

Politeness policies — rate limiting, request spacing, and honouring Crawl-delay directives in robots.txt — are important both ethically and practically: aggressive crawlers are more likely to trigger anti-bot defences and put load on the target server.

Examples

# Scrapy spider following links
import scrapy

class SiteSpider(scrapy.Spider):
    name = "site"
    start_urls = ["https://example.com"]

    def parse(self, response):
        yield {"url": response.url, "title": response.css("h1::text").get()}
        for href in response.css("a::attr(href)"):
            yield response.follow(href, self.parse)