Documentation

Sub-URL Crawler Agent

The Sub-URL Crawler Agent systematically traverses websites by following links and sub-links from a starting URL. It enables deep web crawling, link discovery, and content extraction from multiple connected pages.

Sub-URL Crawler Agent

Sub-URL Crawler Agent interface

Component Inputs

  • URL de départ: Starting URL for crawling

    The initial URL where crawling begins

  • Profondeur maximale: Maximum crawling depth (default value shown: 3)

    Number of levels to crawl from the starting URL

  • URLs maximales par niveau: Maximum URLs per level (default value shown: 50)

    Limits the number of URLs processed at each depth level

Component Outputs

  • Scrape URLs collectées: Collection of all discovered URLs during the crawl

Use Cases

  • Site Mapping: Create a comprehensive map of a website's structure
  • Content Discovery: Find all relevant content within a website
  • Data Collection: Gather information from multiple connected pages
  • SEO Analysis: Analyze site structure and link relationships
  • Archive Creation: Create comprehensive archives of websites

Best Practices

  • Respect robots.txt directives and site crawling policies
  • Implement reasonable crawl rate limits to avoid overloading servers
  • Use depth limits to prevent excessive crawling
  • Consider URL filtering to focus crawling on relevant content
  • Implement error handling for inaccessible pages
  • Be mindful of duplicate content and crawling loops
  • Store intermediate results to resume crawling if interrupted