Sub-URL Crawler Agent
The Sub-URL Crawler Agent systematically traverses websites by following links and sub-links from a starting URL. It enables deep web crawling, link discovery, and content extraction from multiple connected pages.

Sub-URL Crawler Agent interface
Component Inputs
- URL de départ: Starting URL for crawling
The initial URL where crawling begins
- Profondeur maximale: Maximum crawling depth (default value shown: 3)
Number of levels to crawl from the starting URL
- URLs maximales par niveau: Maximum URLs per level (default value shown: 50)
Limits the number of URLs processed at each depth level
Component Outputs
- Scrape URLs collectées: Collection of all discovered URLs during the crawl
Use Cases
- Site Mapping: Create a comprehensive map of a website's structure
- Content Discovery: Find all relevant content within a website
- Data Collection: Gather information from multiple connected pages
- SEO Analysis: Analyze site structure and link relationships
- Archive Creation: Create comprehensive archives of websites
Best Practices
- Respect robots.txt directives and site crawling policies
- Implement reasonable crawl rate limits to avoid overloading servers
- Use depth limits to prevent excessive crawling
- Consider URL filtering to focus crawling on relevant content
- Implement error handling for inaccessible pages
- Be mindful of duplicate content and crawling loops
- Store intermediate results to resume crawling if interrupted