Sub-URL Crawler Agent

The Sub-URL Crawler Agent systematically traverses websites by following links and sub-links from a starting URL. It enables deep web crawling, link discovery, and content extraction from multiple connected pages.

Sub-URL Crawler Agent interface

Component Inputs

URL de départ: Starting URL for crawling
The initial URL where crawling begins
Profondeur maximale: Maximum crawling depth (default value shown: 3)
Number of levels to crawl from the starting URL
URLs maximales par niveau: Maximum URLs per level (default value shown: 50)
Limits the number of URLs processed at each depth level

Component Outputs

Scrape URLs collectées: Collection of all discovered URLs during the crawl

Use Cases

Site Mapping: Create a comprehensive map of a website's structure
Content Discovery: Find all relevant content within a website
Data Collection: Gather information from multiple connected pages
SEO Analysis: Analyze site structure and link relationships
Archive Creation: Create comprehensive archives of websites

Useful Resources

Best Practices

Respect robots.txt directives and site crawling policies
Implement reasonable crawl rate limits to avoid overloading servers
Use depth limits to prevent excessive crawling
Consider URL filtering to focus crawling on relevant content
Implement error handling for inaccessible pages
Be mindful of duplicate content and crawling loops
Store intermediate results to resume crawling if interrupted

Documentation

Sub-URL Crawler Agent

Component Inputs

Component Outputs

Use Cases