## URL Seeding Workflows and Architecture Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows. ### URL Seeding vs Deep Crawling Strategy Comparison ```mermaid graph TB subgraph "Deep Crawling Approach" A1[Start URL] --> A2[Load Page] A2 --> A3[Extract Links] A3 --> A4{More Links?} A4 -->|Yes| A5[Queue Next Page] A5 --> A2 A4 -->|No| A6[Complete] A7[โฑ๏ธ Real-time Discovery] A8[๐ŸŒ Sequential Processing] A9[๐Ÿ” Limited by Page Structure] A10[๐Ÿ’พ High Memory Usage] end subgraph "URL Seeding Approach" B1[Domain Input] --> B2[Query Sitemap] B1 --> B3[Query Common Crawl] B2 --> B4[Merge Results] B3 --> B4 B4 --> B5[Apply Filters] B5 --> B6[Score Relevance] B6 --> B7[Rank Results] B7 --> B8[Select Top URLs] B9[โšก Instant Discovery] B10[๐Ÿš€ Parallel Processing] B11[๐ŸŽฏ Pattern-based Filtering] B12[๐Ÿ’ก Smart Relevance Scoring] end style A1 fill:#ffecb3 style B1 fill:#e8f5e8 style A6 fill:#ffcdd2 style B8 fill:#c8e6c9 ``` ### URL Discovery Data Flow ```mermaid sequenceDiagram participant User participant Seeder as AsyncUrlSeeder participant SM as Sitemap participant CC as Common Crawl participant Filter as URL Filter participant Scorer as BM25 Scorer User->>Seeder: urls("example.com", config) par Parallel Data Sources Seeder->>SM: Fetch sitemap.xml SM-->>Seeder: 500 URLs and Seeder->>CC: Query Common Crawl CC-->>Seeder: 2000 URLs end Seeder->>Seeder: Merge and deduplicate Note over Seeder: 2200 unique URLs Seeder->>Filter: Apply pattern filter Filter-->>Seeder: 800 matching URLs alt extract_head=True loop For each URL Seeder->>Seeder: Extract metadata end Note over Seeder: Title, description, keywords end alt query provided Seeder->>Scorer: Calculate relevance scores Scorer-->>Seeder: Scored URLs Seeder->>Seeder: Filter by score_threshold Note over Seeder: 200 relevant URLs end Seeder->>Seeder: Sort by relevance Seeder->>Seeder: Apply max_urls limit Seeder-->>User: Top 100 URLs ready for crawling ``` ### SeedingConfig Decision Tree ```mermaid flowchart TD A[SeedingConfig Setup] --> B{Data Source Strategy?} B -->|Fast & Official| C[source="sitemap"] B -->|Comprehensive| D[source="cc"] B -->|Maximum Coverage| E[source="sitemap+cc"] C --> F{Need Filtering?} D --> F E --> F F -->|Yes| G[Set URL Pattern] F -->|No| H[pattern="*"] G --> I{Pattern Examples} I --> I1[pattern="*/blog/*"] I --> I2[pattern="*/docs/api/*"] I --> I3[pattern="*.pdf"] I --> I4[pattern="*/product/*"] H --> J{Need Metadata?} I1 --> J I2 --> J I3 --> J I4 --> J J -->|Yes| K[extract_head=True] J -->|No| L[extract_head=False] K --> M{Need Validation?} L --> M M -->|Yes| N[live_check=True] M -->|No| O[live_check=False] N --> P{Need Relevance Scoring?} O --> P P -->|Yes| Q[Set Query + BM25] P -->|No| R[Skip Scoring] Q --> S[query="search terms"] S --> T[scoring_method="bm25"] T --> U[score_threshold=0.3] R --> V[Performance Tuning] U --> V V --> W[Set max_urls] W --> X[Set concurrency] X --> Y[Set hits_per_sec] Y --> Z[Configuration Complete] style A fill:#e3f2fd style Z fill:#c8e6c9 style K fill:#fff3e0 style N fill:#fff3e0 style Q fill:#f3e5f5 ``` ### BM25 Relevance Scoring Pipeline ```mermaid graph TB subgraph "Text Corpus Preparation" A1[URL Collection] --> A2[Extract Metadata] A2 --> A3[Title + Description + Keywords] A3 --> A4[Tokenize Text] A4 --> A5[Remove Stop Words] A5 --> A6[Create Document Corpus] end subgraph "BM25 Algorithm" B1[Query Terms] --> B2[Term Frequency Calculation] A6 --> B2 B2 --> B3[Inverse Document Frequency] B3 --> B4[BM25 Score Calculation] B4 --> B5[Score = ฮฃ(IDF ร— TF ร— K1+1)/(TF + K1ร—(1-b+bร—|d|/avgdl))] end subgraph "Scoring Results" B5 --> C1[URL Relevance Scores] C1 --> C2{Score โ‰ฅ Threshold?} C2 -->|Yes| C3[Include in Results] C2 -->|No| C4[Filter Out] C3 --> C5[Sort by Score DESC] C5 --> C6[Return Top URLs] end subgraph "Example Scores" D1["python async tutorial" โ†’ 0.85] D2["python documentation" โ†’ 0.72] D3["javascript guide" โ†’ 0.23] D4["contact us page" โ†’ 0.05] end style B5 fill:#e3f2fd style C6 fill:#c8e6c9 style D1 fill:#c8e6c9 style D2 fill:#c8e6c9 style D3 fill:#ffecb3 style D4 fill:#ffcdd2 ``` ### Multi-Domain Discovery Architecture ```mermaid graph TB subgraph "Input Layer" A1[Domain List] A2[SeedingConfig] A3[Query Terms] end subgraph "Discovery Engine" B1[AsyncUrlSeeder] B2[Parallel Workers] B3[Rate Limiter] B4[Memory Manager] end subgraph "Data Sources" C1[Sitemap Fetcher] C2[Common Crawl API] C3[Live URL Checker] C4[Metadata Extractor] end subgraph "Processing Pipeline" D1[URL Deduplication] D2[Pattern Filtering] D3[Relevance Scoring] D4[Quality Assessment] end subgraph "Output Layer" E1[Scored URL Lists] E2[Domain Statistics] E3[Performance Metrics] E4[Cache Storage] end A1 --> B1 A2 --> B1 A3 --> B1 B1 --> B2 B2 --> B3 B3 --> B4 B2 --> C1 B2 --> C2 B2 --> C3 B2 --> C4 C1 --> D1 C2 --> D1 C3 --> D2 C4 --> D3 D1 --> D2 D2 --> D3 D3 --> D4 D4 --> E1 B4 --> E2 B3 --> E3 D1 --> E4 style B1 fill:#e3f2fd style D3 fill:#f3e5f5 style E1 fill:#c8e6c9 ``` ### Complete Discovery-to-Crawl Pipeline ```mermaid stateDiagram-v2 [*] --> Discovery Discovery --> SourceSelection: Configure data sources SourceSelection --> Sitemap: source="sitemap" SourceSelection --> CommonCrawl: source="cc" SourceSelection --> Both: source="sitemap+cc" Sitemap --> URLCollection CommonCrawl --> URLCollection Both --> URLCollection URLCollection --> Filtering: Apply patterns Filtering --> MetadataExtraction: extract_head=True Filtering --> LiveValidation: extract_head=False MetadataExtraction --> LiveValidation: live_check=True MetadataExtraction --> RelevanceScoring: live_check=False LiveValidation --> RelevanceScoring RelevanceScoring --> ResultRanking: query provided RelevanceScoring --> ResultLimiting: no query ResultRanking --> ResultLimiting: apply score_threshold ResultLimiting --> URLSelection: apply max_urls URLSelection --> CrawlPreparation: URLs ready CrawlPreparation --> CrawlExecution: AsyncWebCrawler CrawlExecution --> StreamProcessing: stream=True CrawlExecution --> BatchProcessing: stream=False StreamProcessing --> [*] BatchProcessing --> [*] note right of Discovery : ๐Ÿ” Smart URL Discovery note right of URLCollection : ๐Ÿ“š Merge & Deduplicate note right of RelevanceScoring : ๐ŸŽฏ BM25 Algorithm note right of CrawlExecution : ๐Ÿ•ท๏ธ High-Performance Crawling ``` ### Performance Optimization Strategies ```mermaid graph LR subgraph "Input Optimization" A1[Smart Source Selection] --> A2[Sitemap First] A2 --> A3[Add CC if Needed] A3 --> A4[Pattern Filtering Early] end subgraph "Processing Optimization" B1[Parallel Workers] --> B2[Bounded Queues] B2 --> B3[Rate Limiting] B3 --> B4[Memory Management] B4 --> B5[Lazy Evaluation] end subgraph "Output Optimization" C1[Relevance Threshold] --> C2[Max URL Limits] C2 --> C3[Caching Strategy] C3 --> C4[Streaming Results] end subgraph "Performance Metrics" D1[URLs/Second: 100-1000] D2[Memory Usage: Bounded] D3[Network Efficiency: 95%+] D4[Cache Hit Rate: 80%+] end A4 --> B1 B5 --> C1 C4 --> D1 style A2 fill:#e8f5e8 style B2 fill:#e3f2fd style C3 fill:#f3e5f5 style D3 fill:#c8e6c9 ``` ### URL Discovery vs Traditional Crawling Comparison ```mermaid graph TB subgraph "Traditional Approach" T1[Start URL] --> T2[Crawl Page] T2 --> T3[Extract Links] T3 --> T4[Queue New URLs] T4 --> T2 T5[โŒ Time: Hours/Days] T6[โŒ Resource Heavy] T7[โŒ Depth Limited] T8[โŒ Discovery Bias] end subgraph "URL Seeding Approach" S1[Domain Input] --> S2[Query All Sources] S2 --> S3[Pattern Filter] S3 --> S4[Relevance Score] S4 --> S5[Select Best URLs] S5 --> S6[Ready to Crawl] S7[โœ… Time: Seconds/Minutes] S8[โœ… Resource Efficient] S9[โœ… Complete Coverage] S10[โœ… Quality Focused] end subgraph "Use Case Decision Matrix" U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling] U3[Large Sites > 10000 pages] --> U4[Use URL Seeding] U5[Unknown Structure] --> U6[Start with Seeding] U7[Real-time Discovery] --> U8[Use Deep Crawling] U9[Quality over Quantity] --> U10[Use URL Seeding] end style S6 fill:#c8e6c9 style S7 fill:#c8e6c9 style S8 fill:#c8e6c9 style S9 fill:#c8e6c9 style S10 fill:#c8e6c9 style T5 fill:#ffcdd2 style T6 fill:#ffcdd2 style T7 fill:#ffcdd2 style T8 fill:#ffcdd2 ``` ### Data Source Characteristics and Selection ```mermaid graph TB subgraph "Sitemap Source" SM1[๐Ÿ“‹ Official URL List] SM2[โšก Fast Response] SM3[๐Ÿ“… Recently Updated] SM4[๐ŸŽฏ High Quality URLs] SM5[โŒ May Miss Some Pages] end subgraph "Common Crawl Source" CC1[๐ŸŒ Comprehensive Coverage] CC2[๐Ÿ“š Historical Data] CC3[๐Ÿ” Deep Discovery] CC4[โณ Slower Response] CC5[๐Ÿงน May Include Noise] end subgraph "Combined Strategy" CB1[๐Ÿš€ Best of Both] CB2[๐Ÿ“Š Maximum Coverage] CB3[โœจ Automatic Deduplication] CB4[โš–๏ธ Balanced Performance] end subgraph "Selection Guidelines" G1[Speed Critical โ†’ Sitemap Only] G2[Coverage Critical โ†’ Common Crawl] G3[Best Quality โ†’ Combined] G4[Unknown Domain โ†’ Combined] end style SM2 fill:#c8e6c9 style SM4 fill:#c8e6c9 style CC1 fill:#e3f2fd style CC3 fill:#e3f2fd style CB1 fill:#f3e5f5 style CB3 fill:#f3e5f5 ``` **๐Ÿ“– Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)