Files
crawl4ai/docs/md_v2/assets/llm.txt/diagrams/url_seeder.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

441 lines
11 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## URL Seeding Workflows and Architecture
Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows.
### URL Seeding vs Deep Crawling Strategy Comparison
```mermaid
graph TB
subgraph "Deep Crawling Approach"
A1[Start URL] --> A2[Load Page]
A2 --> A3[Extract Links]
A3 --> A4{More Links?}
A4 -->|Yes| A5[Queue Next Page]
A5 --> A2
A4 -->|No| A6[Complete]
A7[⏱️ Real-time Discovery]
A8[🐌 Sequential Processing]
A9[🔍 Limited by Page Structure]
A10[💾 High Memory Usage]
end
subgraph "URL Seeding Approach"
B1[Domain Input] --> B2[Query Sitemap]
B1 --> B3[Query Common Crawl]
B2 --> B4[Merge Results]
B3 --> B4
B4 --> B5[Apply Filters]
B5 --> B6[Score Relevance]
B6 --> B7[Rank Results]
B7 --> B8[Select Top URLs]
B9[⚡ Instant Discovery]
B10[🚀 Parallel Processing]
B11[🎯 Pattern-based Filtering]
B12[💡 Smart Relevance Scoring]
end
style A1 fill:#ffecb3
style B1 fill:#e8f5e8
style A6 fill:#ffcdd2
style B8 fill:#c8e6c9
```
### URL Discovery Data Flow
```mermaid
sequenceDiagram
participant User
participant Seeder as AsyncUrlSeeder
participant SM as Sitemap
participant CC as Common Crawl
participant Filter as URL Filter
participant Scorer as BM25 Scorer
User->>Seeder: urls("example.com", config)
par Parallel Data Sources
Seeder->>SM: Fetch sitemap.xml
SM-->>Seeder: 500 URLs
and
Seeder->>CC: Query Common Crawl
CC-->>Seeder: 2000 URLs
end
Seeder->>Seeder: Merge and deduplicate
Note over Seeder: 2200 unique URLs
Seeder->>Filter: Apply pattern filter
Filter-->>Seeder: 800 matching URLs
alt extract_head=True
loop For each URL
Seeder->>Seeder: Extract <head> metadata
end
Note over Seeder: Title, description, keywords
end
alt query provided
Seeder->>Scorer: Calculate relevance scores
Scorer-->>Seeder: Scored URLs
Seeder->>Seeder: Filter by score_threshold
Note over Seeder: 200 relevant URLs
end
Seeder->>Seeder: Sort by relevance
Seeder->>Seeder: Apply max_urls limit
Seeder-->>User: Top 100 URLs ready for crawling
```
### SeedingConfig Decision Tree
```mermaid
flowchart TD
A[SeedingConfig Setup] --> B{Data Source Strategy?}
B -->|Fast & Official| C[source="sitemap"]
B -->|Comprehensive| D[source="cc"]
B -->|Maximum Coverage| E[source="sitemap+cc"]
C --> F{Need Filtering?}
D --> F
E --> F
F -->|Yes| G[Set URL Pattern]
F -->|No| H[pattern="*"]
G --> I{Pattern Examples}
I --> I1[pattern="*/blog/*"]
I --> I2[pattern="*/docs/api/*"]
I --> I3[pattern="*.pdf"]
I --> I4[pattern="*/product/*"]
H --> J{Need Metadata?}
I1 --> J
I2 --> J
I3 --> J
I4 --> J
J -->|Yes| K[extract_head=True]
J -->|No| L[extract_head=False]
K --> M{Need Validation?}
L --> M
M -->|Yes| N[live_check=True]
M -->|No| O[live_check=False]
N --> P{Need Relevance Scoring?}
O --> P
P -->|Yes| Q[Set Query + BM25]
P -->|No| R[Skip Scoring]
Q --> S[query="search terms"]
S --> T[scoring_method="bm25"]
T --> U[score_threshold=0.3]
R --> V[Performance Tuning]
U --> V
V --> W[Set max_urls]
W --> X[Set concurrency]
X --> Y[Set hits_per_sec]
Y --> Z[Configuration Complete]
style A fill:#e3f2fd
style Z fill:#c8e6c9
style K fill:#fff3e0
style N fill:#fff3e0
style Q fill:#f3e5f5
```
### BM25 Relevance Scoring Pipeline
```mermaid
graph TB
subgraph "Text Corpus Preparation"
A1[URL Collection] --> A2[Extract Metadata]
A2 --> A3[Title + Description + Keywords]
A3 --> A4[Tokenize Text]
A4 --> A5[Remove Stop Words]
A5 --> A6[Create Document Corpus]
end
subgraph "BM25 Algorithm"
B1[Query Terms] --> B2[Term Frequency Calculation]
A6 --> B2
B2 --> B3[Inverse Document Frequency]
B3 --> B4[BM25 Score Calculation]
B4 --> B5[Score = Σ(IDF × TF × K1+1)/(TF + K1×(1-b+b×|d|/avgdl))]
end
subgraph "Scoring Results"
B5 --> C1[URL Relevance Scores]
C1 --> C2{Score ≥ Threshold?}
C2 -->|Yes| C3[Include in Results]
C2 -->|No| C4[Filter Out]
C3 --> C5[Sort by Score DESC]
C5 --> C6[Return Top URLs]
end
subgraph "Example Scores"
D1["python async tutorial" → 0.85]
D2["python documentation" → 0.72]
D3["javascript guide" → 0.23]
D4["contact us page" → 0.05]
end
style B5 fill:#e3f2fd
style C6 fill:#c8e6c9
style D1 fill:#c8e6c9
style D2 fill:#c8e6c9
style D3 fill:#ffecb3
style D4 fill:#ffcdd2
```
### Multi-Domain Discovery Architecture
```mermaid
graph TB
subgraph "Input Layer"
A1[Domain List]
A2[SeedingConfig]
A3[Query Terms]
end
subgraph "Discovery Engine"
B1[AsyncUrlSeeder]
B2[Parallel Workers]
B3[Rate Limiter]
B4[Memory Manager]
end
subgraph "Data Sources"
C1[Sitemap Fetcher]
C2[Common Crawl API]
C3[Live URL Checker]
C4[Metadata Extractor]
end
subgraph "Processing Pipeline"
D1[URL Deduplication]
D2[Pattern Filtering]
D3[Relevance Scoring]
D4[Quality Assessment]
end
subgraph "Output Layer"
E1[Scored URL Lists]
E2[Domain Statistics]
E3[Performance Metrics]
E4[Cache Storage]
end
A1 --> B1
A2 --> B1
A3 --> B1
B1 --> B2
B2 --> B3
B3 --> B4
B2 --> C1
B2 --> C2
B2 --> C3
B2 --> C4
C1 --> D1
C2 --> D1
C3 --> D2
C4 --> D3
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> E1
B4 --> E2
B3 --> E3
D1 --> E4
style B1 fill:#e3f2fd
style D3 fill:#f3e5f5
style E1 fill:#c8e6c9
```
### Complete Discovery-to-Crawl Pipeline
```mermaid
stateDiagram-v2
[*] --> Discovery
Discovery --> SourceSelection: Configure data sources
SourceSelection --> Sitemap: source="sitemap"
SourceSelection --> CommonCrawl: source="cc"
SourceSelection --> Both: source="sitemap+cc"
Sitemap --> URLCollection
CommonCrawl --> URLCollection
Both --> URLCollection
URLCollection --> Filtering: Apply patterns
Filtering --> MetadataExtraction: extract_head=True
Filtering --> LiveValidation: extract_head=False
MetadataExtraction --> LiveValidation: live_check=True
MetadataExtraction --> RelevanceScoring: live_check=False
LiveValidation --> RelevanceScoring
RelevanceScoring --> ResultRanking: query provided
RelevanceScoring --> ResultLimiting: no query
ResultRanking --> ResultLimiting: apply score_threshold
ResultLimiting --> URLSelection: apply max_urls
URLSelection --> CrawlPreparation: URLs ready
CrawlPreparation --> CrawlExecution: AsyncWebCrawler
CrawlExecution --> StreamProcessing: stream=True
CrawlExecution --> BatchProcessing: stream=False
StreamProcessing --> [*]
BatchProcessing --> [*]
note right of Discovery : 🔍 Smart URL Discovery
note right of URLCollection : 📚 Merge & Deduplicate
note right of RelevanceScoring : 🎯 BM25 Algorithm
note right of CrawlExecution : 🕷️ High-Performance Crawling
```
### Performance Optimization Strategies
```mermaid
graph LR
subgraph "Input Optimization"
A1[Smart Source Selection] --> A2[Sitemap First]
A2 --> A3[Add CC if Needed]
A3 --> A4[Pattern Filtering Early]
end
subgraph "Processing Optimization"
B1[Parallel Workers] --> B2[Bounded Queues]
B2 --> B3[Rate Limiting]
B3 --> B4[Memory Management]
B4 --> B5[Lazy Evaluation]
end
subgraph "Output Optimization"
C1[Relevance Threshold] --> C2[Max URL Limits]
C2 --> C3[Caching Strategy]
C3 --> C4[Streaming Results]
end
subgraph "Performance Metrics"
D1[URLs/Second: 100-1000]
D2[Memory Usage: Bounded]
D3[Network Efficiency: 95%+]
D4[Cache Hit Rate: 80%+]
end
A4 --> B1
B5 --> C1
C4 --> D1
style A2 fill:#e8f5e8
style B2 fill:#e3f2fd
style C3 fill:#f3e5f5
style D3 fill:#c8e6c9
```
### URL Discovery vs Traditional Crawling Comparison
```mermaid
graph TB
subgraph "Traditional Approach"
T1[Start URL] --> T2[Crawl Page]
T2 --> T3[Extract Links]
T3 --> T4[Queue New URLs]
T4 --> T2
T5[❌ Time: Hours/Days]
T6[❌ Resource Heavy]
T7[❌ Depth Limited]
T8[❌ Discovery Bias]
end
subgraph "URL Seeding Approach"
S1[Domain Input] --> S2[Query All Sources]
S2 --> S3[Pattern Filter]
S3 --> S4[Relevance Score]
S4 --> S5[Select Best URLs]
S5 --> S6[Ready to Crawl]
S7[✅ Time: Seconds/Minutes]
S8[✅ Resource Efficient]
S9[✅ Complete Coverage]
S10[✅ Quality Focused]
end
subgraph "Use Case Decision Matrix"
U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling]
U3[Large Sites > 10000 pages] --> U4[Use URL Seeding]
U5[Unknown Structure] --> U6[Start with Seeding]
U7[Real-time Discovery] --> U8[Use Deep Crawling]
U9[Quality over Quantity] --> U10[Use URL Seeding]
end
style S6 fill:#c8e6c9
style S7 fill:#c8e6c9
style S8 fill:#c8e6c9
style S9 fill:#c8e6c9
style S10 fill:#c8e6c9
style T5 fill:#ffcdd2
style T6 fill:#ffcdd2
style T7 fill:#ffcdd2
style T8 fill:#ffcdd2
```
### Data Source Characteristics and Selection
```mermaid
graph TB
subgraph "Sitemap Source"
SM1[📋 Official URL List]
SM2[⚡ Fast Response]
SM3[📅 Recently Updated]
SM4[🎯 High Quality URLs]
SM5[❌ May Miss Some Pages]
end
subgraph "Common Crawl Source"
CC1[🌐 Comprehensive Coverage]
CC2[📚 Historical Data]
CC3[🔍 Deep Discovery]
CC4[⏳ Slower Response]
CC5[🧹 May Include Noise]
end
subgraph "Combined Strategy"
CB1[🚀 Best of Both]
CB2[📊 Maximum Coverage]
CB3[✨ Automatic Deduplication]
CB4[⚖️ Balanced Performance]
end
subgraph "Selection Guidelines"
G1[Speed Critical → Sitemap Only]
G2[Coverage Critical → Common Crawl]
G3[Best Quality → Combined]
G4[Unknown Domain → Combined]
end
style SM2 fill:#c8e6c9
style SM4 fill:#c8e6c9
style CC1 fill:#e3f2fd
style CC3 fill:#e3f2fd
style CB1 fill:#f3e5f5
style CB3 fill:#f3e5f5
```
**📖 Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)