feat(scraper): Enhance URL filtering and scoring systems

Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes. - Quick Start is created and added
2024-11-08 18:45:12 +08:00
parent d11c004fbb
commit bae4665949
10 changed files with 1451 additions and 157 deletions
--- a/docs/scrapper/filters_scrorers.md
+++ b/docs/scrapper/filters_scrorers.md
@@ -0,0 +1,342 @@
+# URL Filters and Scorers
+
+The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
+
+```mermaid
+flowchart TB
+    Start([URL Input]) --> Chain[Filter Chain]
+    
+    subgraph Chain Process
+        Chain --> Pattern{URL Pattern\nFilter}
+        Pattern -->|Match| Content{Content Type\nFilter}
+        Pattern -->|No Match| Reject1[Reject URL]
+        
+        Content -->|Allowed| Domain{Domain\nFilter}
+        Content -->|Not Allowed| Reject2[Reject URL]
+        
+        Domain -->|Allowed| Accept[Accept URL]
+        Domain -->|Blocked| Reject3[Reject URL]
+    end
+    
+    subgraph Statistics
+        Pattern --> UpdatePattern[Update Pattern Stats]
+        Content --> UpdateContent[Update Content Stats]
+        Domain --> UpdateDomain[Update Domain Stats]
+        Accept --> UpdateChain[Update Chain Stats]
+        Reject1 --> UpdateChain
+        Reject2 --> UpdateChain
+        Reject3 --> UpdateChain
+    end
+    
+    Accept --> End([End])
+    Reject1 --> End
+    Reject2 --> End
+    Reject3 --> End
+    
+    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
+    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
+    classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
+    classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
+    
+    class Start,End accept;
+    class Pattern,Content,Domain decision;
+    class Reject1,Reject2,Reject3 reject;
+    class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
+```
+
+## URL Filters
+
+URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
+
+### Available Filters
+
+1. **URL Pattern Filter**
+```python
+pattern_filter = URLPatternFilter([
+    "*.example.com/*",  # Glob pattern
+    "*/article/*",      # Path pattern
+    re.compile(r"blog-\d+") # Regex pattern
+])
+```
+- Supports glob patterns and regex
+- Multiple patterns per filter
+- Pattern pre-compilation for performance
+
+2. **Content Type Filter**
+```python
+content_filter = ContentTypeFilter([
+    "text/html",
+    "application/pdf"
+], check_extension=True)
+```
+- Filter by MIME types
+- Extension checking
+- Support for multiple content types
+
+3. **Domain Filter**
+```python
+domain_filter = DomainFilter(
+    allowed_domains=["example.com", "blog.example.com"],
+    blocked_domains=["ads.example.com"]
+)
+```
+- Allow/block specific domains
+- Subdomain support
+- Efficient domain matching
+
+### Creating Filter Chains
+
+```python
+# Create and configure a filter chain
+filter_chain = FilterChain([
+    URLPatternFilter(["*.example.com/*"]),
+    ContentTypeFilter(["text/html"]),
+    DomainFilter(blocked_domains=["ads.*"])
+])
+
+# Add more filters
+filter_chain.add_filter(
+    URLPatternFilter(["*/article/*"])
+)
+```
+
+```mermaid
+flowchart TB
+    Start([URL Input]) --> Composite[Composite Scorer]
+    
+    subgraph Scoring Process
+        Composite --> Keywords[Keyword Relevance]
+        Composite --> Path[Path Depth]
+        Composite --> Content[Content Type]
+        Composite --> Fresh[Freshness]
+        Composite --> Domain[Domain Authority]
+        
+        Keywords --> KeywordScore[Calculate Score]
+        Path --> PathScore[Calculate Score]
+        Content --> ContentScore[Calculate Score]
+        Fresh --> FreshScore[Calculate Score]
+        Domain --> DomainScore[Calculate Score]
+        
+        KeywordScore --> Weight1[Apply Weight]
+        PathScore --> Weight2[Apply Weight]
+        ContentScore --> Weight3[Apply Weight]
+        FreshScore --> Weight4[Apply Weight]
+        DomainScore --> Weight5[Apply Weight]
+    end
+    
+    Weight1 --> Combine[Combine Scores]
+    Weight2 --> Combine
+    Weight3 --> Combine
+    Weight4 --> Combine
+    Weight5 --> Combine
+    
+    Combine --> Normalize{Normalize?}
+    Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
+    Normalize -->|No| FinalScore[Final Score]
+    NormalizeScore --> FinalScore
+    
+    FinalScore --> Stats[Update Statistics]
+    Stats --> End([End])
+    
+    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
+    classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
+    classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
+    classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
+    
+    class Start,End calc;
+    class Keywords,Path,Content,Fresh,Domain scorer;
+    class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
+    class Normalize decision;
+```
+
+## URL Scorers
+
+URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
+
+### Available Scorers
+
+1. **Keyword Relevance Scorer**
+```python
+keyword_scorer = KeywordRelevanceScorer(
+    keywords=["python", "programming"],
+    weight=1.0,
+    case_sensitive=False
+)
+```
+- Score based on keyword matches
+- Case sensitivity options
+- Weighted scoring
+
+2. **Path Depth Scorer**
+```python
+path_scorer = PathDepthScorer(
+    optimal_depth=3,  # Preferred URL depth
+    weight=0.7
+)
+```
+- Score based on URL path depth
+- Configurable optimal depth
+- Diminishing returns for deeper paths
+
+3. **Content Type Scorer**
+```python
+content_scorer = ContentTypeScorer({
+    r'\.html$': 1.0,
+    r'\.pdf$': 0.8,
+    r'\.xml$': 0.6
+})
+```
+- Score based on file types
+- Configurable type weights
+- Pattern matching support
+
+4. **Freshness Scorer**
+```python
+freshness_scorer = FreshnessScorer(weight=0.9)
+```
+- Score based on date indicators in URLs
+- Multiple date format support
+- Recency weighting
+
+5. **Domain Authority Scorer**
+```python
+authority_scorer = DomainAuthorityScorer({
+    "python.org": 1.0,
+    "github.com": 0.9,
+    "medium.com": 0.7
+})
+```
+- Score based on domain importance
+- Configurable domain weights
+- Default weight for unknown domains
+
+### Combining Scorers
+
+```python
+# Create a composite scorer
+composite_scorer = CompositeScorer([
+    KeywordRelevanceScorer(["python"], weight=1.0),
+    PathDepthScorer(optimal_depth=2, weight=0.7),
+    FreshnessScorer(weight=0.8)
+], normalize=True)
+```
+
+## Best Practices
+
+### Filter Configuration
+
+1. **Start Restrictive**
+   ```python
+   # Begin with strict filters
+   filter_chain = FilterChain([
+       DomainFilter(allowed_domains=["example.com"]),
+       ContentTypeFilter(["text/html"])
+   ])
+   ```
+
+2. **Layer Filters**
+   ```python
+   # Add more specific filters
+   filter_chain.add_filter(
+       URLPatternFilter(["*/article/*", "*/blog/*"])
+   )
+   ```
+
+3. **Monitor Filter Statistics**
+   ```python
+   # Check filter performance
+   for filter in filter_chain.filters:
+       print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
+   ```
+
+### Scorer Configuration
+
+1. **Balance Weights**
+   ```python
+   # Balanced scoring configuration
+   scorer = create_balanced_scorer()
+   ```
+
+2. **Customize for Content**
+   ```python
+   # News site configuration
+   news_scorer = CompositeScorer([
+       KeywordRelevanceScorer(["news", "article"], weight=1.0),
+       FreshnessScorer(weight=1.0),
+       PathDepthScorer(optimal_depth=2, weight=0.5)
+   ])
+   ```
+
+3. **Monitor Scoring Statistics**
+   ```python
+   # Check scoring distribution
+   print(f"Average score: {scorer.stats.average_score}")
+   print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
+   ```
+
+## Common Use Cases
+
+### Blog Crawling
+```python
+blog_config = {
+    'filters': FilterChain([
+        URLPatternFilter(["*/blog/*", "*/post/*"]),
+        ContentTypeFilter(["text/html"])
+    ]),
+    'scorer': CompositeScorer([
+        FreshnessScorer(weight=1.0),
+        KeywordRelevanceScorer(["blog", "article"], weight=0.8)
+    ])
+}
+```
+
+### Documentation Sites
+```python
+docs_config = {
+    'filters': FilterChain([
+        URLPatternFilter(["*/docs/*", "*/guide/*"]),
+        ContentTypeFilter(["text/html", "application/pdf"])
+    ]),
+    'scorer': CompositeScorer([
+        PathDepthScorer(optimal_depth=3, weight=1.0),
+        KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
+    ])
+}
+```
+
+### E-commerce Sites
+```python
+ecommerce_config = {
+    'filters': FilterChain([
+        URLPatternFilter(["*/product/*", "*/category/*"]),
+        DomainFilter(blocked_domains=["ads.*", "tracker.*"])
+    ]),
+    'scorer': CompositeScorer([
+        PathDepthScorer(optimal_depth=2, weight=1.0),
+        ContentTypeScorer({
+            r'/product/': 1.0,
+            r'/category/': 0.8
+        })
+    ])
+}
+```
+
+## Advanced Topics
+
+### Custom Filters
+```python
+class CustomFilter(URLFilter):
+    def apply(self, url: str) -> bool:
+        # Your custom filtering logic
+        return True
+```
+
+### Custom Scorers
+```python
+class CustomScorer(URLScorer):
+    def _calculate_score(self, url: str) -> float:
+        # Your custom scoring logic
+        return 1.0
+```
+
+For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).
--- a/docs/scrapper/how_to_use.md
+++ b/docs/scrapper/how_to_use.md
@@ -0,0 +1,206 @@
+# Scraper Examples Guide
+
+This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
+
+## Basic Example
+
+The basic example demonstrates a simple blog scraping scenario:
+
+```python
+from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
+
+# Create simple filter chain
+filter_chain = FilterChain([
+    URLPatternFilter("*/blog/*"),
+    ContentTypeFilter(["text/html"])
+])
+
+# Initialize strategy
+strategy = BFSScraperStrategy(
+    max_depth=2,
+    filter_chain=filter_chain,
+    url_scorer=None,
+    max_concurrent=3
+)
+
+# Create and run scraper
+crawler = AsyncWebCrawler()
+scraper = AsyncWebScraper(crawler, strategy)
+result = await scraper.ascrape("https://example.com/blog/")
+```
+
+### Features Demonstrated
+- Basic URL filtering
+- Simple content type filtering
+- Depth control
+- Concurrent request limiting
+- Result collection
+
+## Advanced Example
+
+The advanced example shows a sophisticated news site scraping setup with all features enabled:
+
+```python
+# Create comprehensive filter chain
+filter_chain = FilterChain([
+    DomainFilter(
+        allowed_domains=["example.com"],
+        blocked_domains=["ads.example.com"]
+    ),
+    URLPatternFilter([
+        "*/article/*",
+        re.compile(r"\d{4}/\d{2}/.*")
+    ]),
+    ContentTypeFilter(["text/html"])
+])
+
+# Create intelligent scorer
+scorer = CompositeScorer([
+    KeywordRelevanceScorer(
+        keywords=["news", "breaking"],
+        weight=1.0
+    ),
+    PathDepthScorer(optimal_depth=3, weight=0.7),
+    FreshnessScorer(weight=0.9)
+])
+
+# Initialize advanced strategy
+strategy = BFSScraperStrategy(
+    max_depth=4,
+    filter_chain=filter_chain,
+    url_scorer=scorer,
+    max_concurrent=5
+)
+```
+
+### Features Demonstrated
+1. **Advanced Filtering**
+   - Domain filtering
+   - Pattern matching
+   - Content type control
+
+2. **Intelligent Scoring**
+   - Keyword relevance
+   - Path optimization
+   - Freshness priority
+
+3. **Monitoring**
+   - Progress tracking
+   - Error handling
+   - Statistics collection
+
+4. **Resource Management**
+   - Concurrent processing
+   - Rate limiting
+   - Cleanup handling
+
+## Running the Examples
+
+```bash
+# Basic usage
+python basic_scraper_example.py
+
+# Advanced usage with logging
+PYTHONPATH=. python advanced_scraper_example.py
+```
+
+## Example Output
+
+### Basic Example
+```
+Crawled 15 pages:
+- https://example.com/blog/post1: 24560 bytes
+- https://example.com/blog/post2: 18920 bytes
+...
+```
+
+### Advanced Example
+```
+INFO: Starting crawl of https://example.com/news/
+INFO: Processed: https://example.com/news/breaking/story1
+DEBUG: KeywordScorer: 0.85
+DEBUG: FreshnessScorer: 0.95
+INFO: Progress: 10 URLs processed
+...
+INFO: Scraping completed:
+INFO: - URLs processed: 50
+INFO: - Errors: 2
+INFO: - Total content size: 1240.50 KB
+```
+
+## Customization
+
+### Adding Custom Filters
+```python
+class CustomFilter(URLFilter):
+    def apply(self, url: str) -> bool:
+        # Your custom filtering logic
+        return True
+
+filter_chain.add_filter(CustomFilter())
+```
+
+### Custom Scoring Logic
+```python
+class CustomScorer(URLScorer):
+    def _calculate_score(self, url: str) -> float:
+        # Your custom scoring logic
+        return 1.0
+
+scorer = CompositeScorer([
+    CustomScorer(weight=1.0),
+    ...
+])
+```
+
+## Best Practices
+
+1. **Start Simple**
+   - Begin with basic filtering
+   - Add features incrementally
+   - Test thoroughly at each step
+
+2. **Monitor Performance**
+   - Watch memory usage
+   - Track processing times
+   - Adjust concurrency as needed
+
+3. **Handle Errors**
+   - Implement proper error handling
+   - Log important events
+   - Track error statistics
+
+4. **Optimize Resources**
+   - Set appropriate delays
+   - Limit concurrent requests
+   - Use streaming for large crawls
+
+## Troubleshooting
+
+Common issues and solutions:
+
+1. **Too Many Requests**
+   ```python
+   strategy = BFSScraperStrategy(
+       max_concurrent=3,  # Reduce concurrent requests
+       min_crawl_delay=2  # Increase delay between requests
+   )
+   ```
+
+2. **Memory Issues**
+   ```python
+   # Use streaming mode for large crawls
+   async for result in scraper.ascrape(url, stream=True):
+       process_result(result)
+   ```
+
+3. **Missing Content**
+   ```python
+   # Check your filter chain
+   filter_chain = FilterChain([
+       URLPatternFilter("*"),  # Broaden patterns
+       ContentTypeFilter(["*"])  # Accept all content
+   ])
+   ```
+
+For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).
--- a/docs/scrapper/web_crawler_quick_start.py
+++ b/docs/scrapper/web_crawler_quick_start.py
@@ -0,0 +1,111 @@
+import unittest, os
+from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import RegexChunking, FixedLengthWordChunking, SlidingWindowChunking
+from crawl4ai.extraction_strategy import CosineStrategy, LLMExtractionStrategy, TopicExtractionStrategy, NoExtractionStrategy
+
+class TestWebCrawler(unittest.TestCase):
+    
+    def setUp(self):
+        self.crawler = WebCrawler()
+    
+    def test_warmup(self):
+        self.crawler.warmup()
+        self.assertTrue(self.crawler.ready, "WebCrawler failed to warm up")
+    
+    def test_run_default_strategies(self):
+        result = self.crawler.run(
+            url='https://www.nbcnews.com/business',
+            word_count_threshold=5,
+            chunking_strategy=RegexChunking(),
+            extraction_strategy=CosineStrategy(), bypass_cache=True
+        )
+        self.assertTrue(result.success, "Failed to crawl and extract using default strategies")
+    
+    def test_run_different_strategies(self):
+        url = 'https://www.nbcnews.com/business'
+        
+        # Test with FixedLengthWordChunking and LLMExtractionStrategy
+        result = self.crawler.run(
+            url=url,
+            word_count_threshold=5,
+            chunking_strategy=FixedLengthWordChunking(chunk_size=100),
+            extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-3.5-turbo", api_token=os.getenv('OPENAI_API_KEY')), bypass_cache=True
+        )
+        self.assertTrue(result.success, "Failed to crawl and extract with FixedLengthWordChunking and LLMExtractionStrategy")
+        
+        # Test with SlidingWindowChunking and TopicExtractionStrategy
+        result = self.crawler.run(
+            url=url,
+            word_count_threshold=5,
+            chunking_strategy=SlidingWindowChunking(window_size=100, step=50),
+            extraction_strategy=TopicExtractionStrategy(num_keywords=5), bypass_cache=True
+        )
+        self.assertTrue(result.success, "Failed to crawl and extract with SlidingWindowChunking and TopicExtractionStrategy")
+    
+    def test_invalid_url(self):
+        with self.assertRaises(Exception) as context:
+            self.crawler.run(url='invalid_url', bypass_cache=True)
+        self.assertIn("Invalid URL", str(context.exception))
+    
+    def test_unsupported_extraction_strategy(self):
+        with self.assertRaises(Exception) as context:
+            self.crawler.run(url='https://www.nbcnews.com/business', extraction_strategy="UnsupportedStrategy", bypass_cache=True)
+        self.assertIn("Unsupported extraction strategy", str(context.exception))
+    
+    def test_invalid_css_selector(self):
+        with self.assertRaises(ValueError) as context:
+            self.crawler.run(url='https://www.nbcnews.com/business', css_selector="invalid_selector", bypass_cache=True)
+        self.assertIn("Invalid CSS selector", str(context.exception))
+
+    
+    def test_crawl_with_cache_and_bypass_cache(self):
+        url = 'https://www.nbcnews.com/business'
+        
+        # First crawl with cache enabled
+        result = self.crawler.run(url=url, bypass_cache=False)
+        self.assertTrue(result.success, "Failed to crawl and cache the result")
+        
+        # Second crawl with bypass_cache=True
+        result = self.crawler.run(url=url, bypass_cache=True)
+        self.assertTrue(result.success, "Failed to bypass cache and fetch fresh data")
+    
+    def test_fetch_multiple_pages(self):
+        urls = [
+            'https://www.nbcnews.com/business',
+            'https://www.bbc.com/news'
+        ]
+        results = []
+        for url in urls:
+            result = self.crawler.run(
+                url=url,
+                word_count_threshold=5,
+                chunking_strategy=RegexChunking(),
+                extraction_strategy=CosineStrategy(),
+                bypass_cache=True
+            )
+            results.append(result)
+        
+        self.assertEqual(len(results), 2, "Failed to crawl and extract multiple pages")
+        for result in results:
+            self.assertTrue(result.success, "Failed to crawl and extract a page in the list")
+    
+    def test_run_fixed_length_word_chunking_and_no_extraction(self):
+        result = self.crawler.run(
+            url='https://www.nbcnews.com/business',
+            word_count_threshold=5,
+            chunking_strategy=FixedLengthWordChunking(chunk_size=100),
+            extraction_strategy=NoExtractionStrategy(), bypass_cache=True
+        )
+        self.assertTrue(result.success, "Failed to crawl and extract with FixedLengthWordChunking and NoExtractionStrategy")
+
+    def test_run_sliding_window_and_no_extraction(self):
+        result = self.crawler.run(
+            url='https://www.nbcnews.com/business',
+            word_count_threshold=5,
+            chunking_strategy=SlidingWindowChunking(window_size=100, step=50),
+            extraction_strategy=NoExtractionStrategy(), bypass_cache=True
+        )
+        self.assertTrue(result.success, "Failed to crawl and extract with SlidingWindowChunking and NoExtractionStrategy")
+
+if __name__ == '__main__':
+    unittest.main()