Refactor: Renamed scrape to traverse and deep_crawl in a few sections where it applies

2025-01-29 16:24:11 +05:30
parent 9ef43bc5f0
commit 2c8f2ec5a6
7 changed files with 3 additions and 169 deletions
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -84,4 +84,4 @@ SHOW_DEPRECATION_WARNINGS = True
 SCREENSHOT_HEIGHT_TRESHOLD = 10000
 PAGE_TIMEOUT = 60000
 DOWNLOAD_PAGE_TIMEOUT = 60000
-SCRAPER_BATCH_SIZE = 5
+DEEP_CRAWL_BATCH_SIZE = 5
--- a/crawl4ai/traversal/bfs_traversal_strategy.py
+++ b/crawl4ai/traversal/bfs_traversal_strategy.py
@@ -8,7 +8,7 @@ from ..models import CrawlResult, TraversalStats
 from .filters import FilterChain
 from .scorers import URLScorer
 from .traversal_strategy import TraversalStrategy
-from ..config import SCRAPER_BATCH_SIZE
+from ..config import DEEP_CRAWL_BATCH_SIZE


 class BFSTraversalStrategy(TraversalStrategy):
@@ -139,7 +139,7 @@ class BFSTraversalStrategy(TraversalStrategy):
                """
                # Collect batch of URLs into active_crawls to process
                async with active_crawls_lock:
-                    while len(active_crawls) < SCRAPER_BATCH_SIZE and not queue.empty():
+                    while len(active_crawls) < DEEP_CRAWL_BATCH_SIZE and not queue.empty():
                        score, depth, url, parent_url = await queue.get()
                        active_crawls[url] = {
                            "depth": depth,
--- a/docs/deep_crawl/bfs_traversal_strategy.md
+++ b/docs/deep_crawl/bfs_traversal_strategy.md
--- a/docs/deep_crawl/deep_crawl_quickstart.py
+++ b/docs/deep_crawl/deep_crawl_quickstart.py
--- a/docs/deep_crawl/filters_scrorers.md
+++ b/docs/deep_crawl/filters_scrorers.md
--- a/docs/deep_crawl/how_to_use.md
+++ b/docs/deep_crawl/how_to_use.md
--- a/docs/scraper/async_web_scraper.md
+++ b/docs/scraper/async_web_scraper.md
@@ -1,166 +0,0 @@
-# AsyncWebScraper: Smart Web Crawling Made Easy
-
-AsyncWebScraper is a powerful and flexible web scraping tool that makes it easy to collect data from websites efficiently. Whether you need to scrape a few pages or an entire website, AsyncWebScraper handles the complexity of web crawling while giving you fine-grained control over the process.
-
-## How It Works
-
-```mermaid
-flowchart TB
-    Start([Start]) --> Init[Initialize AsyncWebScraper\nwith Crawler and Strategy]
-    Init --> InputURL[Receive URL to scrape]
-    InputURL --> Decision{Stream or\nCollect?}
-    
-    %% Streaming Path
-    Decision -->|Stream| StreamInit[Initialize Streaming Mode]
-    StreamInit --> StreamStrategy[Call Strategy.ascrape]
-    StreamStrategy --> AsyncGen[Create Async Generator]
-    AsyncGen --> ProcessURL[Process Next URL]
-    ProcessURL --> FetchContent[Fetch Page Content]
-    FetchContent --> Extract[Extract Data]
-    Extract --> YieldResult[Yield CrawlResult]
-    YieldResult --> CheckMore{More URLs?}
-    CheckMore -->|Yes| ProcessURL
-    CheckMore -->|No| StreamEnd([End Stream])
-    
-    %% Collecting Path
-    Decision -->|Collect| CollectInit[Initialize Collection Mode]
-    CollectInit --> CollectStrategy[Call Strategy.ascrape]
-    CollectStrategy --> CollectGen[Create Async Generator]
-    CollectGen --> ProcessURLColl[Process Next URL]
-    ProcessURLColl --> FetchContentColl[Fetch Page Content]
-    FetchContentColl --> ExtractColl[Extract Data]
-    ExtractColl --> StoreColl[Store in Dictionary]
-    StoreColl --> CheckMoreColl{More URLs?}
-    CheckMoreColl -->|Yes| ProcessURLColl
-    CheckMoreColl -->|No| CreateResult[Create ScraperResult]
-    CreateResult --> ReturnResult([Return Result])
-    
-    %% Parallel Processing
-    subgraph Parallel
-        ProcessURL
-        FetchContent
-        Extract
-        ProcessURLColl
-        FetchContentColl
-        ExtractColl
-    end
-    
-    %% Error Handling
-    FetchContent --> ErrorCheck{Error?}
-    ErrorCheck -->|Yes| LogError[Log Error]
-    LogError --> UpdateStats[Update Error Stats]
-    UpdateStats --> CheckMore
-    ErrorCheck -->|No| Extract
-    
-    FetchContentColl --> ErrorCheckColl{Error?}
-    ErrorCheckColl -->|Yes| LogErrorColl[Log Error]
-    LogErrorColl --> UpdateStatsColl[Update Error Stats]
-    UpdateStatsColl --> CheckMoreColl
-    ErrorCheckColl -->|No| ExtractColl
-    
-    %% Style definitions
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    classDef start fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    
-    class Start,StreamEnd,ReturnResult start;
-    class Decision,CheckMore,CheckMoreColl,ErrorCheck,ErrorCheckColl decision;
-    class LogError,LogErrorColl,UpdateStats,UpdateStatsColl error;
-    class ProcessURL,FetchContent,Extract,ProcessURLColl,FetchContentColl,ExtractColl process;
-```
-
-AsyncWebScraper uses an intelligent crawling system that can navigate through websites following your specified strategy. It supports two main modes of operation:
-
-### 1. Streaming Mode
-```python
-async for result in scraper.ascrape(url, stream=True):
-    print(f"Found data on {result.url}")
-    process_data(result.data)
-```
- Perfect for processing large websites
- Memory efficient - handles one page at a time
- Ideal for real-time data processing
- Great for monitoring or continuous scraping tasks
-
-### 2. Collection Mode
-```python
-result = await scraper.ascrape(url)
-print(f"Scraped {len(result.crawled_urls)} pages")
-process_all_data(result.extracted_data)
-```
- Collects all data before returning
- Best for when you need the complete dataset
- Easier to work with for batch processing
- Includes comprehensive statistics
-
-## Key Features
-
- **Smart Crawling**: Automatically follows relevant links while avoiding duplicates
- **Parallel Processing**: Scrapes multiple pages simultaneously for better performance
- **Memory Efficient**: Choose between streaming and collecting based on your needs
- **Error Resilient**: Continues working even if some pages fail to load
- **Progress Tracking**: Monitor the scraping progress in real-time
- **Customizable**: Configure crawling strategy, filters, and scoring to match your needs
-
-## Quick Start
-
-```python
-from crawl4ai.scraper import AsyncWebScraper, BFSStrategy
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-
-# Initialize the scraper
-crawler = AsyncWebCrawler()
-strategy = BFSStrategy(
-    max_depth=2,  # How deep to crawl
-    url_pattern="*.example.com/*"  # What URLs to follow
-)
-scraper = AsyncWebScraper(crawler, strategy)
-
-# Start scraping
-async def main():
-    # Collect all results
-    result = await scraper.ascrape("https://example.com")
-    print(f"Found {len(result.extracted_data)} pages")
-    
-    # Or stream results
-    async for page in scraper.ascrape("https://example.com", stream=True):
-        print(f"Processing {page.url}")
-
-```
-
-## Best Practices
-
-1. **Choose the Right Mode**
-   - Use streaming for large websites or real-time processing
-   - Use collecting for smaller sites or when you need the complete dataset
-
-2. **Configure Depth**
-   - Start with a small depth (2-3) and increase if needed
-   - Higher depths mean exponentially more pages to crawl
-
-3. **Set Appropriate Filters**
-   - Use URL patterns to stay within relevant sections
-   - Set content type filters to only process useful pages
-
-4. **Handle Resources Responsibly**
-   - Enable parallel processing for faster results
-   - Consider the target website's capacity
-   - Implement appropriate delays between requests
-
-## Common Use Cases
-
- **Content Aggregation**: Collect articles, blog posts, or news from multiple pages
- **Data Extraction**: Gather product information, prices, or specifications
- **Site Mapping**: Create a complete map of a website's structure
- **Content Monitoring**: Track changes or updates across multiple pages
- **Data Mining**: Extract and analyze patterns across web pages
-
-## Advanced Features
-
- Custom scoring algorithms for prioritizing important pages
- URL filters for focusing on specific site sections
- Content type filtering for processing only relevant pages
- Progress tracking for monitoring long-running scrapes
-
-Need more help? Check out our [examples repository](https://github.com/example/crawl4ai/examples) or join our [community Discord](https://discord.gg/example).