Squashed commit of the following:

commit 2def6524cdacb69c72760bf55a41089257c0bb07 Author: ntohidi <nasrin@kidocode.com> Date: Mon Aug 4 18:59:10 2025 +0800 refactor: consolidate WebScrapingStrategy to use LXML implementation only BREAKING CHANGE: None - full backward compatibility maintained This commit simplifies the content scraping architecture by removing the redundant BeautifulSoup-based WebScrapingStrategy implementation and making it an alias for LXMLWebScrapingStrategy. Changes: - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy - Maintain 100% backward compatibility - existing code continues to work Code changes: - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports - crawl4ai/__init__.py: Update imports to show alias relationship - crawl4ai/types.py: Update type definitions - crawl4ai/legacy/web_crawler.py: Update import to use alias - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy - docs/examples/scraping_strategies_performance.py: Update to use single strategy Documentation updates: - docs/md_v2/core/content-selection.md: Update scraping modes section - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide - CHANGELOG.md: Document the refactoring under [Unreleased] Benefits: - 10-20x faster HTML parsing for large documents - Reduced memory usage and simplified codebase - Consistent parsing behavior - No migration required for existing users All existing code using WebScrapingStrategy continues to work without modification, while benefiting from LXML's superior performance.
2025-08-04 19:02:01 +08:00
parent 307fe28b32
commit 7a6ad547f0
11 changed files with 175 additions and 921 deletions
--- a/docs/md_v2/migration/webscraping-strategy-migration.md
+++ b/docs/md_v2/migration/webscraping-strategy-migration.md
@@ -0,0 +1,92 @@
+# WebScrapingStrategy Migration Guide
+
+## Overview
+
+Crawl4AI has simplified its content scraping architecture. The BeautifulSoup-based `WebScrapingStrategy` has been deprecated in favor of the faster LXML-based implementation. However, **no action is required** - your existing code will continue to work.
+
+## What Changed?
+
+1. **`WebScrapingStrategy` is now an alias** for `LXMLWebScrapingStrategy`
+2. **The BeautifulSoup implementation has been removed** (~1000 lines of redundant code)
+3. **`LXMLWebScrapingStrategy` inherits directly** from `ContentScrapingStrategy`
+4. **Performance remains optimal** with LXML as the sole implementation
+
+## Backward Compatibility
+
+**Your existing code continues to work without any changes:**
+
+```python
+# This still works perfectly
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, WebScrapingStrategy
+
+config = CrawlerRunConfig(
+    scraping_strategy=WebScrapingStrategy()  # Works as before
+)
+```
+
+## Migration Options
+
+You have three options:
+
+### Option 1: Do Nothing (Recommended)
+Your code will continue to work. `WebScrapingStrategy` is permanently aliased to `LXMLWebScrapingStrategy`.
+
+### Option 2: Update Imports (Optional)
+For clarity, you can update your imports:
+
+```python
+# Old (still works)
+from crawl4ai import WebScrapingStrategy
+strategy = WebScrapingStrategy()
+
+# New (more explicit)
+from crawl4ai import LXMLWebScrapingStrategy
+strategy = LXMLWebScrapingStrategy()
+```
+
+### Option 3: Use Default Configuration
+Since `LXMLWebScrapingStrategy` is the default, you can omit the strategy parameter:
+
+```python
+# Simplest approach - uses LXMLWebScrapingStrategy by default
+config = CrawlerRunConfig()
+```
+
+## Type Hints
+
+If you use type hints, both work:
+
+```python
+from crawl4ai import WebScrapingStrategy, LXMLWebScrapingStrategy
+
+def process_with_strategy(strategy: WebScrapingStrategy) -> None:
+    # Works with both WebScrapingStrategy and LXMLWebScrapingStrategy
+    pass
+
+# Both are valid
+process_with_strategy(WebScrapingStrategy())
+process_with_strategy(LXMLWebScrapingStrategy())
+```
+
+## Subclassing
+
+If you've subclassed `WebScrapingStrategy`, it continues to work:
+
+```python
+class MyCustomStrategy(WebScrapingStrategy):
+    def __init__(self):
+        super().__init__()
+        # Your custom code
+```
+
+## Performance Benefits
+
+By consolidating to LXML:
+- **10-20x faster** HTML parsing for large documents
+- **Lower memory usage**
+- **Consistent behavior** across all use cases
+- **Simplified maintenance** and bug fixes
+
+## Summary
+
+This change simplifies Crawl4AI's internals while maintaining 100% backward compatibility. Your existing code continues to work, and you get better performance automatically.