feat: add v0.7.3 release notes, changelog updates, and documentation for new features

2025-08-09 21:04:18 +08:00
parent 21f79fe166
commit f0ce7b2710
3 changed files with 341 additions and 7 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,76 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## [0.7.3] - 2025-08-09
 ### Added
 - **🕵️ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
  - `browser_adapter.py` with undetected Chrome integration
  - Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
  - Support for headless stealth mode with anti-detection techniques
  - Human-like behavior simulation with random mouse movements and scrolling
  - Comprehensive examples for anti-bot strategies and stealth crawling
  - Full documentation guide for undetected browser usage
 - **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
  - Different crawling strategies for different URL patterns in a single batch
  - Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
  - Lambda function matchers for complex URL logic
  - Mixed matchers combining strings and functions with AND/OR logic
  - Fallback configuration support when no patterns match
  - First-match-wins configuration selection with optional fallback
 - **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking
  - New `memory_utils.py` module for memory monitoring and optimization
  - Real-time memory usage tracking during crawl sessions
  - Memory leak detection and reporting
  - Performance optimization recommendations
  - Peak memory usage analysis and efficiency metrics
  - Automatic cleanup suggestions for memory-intensive operations
 - **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
  - Direct `result.tables` interface replacing generic `result.media` approach
  - Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
  - Enhanced table detection algorithms for better accuracy
  - Table metadata including source XPath and headers
  - Improved table structure preservation during extraction
 - **💰 GitHub Sponsors Integration**: 4-tier sponsorship system
  - Supporter ($5/month): Community support + early feature previews
  - Professional ($25/month): Priority support + beta access
  - Business ($100/month): Direct consultation + custom integrations
  - Enterprise ($500/month): Dedicated support + feature development
  - Custom arrangement options for larger organizations
 - **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration
  - `LLM_PROVIDER` environment variable support for dynamic provider switching
  - `.llm.env` file support for secure configuration management
  - Per-request provider override capabilities in API endpoints
  - Support for OpenAI, Groq, and other providers without rebuilding images
  - Enhanced Docker documentation with deployment examples
 ### Fixed
 - **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic
 - **Memory Management**: Fixed memory leaks in long-running crawl sessions
 - **Sitemap Processing**: Improved redirect handling in sitemap fetching
 - **Table Extraction**: Enhanced table detection and extraction accuracy
 - **Error Handling**: Better error messages and recovery from network failures
 ### Changed
 - **Architecture Refactoring**: Major cleanup and optimization
  - Moved 2,450+ lines from main `async_crawler_strategy.py` to backup
  - Cleaner separation of concerns in crawler architecture
  - Better maintainability and code organization
  - Preserved backward compatibility while improving performance
 ### Documentation
 - **Comprehensive Examples**: Added real-world URLs and practical use cases
 - **API Documentation**: Complete CrawlResult field documentation with all available fields
 - **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables`
 - **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies
 - **Multi-Config Examples**: Detailed examples for URL-specific configurations
 - **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration
 ## [0.7.x] - 2025-06-29
 ### Added
--- a/README.md
+++ b/README.md
@@ -27,9 +27,9 @@
 Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
-[✨ Check out latest update v0.7.0](#-recent-updates)
+[✨ Check out latest update v0.7.3](#-recent-updates)
-✨ New in v0.7.0, Adaptive Crawling, Virtual Scroll, Link Preview scoring, Async URL Seeder, big performance gains. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
+✨ New in v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
 <details>
  <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -542,7 +542,89 @@ async def test_news_crawl():
 ## ✨ Recent Updates
-### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
+<details>
 <summary><strong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update</strong></summary>
 - **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:
  ```python
  from crawl4ai import AsyncWebCrawler, BrowserConfig
  browser_config = BrowserConfig(
      browser_type="undetected",  # Use undetected Chrome
      headless=True,              # Can run headless with stealth
      extra_args=[
          "--disable-blink-features=AutomationControlled",
          "--disable-web-security"
      ]
  )
  async with AsyncWebCrawler(config=browser_config) as crawler:
      result = await crawler.arun("https://protected-site.com")
  # Successfully bypass Cloudflare, Akamai, and custom bot detection
  ```
 - **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
  ```python
  from crawl4ai import CrawlerRunConfig, MatchMode
  configs = [
      # Documentation sites - aggressive caching
      CrawlerRunConfig(
          url_matcher=["*docs*", "*documentation*"],
          cache_mode="write",
          markdown_generator_options={"include_links": True}
      ),
      # News/blog sites - fresh content
      CrawlerRunConfig(
          url_matcher=lambda url: 'blog' in url or 'news' in url,
          cache_mode="bypass"
      ),
      # Fallback for everything else
      CrawlerRunConfig()
  ]
  results = await crawler.arun_many(urls, config=configs)
  # Each URL gets the perfect configuration automatically
  ```
 - **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:
  ```python
  from crawl4ai.memory_utils import MemoryMonitor
  monitor = MemoryMonitor()
  monitor.start_monitoring()
  results = await crawler.arun_many(large_url_list)
  report = monitor.get_report()
  print(f"Peak memory: {report['peak_mb']:.1f} MB")
  print(f"Efficiency: {report['efficiency']:.1f}%")
  # Get optimization recommendations
  ```
 - **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
  ```python
  result = await crawler.arun("https://site-with-tables.com")
  # New way - direct table access
  if result.tables:
      import pandas as pd
      for table in result.tables:
          df = pd.DataFrame(table['data'])
          print(f"Table: {df.shape[0]} rows × {df.shape[1]} columns")
  ```
 - **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability
 - **🐳 Docker LLM Flexibility**: Configure providers via environment variables
 [Full v0.7.3 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
 </details>
 <details>
 <summary><strong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update</strong></summary>
 - **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
  ```python
@@ -607,6 +689,8 @@ async def test_news_crawl():
 Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
 </details>
 ## Version Numbering in Crawl4AI
 Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
--- a/docs/blog/release-v0.7.3.md
+++ b/docs/blog/release-v0.7.3.md
@@ -8,10 +8,14 @@ Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This
 ## 🎯 What's New at a Glance
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
+- **🕵️ Undetected Browser Support**: Stealth mode for bypassing bot detection systems
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
+- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
- **Bug Fixes**: Resolved several critical issues for better stability
+- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables
- **Documentation Updates**: Clearer examples and improved API documentation
+- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools
 - **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
 - **💰 GitHub Sponsors**: 4-tier sponsorship system with custom arrangements
 - **🔧 Bug Fixes**: Resolved several critical issues for better stability
 - **📚 Documentation Updates**: Clearer examples and improved API documentation
 ## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
@@ -78,6 +82,182 @@ async with AsyncWebCrawler() as crawler:
 - **Reduced Complexity**: No more if/else forests in your extraction code
 - **Better Performance**: Each URL gets exactly the processing it needs
 ## 🕵️ Undetected Browser Support: Stealth Mode Activated
 **The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
 **My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
 ### Technical Implementation
 ```python
 from crawl4ai import AsyncWebCrawler, BrowserConfig
 # Enable undetected mode for stealth crawling
 browser_config = BrowserConfig(
    browser_type="undetected",  # Use undetected Chrome
    headless=True,              # Can run headless with stealth
    extra_args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-web-security",
        "--disable-features=VizDisplayCompositor"
    ]
 )
 async with AsyncWebCrawler(config=browser_config) as crawler:
    # This will bypass most bot detection systems
    result = await crawler.arun("https://protected-site.com")
    if result.success:
        print("✅ Successfully bypassed bot detection!")
        print(f"Content length: {len(result.markdown)}")
 ```
 **Advanced Anti-Bot Strategies:**
 ```python
 # Combine multiple stealth techniques
 from crawl4ai import CrawlerRunConfig
 config = CrawlerRunConfig(
    # Random user agents and headers
    headers={
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1"
    },
    # Human-like behavior simulation
    js_code="""
        // Random mouse movements
        const simulateHuman = () => {
            const event = new MouseEvent('mousemove', {
                clientX: Math.random() * window.innerWidth,
                clientY: Math.random() * window.innerHeight
            });
            document.dispatchEvent(event);
        };
        setInterval(simulateHuman, 100 + Math.random() * 200);
        // Random scrolling
        const randomScroll = () => {
            const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
            window.scrollTo(0, scrollY);
        };
        setTimeout(randomScroll, 500 + Math.random() * 1000);
    """,
    # Delay to appear more human
    delay_before_return_html=2.0
 )
 result = await crawler.arun("https://bot-protected-site.com", config=config)
 ```
 **Expected Real-World Impact:**
 - **Enterprise Scraping**: Access previously blocked corporate sites and databases
 - **Market Research**: Gather data from competitor sites with protection
 - **Price Monitoring**: Track e-commerce sites that block automated access
 - **Content Aggregation**: Collect news and social media despite anti-bot measures
 - **Compliance Testing**: Verify your own site's bot protection effectiveness
 ## 🧠 Memory Monitoring & Optimization
 **The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
 **My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
 ### Memory Tracking Implementation
 ```python
 from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
 # Monitor memory during crawling
 monitor = MemoryMonitor()
 async with AsyncWebCrawler() as crawler:
    # Start monitoring
    monitor.start_monitoring()
    # Perform memory-intensive operations
    results = await crawler.arun_many([
        "https://heavy-js-site.com",
        "https://large-images-site.com", 
        "https://dynamic-content-site.com"
    ])
    # Get detailed memory report
    memory_report = monitor.get_report()
    print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
    print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
    # Automatic cleanup suggestions
    if memory_report['peak_mb'] > 1000:  # > 1GB
        print("💡 Consider batch size optimization")
        print("💡 Enable aggressive garbage collection")
 ```
 **Expected Real-World Impact:**
 - **Production Stability**: Prevent memory-related crashes in long-running services
 - **Cost Optimization**: Right-size server resources based on actual usage
 - **Performance Tuning**: Identify memory bottlenecks and optimization opportunities
 - **Scalability Planning**: Understand memory patterns for horizontal scaling
 ## 📊 Enhanced Table Extraction
 **The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear.
 **My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms.
 ### New Table Access Pattern
 ```python
 # Old way (deprecated)
 # tables_data = result.media.get('tables', [])
 # New way (v0.7.3+)
 result = await crawler.arun("https://site-with-tables.com")
 # Direct table access
 if result.tables:
    print(f"Found {len(result.tables)} tables")
    # Convert to pandas DataFrame instantly
    import pandas as pd
    for i, table in enumerate(result.tables):
        df = pd.DataFrame(table['data'])
        print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
        print(df.head())
        # Table metadata
        print(f"Source: {table.get('source_xpath', 'Unknown')}")
        print(f"Headers: {table.get('headers', [])}")
 ```
 **Expected Real-World Impact:**
 - **Data Analysis**: Faster transition from web data to analysis-ready DataFrames
 - **ETL Pipelines**: Cleaner integration with data processing workflows
 - **Reporting**: Simplified table extraction for automated reporting systems
 ## 💰 Community Support: GitHub Sponsors
 I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
 **Sponsorship Tiers:**
 - **🌱 Supporter ($5/month)**: Community support + early feature previews
 - **🚀 Professional ($25/month)**: Priority support + beta access
 - **🏢 Business ($100/month)**: Direct consultation + custom integrations
 - **🏛️ Enterprise ($500/month)**: Dedicated support + feature development
 **Why Sponsor?**
 - Ensure continuous development and maintenance
 - Get priority support and feature requests
 - Access to premium documentation and examples
 - Direct line to the development team
 [**Become a Sponsor →**](https://github.com/sponsors/unclecode)
 ## 🐳 Docker: Flexible LLM Provider Configuration
 **The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.