docs: Update documentation for v0.7.0 release

- Update mkdocs.yml site name to v0.7.x - Add v0.7.0 to blog index as latest release - Move v0.6.0 to Previous Releases section - Copy release notes to proper location in docs/md_v2/blog/releases/
2025-07-12 19:08:17 +08:00
parent 7b9ba3015f
commit 14f690d751
4 changed files with 580 additions and 9 deletions
--- a/docs/md_v2/blog/releases/0.7.0.md
+++ b/docs/md_v2/blog/releases/0.7.0.md
@@ -0,0 +1,416 @@
+# 🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update
+
+*January 28, 2025 • 10 min read*
+
+---
+
+Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.
+
+## 🎯 What's New at a Glance
+
+- **Adaptive Crawling**: Your crawler now learns and adapts to website patterns
+- **Virtual Scroll Support**: Complete content extraction from infinite scroll pages
+- **Link Preview with 3-Layer Scoring**: Intelligent link analysis and prioritization
+- **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering
+- **PDF Parsing**: Extract data from PDF documents
+- **Performance Optimizations**: Significant speed and memory improvements
+
+## 🧠 Adaptive Crawling: Intelligence Through Pattern Learning
+
+**The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.
+
+**My Solution:** I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.
+
+### Technical Deep-Dive
+
+The Adaptive Crawler maintains a persistent state for each domain, tracking:
+- Pattern success rates
+- Selector stability over time  
+- Content structure variations
+- Extraction confidence scores
+
+```python
+from crawl4ai import AdaptiveCrawler, AdaptiveConfig, CrawlState
+
+# Initialize with custom learning parameters
+config = AdaptiveConfig(
+    confidence_threshold=0.7,    # Min confidence to use learned patterns
+    max_history=100,            # Remember last 100 crawls per domain
+    learning_rate=0.2,          # How quickly to adapt to changes
+    patterns_per_page=3,        # Patterns to learn per page type
+    extraction_strategy='css'   # 'css' or 'xpath'
+)
+
+adaptive_crawler = AdaptiveCrawler(config)
+
+# First crawl - crawler learns the structure
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        "https://news.example.com/article/12345",
+        config=CrawlerRunConfig(
+            adaptive_config=config,
+            extraction_hints={  # Optional hints to speed up learning
+                "title": "article h1",
+                "content": "article .body-content"
+            }
+        )
+    )
+    
+    # Crawler identifies and stores patterns
+    if result.success:
+        state = adaptive_crawler.get_state("news.example.com")
+        print(f"Learned {len(state.patterns)} patterns")
+        print(f"Confidence: {state.avg_confidence:.2%}")
+
+# Subsequent crawls - uses learned patterns
+result2 = await crawler.arun(
+    "https://news.example.com/article/67890",
+    config=CrawlerRunConfig(adaptive_config=config)
+)
+# Automatically extracts using learned patterns!
+```
+
+**Expected Real-World Impact:**
+- **News Aggregation**: Maintain 95%+ extraction accuracy even as news sites update their templates
+- **E-commerce Monitoring**: Track product changes across hundreds of stores without constant maintenance
+- **Research Data Collection**: Build robust academic datasets that survive website redesigns
+- **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites
+
+## 🌊 Virtual Scroll: Complete Content Capture
+
+**The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.
+
+**My Solution:** I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.
+
+### Implementation Details
+
+```python
+from crawl4ai import VirtualScrollConfig
+
+# For social media feeds (Twitter/X style)
+twitter_config = VirtualScrollConfig(
+    container_selector="[data-testid='primaryColumn']",
+    scroll_count=20,                    # Number of scrolls
+    scroll_by="container_height",       # Smart scrolling by container size
+    wait_after_scroll=1.0,             # Let content load
+    capture_method="incremental",       # Capture new content on each scroll
+    deduplicate=True                   # Remove duplicate elements
+)
+
+# For e-commerce product grids (Instagram style)
+grid_config = VirtualScrollConfig(
+    container_selector="main .product-grid",
+    scroll_count=30,
+    scroll_by=800,                     # Fixed pixel scrolling
+    wait_after_scroll=1.5,             # Images need time
+    stop_on_no_change=True            # Smart stopping
+)
+
+# For news feeds with lazy loading
+news_config = VirtualScrollConfig(
+    container_selector=".article-feed",
+    scroll_count=50,
+    scroll_by="page_height",           # Viewport-based scrolling
+    wait_after_scroll=0.5,
+    wait_for_selector=".article-card",  # Wait for specific elements
+    timeout=30000                      # Max 30 seconds total
+)
+
+# Use it in your crawl
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        "https://twitter.com/trending",
+        config=CrawlerRunConfig(
+            virtual_scroll_config=twitter_config,
+            # Combine with other features
+            extraction_strategy=JsonCssExtractionStrategy({
+                "tweets": {
+                    "selector": "[data-testid='tweet']",
+                    "fields": {
+                        "text": {"selector": "[data-testid='tweetText']", "type": "text"},
+                        "likes": {"selector": "[data-testid='like']", "type": "text"}
+                    }
+                }
+            })
+        )
+    )
+    
+    print(f"Captured {len(result.extracted_content['tweets'])} tweets")
+```
+
+**Key Capabilities:**
+- **DOM Recycling Awareness**: Detects and handles virtual DOM element recycling
+- **Smart Scroll Physics**: Three modes - container height, page height, or fixed pixels
+- **Content Preservation**: Captures content before it's destroyed
+- **Intelligent Stopping**: Stops when no new content appears
+- **Memory Efficient**: Streams content instead of holding everything in memory
+
+**Expected Real-World Impact:**
+- **Social Media Analysis**: Capture entire Twitter threads with hundreds of replies, not just top 10
+- **E-commerce Scraping**: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods  
+- **News Aggregation**: Get all articles from modern news sites, not just above-the-fold content
+- **Research Applications**: Complete data extraction from academic databases using virtual pagination
+
+## 🔗 Link Preview: Intelligent Link Analysis and Scoring
+
+**The Problem:** You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.
+
+**My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
+
+### The Three-Layer Scoring System
+
+```python
+from crawl4ai import LinkPreviewConfig
+
+# Configure intelligent link analysis
+link_config = LinkPreviewConfig(
+    # What to analyze
+    include_internal=True,
+    include_external=True,
+    max_links=100,              # Analyze top 100 links
+    
+    # Relevance scoring
+    query="machine learning tutorials",  # Your interest
+    score_threshold=0.3,        # Minimum relevance score
+    
+    # Performance
+    concurrent_requests=10,     # Parallel processing
+    timeout_per_link=5000,      # 5s per link
+    
+    # Advanced scoring weights
+    scoring_weights={
+        "intrinsic": 0.3,       # Link quality indicators
+        "contextual": 0.5,      # Relevance to query
+        "popularity": 0.2       # Link prominence
+    }
+)
+
+# Use in your crawl
+result = await crawler.arun(
+    "https://tech-blog.example.com",
+    config=CrawlerRunConfig(
+        link_preview_config=link_config,
+        score_links=True
+    )
+)
+
+# Access scored and sorted links
+for link in result.links["internal"][:10]:  # Top 10 internal links
+    print(f"Score: {link['total_score']:.3f}")
+    print(f"  Intrinsic: {link['intrinsic_score']:.1f}/10")  # Position, attributes
+    print(f"  Contextual: {link['contextual_score']:.1f}/1")  # Relevance to query
+    print(f"  URL: {link['href']}")
+    print(f"  Title: {link['head_data']['title']}")
+    print(f"  Description: {link['head_data']['meta']['description'][:100]}...")
+```
+
+**Scoring Components:**
+
+1. **Intrinsic Score (0-10)**: Based on link quality indicators
+   - Position on page (navigation, content, footer)
+   - Link attributes (rel, title, class names)
+   - Anchor text quality and length
+   - URL structure and depth
+
+2. **Contextual Score (0-1)**: Relevance to your query
+   - Semantic similarity using embeddings
+   - Keyword matching in link text and title
+   - Meta description analysis
+   - Content preview scoring
+
+3. **Total Score**: Weighted combination for final ranking
+
+**Expected Real-World Impact:**
+- **Research Efficiency**: Find relevant papers 10x faster by following only high-score links
+- **Competitive Analysis**: Automatically identify important pages on competitor sites
+- **Content Discovery**: Build topic-focused crawlers that stay on track
+- **SEO Audits**: Identify and prioritize high-value internal linking opportunities
+
+## 🎣 Async URL Seeder: Automated URL Discovery at Scale
+
+**The Problem:** You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.
+
+**My Solution:** I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.
+
+### Technical Architecture
+
+```python
+from crawl4ai import AsyncUrlSeeder, SeedingConfig
+
+# Basic discovery - find all product pages
+seeder_config = SeedingConfig(
+    # Discovery sources
+    source="sitemap+cc",        # Sitemap + Common Crawl
+    
+    # Filtering
+    pattern="*/product/*",      # URL pattern matching
+    ignore_patterns=["*/reviews/*", "*/questions/*"],
+    
+    # Validation
+    live_check=True,           # Verify URLs are alive
+    max_urls=5000,             # Stop at 5000 URLs
+    
+    # Performance  
+    concurrency=100,           # Parallel requests
+    hits_per_sec=10           # Rate limiting
+)
+
+seeder = AsyncUrlSeeder(seeder_config)
+urls = await seeder.discover("https://shop.example.com")
+
+# Advanced: Relevance-based discovery
+research_config = SeedingConfig(
+    source="crawl+sitemap",    # Deep crawl + sitemap
+    pattern="*/blog/*",        # Blog posts only
+    
+    # Content relevance
+    extract_head=True,         # Get meta tags
+    query="quantum computing tutorials",
+    scoring_method="bm25",     # Or "semantic" (coming soon)
+    score_threshold=0.4,       # High relevance only
+    
+    # Smart filtering
+    filter_nonsense_urls=True,  # Remove .xml, .txt, etc.
+    min_content_length=500,     # Skip thin content
+    
+    force=True                 # Bypass cache
+)
+
+# Discover with progress tracking
+discovered = []
+async for batch in seeder.discover_iter("https://physics-blog.com", research_config):
+    discovered.extend(batch)
+    print(f"Found {len(discovered)} relevant URLs so far...")
+
+# Results include scores and metadata
+for url_data in discovered[:5]:
+    print(f"URL: {url_data['url']}")
+    print(f"Score: {url_data['score']:.3f}")
+    print(f"Title: {url_data['title']}")
+```
+
+**Discovery Methods:**
+- **Sitemap Mining**: Parses robots.txt and all linked sitemaps
+- **Common Crawl**: Queries the Common Crawl index for historical URLs
+- **Intelligent Crawling**: Follows links with smart depth control
+- **Pattern Analysis**: Learns URL structures and generates variations
+
+**Expected Real-World Impact:**
+- **Migration Projects**: Discover 10,000+ URLs from legacy sites in under 60 seconds
+- **Market Research**: Map entire competitor ecosystems automatically
+- **Academic Research**: Build comprehensive datasets without manual URL collection
+- **SEO Audits**: Find every indexable page with content scoring
+- **Content Archival**: Ensure no content is left behind during site migrations
+
+## ⚡ Performance Optimizations
+
+This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.
+
+### What We Optimized
+
+```python
+# Before v0.7.0 (slow)
+results = []
+for url in urls:
+    result = await crawler.arun(url)
+    results.append(result)
+
+# After v0.7.0 (fast)
+# Automatic batching and connection pooling
+results = await crawler.arun_batch(
+    urls,
+    config=CrawlerRunConfig(
+        # New performance options
+        batch_size=10,              # Process 10 URLs concurrently
+        reuse_browser=True,         # Keep browser warm
+        eager_loading=False,        # Load only what's needed
+        streaming_extraction=True,  # Stream large extractions
+        
+        # Optimized defaults
+        wait_until="domcontentloaded",  # Faster than networkidle
+        exclude_external_resources=True, # Skip third-party assets
+        block_ads=True                  # Ad blocking built-in
+    )
+)
+
+# Memory-efficient streaming for large crawls
+async for result in crawler.arun_stream(large_url_list):
+    # Process results as they complete
+    await process_result(result)
+    # Memory is freed after each iteration
+```
+
+**Performance Gains:**
+- **Startup Time**: 70% faster browser initialization
+- **Page Loading**: 40% reduction with smart resource blocking
+- **Extraction**: 3x faster with compiled CSS selectors
+- **Memory Usage**: 60% reduction with streaming processing
+- **Concurrent Crawls**: Handle 5x more parallel requests
+
+## 📄 PDF Support
+
+PDF extraction is now natively supported in Crawl4AI.
+
+```python
+# Extract data from PDF documents
+result = await crawler.arun(
+    "https://example.com/report.pdf",
+    config=CrawlerRunConfig(
+        pdf_extraction=True,
+        extraction_strategy=JsonCssExtractionStrategy({
+            # Works on converted PDF structure
+            "title": {"selector": "h1", "type": "text"},
+            "sections": {"selector": "h2", "type": "list"}
+        })
+    )
+)
+```
+
+## 🔧 Important Changes
+
+### Breaking Changes
+- `link_extractor` renamed to `link_preview` (better reflects functionality)
+- Minimum Python version now 3.9
+- `CrawlerConfig` split into `CrawlerRunConfig` and `BrowserConfig`
+
+### Migration Guide
+```python
+# Old (v0.6.x)
+from crawl4ai import CrawlerConfig
+config = CrawlerConfig(timeout=30000)
+
+# New (v0.7.0)
+from crawl4ai import CrawlerRunConfig, BrowserConfig
+browser_config = BrowserConfig(timeout=30000)
+run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+```
+
+## 🤖 Coming Soon: Intelligent Web Automation
+
+I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:
+
+- **Crawl Agents**: Autonomous crawlers that understand your goals and adapt their strategies
+- **Auto JS Generation**: Automatic JavaScript code generation for complex interactions
+- **Smart Form Handling**: Intelligent form detection and filling
+- **Context-Aware Actions**: Crawlers that understand page context and make decisions
+
+These features are under active development and will revolutionize how we approach web automation. Stay tuned!
+
+## 🚀 Get Started
+
+```bash
+pip install crawl4ai==0.7.0
+```
+
+Check out the [updated documentation](https://docs.crawl4ai.com).
+
+Questions? Issues? I'm always listening:
+- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
+- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
+- Twitter: [@unclecode](https://x.com/unclecode)
+
+Happy crawling! 🕷️
+
+---
+
+*P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.*