Files

ntohidi cf8badfe27 feat: cleanup unused code and enhance documentation for v0.7.1

- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-07-17 11:35:16 +02:00

14 KiB

Raw Permalink Blame History

🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update

January 28, 2025 • 10 min read

Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.

🎯 What's New at a Glance

Adaptive Crawling: Your crawler now learns and adapts to website patterns
Virtual Scroll Support: Complete content extraction from infinite scroll pages
Link Preview with Intelligent Scoring: Intelligent link analysis and prioritization
Async URL Seeder: Discover thousands of URLs in seconds with intelligent filtering
Performance Optimizations: Significant speed and memory improvements

🧠 Adaptive Crawling: Intelligence Through Pattern Learning

The Problem: Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.

My Solution: I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.

Technical Deep-Dive

The Adaptive Crawler maintains a persistent state for each domain, tracking:

Pattern success rates
Selector stability over time
Content structure variations
Extraction confidence scores

from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio

async def main():
    
    # Configure adaptive crawler
    config = AdaptiveConfig(
        strategy="statistical",  # or "embedding" for semantic understanding
        max_pages=10,
        confidence_threshold=0.7,  # Stop at 70% confidence
        top_k_links=3,  # Follow top 3 links per page
        min_gain_threshold=0.05  # Need 5% information gain to continue
    )
    
    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        
        print("Starting adaptive crawl about Python decorators...")
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/glossary.html",
            query="python decorators functions wrapping"
        )
        
        print(f"\n✅ Crawling Complete!")
        print(f"• Confidence Level: {adaptive.confidence:.0%}")
        print(f"• Pages Crawled: {len(result.crawled_urls)}")
        print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
        
        # Get most relevant content
        relevant = adaptive.get_relevant_content(top_k=3)
        print(f"\nMost Relevant Pages:")
        for i, page in enumerate(relevant, 1):
            print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")

asyncio.run(main())

Expected Real-World Impact:

News Aggregation: Maintain 95%+ extraction accuracy even as news sites update their templates
E-commerce Monitoring: Track product changes across hundreds of stores without constant maintenance
Research Data Collection: Build robust academic datasets that survive website redesigns
Reduced Maintenance: Cut selector update time by 80% for frequently-changing sites

🌊 Virtual Scroll: Complete Content Capture

The Problem: Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.

My Solution: I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.

Implementation Details

from crawl4ai import VirtualScrollConfig

# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
    container_selector="[data-testid='primaryColumn']",
    scroll_count=20,                    # Number of scrolls
    scroll_by="container_height",       # Smart scrolling by container size
    wait_after_scroll=1.0              # Let content load
)

# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
    container_selector="main .product-grid",
    scroll_count=30,
    scroll_by=800,                     # Fixed pixel scrolling
    wait_after_scroll=1.5              # Images need time
)

# For news feeds with lazy loading
news_config = VirtualScrollConfig(
    container_selector=".article-feed",
    scroll_count=50,
    scroll_by="page_height",           # Viewport-based scrolling
    wait_after_scroll=0.5              # Wait for content to load
)

# Use it in your crawl
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        "https://twitter.com/trending",
        config=CrawlerRunConfig(
            virtual_scroll_config=twitter_config,
            # Combine with other features
            extraction_strategy=JsonCssExtractionStrategy({
                "tweets": {
                    "selector": "[data-testid='tweet']",
                    "fields": {
                        "text": {"selector": "[data-testid='tweetText']", "type": "text"},
                        "likes": {"selector": "[data-testid='like']", "type": "text"}
                    }
                }
            })
        )
    )
    
    print(f"Captured {len(result.extracted_content['tweets'])} tweets")

Key Capabilities:

DOM Recycling Awareness: Detects and handles virtual DOM element recycling
Smart Scroll Physics: Three modes - container height, page height, or fixed pixels
Content Preservation: Captures content before it's destroyed
Intelligent Stopping: Stops when no new content appears
Memory Efficient: Streams content instead of holding everything in memory

Expected Real-World Impact:

Social Media Analysis: Capture entire Twitter threads with hundreds of replies, not just top 10
E-commerce Scraping: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
News Aggregation: Get all articles from modern news sites, not just above-the-fold content
Research Applications: Complete data extraction from academic databases using virtual pagination

🔗 Link Preview: Intelligent Link Analysis and Scoring

The Problem: You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.

My Solution: I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.

Intelligent Link Analysis and Scoring

import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig

async def main():
    # Configure intelligent link analysis
    link_config = LinkPreviewConfig(
        include_internal=True,
        include_external=False,
        max_links=10,
        concurrency=5,
        query="python tutorial",  # For contextual scoring
        score_threshold=0.3,
        verbose=True
    )
    # Use in your crawl
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://www.geeksforgeeks.org/",
            config=CrawlerRunConfig(
                link_preview_config=link_config,
                score_links=True,  # Enable intrinsic scoring
                cache_mode=CacheMode.BYPASS
            )
        )

        # Access scored and sorted links
        if result.success and result.links:
            for link in result.links.get("internal", []):
                text = link.get('text', 'No text')[:40]
                print(
                    text,
                    f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
                    f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
                    f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
                )

asyncio.run(main())

Scoring Components:

Intrinsic Score: Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
- URL structure and depth
Contextual Score: Relevance to your query using BM25 algorithm
- Keyword matching in link text and title
- Meta description analysis
- Content preview scoring
Total Score: Combined score for final ranking

Expected Real-World Impact:

Research Efficiency: Find relevant papers 10x faster by following only high-score links
Competitive Analysis: Automatically identify important pages on competitor sites
Content Discovery: Build topic-focused crawlers that stay on track
SEO Audits: Identify and prioritize high-value internal linking opportunities

🎣 Async URL Seeder: Automated URL Discovery at Scale

The Problem: You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.

My Solution: I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.

Technical Architecture

import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig

async def main():
    async with AsyncUrlSeeder() as seeder:
        # Discover Python tutorial URLs
        config = SeedingConfig(
            source="sitemap",  # Use sitemap
            pattern="*python*",  # URL pattern filter
            extract_head=True,  # Get metadata
            query="python tutorial",  # For relevance scoring
            scoring_method="bm25",
            score_threshold=0.2,
            max_urls=10
        )
        
        print("Discovering Python async tutorial URLs...")
        urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
        
        print(f"\n✅ Found {len(urls)} relevant URLs:")
        for i, url_info in enumerate(urls[:5], 1):
            print(f"\n{i}. {url_info['url']}")
            if url_info.get('relevance_score'):
                print(f"   Relevance: {url_info['relevance_score']:.3f}")
            if url_info.get('head_data', {}).get('title'):
                print(f"   Title: {url_info['head_data']['title'][:60]}...")

asyncio.run(main())

Discovery Methods:

Sitemap Mining: Parses robots.txt and all linked sitemaps
Common Crawl: Queries the Common Crawl index for historical URLs
Intelligent Crawling: Follows links with smart depth control
Pattern Analysis: Learns URL structures and generates variations

Expected Real-World Impact:

Migration Projects: Discover 10,000+ URLs from legacy sites in under 60 seconds
Market Research: Map entire competitor ecosystems automatically
Academic Research: Build comprehensive datasets without manual URL collection
SEO Audits: Find every indexable page with content scoring
Content Archival: Ensure no content is left behind during site migrations

⚡ Performance Optimizations

This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.

What We Optimized

# Optimized crawling with v0.7.0 improvements
results = []
for url in urls:
    result = await crawler.arun(
        url,
        config=CrawlerRunConfig(
            # Performance optimizations
            wait_until="domcontentloaded",  # Faster than networkidle
            cache_mode=CacheMode.ENABLED    # Enable caching
        )
    )
    results.append(result)

Performance Gains:

Startup Time: 70% faster browser initialization
Page Loading: 40% reduction with smart resource blocking
Extraction: 3x faster with compiled CSS selectors
Memory Usage: 60% reduction with streaming processing
Concurrent Crawls: Handle 5x more parallel requests

🔧 Important Changes

Breaking Changes

link_extractor renamed to link_preview (better reflects functionality)
Minimum Python version now 3.9
CrawlerConfig split into CrawlerRunConfig and BrowserConfig

Migration Guide

# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)

# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)

🤖 Coming Soon: Intelligent Web Automation

I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:

Crawl Agents: Autonomous crawlers that understand your goals and adapt their strategies
Auto JS Generation: Automatic JavaScript code generation for complex interactions
Smart Form Handling: Intelligent form detection and filling
Context-Aware Actions: Crawlers that understand page context and make decisions

These features are under active development and will revolutionize how we approach web automation. Stay tuned!

🚀 Get Started

pip install crawl4ai==0.7.0

Check out the updated documentation.

Questions? Issues? I'm always listening:

Happy crawling! 🕷️

P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.

14 KiB Raw Permalink Blame History

🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update

🎯 What's New at a Glance

🧠 Adaptive Crawling: Intelligence Through Pattern Learning

Technical Deep-Dive

🌊 Virtual Scroll: Complete Content Capture

Implementation Details

🔗 Link Preview: Intelligent Link Analysis and Scoring

Intelligent Link Analysis and Scoring

🎣 Async URL Seeder: Automated URL Discovery at Scale

Technical Architecture

⚡ Performance Optimizations

What We Optimized

🔧 Important Changes

Breaking Changes

Migration Guide

🤖 Coming Soon: Intelligent Web Automation

🚀 Get Started

14 KiB

Raw Permalink Blame History