# 🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update *January 28, 2025 • 10 min read* --- Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities. ## 🎯 What's New at a Glance - **Adaptive Crawling**: Your crawler now learns and adapts to website patterns - **Virtual Scroll Support**: Complete content extraction from infinite scroll pages - **Link Preview with 3-Layer Scoring**: Intelligent link analysis and prioritization - **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering - **PDF Parsing**: Extract data from PDF documents - **Performance Optimizations**: Significant speed and memory improvements ## 🧠 Adaptive Crawling: Intelligence Through Pattern Learning **The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders. **My Solution:** I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape. ### Technical Deep-Dive The Adaptive Crawler maintains a persistent state for each domain, tracking: - Pattern success rates - Selector stability over time - Content structure variations - Extraction confidence scores ```python from crawl4ai import AdaptiveCrawler, AdaptiveConfig, CrawlState # Initialize with custom learning parameters config = AdaptiveConfig( confidence_threshold=0.7, # Min confidence to use learned patterns max_history=100, # Remember last 100 crawls per domain learning_rate=0.2, # How quickly to adapt to changes patterns_per_page=3, # Patterns to learn per page type extraction_strategy='css' # 'css' or 'xpath' ) adaptive_crawler = AdaptiveCrawler(config) # First crawl - crawler learns the structure async with AsyncWebCrawler() as crawler: result = await crawler.arun( "https://news.example.com/article/12345", config=CrawlerRunConfig( adaptive_config=config, extraction_hints={ # Optional hints to speed up learning "title": "article h1", "content": "article .body-content" } ) ) # Crawler identifies and stores patterns if result.success: state = adaptive_crawler.get_state("news.example.com") print(f"Learned {len(state.patterns)} patterns") print(f"Confidence: {state.avg_confidence:.2%}") # Subsequent crawls - uses learned patterns result2 = await crawler.arun( "https://news.example.com/article/67890", config=CrawlerRunConfig(adaptive_config=config) ) # Automatically extracts using learned patterns! ``` **Expected Real-World Impact:** - **News Aggregation**: Maintain 95%+ extraction accuracy even as news sites update their templates - **E-commerce Monitoring**: Track product changes across hundreds of stores without constant maintenance - **Research Data Collection**: Build robust academic datasets that survive website redesigns - **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites ## 🌊 Virtual Scroll: Complete Content Capture **The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book. **My Solution:** I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes. ### Implementation Details ```python from crawl4ai import VirtualScrollConfig # For social media feeds (Twitter/X style) twitter_config = VirtualScrollConfig( container_selector="[data-testid='primaryColumn']", scroll_count=20, # Number of scrolls scroll_by="container_height", # Smart scrolling by container size wait_after_scroll=1.0, # Let content load capture_method="incremental", # Capture new content on each scroll deduplicate=True # Remove duplicate elements ) # For e-commerce product grids (Instagram style) grid_config = VirtualScrollConfig( container_selector="main .product-grid", scroll_count=30, scroll_by=800, # Fixed pixel scrolling wait_after_scroll=1.5, # Images need time stop_on_no_change=True # Smart stopping ) # For news feeds with lazy loading news_config = VirtualScrollConfig( container_selector=".article-feed", scroll_count=50, scroll_by="page_height", # Viewport-based scrolling wait_after_scroll=0.5, wait_for_selector=".article-card", # Wait for specific elements timeout=30000 # Max 30 seconds total ) # Use it in your crawl async with AsyncWebCrawler() as crawler: result = await crawler.arun( "https://twitter.com/trending", config=CrawlerRunConfig( virtual_scroll_config=twitter_config, # Combine with other features extraction_strategy=JsonCssExtractionStrategy({ "tweets": { "selector": "[data-testid='tweet']", "fields": { "text": {"selector": "[data-testid='tweetText']", "type": "text"}, "likes": {"selector": "[data-testid='like']", "type": "text"} } } }) ) ) print(f"Captured {len(result.extracted_content['tweets'])} tweets") ``` **Key Capabilities:** - **DOM Recycling Awareness**: Detects and handles virtual DOM element recycling - **Smart Scroll Physics**: Three modes - container height, page height, or fixed pixels - **Content Preservation**: Captures content before it's destroyed - **Intelligent Stopping**: Stops when no new content appears - **Memory Efficient**: Streams content instead of holding everything in memory **Expected Real-World Impact:** - **Social Media Analysis**: Capture entire Twitter threads with hundreds of replies, not just top 10 - **E-commerce Scraping**: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods - **News Aggregation**: Get all articles from modern news sites, not just above-the-fold content - **Research Applications**: Complete data extraction from academic databases using virtual pagination ## 🔗 Link Preview: Intelligent Link Analysis and Scoring **The Problem:** You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters. **My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals. ### The Three-Layer Scoring System ```python from crawl4ai import LinkPreviewConfig # Configure intelligent link analysis link_config = LinkPreviewConfig( # What to analyze include_internal=True, include_external=True, max_links=100, # Analyze top 100 links # Relevance scoring query="machine learning tutorials", # Your interest score_threshold=0.3, # Minimum relevance score # Performance concurrent_requests=10, # Parallel processing timeout_per_link=5000, # 5s per link # Advanced scoring weights scoring_weights={ "intrinsic": 0.3, # Link quality indicators "contextual": 0.5, # Relevance to query "popularity": 0.2 # Link prominence } ) # Use in your crawl result = await crawler.arun( "https://tech-blog.example.com", config=CrawlerRunConfig( link_preview_config=link_config, score_links=True ) ) # Access scored and sorted links for link in result.links["internal"][:10]: # Top 10 internal links print(f"Score: {link['total_score']:.3f}") print(f" Intrinsic: {link['intrinsic_score']:.1f}/10") # Position, attributes print(f" Contextual: {link['contextual_score']:.1f}/1") # Relevance to query print(f" URL: {link['href']}") print(f" Title: {link['head_data']['title']}") print(f" Description: {link['head_data']['meta']['description'][:100]}...") ``` **Scoring Components:** 1. **Intrinsic Score (0-10)**: Based on link quality indicators - Position on page (navigation, content, footer) - Link attributes (rel, title, class names) - Anchor text quality and length - URL structure and depth 2. **Contextual Score (0-1)**: Relevance to your query - Semantic similarity using embeddings - Keyword matching in link text and title - Meta description analysis - Content preview scoring 3. **Total Score**: Weighted combination for final ranking **Expected Real-World Impact:** - **Research Efficiency**: Find relevant papers 10x faster by following only high-score links - **Competitive Analysis**: Automatically identify important pages on competitor sites - **Content Discovery**: Build topic-focused crawlers that stay on track - **SEO Audits**: Identify and prioritize high-value internal linking opportunities ## 🎣 Async URL Seeder: Automated URL Discovery at Scale **The Problem:** You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans. **My Solution:** I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring. ### Technical Architecture ```python from crawl4ai import AsyncUrlSeeder, SeedingConfig # Basic discovery - find all product pages seeder_config = SeedingConfig( # Discovery sources source="sitemap+cc", # Sitemap + Common Crawl # Filtering pattern="*/product/*", # URL pattern matching ignore_patterns=["*/reviews/*", "*/questions/*"], # Validation live_check=True, # Verify URLs are alive max_urls=5000, # Stop at 5000 URLs # Performance concurrency=100, # Parallel requests hits_per_sec=10 # Rate limiting ) seeder = AsyncUrlSeeder(seeder_config) urls = await seeder.discover("https://shop.example.com") # Advanced: Relevance-based discovery research_config = SeedingConfig( source="crawl+sitemap", # Deep crawl + sitemap pattern="*/blog/*", # Blog posts only # Content relevance extract_head=True, # Get meta tags query="quantum computing tutorials", scoring_method="bm25", # Or "semantic" (coming soon) score_threshold=0.4, # High relevance only # Smart filtering filter_nonsense_urls=True, # Remove .xml, .txt, etc. min_content_length=500, # Skip thin content force=True # Bypass cache ) # Discover with progress tracking discovered = [] async for batch in seeder.discover_iter("https://physics-blog.com", research_config): discovered.extend(batch) print(f"Found {len(discovered)} relevant URLs so far...") # Results include scores and metadata for url_data in discovered[:5]: print(f"URL: {url_data['url']}") print(f"Score: {url_data['score']:.3f}") print(f"Title: {url_data['title']}") ``` **Discovery Methods:** - **Sitemap Mining**: Parses robots.txt and all linked sitemaps - **Common Crawl**: Queries the Common Crawl index for historical URLs - **Intelligent Crawling**: Follows links with smart depth control - **Pattern Analysis**: Learns URL structures and generates variations **Expected Real-World Impact:** - **Migration Projects**: Discover 10,000+ URLs from legacy sites in under 60 seconds - **Market Research**: Map entire competitor ecosystems automatically - **Academic Research**: Build comprehensive datasets without manual URL collection - **SEO Audits**: Find every indexable page with content scoring - **Content Archival**: Ensure no content is left behind during site migrations ## ⚡ Performance Optimizations This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint. ### What We Optimized ```python # Before v0.7.0 (slow) results = [] for url in urls: result = await crawler.arun(url) results.append(result) # After v0.7.0 (fast) # Automatic batching and connection pooling results = await crawler.arun_batch( urls, config=CrawlerRunConfig( # New performance options batch_size=10, # Process 10 URLs concurrently reuse_browser=True, # Keep browser warm eager_loading=False, # Load only what's needed streaming_extraction=True, # Stream large extractions # Optimized defaults wait_until="domcontentloaded", # Faster than networkidle exclude_external_resources=True, # Skip third-party assets block_ads=True # Ad blocking built-in ) ) # Memory-efficient streaming for large crawls async for result in crawler.arun_stream(large_url_list): # Process results as they complete await process_result(result) # Memory is freed after each iteration ``` **Performance Gains:** - **Startup Time**: 70% faster browser initialization - **Page Loading**: 40% reduction with smart resource blocking - **Extraction**: 3x faster with compiled CSS selectors - **Memory Usage**: 60% reduction with streaming processing - **Concurrent Crawls**: Handle 5x more parallel requests ## 📄 PDF Support PDF extraction is now natively supported in Crawl4AI. ```python # Extract data from PDF documents result = await crawler.arun( "https://example.com/report.pdf", config=CrawlerRunConfig( pdf_extraction=True, extraction_strategy=JsonCssExtractionStrategy({ # Works on converted PDF structure "title": {"selector": "h1", "type": "text"}, "sections": {"selector": "h2", "type": "list"} }) ) ) ``` ## 🔧 Important Changes ### Breaking Changes - `link_extractor` renamed to `link_preview` (better reflects functionality) - Minimum Python version now 3.9 - `CrawlerConfig` split into `CrawlerRunConfig` and `BrowserConfig` ### Migration Guide ```python # Old (v0.6.x) from crawl4ai import CrawlerConfig config = CrawlerConfig(timeout=30000) # New (v0.7.0) from crawl4ai import CrawlerRunConfig, BrowserConfig browser_config = BrowserConfig(timeout=30000) run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) ``` ## 🤖 Coming Soon: Intelligent Web Automation I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes: - **Crawl Agents**: Autonomous crawlers that understand your goals and adapt their strategies - **Auto JS Generation**: Automatic JavaScript code generation for complex interactions - **Smart Form Handling**: Intelligent form detection and filling - **Context-Aware Actions**: Crawlers that understand page context and make decisions These features are under active development and will revolutionize how we approach web automation. Stay tuned! ## 🚀 Get Started ```bash pip install crawl4ai==0.7.0 ``` Check out the [updated documentation](https://docs.crawl4ai.com). Questions? Issues? I'm always listening: - GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) - Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN) - Twitter: [@unclecode](https://x.com/unclecode) Happy crawling! 🕷️ --- *P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.*