feat(crawl4ai): Implement adaptive crawling feature

This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
2025-07-04 15:16:53 +08:00
parent 74705c1f67
commit 1a73fb60db
29 changed files with 8800 additions and 3 deletions
--- a/PROGRESSIVE_CRAWLING.md
+++ b/PROGRESSIVE_CRAWLING.md
@@ -0,0 +1,320 @@
 # Progressive Web Crawling with Adaptive Information Foraging
 ## Abstract
 This paper presents a novel approach to web crawling that adaptively determines when sufficient information has been gathered to answer a given query. Unlike traditional exhaustive crawling methods, our Progressive Information Sufficiency (PIS) framework uses statistical measures to balance information completeness against crawling efficiency. We introduce a multi-strategy architecture supporting pure statistical, embedding-enhanced, and LLM-assisted approaches, with theoretical guarantees on convergence and practical evaluation methods using synthetic datasets.
 ## 1. Introduction
 Traditional web crawling approaches follow predetermined patterns (breadth-first, depth-first) without consideration for information sufficiency. This work addresses the fundamental question: *"When do we have enough information to answer a query and similar queries in its domain?"*
 We formalize this as an optimal stopping problem in information foraging, introducing metrics for coverage, consistency, and saturation that enable crawlers to make intelligent decisions about when to stop crawling and which links to follow.
 ## 2. Problem Formulation
 ### 2.1 Definitions
 Let:
 - **K** = {d₁, d₂, ..., dₙ} be the current knowledge base (crawled documents)
 - **Q** be the user query
 - **L** = {l₁, l₂, ..., lₘ} be available links with preview metadata
 - **θ** be the confidence threshold for information sufficiency
 ### 2.2 Objectives
 1. **Minimize** |K| (number of crawled pages)
 2. **Maximize** P(answers(Q) | K) (probability of answering Q given K)
 3. **Ensure** coverage of Q's domain (similar queries)
 ## 3. Mathematical Framework
 ### 3.1 Information Sufficiency Metric
 We define Information Sufficiency as:
 ```
 IS(K, Q) = min(Coverage(K, Q), Consistency(K, Q), 1 - Redundancy(K)) × DomainCoverage(K, Q)
 ```
 ### 3.2 Coverage Score
 Coverage measures how well current knowledge covers query terms and related concepts:
 ```
 Coverage(K, Q) = Σ(t ∈ Q) log(df(t, K) + 1) × idf(t) / |Q|
 ```
 Where:
 - df(t, K) = document frequency of term t in knowledge base K
 - idf(t) = inverse document frequency weight
 ### 3.3 Consistency Score
 Consistency measures information coherence across documents:
 ```
 Consistency(K, Q) = 1 - Var(answers from random subsets of K)
 ```
 This captures the principle that sufficient knowledge should provide stable answers regardless of document subset.
 ### 3.4 Saturation Score
 Saturation detects diminishing returns:
 ```
 Saturation(K) = 1 - (ΔInfo(Kₙ) / ΔInfo(K₁))
 ```
 Where ΔInfo represents marginal information gain from the nth crawl.
 ### 3.5 Link Value Prediction
 Expected information gain from uncrawled links:
 ```
 ExpectedGain(l) = Relevance(l, Q) × Novelty(l, K) × Authority(l)
 ```
 Components:
 - **Relevance**: BM25(preview_text, Q)
 - **Novelty**: 1 - max_similarity(preview, K)
 - **Authority**: f(url_structure, domain_metrics)
 ## 4. Algorithmic Approach
 ### 4.1 Progressive Crawling Algorithm
 ```
 Algorithm: ProgressiveCrawl(start_url, query, θ)
  K ← ∅
  crawled ← {start_url}
  pending ← extract_links(crawl(start_url))
  while IS(K, Q) < θ and |crawled| < max_pages:
    candidates ← rank_by_expected_gain(pending, Q, K)
    if max(ExpectedGain(candidates)) < min_gain:
      break  // Diminishing returns
    to_crawl ← top_k(candidates)
    new_docs ← parallel_crawl(to_crawl)
    K ← K ∪ new_docs
    crawled ← crawled ∪ to_crawl
    pending ← extract_new_links(new_docs) - crawled
  return K
 ```
 ### 4.2 Stopping Criteria
 Crawling terminates when:
 1. IS(K, Q) ≥ θ (sufficient information)
 2. d(IS)/d(crawls) < ε (plateau reached)
 3. |crawled| ≥ max_pages (resource limit)
 4. max(ExpectedGain) < min_gain (no promising links)
 ## 5. Multi-Strategy Architecture
 ### 5.1 Strategy Pattern Design
 ```
 AbstractStrategy
  ├── StatisticalStrategy (no LLM, no embeddings)
  ├── EmbeddingStrategy (with semantic similarity)
  └── LLMStrategy (with language model assistance)
 ```
 ### 5.2 Statistical Strategy
 Pure statistical approach using:
 - BM25 for relevance scoring
 - Term frequency analysis for coverage
 - Graph structure for authority
 - No external models required
 **Advantages**: Fast, no API costs, works offline
 **Best for**: Technical documentation, specific terminology
 ### 5.3 Embedding Strategy (Implemented)
 Semantic understanding through embeddings:
 - Query expansion into semantic variations
 - Coverage mapping in embedding space
 - Gap-driven link selection
 - Validation-based stopping criteria
 **Mathematical Framework**:
 ```
 Coverage(K, Q) = mean(max_similarity(q, K) for q in Q_expanded)
 Gap(q) = 1 - max_similarity(q, K)
 LinkScore(l) = Σ(Gap(q) × relevance(l, q)) × (1 - redundancy(l, K))
 ```
 **Key Parameters**:
 - `embedding_k_exp`: Exponential decay factor for distance-to-score mapping
 - `embedding_coverage_radius`: Distance threshold for query coverage
 - `embedding_min_confidence_threshold`: Minimum relevance threshold
 **Advantages**: Semantic understanding, handles ambiguity, detects irrelevance
 **Best for**: Research queries, conceptual topics, diverse content
 ### 5.4 Progressive Enhancement Path
 1. **Level 0**: Statistical only (implemented)
 2. **Level 1**: + Embeddings for semantic similarity (implemented)
 3. **Level 2**: + LLM for query understanding (future)
 ## 6. Evaluation Methodology
 ### 6.1 Synthetic Dataset Generation
 Using LLM to create evaluation data:
 ```python
 def generate_synthetic_dataset(domain_url):
    # 1. Fully crawl domain
    full_knowledge = exhaustive_crawl(domain_url)
    # 2. Generate answerable queries
    queries = llm_generate_queries(full_knowledge)
    # 3. Create query variations
    for q in queries:
        variations = generate_variations(q)  # synonyms, sub/super queries
    return queries, variations, full_knowledge
 ```
 ### 6.2 Evaluation Metrics
 1. **Efficiency**: Information gained / Pages crawled
 2. **Completeness**: Answerable queries / Total queries
 3. **Redundancy**: 1 - (Unique information / Total information)
 4. **Convergence Rate**: Pages to 95% completeness
 ### 6.3 Ablation Studies
 - Impact of each score component (coverage, consistency, saturation)
 - Sensitivity to threshold parameters
 - Performance across different domain types
 ## 7. Theoretical Properties
 ### 7.1 Convergence Guarantee
 **Theorem**: For finite websites, ProgressiveCrawl converges to IS(K, Q) ≥ θ or exhausts all reachable pages.
 **Proof sketch**: IS(K, Q) is monotonically non-decreasing with each crawl, bounded above by 1.
 ### 7.2 Optimality
 Under certain assumptions about link preview accuracy:
 - Expected crawls ≤ 2 × optimal_crawls
 - Approximation ratio improves with preview quality
 ## 8. Implementation Design
 ### 8.1 Core Components
 1. **CrawlState**: Maintains crawl history and metrics
 2. **AdaptiveConfig**: Configuration parameters
 3. **CrawlStrategy**: Pluggable strategy interface
 4. **AdaptiveCrawler**: Main orchestrator
 ### 8.2 Integration with Crawl4AI
 - Wraps existing AsyncWebCrawler
 - Leverages link preview functionality
 - Maintains backward compatibility
 ### 8.3 Persistence
 Knowledge base serialization for:
 - Resumable crawls
 - Knowledge sharing
 - Offline analysis
 ## 9. Future Directions
 ### 9.1 Advanced Scoring
 - Temporal information value
 - Multi-query optimization
 - Active learning from user feedback
 ### 9.2 Distributed Crawling
 - Collaborative knowledge building
 - Federated information sufficiency
 ### 9.3 Domain Adaptation
 - Transfer learning across domains
 - Meta-learning for threshold selection
 ## 10. Conclusion
 Progressive crawling with adaptive information foraging provides a principled approach to efficient web information extraction. By combining coverage, consistency, and saturation metrics, we can determine information sufficiency without ground truth labels. The multi-strategy architecture allows graceful enhancement from pure statistical to LLM-assisted approaches based on requirements and resources.
 ## References
 1. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
 2. Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
 3. Pirolli, P., & Card, S. (1999). Information Foraging. Psychological Review, 106(4), 643-675.
 4. Dasgupta, S. (2005). Analysis of a greedy active learning strategy. Advances in Neural Information Processing Systems.
 ## Appendix A: Implementation Pseudocode
 ```python
 class StatisticalStrategy:
    def calculate_confidence(self, state):
        coverage = self.calculate_coverage(state)
        consistency = self.calculate_consistency(state)
        saturation = self.calculate_saturation(state)
        return min(coverage, consistency, saturation)
    def calculate_coverage(self, state):
        # BM25-based term coverage
        term_scores = []
        for term in state.query.split():
            df = state.document_frequencies.get(term, 0)
            idf = self.idf_cache.get(term, 1.0)
            term_scores.append(log(df + 1) * idf)
        return mean(term_scores) / max_possible_score
    def rank_links(self, state):
        scored_links = []
        for link in state.pending_links:
            relevance = self.bm25_score(link.preview_text, state.query)
            novelty = self.calculate_novelty(link, state.knowledge_base)
            authority = self.url_authority(link.href)
            score = relevance * novelty * authority
            scored_links.append((link, score))
        return sorted(scored_links, key=lambda x: x[1], reverse=True)
 ```
 ## Appendix B: Evaluation Protocol
 1. **Dataset Creation**:
   - Select diverse domains (documentation, blogs, e-commerce)
   - Generate 100 queries per domain using LLM
   - Create query variations (5-10 per query)
 2. **Baseline Comparisons**:
   - BFS crawler (depth-limited)
   - DFS crawler (depth-limited)
   - Random crawler
   - Oracle (knows relevant pages)
 3. **Metrics Collection**:
   - Pages crawled vs query answerability
   - Time to sufficient confidence
   - False positive/negative rates
 4. **Statistical Analysis**:
   - ANOVA for strategy comparison
   - Regression for parameter sensitivity
   - Bootstrap for confidence intervals
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -69,6 +69,14 @@ from .deep_crawling import (
 )
 # NEW: Import AsyncUrlSeeder
 from .async_url_seeder import AsyncUrlSeeder
 # Adaptive Crawler
 from .adaptive_crawler import (
    AdaptiveCrawler,
    AdaptiveConfig,
    CrawlState,
    CrawlStrategy,
    StatisticalStrategy
 )
 # C4A Script Language Support
 from .script import (
@@ -97,6 +105,12 @@ __all__ = [
    "VirtualScrollConfig",
    # NEW: Add AsyncUrlSeeder
    "AsyncUrlSeeder",
    # Adaptive Crawler
    "AdaptiveCrawler",
    "AdaptiveConfig", 
    "CrawlState",
    "CrawlStrategy",
    "StatisticalStrategy",
    "DeepCrawlStrategy",
    "BFSDeepCrawlStrategy",
    "BestFirstCrawlingStrategy",
--- a/crawl4ai/adaptive_crawler
+++ b/crawl4ai/adaptive_crawler
--- a/crawl4ai/adaptive_crawler.py
+++ b/crawl4ai/adaptive_crawler.py
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -32,7 +32,6 @@ import hashlib
 from urllib.robotparser import RobotFileParser
 import aiohttp
 from urllib.parse import urlparse, urlunparse
 from functools import lru_cache
 from packaging import version
@@ -43,6 +42,14 @@ from itertools import chain
 from collections import deque
 from typing import  Generator, Iterable
 import numpy as np
 from urllib.parse import (
    urljoin, urlparse, urlunparse,
    parse_qsl, urlencode, quote, unquote
 )
 def chunk_documents(
    documents: Iterable[str],
    chunk_token_threshold: int,
@@ -2071,6 +2078,92 @@ def normalize_url(href, base_url):
    return normalized
 def normalize_url(
    href: str,
    base_url: str,
    *,
    drop_query_tracking=True,
    sort_query=True,
    keep_fragment=False,
    extra_drop_params=None
 ):
    """
    Extended URL normalizer
    Parameters
    ----------
    href : str
        The raw link extracted from a page.
    base_url : str
        The page’s canonical URL (used to resolve relative links).
    drop_query_tracking : bool (default True)
        Remove common tracking query parameters.
    sort_query : bool (default True)
        Alphabetically sort query keys for deterministic output.
    keep_fragment : bool (default False)
        Preserve the hash fragment (#section) if you need in-page links.
    extra_drop_params : Iterable[str] | None
        Additional query keys to strip (case-insensitive).
    Returns
    -------
    str | None
        A clean, canonical URL or None if href is empty/None.
    """
    if not href:
        return None
    # Resolve relative paths first
    full_url = urljoin(base_url, href.strip())
    # Parse once, edit parts, then rebuild
    parsed = urlparse(full_url)
    # ── netloc ──
    netloc = parsed.netloc.lower()
    # ── path ──
    # Strip duplicate slashes and trailing “/” (except root)
    path = quote(unquote(parsed.path))
    if path.endswith('/') and path != '/':
        path = path.rstrip('/')
    # ── query ──
    query = parsed.query
    if query:
        # explode, mutate, then rebuild
        params = [(k.lower(), v) for k, v in parse_qsl(query, keep_blank_values=True)]
        if drop_query_tracking:
            default_tracking = {
                'utm_source', 'utm_medium', 'utm_campaign', 'utm_term',
                'utm_content', 'gclid', 'fbclid', 'ref', 'ref_src'
            }
            if extra_drop_params:
                default_tracking |= {p.lower() for p in extra_drop_params}
            params = [(k, v) for k, v in params if k not in default_tracking]
        if sort_query:
            params.sort(key=lambda kv: kv[0])
        query = urlencode(params, doseq=True) if params else ''
    # ── fragment ──
    fragment = parsed.fragment if keep_fragment else ''
    # Re-assemble
    normalized = urlunparse((
        parsed.scheme,
        netloc,
        path,
        parsed.params,
        query,
        fragment
    ))
    return normalized
 def normalize_url_for_deep_crawl(href, base_url):
    """Normalize URLs to ensure consistent format"""
    from urllib.parse import urljoin, urlparse, urlunparse, parse_qs, urlencode
@@ -3148,3 +3241,108 @@ def calculate_total_score(
    return max(0.0, min(total, 10.0))
 # Embedding utilities
 async def get_text_embeddings(
    texts: List[str], 
    llm_config: Optional[Dict] = None,
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    batch_size: int = 32
 ) -> np.ndarray:
    """
    Compute embeddings for a list of texts using specified model.
    Args:
        texts: List of texts to embed
        llm_config: Optional LLM configuration for API-based embeddings
        model_name: Model name (used when llm_config is None)
        batch_size: Batch size for processing
    Returns:
        numpy array of embeddings
    """
    import numpy as np
    if not texts:
        return np.array([])
    # If LLMConfig provided, use litellm for embeddings
    if llm_config is not None:
        from litellm import aembedding
        # Get embedding model from config or use default
        embedding_model = llm_config.get('provider', 'text-embedding-3-small')
        api_base = llm_config.get('base_url', llm_config.get('api_base'))
        # Prepare kwargs
        kwargs = {
            'model': embedding_model,
            'input': texts,
            'api_key': llm_config.get('api_token', llm_config.get('api_key'))
        }
        if api_base:
            kwargs['api_base'] = api_base
        # Handle OpenAI-compatible endpoints
        if api_base and 'openai/' not in embedding_model:
            kwargs['model'] = f"openai/{embedding_model}"
        # Get embeddings
        response = await aembedding(**kwargs)
        # Extract embeddings from response
        embeddings = []
        for item in response.data:
            embeddings.append(item['embedding'])
        return np.array(embeddings)
    # Default: use sentence-transformers
    else:
        # Lazy load to avoid importing heavy libraries unless needed
        from sentence_transformers import SentenceTransformer
        # Cache the model in function attribute to avoid reloading
        if not hasattr(get_text_embeddings, '_models'):
            get_text_embeddings._models = {}
        if model_name not in get_text_embeddings._models:
            get_text_embeddings._models[model_name] = SentenceTransformer(model_name)
        encoder = get_text_embeddings._models[model_name]
        # Batch encode for efficiency
        embeddings = encoder.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=False,
            convert_to_numpy=True
        )
        return embeddings
 def get_text_embeddings_sync(
    texts: List[str],
    llm_config: Optional[Dict] = None,
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    batch_size: int = 32
 ) -> np.ndarray:
    """Synchronous wrapper for get_text_embeddings"""
    import numpy as np
    return asyncio.run(get_text_embeddings(texts, llm_config, model_name, batch_size))
 def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Calculate cosine similarity between two vectors"""
    import numpy as np
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    return float(dot_product / norm_product) if norm_product != 0 else 0.0
 def cosine_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Calculate cosine distance (1 - similarity) between two vectors"""
    return 1 - cosine_similarity(vec1, vec2)
--- a/docs/examples/adaptive_crawling/README.md
+++ b/docs/examples/adaptive_crawling/README.md
@@ -0,0 +1,85 @@
 # Adaptive Crawling Examples
 This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature.
 ## Examples Overview
 ### 1. `basic_usage.py`
 - Simple introduction to adaptive crawling
 - Uses default statistical strategy
 - Shows how to get crawl statistics and relevant content
 ### 2. `embedding_strategy.py` ⭐ NEW
 - Demonstrates the embedding-based strategy for semantic understanding
 - Shows query expansion and irrelevance detection
 - Includes configuration for both local and API-based embeddings
 ### 3. `embedding_vs_statistical.py` ⭐ NEW
 - Direct comparison between statistical and embedding strategies
 - Helps you choose the right strategy for your use case
 - Shows performance and accuracy trade-offs
 ### 4. `embedding_configuration.py` ⭐ NEW
 - Advanced configuration options for embedding strategy
 - Parameter tuning guide for different scenarios
 - Examples for research, exploration, and quality-focused crawling
 ### 5. `advanced_configuration.py`
 - Shows various configuration options for both strategies
 - Demonstrates threshold tuning and performance optimization
 ### 6. `custom_strategies.py`
 - How to implement your own crawling strategy
 - Extends the base CrawlStrategy class
 - Advanced use case for specialized requirements
 ### 7. `export_import_kb.py`
 - Export crawled knowledge base to JSONL
 - Import and continue crawling from saved state
 - Useful for building persistent knowledge bases
 ## Quick Start
 For your first adaptive crawling experience, run:
 ```bash
 python basic_usage.py
 ```
 To try the new embedding strategy with semantic understanding:
 ```bash
 python embedding_strategy.py
 ```
 To compare strategies and see which works best for your use case:
 ```bash
 python embedding_vs_statistical.py
 ```
 ## Strategy Selection Guide
 ### Use Statistical Strategy (Default) When:
 - Working with technical documentation
 - Queries contain specific terms or code
 - Speed is critical
 - No API access available
 ### Use Embedding Strategy When:
 - Queries are conceptual or ambiguous
 - Need semantic understanding beyond exact matches
 - Want to detect irrelevant content
 - Working with diverse content sources
 ## Requirements
 - Crawl4AI installed
 - For embedding strategy with local models: `sentence-transformers`
 - For embedding strategy with OpenAI: Set `OPENAI_API_KEY` environment variable
 ## Learn More
 - [Adaptive Crawling Documentation](https://docs.crawl4ai.com/core/adaptive-crawling/)
 - [Mathematical Framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md)
 - [Blog: The Adaptive Crawling Revolution](https://docs.crawl4ai.com/blog/adaptive-crawling-revolution/)
--- a/docs/examples/adaptive_crawling/advanced_configuration.py
+++ b/docs/examples/adaptive_crawling/advanced_configuration.py
@@ -0,0 +1,207 @@
 """
 Advanced Adaptive Crawling Configuration
 This example demonstrates all configuration options available for adaptive crawling,
 including threshold tuning, persistence, and custom parameters.
 """
 import asyncio
 from pathlib import Path
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async def main():
    """Demonstrate advanced configuration options"""
    # Example 1: Custom thresholds for different use cases
    print("="*60)
    print("EXAMPLE 1: Custom Confidence Thresholds")
    print("="*60)
    # High-precision configuration (exhaustive crawling)
    high_precision_config = AdaptiveConfig(
        confidence_threshold=0.9,      # Very high confidence required
        max_pages=50,                  # Allow more pages
        top_k_links=5,                 # Follow more links per page
        min_gain_threshold=0.02        # Lower threshold to continue
    )
    # Balanced configuration (default use case)
    balanced_config = AdaptiveConfig(
        confidence_threshold=0.7,      # Moderate confidence
        max_pages=20,                  # Reasonable limit
        top_k_links=3,                 # Moderate branching
        min_gain_threshold=0.05        # Standard gain threshold
    )
    # Quick exploration configuration
    quick_config = AdaptiveConfig(
        confidence_threshold=0.5,      # Lower confidence acceptable
        max_pages=10,                  # Strict limit
        top_k_links=2,                 # Minimal branching
        min_gain_threshold=0.1         # High gain required
    )
    async with AsyncWebCrawler(verbose=False) as crawler:
        # Test different configurations
        for config_name, config in [
            ("High Precision", high_precision_config),
            ("Balanced", balanced_config),
            ("Quick Exploration", quick_config)
        ]:
            print(f"\nTesting {config_name} configuration...")
            adaptive = AdaptiveCrawler(crawler, config=config)
            result = await adaptive.digest(
                start_url="https://httpbin.org",
                query="http headers authentication"
            )
            print(f"  - Pages crawled: {len(result.crawled_urls)}")
            print(f"  - Confidence achieved: {adaptive.confidence:.2%}")
            print(f"  - Coverage score: {adaptive.coverage_stats['coverage']:.2f}")
    # Example 2: Persistence and state management
    print("\n" + "="*60)
    print("EXAMPLE 2: State Persistence")
    print("="*60)
    state_file = "crawl_state_demo.json"
    # Configuration with persistence
    persistent_config = AdaptiveConfig(
        confidence_threshold=0.8,
        max_pages=30,
        save_state=True,              # Enable auto-save
        state_path=state_file         # Specify save location
    )
    async with AsyncWebCrawler(verbose=False) as crawler:
        # First crawl - will be interrupted
        print("\nStarting initial crawl (will interrupt after 5 pages)...")
        interrupt_config = AdaptiveConfig(
            confidence_threshold=0.8,
            max_pages=5,              # Artificially low to simulate interruption
            save_state=True,
            state_path=state_file
        )
        adaptive = AdaptiveCrawler(crawler, config=interrupt_config)
        result1 = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="exception handling try except finally"
        )
        print(f"First crawl completed: {len(result1.crawled_urls)} pages")
        print(f"Confidence reached: {adaptive.confidence:.2%}")
        # Resume crawl with higher page limit
        print("\nResuming crawl from saved state...")
        resume_config = AdaptiveConfig(
            confidence_threshold=0.8,
            max_pages=20,             # Increase limit
            save_state=True,
            state_path=state_file
        )
        adaptive2 = AdaptiveCrawler(crawler, config=resume_config)
        result2 = await adaptive2.digest(
            start_url="https://docs.python.org/3/",
            query="exception handling try except finally",
            resume_from=state_file
        )
        print(f"Resumed crawl completed: {len(result2.crawled_urls)} total pages")
        print(f"Final confidence: {adaptive2.confidence:.2%}")
        # Clean up
        Path(state_file).unlink(missing_ok=True)
    # Example 3: Link selection strategies
    print("\n" + "="*60)
    print("EXAMPLE 3: Link Selection Strategies")
    print("="*60)
    # Conservative link following
    conservative_config = AdaptiveConfig(
        confidence_threshold=0.7,
        max_pages=15,
        top_k_links=1,                # Only follow best link
        min_gain_threshold=0.15       # High threshold
    )
    # Aggressive link following
    aggressive_config = AdaptiveConfig(
        confidence_threshold=0.7,
        max_pages=15,
        top_k_links=10,               # Follow many links
        min_gain_threshold=0.01       # Very low threshold
    )
    async with AsyncWebCrawler(verbose=False) as crawler:
        for strategy_name, config in [
            ("Conservative", conservative_config),
            ("Aggressive", aggressive_config)
        ]:
            print(f"\n{strategy_name} link selection:")
            adaptive = AdaptiveCrawler(crawler, config=config)
            result = await adaptive.digest(
                start_url="https://httpbin.org",
                query="api endpoints"
            )
            # Analyze crawl pattern
            print(f"  - Total pages: {len(result.crawled_urls)}")
            print(f"  - Unique domains: {len(set(url.split('/')[2] for url in result.crawled_urls))}")
            print(f"  - Max depth reached: {max(url.count('/') for url in result.crawled_urls) - 2}")
            # Show saturation trend
            if hasattr(result, 'new_terms_history') and result.new_terms_history:
                print(f"  - New terms discovered: {result.new_terms_history[:5]}...")
                print(f"  - Saturation trend: {'decreasing' if result.new_terms_history[-1] < result.new_terms_history[0] else 'increasing'}")
    # Example 4: Monitoring crawl progress
    print("\n" + "="*60)
    print("EXAMPLE 4: Progress Monitoring")
    print("="*60)
    # Configuration with detailed monitoring
    monitor_config = AdaptiveConfig(
        confidence_threshold=0.75,
        max_pages=10,
        top_k_links=3
    )
    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler, config=monitor_config)
        # Start crawl
        print("\nMonitoring crawl progress...")
        result = await adaptive.digest(
            start_url="https://httpbin.org",
            query="http methods headers"
        )
        # Detailed statistics
        print("\nDetailed crawl analysis:")
        adaptive.print_stats(detailed=True)
        # Export for analysis
        print("\nExporting knowledge base for external analysis...")
        adaptive.export_knowledge_base("knowledge_export_demo.jsonl")
        print("Knowledge base exported to: knowledge_export_demo.jsonl")
        # Show sample of exported data
        with open("knowledge_export_demo.jsonl", 'r') as f:
            first_line = f.readline()
            print(f"Sample export: {first_line[:100]}...")
        # Clean up
        Path("knowledge_export_demo.jsonl").unlink(missing_ok=True)
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/adaptive_crawling/basic_usage.py
+++ b/docs/examples/adaptive_crawling/basic_usage.py
@@ -0,0 +1,76 @@
 """
 Basic Adaptive Crawling Example
 This example demonstrates the simplest use case of adaptive crawling:
 finding information about a specific topic and knowing when to stop.
 """
 import asyncio
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
 async def main():
    """Basic adaptive crawling example"""
    # Initialize the crawler
    async with AsyncWebCrawler(verbose=True) as crawler:
        # Create an adaptive crawler with default settings (statistical strategy)
        adaptive = AdaptiveCrawler(crawler)
        # Note: You can also use embedding strategy for semantic understanding:
        # from crawl4ai import AdaptiveConfig
        # config = AdaptiveConfig(strategy="embedding")
        # adaptive = AdaptiveCrawler(crawler, config)
        # Start adaptive crawling
        print("Starting adaptive crawl for Python async programming information...")
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/library/asyncio.html",
            query="async await context managers coroutines"
        )
        # Display crawl statistics
        print("\n" + "="*50)
        print("CRAWL STATISTICS")
        print("="*50)
        adaptive.print_stats(detailed=False)
        # Get the most relevant content found
        print("\n" + "="*50)
        print("MOST RELEVANT PAGES")
        print("="*50)
        relevant_pages = adaptive.get_relevant_content(top_k=5)
        for i, page in enumerate(relevant_pages, 1):
            print(f"\n{i}. {page['url']}")
            print(f"   Relevance Score: {page['score']:.2%}")
            # Show a snippet of the content
            content = page['content'] or ""
            if content:
                snippet = content[:200].replace('\n', ' ')
                if len(content) > 200:
                    snippet += "..."
                print(f"   Preview: {snippet}")
        # Show final confidence
        print(f"\n{'='*50}")
        print(f"Final Confidence: {adaptive.confidence:.2%}")
        print(f"Total Pages Crawled: {len(result.crawled_urls)}")
        print(f"Knowledge Base Size: {len(adaptive.state.knowledge_base)} documents")
        # Example: Check if we can answer specific questions
        print(f"\n{'='*50}")
        print("INFORMATION SUFFICIENCY CHECK")
        print(f"{'='*50}")
        if adaptive.confidence >= 0.8:
            print("✓ High confidence - can answer detailed questions about async Python")
        elif adaptive.confidence >= 0.6:
            print("~ Moderate confidence - can answer basic questions") 
        else:
            print("✗ Low confidence - need more information")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/adaptive_crawling/custom_strategies.py
+++ b/docs/examples/adaptive_crawling/custom_strategies.py
@@ -0,0 +1,373 @@
 """
 Custom Adaptive Crawling Strategies
 This example demonstrates how to implement custom scoring strategies
 for domain-specific crawling needs.
 """
 import asyncio
 import re
 from typing import List, Dict, Set
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 from crawl4ai.adaptive_crawler import CrawlState, Link
 import math
 class APIDocumentationStrategy:
    """
    Custom strategy optimized for API documentation crawling.
    Prioritizes endpoint references, code examples, and parameter descriptions.
    """
    def __init__(self):
        # Keywords that indicate high-value API documentation
        self.api_keywords = {
            'endpoint', 'request', 'response', 'parameter', 'authentication',
            'header', 'body', 'query', 'path', 'method', 'get', 'post', 'put',
            'delete', 'patch', 'status', 'code', 'example', 'curl', 'python'
        }
        # URL patterns that typically contain API documentation
        self.valuable_patterns = [
            r'/api/',
            r'/reference/',
            r'/endpoints?/',
            r'/methods?/',
            r'/resources?/'
        ]
        # Patterns to avoid
        self.avoid_patterns = [
            r'/blog/',
            r'/news/',
            r'/about/',
            r'/contact/',
            r'/legal/'
        ]
    def score_link(self, link: Link, query: str, state: CrawlState) -> float:
        """Custom link scoring for API documentation"""
        score = 1.0
        url = link.href.lower()
        # Boost API-related URLs
        for pattern in self.valuable_patterns:
            if re.search(pattern, url):
                score *= 2.0
                break
        # Reduce score for non-API content
        for pattern in self.avoid_patterns:
            if re.search(pattern, url):
                score *= 0.1
                break
        # Boost if preview contains API keywords
        if link.text:
            preview_lower = link.text.lower()
            keyword_count = sum(1 for kw in self.api_keywords if kw in preview_lower)
            score *= (1 + keyword_count * 0.2)
        # Prioritize shallow URLs (likely overview pages)
        depth = url.count('/') - 2  # Subtract protocol slashes
        if depth <= 3:
            score *= 1.5
        elif depth > 6:
            score *= 0.5
        return score
    def calculate_api_coverage(self, state: CrawlState, query: str) -> Dict[str, float]:
        """Calculate specialized coverage metrics for API documentation"""
        metrics = {
            'endpoint_coverage': 0.0,
            'example_coverage': 0.0,
            'parameter_coverage': 0.0
        }
        # Analyze knowledge base for API-specific content
        endpoint_patterns = [r'GET\s+/', r'POST\s+/', r'PUT\s+/', r'DELETE\s+/']
        example_patterns = [r'```\w+', r'curl\s+-', r'import\s+requests']
        param_patterns = [r'param(?:eter)?s?\s*:', r'required\s*:', r'optional\s*:']
        total_docs = len(state.knowledge_base)
        if total_docs == 0:
            return metrics
        docs_with_endpoints = 0
        docs_with_examples = 0
        docs_with_params = 0
        for doc in state.knowledge_base:
            content = doc.markdown.raw_markdown if hasattr(doc, 'markdown') else str(doc)
            # Check for endpoints
            if any(re.search(pattern, content, re.IGNORECASE) for pattern in endpoint_patterns):
                docs_with_endpoints += 1
            # Check for examples
            if any(re.search(pattern, content, re.IGNORECASE) for pattern in example_patterns):
                docs_with_examples += 1
            # Check for parameters
            if any(re.search(pattern, content, re.IGNORECASE) for pattern in param_patterns):
                docs_with_params += 1
        metrics['endpoint_coverage'] = docs_with_endpoints / total_docs
        metrics['example_coverage'] = docs_with_examples / total_docs
        metrics['parameter_coverage'] = docs_with_params / total_docs
        return metrics
 class ResearchPaperStrategy:
    """
    Strategy optimized for crawling research papers and academic content.
    Prioritizes citations, abstracts, and methodology sections.
    """
    def __init__(self):
        self.academic_keywords = {
            'abstract', 'introduction', 'methodology', 'results', 'conclusion',
            'references', 'citation', 'paper', 'study', 'research', 'analysis',
            'hypothesis', 'experiment', 'findings', 'doi'
        }
        self.citation_patterns = [
            r'\[\d+\]',  # [1] style citations
            r'\(\w+\s+\d{4}\)',  # (Author 2024) style
            r'doi:\s*\S+',  # DOI references
        ]
    def calculate_academic_relevance(self, content: str, query: str) -> float:
        """Calculate relevance score for academic content"""
        score = 0.0
        content_lower = content.lower()
        # Check for academic keywords
        keyword_matches = sum(1 for kw in self.academic_keywords if kw in content_lower)
        score += keyword_matches * 0.1
        # Check for citations
        citation_count = sum(
            len(re.findall(pattern, content)) 
            for pattern in self.citation_patterns
        )
        score += min(citation_count * 0.05, 1.0)  # Cap at 1.0
        # Check for query terms in academic context
        query_terms = query.lower().split()
        for term in query_terms:
            # Boost if term appears near academic keywords
            for keyword in ['abstract', 'conclusion', 'results']:
                if keyword in content_lower:
                    section = content_lower[content_lower.find(keyword):content_lower.find(keyword) + 500]
                    if term in section:
                        score += 0.2
        return min(score, 2.0)  # Cap total score
 async def demo_custom_strategies():
    """Demonstrate custom strategy usage"""
    # Example 1: API Documentation Strategy
    print("="*60)
    print("EXAMPLE 1: Custom API Documentation Strategy")
    print("="*60)
    api_strategy = APIDocumentationStrategy()
    async with AsyncWebCrawler() as crawler:
        # Standard adaptive crawler
        config = AdaptiveConfig(
            confidence_threshold=0.8,
            max_pages=15
        )
        adaptive = AdaptiveCrawler(crawler, config)
        # Override link scoring with custom strategy
        original_rank_links = adaptive._rank_links
        def custom_rank_links(links, query, state):
            # Apply custom scoring
            scored_links = []
            for link in links:
                base_score = api_strategy.score_link(link, query, state)
                scored_links.append((link, base_score))
            # Sort by score
            scored_links.sort(key=lambda x: x[1], reverse=True)
            return [link for link, _ in scored_links[:config.top_k_links]]
        adaptive._rank_links = custom_rank_links
        # Crawl API documentation
        print("\nCrawling API documentation with custom strategy...")
        state = await adaptive.digest(
            start_url="https://httpbin.org",
            query="api endpoints authentication headers"
        )
        # Calculate custom metrics
        api_metrics = api_strategy.calculate_api_coverage(state, "api endpoints")
        print(f"\nResults:")
        print(f"Pages crawled: {len(state.crawled_urls)}")
        print(f"Confidence: {adaptive.confidence:.2%}")
        print(f"\nAPI-Specific Metrics:")
        print(f"  - Endpoint coverage: {api_metrics['endpoint_coverage']:.2%}")
        print(f"  - Example coverage: {api_metrics['example_coverage']:.2%}")
        print(f"  - Parameter coverage: {api_metrics['parameter_coverage']:.2%}")
    # Example 2: Combined Strategy
    print("\n" + "="*60)
    print("EXAMPLE 2: Hybrid Strategy Combining Multiple Approaches")
    print("="*60)
    class HybridStrategy:
        """Combines multiple strategies with weights"""
        def __init__(self):
            self.api_strategy = APIDocumentationStrategy()
            self.research_strategy = ResearchPaperStrategy()
            self.weights = {
                'api': 0.7,
                'research': 0.3
            }
        def score_content(self, content: str, query: str) -> float:
            # Get scores from each strategy
            api_score = self._calculate_api_score(content, query)
            research_score = self.research_strategy.calculate_academic_relevance(content, query)
            # Weighted combination
            total_score = (
                api_score * self.weights['api'] +
                research_score * self.weights['research']
            )
            return total_score
        def _calculate_api_score(self, content: str, query: str) -> float:
            # Simplified API scoring based on keyword presence
            content_lower = content.lower()
            api_keywords = self.api_strategy.api_keywords
            keyword_count = sum(1 for kw in api_keywords if kw in content_lower)
            return min(keyword_count * 0.1, 2.0)
    hybrid_strategy = HybridStrategy()
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler)
        # Crawl with hybrid scoring
        print("\nTesting hybrid strategy on technical documentation...")
        state = await adaptive.digest(
            start_url="https://docs.python.org/3/library/asyncio.html",
            query="async await coroutines api"
        )
        # Analyze results with hybrid strategy
        print(f"\nHybrid Strategy Analysis:")
        total_score = 0
        for doc in adaptive.get_relevant_content(top_k=5):
            content = doc['content'] or ""
            score = hybrid_strategy.score_content(content, "async await api")
            total_score += score
            print(f"  - {doc['url'][:50]}... Score: {score:.2f}")
        print(f"\nAverage hybrid score: {total_score/5:.2f}")
 async def demo_performance_optimization():
    """Demonstrate performance optimization with custom strategies"""
    print("\n" + "="*60)
    print("EXAMPLE 3: Performance-Optimized Strategy")
    print("="*60)
    class PerformanceOptimizedStrategy:
        """Strategy that balances thoroughness with speed"""
        def __init__(self):
            self.url_cache: Set[str] = set()
            self.domain_scores: Dict[str, float] = {}
        def should_crawl_domain(self, url: str) -> bool:
            """Implement domain-level filtering"""
            domain = url.split('/')[2] if url.startswith('http') else url
            # Skip if we've already crawled many pages from this domain
            domain_count = sum(1 for cached in self.url_cache if domain in cached)
            if domain_count > 5:
                return False
            # Skip low-scoring domains
            if domain in self.domain_scores and self.domain_scores[domain] < 0.3:
                return False
            return True
        def update_domain_score(self, url: str, relevance: float):
            """Track domain-level performance"""
            domain = url.split('/')[2] if url.startswith('http') else url
            if domain not in self.domain_scores:
                self.domain_scores[domain] = relevance
            else:
                # Moving average
                self.domain_scores[domain] = (
                    0.7 * self.domain_scores[domain] + 0.3 * relevance
                )
    perf_strategy = PerformanceOptimizedStrategy()
    async with AsyncWebCrawler() as crawler:
        config = AdaptiveConfig(
            confidence_threshold=0.7,
            max_pages=10,
            top_k_links=2  # Fewer links for speed
        )
        adaptive = AdaptiveCrawler(crawler, config)
        # Track performance
        import time
        start_time = time.time()
        state = await adaptive.digest(
            start_url="https://httpbin.org",
            query="http methods headers"
        )
        elapsed = time.time() - start_time
        print(f"\nPerformance Results:")
        print(f"  - Time elapsed: {elapsed:.2f} seconds")
        print(f"  - Pages crawled: {len(state.crawled_urls)}")
        print(f"  - Pages per second: {len(state.crawled_urls)/elapsed:.2f}")
        print(f"  - Final confidence: {adaptive.confidence:.2%}")
        print(f"  - Efficiency: {adaptive.confidence/len(state.crawled_urls):.2%} confidence per page")
 async def main():
    """Run all demonstrations"""
    try:
        await demo_custom_strategies()
        await demo_performance_optimization()
        print("\n" + "="*60)
        print("All custom strategy examples completed!")
        print("="*60)
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/adaptive_crawling/embedding_configuration.py
+++ b/docs/examples/adaptive_crawling/embedding_configuration.py
@@ -0,0 +1,206 @@
 """
 Advanced Embedding Configuration Example
 This example demonstrates all configuration options available for the
 embedding strategy, including fine-tuning parameters for different use cases.
 """
 import asyncio
 import os
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async def test_configuration(name: str, config: AdaptiveConfig, url: str, query: str):
    """Test a specific configuration"""
    print(f"\n{'='*60}")
    print(f"Configuration: {name}")
    print(f"{'='*60}")
    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        result = await adaptive.digest(start_url=url, query=query)
        print(f"Pages crawled: {len(result.crawled_urls)}")
        print(f"Final confidence: {adaptive.confidence:.1%}")
        print(f"Stopped reason: {result.metrics.get('stopped_reason', 'max_pages')}")
        if result.metrics.get('is_irrelevant', False):
            print("⚠️  Query detected as irrelevant!")
        return result
 async def main():
    """Demonstrate various embedding configurations"""
    print("EMBEDDING STRATEGY CONFIGURATION EXAMPLES")
    print("=" * 60)
    # Base URL and query for testing
    test_url = "https://docs.python.org/3/library/asyncio.html"
    # 1. Default Configuration
    config_default = AdaptiveConfig(
        strategy="embedding",
        max_pages=10
    )
    await test_configuration(
        "Default Settings",
        config_default,
        test_url,
        "async programming patterns"
    )
    # 2. Strict Coverage Requirements
    config_strict = AdaptiveConfig(
        strategy="embedding",
        max_pages=20,
        # Stricter similarity requirements
        embedding_k_exp=5.0,  # Default is 3.0, higher = stricter
        embedding_coverage_radius=0.15,  # Default is 0.2, lower = stricter
        # Higher validation threshold
        embedding_validation_min_score=0.6,  # Default is 0.3
        # More query variations for better coverage
        n_query_variations=15  # Default is 10
    )
    await test_configuration(
        "Strict Coverage (Research/Academic)",
        config_strict,
        test_url,
        "comprehensive guide async await"
    )
    # 3. Fast Exploration
    config_fast = AdaptiveConfig(
        strategy="embedding",
        max_pages=10,
        top_k_links=5,  # Follow more links per page
        # Relaxed requirements for faster convergence
        embedding_k_exp=1.0,  # Lower = more lenient
        embedding_min_relative_improvement=0.05,  # Stop earlier
        # Lower quality thresholds
        embedding_quality_min_confidence=0.5,  # Display lower confidence
        embedding_quality_max_confidence=0.85,
        # Fewer query variations for speed
        n_query_variations=5
    )
    await test_configuration(
        "Fast Exploration (Quick Overview)",
        config_fast,
        test_url,
        "async basics"
    )
    # 4. Irrelevance Detection Focus
    config_irrelevance = AdaptiveConfig(
        strategy="embedding",
        max_pages=5,
        # Aggressive irrelevance detection
        embedding_min_confidence_threshold=0.2,  # Higher threshold (default 0.1)
        embedding_k_exp=5.0,  # Strict similarity
        # Quick stopping for irrelevant content
        embedding_min_relative_improvement=0.15
    )
    await test_configuration(
        "Irrelevance Detection",
        config_irrelevance,
        test_url,
        "recipe for chocolate cake"  # Irrelevant query
    )
    # 5. High-Quality Knowledge Base
    config_quality = AdaptiveConfig(
        strategy="embedding",
        max_pages=30,
        # Deduplication settings
        embedding_overlap_threshold=0.75,  # More aggressive deduplication
        # Quality focus
        embedding_validation_min_score=0.5,
        embedding_quality_scale_factor=1.0,  # Linear quality mapping
        # Balanced parameters
        embedding_k_exp=3.0,
        embedding_nearest_weight=0.8,  # Focus on best matches
        embedding_top_k_weight=0.2
    )
    await test_configuration(
        "High-Quality Knowledge Base",
        config_quality,
        test_url,
        "asyncio advanced patterns best practices"
    )
    # 6. Custom Embedding Provider
    if os.getenv('OPENAI_API_KEY'):
        config_openai = AdaptiveConfig(
            strategy="embedding",
            max_pages=10,
            # Use OpenAI embeddings
            embedding_llm_config={
                'provider': 'openai/text-embedding-3-small',
                'api_token': os.getenv('OPENAI_API_KEY')
            },
            # OpenAI embeddings are high quality, can be stricter
            embedding_k_exp=4.0,
            n_query_variations=12
        )
        await test_configuration(
            "OpenAI Embeddings",
            config_openai,
            test_url,
            "event-driven architecture patterns"
        )
    # Parameter Guide
    print("\n" + "="*60)
    print("PARAMETER TUNING GUIDE")
    print("="*60)
    print("\n📊 Key Parameters and Their Effects:")
    print("\n1. embedding_k_exp (default: 3.0)")
    print("   - Lower (1-2): More lenient, faster convergence")
    print("   - Higher (4-5): Stricter, better precision")
    print("\n2. embedding_coverage_radius (default: 0.2)")
    print("   - Lower (0.1-0.15): Requires closer matches")
    print("   - Higher (0.25-0.3): Accepts broader matches")
    print("\n3. n_query_variations (default: 10)")
    print("   - Lower (5-7): Faster, less comprehensive")
    print("   - Higher (15-20): Better coverage, slower")
    print("\n4. embedding_min_confidence_threshold (default: 0.1)")
    print("   - Set to 0.15-0.2 for aggressive irrelevance detection")
    print("   - Set to 0.05 to crawl even barely relevant content")
    print("\n5. embedding_validation_min_score (default: 0.3)")
    print("   - Higher (0.5-0.6): Requires strong validation")
    print("   - Lower (0.2): More permissive stopping")
    print("\n💡 Tips:")
    print("- For research: High k_exp, more variations, strict validation")
    print("- For exploration: Low k_exp, fewer variations, relaxed thresholds")
    print("- For quality: Focus on overlap_threshold and validation scores")
    print("- For speed: Reduce variations, increase min_relative_improvement")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/adaptive_crawling/embedding_strategy.py
+++ b/docs/examples/adaptive_crawling/embedding_strategy.py
@@ -0,0 +1,109 @@
 """
 Embedding Strategy Example for Adaptive Crawling
 This example demonstrates how to use the embedding-based strategy
 for semantic understanding and intelligent crawling.
 """
 import asyncio
 import os
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async def main():
    """Demonstrate embedding strategy for adaptive crawling"""
    # Configure embedding strategy
    config = AdaptiveConfig(
        strategy="embedding",  # Use embedding strategy
        embedding_model="sentence-transformers/all-MiniLM-L6-v2",  # Default model
        n_query_variations=10,  # Generate 10 semantic variations
        max_pages=15,
        top_k_links=3,
        min_gain_threshold=0.05,
        # Embedding-specific parameters
        embedding_k_exp=3.0,  # Higher = stricter similarity requirements
        embedding_min_confidence_threshold=0.1,  # Stop if <10% relevant
        embedding_validation_min_score=0.4  # Validation threshold
    )
    # Optional: Use OpenAI embeddings instead
    if os.getenv('OPENAI_API_KEY'):
        config.embedding_llm_config = {
            'provider': 'openai/text-embedding-3-small',
            'api_token': os.getenv('OPENAI_API_KEY')
        }
        print("Using OpenAI embeddings")
    else:
        print("Using sentence-transformers (local embeddings)")
    async with AsyncWebCrawler(verbose=True) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        # Test 1: Relevant query with semantic understanding
        print("\n" + "="*50)
        print("TEST 1: Semantic Query Understanding")
        print("="*50)
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/library/asyncio.html",
            query="concurrent programming event-driven architecture"
        )
        print("\nQuery Expansion:")
        print(f"Original query expanded to {len(result.expanded_queries)} variations")
        for i, q in enumerate(result.expanded_queries[:3], 1):
            print(f"  {i}. {q}")
        print("  ...")
        print("\nResults:")
        adaptive.print_stats(detailed=False)
        # Test 2: Detecting irrelevant queries
        print("\n" + "="*50)
        print("TEST 2: Irrelevant Query Detection")
        print("="*50)
        # Reset crawler for new query
        adaptive = AdaptiveCrawler(crawler, config)
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/library/asyncio.html",
            query="how to bake chocolate chip cookies"
        )
        if result.metrics.get('is_irrelevant', False):
            print("\n✅ Successfully detected irrelevant query!")
            print(f"Stopped after just {len(result.crawled_urls)} pages")
            print(f"Reason: {result.metrics.get('stopped_reason', 'unknown')}")
        else:
            print("\n❌ Failed to detect irrelevance")
        print(f"Final confidence: {adaptive.confidence:.1%}")
        # Test 3: Semantic gap analysis
        print("\n" + "="*50)
        print("TEST 3: Semantic Gap Analysis")
        print("="*50)
        # Show how embedding strategy identifies gaps
        adaptive = AdaptiveCrawler(crawler, config)
        result = await adaptive.digest(
            start_url="https://realpython.com",
            query="python decorators advanced patterns"
        )
        print(f"\nSemantic gaps identified: {len(result.semantic_gaps)}")
        print(f"Knowledge base embeddings shape: {result.kb_embeddings.shape if result.kb_embeddings is not None else 'None'}")
        # Show coverage metrics specific to embedding strategy
        print("\nEmbedding-specific metrics:")
        print(f"  Average best similarity: {result.metrics.get('avg_best_similarity', 0):.3f}")
        print(f"  Coverage score: {result.metrics.get('coverage_score', 0):.3f}")
        print(f"  Validation confidence: {result.metrics.get('validation_confidence', 0):.2%}")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/adaptive_crawling/embedding_vs_statistical.py
+++ b/docs/examples/adaptive_crawling/embedding_vs_statistical.py
@@ -0,0 +1,167 @@
 """
 Comparison: Embedding vs Statistical Strategy
 This example demonstrates the differences between statistical and embedding
 strategies for adaptive crawling, showing when to use each approach.
 """
 import asyncio
 import time
 import os
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async def crawl_with_strategy(url: str, query: str, strategy: str, **kwargs):
    """Helper function to crawl with a specific strategy"""
    config = AdaptiveConfig(
        strategy=strategy,
        max_pages=20,
        top_k_links=3,
        min_gain_threshold=0.05,
        **kwargs
    )
    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        start_time = time.time()
        result = await adaptive.digest(start_url=url, query=query)
        elapsed = time.time() - start_time
        return {
            'result': result,
            'crawler': adaptive,
            'elapsed': elapsed,
            'pages': len(result.crawled_urls),
            'confidence': adaptive.confidence
        }
 async def main():
    """Compare embedding and statistical strategies"""
    # Test scenarios
    test_cases = [
        {
            'name': 'Technical Documentation (Specific Terms)',
            'url': 'https://docs.python.org/3/library/asyncio.html',
            'query': 'asyncio.create_task event_loop.run_until_complete'
        },
        {
            'name': 'Conceptual Query (Semantic Understanding)',
            'url': 'https://docs.python.org/3/library/asyncio.html',
            'query': 'concurrent programming patterns'
        },
        {
            'name': 'Ambiguous Query',
            'url': 'https://realpython.com',
            'query': 'python performance optimization'
        }
    ]
    # Configure embedding strategy
    embedding_config = {}
    if os.getenv('OPENAI_API_KEY'):
        embedding_config['embedding_llm_config'] = {
            'provider': 'openai/text-embedding-3-small',
            'api_token': os.getenv('OPENAI_API_KEY')
        }
    for test in test_cases:
        print("\n" + "="*70)
        print(f"TEST: {test['name']}")
        print(f"URL: {test['url']}")
        print(f"Query: '{test['query']}'")
        print("="*70)
        # Run statistical strategy
        print("\n📊 Statistical Strategy:")
        stat_result = await crawl_with_strategy(
            test['url'], 
            test['query'], 
            'statistical'
        )
        print(f"  Pages crawled: {stat_result['pages']}")
        print(f"  Time taken: {stat_result['elapsed']:.2f}s")
        print(f"  Confidence: {stat_result['confidence']:.1%}")
        print(f"  Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}")
        # Show term coverage
        if hasattr(stat_result['result'], 'term_frequencies'):
            query_terms = test['query'].lower().split()
            covered = sum(1 for term in query_terms 
                         if term in stat_result['result'].term_frequencies)
            print(f"  Term coverage: {covered}/{len(query_terms)} query terms found")
        # Run embedding strategy
        print("\n🧠 Embedding Strategy:")
        emb_result = await crawl_with_strategy(
            test['url'], 
            test['query'], 
            'embedding',
            **embedding_config
        )
        print(f"  Pages crawled: {emb_result['pages']}")
        print(f"  Time taken: {emb_result['elapsed']:.2f}s")
        print(f"  Confidence: {emb_result['confidence']:.1%}")
        print(f"  Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}")
        # Show semantic understanding
        if emb_result['result'].expanded_queries:
            print(f"  Query variations: {len(emb_result['result'].expanded_queries)}")
            print(f"  Semantic gaps: {len(emb_result['result'].semantic_gaps)}")
        # Compare results
        print("\n📈 Comparison:")
        efficiency_diff = ((stat_result['pages'] - emb_result['pages']) / 
                          stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0
        print(f"  Efficiency: ", end="")
        if efficiency_diff > 0:
            print(f"Embedding used {efficiency_diff:.0f}% fewer pages")
        else:
            print(f"Statistical used {-efficiency_diff:.0f}% fewer pages")
        print(f"  Speed: ", end="")
        if stat_result['elapsed'] < emb_result['elapsed']:
            print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster")
        else:
            print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster")
        print(f"  Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points")
        # Recommendation
        print("\n💡 Recommendation:")
        if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()):
            print("  → Statistical strategy is likely better for this use case (specific terms)")
        elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower():
            print("  → Embedding strategy is likely better for this use case (semantic understanding)")
        else:
            if emb_result['confidence'] > stat_result['confidence'] + 0.1:
                print("  → Embedding strategy achieved significantly better understanding")
            elif stat_result['elapsed'] < emb_result['elapsed'] / 2:
                print("  → Statistical strategy is much faster with similar results")
            else:
                print("  → Both strategies performed similarly; choose based on your priorities")
    # Summary recommendations
    print("\n" + "="*70)
    print("STRATEGY SELECTION GUIDE")
    print("="*70)
    print("\n✅ Use STATISTICAL strategy when:")
    print("  - Queries contain specific technical terms")
    print("  - Speed is critical")
    print("  - No API access available")
    print("  - Working with well-structured documentation")
    print("\n✅ Use EMBEDDING strategy when:")
    print("  - Queries are conceptual or ambiguous")
    print("  - Semantic understanding is important")
    print("  - Need to detect irrelevant content")
    print("  - Working with diverse content sources")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/adaptive_crawling/export_import_kb.py
+++ b/docs/examples/adaptive_crawling/export_import_kb.py
@@ -0,0 +1,232 @@
 """
 Knowledge Base Export and Import
 This example demonstrates how to export crawled knowledge bases and
 import them for reuse, sharing, or analysis.
 """
 import asyncio
 import json
 from pathlib import Path
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async def build_knowledge_base():
    """Build a knowledge base about web technologies"""
    print("="*60)
    print("PHASE 1: Building Knowledge Base")
    print("="*60)
    async with AsyncWebCrawler(verbose=False) as crawler:
        adaptive = AdaptiveCrawler(crawler)
        # Crawl information about HTTP
        print("\n1. Gathering HTTP protocol information...")
        await adaptive.digest(
            start_url="https://httpbin.org",
            query="http methods headers status codes"
        )
        print(f"   - Pages crawled: {len(adaptive.state.crawled_urls)}")
        print(f"   - Confidence: {adaptive.confidence:.2%}")
        # Add more information about APIs
        print("\n2. Adding API documentation knowledge...")
        await adaptive.digest(
            start_url="https://httpbin.org/anything",
            query="rest api json response request"
        )
        print(f"   - Total pages: {len(adaptive.state.crawled_urls)}")
        print(f"   - Confidence: {adaptive.confidence:.2%}")
        # Export the knowledge base
        export_path = "web_tech_knowledge.jsonl"
        print(f"\n3. Exporting knowledge base to {export_path}")
        adaptive.export_knowledge_base(export_path)
        # Show export statistics
        export_size = Path(export_path).stat().st_size / 1024
        with open(export_path, 'r') as f:
            line_count = sum(1 for _ in f)
        print(f"   - Exported {line_count} documents")
        print(f"   - File size: {export_size:.1f} KB")
        return export_path
 async def analyze_knowledge_base(kb_path):
    """Analyze the exported knowledge base"""
    print("\n" + "="*60)
    print("PHASE 2: Analyzing Exported Knowledge Base")
    print("="*60)
    # Read and analyze JSONL
    documents = []
    with open(kb_path, 'r') as f:
        for line in f:
            documents.append(json.loads(line))
    print(f"\nKnowledge base contains {len(documents)} documents:")
    # Analyze document properties
    total_content_length = 0
    urls_by_domain = {}
    for doc in documents:
        # Content analysis
        content_length = len(doc.get('content', ''))
        total_content_length += content_length
        # URL analysis
        url = doc.get('url', '')
        domain = url.split('/')[2] if url.startswith('http') else 'unknown'
        urls_by_domain[domain] = urls_by_domain.get(domain, 0) + 1
        # Show sample document
        if documents.index(doc) == 0:
            print(f"\nSample document structure:")
            print(f"  - URL: {url}")
            print(f"  - Content length: {content_length} chars")
            print(f"  - Has metadata: {'metadata' in doc}")
            print(f"  - Has links: {len(doc.get('links', []))} links")
            print(f"  - Query: {doc.get('query', 'N/A')}")
    print(f"\nContent statistics:")
    print(f"  - Total content: {total_content_length:,} characters")
    print(f"  - Average per document: {total_content_length/len(documents):,.0f} chars")
    print(f"\nDomain distribution:")
    for domain, count in urls_by_domain.items():
        print(f"  - {domain}: {count} pages")
 async def import_and_continue():
    """Import a knowledge base and continue crawling"""
    print("\n" + "="*60)
    print("PHASE 3: Importing and Extending Knowledge Base")
    print("="*60)
    kb_path = "web_tech_knowledge.jsonl"
    async with AsyncWebCrawler(verbose=False) as crawler:
        # Create new adaptive crawler
        adaptive = AdaptiveCrawler(crawler)
        # Import existing knowledge base
        print(f"\n1. Importing knowledge base from {kb_path}")
        adaptive.import_knowledge_base(kb_path)
        print(f"   - Imported {len(adaptive.state.knowledge_base)} documents")
        print(f"   - Existing URLs: {len(adaptive.state.crawled_urls)}")
        # Check current state
        print("\n2. Checking imported knowledge state:")
        adaptive.print_stats(detailed=False)
        # Continue crawling with new query
        print("\n3. Extending knowledge with new query...")
        await adaptive.digest(
            start_url="https://httpbin.org/status/200",
            query="error handling retry timeout"
        )
        print("\n4. Final knowledge base state:")
        adaptive.print_stats(detailed=False)
        # Export extended knowledge base
        extended_path = "web_tech_knowledge_extended.jsonl"
        adaptive.export_knowledge_base(extended_path)
        print(f"\n5. Extended knowledge base exported to {extended_path}")
 async def share_knowledge_bases():
    """Demonstrate sharing knowledge bases between projects"""
    print("\n" + "="*60)
    print("PHASE 4: Sharing Knowledge Between Projects")
    print("="*60)
    # Simulate two different projects
    project_a_kb = "project_a_knowledge.jsonl"
    project_b_kb = "project_b_knowledge.jsonl"
    async with AsyncWebCrawler(verbose=False) as crawler:
        # Project A: Security documentation
        print("\n1. Project A: Building security knowledge...")
        crawler_a = AdaptiveCrawler(crawler)
        await crawler_a.digest(
            start_url="https://httpbin.org/basic-auth/user/pass",
            query="authentication security headers"
        )
        crawler_a.export_knowledge_base(project_a_kb)
        print(f"   - Exported {len(crawler_a.state.knowledge_base)} documents")
        # Project B: API testing
        print("\n2. Project B: Building testing knowledge...")
        crawler_b = AdaptiveCrawler(crawler)
        await crawler_b.digest(
            start_url="https://httpbin.org/anything",
            query="testing endpoints mocking"
        )
        crawler_b.export_knowledge_base(project_b_kb)
        print(f"   - Exported {len(crawler_b.state.knowledge_base)} documents")
        # Merge knowledge bases
        print("\n3. Merging knowledge bases...")
        merged_crawler = AdaptiveCrawler(crawler)
        # Import both knowledge bases
        merged_crawler.import_knowledge_base(project_a_kb)
        initial_size = len(merged_crawler.state.knowledge_base)
        merged_crawler.import_knowledge_base(project_b_kb)
        final_size = len(merged_crawler.state.knowledge_base)
        print(f"   - Project A documents: {initial_size}")
        print(f"   - Additional from Project B: {final_size - initial_size}")
        print(f"   - Total merged documents: {final_size}")
        # Export merged knowledge
        merged_kb = "merged_knowledge.jsonl"
        merged_crawler.export_knowledge_base(merged_kb)
        print(f"\n4. Merged knowledge base exported to {merged_kb}")
        # Show combined coverage
        print("\n5. Combined knowledge coverage:")
        merged_crawler.print_stats(detailed=False)
 async def main():
    """Run all examples"""
    try:
        # Build initial knowledge base
        kb_path = await build_knowledge_base()
        # Analyze the export
        await analyze_knowledge_base(kb_path)
        # Import and extend
        await import_and_continue()
        # Demonstrate sharing
        await share_knowledge_bases()
        print("\n" + "="*60)
        print("All examples completed successfully!")
        print("="*60)
    finally:
        # Clean up generated files
        print("\nCleaning up generated files...")
        for file in [
            "web_tech_knowledge.jsonl",
            "web_tech_knowledge_extended.jsonl", 
            "project_a_knowledge.jsonl",
            "project_b_knowledge.jsonl",
            "merged_knowledge.jsonl"
        ]:
            Path(file).unlink(missing_ok=True)
        print("Cleanup complete.")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/md_v2/advanced/adaptive-strategies.md
+++ b/docs/md_v2/advanced/adaptive-strategies.md
@@ -0,0 +1,432 @@
 # Advanced Adaptive Strategies
 ## Overview
 While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements.
 ## The Three-Layer Scoring System
 ### 1. Coverage Score
 Coverage measures how comprehensively your knowledge base covers the query terms and related concepts.
 #### Mathematical Foundation
 ```python
 Coverage(K, Q) = Σ(t ∈ Q) score(t, K) / |Q|
 where score(t, K) = doc_coverage(t) × (1 + freq_boost(t))
 ```
 #### Components
 - **Document Coverage**: Percentage of documents containing the term
 - **Frequency Boost**: Logarithmic bonus for term frequency
 - **Query Decomposition**: Handles multi-word queries intelligently
 #### Tuning Coverage
 ```python
 # For technical documentation with specific terminology
 config = AdaptiveConfig(
    confidence_threshold=0.85,  # Require high coverage
    top_k_links=5              # Cast wider net
 )
 # For general topics with synonyms
 config = AdaptiveConfig(
    confidence_threshold=0.6,   # Lower threshold
    top_k_links=2              # More focused
 )
 ```
 ### 2. Consistency Score
 Consistency evaluates whether the information across pages is coherent and non-contradictory.
 #### How It Works
 1. Extracts key statements from each document
 2. Compares statements across documents
 3. Measures agreement vs. contradiction
 4. Returns normalized score (0-1)
 #### Practical Impact
 - **High consistency (>0.8)**: Information is reliable and coherent
 - **Medium consistency (0.5-0.8)**: Some variation, but generally aligned
 - **Low consistency (<0.5)**: Conflicting information, need more sources
 ### 3. Saturation Score
 Saturation detects when new pages stop providing novel information.
 #### Detection Algorithm
 ```python
 # Tracks new unique terms per page
 new_terms_page_1 = 50
 new_terms_page_2 = 30  # 60% of first
 new_terms_page_3 = 15  # 50% of second
 new_terms_page_4 = 5   # 33% of third
 # Saturation detected: rapidly diminishing returns
 ```
 #### Configuration
 ```python
 config = AdaptiveConfig(
    min_gain_threshold=0.1  # Stop if <10% new information
 )
 ```
 ## Link Ranking Algorithm
 ### Expected Information Gain
 Each uncrawled link is scored based on:
 ```python
 ExpectedGain(link) = Relevance × Novelty × Authority
 ```
 #### 1. Relevance Scoring
 Uses BM25 algorithm on link preview text:
 ```python
 relevance = BM25(link.preview_text, query)
 ```
 Factors:
 - Term frequency in preview
 - Inverse document frequency
 - Preview length normalization
 #### 2. Novelty Estimation
 Measures how different the link appears from already-crawled content:
 ```python
 novelty = 1 - max_similarity(preview, knowledge_base)
 ```
 Prevents crawling duplicate or highly similar pages.
 #### 3. Authority Calculation
 URL structure and domain analysis:
 ```python
 authority = f(domain_rank, url_depth, url_structure)
 ```
 Factors:
 - Domain reputation
 - URL depth (fewer slashes = higher authority)
 - Clean URL structure
 ### Custom Link Scoring
 ```python
 class CustomLinkScorer:
    def score(self, link: Link, query: str, state: CrawlState) -> float:
        # Prioritize specific URL patterns
        if "/api/reference/" in link.href:
            return 2.0  # Double the score
        # Deprioritize certain sections
        if "/archive/" in link.href:
            return 0.1  # Reduce score by 90%
        # Default scoring
        return 1.0
 # Use with adaptive crawler
 adaptive = AdaptiveCrawler(
    crawler,
    config=config,
    link_scorer=CustomLinkScorer()
 )
 ```
 ## Domain-Specific Configurations
 ### Technical Documentation
 ```python
 tech_doc_config = AdaptiveConfig(
    confidence_threshold=0.85,
    max_pages=30,
    top_k_links=3,
    min_gain_threshold=0.05  # Keep crawling for small gains
 )
 ```
 Rationale:
 - High threshold ensures comprehensive coverage
 - Lower gain threshold captures edge cases
 - Moderate link following for depth
 ### News & Articles
 ```python
 news_config = AdaptiveConfig(
    confidence_threshold=0.6,
    max_pages=10,
    top_k_links=5,
    min_gain_threshold=0.15  # Stop quickly on repetition
 )
 ```
 Rationale:
 - Lower threshold (articles often repeat information)
 - Higher gain threshold (avoid duplicate stories)
 - More links per page (explore different perspectives)
 ### E-commerce
 ```python
 ecommerce_config = AdaptiveConfig(
    confidence_threshold=0.7,
    max_pages=20,
    top_k_links=2,
    min_gain_threshold=0.1
 )
 ```
 Rationale:
 - Balanced threshold for product variations
 - Focused link following (avoid infinite products)
 - Standard gain threshold
 ### Research & Academic
 ```python
 research_config = AdaptiveConfig(
    confidence_threshold=0.9,
    max_pages=50,
    top_k_links=4,
    min_gain_threshold=0.02  # Very low - capture citations
 )
 ```
 Rationale:
 - Very high threshold for completeness
 - Many pages allowed for thorough research
 - Very low gain threshold to capture references
 ## Performance Optimization
 ### Memory Management
 ```python
 # For large crawls, use streaming
 config = AdaptiveConfig(
    max_pages=100,
    save_state=True,
    state_path="large_crawl.json"
 )
 # Periodically clean state
 if len(state.knowledge_base) > 1000:
    # Keep only most relevant
    state.knowledge_base = get_top_relevant(state.knowledge_base, 500)
 ```
 ### Parallel Processing
 ```python
 # Use multiple start points
 start_urls = [
    "https://docs.example.com/intro",
    "https://docs.example.com/api",
    "https://docs.example.com/guides"
 ]
 # Crawl in parallel
 tasks = [
    adaptive.digest(url, query)
    for url in start_urls
 ]
 results = await asyncio.gather(*tasks)
 ```
 ### Caching Strategy
 ```python
 # Enable caching for repeated crawls
 async with AsyncWebCrawler(
    config=BrowserConfig(
        cache_mode=CacheMode.ENABLED
    )
 ) as crawler:
    adaptive = AdaptiveCrawler(crawler, config)
 ```
 ## Debugging & Analysis
 ### Enable Verbose Logging
 ```python
 import logging
 logging.basicConfig(level=logging.DEBUG)
 adaptive = AdaptiveCrawler(crawler, config, verbose=True)
 ```
 ### Analyze Crawl Patterns
 ```python
 # After crawling
 state = await adaptive.digest(start_url, query)
 # Analyze link selection
 print("Link selection order:")
 for i, url in enumerate(state.crawl_order):
    print(f"{i+1}. {url}")
 # Analyze term discovery
 print("\nTerm discovery rate:")
 for i, new_terms in enumerate(state.new_terms_history):
    print(f"Page {i+1}: {new_terms} new terms")
 # Analyze score progression
 print("\nScore progression:")
 print(f"Coverage: {state.metrics['coverage_history']}")
 print(f"Saturation: {state.metrics['saturation_history']}")
 ```
 ### Export for Analysis
 ```python
 # Export detailed metrics
 import json
 metrics = {
    "query": query,
    "total_pages": len(state.crawled_urls),
    "confidence": adaptive.confidence,
    "coverage_stats": adaptive.coverage_stats,
    "crawl_order": state.crawl_order,
    "term_frequencies": dict(state.term_frequencies),
    "new_terms_history": state.new_terms_history
 }
 with open("crawl_analysis.json", "w") as f:
    json.dump(metrics, f, indent=2)
 ```
 ## Custom Strategies
 ### Implementing a Custom Strategy
 ```python
 from crawl4ai.adaptive_crawler import BaseStrategy
 class DomainSpecificStrategy(BaseStrategy):
    def calculate_coverage(self, state: CrawlState) -> float:
        # Custom coverage calculation
        # e.g., weight certain terms more heavily
        pass
    def calculate_consistency(self, state: CrawlState) -> float:
        # Custom consistency logic
        # e.g., domain-specific validation
        pass
    def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]:
        # Custom link ranking
        # e.g., prioritize specific URL patterns
        pass
 # Use custom strategy
 adaptive = AdaptiveCrawler(
    crawler,
    config=config,
    strategy=DomainSpecificStrategy()
 )
 ```
 ### Combining Strategies
 ```python
 class HybridStrategy(BaseStrategy):
    def __init__(self):
        self.strategies = [
            TechnicalDocStrategy(),
            SemanticSimilarityStrategy(),
            URLPatternStrategy()
        ]
    def calculate_confidence(self, state: CrawlState) -> float:
        # Weighted combination of strategies
        scores = [s.calculate_confidence(state) for s in self.strategies]
        weights = [0.5, 0.3, 0.2]
        return sum(s * w for s, w in zip(scores, weights))
 ```
 ## Best Practices
 ### 1. Start Conservative
 Begin with default settings and adjust based on results:
 ```python
 # Start with defaults
 result = await adaptive.digest(url, query)
 # Analyze and adjust
 if adaptive.confidence < 0.7:
    config.max_pages += 10
    config.confidence_threshold -= 0.1
 ```
 ### 2. Monitor Resource Usage
 ```python
 import psutil
 # Check memory before large crawls
 memory_percent = psutil.virtual_memory().percent
 if memory_percent > 80:
    config.max_pages = min(config.max_pages, 20)
 ```
 ### 3. Use Domain Knowledge
 ```python
 # For API documentation
 if "api" in start_url:
    config.top_k_links = 2  # APIs have clear structure
 # For blogs
 if "blog" in start_url:
    config.min_gain_threshold = 0.2  # Avoid similar posts
 ```
 ### 4. Validate Results
 ```python
 # Always validate the knowledge base
 relevant_content = adaptive.get_relevant_content(top_k=10)
 # Check coverage
 query_terms = set(query.lower().split())
 covered_terms = set()
 for doc in relevant_content:
    content_lower = doc['content'].lower()
    for term in query_terms:
        if term in content_lower:
            covered_terms.add(term)
 coverage_ratio = len(covered_terms) / len(query_terms)
 print(f"Query term coverage: {coverage_ratio:.0%}")
 ```
 ## Next Steps
 - Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md)
 - Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md)
 - See [Performance Benchmarks](../benchmarks/adaptive-performance.md)
--- a/docs/md_v2/api/adaptive-crawler.md
+++ b/docs/md_v2/api/adaptive-crawler.md
@@ -0,0 +1,244 @@
 # AdaptiveCrawler
 The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
 ## Constructor
 ```python
 AdaptiveCrawler(
    crawler: AsyncWebCrawler,
    config: Optional[AdaptiveConfig] = None
 )
 ```
 ### Parameters
 - **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
 - **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
 ## Primary Method
 ### digest()
 The main method that performs adaptive crawling starting from a URL with a specific query.
 ```python
 async def digest(
    start_url: str,
    query: str,
    resume_from: Optional[Union[str, Path]] = None
 ) -> CrawlState
 ```
 #### Parameters
 - **start_url** (`str`): The starting URL for crawling
 - **query** (`str`): The search query that guides the crawling process
 - **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
 #### Returns
 - **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
 #### Example
 ```python
 async with AsyncWebCrawler() as crawler:
    adaptive = AdaptiveCrawler(crawler)
    state = await adaptive.digest(
        start_url="https://docs.python.org",
        query="async context managers"
    )
 ```
 ## Properties
 ### confidence
 Current confidence score (0-1) indicating information sufficiency.
 ```python
@property
 def confidence(self) -> float
 ```
 ### coverage_stats
 Dictionary containing detailed coverage statistics.
 ```python
@property  
 def coverage_stats(self) -> Dict[str, float]
 ```
 Returns:
 - **coverage**: Query term coverage score
 - **consistency**: Information consistency score  
 - **saturation**: Content saturation score
 - **confidence**: Overall confidence score
 ### is_sufficient
 Boolean indicating whether sufficient information has been gathered.
 ```python
@property
 def is_sufficient(self) -> bool
 ```
 ### state
 Access to the current crawl state.
 ```python
@property
 def state(self) -> CrawlState
 ```
 ## Methods
 ### get_relevant_content()
 Retrieve the most relevant content from the knowledge base.
 ```python
 def get_relevant_content(
    self,
    top_k: int = 5
 ) -> List[Dict[str, Any]]
 ```
 #### Parameters
 - **top_k** (`int`): Number of top relevant documents to return (default: 5)
 #### Returns
 List of dictionaries containing:
 - **url**: The URL of the page
 - **content**: The page content
 - **score**: Relevance score
 - **metadata**: Additional page metadata
 ### print_stats()
 Display crawl statistics in formatted output.
 ```python
 def print_stats(
    self,
    detailed: bool = False
 ) -> None
 ```
 #### Parameters
 - **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
 ### export_knowledge_base()
 Export the collected knowledge base to a JSONL file.
 ```python
 def export_knowledge_base(
    self,
    path: Union[str, Path]
 ) -> None
 ```
 #### Parameters
 - **path** (`Union[str, Path]`): Output file path for JSONL export
 #### Example
 ```python
 adaptive.export_knowledge_base("my_knowledge.jsonl")
 ```
 ### import_knowledge_base()
 Import a previously exported knowledge base.
 ```python
 def import_knowledge_base(
    self,
    path: Union[str, Path]
 ) -> None
 ```
 #### Parameters
 - **path** (`Union[str, Path]`): Path to JSONL file to import
 ## Configuration
 The `AdaptiveConfig` class controls the behavior of adaptive crawling:
 ```python
@dataclass
 class AdaptiveConfig:
    confidence_threshold: float = 0.8      # Stop when confidence reaches this
    max_pages: int = 50                    # Maximum pages to crawl
    top_k_links: int = 5                   # Links to follow per page
    min_gain_threshold: float = 0.1        # Minimum expected gain to continue
    save_state: bool = False               # Auto-save crawl state
    state_path: Optional[str] = None       # Path for state persistence
 ```
 ### Example with Custom Config
 ```python
 config = AdaptiveConfig(
    confidence_threshold=0.7,
    max_pages=20,
    top_k_links=3
 )
 adaptive = AdaptiveCrawler(crawler, config=config)
 ```
 ## Complete Example
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async def main():
    # Configure adaptive crawling
    config = AdaptiveConfig(
        confidence_threshold=0.75,
        max_pages=15,
        save_state=True,
        state_path="my_crawl.json"
    )
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        # Start crawling
        state = await adaptive.digest(
            start_url="https://example.com/docs",
            query="authentication oauth2 jwt"
        )
        # Check results
        print(f"Confidence achieved: {adaptive.confidence:.0%}")
        adaptive.print_stats()
        # Get most relevant pages
        for page in adaptive.get_relevant_content(top_k=3):
            print(f"- {page['url']} (score: {page['score']:.2f})")
        # Export for later use
        adaptive.export_knowledge_base("auth_knowledge.jsonl")
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 ## See Also
 - [digest() Method Reference](digest.md)
 - [Adaptive Crawling Guide](../core/adaptive-crawling.md)
 - [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
--- a/docs/md_v2/api/digest.md
+++ b/docs/md_v2/api/digest.md
@@ -0,0 +1,181 @@
 # digest()
 The `digest()` method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered.
 ## Method Signature
 ```python
 async def digest(
    start_url: str,
    query: str,
    resume_from: Optional[Union[str, Path]] = None
 ) -> CrawlState
 ```
 ## Parameters
 ### start_url
 - **Type**: `str`
 - **Required**: Yes
 - **Description**: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering.
 ### query
 - **Type**: `str`  
 - **Required**: Yes
 - **Description**: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow.
 ### resume_from
 - **Type**: `Optional[Union[str, Path]]`
 - **Default**: `None`
 - **Description**: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh.
 ## Return Value
 Returns a `CrawlState` object containing:
 - **crawled_urls** (`Set[str]`): All URLs that have been crawled
 - **knowledge_base** (`List[CrawlResult]`): Collection of crawled pages with content
 - **pending_links** (`List[Link]`): Links discovered but not yet crawled
 - **metrics** (`Dict[str, float]`): Performance and quality metrics
 - **query** (`str`): The original query
 - Additional statistical information for scoring
 ## How It Works
 The `digest()` method implements an intelligent crawling algorithm:
 1. **Initial Crawl**: Starts from the provided URL
 2. **Link Analysis**: Evaluates all discovered links for relevance
 3. **Scoring**: Uses three metrics to assess information sufficiency:
   - **Coverage**: How well the query terms are covered
   - **Consistency**: Information coherence across pages
   - **Saturation**: Diminishing returns detection
 4. **Adaptive Selection**: Chooses the most promising links to follow
 5. **Stopping Decision**: Automatically stops when confidence threshold is reached
 ## Examples
 ### Basic Usage
 ```python
 async with AsyncWebCrawler() as crawler:
    adaptive = AdaptiveCrawler(crawler)
    state = await adaptive.digest(
        start_url="https://docs.python.org/3/",
        query="async await context managers"
    )
    print(f"Crawled {len(state.crawled_urls)} pages")
    print(f"Confidence: {adaptive.confidence:.0%}")
 ```
 ### With Configuration
 ```python
 config = AdaptiveConfig(
    confidence_threshold=0.9,  # Require high confidence
    max_pages=30,             # Allow more pages
    top_k_links=3             # Follow top 3 links per page
 )
 adaptive = AdaptiveCrawler(crawler, config=config)
 state = await adaptive.digest(
    start_url="https://api.example.com/docs",
    query="authentication endpoints rate limits"
 )
 ```
 ### Resuming a Previous Crawl
 ```python
 # First crawl - may be interrupted
 state1 = await adaptive.digest(
    start_url="https://example.com",
    query="machine learning algorithms"
 )
 # Save state (if not auto-saved)
 state1.save("ml_crawl_state.json")
 # Later, resume from saved state
 state2 = await adaptive.digest(
    start_url="https://example.com",
    query="machine learning algorithms",
    resume_from="ml_crawl_state.json"
 )
 ```
 ### With Progress Monitoring
 ```python
 state = await adaptive.digest(
    start_url="https://docs.example.com",
    query="api reference"
 )
 # Monitor progress
 print(f"Pages crawled: {len(state.crawled_urls)}")
 print(f"New terms discovered: {state.new_terms_history}")
 print(f"Final confidence: {adaptive.confidence:.2%}")
 # View detailed statistics
 adaptive.print_stats(detailed=True)
 ```
 ## Query Best Practices
 1. **Be Specific**: Use descriptive terms that appear in target content
   ```python
   # Good
   query = "python async context managers implementation"
   # Too broad
   query = "python programming"
   ```
 2. **Include Key Terms**: Add technical terms you expect to find
   ```python
   query = "oauth2 jwt refresh tokens authorization"
   ```
 3. **Multiple Concepts**: Combine related concepts for comprehensive coverage
   ```python
   query = "rest api pagination sorting filtering"
   ```
 ## Performance Considerations
 - **Initial URL**: Choose a page with good navigation (e.g., documentation index)
 - **Query Length**: 3-8 terms typically work best
 - **Link Density**: Sites with clear navigation crawl more efficiently
 - **Caching**: Enable caching for repeated crawls of the same domain
 ## Error Handling
 ```python
 try:
    state = await adaptive.digest(
        start_url="https://example.com",
        query="search terms"
    )
 except Exception as e:
    print(f"Crawl failed: {e}")
    # State is auto-saved if save_state=True in config
 ```
 ## Stopping Conditions
 The crawl stops when any of these conditions are met:
 1. **Confidence Threshold**: Reached the configured confidence level
 2. **Page Limit**: Crawled the maximum number of pages
 3. **Diminishing Returns**: Expected information gain below threshold
 4. **No Relevant Links**: No promising links remain to follow
 ## See Also
 - [AdaptiveCrawler Class](adaptive-crawler.md)
 - [Adaptive Crawling Guide](../core/adaptive-crawling.md)
 - [Configuration Options](../core/adaptive-crawling.md#configuration-options)
--- a/docs/md_v2/blog/articles/adaptive-crawling-revolution.md
+++ b/docs/md_v2/blog/articles/adaptive-crawling-revolution.md
@@ -0,0 +1,369 @@
 # Adaptive Crawling: Building Dynamic Knowledge That Grows on Demand
 *Published on January 29, 2025 • 8 min read*
 *By [unclecode](https://x.com/unclecode) • Follow me on [X/Twitter](https://x.com/unclecode) for more web scraping insights*
 ---
 ## The Knowledge Capacitor
 Imagine a capacitor that stores energy, releasing it precisely when needed. Now imagine that for information. That's Adaptive Crawling—a term I coined to describe a fundamentally different approach to web crawling. Instead of the brute force of traditional deep crawling, we build knowledge dynamically, growing it based on queries and circumstances, like a living organism responding to its environment.
 This isn't just another crawling optimization. It's a paradigm shift from "crawl everything, hope for the best" to "crawl intelligently, know when to stop."
 ## Why I Built This
 I've watched too many startups burn through resources with a dangerous misconception: that LLMs make everything efficient. They don't. They make things *possible*, not necessarily *smart*. When you combine brute-force crawling with LLM processing, you're not just wasting time—you're hemorrhaging money on tokens, compute, and opportunity cost.
 Consider this reality:
 - **Traditional deep crawling**: 500 pages → 50 useful → $15 in LLM tokens → 2 hours wasted
 - **Adaptive crawling**: 15 pages → 14 useful → $2 in tokens → 10 minutes → **7.5x cost reduction**
 But it's not about crawling less. It's about crawling *right*.
 ## The Information Theory Foundation
 <div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
 ### 🧮 **Pure Statistics, No Magic**
 My first principle was crucial: start with classic statistical approaches. No embeddings. No LLMs. Just pure information theory:
 ```python
 # Information gain calculation - the heart of adaptive crawling
 def calculate_information_gain(new_page, knowledge_base):
    new_terms = extract_terms(new_page) - existing_terms(knowledge_base)
    overlap = calculate_overlap(new_page, knowledge_base)
    # High gain = many new terms + low overlap
    gain = len(new_terms) / (1 + overlap)
    return gain
 ```
 This isn't regression to older methods—it's recognition that we've forgotten powerful, efficient solutions in our rush to apply LLMs everywhere.
 </div>
 ## The A* of Web Crawling
 Adaptive crawling implements what I call "information scenting"—like A* pathfinding but for knowledge acquisition. Each link is evaluated not randomly, but by its probability of contributing meaningful information toward answering current and future queries.
 <div style="display: flex; align-items: center; background-color: #3f3f44; padding: 20px; margin: 20px 0; border-left: 4px solid #09b5a5;">
 <div style="font-size: 48px; margin-right: 20px;">🎯</div>
 <div>
 <strong>The Scenting Algorithm:</strong><br>
 From available links, we select those with highest information gain. It's not about following every path—it's about following the <em>right</em> paths. Like a bloodhound following the strongest scent to its target.
 </div>
 </div>
 ## The Three Pillars of Intelligence
 ### 1. Coverage: The Breadth Sensor
 Measures how well your knowledge spans the query space. Not just "do we have pages?" but "do we have the RIGHT pages?"
 ### 2. Consistency: The Coherence Detector  
 Information from multiple sources should align. When pages agree, confidence rises. When they conflict, we need more data.
 ### 3. Saturation: The Efficiency Guardian
 The most crucial metric. When new pages stop adding information, we stop crawling. Simple. Powerful. Ignored by everyone else.
 ## Real Impact: Time, Money, and Sanity
 Let me show you what this means for your bottom line:
 ### Building a Customer Support Knowledge Base
 **Traditional Approach:**
 ```python
 # Crawl entire documentation site
 results = await crawler.crawl_bfs("https://docs.company.com", max_depth=5)
 # Result: 1,200 pages, 18 hours, $150 in API costs
 # Useful content: ~100 pages scattered throughout
 ```
 **Adaptive Approach:**
 ```python
 # Grow knowledge based on actual support queries
 knowledge = await adaptive.digest(
    start_url="https://docs.company.com",
    query="payment processing errors refund policies"
 )
 # Result: 45 pages, 12 minutes, $8 in API costs
 # Useful content: 42 pages, all relevant
 ```
 **Savings: 93% time reduction, 95% cost reduction, 100% more sanity**
 ## The Dynamic Growth Pattern
 <div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
 <div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
 Knowledge grows like crystals in a supersaturated solution
 </div>
 <div style="color: #a3abba;">
 Add a query (seed), and relevant information crystallizes around it.<br>
 Change the query, and the knowledge structure adapts.
 </div>
 </div>
 This is the beauty of adaptive crawling: your knowledge base becomes a living entity that grows based on actual needs, not hypothetical completeness.
 ## Why "Adaptive"? 
 I specifically chose "Adaptive" because it captures the essence: the system adapts to what it finds. Dense technical documentation might need 20 pages for confidence. A simple FAQ might need just 5. The crawler doesn't follow a recipe—it reads the room and adjusts.
 This is my term, my concept, and I have extensive plans for its evolution.
 ## The Progressive Roadmap
 This is just the beginning. My roadmap for Adaptive Crawling:
 ### Phase 1 (Current): Statistical Foundation
 - Pure information theory approach
 - No dependencies on expensive models
 - Proven efficiency gains
 ### Phase 2 (Now Available): Embedding Enhancement
 - Semantic understanding layered onto statistical base
 - Still efficient, now even smarter
 - Optional, not required
 ### Phase 3 (Future): LLM Integration
 - LLMs for complex reasoning tasks only
 - Used surgically, not wastefully
 - Always with statistical foundation underneath
 ## The Efficiency Revolution
 <div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
 ### 💰 **The Economics of Intelligence**
 For a typical SaaS documentation crawl:
 **Traditional Deep Crawling:**
 - Pages crawled: 1,000
 - Useful pages: 80
 - Time spent: 3 hours
 - LLM tokens used: 2.5M
 - Cost: $75
 - Efficiency: 8%
 **Adaptive Crawling:**
 - Pages crawled: 95
 - Useful pages: 88
 - Time spent: 15 minutes
 - LLM tokens used: 200K
 - Cost: $6
 - Efficiency: 93%
 **That's not optimization. That's transformation.**
 </div>
 ## Missing the Forest for the Trees
 The startup world has a dangerous blind spot. We're so enamored with LLMs that we forget: just because you CAN process everything with an LLM doesn't mean you SHOULD. 
 Classic NLP and statistical methods can:
 - Filter irrelevant content before it reaches LLMs
 - Identify patterns without expensive inference
 - Make intelligent decisions in microseconds
 - Scale without breaking the bank
 Adaptive crawling proves this. It uses battle-tested information theory to make smart decisions BEFORE expensive processing.
 ## Your Knowledge, On Demand
 ```python
 # Monday: Customer asks about authentication
 auth_knowledge = await adaptive.digest(
    "https://docs.api.com",
    "oauth jwt authentication"
 )
 # Tuesday: They ask about rate limiting
 # The crawler adapts, builds on existing knowledge
 rate_limit_knowledge = await adaptive.digest(
    "https://docs.api.com", 
    "rate limiting throttling quotas"
 )
 # Your knowledge base grows intelligently, not indiscriminately
 ```
 ## The Competitive Edge
 Companies using adaptive crawling will have:
 - **90% lower crawling costs**
 - **Knowledge bases that actually answer questions**
 - **Update cycles in minutes, not days**
 - **Happy customers who find answers fast**
 - **Engineers who sleep at night**
 Those still using brute force? They'll wonder why their infrastructure costs keep rising while their customers keep complaining.
 ## The Embedding Evolution (Now Available!)
 <div style="background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px; margin: 20px 0;">
 ### 🧠 **Semantic Understanding Without the Cost**
 The embedding strategy brings semantic intelligence while maintaining efficiency:
 ```python
 # Statistical strategy - great for exact terms
 config_statistical = AdaptiveConfig(
    strategy="statistical"  # Default
 )
 # Embedding strategy - understands concepts
 config_embedding = AdaptiveConfig(
    strategy="embedding",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    n_query_variations=10
 )
 ```
 **The magic**: It automatically expands your query into semantic variations, maps the coverage space, and identifies gaps to fill intelligently.
 </div>
 ### Real-World Comparison
 <div style="display: flex; gap: 20px; margin: 20px 0;">
 <div style="flex: 1; background-color: #1a1a1c; border: 1px solid #3f3f44; padding: 20px;">
 **Query**: "authentication oauth"
 **Statistical Strategy**:
 - Searches for exact terms
 - 12 pages crawled
 - 78% confidence
 - Fast but literal
 </div>
 <div style="flex: 1; background-color: #1a1a1c; border: 1px solid #09b5a5; padding: 20px;">
 **Embedding Strategy**:
 - Understands "auth", "login", "SSO"
 - 8 pages crawled
 - 92% confidence
 - Semantic comprehension
 </div>
 </div>
 ### Detecting Irrelevance
 One killer feature: the embedding strategy knows when to give up:
 ```python
 # Crawling Python docs with a cooking query
 result = await adaptive.digest(
    start_url="https://docs.python.org/3/",
    query="how to make spaghetti carbonara"
 )
 # System detects irrelevance and stops
 # Confidence: 5% (below threshold)
 # Pages crawled: 2
 # Stopped reason: "below_minimum_relevance_threshold"
 ```
 No more crawling hundreds of pages hoping to find something that doesn't exist!
 ## Try It Yourself
 ```python
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 async with AsyncWebCrawler() as crawler:
    # Choose your strategy
    config = AdaptiveConfig(
        strategy="embedding",  # or "statistical"
        embedding_min_confidence_threshold=0.1  # Stop if irrelevant
    )
    adaptive = AdaptiveCrawler(crawler, config)
    # Watch intelligence at work
    result = await adaptive.digest(
        start_url="https://your-docs.com",
        query="your users' actual questions"
    )
    # See the efficiency
    adaptive.print_stats()
    print(f"Found {adaptive.confidence:.0%} of needed information")
    print(f"In just {len(result.crawled_urls)} pages")
    print(f"Saving you {1000 - len(result.crawled_urls)} unnecessary crawls")
 ```
 ## A Personal Note
 I created Adaptive Crawling because I was tired of watching smart people make inefficient choices. We have incredibly powerful statistical tools that we've forgotten in our rush toward LLMs. This is my attempt to bring balance back to the Force.
 This is not just a feature. It's a philosophy: **Grow knowledge on demand. Stop when you have enough. Save time, money, and computational resources for what really matters.**
 ## The Future is Adaptive
 <div style="text-align: center; padding: 40px; background-color: #1a1a1c; border: 1px dashed #3f3f44; margin: 30px 0;">
 <div style="font-size: 24px; color: #09b5a5; margin-bottom: 10px;">
 Traditional Crawling: Drinking from a firehose<br>
 Adaptive Crawling: Sipping exactly what you need
 </div>
 <div style="color: #a3abba;">
 The future of web crawling isn't about processing more data.<br>
 It's about processing the <em>right</em> data.
 </div>
 </div>
 Join me in making web crawling intelligent, efficient, and actually useful. Because in the age of information overload, the winners won't be those who collect the most data—they'll be those who collect the *right* data.
 ---
 *Adaptive Crawling is now part of Crawl4AI. [Get started with the documentation](/core/adaptive-crawling/) or [dive into the mathematical framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md). For updates on my work in information theory and efficient AI, follow me on [X/Twitter](https://x.com/unclecode).*
 <style>
 /* Custom styles for this article */
 .markdown-body pre {
    background-color: #1e1e1e !important;
    border: 1px solid #3f3f44;
 }
 .markdown-body code {
    background-color: #3f3f44;
    color: #50ffff;
    padding: 2px 6px;
    border-radius: 3px;
 }
 .markdown-body pre code {
    background-color: transparent;
    color: #e8e9ed;
    padding: 0;
 }
 .markdown-body blockquote {
    border-left: 4px solid #09b5a5;
    background-color: #1a1a1c;
    padding: 15px 20px;
    margin: 20px 0;
 }
 .markdown-body h2 {
    color: #50ffff;
    border-bottom: 1px dashed #3f3f44;
    padding-bottom: 10px;
 }
 .markdown-body h3 {
    color: #09b5a5;
 }
 .markdown-body strong {
    color: #50ffff;
 }
 </style>
--- a/docs/md_v2/blog/index.md
+++ b/docs/md_v2/blog/index.md
@@ -2,6 +2,22 @@
 Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
 ## Featured Articles
 ### [When to Stop Crawling: The Art of Knowing "Enough"](articles/adaptive-crawling-revolution.md)
 *January 29, 2025*
 Traditional crawlers are like tourists with unlimited time—they'll visit every street, every alley, every dead end. But what if your crawler could think like a researcher with a deadline? Discover how Adaptive Crawling revolutionizes web scraping by knowing when to stop. Learn about the three-layer intelligence system that evaluates coverage, consistency, and saturation to build focused knowledge bases instead of endless page collections.
 [Read the full article →](articles/adaptive-crawling-revolution.md)
 ### [The LLM Context Protocol: Why Your AI Assistant Needs Memory, Reasoning, and Examples](articles/llm-context-revolution.md)
 *January 24, 2025*
 Ever wondered why your AI coding assistant struggles with your library despite comprehensive documentation? This article introduces the three-dimensional context protocol that transforms how AI understands code. Learn why memory, reasoning, and examples together create wisdom—not just information.
 [Read the full article →](articles/llm-context-revolution.md)
 ## Latest Release
 Here’s the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
--- a/docs/md_v2/core/adaptive-crawling.md
+++ b/docs/md_v2/core/adaptive-crawling.md
@@ -0,0 +1,347 @@
 # Adaptive Web Crawling
 ## Introduction
 Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. **Adaptive Crawling** changes this paradigm by introducing intelligence into the crawling process.
 Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
 ## Key Concepts
 ### The Problem It Solves
 When crawling websites for specific information, you face two challenges:
 1. **Under-crawling**: Stopping too early and missing crucial information
 2. **Over-crawling**: Wasting resources by crawling irrelevant pages
 Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
 ### How It Works
 The AdaptiveCrawler uses three metrics to measure information sufficiency:
 - **Coverage**: How well your collected pages cover the query terms
 - **Consistency**: Whether the information is coherent across pages  
 - **Saturation**: Detecting when new pages aren't adding new information
 When these metrics indicate sufficient information has been gathered, crawling stops automatically.
 ## Quick Start
 ### Basic Usage
 ```python
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
 async def main():
    async with AsyncWebCrawler() as crawler:
        # Create an adaptive crawler
        adaptive = AdaptiveCrawler(crawler)
        # Start crawling with a query
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="async context managers"
        )
        # View statistics
        adaptive.print_stats()
        # Get the most relevant content
        relevant_pages = adaptive.get_relevant_content(top_k=5)
        for page in relevant_pages:
            print(f"- {page['url']} (score: {page['score']:.2f})")
 ```
 ### Configuration Options
 ```python
 from crawl4ai import AdaptiveConfig
 config = AdaptiveConfig(
    confidence_threshold=0.7,    # Stop when 70% confident (default: 0.8)
    max_pages=20,               # Maximum pages to crawl (default: 50)
    top_k_links=3,              # Links to follow per page (default: 5)
    min_gain_threshold=0.05     # Minimum expected gain to continue (default: 0.1)
 )
 adaptive = AdaptiveCrawler(crawler, config=config)
 ```
 ## Crawling Strategies
 Adaptive Crawling supports two distinct strategies for determining information sufficiency:
 ### Statistical Strategy (Default)
 The statistical strategy uses pure information theory and term-based analysis:
 - **Fast and efficient** - No API calls or model loading
 - **Term-based coverage** - Analyzes query term presence and distribution
 - **No external dependencies** - Works offline
 - **Best for**: Well-defined queries with specific terminology
 ```python
 # Default configuration uses statistical strategy
 config = AdaptiveConfig(
    strategy="statistical",  # This is the default
    confidence_threshold=0.8
 )
 ```
 ### Embedding Strategy
 The embedding strategy uses semantic embeddings for deeper understanding:
 - **Semantic understanding** - Captures meaning beyond exact term matches
 - **Query expansion** - Automatically generates query variations
 - **Gap-driven selection** - Identifies semantic gaps in knowledge
 - **Validation-based stopping** - Uses held-out queries to validate coverage
 - **Best for**: Complex queries, ambiguous topics, conceptual understanding
 ```python
 # Configure embedding strategy
 config = AdaptiveConfig(
    strategy="embedding",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",  # Default
    n_query_variations=10,  # Generate 10 query variations
    embedding_min_confidence_threshold=0.1  # Stop if completely irrelevant
 )
 # With custom embedding provider (e.g., OpenAI)
 config = AdaptiveConfig(
    strategy="embedding",
    embedding_llm_config={
        'provider': 'openai/text-embedding-3-small',
        'api_token': 'your-api-key'
    }
 )
 ```
 ### Strategy Comparison
 | Feature | Statistical | Embedding |
 |---------|------------|-----------|
 | **Speed** | Very fast | Moderate (API calls) |
 | **Cost** | Free | Depends on provider |
 | **Accuracy** | Good for exact terms | Excellent for concepts |
 | **Dependencies** | None | Embedding model/API |
 | **Query Understanding** | Literal | Semantic |
 | **Best Use Case** | Technical docs, specific terms | Research, broad topics |
 ### Embedding Strategy Configuration
 The embedding strategy offers fine-tuned control through several parameters:
 ```python
 config = AdaptiveConfig(
    strategy="embedding",
    # Model configuration
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    embedding_llm_config=None,  # Use for API-based embeddings
    # Query expansion
    n_query_variations=10,  # Number of query variations to generate
    # Coverage parameters
    embedding_coverage_radius=0.2,  # Distance threshold for coverage
    embedding_k_exp=3.0,  # Exponential decay factor (higher = stricter)
    # Stopping criteria
    embedding_min_relative_improvement=0.1,  # Min improvement to continue
    embedding_validation_min_score=0.3,  # Min validation score
    embedding_min_confidence_threshold=0.1,  # Below this = irrelevant
    # Link selection
    embedding_overlap_threshold=0.85,  # Similarity for deduplication
    # Display confidence mapping
    embedding_quality_min_confidence=0.7,  # Min displayed confidence
    embedding_quality_max_confidence=0.95  # Max displayed confidence
 )
 ```
 ### Handling Irrelevant Queries
 The embedding strategy can detect when a query is completely unrelated to the content:
 ```python
 # This will stop quickly with low confidence
 result = await adaptive.digest(
    start_url="https://docs.python.org/3/",
    query="how to cook pasta"  # Irrelevant to Python docs
 )
 # Check if query was irrelevant
 if result.metrics.get('is_irrelevant', False):
    print("Query is unrelated to the content!")
 ```
 ## When to Use Adaptive Crawling
 ### Perfect For:
 - **Research Tasks**: Finding comprehensive information about a topic
 - **Question Answering**: Gathering sufficient context to answer specific queries
 - **Knowledge Base Building**: Creating focused datasets for AI/ML applications
 - **Competitive Intelligence**: Collecting complete information about specific products/features
 ### Not Recommended For:
 - **Full Site Archiving**: When you need every page regardless of content
 - **Structured Data Extraction**: When targeting specific, known page patterns
 - **Real-time Monitoring**: When you need continuous updates
 ## Understanding the Output
 ### Confidence Score
 The confidence score (0-1) indicates how sufficient the gathered information is:
 - **0.0-0.3**: Insufficient information, needs more crawling
 - **0.3-0.6**: Partial information, may answer basic queries
 - **0.6-0.8**: Good coverage, can answer most queries
 - **0.8-1.0**: Excellent coverage, comprehensive information
 ### Statistics Display
 ```python
 adaptive.print_stats(detailed=False)  # Summary table
 adaptive.print_stats(detailed=True)   # Detailed metrics
 ```
 The summary shows:
 - Pages crawled vs. confidence achieved
 - Coverage, consistency, and saturation scores
 - Crawling efficiency metrics
 ## Persistence and Resumption
 ### Saving Progress
 ```python
 config = AdaptiveConfig(
    save_state=True,
    state_path="my_crawl_state.json"
 )
 # Crawl will auto-save progress
 result = await adaptive.digest(start_url, query)
 ```
 ### Resuming a Crawl
 ```python
 # Resume from saved state
 result = await adaptive.digest(
    start_url,
    query,
    resume_from="my_crawl_state.json"
 )
 ```
 ### Exporting Knowledge Base
 ```python
 # Export collected pages to JSONL
 adaptive.export_knowledge_base("knowledge_base.jsonl")
 # Import into another session
 new_adaptive = AdaptiveCrawler(crawler)
 new_adaptive.import_knowledge_base("knowledge_base.jsonl")
 ```
 ## Best Practices
 ### 1. Query Formulation
 - Use specific, descriptive queries
 - Include key terms you expect to find
 - Avoid overly broad queries
 ### 2. Threshold Tuning
 - Start with default (0.8) for general use
 - Lower to 0.6-0.7 for exploratory crawling
 - Raise to 0.9+ for exhaustive coverage
 ### 3. Performance Optimization
 - Use appropriate `max_pages` limits
 - Adjust `top_k_links` based on site structure
 - Enable caching for repeat crawls
 ### 4. Link Selection
 - The crawler prioritizes links based on:
  - Relevance to query
  - Expected information gain
  - URL structure and depth
 ## Examples
 ### Research Assistant
 ```python
 # Gather information about a programming concept
 result = await adaptive.digest(
    start_url="https://realpython.com",
    query="python decorators implementation patterns"
 )
 # Get the most relevant excerpts
 for doc in adaptive.get_relevant_content(top_k=3):
    print(f"\nFrom: {doc['url']}")
    print(f"Relevance: {doc['score']:.2%}")
    print(doc['content'][:500] + "...")
 ```
 ### Knowledge Base Builder
 ```python
 # Build a focused knowledge base about machine learning
 queries = [
    "supervised learning algorithms",
    "neural network architectures", 
    "model evaluation metrics"
 ]
 for query in queries:
    await adaptive.digest(
        start_url="https://scikit-learn.org/stable/",
        query=query
    )
 # Export combined knowledge base
 adaptive.export_knowledge_base("ml_knowledge.jsonl")
 ```
 ### API Documentation Crawler
 ```python
 # Intelligently crawl API documentation
 config = AdaptiveConfig(
    confidence_threshold=0.85,  # Higher threshold for completeness
    max_pages=30
 )
 adaptive = AdaptiveCrawler(crawler, config)
 result = await adaptive.digest(
    start_url="https://api.example.com/docs",
    query="authentication endpoints rate limits"
 )
 ```
 ## Next Steps
 - Learn about [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
 - Explore the [AdaptiveCrawler API Reference](../api/adaptive-crawler.md)
 - See more [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/adaptive_crawling)
 ## FAQ
 **Q: How is this different from traditional crawling?**
 A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
 **Q: Can I use this with JavaScript-heavy sites?**
 A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
 **Q: How does it handle large websites?**
 A: The algorithm naturally limits crawling to relevant sections. Use `max_pages` as a safety limit.
 **Q: Can I customize the scoring algorithms?**
 A: Advanced users can implement custom strategies. See [Adaptive Strategies](../advanced/adaptive-strategies.md).
--- a/docs/md_v2/core/examples.md
+++ b/docs/md_v2/core/examples.md
@@ -28,7 +28,11 @@ This page provides a comprehensive list of example scripts that demonstrate vari
 | Example | Description | Link |
 |---------|-------------|------|
 | Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
 <<<<<<< HEAD
 | Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
 =======
 | Adaptive Crawling | Demonstrates intelligent crawling that automatically determines when sufficient information has been gathered. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/adaptive_crawling/) |
 >>>>>>> feature/progressive-crawling
 | Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
 | Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
 | Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
--- a/docs/md_v2/core/quickstart.md
+++ b/docs/md_v2/core/quickstart.md
@@ -272,7 +272,43 @@ if __name__ == "__main__":
 ---
-## 7. Multi-URL Concurrency (Preview)
+## 7. Adaptive Crawling (New!)
 Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
 async def adaptive_example():
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler)
        # Start adaptive crawling
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="async context managers"
        )
        # View results
        adaptive.print_stats()
        print(f"Crawled {len(result.crawled_urls)} pages")
        print(f"Achieved {adaptive.confidence:.0%} confidence")
 if __name__ == "__main__":
    asyncio.run(adaptive_example())
 ```
 **What's special about adaptive crawling?**
 - **Automatic stopping**: Stops when sufficient information is gathered
 - **Intelligent link selection**: Follows only relevant links
 - **Confidence scoring**: Know how complete your information is
 [Learn more about Adaptive Crawling →](adaptive-crawling.md)
 ---
 ## 8. Multi-URL Concurrency (Preview)
 If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -48,6 +48,12 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
 > **Note**: If you're looking for the old documentation, you can access it [here](https://old.docs.crawl4ai.com).
 ## 🎯 New: Adaptive Web Crawling
 Crawl4AI now features intelligent adaptive crawling that knows when to stop! Using advanced information foraging algorithms, it determines when sufficient information has been gathered to answer your query.
 [Learn more about Adaptive Crawling →](core/adaptive-crawling.md)
 ## Quick Start
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -44,7 +44,10 @@ dependencies = [
    "aiohttp>=3.11.11",
    "brotli>=1.1.0",
    "humanize>=4.10.0",
-    "lark>=1.2.2"
+    "lark>=1.2.2",
    "sentence-transformers>=2.2.0",
    "alphashape>=1.3.1",
    "shapely>=2.0.0"
 ]
 classifiers = [
    "Development Status :: 4 - Beta",
--- a/requirements.txt
+++ b/requirements.txt
@@ -24,3 +24,6 @@ cssselect>=1.2.0
 chardet>=5.2.0
 brotli>=1.1.0
 httpx[http2]>=0.27.2
 sentence-transformers>=2.2.0
 alphashape>=1.3.1
 shapely>=2.0.0
--- a/tests/adaptive/compare_performance.py
+++ b/tests/adaptive/compare_performance.py
@@ -0,0 +1,98 @@
 """
 Compare performance before and after optimizations
 """
 def read_baseline():
    """Read baseline performance metrics"""
    with open('performance_baseline.txt', 'r') as f:
        content = f.read()
    # Extract key metrics
    metrics = {}
    lines = content.split('\n')
    for i, line in enumerate(lines):
        if 'Total Time:' in line:
            metrics['total_time'] = float(line.split(':')[1].strip().split()[0])
        elif 'Memory Used:' in line:
            metrics['memory_mb'] = float(line.split(':')[1].strip().split()[0])
        elif 'validate_coverage:' in line and i+1 < len(lines) and 'Avg Time:' in lines[i+2]:
            metrics['validate_coverage_ms'] = float(lines[i+2].split(':')[1].strip().split()[0])
        elif 'select_links:' in line and i+1 < len(lines) and 'Avg Time:' in lines[i+2]:
            metrics['select_links_ms'] = float(lines[i+2].split(':')[1].strip().split()[0])
        elif 'calculate_confidence:' in line and i+1 < len(lines) and 'Avg Time:' in lines[i+2]:
            metrics['calculate_confidence_ms'] = float(lines[i+2].split(':')[1].strip().split()[0])
    return metrics
 def print_comparison(before_metrics, after_metrics):
    """Print performance comparison"""
    print("\n" + "="*80)
    print("PERFORMANCE COMPARISON: BEFORE vs AFTER OPTIMIZATIONS")
    print("="*80)
    # Total time
    time_improvement = (before_metrics['total_time'] - after_metrics['total_time']) / before_metrics['total_time'] * 100
    print(f"\n📊 Total Time:")
    print(f"   Before: {before_metrics['total_time']:.2f} seconds")
    print(f"   After:  {after_metrics['total_time']:.2f} seconds")
    print(f"   Improvement: {time_improvement:.1f}% faster ✅" if time_improvement > 0 else f"   Slower: {-time_improvement:.1f}% ❌")
    # Memory
    mem_improvement = (before_metrics['memory_mb'] - after_metrics['memory_mb']) / before_metrics['memory_mb'] * 100
    print(f"\n💾 Memory Usage:")
    print(f"   Before: {before_metrics['memory_mb']:.2f} MB")
    print(f"   After:  {after_metrics['memory_mb']:.2f} MB")
    print(f"   Improvement: {mem_improvement:.1f}% less memory ✅" if mem_improvement > 0 else f"   More memory: {-mem_improvement:.1f}% ❌")
    # Key operations
    print(f"\n⚡ Key Operations:")
    # Validate coverage
    if 'validate_coverage_ms' in before_metrics and 'validate_coverage_ms' in after_metrics:
        val_improvement = (before_metrics['validate_coverage_ms'] - after_metrics['validate_coverage_ms']) / before_metrics['validate_coverage_ms'] * 100
        print(f"\n   validate_coverage:")
        print(f"     Before: {before_metrics['validate_coverage_ms']:.1f} ms")
        print(f"     After:  {after_metrics['validate_coverage_ms']:.1f} ms")
        print(f"     Improvement: {val_improvement:.1f}% faster ✅" if val_improvement > 0 else f"     Slower: {-val_improvement:.1f}% ❌")
    # Select links
    if 'select_links_ms' in before_metrics and 'select_links_ms' in after_metrics:
        sel_improvement = (before_metrics['select_links_ms'] - after_metrics['select_links_ms']) / before_metrics['select_links_ms'] * 100
        print(f"\n   select_links:")
        print(f"     Before: {before_metrics['select_links_ms']:.1f} ms")
        print(f"     After:  {after_metrics['select_links_ms']:.1f} ms")
        print(f"     Improvement: {sel_improvement:.1f}% faster ✅" if sel_improvement > 0 else f"     Slower: {-sel_improvement:.1f}% ❌")
    # Calculate confidence
    if 'calculate_confidence_ms' in before_metrics and 'calculate_confidence_ms' in after_metrics:
        calc_improvement = (before_metrics['calculate_confidence_ms'] - after_metrics['calculate_confidence_ms']) / before_metrics['calculate_confidence_ms'] * 100
        print(f"\n   calculate_confidence:")
        print(f"     Before: {before_metrics['calculate_confidence_ms']:.1f} ms")
        print(f"     After:  {after_metrics['calculate_confidence_ms']:.1f} ms")
        print(f"     Improvement: {calc_improvement:.1f}% faster ✅" if calc_improvement > 0 else f"     Slower: {-calc_improvement:.1f}% ❌")
    print("\n" + "="*80)
    # Overall assessment
    if time_improvement > 50:
        print("🎉 EXCELLENT OPTIMIZATION! More than 50% performance improvement!")
    elif time_improvement > 30:
        print("✅ GOOD OPTIMIZATION! Significant performance improvement!")
    elif time_improvement > 10:
        print("👍 DECENT OPTIMIZATION! Noticeable performance improvement!")
    else:
        print("🤔 MINIMAL IMPROVEMENT. Further optimization may be needed.")
    print("="*80)
 if __name__ == "__main__":
    # Example usage - you'll run this after implementing optimizations
    baseline = read_baseline()
    print("Baseline metrics loaded:")
    for k, v in baseline.items():
        print(f"  {k}: {v}")
    print("\n⚠️  Run the performance test again after optimizations to compare!")
    print("Then update this script with the new metrics to see the comparison.")
--- a/tests/adaptive/test_adaptive_crawler.py
+++ b/tests/adaptive/test_adaptive_crawler.py
@@ -0,0 +1,293 @@
 """
 Test and demo script for Adaptive Crawler
 This script demonstrates the progressive crawling functionality
 with various configurations and use cases.
 """
 import asyncio
 import json
 from pathlib import Path
 import time
 from typing import Dict, List
 from rich.console import Console
 from rich.table import Table
 from rich.progress import Progress
 from rich import print as rprint
 # Add parent directory to path for imports
 import sys
 sys.path.append(str(Path(__file__).parent.parent))
 from crawl4ai import (
    AsyncWebCrawler,
    AdaptiveCrawler,
    AdaptiveConfig,
    CrawlState
 )
 console = Console()
 def print_relevant_content(crawler: AdaptiveCrawler, top_k: int = 3):
    """Print most relevant content found"""
    relevant = crawler.get_relevant_content(top_k=top_k)
    if not relevant:
        console.print("[yellow]No relevant content found yet.[/yellow]")
        return
    console.print(f"\n[bold cyan]Top {len(relevant)} Most Relevant Pages:[/bold cyan]")
    for i, doc in enumerate(relevant, 1):
        console.print(f"\n[green]{i}. {doc['url']}[/green]")
        console.print(f"   Score: {doc['score']:.2f}")
        # Show snippet
        content = doc['content'] or ""
        snippet = content[:200].replace('\n', ' ') + "..." if len(content) > 200 else content
        console.print(f"   [dim]{snippet}[/dim]")
 async def test_basic_progressive_crawl():
    """Test basic progressive crawling functionality"""
    console.print("\n[bold yellow]Test 1: Basic Progressive Crawl[/bold yellow]")
    console.print("Testing on Python documentation with query about async/await")
    config = AdaptiveConfig(
        confidence_threshold=0.7,
        max_pages=10,
        top_k_links=2,
        min_gain_threshold=0.1
    )
    # Create crawler
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(
            crawler=crawler,
            config=config
        )
        # Start progressive crawl
        start_time = time.time()
        state = await prog_crawler.digest(
            start_url="https://docs.python.org/3/library/asyncio.html",
            query="async await context managers"
        )
        elapsed = time.time() - start_time
        # Print results
        prog_crawler.print_stats(detailed=False)
        prog_crawler.print_stats(detailed=True)
        print_relevant_content(prog_crawler)
        console.print(f"\n[green]Crawl completed in {elapsed:.2f} seconds[/green]")
        console.print(f"Final confidence: {prog_crawler.confidence:.2%}")
        console.print(f"URLs crawled: {list(state.crawled_urls)[:5]}...")  # Show first 5
        # Test export functionality
        export_path = "knowledge_base_export.jsonl"
        prog_crawler.export_knowledge_base(export_path)
        console.print(f"[green]Knowledge base exported to {export_path}[/green]")
        # Clean up
        Path(export_path).unlink(missing_ok=True)
 async def test_with_persistence():
    """Test state persistence and resumption"""
    console.print("\n[bold yellow]Test 2: Persistence and Resumption[/bold yellow]")
    console.print("Testing state save/load functionality")
    state_path = "test_crawl_state.json"
    config = AdaptiveConfig(
        confidence_threshold=0.6,
        max_pages=5,
        top_k_links=2,
        save_state=True,
        state_path=state_path
    )
    # First crawl - partial
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(
            crawler=crawler,
            config=config
        )
        state1 = await prog_crawler.digest(
            start_url="https://httpbin.org",
            query="http headers response"
        )
        console.print(f"[cyan]First crawl: {len(state1.crawled_urls)} pages[/cyan]")
    # Resume crawl
    config.max_pages = 10  # Increase limit
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(
            crawler=crawler,
            config=config
        )
        state2 = await prog_crawler.digest(
            start_url="https://httpbin.org",
            query="http headers response",
            resume_from=state_path
        )
        console.print(f"[green]Resumed crawl: {len(state2.crawled_urls)} total pages[/green]")
    # Clean up
    Path(state_path).unlink(missing_ok=True)
 async def test_different_domains():
    """Test on different types of websites"""
    console.print("\n[bold yellow]Test 3: Different Domain Types[/bold yellow]")
    test_cases = [
        {
            "name": "Documentation Site",
            "url": "https://docs.python.org/3/",
            "query": "decorators and context managers"
        },
        {
            "name": "API Documentation",  
            "url": "https://httpbin.org",
            "query": "http authentication headers"
        }
    ]
    for test in test_cases:
        console.print(f"\n[cyan]Testing: {test['name']}[/cyan]")
        console.print(f"URL: {test['url']}")
        console.print(f"Query: {test['query']}")
        config = AdaptiveConfig(
            confidence_threshold=0.6,
            max_pages=5,
            top_k_links=2
        )
        async with AsyncWebCrawler() as crawler:
            prog_crawler = AdaptiveCrawler(
                crawler=crawler,
                config=config
            )
            start_time = time.time()
            state = await prog_crawler.digest(
                start_url=test['url'],
                query=test['query']
            )
            elapsed = time.time() - start_time
            # Summary using print_stats
            prog_crawler.print_stats(detailed=False)
 async def test_stopping_criteria():
    """Test different stopping criteria"""
    console.print("\n[bold yellow]Test 4: Stopping Criteria[/bold yellow]")
    # Test 1: High confidence threshold
    console.print("\n[cyan]4.1 High confidence threshold (0.9)[/cyan]")
    config = AdaptiveConfig(
        confidence_threshold=0.9,  # Very high
        max_pages=20,
        top_k_links=3
    )
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(crawler=crawler, config=config)
        state = await prog_crawler.digest(
            start_url="https://docs.python.org/3/library/",
            query="python standard library"
        )
        console.print(f"Pages needed for 90% confidence: {len(state.crawled_urls)}")
        prog_crawler.print_stats(detailed=False)
    # Test 2: Page limit
    console.print("\n[cyan]4.2 Page limit (3 pages max)[/cyan]")
    config = AdaptiveConfig(
        confidence_threshold=0.9,
        max_pages=3,  # Very low limit
        top_k_links=2
    )
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(crawler=crawler, config=config)
        state = await prog_crawler.digest(
            start_url="https://docs.python.org/3/library/",
            query="python standard library modules"
        )
        console.print(f"Stopped by: {'Page limit' if len(state.crawled_urls) >= 3 else 'Other'}")
        prog_crawler.print_stats(detailed=False)
 async def test_crawl_patterns():
    """Analyze crawl patterns and link selection"""
    console.print("\n[bold yellow]Test 5: Crawl Pattern Analysis[/bold yellow]")
    config = AdaptiveConfig(
        confidence_threshold=0.7,
        max_pages=8,
        top_k_links=2,
        min_gain_threshold=0.05
    )
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(crawler=crawler, config=config)
        # Track crawl progress
        console.print("\n[cyan]Crawl Progress:[/cyan]")
        state = await prog_crawler.digest(
            start_url="https://httpbin.org",
            query="http methods post get"
        )
        # Show crawl order
        console.print("\n[green]Crawl Order:[/green]")
        for i, url in enumerate(state.crawl_order, 1):
            console.print(f"{i}. {url}")
        # Show new terms discovered per page
        console.print("\n[green]New Terms Discovered:[/green]")
        for i, new_terms in enumerate(state.new_terms_history, 1):
            console.print(f"Page {i}: {new_terms} new terms")
        # Final metrics
        console.print(f"\n[yellow]Saturation reached: {state.metrics.get('saturation', 0):.2%}[/yellow]")
 async def main():
    """Run all tests"""
    console.print("[bold magenta]Adaptive Crawler Test Suite[/bold magenta]")
    console.print("=" * 50)
    try:
        # Run tests
        await test_basic_progressive_crawl()
        # await test_with_persistence()
        # await test_different_domains()
        # await test_stopping_criteria()
        # await test_crawl_patterns()
        console.print("\n[bold green]✅ All tests completed successfully![/bold green]")
    except Exception as e:
        console.print(f"\n[bold red]❌ Test failed with error: {e}[/bold red]")
        import traceback
        traceback.print_exc()
 if __name__ == "__main__":
    # Run the test suite
    asyncio.run(main())
--- a/tests/adaptive/test_confidence_debug.py
+++ b/tests/adaptive/test_confidence_debug.py
@@ -0,0 +1,182 @@
 """
 Test script for debugging confidence calculation in adaptive crawler
 Focus: Testing why confidence decreases when crawling relevant URLs
 """
 import asyncio
 import sys
 from pathlib import Path
 from typing import List, Dict
 import math
 # Add parent directory to path for imports
 sys.path.append(str(Path(__file__).parent.parent))
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.adaptive_crawler import CrawlState, StatisticalStrategy
 from crawl4ai.models import CrawlResult
 class ConfidenceTestHarness:
    """Test harness for analyzing confidence calculation"""
    def __init__(self):
        self.strategy = StatisticalStrategy()
        self.test_urls = [
            'https://docs.python.org/3/library/asyncio.html',
            'https://docs.python.org/3/library/asyncio-runner.html', 
            'https://docs.python.org/3/library/asyncio-api-index.html',
            'https://docs.python.org/3/library/contextvars.html',
            'https://docs.python.org/3/library/asyncio-stream.html'
        ]
        self.query = "async await context manager"
    async def test_confidence_progression(self):
        """Test confidence calculation as we crawl each URL"""
        print(f"Testing confidence for query: '{self.query}'")
        print("=" * 80)
        # Initialize state
        state = CrawlState(query=self.query)
        # Create crawler
        async with AsyncWebCrawler() as crawler:
            for i, url in enumerate(self.test_urls, 1):
                print(f"\n{i}. Crawling: {url}")
                print("-" * 80)
                # Crawl the URL
                result = await crawler.arun(url=url)
                # Extract markdown content
                if hasattr(result, '_results') and result._results:
                    result = result._results[0]
                # Create a mock CrawlResult with markdown
                mock_result = type('CrawlResult', (), {
                    'markdown': type('Markdown', (), {
                        'raw_markdown': result.markdown.raw_markdown if hasattr(result, 'markdown') else ''
                    })(),
                    'url': url
                })()
                # Update state
                state.knowledge_base.append(mock_result)
                await self.strategy.update_state(state, [mock_result])
                # Calculate metrics
                confidence = await self.strategy.calculate_confidence(state)
                # Get individual components
                coverage = state.metrics.get('coverage', 0)
                consistency = state.metrics.get('consistency', 0)
                saturation = state.metrics.get('saturation', 0)
                # Analyze term frequencies
                query_terms = self.strategy._tokenize(self.query.lower())
                term_stats = {}
                for term in query_terms:
                    term_stats[term] = {
                        'tf': state.term_frequencies.get(term, 0),
                        'df': state.document_frequencies.get(term, 0)
                    }
                # Print detailed results
                print(f"State after crawl {i}:")
                print(f"  Total documents: {state.total_documents}")
                print(f"  Unique terms: {len(state.term_frequencies)}")
                print(f"  New terms added: {state.new_terms_history[-1] if state.new_terms_history else 0}")
                print(f"\nQuery term statistics:")
                for term, stats in term_stats.items():
                    print(f"  '{term}': tf={stats['tf']}, df={stats['df']}")
                print(f"\nMetrics:")
                print(f"  Coverage: {coverage:.3f}")
                print(f"  Consistency: {consistency:.3f}")
                print(f"  Saturation: {saturation:.3f}")
                print(f"  → Confidence: {confidence:.3f}")
                # Show coverage calculation details
                print(f"\nCoverage calculation details:")
                self._debug_coverage_calculation(state, query_terms)
                # Alert if confidence decreased
                if i > 1 and confidence < state.metrics.get('prev_confidence', 0):
                    print(f"\n⚠️  WARNING: Confidence decreased from {state.metrics.get('prev_confidence', 0):.3f} to {confidence:.3f}")
                state.metrics['prev_confidence'] = confidence
    def _debug_coverage_calculation(self, state: CrawlState, query_terms: List[str]):
        """Debug coverage calculation step by step"""
        coverage_score = 0.0
        max_possible_score = 0.0
        for term in query_terms:
            tf = state.term_frequencies.get(term, 0)
            df = state.document_frequencies.get(term, 0)
            if df > 0:
                idf = math.log((state.total_documents - df + 0.5) / (df + 0.5) + 1)
                doc_coverage = df / state.total_documents
                tf_boost = min(tf / df, 3.0)
                term_score = doc_coverage * idf * (1 + 0.1 * math.log1p(tf_boost))
                print(f"    '{term}': doc_cov={doc_coverage:.2f}, idf={idf:.2f}, boost={1 + 0.1 * math.log1p(tf_boost):.2f} → score={term_score:.3f}")
                coverage_score += term_score
            else:
                print(f"    '{term}': not found → score=0.000")
            max_possible_score += 1.0 * 1.0 * 1.1
        print(f"    Total: {coverage_score:.3f} / {max_possible_score:.3f} = {coverage_score/max_possible_score if max_possible_score > 0 else 0:.3f}")
        # New coverage calculation
        print(f"\n  NEW Coverage calculation (without IDF):")
        new_coverage = self._calculate_coverage_new(state, query_terms)
        print(f"    → New Coverage: {new_coverage:.3f}")
    def _calculate_coverage_new(self, state: CrawlState, query_terms: List[str]) -> float:
        """New coverage calculation without IDF"""
        if not query_terms or state.total_documents == 0:
            return 0.0
        term_scores = []
        max_tf = max(state.term_frequencies.values()) if state.term_frequencies else 1
        for term in query_terms:
            tf = state.term_frequencies.get(term, 0)
            df = state.document_frequencies.get(term, 0)
            if df > 0:
                # Document coverage: what fraction of docs contain this term
                doc_coverage = df / state.total_documents
                # Frequency signal: normalized log frequency
                freq_signal = math.log(1 + tf) / math.log(1 + max_tf) if max_tf > 0 else 0
                # Combined score: document coverage with frequency boost
                term_score = doc_coverage * (1 + 0.5 * freq_signal)
                print(f"    '{term}': doc_cov={doc_coverage:.2f}, freq_signal={freq_signal:.2f} → score={term_score:.3f}")
                term_scores.append(term_score)
            else:
                print(f"    '{term}': not found → score=0.000")
                term_scores.append(0.0)
        # Average across all query terms
        coverage = sum(term_scores) / len(term_scores)
        return coverage
 async def main():
    """Run the confidence test"""
    tester = ConfidenceTestHarness()
    await tester.test_confidence_progression()
    print("\n" + "=" * 80)
    print("Test complete!")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/tests/adaptive/test_embedding_performance.py
+++ b/tests/adaptive/test_embedding_performance.py
@@ -0,0 +1,254 @@
 """
 Performance test for Embedding Strategy optimizations
 Measures time and memory usage before and after optimizations
 """
 import asyncio
 import time
 import tracemalloc
 import numpy as np
 from pathlib import Path
 import sys
 import os
 # Add parent directory to path for imports
 sys.path.append(str(Path(__file__).parent.parent.parent))
 from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
 from crawl4ai.adaptive_crawler import EmbeddingStrategy, CrawlState
 from crawl4ai.models import CrawlResult
 class PerformanceMetrics:
    def __init__(self):
        self.start_time = 0
        self.end_time = 0
        self.start_memory = 0
        self.peak_memory = 0
        self.operation_times = {}
    def start(self):
        tracemalloc.start()
        self.start_time = time.perf_counter()
        self.start_memory = tracemalloc.get_traced_memory()[0]
    def end(self):
        self.end_time = time.perf_counter()
        current, peak = tracemalloc.get_traced_memory()
        self.peak_memory = peak
        tracemalloc.stop()
    def record_operation(self, name: str, duration: float):
        if name not in self.operation_times:
            self.operation_times[name] = []
        self.operation_times[name].append(duration)
    @property
    def total_time(self):
        return self.end_time - self.start_time
    @property
    def memory_used_mb(self):
        return (self.peak_memory - self.start_memory) / 1024 / 1024
    def print_summary(self, label: str):
        print(f"\n{'='*60}")
        print(f"Performance Summary: {label}")
        print(f"{'='*60}")
        print(f"Total Time: {self.total_time:.3f} seconds")
        print(f"Memory Used: {self.memory_used_mb:.2f} MB")
        if self.operation_times:
            print("\nOperation Breakdown:")
            for op, times in self.operation_times.items():
                avg_time = sum(times) / len(times)
                total_time = sum(times)
                print(f"  {op}:")
                print(f"    - Calls: {len(times)}")
                print(f"    - Avg Time: {avg_time*1000:.2f} ms")
                print(f"    - Total Time: {total_time:.3f} s")
 async def create_mock_crawl_results(n: int) -> list:
    """Create mock crawl results for testing"""
    results = []
    for i in range(n):
        class MockMarkdown:
            def __init__(self, content):
                self.raw_markdown = content
        class MockResult:
            def __init__(self, url, content):
                self.url = url
                self.markdown = MockMarkdown(content)
                self.success = True
        content = f"This is test content {i} about async await coroutines event loops. " * 50
        result = MockResult(f"https://example.com/page{i}", content)
        results.append(result)
    return results
 async def test_embedding_performance():
    """Test the performance of embedding strategy operations"""
    # Configuration
    n_kb_docs = 30  # Number of documents in knowledge base
    n_queries = 10  # Number of query variations
    n_links = 50   # Number of candidate links
    n_iterations = 5  # Number of calculation iterations
    print(f"\nTest Configuration:")
    print(f"- Knowledge Base Documents: {n_kb_docs}")
    print(f"- Query Variations: {n_queries}")
    print(f"- Candidate Links: {n_links}")
    print(f"- Iterations: {n_iterations}")
    # Create embedding strategy
    config = AdaptiveConfig(
        strategy="embedding",
        max_pages=50,
        n_query_variations=n_queries,
        embedding_model="sentence-transformers/all-MiniLM-L6-v2"  # 384 dimensions
    )
    # Set up API key if available
    if os.getenv('OPENAI_API_KEY'):
        config.embedding_llm_config = {
            'provider': 'openai/text-embedding-3-small',
            'api_token': os.getenv('OPENAI_API_KEY'),
            'embedding_model': 'text-embedding-3-small'
        }
    else:
        config.embedding_llm_config = {
            'provider': 'openai/gpt-4o-mini',
            'api_token': 'dummy-key'
        }
    strategy = EmbeddingStrategy(
        embedding_model=config.embedding_model,
        llm_config=config.embedding_llm_config
    )
    strategy.config = config
    # Initialize state
    state = CrawlState()
    state.query = "async await coroutines event loops tasks"
    # Start performance monitoring
    metrics = PerformanceMetrics()
    metrics.start()
    # 1. Generate query embeddings
    print("\n1. Generating query embeddings...")
    start = time.perf_counter()
    query_embeddings, expanded_queries = await strategy.map_query_semantic_space(
        state.query, 
        config.n_query_variations
    )
    state.query_embeddings = query_embeddings
    state.expanded_queries = expanded_queries
    metrics.record_operation("query_embedding", time.perf_counter() - start)
    print(f"   Generated {len(query_embeddings)} query embeddings")
    # 2. Build knowledge base incrementally
    print("\n2. Building knowledge base...")
    mock_results = await create_mock_crawl_results(n_kb_docs)
    for i in range(0, n_kb_docs, 5):  # Add 5 documents at a time
        batch = mock_results[i:i+5]
        start = time.perf_counter()
        await strategy.update_state(state, batch)
        metrics.record_operation("update_state", time.perf_counter() - start)
        state.knowledge_base.extend(batch)
    print(f"   Knowledge base has {len(state.kb_embeddings)} documents")
    # 3. Test repeated confidence calculations
    print(f"\n3. Testing {n_iterations} confidence calculations...")
    for i in range(n_iterations):
        start = time.perf_counter()
        confidence = await strategy.calculate_confidence(state)
        metrics.record_operation("calculate_confidence", time.perf_counter() - start)
        print(f"   Iteration {i+1}: {confidence:.3f} ({(time.perf_counter() - start)*1000:.1f} ms)")
    # 4. Test coverage gap calculations
    print(f"\n4. Testing coverage gap calculations...")
    for i in range(n_iterations):
        start = time.perf_counter()
        gaps = strategy.find_coverage_gaps(state.kb_embeddings, state.query_embeddings)
        metrics.record_operation("find_coverage_gaps", time.perf_counter() - start)
        print(f"   Iteration {i+1}: {len(gaps)} gaps ({(time.perf_counter() - start)*1000:.1f} ms)")
    # 5. Test validation
    print(f"\n5. Testing validation coverage...")
    for i in range(n_iterations):
        start = time.perf_counter()
        val_score = await strategy.validate_coverage(state)
        metrics.record_operation("validate_coverage", time.perf_counter() - start)
        print(f"   Iteration {i+1}: {val_score:.3f} ({(time.perf_counter() - start)*1000:.1f} ms)")
    # 6. Create mock links for ranking
    from crawl4ai.models import Link
    mock_links = []
    for i in range(n_links):
        link = Link(
            href=f"https://example.com/new{i}",
            text=f"Link about async programming {i}",
            title=f"Async Guide {i}"
        )
        mock_links.append(link)
    # 7. Test link selection
    print(f"\n6. Testing link selection with {n_links} candidates...")
    start = time.perf_counter()
    scored_links = await strategy.select_links_for_expansion(
        mock_links,
        gaps,
        state.kb_embeddings
    )
    metrics.record_operation("select_links", time.perf_counter() - start)
    print(f"   Scored {len(scored_links)} links in {(time.perf_counter() - start)*1000:.1f} ms")
    # End monitoring
    metrics.end()
    return metrics
 async def main():
    """Run performance tests before and after optimizations"""
    print("="*80)
    print("EMBEDDING STRATEGY PERFORMANCE TEST")
    print("="*80)
    # Test current implementation
    print("\n📊 Testing CURRENT Implementation...")
    metrics_before = await test_embedding_performance()
    metrics_before.print_summary("BEFORE Optimizations")
    # Store key metrics for comparison
    total_time_before = metrics_before.total_time
    memory_before = metrics_before.memory_used_mb
    # Calculate specific operation costs
    calc_conf_avg = sum(metrics_before.operation_times.get("calculate_confidence", [])) / len(metrics_before.operation_times.get("calculate_confidence", [1]))
    find_gaps_avg = sum(metrics_before.operation_times.get("find_coverage_gaps", [])) / len(metrics_before.operation_times.get("find_coverage_gaps", [1]))
    validate_avg = sum(metrics_before.operation_times.get("validate_coverage", [])) / len(metrics_before.operation_times.get("validate_coverage", [1]))
    print(f"\n🔍 Key Bottlenecks Identified:")
    print(f"   - calculate_confidence: {calc_conf_avg*1000:.1f} ms per call")
    print(f"   - find_coverage_gaps: {find_gaps_avg*1000:.1f} ms per call")
    print(f"   - validate_coverage: {validate_avg*1000:.1f} ms per call")
    print("\n" + "="*80)
    print("EXPECTED IMPROVEMENTS AFTER OPTIMIZATION:")
    print("- Distance calculations: 80-90% faster (vectorization)")
    print("- Memory usage: 20-30% reduction (deduplication)")
    print("- Overall performance: 60-70% improvement")
    print("="*80)
 if __name__ == "__main__":
    asyncio.run(main())
--- a/tests/adaptive/test_embedding_strategy.py
+++ b/tests/adaptive/test_embedding_strategy.py
@@ -0,0 +1,634 @@
 """
 Test and demo script for Embedding-based Adaptive Crawler
 This script demonstrates the embedding-based adaptive crawling
 with semantic space coverage and gap-driven expansion.
 """
 import asyncio
 import os
 from pathlib import Path
 import time
 from rich.console import Console
 from rich import print as rprint
 import sys
 # Add parent directory to path for imports
 sys.path.append(str(Path(__file__).parent.parent.parent))
 from crawl4ai import (
    AsyncWebCrawler,
    AdaptiveCrawler,
    AdaptiveConfig,
    CrawlState
 )
 console = Console()
 async def test_basic_embedding_crawl():
    """Test basic embedding-based adaptive crawling"""
    console.print("\n[bold yellow]Test 1: Basic Embedding-based Crawl[/bold yellow]")
    console.print("Testing semantic space coverage with query expansion")
    # Configure with embedding strategy
    config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.7,  # Not used for stopping in embedding strategy
        min_gain_threshold=0.01,
        max_pages=15,
        top_k_links=3,
        n_query_variations=8,
        embedding_model="sentence-transformers/all-MiniLM-L6-v2"  # Fast, good quality
    )
    # For query expansion, we need an LLM config
    llm_config = {
        'provider': 'openai/gpt-4o-mini',
        'api_token': os.getenv('OPENAI_API_KEY')
    }
    if not llm_config['api_token']:
        console.print("[red]Warning: OPENAI_API_KEY not set. Using mock data for demo.[/red]")
        # Continue with mock for demo purposes
    config.embedding_llm_config = llm_config
    # Create crawler
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(
            crawler=crawler,
            config=config
        )
        # Start adaptive crawl
        start_time = time.time()
        console.print("\n[cyan]Starting semantic adaptive crawl...[/cyan]")
        state = await prog_crawler.digest(
            start_url="https://docs.python.org/3/library/asyncio.html",
            query="async await coroutines event loops"
        )
        elapsed = time.time() - start_time
        # Print results
        console.print(f"\n[green]Crawl completed in {elapsed:.2f} seconds[/green]")
        prog_crawler.print_stats(detailed=False)
        # Show semantic coverage details
        console.print("\n[bold cyan]Semantic Coverage Details:[/bold cyan]")
        if state.expanded_queries:
            console.print(f"Query expanded to {len(state.expanded_queries)} variations")
            console.print("Sample variations:")
            for i, q in enumerate(state.expanded_queries[:3], 1):
                console.print(f"  {i}. {q}")
        if state.semantic_gaps:
            console.print(f"\nSemantic gaps identified: {len(state.semantic_gaps)}")
        console.print(f"\nFinal confidence: {prog_crawler.confidence:.2%}")
        console.print(f"Is Sufficient: {'Yes (Validated)' if prog_crawler.is_sufficient else 'No'}")
        console.print(f"Pages needed: {len(state.crawled_urls)}")
 async def test_embedding_vs_statistical(use_openai=False):
    """Compare embedding strategy with statistical strategy"""
    console.print("\n[bold yellow]Test 2: Embedding vs Statistical Strategy Comparison[/bold yellow]")
    test_url = "https://httpbin.org"
    test_query = "http headers authentication api"
    # Test 1: Statistical strategy
    console.print("\n[cyan]1. Statistical Strategy:[/cyan]")
    config_stat = AdaptiveConfig(
        strategy="statistical",
        confidence_threshold=0.7,
        max_pages=10
    )
    async with AsyncWebCrawler() as crawler:
        stat_crawler = AdaptiveCrawler(crawler=crawler, config=config_stat)
        start_time = time.time()
        state_stat = await stat_crawler.digest(start_url=test_url, query=test_query)
        stat_time = time.time() - start_time
        stat_pages = len(state_stat.crawled_urls)
        stat_confidence = stat_crawler.confidence
    # Test 2: Embedding strategy
    console.print("\n[cyan]2. Embedding Strategy:[/cyan]")
    config_emb = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.7,  # Not used for stopping
        max_pages=10,
        n_query_variations=5,
        min_gain_threshold=0.01
    )
    # Use OpenAI if available or requested
    if use_openai and os.getenv('OPENAI_API_KEY'):
        config_emb.embedding_llm_config = {
            'provider': 'openai/text-embedding-3-small',
            'api_token': os.getenv('OPENAI_API_KEY'),
            'embedding_model': 'text-embedding-3-small'
        }
        console.print("[cyan]Using OpenAI embeddings[/cyan]")
    else:
        # Default config will try sentence-transformers
        config_emb.embedding_llm_config = {
            'provider': 'openai/gpt-4o-mini',
            'api_token': os.getenv('OPENAI_API_KEY', 'dummy-key')
        }
    async with AsyncWebCrawler() as crawler:
        emb_crawler = AdaptiveCrawler(crawler=crawler, config=config_emb)
        start_time = time.time()
        state_emb = await emb_crawler.digest(start_url=test_url, query=test_query)
        emb_time = time.time() - start_time
        emb_pages = len(state_emb.crawled_urls)
        emb_confidence = emb_crawler.confidence
    # Compare results
    console.print("\n[bold green]Comparison Results:[/bold green]")
    console.print(f"Statistical: {stat_pages} pages in {stat_time:.2f}s, confidence: {stat_confidence:.2%}, sufficient: {stat_crawler.is_sufficient}")
    console.print(f"Embedding:   {emb_pages} pages in {emb_time:.2f}s, confidence: {emb_confidence:.2%}, sufficient: {emb_crawler.is_sufficient}")
    if emb_pages < stat_pages:
        efficiency = ((stat_pages - emb_pages) / stat_pages) * 100
        console.print(f"\n[green]Embedding strategy used {efficiency:.0f}% fewer pages![/green]")
    # Show validation info for embedding
    if hasattr(state_emb, 'metrics') and 'validation_confidence' in state_emb.metrics:
        console.print(f"Embedding validation score: {state_emb.metrics['validation_confidence']:.2%}")
 async def test_custom_embedding_provider():
    """Test with different embedding providers"""
    console.print("\n[bold yellow]Test 3: Custom Embedding Provider[/bold yellow]")
    # Example with OpenAI embeddings
    config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.8,  # Not used for stopping
        max_pages=10,
        min_gain_threshold=0.01,
        n_query_variations=5
    )
    # Configure to use OpenAI embeddings instead of sentence-transformers
    config.embedding_llm_config = {
        'provider': 'openai/text-embedding-3-small',
        'api_token': os.getenv('OPENAI_API_KEY'),
        'embedding_model': 'text-embedding-3-small'
    }
    if not config.embedding_llm_config['api_token']:
        console.print("[yellow]Skipping OpenAI embedding test - no API key[/yellow]")
        return
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(crawler=crawler, config=config)
        console.print("Using OpenAI embeddings for semantic analysis...")
        state = await prog_crawler.digest(
            start_url="https://httpbin.org",
            query="api endpoints json response"
        )
        prog_crawler.print_stats(detailed=False)
 async def test_knowledge_export_import():
    """Test exporting and importing semantic knowledge bases"""
    console.print("\n[bold yellow]Test 4: Semantic Knowledge Base Export/Import[/bold yellow]")
    config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.7,  # Not used for stopping
        max_pages=5,
        min_gain_threshold=0.01,
        n_query_variations=4
    )
    # First crawl
    async with AsyncWebCrawler() as crawler:
        crawler1 = AdaptiveCrawler(crawler=crawler, config=config)
        console.print("\n[cyan]Building initial knowledge base...[/cyan]")
        state1 = await crawler1.digest(
            start_url="https://httpbin.org",
            query="http methods headers"
        )
        # Export
        export_path = "semantic_kb.jsonl"
        crawler1.export_knowledge_base(export_path)
        console.print(f"[green]Exported {len(state1.knowledge_base)} documents with embeddings[/green]")
    # Import and continue
    async with AsyncWebCrawler() as crawler:
        crawler2 = AdaptiveCrawler(crawler=crawler, config=config)
        console.print("\n[cyan]Importing knowledge base...[/cyan]")
        crawler2.import_knowledge_base(export_path)
        # Continue with new query - should be faster
        console.print("\n[cyan]Extending with new query...[/cyan]")
        state2 = await crawler2.digest(
            start_url="https://httpbin.org",
            query="authentication oauth tokens"
        )
        console.print(f"[green]Total knowledge base: {len(state2.knowledge_base)} documents[/green]")
    # Cleanup
    Path(export_path).unlink(missing_ok=True)
 async def test_gap_visualization():
    """Visualize semantic gaps and coverage"""
    console.print("\n[bold yellow]Test 5: Semantic Gap Analysis[/bold yellow]")
    config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.9,  # Not used for stopping
        max_pages=8,
        n_query_variations=6,
        min_gain_threshold=0.01
    )
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(crawler=crawler, config=config)
        # Initial crawl
        state = await prog_crawler.digest(
            start_url="https://docs.python.org/3/library/",
            query="concurrency threading multiprocessing"
        )
        # Analyze gaps
        console.print("\n[bold cyan]Semantic Gap Analysis:[/bold cyan]")
        console.print(f"Query variations: {len(state.expanded_queries)}")
        console.print(f"Knowledge documents: {len(state.knowledge_base)}")
        console.print(f"Identified gaps: {len(state.semantic_gaps)}")
        if state.semantic_gaps:
            console.print("\n[yellow]Gap sizes (distance from coverage):[/yellow]")
            for i, (_, distance) in enumerate(state.semantic_gaps[:5], 1):
                console.print(f"  Gap {i}: {distance:.3f}")
        # Show crawl progression
        console.print("\n[cyan]Crawl Order (gap-driven selection):[/cyan]")
        for i, url in enumerate(state.crawl_order[:5], 1):
            console.print(f"  {i}. {url}")
 async def test_fast_convergence_with_relevant_query():
    """Test that both strategies reach high confidence quickly with relevant queries"""
    console.print("\n[bold yellow]Test 7: Fast Convergence with Relevant Query[/bold yellow]")
    console.print("Testing that strategies reach 80%+ confidence within 2-3 batches")
    # Test scenarios
    test_cases = [
        {
            "name": "Python Async Documentation",
            "url": "https://docs.python.org/3/library/asyncio.html",
            "query": "async await coroutines event loops tasks"
        }
    ]
    for test_case in test_cases:
        console.print(f"\n[bold cyan]Testing: {test_case['name']}[/bold cyan]")
        console.print(f"URL: {test_case['url']}")
        console.print(f"Query: {test_case['query']}")
        # Test Embedding Strategy
        console.print("\n[yellow]Embedding Strategy:[/yellow]")
        config_emb = AdaptiveConfig(
            strategy="embedding",
            confidence_threshold=0.8,
            max_pages=9,
            top_k_links=3,
            min_gain_threshold=0.01,
            n_query_variations=5
        )
        # Configure embeddings
        config_emb.embedding_llm_config = {
            'provider': 'openai/gpt-4o-mini',
            'api_token': os.getenv('OPENAI_API_KEY'),
        }
        async with AsyncWebCrawler() as crawler:
            emb_crawler = AdaptiveCrawler(crawler=crawler, config=config_emb)
            start_time = time.time()
            state = await emb_crawler.digest(
                start_url=test_case['url'],
                query=test_case['query']
            )
            # Get batch breakdown
            total_pages = len(state.crawled_urls)
            for i in range(0, total_pages, 3):
                batch_num = (i // 3) + 1
                batch_pages = min(3, total_pages - i)
                pages_so_far = i + batch_pages
                estimated_confidence = state.metrics.get('confidence', 0) * (pages_so_far / total_pages)
                console.print(f"Batch {batch_num}: {batch_pages} pages → Confidence: {estimated_confidence:.1%} {'✅' if estimated_confidence >= 0.8 else '❌'}")
            final_confidence = emb_crawler.confidence
            console.print(f"[green]Final: {total_pages} pages → Confidence: {final_confidence:.1%} {'✅ (Sufficient!)' if emb_crawler.is_sufficient else '❌'}[/green]")
            # Show learning metrics for embedding
            if 'avg_min_distance' in state.metrics:
                console.print(f"[dim]Avg gap distance: {state.metrics['avg_min_distance']:.3f}[/dim]")
            if 'validation_confidence' in state.metrics:
                console.print(f"[dim]Validation score: {state.metrics['validation_confidence']:.1%}[/dim]")
        # Test Statistical Strategy
        console.print("\n[yellow]Statistical Strategy:[/yellow]")
        config_stat = AdaptiveConfig(
            strategy="statistical",
            confidence_threshold=0.8,
            max_pages=9,
            top_k_links=3,
            min_gain_threshold=0.01
        )
        async with AsyncWebCrawler() as crawler:
            stat_crawler = AdaptiveCrawler(crawler=crawler, config=config_stat)
            # Track batch progress
            batch_results = []
            current_pages = 0
            # Custom batch tracking
            start_time = time.time()
            state = await stat_crawler.digest(
                start_url=test_case['url'],
                query=test_case['query']
            )
            # Get batch breakdown (every 3 pages)
            total_pages = len(state.crawled_urls)
            for i in range(0, total_pages, 3):
                batch_num = (i // 3) + 1
                batch_pages = min(3, total_pages - i)
                # Estimate confidence at this point (simplified)
                pages_so_far = i + batch_pages
                estimated_confidence = state.metrics.get('confidence', 0) * (pages_so_far / total_pages)
                console.print(f"Batch {batch_num}: {batch_pages} pages → Confidence: {estimated_confidence:.1%} {'✅' if estimated_confidence >= 0.8 else '❌'}")
            final_confidence = stat_crawler.confidence
            console.print(f"[green]Final: {total_pages} pages → Confidence: {final_confidence:.1%} {'✅ (Sufficient!)' if stat_crawler.is_sufficient else '❌'}[/green]")
 async def test_irrelevant_query_behavior():
    """Test how embedding strategy handles completely irrelevant queries"""
    console.print("\n[bold yellow]Test 8: Irrelevant Query Behavior[/bold yellow]")
    console.print("Testing embedding strategy with a query that has no semantic relevance to the content")
    # Test with irrelevant query on Python async documentation
    test_case = {
        "name": "Irrelevant Query on Python Docs",
        "url": "https://docs.python.org/3/library/asyncio.html",
        "query": "how to cook fried rice with vegetables"
    }
    console.print(f"\n[bold cyan]Testing: {test_case['name']}[/bold cyan]")
    console.print(f"URL: {test_case['url']} (Python async documentation)")
    console.print(f"Query: '{test_case['query']}' (completely irrelevant)")
    console.print("\n[dim]Expected behavior: Low confidence, high distances, no convergence[/dim]")
    # Configure embedding strategy
    config_emb = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.8,
        max_pages=9,
        top_k_links=3,
        min_gain_threshold=0.01,
        n_query_variations=5,
        embedding_min_relative_improvement=0.05,  # Lower threshold to see more iterations
        embedding_min_confidence_threshold=0.1  # Will stop if confidence < 10%
    )
    # Configure embeddings using the correct format
    config_emb.embedding_llm_config = {
        'provider': 'openai/gpt-4o-mini',
        'api_token': os.getenv('OPENAI_API_KEY'),
    }
    async with AsyncWebCrawler() as crawler:
        emb_crawler = AdaptiveCrawler(crawler=crawler, config=config_emb)
        start_time = time.time()
        state = await emb_crawler.digest(
            start_url=test_case['url'],
            query=test_case['query']
        )
        elapsed = time.time() - start_time
        # Analyze results
        console.print(f"\n[bold]Results after {elapsed:.1f} seconds:[/bold]")
        # Basic metrics
        total_pages = len(state.crawled_urls)
        final_confidence = emb_crawler.confidence
        console.print(f"\nPages crawled: {total_pages}")
        console.print(f"Final confidence: {final_confidence:.1%} {'✅' if emb_crawler.is_sufficient else '❌'}")
        # Distance metrics
        if 'avg_min_distance' in state.metrics:
            console.print(f"\n[yellow]Distance Metrics:[/yellow]")
            console.print(f"  Average minimum distance: {state.metrics['avg_min_distance']:.3f}")
            console.print(f"  Close neighbors (<0.3): {state.metrics.get('avg_close_neighbors', 0):.1f}")
            console.print(f"  Very close neighbors (<0.2): {state.metrics.get('avg_very_close_neighbors', 0):.1f}")
            # Interpret distances
            avg_dist = state.metrics['avg_min_distance']
            if avg_dist > 0.8:
                console.print(f"  [red]→ Very poor match (distance > 0.8)[/red]")
            elif avg_dist > 0.6:
                console.print(f"  [yellow]→ Poor match (distance > 0.6)[/yellow]")
            elif avg_dist > 0.4:
                console.print(f"  [blue]→ Moderate match (distance > 0.4)[/blue]")
            else:
                console.print(f"  [green]→ Good match (distance < 0.4)[/green]")
        # Show sample expanded queries
        if state.expanded_queries:
            console.print(f"\n[yellow]Sample Query Variations Generated:[/yellow]")
            for i, q in enumerate(state.expanded_queries[:3], 1):
                console.print(f"  {i}. {q}")
        # Show crawl progression
        console.print(f"\n[yellow]Crawl Progression:[/yellow]")
        for i, url in enumerate(state.crawl_order[:5], 1):
            console.print(f"  {i}. {url}")
        if len(state.crawl_order) > 5:
            console.print(f"  ... and {len(state.crawl_order) - 5} more")
        # Validation score
        if 'validation_confidence' in state.metrics:
            console.print(f"\n[yellow]Validation:[/yellow]")
            console.print(f"  Validation score: {state.metrics['validation_confidence']:.1%}")
        # Why it stopped
        if 'stopped_reason' in state.metrics:
            console.print(f"\n[yellow]Stopping Reason:[/yellow] {state.metrics['stopped_reason']}")
            if state.metrics.get('is_irrelevant', False):
                console.print("[red]→ Query and content are completely unrelated![/red]")
        elif total_pages >= config_emb.max_pages:
            console.print(f"\n[yellow]Stopping Reason:[/yellow] Reached max pages limit ({config_emb.max_pages})")
        # Summary
        console.print(f"\n[bold]Summary:[/bold]")
        if final_confidence < 0.2:
            console.print("[red]✗ As expected: Query is completely irrelevant to content[/red]")
            console.print("[green]✓ The embedding strategy correctly identified no semantic match[/green]")
        else:
            console.print(f"[yellow]⚠ Unexpected: Got {final_confidence:.1%} confidence for irrelevant query[/yellow]")
            console.print("[yellow]  This may indicate the query variations are too broad[/yellow]")
 async def test_high_dimensional_handling():
    """Test handling of high-dimensional embedding spaces"""
    console.print("\n[bold yellow]Test 6: High-Dimensional Embedding Space Handling[/bold yellow]")
    console.print("Testing how the system handles 384+ dimensional embeddings")
    config = AdaptiveConfig(
        strategy="embedding",
        confidence_threshold=0.8,  # Not used for stopping
        max_pages=5,
        n_query_variations=8,  # Will create 9 points total
        min_gain_threshold=0.01,
        embedding_model="sentence-transformers/all-MiniLM-L6-v2"  # 384 dimensions
    )
    # Use OpenAI if available, otherwise mock
    if os.getenv('OPENAI_API_KEY'):
        config.embedding_llm_config = {
            'provider': 'openai/text-embedding-3-small',
            'api_token': os.getenv('OPENAI_API_KEY'),
            'embedding_model': 'text-embedding-3-small'
        }
    else:
        config.embedding_llm_config = {
            'provider': 'openai/gpt-4o-mini',
            'api_token': 'mock-key'
        }
    async with AsyncWebCrawler() as crawler:
        prog_crawler = AdaptiveCrawler(crawler=crawler, config=config)
        console.print("\n[cyan]Testing with high-dimensional embeddings (384D)...[/cyan]")
        try:
            state = await prog_crawler.digest(
                start_url="https://httpbin.org",
                query="api endpoints json"
            )
            console.print(f"[green]✓ Successfully handled {len(state.expanded_queries)} queries in 384D space[/green]")
            console.print(f"Coverage shape type: {type(state.coverage_shape)}")
            if isinstance(state.coverage_shape, dict):
                console.print(f"Coverage model: centroid + radius")
                console.print(f"  - Center shape: {state.coverage_shape['center'].shape if 'center' in state.coverage_shape else 'N/A'}")
                console.print(f"  - Radius: {state.coverage_shape.get('radius', 'N/A'):.3f}")
        except Exception as e:
            console.print(f"[red]Error: {e}[/red]")
            console.print("[yellow]This demonstrates why alpha shapes don't work in high dimensions[/yellow]")
 async def main():
    """Run all embedding strategy tests"""
    console.print("[bold magenta]Embedding-based Adaptive Crawler Test Suite[/bold magenta]")
    console.print("=" * 60)
    try:
        # Check if we have required dependencies
        has_sentence_transformers = True
        has_numpy = True
        try:
            import numpy
            console.print("[green]✓ NumPy installed[/green]")
        except ImportError:
            has_numpy = False
            console.print("[red]Missing numpy[/red]")
        # Try to import sentence_transformers but catch numpy compatibility errors
        try:
            import sentence_transformers
            console.print("[green]✓ Sentence-transformers installed[/green]")
        except (ImportError, RuntimeError, ValueError) as e:
            has_sentence_transformers = False
            console.print(f"[yellow]Warning: sentence-transformers not available[/yellow]")
            console.print("[yellow]Tests will use OpenAI embeddings if available or mock data[/yellow]")
        # Run tests based on available dependencies
        if has_numpy:
            # Check if we should use OpenAI for embeddings
            use_openai = not has_sentence_transformers and os.getenv('OPENAI_API_KEY')
            if not has_sentence_transformers and not os.getenv('OPENAI_API_KEY'):
                console.print("\n[red]Neither sentence-transformers nor OpenAI API key available[/red]")
                console.print("[yellow]Please set OPENAI_API_KEY or fix sentence-transformers installation[/yellow]")
                return
            # Run all tests
            # await test_basic_embedding_crawl()
            # await test_embedding_vs_statistical(use_openai=use_openai)
            # Run the fast convergence test - this is the most important one
            # await test_fast_convergence_with_relevant_query()
            # Test with irrelevant query
            await test_irrelevant_query_behavior()
            # Only run OpenAI-specific test if we have API key
            # if os.getenv('OPENAI_API_KEY'):
            #     await test_custom_embedding_provider()
            # # Skip tests that require sentence-transformers when it's not available
            # if has_sentence_transformers:
            #     await test_knowledge_export_import()
            #     await test_gap_visualization()
            # else:
            #     console.print("\n[yellow]Skipping tests that require sentence-transformers due to numpy compatibility issues[/yellow]")
            # This test should work with mock data
            # await test_high_dimensional_handling()
        else:
            console.print("\n[red]Cannot run tests without NumPy[/red]")
            return
        console.print("\n[bold green]✅ All tests completed![/bold green]")
    except Exception as e:
        console.print(f"\n[bold red]❌ Test failed: {e}[/bold red]")
        import traceback
        traceback.print_exc()
 if __name__ == "__main__":
    asyncio.run(main())