diff --git a/PROGRESSIVE_CRAWLING.md b/PROGRESSIVE_CRAWLING.md new file mode 100644 index 00000000..1d710bd1 --- /dev/null +++ b/PROGRESSIVE_CRAWLING.md @@ -0,0 +1,320 @@ +# Progressive Web Crawling with Adaptive Information Foraging + +## Abstract + +This paper presents a novel approach to web crawling that adaptively determines when sufficient information has been gathered to answer a given query. Unlike traditional exhaustive crawling methods, our Progressive Information Sufficiency (PIS) framework uses statistical measures to balance information completeness against crawling efficiency. We introduce a multi-strategy architecture supporting pure statistical, embedding-enhanced, and LLM-assisted approaches, with theoretical guarantees on convergence and practical evaluation methods using synthetic datasets. + +## 1. Introduction + +Traditional web crawling approaches follow predetermined patterns (breadth-first, depth-first) without consideration for information sufficiency. This work addresses the fundamental question: *"When do we have enough information to answer a query and similar queries in its domain?"* + +We formalize this as an optimal stopping problem in information foraging, introducing metrics for coverage, consistency, and saturation that enable crawlers to make intelligent decisions about when to stop crawling and which links to follow. + +## 2. Problem Formulation + +### 2.1 Definitions + +Let: +- **K** = {d₁, d₂, ..., dₙ} be the current knowledge base (crawled documents) +- **Q** be the user query +- **L** = {l₁, l₂, ..., lₘ} be available links with preview metadata +- **θ** be the confidence threshold for information sufficiency + +### 2.2 Objectives + +1. **Minimize** |K| (number of crawled pages) +2. **Maximize** P(answers(Q) | K) (probability of answering Q given K) +3. **Ensure** coverage of Q's domain (similar queries) + +## 3. Mathematical Framework + +### 3.1 Information Sufficiency Metric + +We define Information Sufficiency as: + +``` +IS(K, Q) = min(Coverage(K, Q), Consistency(K, Q), 1 - Redundancy(K)) × DomainCoverage(K, Q) +``` + +### 3.2 Coverage Score + +Coverage measures how well current knowledge covers query terms and related concepts: + +``` +Coverage(K, Q) = Σ(t ∈ Q) log(df(t, K) + 1) × idf(t) / |Q| +``` + +Where: +- df(t, K) = document frequency of term t in knowledge base K +- idf(t) = inverse document frequency weight + +### 3.3 Consistency Score + +Consistency measures information coherence across documents: + +``` +Consistency(K, Q) = 1 - Var(answers from random subsets of K) +``` + +This captures the principle that sufficient knowledge should provide stable answers regardless of document subset. + +### 3.4 Saturation Score + +Saturation detects diminishing returns: + +``` +Saturation(K) = 1 - (ΔInfo(Kₙ) / ΔInfo(K₁)) +``` + +Where ΔInfo represents marginal information gain from the nth crawl. + +### 3.5 Link Value Prediction + +Expected information gain from uncrawled links: + +``` +ExpectedGain(l) = Relevance(l, Q) × Novelty(l, K) × Authority(l) +``` + +Components: +- **Relevance**: BM25(preview_text, Q) +- **Novelty**: 1 - max_similarity(preview, K) +- **Authority**: f(url_structure, domain_metrics) + +## 4. Algorithmic Approach + +### 4.1 Progressive Crawling Algorithm + +``` +Algorithm: ProgressiveCrawl(start_url, query, θ) + K ← ∅ + crawled ← {start_url} + pending ← extract_links(crawl(start_url)) + + while IS(K, Q) < θ and |crawled| < max_pages: + candidates ← rank_by_expected_gain(pending, Q, K) + if max(ExpectedGain(candidates)) < min_gain: + break // Diminishing returns + + to_crawl ← top_k(candidates) + new_docs ← parallel_crawl(to_crawl) + K ← K ∪ new_docs + crawled ← crawled ∪ to_crawl + pending ← extract_new_links(new_docs) - crawled + + return K +``` + +### 4.2 Stopping Criteria + +Crawling terminates when: +1. IS(K, Q) ≥ θ (sufficient information) +2. d(IS)/d(crawls) < ε (plateau reached) +3. |crawled| ≥ max_pages (resource limit) +4. max(ExpectedGain) < min_gain (no promising links) + +## 5. Multi-Strategy Architecture + +### 5.1 Strategy Pattern Design + +``` +AbstractStrategy + ├── StatisticalStrategy (no LLM, no embeddings) + ├── EmbeddingStrategy (with semantic similarity) + └── LLMStrategy (with language model assistance) +``` + +### 5.2 Statistical Strategy + +Pure statistical approach using: +- BM25 for relevance scoring +- Term frequency analysis for coverage +- Graph structure for authority +- No external models required + +**Advantages**: Fast, no API costs, works offline +**Best for**: Technical documentation, specific terminology + +### 5.3 Embedding Strategy (Implemented) + +Semantic understanding through embeddings: +- Query expansion into semantic variations +- Coverage mapping in embedding space +- Gap-driven link selection +- Validation-based stopping criteria + +**Mathematical Framework**: +``` +Coverage(K, Q) = mean(max_similarity(q, K) for q in Q_expanded) +Gap(q) = 1 - max_similarity(q, K) +LinkScore(l) = Σ(Gap(q) × relevance(l, q)) × (1 - redundancy(l, K)) +``` + +**Key Parameters**: +- `embedding_k_exp`: Exponential decay factor for distance-to-score mapping +- `embedding_coverage_radius`: Distance threshold for query coverage +- `embedding_min_confidence_threshold`: Minimum relevance threshold + +**Advantages**: Semantic understanding, handles ambiguity, detects irrelevance +**Best for**: Research queries, conceptual topics, diverse content + +### 5.4 Progressive Enhancement Path + +1. **Level 0**: Statistical only (implemented) +2. **Level 1**: + Embeddings for semantic similarity (implemented) +3. **Level 2**: + LLM for query understanding (future) + +## 6. Evaluation Methodology + +### 6.1 Synthetic Dataset Generation + +Using LLM to create evaluation data: + +```python +def generate_synthetic_dataset(domain_url): + # 1. Fully crawl domain + full_knowledge = exhaustive_crawl(domain_url) + + # 2. Generate answerable queries + queries = llm_generate_queries(full_knowledge) + + # 3. Create query variations + for q in queries: + variations = generate_variations(q) # synonyms, sub/super queries + + return queries, variations, full_knowledge +``` + +### 6.2 Evaluation Metrics + +1. **Efficiency**: Information gained / Pages crawled +2. **Completeness**: Answerable queries / Total queries +3. **Redundancy**: 1 - (Unique information / Total information) +4. **Convergence Rate**: Pages to 95% completeness + +### 6.3 Ablation Studies + +- Impact of each score component (coverage, consistency, saturation) +- Sensitivity to threshold parameters +- Performance across different domain types + +## 7. Theoretical Properties + +### 7.1 Convergence Guarantee + +**Theorem**: For finite websites, ProgressiveCrawl converges to IS(K, Q) ≥ θ or exhausts all reachable pages. + +**Proof sketch**: IS(K, Q) is monotonically non-decreasing with each crawl, bounded above by 1. + +### 7.2 Optimality + +Under certain assumptions about link preview accuracy: +- Expected crawls ≤ 2 × optimal_crawls +- Approximation ratio improves with preview quality + +## 8. Implementation Design + +### 8.1 Core Components + +1. **CrawlState**: Maintains crawl history and metrics +2. **AdaptiveConfig**: Configuration parameters +3. **CrawlStrategy**: Pluggable strategy interface +4. **AdaptiveCrawler**: Main orchestrator + +### 8.2 Integration with Crawl4AI + +- Wraps existing AsyncWebCrawler +- Leverages link preview functionality +- Maintains backward compatibility + +### 8.3 Persistence + +Knowledge base serialization for: +- Resumable crawls +- Knowledge sharing +- Offline analysis + +## 9. Future Directions + +### 9.1 Advanced Scoring + +- Temporal information value +- Multi-query optimization +- Active learning from user feedback + +### 9.2 Distributed Crawling + +- Collaborative knowledge building +- Federated information sufficiency + +### 9.3 Domain Adaptation + +- Transfer learning across domains +- Meta-learning for threshold selection + +## 10. Conclusion + +Progressive crawling with adaptive information foraging provides a principled approach to efficient web information extraction. By combining coverage, consistency, and saturation metrics, we can determine information sufficiency without ground truth labels. The multi-strategy architecture allows graceful enhancement from pure statistical to LLM-assisted approaches based on requirements and resources. + +## References + +1. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. + +2. Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval. + +3. Pirolli, P., & Card, S. (1999). Information Foraging. Psychological Review, 106(4), 643-675. + +4. Dasgupta, S. (2005). Analysis of a greedy active learning strategy. Advances in Neural Information Processing Systems. + +## Appendix A: Implementation Pseudocode + +```python +class StatisticalStrategy: + def calculate_confidence(self, state): + coverage = self.calculate_coverage(state) + consistency = self.calculate_consistency(state) + saturation = self.calculate_saturation(state) + return min(coverage, consistency, saturation) + + def calculate_coverage(self, state): + # BM25-based term coverage + term_scores = [] + for term in state.query.split(): + df = state.document_frequencies.get(term, 0) + idf = self.idf_cache.get(term, 1.0) + term_scores.append(log(df + 1) * idf) + return mean(term_scores) / max_possible_score + + def rank_links(self, state): + scored_links = [] + for link in state.pending_links: + relevance = self.bm25_score(link.preview_text, state.query) + novelty = self.calculate_novelty(link, state.knowledge_base) + authority = self.url_authority(link.href) + score = relevance * novelty * authority + scored_links.append((link, score)) + return sorted(scored_links, key=lambda x: x[1], reverse=True) +``` + +## Appendix B: Evaluation Protocol + +1. **Dataset Creation**: + - Select diverse domains (documentation, blogs, e-commerce) + - Generate 100 queries per domain using LLM + - Create query variations (5-10 per query) + +2. **Baseline Comparisons**: + - BFS crawler (depth-limited) + - DFS crawler (depth-limited) + - Random crawler + - Oracle (knows relevant pages) + +3. **Metrics Collection**: + - Pages crawled vs query answerability + - Time to sufficient confidence + - False positive/negative rates + +4. **Statistical Analysis**: + - ANOVA for strategy comparison + - Regression for parameter sensitivity + - Bootstrap for confidence intervals \ No newline at end of file diff --git a/crawl4ai/__init__.py b/crawl4ai/__init__.py index bb5ca0e7..7a75e76d 100644 --- a/crawl4ai/__init__.py +++ b/crawl4ai/__init__.py @@ -69,6 +69,14 @@ from .deep_crawling import ( ) # NEW: Import AsyncUrlSeeder from .async_url_seeder import AsyncUrlSeeder +# Adaptive Crawler +from .adaptive_crawler import ( + AdaptiveCrawler, + AdaptiveConfig, + CrawlState, + CrawlStrategy, + StatisticalStrategy +) # C4A Script Language Support from .script import ( @@ -97,6 +105,12 @@ __all__ = [ "VirtualScrollConfig", # NEW: Add AsyncUrlSeeder "AsyncUrlSeeder", + # Adaptive Crawler + "AdaptiveCrawler", + "AdaptiveConfig", + "CrawlState", + "CrawlStrategy", + "StatisticalStrategy", "DeepCrawlStrategy", "BFSDeepCrawlStrategy", "BestFirstCrawlingStrategy", diff --git a/crawl4ai/adaptive_crawler copy.py b/crawl4ai/adaptive_crawler copy.py new file mode 100644 index 00000000..294a292d --- /dev/null +++ b/crawl4ai/adaptive_crawler copy.py @@ -0,0 +1,1847 @@ +""" +Adaptive Web Crawler for Crawl4AI + +This module implements adaptive information foraging for efficient web crawling. +It determines when sufficient information has been gathered to answer a query, +avoiding unnecessary crawls while ensuring comprehensive coverage. +""" + +from abc import ABC, abstractmethod +from typing import Dict, List, Optional, Set, Tuple, Any, Union +from dataclasses import dataclass, field +import asyncio +import pickle +import os +import json +import math +from collections import defaultdict, Counter +import re +from pathlib import Path + +from crawl4ai.async_webcrawler import AsyncWebCrawler +from crawl4ai.async_configs import CrawlerRunConfig, LinkPreviewConfig +from crawl4ai.models import Link, CrawlResult + + +@dataclass +class CrawlState: + """Tracks the current state of adaptive crawling""" + crawled_urls: Set[str] = field(default_factory=set) + knowledge_base: List[CrawlResult] = field(default_factory=list) + pending_links: List[Link] = field(default_factory=list) + query: str = "" + metrics: Dict[str, float] = field(default_factory=dict) + + # Statistical tracking + term_frequencies: Dict[str, int] = field(default_factory=lambda: defaultdict(int)) + document_frequencies: Dict[str, int] = field(default_factory=lambda: defaultdict(int)) + documents_with_terms: Dict[str, Set[int]] = field(default_factory=lambda: defaultdict(set)) + total_documents: int = 0 + + # History tracking for saturation + new_terms_history: List[int] = field(default_factory=list) + crawl_order: List[str] = field(default_factory=list) + + # Embedding-specific tracking (only if strategy is embedding) + kb_embeddings: Optional[Any] = None # Will be numpy array + query_embeddings: Optional[Any] = None # Will be numpy array + expanded_queries: List[str] = field(default_factory=list) + coverage_shape: Optional[Any] = None # Alpha shape + semantic_gaps: List[Tuple[List[float], float]] = field(default_factory=list) # Serializable + embedding_model: str = "" + + def save(self, path: Union[str, Path]): + """Save state to disk for persistence""" + path = Path(path) + path.parent.mkdir(parents=True, exist_ok=True) + + # Convert CrawlResult objects to dicts for serialization + state_dict = { + 'crawled_urls': list(self.crawled_urls), + 'knowledge_base': [self._crawl_result_to_dict(cr) for cr in self.knowledge_base], + 'pending_links': [link.model_dump() for link in self.pending_links], + 'query': self.query, + 'metrics': self.metrics, + 'term_frequencies': dict(self.term_frequencies), + 'document_frequencies': dict(self.document_frequencies), + 'documents_with_terms': {k: list(v) for k, v in self.documents_with_terms.items()}, + 'total_documents': self.total_documents, + 'new_terms_history': self.new_terms_history, + 'crawl_order': self.crawl_order, + # Embedding-specific fields (convert numpy arrays to lists for JSON) + 'kb_embeddings': self.kb_embeddings.tolist() if self.kb_embeddings is not None else None, + 'query_embeddings': self.query_embeddings.tolist() if self.query_embeddings is not None else None, + 'expanded_queries': self.expanded_queries, + 'semantic_gaps': self.semantic_gaps, + 'embedding_model': self.embedding_model + } + + with open(path, 'w') as f: + json.dump(state_dict, f, indent=2) + + @classmethod + def load(cls, path: Union[str, Path]) -> 'CrawlState': + """Load state from disk""" + path = Path(path) + with open(path, 'r') as f: + state_dict = json.load(f) + + state = cls() + state.crawled_urls = set(state_dict['crawled_urls']) + state.knowledge_base = [cls._dict_to_crawl_result(d) for d in state_dict['knowledge_base']] + state.pending_links = [Link(**link_dict) for link_dict in state_dict['pending_links']] + state.query = state_dict['query'] + state.metrics = state_dict['metrics'] + state.term_frequencies = defaultdict(int, state_dict['term_frequencies']) + state.document_frequencies = defaultdict(int, state_dict['document_frequencies']) + state.documents_with_terms = defaultdict(set, {k: set(v) for k, v in state_dict['documents_with_terms'].items()}) + state.total_documents = state_dict['total_documents'] + state.new_terms_history = state_dict['new_terms_history'] + state.crawl_order = state_dict['crawl_order'] + + # Load embedding-specific fields (convert lists back to numpy arrays) + import numpy as np + state.kb_embeddings = np.array(state_dict['kb_embeddings']) if state_dict.get('kb_embeddings') is not None else None + state.query_embeddings = np.array(state_dict['query_embeddings']) if state_dict.get('query_embeddings') is not None else None + state.expanded_queries = state_dict.get('expanded_queries', []) + state.semantic_gaps = state_dict.get('semantic_gaps', []) + state.embedding_model = state_dict.get('embedding_model', '') + + return state + + @staticmethod + def _crawl_result_to_dict(cr: CrawlResult) -> Dict: + """Convert CrawlResult to serializable dict""" + # Extract markdown content safely + markdown_content = "" + if hasattr(cr, 'markdown') and cr.markdown: + if hasattr(cr.markdown, 'raw_markdown'): + markdown_content = cr.markdown.raw_markdown + else: + markdown_content = str(cr.markdown) + + return { + 'url': cr.url, + 'content': markdown_content, + 'links': cr.links if hasattr(cr, 'links') else {}, + 'metadata': cr.metadata if hasattr(cr, 'metadata') else {} + } + + @staticmethod + def _dict_to_crawl_result(d: Dict): + """Convert dict back to CrawlResult""" + # Create a mock object that has the minimal interface we need + class MockMarkdown: + def __init__(self, content): + self.raw_markdown = content + + class MockCrawlResult: + def __init__(self, url, content, links, metadata): + self.url = url + self.markdown = MockMarkdown(content) + self.links = links + self.metadata = metadata + + return MockCrawlResult( + url=d['url'], + content=d.get('content', ''), + links=d.get('links', {}), + metadata=d.get('metadata', {}) + ) + + +@dataclass +class AdaptiveConfig: + """Configuration for adaptive crawling""" + confidence_threshold: float = 0.7 + max_depth: int = 5 + max_pages: int = 20 + top_k_links: int = 3 + min_gain_threshold: float = 0.1 + strategy: str = "statistical" # statistical, embedding, llm + + # Advanced parameters + saturation_threshold: float = 0.8 + consistency_threshold: float = 0.7 + coverage_weight: float = 0.4 + consistency_weight: float = 0.3 + saturation_weight: float = 0.3 + + # Link scoring parameters + relevance_weight: float = 0.5 + novelty_weight: float = 0.3 + authority_weight: float = 0.2 + + # Persistence + save_state: bool = False + state_path: Optional[str] = None + + # Embedding strategy parameters + embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2" + embedding_llm_config: Optional[Dict] = None # Separate config for embeddings + n_query_variations: int = 10 + coverage_threshold: float = 0.85 + alpha_shape_alpha: float = 0.5 + + # Embedding confidence calculation parameters + embedding_coverage_radius: float = 0.2 # Distance threshold for "covered" query points + # Example: With radius=0.2, a query point is considered covered if ANY document + # is within cosine distance 0.2 (very similar). Smaller = stricter coverage requirement + + embedding_k_exp: float = 3.0 # Exponential decay factor for distance-to-score mapping + # Example: score = exp(-k_exp * distance). With k_exp=1, distance 0.2 → score 0.82, + # distance 0.5 → score 0.61. Higher k_exp = steeper decay = more emphasis on very close matches + + embedding_nearest_weight: float = 0.7 # Weight for nearest neighbor in hybrid scoring + embedding_top_k_weight: float = 0.3 # Weight for top-k average in hybrid scoring + # Example: If nearest doc has score 0.9 and top-3 avg is 0.6, final = 0.7*0.9 + 0.3*0.6 = 0.81 + # Higher nearest_weight = more focus on best match vs neighborhood density + + # Embedding link selection parameters + embedding_overlap_threshold: float = 0.85 # Similarity threshold for penalizing redundant links + # Example: Links with >0.85 similarity to existing KB get penalized to avoid redundancy + # Lower = more aggressive deduplication, Higher = allow more similar content + + # Embedding stopping criteria parameters + embedding_min_relative_improvement: float = 0.1 # Minimum relative improvement to continue + # Example: If confidence is 0.6, need improvement > 0.06 per batch to continue crawling + # Lower = more patient crawling, Higher = stop earlier when progress slows + + embedding_validation_min_score: float = 0.4 # Minimum validation score to trust convergence + # Example: Even if learning converged, keep crawling if validation score < 0.4 + # This prevents premature stopping when we haven't truly covered the query space + + # Quality confidence mapping parameters (for display to user) + embedding_quality_min_confidence: float = 0.7 # Minimum confidence for validated systems + embedding_quality_max_confidence: float = 0.95 # Maximum realistic confidence + embedding_quality_scale_factor: float = 0.833 # Scaling factor for confidence mapping + # Example: Validated system with learning_score=0.5 → confidence = 0.7 + (0.5-0.4)*0.833 = 0.78 + # These control how internal scores map to user-friendly confidence percentages + + def validate(self): + """Validate configuration parameters""" + assert 0 <= self.confidence_threshold <= 1, "confidence_threshold must be between 0 and 1" + assert self.max_depth > 0, "max_depth must be positive" + assert self.max_pages > 0, "max_pages must be positive" + assert self.top_k_links > 0, "top_k_links must be positive" + assert 0 <= self.min_gain_threshold <= 1, "min_gain_threshold must be between 0 and 1" + + # Check weights sum to 1 + weight_sum = self.coverage_weight + self.consistency_weight + self.saturation_weight + assert abs(weight_sum - 1.0) < 0.001, f"Coverage weights must sum to 1, got {weight_sum}" + + weight_sum = self.relevance_weight + self.novelty_weight + self.authority_weight + assert abs(weight_sum - 1.0) < 0.001, f"Link scoring weights must sum to 1, got {weight_sum}" + + # Validate embedding parameters + assert 0 < self.embedding_coverage_radius < 1, "embedding_coverage_radius must be between 0 and 1" + assert self.embedding_k_exp > 0, "embedding_k_exp must be positive" + assert 0 <= self.embedding_nearest_weight <= 1, "embedding_nearest_weight must be between 0 and 1" + assert 0 <= self.embedding_top_k_weight <= 1, "embedding_top_k_weight must be between 0 and 1" + assert abs(self.embedding_nearest_weight + self.embedding_top_k_weight - 1.0) < 0.001, "Embedding weights must sum to 1" + assert 0 <= self.embedding_overlap_threshold <= 1, "embedding_overlap_threshold must be between 0 and 1" + assert 0 < self.embedding_min_relative_improvement < 1, "embedding_min_relative_improvement must be between 0 and 1" + assert 0 <= self.embedding_validation_min_score <= 1, "embedding_validation_min_score must be between 0 and 1" + assert 0 <= self.embedding_quality_min_confidence <= 1, "embedding_quality_min_confidence must be between 0 and 1" + assert 0 <= self.embedding_quality_max_confidence <= 1, "embedding_quality_max_confidence must be between 0 and 1" + assert self.embedding_quality_scale_factor > 0, "embedding_quality_scale_factor must be positive" + + +class CrawlStrategy(ABC): + """Abstract base class for crawling strategies""" + + @abstractmethod + async def calculate_confidence(self, state: CrawlState) -> float: + """Calculate overall confidence that we have sufficient information""" + pass + + @abstractmethod + async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: + """Rank pending links by expected information gain""" + pass + + @abstractmethod + async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: + """Determine if crawling should stop""" + pass + + @abstractmethod + async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: + """Update state with new crawl results""" + pass + + +class StatisticalStrategy(CrawlStrategy): + """Pure statistical approach - no LLM, no embeddings""" + + def __init__(self): + self.idf_cache = {} + self.bm25_k1 = 1.2 # BM25 parameter + self.bm25_b = 0.75 # BM25 parameter + + async def calculate_confidence(self, state: CrawlState) -> float: + """Calculate confidence using coverage, consistency, and saturation""" + if not state.knowledge_base: + return 0.0 + + coverage = self._calculate_coverage(state) + consistency = self._calculate_consistency(state) + saturation = self._calculate_saturation(state) + + # Store individual metrics + state.metrics['coverage'] = coverage + state.metrics['consistency'] = consistency + state.metrics['saturation'] = saturation + + # Weighted combination (weights from config not accessible here, using defaults) + confidence = 0.4 * coverage + 0.3 * consistency + 0.3 * saturation + + return confidence + + def _calculate_coverage(self, state: CrawlState) -> float: + """Coverage scoring - measures query term presence across knowledge base + + Returns a score between 0 and 1, where: + - 0 means no query terms found + - 1 means excellent coverage of all query terms + """ + if not state.query or state.total_documents == 0: + return 0.0 + + query_terms = self._tokenize(state.query.lower()) + if not query_terms: + return 0.0 + + term_scores = [] + max_tf = max(state.term_frequencies.values()) if state.term_frequencies else 1 + + for term in query_terms: + tf = state.term_frequencies.get(term, 0) + df = state.document_frequencies.get(term, 0) + + if df > 0: + # Document coverage: what fraction of docs contain this term + doc_coverage = df / state.total_documents + + # Frequency signal: normalized log frequency + freq_signal = math.log(1 + tf) / math.log(1 + max_tf) if max_tf > 0 else 0 + + # Combined score: document coverage with frequency boost + term_score = doc_coverage * (1 + 0.5 * freq_signal) + term_scores.append(term_score) + else: + term_scores.append(0.0) + + # Average across all query terms + coverage = sum(term_scores) / len(term_scores) + + # Apply square root curve to make score more intuitive + # This helps differentiate between partial and good coverage + return min(1.0, math.sqrt(coverage)) + + def _calculate_consistency(self, state: CrawlState) -> float: + """Information overlap between pages - high overlap suggests coherent topic coverage""" + if len(state.knowledge_base) < 2: + return 1.0 # Single or no documents are perfectly consistent + + # Calculate pairwise term overlap + overlaps = [] + + for i in range(len(state.knowledge_base)): + for j in range(i + 1, len(state.knowledge_base)): + # Get terms from both documents + terms_i = set(self._get_document_terms(state.knowledge_base[i])) + terms_j = set(self._get_document_terms(state.knowledge_base[j])) + + if terms_i and terms_j: + # Jaccard similarity + overlap = len(terms_i & terms_j) / len(terms_i | terms_j) + overlaps.append(overlap) + + if overlaps: + # Average overlap as consistency measure + consistency = sum(overlaps) / len(overlaps) + else: + consistency = 0.0 + + return consistency + + def _calculate_saturation(self, state: CrawlState) -> float: + """Diminishing returns indicator - are we still discovering new information?""" + if not state.new_terms_history: + return 0.0 + + if len(state.new_terms_history) < 2: + return 0.0 # Not enough history + + # Calculate rate of new term discovery + recent_rate = state.new_terms_history[-1] if state.new_terms_history[-1] > 0 else 1 + initial_rate = state.new_terms_history[0] if state.new_terms_history[0] > 0 else 1 + + # Saturation increases as rate decreases + saturation = 1 - (recent_rate / initial_rate) + + return max(0.0, min(saturation, 1.0)) + + async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: + """Rank links by expected information gain""" + scored_links = [] + + for link in state.pending_links: + # Skip already crawled URLs + if link.href in state.crawled_urls: + continue + + # Calculate component scores + relevance = self._calculate_relevance(link, state) + novelty = self._calculate_novelty(link, state) + authority = 1.0 + # authority = self._calculate_authority(link) + + # Combined score + score = (config.relevance_weight * relevance + + config.novelty_weight * novelty + + config.authority_weight * authority) + + scored_links.append((link, score)) + + # Sort by score descending + scored_links.sort(key=lambda x: x[1], reverse=True) + + return scored_links + + def _calculate_relevance(self, link: Link, state: CrawlState) -> float: + """BM25 relevance score between link preview and query""" + if not state.query or not link: + return 0.0 + + # Combine available text from link + link_text = ' '.join(filter(None, [ + link.text or '', + link.title or '', + link.head_data.get('meta', {}).get('title', '') if link.head_data else '', + link.head_data.get('meta', {}).get('description', '') if link.head_data else '', + link.head_data.get('meta', {}).get('keywords', '') if link.head_data else '' + ])).lower() + + if not link_text: + return 0.0 + + # Use contextual score if available (from BM25 scoring during crawl) + # if link.contextual_score is not None: + if link.contextual_score and link.contextual_score > 0: + return link.contextual_score + + # Otherwise, calculate simple term overlap + query_terms = set(self._tokenize(state.query.lower())) + link_terms = set(self._tokenize(link_text)) + + if not query_terms: + return 0.0 + + overlap = len(query_terms & link_terms) / len(query_terms) + return overlap + + def _calculate_novelty(self, link: Link, state: CrawlState) -> float: + """Estimate how much new information this link might provide""" + if not state.knowledge_base: + return 1.0 # First links are maximally novel + + # Get terms from link preview + link_text = ' '.join(filter(None, [ + link.text or '', + link.title or '', + link.head_data.get('title', '') if link.head_data else '', + link.head_data.get('description', '') if link.head_data else '', + link.head_data.get('keywords', '') if link.head_data else '' + ])).lower() + + link_terms = set(self._tokenize(link_text)) + if not link_terms: + return 0.5 # Unknown novelty + + # Calculate what percentage of link terms are new + existing_terms = set(state.term_frequencies.keys()) + new_terms = link_terms - existing_terms + + novelty = len(new_terms) / len(link_terms) if link_terms else 0.0 + + return novelty + + def _calculate_authority(self, link: Link) -> float: + """Simple authority score based on URL structure and link attributes""" + score = 0.5 # Base score + + if not link.href: + return 0.0 + + url = link.href.lower() + + # Positive indicators + if '/docs/' in url or '/documentation/' in url: + score += 0.2 + if '/api/' in url or '/reference/' in url: + score += 0.2 + if '/guide/' in url or '/tutorial/' in url: + score += 0.1 + + # Check for file extensions + if url.endswith('.pdf'): + score += 0.1 + elif url.endswith(('.jpg', '.png', '.gif')): + score -= 0.3 # Reduce score for images + + # Use intrinsic score if available + if link.intrinsic_score is not None: + score = 0.7 * score + 0.3 * link.intrinsic_score + + return min(score, 1.0) + + async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: + """Determine if crawling should stop""" + # Check confidence threshold + confidence = state.metrics.get('confidence', 0.0) + if confidence >= config.confidence_threshold: + return True + + # Check resource limits + if len(state.crawled_urls) >= config.max_pages: + return True + + # Check if we have any links left + if not state.pending_links: + return True + + # Check saturation + if state.metrics.get('saturation', 0.0) >= config.saturation_threshold: + return True + + return False + + async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: + """Update state with new crawl results""" + for result in new_results: + # Track new terms + old_term_count = len(state.term_frequencies) + + # Extract and process content - try multiple fields + try: + content = result.markdown.raw_markdown + except AttributeError: + print(f"Warning: CrawlResult {result.url} has no markdown content") + content = "" + # content = "" + # if hasattr(result, 'extracted_content') and result.extracted_content: + # content = result.extracted_content + # elif hasattr(result, 'markdown') and result.markdown: + # content = result.markdown.raw_markdown + # elif hasattr(result, 'cleaned_html') and result.cleaned_html: + # content = result.cleaned_html + # elif hasattr(result, 'html') and result.html: + # # Use raw HTML as last resort + # content = result.html + + + terms = self._tokenize(content.lower()) + + # Update term frequencies + term_set = set() + for term in terms: + state.term_frequencies[term] += 1 + term_set.add(term) + + # Update document frequencies + doc_id = state.total_documents + for term in term_set: + if term not in state.documents_with_terms[term]: + state.document_frequencies[term] += 1 + state.documents_with_terms[term].add(doc_id) + + # Track new terms discovered + new_term_count = len(state.term_frequencies) + new_terms = new_term_count - old_term_count + state.new_terms_history.append(new_terms) + + # Update document count + state.total_documents += 1 + + # Add to crawl order + state.crawl_order.append(result.url) + + def _tokenize(self, text: str) -> List[str]: + """Simple tokenization - can be enhanced""" + # Remove punctuation and split + text = re.sub(r'[^\w\s]', ' ', text) + tokens = text.split() + + # Filter short tokens and stop words (basic) + tokens = [t for t in tokens if len(t) > 2] + + return tokens + + def _get_document_terms(self, crawl_result: CrawlResult) -> List[str]: + """Extract terms from a crawl result""" + content = crawl_result.markdown.raw_markdown or "" + return self._tokenize(content.lower()) + + +class EmbeddingStrategy(CrawlStrategy): + """Embedding-based adaptive crawling using semantic space coverage""" + + def __init__(self, embedding_model: str = None, llm_config: Dict = None): + self.embedding_model = embedding_model or "sentence-transformers/all-MiniLM-L6-v2" + self.llm_config = llm_config + self._embedding_cache = {} + self._link_embedding_cache = {} # Cache for link embeddings + self._validation_passed = False # Track if validation passed + + # Performance optimization caches + self._distance_matrix_cache = None # Cache for query-KB distances + self._kb_embeddings_hash = None # Track KB changes + self._validation_embeddings_cache = None # Cache validation query embeddings + self._kb_similarity_threshold = 0.95 # Threshold for deduplication + + async def _get_embeddings(self, texts: List[str]) -> Any: + """Get embeddings using configured method""" + from .utils import get_text_embeddings + embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + return await get_text_embeddings( + texts, + embedding_llm_config, + self.embedding_model + ) + + def _compute_distance_matrix(self, query_embeddings: Any, kb_embeddings: Any) -> Any: + """Compute distance matrix using vectorized operations""" + import numpy as np + + if kb_embeddings is None or len(kb_embeddings) == 0: + return None + + # Ensure proper shapes + if len(query_embeddings.shape) == 1: + query_embeddings = query_embeddings.reshape(1, -1) + if len(kb_embeddings.shape) == 1: + kb_embeddings = kb_embeddings.reshape(1, -1) + + # Vectorized cosine distance: 1 - cosine_similarity + # Normalize vectors + query_norm = query_embeddings / np.linalg.norm(query_embeddings, axis=1, keepdims=True) + kb_norm = kb_embeddings / np.linalg.norm(kb_embeddings, axis=1, keepdims=True) + + # Compute cosine similarity matrix + similarity_matrix = np.dot(query_norm, kb_norm.T) + + # Convert to distance + distance_matrix = 1 - similarity_matrix + + return distance_matrix + + def _get_cached_distance_matrix(self, query_embeddings: Any, kb_embeddings: Any) -> Any: + """Get distance matrix with caching""" + import numpy as np + + if kb_embeddings is None or len(kb_embeddings) == 0: + return None + + # Check if KB has changed + kb_hash = hash(kb_embeddings.tobytes()) if kb_embeddings is not None else None + + if (self._distance_matrix_cache is None or + kb_hash != self._kb_embeddings_hash): + # Recompute matrix + self._distance_matrix_cache = self._compute_distance_matrix(query_embeddings, kb_embeddings) + self._kb_embeddings_hash = kb_hash + + return self._distance_matrix_cache + + async def map_query_semantic_space(self, query: str, n_synthetic: int = 10) -> Any: + """Generate a point cloud representing the semantic neighborhood of the query""" + from .utils import perform_completion_with_backoff + + # Generate more variations than needed for train/val split + n_total = int(n_synthetic * 1.3) # Generate 30% more for validation + + # Generate variations using LLM + prompt = f"""Generate {n_total} variations of this query that explore different aspects: '{query}' + + These should be queries a user might ask when looking for similar information. + Include different phrasings, related concepts, and specific aspects. + + Return as a JSON array of strings.""" + + # Use the LLM for query generation + provider = self.llm_config.get('provider', 'openai/gpt-4o-mini') if self.llm_config else 'openai/gpt-4o-mini' + api_token = self.llm_config.get('api_token') if self.llm_config else None + + # response = perform_completion_with_backoff( + # provider=provider, + # prompt_with_variables=prompt, + # api_token=api_token, + # json_response=True + # ) + + # variations = json.loads(response.choices[0].message.content) + + + # # Mock data with more variations for split + variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']} + + + variations = {'queries': [ + 'How do async and await work with coroutines in Python?', + 'What is the role of event loops in asynchronous programming?', + 'Can you explain the differences between async/await and traditional callback methods?', + 'How do coroutines interact with event loops in JavaScript?', + 'What are the benefits of using async await over promises in Node.js?', + # 'How to manage multiple coroutines with an event loop?', + # 'What are some common pitfalls when using async await with coroutines?', + # 'How do different programming languages implement async await and event loops?', + # 'What happens when an async function is called without await?', + # 'How does the event loop handle blocking operations?', + 'Can you nest async functions and how does that affect the event loop?', + 'What is the performance impact of using async/await?' + ]} + + # Split into train and validation + # all_queries = [query] + variations['queries'] + + # Randomly shuffle for proper train/val split (keeping original query in training) + import random + + # Keep original query always in training + other_queries = variations['queries'].copy() + random.shuffle(other_queries) + + # Split: 80% for training, 20% for validation + n_validation = max(2, int(len(other_queries) * 0.2)) # At least 2 for validation + val_queries = other_queries[-n_validation:] + train_queries = [query] + other_queries[:-n_validation] + + # Embed only training queries for now (faster) + train_embeddings = await self._get_embeddings(train_queries) + + # Store validation queries for later (don't embed yet to save time) + self._validation_queries = val_queries + + return train_embeddings, train_queries + + def compute_coverage_shape(self, query_points: Any, alpha: float = 0.5): + """Find the minimal shape that covers all query points using alpha shape""" + try: + import numpy as np + + if len(query_points) < 3: + return None + + # For high-dimensional embeddings (e.g., 384-dim, 768-dim), + # alpha shapes require exponentially more points than available. + # Instead, use a statistical coverage model + query_points = np.array(query_points) + + # Store coverage as centroid + radius model + coverage = { + 'center': np.mean(query_points, axis=0), + 'std': np.std(query_points, axis=0), + 'points': query_points, + 'radius': np.max(np.linalg.norm(query_points - np.mean(query_points, axis=0), axis=1)) + } + return coverage + except Exception: + # Fallback if computation fails + return None + + def _sample_boundary_points(self, shape, n_samples: int = 20) -> List[Any]: + """Sample points from the boundary of a shape""" + import numpy as np + + # Simplified implementation - in practice would sample from actual shape boundary + # For now, return empty list if shape is None + if shape is None: + return [] + + # This is a placeholder - actual implementation would depend on shape type + return [] + + def find_coverage_gaps(self, kb_embeddings: Any, query_embeddings: Any) -> List[Tuple[Any, float]]: + """Calculate gap distances for all query variations using vectorized operations""" + import numpy as np + + gaps = [] + + if kb_embeddings is None or len(kb_embeddings) == 0: + # If no KB yet, all query points have maximum gap + for q_emb in query_embeddings: + gaps.append((q_emb, 1.0)) + return gaps + + # Use cached distance matrix + distance_matrix = self._get_cached_distance_matrix(query_embeddings, kb_embeddings) + + if distance_matrix is None: + # Fallback + for q_emb in query_embeddings: + gaps.append((q_emb, 1.0)) + return gaps + + # Find minimum distance for each query (vectorized) + min_distances = np.min(distance_matrix, axis=1) + + # Create gaps list + for i, q_emb in enumerate(query_embeddings): + gaps.append((q_emb, min_distances[i])) + + return gaps + + async def select_links_for_expansion( + self, + candidate_links: List[Link], + gaps: List[Tuple[Any, float]], + kb_embeddings: Any + ) -> List[Tuple[Link, float]]: + """Select links that most efficiently fill the gaps""" + from .utils import cosine_distance, cosine_similarity, get_text_embeddings + import numpy as np + import hashlib + + scored_links = [] + + # Prepare for embedding - separate cached vs uncached + links_to_embed = [] + texts_to_embed = [] + link_embeddings_map = {} + + for link in candidate_links: + # Extract text from link + link_text = ' '.join(filter(None, [ + link.text or '', + link.title or '', + link.meta.get('description', '') if hasattr(link, 'meta') and link.meta else '', + link.head_data.get('meta', {}).get('description', '') if link.head_data else '' + ])) + + if not link_text.strip(): + continue + + # Create cache key from URL + text content + cache_key = hashlib.md5(f"{link.href}:{link_text}".encode()).hexdigest() + + # Check cache + if cache_key in self._link_embedding_cache: + link_embeddings_map[link.href] = self._link_embedding_cache[cache_key] + else: + links_to_embed.append(link) + texts_to_embed.append(link_text) + + # Batch embed only uncached links + if texts_to_embed: + embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + new_embeddings = await get_text_embeddings(texts_to_embed, embedding_llm_config, self.embedding_model) + + # Cache the new embeddings + for link, text, embedding in zip(links_to_embed, texts_to_embed, new_embeddings): + cache_key = hashlib.md5(f"{link.href}:{text}".encode()).hexdigest() + self._link_embedding_cache[cache_key] = embedding + link_embeddings_map[link.href] = embedding + + # Get coverage radius from config + coverage_radius = self.config.embedding_coverage_radius if hasattr(self, 'config') else 0.2 + + # Score each link + for link in candidate_links: + if link.href not in link_embeddings_map: + continue # Skip links without embeddings + + link_embedding = link_embeddings_map[link.href] + + if not gaps: + score = 0.0 + else: + # Calculate how many gaps this link helps with + gaps_helped = 0 + total_improvement = 0 + + for gap_point, gap_distance in gaps: + # Only consider gaps that actually need filling (outside coverage radius) + if gap_distance > coverage_radius: + new_distance = cosine_distance(link_embedding, gap_point) + if new_distance < gap_distance: + # This link helps this gap + improvement = gap_distance - new_distance + # Scale improvement - moving from 0.5 to 0.3 is valuable + scaled_improvement = improvement * 2 # Amplify the signal + total_improvement += scaled_improvement + gaps_helped += 1 + + # Average improvement per gap that needs help + gaps_needing_help = sum(1 for _, d in gaps if d > coverage_radius) + if gaps_needing_help > 0: + gap_reduction_score = total_improvement / gaps_needing_help + else: + gap_reduction_score = 0 + + # Check overlap with existing KB (vectorized) + if kb_embeddings is not None and len(kb_embeddings) > 0: + # Normalize embeddings + link_norm = link_embedding / np.linalg.norm(link_embedding) + kb_norm = kb_embeddings / np.linalg.norm(kb_embeddings, axis=1, keepdims=True) + + # Compute all similarities at once + similarities = np.dot(kb_norm, link_norm) + max_similarity = np.max(similarities) + + # Only penalize if very similar (above threshold) + overlap_threshold = self.config.embedding_overlap_threshold if hasattr(self, 'config') else 0.85 + if max_similarity > overlap_threshold: + overlap_penalty = (max_similarity - overlap_threshold) * 2 # 0 to 0.3 range + else: + overlap_penalty = 0 + else: + overlap_penalty = 0 + + # Final score - emphasize gap reduction + score = gap_reduction_score * (1 - overlap_penalty) + + # Add contextual score boost if available + if hasattr(link, 'contextual_score') and link.contextual_score: + score = score * 0.8 + link.contextual_score * 0.2 + + scored_links.append((link, score)) + + return sorted(scored_links, key=lambda x: x[1], reverse=True) + + async def calculate_confidence(self, state: CrawlState) -> float: + """Coverage-based learning score (0–1).""" + import numpy as np + + # Guard clauses + if state.kb_embeddings is None or state.query_embeddings is None: + return 0.0 + if len(state.kb_embeddings) == 0 or len(state.query_embeddings) == 0: + return 0.0 + + # Prepare L2-normalised arrays + Q = np.asarray(state.query_embeddings, dtype=np.float32) + D = np.asarray(state.kb_embeddings, dtype=np.float32) + Q /= np.linalg.norm(Q, axis=1, keepdims=True) + 1e-8 + D /= np.linalg.norm(D, axis=1, keepdims=True) + 1e-8 + + # Best cosine per query + best = (Q @ D.T).max(axis=1) + + # Mean similarity or hit-rate above tau + tau = getattr(self.config, 'coverage_tau', None) + score = float((best >= tau).mean()) if tau is not None else float(best.mean()) + + # Store quick metrics + state.metrics['coverage_score'] = score + state.metrics['avg_best_similarity'] = float(best.mean()) + state.metrics['median_best_similarity'] = float(np.median(best)) + + return score + + + + # async def calculate_confidence(self, state: CrawlState) -> float: + # """Calculate learning score for adaptive crawling (used for stopping)""" + # import numpy as np + + # if state.kb_embeddings is None or state.query_embeddings is None: + # return 0.0 + + # if len(state.kb_embeddings) == 0: + # return 0.0 + + # # Get cached distance matrix + # distance_matrix = self._get_cached_distance_matrix(state.query_embeddings, state.kb_embeddings) + + # if distance_matrix is None: + # return 0.0 + + # # Vectorized analysis for all queries at once + # all_query_metrics = [] + + # for i in range(len(state.query_embeddings)): + # # Get distances for this query + # distances = distance_matrix[i] + # sorted_distances = np.sort(distances) + + # # Store metrics for this query + # query_metric = { + # 'min_distance': sorted_distances[0], + # 'top_3_distances': sorted_distances[:3], + # 'top_5_distances': sorted_distances[:5], + # 'close_neighbors': np.sum(distances < 0.3), + # 'very_close_neighbors': np.sum(distances < 0.2), + # 'all_distances': distances + # } + # all_query_metrics.append(query_metric) + + # # Hybrid approach with density (exponential base) + # k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 + # coverage_scores_hybrid_exp = [] + + # for metric in all_query_metrics: + # # Base score from nearest neighbor + # nearest_score = np.exp(-k_exp * metric['min_distance']) + + # # Top-k average (top 3) + # top_k = min(3, len(metric['all_distances'])) + # top_k_avg = np.mean([np.exp(-k_exp * d) for d in metric['top_3_distances'][:top_k]]) + + # # Combine using configured weights + # nearest_weight = self.config.embedding_nearest_weight if hasattr(self, 'config') else 0.7 + # top_k_weight = self.config.embedding_top_k_weight if hasattr(self, 'config') else 0.3 + # hybrid_score = nearest_weight * nearest_score + top_k_weight * top_k_avg + # coverage_scores_hybrid_exp.append(hybrid_score) + + # learning_score = np.mean(coverage_scores_hybrid_exp) + + # # Store as learning score + # state.metrics['learning_score'] = learning_score + + # # Store embedding-specific metrics + # state.metrics['avg_min_distance'] = np.mean([m['min_distance'] for m in all_query_metrics]) + # state.metrics['avg_close_neighbors'] = np.mean([m['close_neighbors'] for m in all_query_metrics]) + # state.metrics['avg_very_close_neighbors'] = np.mean([m['very_close_neighbors'] for m in all_query_metrics]) + # state.metrics['total_kb_docs'] = len(state.kb_embeddings) + + # # Store query-level metrics for detailed analysis + # self._query_metrics = all_query_metrics + + # # For stopping criteria, return learning score + # return float(learning_score) + + async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: + """Main entry point for link ranking""" + # Store config for use in other methods + self.config = config + + # Filter out already crawled URLs and remove duplicates + seen_urls = set() + uncrawled_links = [] + + for link in state.pending_links: + if link.href not in state.crawled_urls and link.href not in seen_urls: + uncrawled_links.append(link) + seen_urls.add(link.href) + + if not uncrawled_links: + return [] + + # Get gaps in coverage (no threshold needed anymore) + gaps = self.find_coverage_gaps( + state.kb_embeddings, + state.query_embeddings + ) + state.semantic_gaps = [(g[0].tolist(), g[1]) for g in gaps] # Store as list for serialization + + # Select links that fill gaps (only from uncrawled) + return await self.select_links_for_expansion( + uncrawled_links, + gaps, + state.kb_embeddings + ) + + async def validate_coverage(self, state: CrawlState) -> float: + """Validate coverage using held-out queries with caching""" + if not hasattr(self, '_validation_queries') or not self._validation_queries: + return state.metrics.get('confidence', 0.0) + + import numpy as np + + # Cache validation embeddings (only embed once!) + if self._validation_embeddings_cache is None: + self._validation_embeddings_cache = await self._get_embeddings(self._validation_queries) + + val_embeddings = self._validation_embeddings_cache + + # Use vectorized distance computation + if state.kb_embeddings is None or len(state.kb_embeddings) == 0: + return 0.0 + + # Compute distance matrix for validation queries + distance_matrix = self._compute_distance_matrix(val_embeddings, state.kb_embeddings) + + if distance_matrix is None: + return 0.0 + + # Find minimum distance for each validation query (vectorized) + min_distances = np.min(distance_matrix, axis=1) + + # Compute scores using same exponential as training + k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 + scores = np.exp(-k_exp * min_distances) + + validation_confidence = np.mean(scores) + state.metrics['validation_confidence'] = validation_confidence + + return validation_confidence + + async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: + """Stop based on learning curve convergence""" + confidence = state.metrics.get('confidence', 0.0) + + # Basic limits + if len(state.crawled_urls) >= config.max_pages or not state.pending_links: + return True + + # Track confidence history + if not hasattr(state, 'confidence_history'): + state.confidence_history = [] + + state.confidence_history.append(confidence) + + # Need at least 3 iterations to check convergence + if len(state.confidence_history) < 2: + return False + + improvement_diffs = list(zip(state.confidence_history[:-1], state.confidence_history[1:])) + + # Calculate average improvement + avg_improvement = sum(abs(b - a) for a, b in improvement_diffs) / len(improvement_diffs) + state.metrics['avg_improvement'] = avg_improvement + + min_relative_improvement = self.config.embedding_min_relative_improvement * confidence if hasattr(self, 'config') else 0.1 * confidence + if avg_improvement < min_relative_improvement: + # Converged - validate before stopping + val_score = await self.validate_coverage(state) + + # Only stop if validation is reasonable + validation_min = self.config.embedding_validation_min_score if hasattr(self, 'config') else 0.4 + if val_score > validation_min: + state.metrics['stopped_reason'] = 'converged_validated' + self._validation_passed = True + return True + else: + state.metrics['stopped_reason'] = 'low_validation' + # Continue crawling despite convergence + + return False + + def get_quality_confidence(self, state: CrawlState) -> float: + """Calculate quality-based confidence score for display""" + learning_score = state.metrics.get('learning_score', 0.0) + validation_score = state.metrics.get('validation_confidence', 0.0) + + # Get config values + validation_min = self.config.embedding_validation_min_score if hasattr(self, 'config') else 0.4 + quality_min = self.config.embedding_quality_min_confidence if hasattr(self, 'config') else 0.7 + quality_max = self.config.embedding_quality_max_confidence if hasattr(self, 'config') else 0.95 + scale_factor = self.config.embedding_quality_scale_factor if hasattr(self, 'config') else 0.833 + + if self._validation_passed and validation_score > validation_min: + # Validated systems get boosted scores + # Map 0.4-0.7 learning → quality_min-quality_max confidence + if learning_score < 0.4: + confidence = quality_min # Minimum for validated systems + elif learning_score > 0.7: + confidence = quality_max # Maximum realistic confidence + else: + # Linear mapping in between + confidence = quality_min + (learning_score - 0.4) * scale_factor + else: + # Not validated = conservative mapping + confidence = learning_score * 0.8 + + return confidence + + async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: + """Update embeddings and coverage metrics with deduplication""" + from .utils import get_text_embeddings + import numpy as np + + # Extract text from results + new_texts = [] + valid_results = [] + for result in new_results: + content = result.markdown.raw_markdown if hasattr(result, 'markdown') and result.markdown else "" + if content: # Only process non-empty content + new_texts.append(content[:5000]) # Limit text length + valid_results.append(result) + + if not new_texts: + return + + # Get embeddings for new texts + embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + new_embeddings = await get_text_embeddings(new_texts, embedding_llm_config, self.embedding_model) + + # Deduplicate embeddings before adding to KB + if state.kb_embeddings is None: + # First batch - no deduplication needed + state.kb_embeddings = new_embeddings + deduplicated_indices = list(range(len(new_embeddings))) + else: + # Check for duplicates using vectorized similarity + deduplicated_embeddings = [] + deduplicated_indices = [] + + for i, new_emb in enumerate(new_embeddings): + # Compute similarities with existing KB + new_emb_normalized = new_emb / np.linalg.norm(new_emb) + kb_normalized = state.kb_embeddings / np.linalg.norm(state.kb_embeddings, axis=1, keepdims=True) + similarities = np.dot(kb_normalized, new_emb_normalized) + + # Only add if not too similar to existing content + if np.max(similarities) < self._kb_similarity_threshold: + deduplicated_embeddings.append(new_emb) + deduplicated_indices.append(i) + + # Add deduplicated embeddings + if deduplicated_embeddings: + state.kb_embeddings = np.vstack([state.kb_embeddings, np.array(deduplicated_embeddings)]) + + # Update crawl order only for non-duplicate results + for idx in deduplicated_indices: + state.crawl_order.append(valid_results[idx].url) + + # Invalidate distance matrix cache since KB changed + self._kb_embeddings_hash = None + self._distance_matrix_cache = None + + # Update coverage shape if needed + if hasattr(state, 'query_embeddings') and state.query_embeddings is not None: + state.coverage_shape = self.compute_coverage_shape(state.query_embeddings, self.config.alpha_shape_alpha if hasattr(self, 'config') else 0.5) + + +class AdaptiveCrawler: + """Main adaptive crawler that orchestrates the crawling process""" + + def __init__(self, + crawler: Optional[AsyncWebCrawler] = None, + config: Optional[AdaptiveConfig] = None, + strategy: Optional[CrawlStrategy] = None): + self.crawler = crawler + self.config = config or AdaptiveConfig() + self.config.validate() + + # Create strategy based on config + if strategy: + self.strategy = strategy + else: + self.strategy = self._create_strategy(self.config.strategy) + + # Initialize state + self.state: Optional[CrawlState] = None + + # Track if we own the crawler (for cleanup) + self._owns_crawler = crawler is None + + def _create_strategy(self, strategy_name: str) -> CrawlStrategy: + """Create strategy instance based on name""" + if strategy_name == "statistical": + return StatisticalStrategy() + elif strategy_name == "embedding": + return EmbeddingStrategy( + embedding_model=self.config.embedding_model, + llm_config=self.config.embedding_llm_config + ) + else: + raise ValueError(f"Unknown strategy: {strategy_name}") + + async def digest(self, + start_url: str, + query: str, + resume_from: Optional[str] = None) -> CrawlState: + """Main entry point for adaptive crawling""" + # Initialize or resume state + if resume_from: + self.state = CrawlState.load(resume_from) + self.state.query = query # Update query in case it changed + else: + self.state = CrawlState( + crawled_urls=set(), + knowledge_base=[], + pending_links=[], + query=query, + metrics={} + ) + + # Create crawler if needed + if not self.crawler: + self.crawler = AsyncWebCrawler() + await self.crawler.__aenter__() + + self.strategy.config = self.config # Pass config to strategy + + # If using embedding strategy and not resuming, expand query space + if isinstance(self.strategy, EmbeddingStrategy) and not resume_from: + # Generate query space + query_embeddings, expanded_queries = await self.strategy.map_query_semantic_space( + query, + self.config.n_query_variations + ) + self.state.query_embeddings = query_embeddings + self.state.expanded_queries = expanded_queries[1:] # Skip original query + self.state.embedding_model = self.strategy.embedding_model + + try: + # Initial crawl if not resuming + if start_url not in self.state.crawled_urls: + result = await self._crawl_with_preview(start_url, query) + if result and hasattr(result, 'success') and result.success: + self.state.knowledge_base.append(result) + self.state.crawled_urls.add(start_url) + # Extract links from result - handle both dict and Links object formats + if hasattr(result, 'links') and result.links: + if isinstance(result.links, dict): + # Extract internal and external links from dict + internal_links = [Link(**link) for link in result.links.get('internal', [])] + external_links = [Link(**link) for link in result.links.get('external', [])] + self.state.pending_links.extend(internal_links + external_links) + else: + # Handle Links object + self.state.pending_links.extend(result.links.internal + result.links.external) + + # Update state + await self.strategy.update_state(self.state, [result]) + + # adaptive expansion + depth = 0 + while depth < self.config.max_depth: + # Calculate confidence + confidence = await self.strategy.calculate_confidence(self.state) + self.state.metrics['confidence'] = confidence + + # Check stopping criteria + if await self.strategy.should_stop(self.state, self.config): + break + + # Rank candidate links + ranked_links = await self.strategy.rank_links(self.state, self.config) + + if not ranked_links: + break + + # Check minimum gain threshold + if ranked_links[0][1] < self.config.min_gain_threshold: + break + + # Select top K links + to_crawl = [(link, score) for link, score in ranked_links[:self.config.top_k_links] + if link.href not in self.state.crawled_urls] + + if not to_crawl: + break + + # Crawl selected links + new_results = await self._crawl_batch(to_crawl, query) + + if new_results: + # Update knowledge base + self.state.knowledge_base.extend(new_results) + + # Update crawled URLs and pending links + for result, (link, _) in zip(new_results, to_crawl): + if result: + self.state.crawled_urls.add(link.href) + # Extract links from result - handle both dict and Links object formats + if hasattr(result, 'links') and result.links: + new_links = [] + if isinstance(result.links, dict): + # Extract internal and external links from dict + internal_links = [Link(**link_data) for link_data in result.links.get('internal', [])] + external_links = [Link(**link_data) for link_data in result.links.get('external', [])] + new_links = internal_links + external_links + else: + # Handle Links object + new_links = result.links.internal + result.links.external + + # Add new links to pending + for new_link in new_links: + if new_link.href not in self.state.crawled_urls: + self.state.pending_links.append(new_link) + + # Update state with new results + await self.strategy.update_state(self.state, new_results) + + depth += 1 + + # Save state if configured + if self.config.save_state and self.config.state_path: + self.state.save(self.config.state_path) + + # Final confidence calculation + learning_score = await self.strategy.calculate_confidence(self.state) + + # For embedding strategy, get quality-based confidence + if isinstance(self.strategy, EmbeddingStrategy): + self.state.metrics['confidence'] = self.strategy.get_quality_confidence(self.state) + else: + # For statistical strategy, use the same as before + self.state.metrics['confidence'] = learning_score + + self.state.metrics['pages_crawled'] = len(self.state.crawled_urls) + self.state.metrics['depth_reached'] = depth + + # Final save + if self.config.save_state and self.config.state_path: + self.state.save(self.config.state_path) + + return self.state + + finally: + # Cleanup if we created the crawler + if self._owns_crawler and self.crawler: + await self.crawler.__aexit__(None, None, None) + + async def _crawl_with_preview(self, url: str, query: str) -> Optional[CrawlResult]: + """Crawl a URL with link preview enabled""" + config = CrawlerRunConfig( + link_preview_config=LinkPreviewConfig( + include_internal=True, + include_external=False, + query=query, # For BM25 scoring + concurrency=5, + timeout=5, + max_links=50, # Reasonable limit + verbose=False + ), + score_links=True # Enable intrinsic scoring + ) + + try: + result = await self.crawler.arun(url=url, config=config) + # Extract the actual CrawlResult from the container + if hasattr(result, '_results') and result._results: + result = result._results[0] + + # Filter our all links do not have head_date + if hasattr(result, 'links') and result.links: + result.links['internal'] = [link for link in result.links['internal'] if link.get('head_data')] + # For now let's ignore external links without head_data + # result.links['external'] = [link for link in result.links['external'] if link.get('head_data')] + + return result + except Exception as e: + print(f"Error crawling {url}: {e}") + return None + + async def _crawl_batch(self, links_with_scores: List[Tuple[Link, float]], query: str) -> List[CrawlResult]: + """Crawl multiple URLs in parallel""" + tasks = [] + for link, score in links_with_scores: + task = self._crawl_with_preview(link.href, query) + tasks.append(task) + + results = await asyncio.gather(*tasks, return_exceptions=True) + + # Filter out exceptions and failed crawls + valid_results = [] + for result in results: + if isinstance(result, CrawlResult): + # Only include successful crawls + if hasattr(result, 'success') and result.success: + valid_results.append(result) + else: + print(f"Skipping failed crawl: {result.url if hasattr(result, 'url') else 'unknown'}") + elif isinstance(result, Exception): + print(f"Error in batch crawl: {result}") + + return valid_results + + # Status properties + @property + def confidence(self) -> float: + """Current confidence level""" + if self.state: + return self.state.metrics.get('confidence', 0.0) + return 0.0 + + @property + def coverage_stats(self) -> Dict[str, Any]: + """Detailed coverage statistics""" + if not self.state: + return {} + + total_content_length = sum( + len(result.markdown.raw_markdown or "") + for result in self.state.knowledge_base + ) + + return { + 'pages_crawled': len(self.state.crawled_urls), + 'total_content_length': total_content_length, + 'unique_terms': len(self.state.term_frequencies), + 'total_terms': sum(self.state.term_frequencies.values()), + 'pending_links': len(self.state.pending_links), + 'confidence': self.confidence, + 'coverage': self.state.metrics.get('coverage', 0.0), + 'consistency': self.state.metrics.get('consistency', 0.0), + 'saturation': self.state.metrics.get('saturation', 0.0) + } + + @property + def is_sufficient(self) -> bool: + """Check if current knowledge is sufficient""" + if isinstance(self.strategy, EmbeddingStrategy): + # For embedding strategy, sufficient = validation passed + return self.strategy._validation_passed + else: + # For statistical strategy, use threshold + return self.confidence >= self.config.confidence_threshold + + def print_stats(self, detailed: bool = False) -> None: + """Print comprehensive statistics about the knowledge base + + Args: + detailed: If True, show detailed statistics including top terms + """ + if not self.state: + print("No crawling state available.") + return + + # Import here to avoid circular imports + try: + from rich.console import Console + from rich.table import Table + console = Console() + use_rich = True + except ImportError: + use_rich = False + + if not detailed and use_rich: + # Summary view with nice table (like original) + table = Table(title=f"Adaptive Crawl Stats - Query: '{self.state.query}'") + table.add_column("Metric", style="cyan", no_wrap=True) + table.add_column("Value", style="magenta") + + # Basic stats + stats = self.coverage_stats + table.add_row("Pages Crawled", str(stats.get('pages_crawled', 0))) + table.add_row("Unique Terms", str(stats.get('unique_terms', 0))) + table.add_row("Total Terms", str(stats.get('total_terms', 0))) + table.add_row("Content Length", f"{stats.get('total_content_length', 0):,} chars") + table.add_row("Pending Links", str(stats.get('pending_links', 0))) + table.add_row("", "") # Spacer + + # Strategy-specific metrics + if isinstance(self.strategy, EmbeddingStrategy): + # Embedding-specific metrics + table.add_row("Confidence", f"{stats.get('confidence', 0):.2%}") + table.add_row("Avg Min Distance", f"{self.state.metrics.get('avg_min_distance', 0):.3f}") + table.add_row("Avg Close Neighbors", f"{self.state.metrics.get('avg_close_neighbors', 0):.1f}") + table.add_row("Validation Score", f"{self.state.metrics.get('validation_confidence', 0):.2%}") + table.add_row("", "") # Spacer + table.add_row("Is Sufficient?", "[green]Yes (Validated)[/green]" if self.is_sufficient else "[red]No[/red]") + else: + # Statistical strategy metrics + table.add_row("Confidence", f"{stats.get('confidence', 0):.2%}") + table.add_row("Coverage", f"{stats.get('coverage', 0):.2%}") + table.add_row("Consistency", f"{stats.get('consistency', 0):.2%}") + table.add_row("Saturation", f"{stats.get('saturation', 0):.2%}") + table.add_row("", "") # Spacer + table.add_row("Is Sufficient?", "[green]Yes[/green]" if self.is_sufficient else "[red]No[/red]") + + console.print(table) + else: + # Detailed view or fallback when rich not available + print("\n" + "="*80) + print(f"Adaptive Crawl Statistics - Query: '{self.state.query}'") + print("="*80) + + # Basic stats + print("\n[*] Basic Statistics:") + print(f" Pages Crawled: {len(self.state.crawled_urls)}") + print(f" Pending Links: {len(self.state.pending_links)}") + print(f" Total Documents: {self.state.total_documents}") + + # Content stats + total_content_length = sum( + len(self._get_content_from_result(result)) + for result in self.state.knowledge_base + ) + total_words = sum(self.state.term_frequencies.values()) + unique_terms = len(self.state.term_frequencies) + + print(f"\n[*] Content Statistics:") + print(f" Total Content: {total_content_length:,} characters") + print(f" Total Words: {total_words:,}") + print(f" Unique Terms: {unique_terms:,}") + if total_words > 0: + print(f" Vocabulary Richness: {unique_terms/total_words:.2%}") + + # Strategy-specific output + if isinstance(self.strategy, EmbeddingStrategy): + # Semantic coverage for embedding strategy + print(f"\n[*] Semantic Coverage Analysis:") + print(f" Average Min Distance: {self.state.metrics.get('avg_min_distance', 0):.3f}") + print(f" Avg Close Neighbors (< 0.3): {self.state.metrics.get('avg_close_neighbors', 0):.1f}") + print(f" Avg Very Close Neighbors (< 0.2): {self.state.metrics.get('avg_very_close_neighbors', 0):.1f}") + + # Confidence metrics + print(f"\n[*] Confidence Metrics:") + if self.is_sufficient: + if use_rich: + console.print(f" Overall Confidence: {self.confidence:.2%} [green][VALIDATED][/green]") + else: + print(f" Overall Confidence: {self.confidence:.2%} [VALIDATED]") + else: + if use_rich: + console.print(f" Overall Confidence: {self.confidence:.2%} [red][NOT VALIDATED][/red]") + else: + print(f" Overall Confidence: {self.confidence:.2%} [NOT VALIDATED]") + + print(f" Learning Score: {self.state.metrics.get('learning_score', 0):.2%}") + print(f" Validation Score: {self.state.metrics.get('validation_confidence', 0):.2%}") + + else: + # Query coverage for statistical strategy + print(f"\n[*] Query Coverage:") + query_terms = self.strategy._tokenize(self.state.query.lower()) + for term in query_terms: + tf = self.state.term_frequencies.get(term, 0) + df = self.state.document_frequencies.get(term, 0) + if df > 0: + if use_rich: + console.print(f" '{term}': found in {df}/{self.state.total_documents} docs ([green]{df/self.state.total_documents:.0%}[/green]), {tf} occurrences") + else: + print(f" '{term}': found in {df}/{self.state.total_documents} docs ({df/self.state.total_documents:.0%}), {tf} occurrences") + else: + if use_rich: + console.print(f" '{term}': [red][X] not found[/red]") + else: + print(f" '{term}': [X] not found") + + # Confidence metrics + print(f"\n[*] Confidence Metrics:") + status = "[OK]" if self.is_sufficient else "[!!]" + if use_rich: + status_colored = "[green][OK][/green]" if self.is_sufficient else "[red][!!][/red]" + console.print(f" Overall Confidence: {self.confidence:.2%} {status_colored}") + else: + print(f" Overall Confidence: {self.confidence:.2%} {status}") + print(f" Coverage Score: {self.state.metrics.get('coverage', 0):.2%}") + print(f" Consistency Score: {self.state.metrics.get('consistency', 0):.2%}") + print(f" Saturation Score: {self.state.metrics.get('saturation', 0):.2%}") + + # Crawl efficiency + if self.state.new_terms_history: + avg_new_terms = sum(self.state.new_terms_history) / len(self.state.new_terms_history) + print(f"\n[*] Crawl Efficiency:") + print(f" Avg New Terms per Page: {avg_new_terms:.1f}") + print(f" Information Saturation: {self.state.metrics.get('saturation', 0):.2%}") + + if detailed: + print("\n" + "-"*80) + if use_rich: + console.print("[bold cyan]DETAILED STATISTICS[/bold cyan]") + else: + print("DETAILED STATISTICS") + print("-"*80) + + # Top terms + print("\n[+] Top 20 Terms by Frequency:") + top_terms = sorted(self.state.term_frequencies.items(), key=lambda x: x[1], reverse=True)[:20] + for i, (term, freq) in enumerate(top_terms, 1): + df = self.state.document_frequencies.get(term, 0) + if use_rich: + console.print(f" {i:2d}. [yellow]'{term}'[/yellow]: {freq} occurrences in {df} docs") + else: + print(f" {i:2d}. '{term}': {freq} occurrences in {df} docs") + + # URLs crawled + print(f"\n[+] URLs Crawled ({len(self.state.crawled_urls)}):") + for i, url in enumerate(self.state.crawl_order, 1): + new_terms = self.state.new_terms_history[i-1] if i <= len(self.state.new_terms_history) else 0 + if use_rich: + console.print(f" {i}. [cyan]{url}[/cyan]") + console.print(f" -> Added [green]{new_terms}[/green] new terms") + else: + print(f" {i}. {url}") + print(f" -> Added {new_terms} new terms") + + # Document frequency distribution + print("\n[+] Document Frequency Distribution:") + df_counts = {} + for df in self.state.document_frequencies.values(): + df_counts[df] = df_counts.get(df, 0) + 1 + + for df in sorted(df_counts.keys()): + count = df_counts[df] + print(f" Terms in {df} docs: {count} terms") + + # Embedding stats + if self.state.embedding_model: + print("\n[+] Semantic Coverage Analysis:") + print(f" Embedding Model: {self.state.embedding_model}") + print(f" Query Variations: {len(self.state.expanded_queries)}") + if self.state.kb_embeddings is not None: + print(f" Knowledge Embeddings: {self.state.kb_embeddings.shape}") + else: + print(f" Knowledge Embeddings: None") + print(f" Semantic Gaps: {len(self.state.semantic_gaps)}") + print(f" Coverage Achievement: {self.confidence:.2%}") + + # Show sample expanded queries + if self.state.expanded_queries: + print("\n[+] Query Space (samples):") + for i, eq in enumerate(self.state.expanded_queries[:5], 1): + if use_rich: + console.print(f" {i}. [yellow]{eq}[/yellow]") + else: + print(f" {i}. {eq}") + + print("\n" + "="*80) + + def _get_content_from_result(self, result) -> str: + """Helper to safely extract content from result""" + if hasattr(result, 'markdown') and result.markdown: + if hasattr(result.markdown, 'raw_markdown'): + return result.markdown.raw_markdown or "" + return str(result.markdown) + return "" + + def export_knowledge_base(self, filepath: Union[str, Path], format: str = "jsonl") -> None: + """Export the knowledge base to a file + + Args: + filepath: Path to save the file + format: Export format - currently supports 'jsonl' + """ + if not self.state or not self.state.knowledge_base: + print("No knowledge base to export.") + return + + filepath = Path(filepath) + filepath.parent.mkdir(parents=True, exist_ok=True) + + if format == "jsonl": + # Export as JSONL - one CrawlResult per line + with open(filepath, 'w', encoding='utf-8') as f: + for result in self.state.knowledge_base: + # Convert CrawlResult to dict + result_dict = self._crawl_result_to_export_dict(result) + # Write as single line JSON + f.write(json.dumps(result_dict, ensure_ascii=False) + '\n') + + print(f"Exported {len(self.state.knowledge_base)} documents to {filepath}") + else: + raise ValueError(f"Unsupported export format: {format}") + + def _crawl_result_to_export_dict(self, result) -> Dict[str, Any]: + """Convert CrawlResult to a dictionary for export""" + # Extract all available fields + export_dict = { + 'url': getattr(result, 'url', ''), + 'timestamp': getattr(result, 'timestamp', None), + 'success': getattr(result, 'success', True), + 'query': self.state.query if self.state else '', + } + + # Extract content + if hasattr(result, 'markdown') and result.markdown: + if hasattr(result.markdown, 'raw_markdown'): + export_dict['content'] = result.markdown.raw_markdown + else: + export_dict['content'] = str(result.markdown) + else: + export_dict['content'] = '' + + # Extract metadata + if hasattr(result, 'metadata'): + export_dict['metadata'] = result.metadata + + # Extract links if available + if hasattr(result, 'links'): + export_dict['links'] = result.links + + # Add crawl-specific metadata + if self.state: + export_dict['crawl_metadata'] = { + 'crawl_order': self.state.crawl_order.index(export_dict['url']) + 1 if export_dict['url'] in self.state.crawl_order else 0, + 'confidence_at_crawl': self.state.metrics.get('confidence', 0), + 'total_documents': self.state.total_documents + } + + return export_dict + + def import_knowledge_base(self, filepath: Union[str, Path], format: str = "jsonl") -> None: + """Import a knowledge base from a file + + Args: + filepath: Path to the file to import + format: Import format - currently supports 'jsonl' + """ + filepath = Path(filepath) + if not filepath.exists(): + raise FileNotFoundError(f"File not found: {filepath}") + + if format == "jsonl": + imported_results = [] + with open(filepath, 'r', encoding='utf-8') as f: + for line in f: + if line.strip(): + data = json.loads(line) + # Convert back to a mock CrawlResult + mock_result = self._import_dict_to_crawl_result(data) + imported_results.append(mock_result) + + # Initialize state if needed + if not self.state: + self.state = CrawlState() + + # Add imported results + self.state.knowledge_base.extend(imported_results) + + # Update state with imported data + asyncio.run(self.strategy.update_state(self.state, imported_results)) + + print(f"Imported {len(imported_results)} documents from {filepath}") + else: + raise ValueError(f"Unsupported import format: {format}") + + def _import_dict_to_crawl_result(self, data: Dict[str, Any]): + """Convert imported dict back to a mock CrawlResult""" + class MockMarkdown: + def __init__(self, content): + self.raw_markdown = content + + class MockCrawlResult: + def __init__(self, data): + self.url = data.get('url', '') + self.markdown = MockMarkdown(data.get('content', '')) + self.links = data.get('links', {}) + self.metadata = data.get('metadata', {}) + self.success = data.get('success', True) + self.timestamp = data.get('timestamp') + + return MockCrawlResult(data) + + def get_relevant_content(self, top_k: int = 5) -> List[Dict[str, Any]]: + """Get most relevant content for the query""" + if not self.state or not self.state.knowledge_base: + return [] + + # Simple relevance ranking based on term overlap + scored_docs = [] + query_terms = set(self.state.query.lower().split()) + + for i, result in enumerate(self.state.knowledge_base): + content = (result.markdown.raw_markdown or "").lower() + content_terms = set(content.split()) + + # Calculate relevance score + overlap = len(query_terms & content_terms) + score = overlap / len(query_terms) if query_terms else 0.0 + + scored_docs.append({ + 'url': result.url, + 'score': score, + 'content': result.markdown.raw_markdown, + 'index': i + }) + + # Sort by score and return top K + scored_docs.sort(key=lambda x: x['score'], reverse=True) + return scored_docs[:top_k] \ No newline at end of file diff --git a/crawl4ai/adaptive_crawler.py b/crawl4ai/adaptive_crawler.py new file mode 100644 index 00000000..a0b8fa9c --- /dev/null +++ b/crawl4ai/adaptive_crawler.py @@ -0,0 +1,1861 @@ +""" +Adaptive Web Crawler for Crawl4AI + +This module implements adaptive information foraging for efficient web crawling. +It determines when sufficient information has been gathered to answer a query, +avoiding unnecessary crawls while ensuring comprehensive coverage. +""" + +from abc import ABC, abstractmethod +from typing import Dict, List, Optional, Set, Tuple, Any, Union +from dataclasses import dataclass, field +import asyncio +import pickle +import os +import json +import math +from collections import defaultdict, Counter +import re +from pathlib import Path + +from crawl4ai.async_webcrawler import AsyncWebCrawler +from crawl4ai.async_configs import CrawlerRunConfig, LinkPreviewConfig +from crawl4ai.models import Link, CrawlResult +import numpy as np + +@dataclass +class CrawlState: + """Tracks the current state of adaptive crawling""" + crawled_urls: Set[str] = field(default_factory=set) + knowledge_base: List[CrawlResult] = field(default_factory=list) + pending_links: List[Link] = field(default_factory=list) + query: str = "" + metrics: Dict[str, float] = field(default_factory=dict) + + # Statistical tracking + term_frequencies: Dict[str, int] = field(default_factory=lambda: defaultdict(int)) + document_frequencies: Dict[str, int] = field(default_factory=lambda: defaultdict(int)) + documents_with_terms: Dict[str, Set[int]] = field(default_factory=lambda: defaultdict(set)) + total_documents: int = 0 + + # History tracking for saturation + new_terms_history: List[int] = field(default_factory=list) + crawl_order: List[str] = field(default_factory=list) + + # Embedding-specific tracking (only if strategy is embedding) + kb_embeddings: Optional[Any] = None # Will be numpy array + query_embeddings: Optional[Any] = None # Will be numpy array + expanded_queries: List[str] = field(default_factory=list) + coverage_shape: Optional[Any] = None # Alpha shape + semantic_gaps: List[Tuple[List[float], float]] = field(default_factory=list) # Serializable + embedding_model: str = "" + + def save(self, path: Union[str, Path]): + """Save state to disk for persistence""" + path = Path(path) + path.parent.mkdir(parents=True, exist_ok=True) + + # Convert CrawlResult objects to dicts for serialization + state_dict = { + 'crawled_urls': list(self.crawled_urls), + 'knowledge_base': [self._crawl_result_to_dict(cr) for cr in self.knowledge_base], + 'pending_links': [link.model_dump() for link in self.pending_links], + 'query': self.query, + 'metrics': self.metrics, + 'term_frequencies': dict(self.term_frequencies), + 'document_frequencies': dict(self.document_frequencies), + 'documents_with_terms': {k: list(v) for k, v in self.documents_with_terms.items()}, + 'total_documents': self.total_documents, + 'new_terms_history': self.new_terms_history, + 'crawl_order': self.crawl_order, + # Embedding-specific fields (convert numpy arrays to lists for JSON) + 'kb_embeddings': self.kb_embeddings.tolist() if self.kb_embeddings is not None else None, + 'query_embeddings': self.query_embeddings.tolist() if self.query_embeddings is not None else None, + 'expanded_queries': self.expanded_queries, + 'semantic_gaps': self.semantic_gaps, + 'embedding_model': self.embedding_model + } + + with open(path, 'w') as f: + json.dump(state_dict, f, indent=2) + + @classmethod + def load(cls, path: Union[str, Path]) -> 'CrawlState': + """Load state from disk""" + path = Path(path) + with open(path, 'r') as f: + state_dict = json.load(f) + + state = cls() + state.crawled_urls = set(state_dict['crawled_urls']) + state.knowledge_base = [cls._dict_to_crawl_result(d) for d in state_dict['knowledge_base']] + state.pending_links = [Link(**link_dict) for link_dict in state_dict['pending_links']] + state.query = state_dict['query'] + state.metrics = state_dict['metrics'] + state.term_frequencies = defaultdict(int, state_dict['term_frequencies']) + state.document_frequencies = defaultdict(int, state_dict['document_frequencies']) + state.documents_with_terms = defaultdict(set, {k: set(v) for k, v in state_dict['documents_with_terms'].items()}) + state.total_documents = state_dict['total_documents'] + state.new_terms_history = state_dict['new_terms_history'] + state.crawl_order = state_dict['crawl_order'] + + # Load embedding-specific fields (convert lists back to numpy arrays) + + state.kb_embeddings = np.array(state_dict['kb_embeddings']) if state_dict.get('kb_embeddings') is not None else None + state.query_embeddings = np.array(state_dict['query_embeddings']) if state_dict.get('query_embeddings') is not None else None + state.expanded_queries = state_dict.get('expanded_queries', []) + state.semantic_gaps = state_dict.get('semantic_gaps', []) + state.embedding_model = state_dict.get('embedding_model', '') + + return state + + @staticmethod + def _crawl_result_to_dict(cr: CrawlResult) -> Dict: + """Convert CrawlResult to serializable dict""" + # Extract markdown content safely + markdown_content = "" + if hasattr(cr, 'markdown') and cr.markdown: + if hasattr(cr.markdown, 'raw_markdown'): + markdown_content = cr.markdown.raw_markdown + else: + markdown_content = str(cr.markdown) + + return { + 'url': cr.url, + 'content': markdown_content, + 'links': cr.links if hasattr(cr, 'links') else {}, + 'metadata': cr.metadata if hasattr(cr, 'metadata') else {} + } + + @staticmethod + def _dict_to_crawl_result(d: Dict): + """Convert dict back to CrawlResult""" + # Create a mock object that has the minimal interface we need + class MockMarkdown: + def __init__(self, content): + self.raw_markdown = content + + class MockCrawlResult: + def __init__(self, url, content, links, metadata): + self.url = url + self.markdown = MockMarkdown(content) + self.links = links + self.metadata = metadata + + return MockCrawlResult( + url=d['url'], + content=d.get('content', ''), + links=d.get('links', {}), + metadata=d.get('metadata', {}) + ) + + +@dataclass +class AdaptiveConfig: + """Configuration for adaptive crawling""" + confidence_threshold: float = 0.7 + max_depth: int = 5 + max_pages: int = 20 + top_k_links: int = 3 + min_gain_threshold: float = 0.1 + strategy: str = "statistical" # statistical, embedding, llm + + # Advanced parameters + saturation_threshold: float = 0.8 + consistency_threshold: float = 0.7 + coverage_weight: float = 0.4 + consistency_weight: float = 0.3 + saturation_weight: float = 0.3 + + # Link scoring parameters + relevance_weight: float = 0.5 + novelty_weight: float = 0.3 + authority_weight: float = 0.2 + + # Persistence + save_state: bool = False + state_path: Optional[str] = None + + # Embedding strategy parameters + embedding_model: str = "sentence-transformers/all-MiniLM-L6-v2" + embedding_llm_config: Optional[Dict] = None # Separate config for embeddings + n_query_variations: int = 10 + coverage_threshold: float = 0.85 + alpha_shape_alpha: float = 0.5 + + # Minimum confidence threshold for relevance + embedding_min_confidence_threshold: float = 0.1 # Below this, content is considered completely irrelevant + # Example: If confidence < 0.1, stop immediately as query and content are unrelated + + # Embedding confidence calculation parameters + embedding_coverage_radius: float = 0.2 # Distance threshold for "covered" query points + # Example: With radius=0.2, a query point is considered covered if ANY document + # is within cosine distance 0.2 (very similar). Smaller = stricter coverage requirement + + embedding_k_exp: float = 1.0 # Exponential decay factor for distance-to-score mapping + # Example: score = exp(-k_exp * distance). With k_exp=1, distance 0.2 → score 0.82, + # distance 0.5 → score 0.61. Higher k_exp = steeper decay = more emphasis on very close matches + + embedding_nearest_weight: float = 0.7 # Weight for nearest neighbor in hybrid scoring + embedding_top_k_weight: float = 0.3 # Weight for top-k average in hybrid scoring + # Example: If nearest doc has score 0.9 and top-3 avg is 0.6, final = 0.7*0.9 + 0.3*0.6 = 0.81 + # Higher nearest_weight = more focus on best match vs neighborhood density + + # Embedding link selection parameters + embedding_overlap_threshold: float = 0.85 # Similarity threshold for penalizing redundant links + # Example: Links with >0.85 similarity to existing KB get penalized to avoid redundancy + # Lower = more aggressive deduplication, Higher = allow more similar content + + # Embedding stopping criteria parameters + embedding_min_relative_improvement: float = 0.1 # Minimum relative improvement to continue + # Example: If confidence is 0.6, need improvement > 0.06 per batch to continue crawling + # Lower = more patient crawling, Higher = stop earlier when progress slows + + embedding_validation_min_score: float = 0.3 # Minimum validation score to trust convergence + # Example: Even if learning converged, keep crawling if validation score < 0.4 + # This prevents premature stopping when we haven't truly covered the query space + + # Quality confidence mapping parameters (for display to user) + embedding_quality_min_confidence: float = 0.7 # Minimum confidence for validated systems + embedding_quality_max_confidence: float = 0.95 # Maximum realistic confidence + embedding_quality_scale_factor: float = 0.833 # Scaling factor for confidence mapping + # Example: Validated system with learning_score=0.5 → confidence = 0.7 + (0.5-0.4)*0.833 = 0.78 + # These control how internal scores map to user-friendly confidence percentages + + def validate(self): + """Validate configuration parameters""" + assert 0 <= self.confidence_threshold <= 1, "confidence_threshold must be between 0 and 1" + assert self.max_depth > 0, "max_depth must be positive" + assert self.max_pages > 0, "max_pages must be positive" + assert self.top_k_links > 0, "top_k_links must be positive" + assert 0 <= self.min_gain_threshold <= 1, "min_gain_threshold must be between 0 and 1" + + # Check weights sum to 1 + weight_sum = self.coverage_weight + self.consistency_weight + self.saturation_weight + assert abs(weight_sum - 1.0) < 0.001, f"Coverage weights must sum to 1, got {weight_sum}" + + weight_sum = self.relevance_weight + self.novelty_weight + self.authority_weight + assert abs(weight_sum - 1.0) < 0.001, f"Link scoring weights must sum to 1, got {weight_sum}" + + # Validate embedding parameters + assert 0 < self.embedding_coverage_radius < 1, "embedding_coverage_radius must be between 0 and 1" + assert self.embedding_k_exp > 0, "embedding_k_exp must be positive" + assert 0 <= self.embedding_nearest_weight <= 1, "embedding_nearest_weight must be between 0 and 1" + assert 0 <= self.embedding_top_k_weight <= 1, "embedding_top_k_weight must be between 0 and 1" + assert abs(self.embedding_nearest_weight + self.embedding_top_k_weight - 1.0) < 0.001, "Embedding weights must sum to 1" + assert 0 <= self.embedding_overlap_threshold <= 1, "embedding_overlap_threshold must be between 0 and 1" + assert 0 < self.embedding_min_relative_improvement < 1, "embedding_min_relative_improvement must be between 0 and 1" + assert 0 <= self.embedding_validation_min_score <= 1, "embedding_validation_min_score must be between 0 and 1" + assert 0 <= self.embedding_quality_min_confidence <= 1, "embedding_quality_min_confidence must be between 0 and 1" + assert 0 <= self.embedding_quality_max_confidence <= 1, "embedding_quality_max_confidence must be between 0 and 1" + assert self.embedding_quality_scale_factor > 0, "embedding_quality_scale_factor must be positive" + assert 0 <= self.embedding_min_confidence_threshold <= 1, "embedding_min_confidence_threshold must be between 0 and 1" + + +class CrawlStrategy(ABC): + """Abstract base class for crawling strategies""" + + @abstractmethod + async def calculate_confidence(self, state: CrawlState) -> float: + """Calculate overall confidence that we have sufficient information""" + pass + + @abstractmethod + async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: + """Rank pending links by expected information gain""" + pass + + @abstractmethod + async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: + """Determine if crawling should stop""" + pass + + @abstractmethod + async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: + """Update state with new crawl results""" + pass + + +class StatisticalStrategy(CrawlStrategy): + """Pure statistical approach - no LLM, no embeddings""" + + def __init__(self): + self.idf_cache = {} + self.bm25_k1 = 1.2 # BM25 parameter + self.bm25_b = 0.75 # BM25 parameter + + async def calculate_confidence(self, state: CrawlState) -> float: + """Calculate confidence using coverage, consistency, and saturation""" + if not state.knowledge_base: + return 0.0 + + coverage = self._calculate_coverage(state) + consistency = self._calculate_consistency(state) + saturation = self._calculate_saturation(state) + + # Store individual metrics + state.metrics['coverage'] = coverage + state.metrics['consistency'] = consistency + state.metrics['saturation'] = saturation + + # Weighted combination (weights from config not accessible here, using defaults) + confidence = 0.4 * coverage + 0.3 * consistency + 0.3 * saturation + + return confidence + + def _calculate_coverage(self, state: CrawlState) -> float: + """Coverage scoring - measures query term presence across knowledge base + + Returns a score between 0 and 1, where: + - 0 means no query terms found + - 1 means excellent coverage of all query terms + """ + if not state.query or state.total_documents == 0: + return 0.0 + + query_terms = self._tokenize(state.query.lower()) + if not query_terms: + return 0.0 + + term_scores = [] + max_tf = max(state.term_frequencies.values()) if state.term_frequencies else 1 + + for term in query_terms: + tf = state.term_frequencies.get(term, 0) + df = state.document_frequencies.get(term, 0) + + if df > 0: + # Document coverage: what fraction of docs contain this term + doc_coverage = df / state.total_documents + + # Frequency signal: normalized log frequency + freq_signal = math.log(1 + tf) / math.log(1 + max_tf) if max_tf > 0 else 0 + + # Combined score: document coverage with frequency boost + term_score = doc_coverage * (1 + 0.5 * freq_signal) + term_scores.append(term_score) + else: + term_scores.append(0.0) + + # Average across all query terms + coverage = sum(term_scores) / len(term_scores) + + # Apply square root curve to make score more intuitive + # This helps differentiate between partial and good coverage + return min(1.0, math.sqrt(coverage)) + + def _calculate_consistency(self, state: CrawlState) -> float: + """Information overlap between pages - high overlap suggests coherent topic coverage""" + if len(state.knowledge_base) < 2: + return 1.0 # Single or no documents are perfectly consistent + + # Calculate pairwise term overlap + overlaps = [] + + for i in range(len(state.knowledge_base)): + for j in range(i + 1, len(state.knowledge_base)): + # Get terms from both documents + terms_i = set(self._get_document_terms(state.knowledge_base[i])) + terms_j = set(self._get_document_terms(state.knowledge_base[j])) + + if terms_i and terms_j: + # Jaccard similarity + overlap = len(terms_i & terms_j) / len(terms_i | terms_j) + overlaps.append(overlap) + + if overlaps: + # Average overlap as consistency measure + consistency = sum(overlaps) / len(overlaps) + else: + consistency = 0.0 + + return consistency + + def _calculate_saturation(self, state: CrawlState) -> float: + """Diminishing returns indicator - are we still discovering new information?""" + if not state.new_terms_history: + return 0.0 + + if len(state.new_terms_history) < 2: + return 0.0 # Not enough history + + # Calculate rate of new term discovery + recent_rate = state.new_terms_history[-1] if state.new_terms_history[-1] > 0 else 1 + initial_rate = state.new_terms_history[0] if state.new_terms_history[0] > 0 else 1 + + # Saturation increases as rate decreases + saturation = 1 - (recent_rate / initial_rate) + + return max(0.0, min(saturation, 1.0)) + + async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: + """Rank links by expected information gain""" + scored_links = [] + + for link in state.pending_links: + # Skip already crawled URLs + if link.href in state.crawled_urls: + continue + + # Calculate component scores + relevance = self._calculate_relevance(link, state) + novelty = self._calculate_novelty(link, state) + authority = 1.0 + # authority = self._calculate_authority(link) + + # Combined score + score = (config.relevance_weight * relevance + + config.novelty_weight * novelty + + config.authority_weight * authority) + + scored_links.append((link, score)) + + # Sort by score descending + scored_links.sort(key=lambda x: x[1], reverse=True) + + return scored_links + + def _calculate_relevance(self, link: Link, state: CrawlState) -> float: + """BM25 relevance score between link preview and query""" + if not state.query or not link: + return 0.0 + + # Combine available text from link + link_text = ' '.join(filter(None, [ + link.text or '', + link.title or '', + link.head_data.get('meta', {}).get('title', '') if link.head_data else '', + link.head_data.get('meta', {}).get('description', '') if link.head_data else '', + link.head_data.get('meta', {}).get('keywords', '') if link.head_data else '' + ])).lower() + + if not link_text: + return 0.0 + + # Use contextual score if available (from BM25 scoring during crawl) + # if link.contextual_score is not None: + if link.contextual_score and link.contextual_score > 0: + return link.contextual_score + + # Otherwise, calculate simple term overlap + query_terms = set(self._tokenize(state.query.lower())) + link_terms = set(self._tokenize(link_text)) + + if not query_terms: + return 0.0 + + overlap = len(query_terms & link_terms) / len(query_terms) + return overlap + + def _calculate_novelty(self, link: Link, state: CrawlState) -> float: + """Estimate how much new information this link might provide""" + if not state.knowledge_base: + return 1.0 # First links are maximally novel + + # Get terms from link preview + link_text = ' '.join(filter(None, [ + link.text or '', + link.title or '', + link.head_data.get('title', '') if link.head_data else '', + link.head_data.get('description', '') if link.head_data else '', + link.head_data.get('keywords', '') if link.head_data else '' + ])).lower() + + link_terms = set(self._tokenize(link_text)) + if not link_terms: + return 0.5 # Unknown novelty + + # Calculate what percentage of link terms are new + existing_terms = set(state.term_frequencies.keys()) + new_terms = link_terms - existing_terms + + novelty = len(new_terms) / len(link_terms) if link_terms else 0.0 + + return novelty + + def _calculate_authority(self, link: Link) -> float: + """Simple authority score based on URL structure and link attributes""" + score = 0.5 # Base score + + if not link.href: + return 0.0 + + url = link.href.lower() + + # Positive indicators + if '/docs/' in url or '/documentation/' in url: + score += 0.2 + if '/api/' in url or '/reference/' in url: + score += 0.2 + if '/guide/' in url or '/tutorial/' in url: + score += 0.1 + + # Check for file extensions + if url.endswith('.pdf'): + score += 0.1 + elif url.endswith(('.jpg', '.png', '.gif')): + score -= 0.3 # Reduce score for images + + # Use intrinsic score if available + if link.intrinsic_score is not None: + score = 0.7 * score + 0.3 * link.intrinsic_score + + return min(score, 1.0) + + async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: + """Determine if crawling should stop""" + # Check confidence threshold + confidence = state.metrics.get('confidence', 0.0) + if confidence >= config.confidence_threshold: + return True + + # Check resource limits + if len(state.crawled_urls) >= config.max_pages: + return True + + # Check if we have any links left + if not state.pending_links: + return True + + # Check saturation + if state.metrics.get('saturation', 0.0) >= config.saturation_threshold: + return True + + return False + + async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: + """Update state with new crawl results""" + for result in new_results: + # Track new terms + old_term_count = len(state.term_frequencies) + + # Extract and process content - try multiple fields + try: + content = result.markdown.raw_markdown + except AttributeError: + print(f"Warning: CrawlResult {result.url} has no markdown content") + content = "" + # content = "" + # if hasattr(result, 'extracted_content') and result.extracted_content: + # content = result.extracted_content + # elif hasattr(result, 'markdown') and result.markdown: + # content = result.markdown.raw_markdown + # elif hasattr(result, 'cleaned_html') and result.cleaned_html: + # content = result.cleaned_html + # elif hasattr(result, 'html') and result.html: + # # Use raw HTML as last resort + # content = result.html + + + terms = self._tokenize(content.lower()) + + # Update term frequencies + term_set = set() + for term in terms: + state.term_frequencies[term] += 1 + term_set.add(term) + + # Update document frequencies + doc_id = state.total_documents + for term in term_set: + if term not in state.documents_with_terms[term]: + state.document_frequencies[term] += 1 + state.documents_with_terms[term].add(doc_id) + + # Track new terms discovered + new_term_count = len(state.term_frequencies) + new_terms = new_term_count - old_term_count + state.new_terms_history.append(new_terms) + + # Update document count + state.total_documents += 1 + + # Add to crawl order + state.crawl_order.append(result.url) + + def _tokenize(self, text: str) -> List[str]: + """Simple tokenization - can be enhanced""" + # Remove punctuation and split + text = re.sub(r'[^\w\s]', ' ', text) + tokens = text.split() + + # Filter short tokens and stop words (basic) + tokens = [t for t in tokens if len(t) > 2] + + return tokens + + def _get_document_terms(self, crawl_result: CrawlResult) -> List[str]: + """Extract terms from a crawl result""" + content = crawl_result.markdown.raw_markdown or "" + return self._tokenize(content.lower()) + + +class EmbeddingStrategy(CrawlStrategy): + """Embedding-based adaptive crawling using semantic space coverage""" + + def __init__(self, embedding_model: str = None, llm_config: Dict = None): + self.embedding_model = embedding_model or "sentence-transformers/all-MiniLM-L6-v2" + self.llm_config = llm_config + self._embedding_cache = {} + self._link_embedding_cache = {} # Cache for link embeddings + self._validation_passed = False # Track if validation passed + + # Performance optimization caches + self._distance_matrix_cache = None # Cache for query-KB distances + self._kb_embeddings_hash = None # Track KB changes + self._validation_embeddings_cache = None # Cache validation query embeddings + self._kb_similarity_threshold = 0.95 # Threshold for deduplication + + async def _get_embeddings(self, texts: List[str]) -> Any: + """Get embeddings using configured method""" + from .utils import get_text_embeddings + embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + return await get_text_embeddings( + texts, + embedding_llm_config, + self.embedding_model + ) + + def _compute_distance_matrix(self, query_embeddings: Any, kb_embeddings: Any) -> Any: + """Compute distance matrix using vectorized operations""" + + + if kb_embeddings is None or len(kb_embeddings) == 0: + return None + + # Ensure proper shapes + if len(query_embeddings.shape) == 1: + query_embeddings = query_embeddings.reshape(1, -1) + if len(kb_embeddings.shape) == 1: + kb_embeddings = kb_embeddings.reshape(1, -1) + + # Vectorized cosine distance: 1 - cosine_similarity + # Normalize vectors + query_norm = query_embeddings / np.linalg.norm(query_embeddings, axis=1, keepdims=True) + kb_norm = kb_embeddings / np.linalg.norm(kb_embeddings, axis=1, keepdims=True) + + # Compute cosine similarity matrix + similarity_matrix = np.dot(query_norm, kb_norm.T) + + # Convert to distance + distance_matrix = 1 - similarity_matrix + + return distance_matrix + + def _get_cached_distance_matrix(self, query_embeddings: Any, kb_embeddings: Any) -> Any: + """Get distance matrix with caching""" + + + if kb_embeddings is None or len(kb_embeddings) == 0: + return None + + # Check if KB has changed + kb_hash = hash(kb_embeddings.tobytes()) if kb_embeddings is not None else None + + if (self._distance_matrix_cache is None or + kb_hash != self._kb_embeddings_hash): + # Recompute matrix + self._distance_matrix_cache = self._compute_distance_matrix(query_embeddings, kb_embeddings) + self._kb_embeddings_hash = kb_hash + + return self._distance_matrix_cache + + async def map_query_semantic_space(self, query: str, n_synthetic: int = 10) -> Any: + """Generate a point cloud representing the semantic neighborhood of the query""" + from .utils import perform_completion_with_backoff + + # Generate more variations than needed for train/val split + n_total = int(n_synthetic * 1.3) # Generate 30% more for validation + + # Generate variations using LLM + prompt = f"""Generate {n_total} variations of this query that explore different aspects: '{query}' + + These should be queries a user might ask when looking for similar information. + Include different phrasings, related concepts, and specific aspects. + + Return as a JSON array of strings.""" + + # Use the LLM for query generation + provider = self.llm_config.get('provider', 'openai/gpt-4o-mini') if self.llm_config else 'openai/gpt-4o-mini' + api_token = self.llm_config.get('api_token') if self.llm_config else None + + # response = perform_completion_with_backoff( + # provider=provider, + # prompt_with_variables=prompt, + # api_token=api_token, + # json_response=True + # ) + + # variations = json.loads(response.choices[0].message.content) + + + # # Mock data with more variations for split + variations ={'queries': ['what are the best vegetables to use in fried rice?', 'how do I make vegetable fried rice from scratch?', 'can you provide a quick recipe for vegetable fried rice?', 'what cooking techniques are essential for perfect fried rice with vegetables?', 'how to add flavor to vegetable fried rice?', 'are there any tips for making healthy fried rice with vegetables?']} + + + # variations = {'queries': [ + # 'How do async and await work with coroutines in Python?', + # 'What is the role of event loops in asynchronous programming?', + # 'Can you explain the differences between async/await and traditional callback methods?', + # 'How do coroutines interact with event loops in JavaScript?', + # 'What are the benefits of using async await over promises in Node.js?', + # 'How to manage multiple coroutines with an event loop?', + # 'What are some common pitfalls when using async await with coroutines?', + # 'How do different programming languages implement async await and event loops?', + # 'What happens when an async function is called without await?', + # 'How does the event loop handle blocking operations?', + # 'Can you nest async functions and how does that affect the event loop?', + # 'What is the performance impact of using async/await?' + # ]} + + # Split into train and validation + # all_queries = [query] + variations['queries'] + + # Randomly shuffle for proper train/val split (keeping original query in training) + import random + + # Keep original query always in training + other_queries = variations['queries'].copy() + random.shuffle(other_queries) + + # Split: 80% for training, 20% for validation + n_validation = max(2, int(len(other_queries) * 0.2)) # At least 2 for validation + val_queries = other_queries[-n_validation:] + train_queries = [query] + other_queries[:-n_validation] + + # Embed only training queries for now (faster) + train_embeddings = await self._get_embeddings(train_queries) + + # Store validation queries for later (don't embed yet to save time) + self._validation_queries = val_queries + + return train_embeddings, train_queries + + def compute_coverage_shape(self, query_points: Any, alpha: float = 0.5): + """Find the minimal shape that covers all query points using alpha shape""" + try: + + + if len(query_points) < 3: + return None + + # For high-dimensional embeddings (e.g., 384-dim, 768-dim), + # alpha shapes require exponentially more points than available. + # Instead, use a statistical coverage model + query_points = np.array(query_points) + + # Store coverage as centroid + radius model + coverage = { + 'center': np.mean(query_points, axis=0), + 'std': np.std(query_points, axis=0), + 'points': query_points, + 'radius': np.max(np.linalg.norm(query_points - np.mean(query_points, axis=0), axis=1)) + } + return coverage + except Exception: + # Fallback if computation fails + return None + + def _sample_boundary_points(self, shape, n_samples: int = 20) -> List[Any]: + """Sample points from the boundary of a shape""" + + + # Simplified implementation - in practice would sample from actual shape boundary + # For now, return empty list if shape is None + if shape is None: + return [] + + # This is a placeholder - actual implementation would depend on shape type + return [] + + def find_coverage_gaps(self, kb_embeddings: Any, query_embeddings: Any) -> List[Tuple[Any, float]]: + """Calculate gap distances for all query variations using vectorized operations""" + + + gaps = [] + + if kb_embeddings is None or len(kb_embeddings) == 0: + # If no KB yet, all query points have maximum gap + for q_emb in query_embeddings: + gaps.append((q_emb, 1.0)) + return gaps + + # Use cached distance matrix + distance_matrix = self._get_cached_distance_matrix(query_embeddings, kb_embeddings) + + if distance_matrix is None: + # Fallback + for q_emb in query_embeddings: + gaps.append((q_emb, 1.0)) + return gaps + + # Find minimum distance for each query (vectorized) + min_distances = np.min(distance_matrix, axis=1) + + # Create gaps list + for i, q_emb in enumerate(query_embeddings): + gaps.append((q_emb, min_distances[i])) + + return gaps + + async def select_links_for_expansion( + self, + candidate_links: List[Link], + gaps: List[Tuple[Any, float]], + kb_embeddings: Any + ) -> List[Tuple[Link, float]]: + """Select links that most efficiently fill the gaps""" + from .utils import cosine_distance, cosine_similarity, get_text_embeddings + + import hashlib + + scored_links = [] + + # Prepare for embedding - separate cached vs uncached + links_to_embed = [] + texts_to_embed = [] + link_embeddings_map = {} + + for link in candidate_links: + # Extract text from link + link_text = ' '.join(filter(None, [ + link.text or '', + link.title or '', + link.meta.get('description', '') if hasattr(link, 'meta') and link.meta else '', + link.head_data.get('meta', {}).get('description', '') if link.head_data else '' + ])) + + if not link_text.strip(): + continue + + # Create cache key from URL + text content + cache_key = hashlib.md5(f"{link.href}:{link_text}".encode()).hexdigest() + + # Check cache + if cache_key in self._link_embedding_cache: + link_embeddings_map[link.href] = self._link_embedding_cache[cache_key] + else: + links_to_embed.append(link) + texts_to_embed.append(link_text) + + # Batch embed only uncached links + if texts_to_embed: + embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + new_embeddings = await get_text_embeddings(texts_to_embed, embedding_llm_config, self.embedding_model) + + # Cache the new embeddings + for link, text, embedding in zip(links_to_embed, texts_to_embed, new_embeddings): + cache_key = hashlib.md5(f"{link.href}:{text}".encode()).hexdigest() + self._link_embedding_cache[cache_key] = embedding + link_embeddings_map[link.href] = embedding + + # Get coverage radius from config + coverage_radius = self.config.embedding_coverage_radius if hasattr(self, 'config') else 0.2 + + # Score each link + for link in candidate_links: + if link.href not in link_embeddings_map: + continue # Skip links without embeddings + + link_embedding = link_embeddings_map[link.href] + + if not gaps: + score = 0.0 + else: + # Calculate how many gaps this link helps with + gaps_helped = 0 + total_improvement = 0 + + for gap_point, gap_distance in gaps: + # Only consider gaps that actually need filling (outside coverage radius) + if gap_distance > coverage_radius: + new_distance = cosine_distance(link_embedding, gap_point) + if new_distance < gap_distance: + # This link helps this gap + improvement = gap_distance - new_distance + # Scale improvement - moving from 0.5 to 0.3 is valuable + scaled_improvement = improvement * 2 # Amplify the signal + total_improvement += scaled_improvement + gaps_helped += 1 + + # Average improvement per gap that needs help + gaps_needing_help = sum(1 for _, d in gaps if d > coverage_radius) + if gaps_needing_help > 0: + gap_reduction_score = total_improvement / gaps_needing_help + else: + gap_reduction_score = 0 + + # Check overlap with existing KB (vectorized) + if kb_embeddings is not None and len(kb_embeddings) > 0: + # Normalize embeddings + link_norm = link_embedding / np.linalg.norm(link_embedding) + kb_norm = kb_embeddings / np.linalg.norm(kb_embeddings, axis=1, keepdims=True) + + # Compute all similarities at once + similarities = np.dot(kb_norm, link_norm) + max_similarity = np.max(similarities) + + # Only penalize if very similar (above threshold) + overlap_threshold = self.config.embedding_overlap_threshold if hasattr(self, 'config') else 0.85 + if max_similarity > overlap_threshold: + overlap_penalty = (max_similarity - overlap_threshold) * 2 # 0 to 0.3 range + else: + overlap_penalty = 0 + else: + overlap_penalty = 0 + + # Final score - emphasize gap reduction + score = gap_reduction_score * (1 - overlap_penalty) + + # Add contextual score boost if available + if hasattr(link, 'contextual_score') and link.contextual_score: + score = score * 0.8 + link.contextual_score * 0.2 + + scored_links.append((link, score)) + + return sorted(scored_links, key=lambda x: x[1], reverse=True) + + async def calculate_confidence(self, state: CrawlState) -> float: + """Coverage-based learning score (0–1).""" + # Guard clauses + if state.kb_embeddings is None or state.query_embeddings is None: + return 0.0 + if len(state.kb_embeddings) == 0 or len(state.query_embeddings) == 0: + return 0.0 + + # Prepare L2-normalised arrays + Q = np.asarray(state.query_embeddings, dtype=np.float32) + D = np.asarray(state.kb_embeddings, dtype=np.float32) + Q /= np.linalg.norm(Q, axis=1, keepdims=True) + 1e-8 + D /= np.linalg.norm(D, axis=1, keepdims=True) + 1e-8 + + # Best cosine per query + best = (Q @ D.T).max(axis=1) + + # Mean similarity or hit-rate above tau + tau = getattr(self.config, 'coverage_tau', None) + score = float((best >= tau).mean()) if tau is not None else float(best.mean()) + + # Store quick metrics + state.metrics['coverage_score'] = score + state.metrics['avg_best_similarity'] = float(best.mean()) + state.metrics['median_best_similarity'] = float(np.median(best)) + + return score + + + + # async def calculate_confidence(self, state: CrawlState) -> float: + # """Calculate learning score for adaptive crawling (used for stopping)""" + # + + # if state.kb_embeddings is None or state.query_embeddings is None: + # return 0.0 + + # if len(state.kb_embeddings) == 0: + # return 0.0 + + # # Get cached distance matrix + # distance_matrix = self._get_cached_distance_matrix(state.query_embeddings, state.kb_embeddings) + + # if distance_matrix is None: + # return 0.0 + + # # Vectorized analysis for all queries at once + # all_query_metrics = [] + + # for i in range(len(state.query_embeddings)): + # # Get distances for this query + # distances = distance_matrix[i] + # sorted_distances = np.sort(distances) + + # # Store metrics for this query + # query_metric = { + # 'min_distance': sorted_distances[0], + # 'top_3_distances': sorted_distances[:3], + # 'top_5_distances': sorted_distances[:5], + # 'close_neighbors': np.sum(distances < 0.3), + # 'very_close_neighbors': np.sum(distances < 0.2), + # 'all_distances': distances + # } + # all_query_metrics.append(query_metric) + + # # Hybrid approach with density (exponential base) + # k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 + # coverage_scores_hybrid_exp = [] + + # for metric in all_query_metrics: + # # Base score from nearest neighbor + # nearest_score = np.exp(-k_exp * metric['min_distance']) + + # # Top-k average (top 3) + # top_k = min(3, len(metric['all_distances'])) + # top_k_avg = np.mean([np.exp(-k_exp * d) for d in metric['top_3_distances'][:top_k]]) + + # # Combine using configured weights + # nearest_weight = self.config.embedding_nearest_weight if hasattr(self, 'config') else 0.7 + # top_k_weight = self.config.embedding_top_k_weight if hasattr(self, 'config') else 0.3 + # hybrid_score = nearest_weight * nearest_score + top_k_weight * top_k_avg + # coverage_scores_hybrid_exp.append(hybrid_score) + + # learning_score = np.mean(coverage_scores_hybrid_exp) + + # # Store as learning score + # state.metrics['learning_score'] = learning_score + + # # Store embedding-specific metrics + # state.metrics['avg_min_distance'] = np.mean([m['min_distance'] for m in all_query_metrics]) + # state.metrics['avg_close_neighbors'] = np.mean([m['close_neighbors'] for m in all_query_metrics]) + # state.metrics['avg_very_close_neighbors'] = np.mean([m['very_close_neighbors'] for m in all_query_metrics]) + # state.metrics['total_kb_docs'] = len(state.kb_embeddings) + + # # Store query-level metrics for detailed analysis + # self._query_metrics = all_query_metrics + + # # For stopping criteria, return learning score + # return float(learning_score) + + async def rank_links(self, state: CrawlState, config: AdaptiveConfig) -> List[Tuple[Link, float]]: + """Main entry point for link ranking""" + # Store config for use in other methods + self.config = config + + # Filter out already crawled URLs and remove duplicates + seen_urls = set() + uncrawled_links = [] + + for link in state.pending_links: + if link.href not in state.crawled_urls and link.href not in seen_urls: + uncrawled_links.append(link) + seen_urls.add(link.href) + + if not uncrawled_links: + return [] + + # Get gaps in coverage (no threshold needed anymore) + gaps = self.find_coverage_gaps( + state.kb_embeddings, + state.query_embeddings + ) + state.semantic_gaps = [(g[0].tolist(), g[1]) for g in gaps] # Store as list for serialization + + # Select links that fill gaps (only from uncrawled) + return await self.select_links_for_expansion( + uncrawled_links, + gaps, + state.kb_embeddings + ) + + async def validate_coverage(self, state: CrawlState) -> float: + """Validate coverage using held-out queries with caching""" + if not hasattr(self, '_validation_queries') or not self._validation_queries: + return state.metrics.get('confidence', 0.0) + + + + # Cache validation embeddings (only embed once!) + if self._validation_embeddings_cache is None: + self._validation_embeddings_cache = await self._get_embeddings(self._validation_queries) + + val_embeddings = self._validation_embeddings_cache + + # Use vectorized distance computation + if state.kb_embeddings is None or len(state.kb_embeddings) == 0: + return 0.0 + + # Compute distance matrix for validation queries + distance_matrix = self._compute_distance_matrix(val_embeddings, state.kb_embeddings) + + if distance_matrix is None: + return 0.0 + + # Find minimum distance for each validation query (vectorized) + min_distances = np.min(distance_matrix, axis=1) + scores = 1.0 - min_distances # Convert distances to scores (0-1 range) + + # Compute scores using same exponential as training + # k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 + # scores = np.exp(-k_exp * min_distances) + + validation_confidence = np.mean(scores) + state.metrics['validation_confidence'] = validation_confidence + + return validation_confidence + + async def should_stop(self, state: CrawlState, config: AdaptiveConfig) -> bool: + """Stop based on learning curve convergence""" + confidence = state.metrics.get('confidence', 0.0) + + # Check if confidence is below minimum threshold (completely irrelevant) + min_confidence_threshold = config.embedding_min_confidence_threshold if hasattr(config, 'embedding_min_confidence_threshold') else 0.1 + if confidence < min_confidence_threshold and len(state.crawled_urls) > 0: + state.metrics['stopped_reason'] = 'below_minimum_relevance_threshold' + state.metrics['is_irrelevant'] = True + return True + + # Basic limits + if len(state.crawled_urls) >= config.max_pages or not state.pending_links: + return True + + # Track confidence history + if not hasattr(state, 'confidence_history'): + state.confidence_history = [] + + state.confidence_history.append(confidence) + + # Need at least 3 iterations to check convergence + if len(state.confidence_history) < 2: + return False + + improvement_diffs = list(zip(state.confidence_history[:-1], state.confidence_history[1:])) + + # Calculate average improvement + avg_improvement = sum(abs(b - a) for a, b in improvement_diffs) / len(improvement_diffs) + state.metrics['avg_improvement'] = avg_improvement + + min_relative_improvement = self.config.embedding_min_relative_improvement * confidence if hasattr(self, 'config') else 0.1 * confidence + if avg_improvement < min_relative_improvement: + # Converged - validate before stopping + val_score = await self.validate_coverage(state) + + # Only stop if validation is reasonable + validation_min = self.config.embedding_validation_min_score if hasattr(self, 'config') else 0.4 + # k_exp = self.config.embedding_k_exp if hasattr(self, 'config') else 1.0 + # validation_min = np.exp(-k_exp * validation_min) + + if val_score > validation_min: + state.metrics['stopped_reason'] = 'converged_validated' + self._validation_passed = True + return True + else: + state.metrics['stopped_reason'] = 'low_validation' + # Continue crawling despite convergence + + return False + + def get_quality_confidence(self, state: CrawlState) -> float: + """Calculate quality-based confidence score for display""" + learning_score = state.metrics.get('learning_score', 0.0) + validation_score = state.metrics.get('validation_confidence', 0.0) + + # Get config values + validation_min = self.config.embedding_validation_min_score if hasattr(self, 'config') else 0.4 + quality_min = self.config.embedding_quality_min_confidence if hasattr(self, 'config') else 0.7 + quality_max = self.config.embedding_quality_max_confidence if hasattr(self, 'config') else 0.95 + scale_factor = self.config.embedding_quality_scale_factor if hasattr(self, 'config') else 0.833 + + if self._validation_passed and validation_score > validation_min: + # Validated systems get boosted scores + # Map 0.4-0.7 learning → quality_min-quality_max confidence + if learning_score < 0.4: + confidence = quality_min # Minimum for validated systems + elif learning_score > 0.7: + confidence = quality_max # Maximum realistic confidence + else: + # Linear mapping in between + confidence = quality_min + (learning_score - 0.4) * scale_factor + else: + # Not validated = conservative mapping + confidence = learning_score * 0.8 + + return confidence + + async def update_state(self, state: CrawlState, new_results: List[CrawlResult]) -> None: + """Update embeddings and coverage metrics with deduplication""" + from .utils import get_text_embeddings + + + # Extract text from results + new_texts = [] + valid_results = [] + for result in new_results: + content = result.markdown.raw_markdown if hasattr(result, 'markdown') and result.markdown else "" + if content: # Only process non-empty content + new_texts.append(content[:5000]) # Limit text length + valid_results.append(result) + + if not new_texts: + return + + # Get embeddings for new texts + embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + new_embeddings = await get_text_embeddings(new_texts, embedding_llm_config, self.embedding_model) + + # Deduplicate embeddings before adding to KB + if state.kb_embeddings is None: + # First batch - no deduplication needed + state.kb_embeddings = new_embeddings + deduplicated_indices = list(range(len(new_embeddings))) + else: + # Check for duplicates using vectorized similarity + deduplicated_embeddings = [] + deduplicated_indices = [] + + for i, new_emb in enumerate(new_embeddings): + # Compute similarities with existing KB + new_emb_normalized = new_emb / np.linalg.norm(new_emb) + kb_normalized = state.kb_embeddings / np.linalg.norm(state.kb_embeddings, axis=1, keepdims=True) + similarities = np.dot(kb_normalized, new_emb_normalized) + + # Only add if not too similar to existing content + if np.max(similarities) < self._kb_similarity_threshold: + deduplicated_embeddings.append(new_emb) + deduplicated_indices.append(i) + + # Add deduplicated embeddings + if deduplicated_embeddings: + state.kb_embeddings = np.vstack([state.kb_embeddings, np.array(deduplicated_embeddings)]) + + # Update crawl order only for non-duplicate results + for idx in deduplicated_indices: + state.crawl_order.append(valid_results[idx].url) + + # Invalidate distance matrix cache since KB changed + self._kb_embeddings_hash = None + self._distance_matrix_cache = None + + # Update coverage shape if needed + if hasattr(state, 'query_embeddings') and state.query_embeddings is not None: + state.coverage_shape = self.compute_coverage_shape(state.query_embeddings, self.config.alpha_shape_alpha if hasattr(self, 'config') else 0.5) + + +class AdaptiveCrawler: + """Main adaptive crawler that orchestrates the crawling process""" + + def __init__(self, + crawler: Optional[AsyncWebCrawler] = None, + config: Optional[AdaptiveConfig] = None, + strategy: Optional[CrawlStrategy] = None): + self.crawler = crawler + self.config = config or AdaptiveConfig() + self.config.validate() + + # Create strategy based on config + if strategy: + self.strategy = strategy + else: + self.strategy = self._create_strategy(self.config.strategy) + + # Initialize state + self.state: Optional[CrawlState] = None + + # Track if we own the crawler (for cleanup) + self._owns_crawler = crawler is None + + def _create_strategy(self, strategy_name: str) -> CrawlStrategy: + """Create strategy instance based on name""" + if strategy_name == "statistical": + return StatisticalStrategy() + elif strategy_name == "embedding": + return EmbeddingStrategy( + embedding_model=self.config.embedding_model, + llm_config=self.config.embedding_llm_config + ) + else: + raise ValueError(f"Unknown strategy: {strategy_name}") + + async def digest(self, + start_url: str, + query: str, + resume_from: Optional[str] = None) -> CrawlState: + """Main entry point for adaptive crawling""" + # Initialize or resume state + if resume_from: + self.state = CrawlState.load(resume_from) + self.state.query = query # Update query in case it changed + else: + self.state = CrawlState( + crawled_urls=set(), + knowledge_base=[], + pending_links=[], + query=query, + metrics={} + ) + + # Create crawler if needed + if not self.crawler: + self.crawler = AsyncWebCrawler() + await self.crawler.__aenter__() + + self.strategy.config = self.config # Pass config to strategy + + # If using embedding strategy and not resuming, expand query space + if isinstance(self.strategy, EmbeddingStrategy) and not resume_from: + # Generate query space + query_embeddings, expanded_queries = await self.strategy.map_query_semantic_space( + query, + self.config.n_query_variations + ) + self.state.query_embeddings = query_embeddings + self.state.expanded_queries = expanded_queries[1:] # Skip original query + self.state.embedding_model = self.strategy.embedding_model + + try: + # Initial crawl if not resuming + if start_url not in self.state.crawled_urls: + result = await self._crawl_with_preview(start_url, query) + if result and hasattr(result, 'success') and result.success: + self.state.knowledge_base.append(result) + self.state.crawled_urls.add(start_url) + # Extract links from result - handle both dict and Links object formats + if hasattr(result, 'links') and result.links: + if isinstance(result.links, dict): + # Extract internal and external links from dict + internal_links = [Link(**link) for link in result.links.get('internal', [])] + external_links = [Link(**link) for link in result.links.get('external', [])] + self.state.pending_links.extend(internal_links + external_links) + else: + # Handle Links object + self.state.pending_links.extend(result.links.internal + result.links.external) + + # Update state + await self.strategy.update_state(self.state, [result]) + + # adaptive expansion + depth = 0 + while depth < self.config.max_depth: + # Calculate confidence + confidence = await self.strategy.calculate_confidence(self.state) + self.state.metrics['confidence'] = confidence + + # Check stopping criteria + if await self.strategy.should_stop(self.state, self.config): + break + + # Rank candidate links + ranked_links = await self.strategy.rank_links(self.state, self.config) + + if not ranked_links: + break + + # Check minimum gain threshold + if ranked_links[0][1] < self.config.min_gain_threshold: + break + + # Select top K links + to_crawl = [(link, score) for link, score in ranked_links[:self.config.top_k_links] + if link.href not in self.state.crawled_urls] + + if not to_crawl: + break + + # Crawl selected links + new_results = await self._crawl_batch(to_crawl, query) + + if new_results: + # Update knowledge base + self.state.knowledge_base.extend(new_results) + + # Update crawled URLs and pending links + for result, (link, _) in zip(new_results, to_crawl): + if result: + self.state.crawled_urls.add(link.href) + # Extract links from result - handle both dict and Links object formats + if hasattr(result, 'links') and result.links: + new_links = [] + if isinstance(result.links, dict): + # Extract internal and external links from dict + internal_links = [Link(**link_data) for link_data in result.links.get('internal', [])] + external_links = [Link(**link_data) for link_data in result.links.get('external', [])] + new_links = internal_links + external_links + else: + # Handle Links object + new_links = result.links.internal + result.links.external + + # Add new links to pending + for new_link in new_links: + if new_link.href not in self.state.crawled_urls: + self.state.pending_links.append(new_link) + + # Update state with new results + await self.strategy.update_state(self.state, new_results) + + depth += 1 + + # Save state if configured + if self.config.save_state and self.config.state_path: + self.state.save(self.config.state_path) + + # Final confidence calculation + learning_score = await self.strategy.calculate_confidence(self.state) + + # For embedding strategy, get quality-based confidence + if isinstance(self.strategy, EmbeddingStrategy): + self.state.metrics['confidence'] = self.strategy.get_quality_confidence(self.state) + else: + # For statistical strategy, use the same as before + self.state.metrics['confidence'] = learning_score + + self.state.metrics['pages_crawled'] = len(self.state.crawled_urls) + self.state.metrics['depth_reached'] = depth + + # Final save + if self.config.save_state and self.config.state_path: + self.state.save(self.config.state_path) + + return self.state + + finally: + # Cleanup if we created the crawler + if self._owns_crawler and self.crawler: + await self.crawler.__aexit__(None, None, None) + + async def _crawl_with_preview(self, url: str, query: str) -> Optional[CrawlResult]: + """Crawl a URL with link preview enabled""" + config = CrawlerRunConfig( + link_preview_config=LinkPreviewConfig( + include_internal=True, + include_external=False, + query=query, # For BM25 scoring + concurrency=5, + timeout=5, + max_links=50, # Reasonable limit + verbose=False + ), + score_links=True # Enable intrinsic scoring + ) + + try: + result = await self.crawler.arun(url=url, config=config) + # Extract the actual CrawlResult from the container + if hasattr(result, '_results') and result._results: + result = result._results[0] + + # Filter our all links do not have head_date + if hasattr(result, 'links') and result.links: + result.links['internal'] = [link for link in result.links['internal'] if link.get('head_data')] + # For now let's ignore external links without head_data + # result.links['external'] = [link for link in result.links['external'] if link.get('head_data')] + + return result + except Exception as e: + print(f"Error crawling {url}: {e}") + return None + + async def _crawl_batch(self, links_with_scores: List[Tuple[Link, float]], query: str) -> List[CrawlResult]: + """Crawl multiple URLs in parallel""" + tasks = [] + for link, score in links_with_scores: + task = self._crawl_with_preview(link.href, query) + tasks.append(task) + + results = await asyncio.gather(*tasks, return_exceptions=True) + + # Filter out exceptions and failed crawls + valid_results = [] + for result in results: + if isinstance(result, CrawlResult): + # Only include successful crawls + if hasattr(result, 'success') and result.success: + valid_results.append(result) + else: + print(f"Skipping failed crawl: {result.url if hasattr(result, 'url') else 'unknown'}") + elif isinstance(result, Exception): + print(f"Error in batch crawl: {result}") + + return valid_results + + # Status properties + @property + def confidence(self) -> float: + """Current confidence level""" + if self.state: + return self.state.metrics.get('confidence', 0.0) + return 0.0 + + @property + def coverage_stats(self) -> Dict[str, Any]: + """Detailed coverage statistics""" + if not self.state: + return {} + + total_content_length = sum( + len(result.markdown.raw_markdown or "") + for result in self.state.knowledge_base + ) + + return { + 'pages_crawled': len(self.state.crawled_urls), + 'total_content_length': total_content_length, + 'unique_terms': len(self.state.term_frequencies), + 'total_terms': sum(self.state.term_frequencies.values()), + 'pending_links': len(self.state.pending_links), + 'confidence': self.confidence, + 'coverage': self.state.metrics.get('coverage', 0.0), + 'consistency': self.state.metrics.get('consistency', 0.0), + 'saturation': self.state.metrics.get('saturation', 0.0) + } + + @property + def is_sufficient(self) -> bool: + """Check if current knowledge is sufficient""" + if isinstance(self.strategy, EmbeddingStrategy): + # For embedding strategy, sufficient = validation passed + return self.strategy._validation_passed + else: + # For statistical strategy, use threshold + return self.confidence >= self.config.confidence_threshold + + def print_stats(self, detailed: bool = False) -> None: + """Print comprehensive statistics about the knowledge base + + Args: + detailed: If True, show detailed statistics including top terms + """ + if not self.state: + print("No crawling state available.") + return + + # Import here to avoid circular imports + try: + from rich.console import Console + from rich.table import Table + console = Console() + use_rich = True + except ImportError: + use_rich = False + + if not detailed and use_rich: + # Summary view with nice table (like original) + table = Table(title=f"Adaptive Crawl Stats - Query: '{self.state.query}'") + table.add_column("Metric", style="cyan", no_wrap=True) + table.add_column("Value", style="magenta") + + # Basic stats + stats = self.coverage_stats + table.add_row("Pages Crawled", str(stats.get('pages_crawled', 0))) + table.add_row("Unique Terms", str(stats.get('unique_terms', 0))) + table.add_row("Total Terms", str(stats.get('total_terms', 0))) + table.add_row("Content Length", f"{stats.get('total_content_length', 0):,} chars") + table.add_row("Pending Links", str(stats.get('pending_links', 0))) + table.add_row("", "") # Spacer + + # Strategy-specific metrics + if isinstance(self.strategy, EmbeddingStrategy): + # Embedding-specific metrics + table.add_row("Confidence", f"{stats.get('confidence', 0):.2%}") + table.add_row("Avg Min Distance", f"{self.state.metrics.get('avg_min_distance', 0):.3f}") + table.add_row("Avg Close Neighbors", f"{self.state.metrics.get('avg_close_neighbors', 0):.1f}") + table.add_row("Validation Score", f"{self.state.metrics.get('validation_confidence', 0):.2%}") + table.add_row("", "") # Spacer + table.add_row("Is Sufficient?", "[green]Yes (Validated)[/green]" if self.is_sufficient else "[red]No[/red]") + else: + # Statistical strategy metrics + table.add_row("Confidence", f"{stats.get('confidence', 0):.2%}") + table.add_row("Coverage", f"{stats.get('coverage', 0):.2%}") + table.add_row("Consistency", f"{stats.get('consistency', 0):.2%}") + table.add_row("Saturation", f"{stats.get('saturation', 0):.2%}") + table.add_row("", "") # Spacer + table.add_row("Is Sufficient?", "[green]Yes[/green]" if self.is_sufficient else "[red]No[/red]") + + console.print(table) + else: + # Detailed view or fallback when rich not available + print("\n" + "="*80) + print(f"Adaptive Crawl Statistics - Query: '{self.state.query}'") + print("="*80) + + # Basic stats + print("\n[*] Basic Statistics:") + print(f" Pages Crawled: {len(self.state.crawled_urls)}") + print(f" Pending Links: {len(self.state.pending_links)}") + print(f" Total Documents: {self.state.total_documents}") + + # Content stats + total_content_length = sum( + len(self._get_content_from_result(result)) + for result in self.state.knowledge_base + ) + total_words = sum(self.state.term_frequencies.values()) + unique_terms = len(self.state.term_frequencies) + + print(f"\n[*] Content Statistics:") + print(f" Total Content: {total_content_length:,} characters") + print(f" Total Words: {total_words:,}") + print(f" Unique Terms: {unique_terms:,}") + if total_words > 0: + print(f" Vocabulary Richness: {unique_terms/total_words:.2%}") + + # Strategy-specific output + if isinstance(self.strategy, EmbeddingStrategy): + # Semantic coverage for embedding strategy + print(f"\n[*] Semantic Coverage Analysis:") + print(f" Average Min Distance: {self.state.metrics.get('avg_min_distance', 0):.3f}") + print(f" Avg Close Neighbors (< 0.3): {self.state.metrics.get('avg_close_neighbors', 0):.1f}") + print(f" Avg Very Close Neighbors (< 0.2): {self.state.metrics.get('avg_very_close_neighbors', 0):.1f}") + + # Confidence metrics + print(f"\n[*] Confidence Metrics:") + if self.is_sufficient: + if use_rich: + console.print(f" Overall Confidence: {self.confidence:.2%} [green][VALIDATED][/green]") + else: + print(f" Overall Confidence: {self.confidence:.2%} [VALIDATED]") + else: + if use_rich: + console.print(f" Overall Confidence: {self.confidence:.2%} [red][NOT VALIDATED][/red]") + else: + print(f" Overall Confidence: {self.confidence:.2%} [NOT VALIDATED]") + + print(f" Learning Score: {self.state.metrics.get('learning_score', 0):.2%}") + print(f" Validation Score: {self.state.metrics.get('validation_confidence', 0):.2%}") + + else: + # Query coverage for statistical strategy + print(f"\n[*] Query Coverage:") + query_terms = self.strategy._tokenize(self.state.query.lower()) + for term in query_terms: + tf = self.state.term_frequencies.get(term, 0) + df = self.state.document_frequencies.get(term, 0) + if df > 0: + if use_rich: + console.print(f" '{term}': found in {df}/{self.state.total_documents} docs ([green]{df/self.state.total_documents:.0%}[/green]), {tf} occurrences") + else: + print(f" '{term}': found in {df}/{self.state.total_documents} docs ({df/self.state.total_documents:.0%}), {tf} occurrences") + else: + if use_rich: + console.print(f" '{term}': [red][X] not found[/red]") + else: + print(f" '{term}': [X] not found") + + # Confidence metrics + print(f"\n[*] Confidence Metrics:") + status = "[OK]" if self.is_sufficient else "[!!]" + if use_rich: + status_colored = "[green][OK][/green]" if self.is_sufficient else "[red][!!][/red]" + console.print(f" Overall Confidence: {self.confidence:.2%} {status_colored}") + else: + print(f" Overall Confidence: {self.confidence:.2%} {status}") + print(f" Coverage Score: {self.state.metrics.get('coverage', 0):.2%}") + print(f" Consistency Score: {self.state.metrics.get('consistency', 0):.2%}") + print(f" Saturation Score: {self.state.metrics.get('saturation', 0):.2%}") + + # Crawl efficiency + if self.state.new_terms_history: + avg_new_terms = sum(self.state.new_terms_history) / len(self.state.new_terms_history) + print(f"\n[*] Crawl Efficiency:") + print(f" Avg New Terms per Page: {avg_new_terms:.1f}") + print(f" Information Saturation: {self.state.metrics.get('saturation', 0):.2%}") + + if detailed: + print("\n" + "-"*80) + if use_rich: + console.print("[bold cyan]DETAILED STATISTICS[/bold cyan]") + else: + print("DETAILED STATISTICS") + print("-"*80) + + # Top terms + print("\n[+] Top 20 Terms by Frequency:") + top_terms = sorted(self.state.term_frequencies.items(), key=lambda x: x[1], reverse=True)[:20] + for i, (term, freq) in enumerate(top_terms, 1): + df = self.state.document_frequencies.get(term, 0) + if use_rich: + console.print(f" {i:2d}. [yellow]'{term}'[/yellow]: {freq} occurrences in {df} docs") + else: + print(f" {i:2d}. '{term}': {freq} occurrences in {df} docs") + + # URLs crawled + print(f"\n[+] URLs Crawled ({len(self.state.crawled_urls)}):") + for i, url in enumerate(self.state.crawl_order, 1): + new_terms = self.state.new_terms_history[i-1] if i <= len(self.state.new_terms_history) else 0 + if use_rich: + console.print(f" {i}. [cyan]{url}[/cyan]") + console.print(f" -> Added [green]{new_terms}[/green] new terms") + else: + print(f" {i}. {url}") + print(f" -> Added {new_terms} new terms") + + # Document frequency distribution + print("\n[+] Document Frequency Distribution:") + df_counts = {} + for df in self.state.document_frequencies.values(): + df_counts[df] = df_counts.get(df, 0) + 1 + + for df in sorted(df_counts.keys()): + count = df_counts[df] + print(f" Terms in {df} docs: {count} terms") + + # Embedding stats + if self.state.embedding_model: + print("\n[+] Semantic Coverage Analysis:") + print(f" Embedding Model: {self.state.embedding_model}") + print(f" Query Variations: {len(self.state.expanded_queries)}") + if self.state.kb_embeddings is not None: + print(f" Knowledge Embeddings: {self.state.kb_embeddings.shape}") + else: + print(f" Knowledge Embeddings: None") + print(f" Semantic Gaps: {len(self.state.semantic_gaps)}") + print(f" Coverage Achievement: {self.confidence:.2%}") + + # Show sample expanded queries + if self.state.expanded_queries: + print("\n[+] Query Space (samples):") + for i, eq in enumerate(self.state.expanded_queries[:5], 1): + if use_rich: + console.print(f" {i}. [yellow]{eq}[/yellow]") + else: + print(f" {i}. {eq}") + + print("\n" + "="*80) + + def _get_content_from_result(self, result) -> str: + """Helper to safely extract content from result""" + if hasattr(result, 'markdown') and result.markdown: + if hasattr(result.markdown, 'raw_markdown'): + return result.markdown.raw_markdown or "" + return str(result.markdown) + return "" + + def export_knowledge_base(self, filepath: Union[str, Path], format: str = "jsonl") -> None: + """Export the knowledge base to a file + + Args: + filepath: Path to save the file + format: Export format - currently supports 'jsonl' + """ + if not self.state or not self.state.knowledge_base: + print("No knowledge base to export.") + return + + filepath = Path(filepath) + filepath.parent.mkdir(parents=True, exist_ok=True) + + if format == "jsonl": + # Export as JSONL - one CrawlResult per line + with open(filepath, 'w', encoding='utf-8') as f: + for result in self.state.knowledge_base: + # Convert CrawlResult to dict + result_dict = self._crawl_result_to_export_dict(result) + # Write as single line JSON + f.write(json.dumps(result_dict, ensure_ascii=False) + '\n') + + print(f"Exported {len(self.state.knowledge_base)} documents to {filepath}") + else: + raise ValueError(f"Unsupported export format: {format}") + + def _crawl_result_to_export_dict(self, result) -> Dict[str, Any]: + """Convert CrawlResult to a dictionary for export""" + # Extract all available fields + export_dict = { + 'url': getattr(result, 'url', ''), + 'timestamp': getattr(result, 'timestamp', None), + 'success': getattr(result, 'success', True), + 'query': self.state.query if self.state else '', + } + + # Extract content + if hasattr(result, 'markdown') and result.markdown: + if hasattr(result.markdown, 'raw_markdown'): + export_dict['content'] = result.markdown.raw_markdown + else: + export_dict['content'] = str(result.markdown) + else: + export_dict['content'] = '' + + # Extract metadata + if hasattr(result, 'metadata'): + export_dict['metadata'] = result.metadata + + # Extract links if available + if hasattr(result, 'links'): + export_dict['links'] = result.links + + # Add crawl-specific metadata + if self.state: + export_dict['crawl_metadata'] = { + 'crawl_order': self.state.crawl_order.index(export_dict['url']) + 1 if export_dict['url'] in self.state.crawl_order else 0, + 'confidence_at_crawl': self.state.metrics.get('confidence', 0), + 'total_documents': self.state.total_documents + } + + return export_dict + + def import_knowledge_base(self, filepath: Union[str, Path], format: str = "jsonl") -> None: + """Import a knowledge base from a file + + Args: + filepath: Path to the file to import + format: Import format - currently supports 'jsonl' + """ + filepath = Path(filepath) + if not filepath.exists(): + raise FileNotFoundError(f"File not found: {filepath}") + + if format == "jsonl": + imported_results = [] + with open(filepath, 'r', encoding='utf-8') as f: + for line in f: + if line.strip(): + data = json.loads(line) + # Convert back to a mock CrawlResult + mock_result = self._import_dict_to_crawl_result(data) + imported_results.append(mock_result) + + # Initialize state if needed + if not self.state: + self.state = CrawlState() + + # Add imported results + self.state.knowledge_base.extend(imported_results) + + # Update state with imported data + asyncio.run(self.strategy.update_state(self.state, imported_results)) + + print(f"Imported {len(imported_results)} documents from {filepath}") + else: + raise ValueError(f"Unsupported import format: {format}") + + def _import_dict_to_crawl_result(self, data: Dict[str, Any]): + """Convert imported dict back to a mock CrawlResult""" + class MockMarkdown: + def __init__(self, content): + self.raw_markdown = content + + class MockCrawlResult: + def __init__(self, data): + self.url = data.get('url', '') + self.markdown = MockMarkdown(data.get('content', '')) + self.links = data.get('links', {}) + self.metadata = data.get('metadata', {}) + self.success = data.get('success', True) + self.timestamp = data.get('timestamp') + + return MockCrawlResult(data) + + def get_relevant_content(self, top_k: int = 5) -> List[Dict[str, Any]]: + """Get most relevant content for the query""" + if not self.state or not self.state.knowledge_base: + return [] + + # Simple relevance ranking based on term overlap + scored_docs = [] + query_terms = set(self.state.query.lower().split()) + + for i, result in enumerate(self.state.knowledge_base): + content = (result.markdown.raw_markdown or "").lower() + content_terms = set(content.split()) + + # Calculate relevance score + overlap = len(query_terms & content_terms) + score = overlap / len(query_terms) if query_terms else 0.0 + + scored_docs.append({ + 'url': result.url, + 'score': score, + 'content': result.markdown.raw_markdown, + 'index': i + }) + + # Sort by score and return top K + scored_docs.sort(key=lambda x: x['score'], reverse=True) + return scored_docs[:top_k] \ No newline at end of file diff --git a/crawl4ai/utils.py b/crawl4ai/utils.py index fe6957f6..14073ef1 100644 --- a/crawl4ai/utils.py +++ b/crawl4ai/utils.py @@ -32,7 +32,6 @@ import hashlib from urllib.robotparser import RobotFileParser import aiohttp -from urllib.parse import urlparse, urlunparse from functools import lru_cache from packaging import version @@ -43,6 +42,14 @@ from itertools import chain from collections import deque from typing import Generator, Iterable +import numpy as np + +from urllib.parse import ( + urljoin, urlparse, urlunparse, + parse_qsl, urlencode, quote, unquote +) + + def chunk_documents( documents: Iterable[str], chunk_token_threshold: int, @@ -2071,6 +2078,92 @@ def normalize_url(href, base_url): return normalized +def normalize_url( + href: str, + base_url: str, + *, + drop_query_tracking=True, + sort_query=True, + keep_fragment=False, + extra_drop_params=None +): + """ + Extended URL normalizer + + Parameters + ---------- + href : str + The raw link extracted from a page. + base_url : str + The page’s canonical URL (used to resolve relative links). + drop_query_tracking : bool (default True) + Remove common tracking query parameters. + sort_query : bool (default True) + Alphabetically sort query keys for deterministic output. + keep_fragment : bool (default False) + Preserve the hash fragment (#section) if you need in-page links. + extra_drop_params : Iterable[str] | None + Additional query keys to strip (case-insensitive). + + Returns + ------- + str | None + A clean, canonical URL or None if href is empty/None. + """ + if not href: + return None + + # Resolve relative paths first + full_url = urljoin(base_url, href.strip()) + + # Parse once, edit parts, then rebuild + parsed = urlparse(full_url) + + # ── netloc ── + netloc = parsed.netloc.lower() + + # ── path ── + # Strip duplicate slashes and trailing “/” (except root) + path = quote(unquote(parsed.path)) + if path.endswith('/') and path != '/': + path = path.rstrip('/') + + # ── query ── + query = parsed.query + if query: + # explode, mutate, then rebuild + params = [(k.lower(), v) for k, v in parse_qsl(query, keep_blank_values=True)] + + if drop_query_tracking: + default_tracking = { + 'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', + 'utm_content', 'gclid', 'fbclid', 'ref', 'ref_src' + } + if extra_drop_params: + default_tracking |= {p.lower() for p in extra_drop_params} + params = [(k, v) for k, v in params if k not in default_tracking] + + if sort_query: + params.sort(key=lambda kv: kv[0]) + + query = urlencode(params, doseq=True) if params else '' + + # ── fragment ── + fragment = parsed.fragment if keep_fragment else '' + + # Re-assemble + normalized = urlunparse(( + parsed.scheme, + netloc, + path, + parsed.params, + query, + fragment + )) + + return normalized + + def normalize_url_for_deep_crawl(href, base_url): """Normalize URLs to ensure consistent format""" from urllib.parse import urljoin, urlparse, urlunparse, parse_qs, urlencode @@ -3148,3 +3241,108 @@ def calculate_total_score( return max(0.0, min(total, 10.0)) + +# Embedding utilities +async def get_text_embeddings( + texts: List[str], + llm_config: Optional[Dict] = None, + model_name: str = "sentence-transformers/all-MiniLM-L6-v2", + batch_size: int = 32 +) -> np.ndarray: + """ + Compute embeddings for a list of texts using specified model. + + Args: + texts: List of texts to embed + llm_config: Optional LLM configuration for API-based embeddings + model_name: Model name (used when llm_config is None) + batch_size: Batch size for processing + + Returns: + numpy array of embeddings + """ + import numpy as np + + if not texts: + return np.array([]) + + # If LLMConfig provided, use litellm for embeddings + if llm_config is not None: + from litellm import aembedding + + # Get embedding model from config or use default + embedding_model = llm_config.get('provider', 'text-embedding-3-small') + api_base = llm_config.get('base_url', llm_config.get('api_base')) + + # Prepare kwargs + kwargs = { + 'model': embedding_model, + 'input': texts, + 'api_key': llm_config.get('api_token', llm_config.get('api_key')) + } + + if api_base: + kwargs['api_base'] = api_base + + # Handle OpenAI-compatible endpoints + if api_base and 'openai/' not in embedding_model: + kwargs['model'] = f"openai/{embedding_model}" + + # Get embeddings + response = await aembedding(**kwargs) + + # Extract embeddings from response + embeddings = [] + for item in response.data: + embeddings.append(item['embedding']) + + return np.array(embeddings) + + # Default: use sentence-transformers + else: + # Lazy load to avoid importing heavy libraries unless needed + from sentence_transformers import SentenceTransformer + + # Cache the model in function attribute to avoid reloading + if not hasattr(get_text_embeddings, '_models'): + get_text_embeddings._models = {} + + if model_name not in get_text_embeddings._models: + get_text_embeddings._models[model_name] = SentenceTransformer(model_name) + + encoder = get_text_embeddings._models[model_name] + + # Batch encode for efficiency + embeddings = encoder.encode( + texts, + batch_size=batch_size, + show_progress_bar=False, + convert_to_numpy=True + ) + + return embeddings + + +def get_text_embeddings_sync( + texts: List[str], + llm_config: Optional[Dict] = None, + model_name: str = "sentence-transformers/all-MiniLM-L6-v2", + batch_size: int = 32 +) -> np.ndarray: + """Synchronous wrapper for get_text_embeddings""" + import numpy as np + return asyncio.run(get_text_embeddings(texts, llm_config, model_name, batch_size)) + + +def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float: + """Calculate cosine similarity between two vectors""" + import numpy as np + dot_product = np.dot(vec1, vec2) + norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2) + return float(dot_product / norm_product) if norm_product != 0 else 0.0 + + +def cosine_distance(vec1: np.ndarray, vec2: np.ndarray) -> float: + """Calculate cosine distance (1 - similarity) between two vectors""" + return 1 - cosine_similarity(vec1, vec2) + diff --git a/docs/examples/adaptive_crawling/README.md b/docs/examples/adaptive_crawling/README.md new file mode 100644 index 00000000..3b3c41fc --- /dev/null +++ b/docs/examples/adaptive_crawling/README.md @@ -0,0 +1,85 @@ +# Adaptive Crawling Examples + +This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature. + +## Examples Overview + +### 1. `basic_usage.py` +- Simple introduction to adaptive crawling +- Uses default statistical strategy +- Shows how to get crawl statistics and relevant content + +### 2. `embedding_strategy.py` ⭐ NEW +- Demonstrates the embedding-based strategy for semantic understanding +- Shows query expansion and irrelevance detection +- Includes configuration for both local and API-based embeddings + +### 3. `embedding_vs_statistical.py` ⭐ NEW +- Direct comparison between statistical and embedding strategies +- Helps you choose the right strategy for your use case +- Shows performance and accuracy trade-offs + +### 4. `embedding_configuration.py` ⭐ NEW +- Advanced configuration options for embedding strategy +- Parameter tuning guide for different scenarios +- Examples for research, exploration, and quality-focused crawling + +### 5. `advanced_configuration.py` +- Shows various configuration options for both strategies +- Demonstrates threshold tuning and performance optimization + +### 6. `custom_strategies.py` +- How to implement your own crawling strategy +- Extends the base CrawlStrategy class +- Advanced use case for specialized requirements + +### 7. `export_import_kb.py` +- Export crawled knowledge base to JSONL +- Import and continue crawling from saved state +- Useful for building persistent knowledge bases + +## Quick Start + +For your first adaptive crawling experience, run: + +```bash +python basic_usage.py +``` + +To try the new embedding strategy with semantic understanding: + +```bash +python embedding_strategy.py +``` + +To compare strategies and see which works best for your use case: + +```bash +python embedding_vs_statistical.py +``` + +## Strategy Selection Guide + +### Use Statistical Strategy (Default) When: +- Working with technical documentation +- Queries contain specific terms or code +- Speed is critical +- No API access available + +### Use Embedding Strategy When: +- Queries are conceptual or ambiguous +- Need semantic understanding beyond exact matches +- Want to detect irrelevant content +- Working with diverse content sources + +## Requirements + +- Crawl4AI installed +- For embedding strategy with local models: `sentence-transformers` +- For embedding strategy with OpenAI: Set `OPENAI_API_KEY` environment variable + +## Learn More + +- [Adaptive Crawling Documentation](https://docs.crawl4ai.com/core/adaptive-crawling/) +- [Mathematical Framework](https://github.com/unclecode/crawl4ai/blob/main/PROGRESSIVE_CRAWLING.md) +- [Blog: The Adaptive Crawling Revolution](https://docs.crawl4ai.com/blog/adaptive-crawling-revolution/) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/advanced_configuration.py b/docs/examples/adaptive_crawling/advanced_configuration.py new file mode 100644 index 00000000..dcfbf76d --- /dev/null +++ b/docs/examples/adaptive_crawling/advanced_configuration.py @@ -0,0 +1,207 @@ +""" +Advanced Adaptive Crawling Configuration + +This example demonstrates all configuration options available for adaptive crawling, +including threshold tuning, persistence, and custom parameters. +""" + +import asyncio +from pathlib import Path +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig + + +async def main(): + """Demonstrate advanced configuration options""" + + # Example 1: Custom thresholds for different use cases + print("="*60) + print("EXAMPLE 1: Custom Confidence Thresholds") + print("="*60) + + # High-precision configuration (exhaustive crawling) + high_precision_config = AdaptiveConfig( + confidence_threshold=0.9, # Very high confidence required + max_pages=50, # Allow more pages + top_k_links=5, # Follow more links per page + min_gain_threshold=0.02 # Lower threshold to continue + ) + + # Balanced configuration (default use case) + balanced_config = AdaptiveConfig( + confidence_threshold=0.7, # Moderate confidence + max_pages=20, # Reasonable limit + top_k_links=3, # Moderate branching + min_gain_threshold=0.05 # Standard gain threshold + ) + + # Quick exploration configuration + quick_config = AdaptiveConfig( + confidence_threshold=0.5, # Lower confidence acceptable + max_pages=10, # Strict limit + top_k_links=2, # Minimal branching + min_gain_threshold=0.1 # High gain required + ) + + async with AsyncWebCrawler(verbose=False) as crawler: + # Test different configurations + for config_name, config in [ + ("High Precision", high_precision_config), + ("Balanced", balanced_config), + ("Quick Exploration", quick_config) + ]: + print(f"\nTesting {config_name} configuration...") + adaptive = AdaptiveCrawler(crawler, config=config) + + result = await adaptive.digest( + start_url="https://httpbin.org", + query="http headers authentication" + ) + + print(f" - Pages crawled: {len(result.crawled_urls)}") + print(f" - Confidence achieved: {adaptive.confidence:.2%}") + print(f" - Coverage score: {adaptive.coverage_stats['coverage']:.2f}") + + # Example 2: Persistence and state management + print("\n" + "="*60) + print("EXAMPLE 2: State Persistence") + print("="*60) + + state_file = "crawl_state_demo.json" + + # Configuration with persistence + persistent_config = AdaptiveConfig( + confidence_threshold=0.8, + max_pages=30, + save_state=True, # Enable auto-save + state_path=state_file # Specify save location + ) + + async with AsyncWebCrawler(verbose=False) as crawler: + # First crawl - will be interrupted + print("\nStarting initial crawl (will interrupt after 5 pages)...") + + interrupt_config = AdaptiveConfig( + confidence_threshold=0.8, + max_pages=5, # Artificially low to simulate interruption + save_state=True, + state_path=state_file + ) + + adaptive = AdaptiveCrawler(crawler, config=interrupt_config) + result1 = await adaptive.digest( + start_url="https://docs.python.org/3/", + query="exception handling try except finally" + ) + + print(f"First crawl completed: {len(result1.crawled_urls)} pages") + print(f"Confidence reached: {adaptive.confidence:.2%}") + + # Resume crawl with higher page limit + print("\nResuming crawl from saved state...") + + resume_config = AdaptiveConfig( + confidence_threshold=0.8, + max_pages=20, # Increase limit + save_state=True, + state_path=state_file + ) + + adaptive2 = AdaptiveCrawler(crawler, config=resume_config) + result2 = await adaptive2.digest( + start_url="https://docs.python.org/3/", + query="exception handling try except finally", + resume_from=state_file + ) + + print(f"Resumed crawl completed: {len(result2.crawled_urls)} total pages") + print(f"Final confidence: {adaptive2.confidence:.2%}") + + # Clean up + Path(state_file).unlink(missing_ok=True) + + # Example 3: Link selection strategies + print("\n" + "="*60) + print("EXAMPLE 3: Link Selection Strategies") + print("="*60) + + # Conservative link following + conservative_config = AdaptiveConfig( + confidence_threshold=0.7, + max_pages=15, + top_k_links=1, # Only follow best link + min_gain_threshold=0.15 # High threshold + ) + + # Aggressive link following + aggressive_config = AdaptiveConfig( + confidence_threshold=0.7, + max_pages=15, + top_k_links=10, # Follow many links + min_gain_threshold=0.01 # Very low threshold + ) + + async with AsyncWebCrawler(verbose=False) as crawler: + for strategy_name, config in [ + ("Conservative", conservative_config), + ("Aggressive", aggressive_config) + ]: + print(f"\n{strategy_name} link selection:") + adaptive = AdaptiveCrawler(crawler, config=config) + + result = await adaptive.digest( + start_url="https://httpbin.org", + query="api endpoints" + ) + + # Analyze crawl pattern + print(f" - Total pages: {len(result.crawled_urls)}") + print(f" - Unique domains: {len(set(url.split('/')[2] for url in result.crawled_urls))}") + print(f" - Max depth reached: {max(url.count('/') for url in result.crawled_urls) - 2}") + + # Show saturation trend + if hasattr(result, 'new_terms_history') and result.new_terms_history: + print(f" - New terms discovered: {result.new_terms_history[:5]}...") + print(f" - Saturation trend: {'decreasing' if result.new_terms_history[-1] < result.new_terms_history[0] else 'increasing'}") + + # Example 4: Monitoring crawl progress + print("\n" + "="*60) + print("EXAMPLE 4: Progress Monitoring") + print("="*60) + + # Configuration with detailed monitoring + monitor_config = AdaptiveConfig( + confidence_threshold=0.75, + max_pages=10, + top_k_links=3 + ) + + async with AsyncWebCrawler(verbose=False) as crawler: + adaptive = AdaptiveCrawler(crawler, config=monitor_config) + + # Start crawl + print("\nMonitoring crawl progress...") + result = await adaptive.digest( + start_url="https://httpbin.org", + query="http methods headers" + ) + + # Detailed statistics + print("\nDetailed crawl analysis:") + adaptive.print_stats(detailed=True) + + # Export for analysis + print("\nExporting knowledge base for external analysis...") + adaptive.export_knowledge_base("knowledge_export_demo.jsonl") + print("Knowledge base exported to: knowledge_export_demo.jsonl") + + # Show sample of exported data + with open("knowledge_export_demo.jsonl", 'r') as f: + first_line = f.readline() + print(f"Sample export: {first_line[:100]}...") + + # Clean up + Path("knowledge_export_demo.jsonl").unlink(missing_ok=True) + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/basic_usage.py b/docs/examples/adaptive_crawling/basic_usage.py new file mode 100644 index 00000000..72f50dc7 --- /dev/null +++ b/docs/examples/adaptive_crawling/basic_usage.py @@ -0,0 +1,76 @@ +""" +Basic Adaptive Crawling Example + +This example demonstrates the simplest use case of adaptive crawling: +finding information about a specific topic and knowing when to stop. +""" + +import asyncio +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler + + +async def main(): + """Basic adaptive crawling example""" + + # Initialize the crawler + async with AsyncWebCrawler(verbose=True) as crawler: + # Create an adaptive crawler with default settings (statistical strategy) + adaptive = AdaptiveCrawler(crawler) + + # Note: You can also use embedding strategy for semantic understanding: + # from crawl4ai import AdaptiveConfig + # config = AdaptiveConfig(strategy="embedding") + # adaptive = AdaptiveCrawler(crawler, config) + + # Start adaptive crawling + print("Starting adaptive crawl for Python async programming information...") + result = await adaptive.digest( + start_url="https://docs.python.org/3/library/asyncio.html", + query="async await context managers coroutines" + ) + + # Display crawl statistics + print("\n" + "="*50) + print("CRAWL STATISTICS") + print("="*50) + adaptive.print_stats(detailed=False) + + # Get the most relevant content found + print("\n" + "="*50) + print("MOST RELEVANT PAGES") + print("="*50) + + relevant_pages = adaptive.get_relevant_content(top_k=5) + for i, page in enumerate(relevant_pages, 1): + print(f"\n{i}. {page['url']}") + print(f" Relevance Score: {page['score']:.2%}") + + # Show a snippet of the content + content = page['content'] or "" + if content: + snippet = content[:200].replace('\n', ' ') + if len(content) > 200: + snippet += "..." + print(f" Preview: {snippet}") + + # Show final confidence + print(f"\n{'='*50}") + print(f"Final Confidence: {adaptive.confidence:.2%}") + print(f"Total Pages Crawled: {len(result.crawled_urls)}") + print(f"Knowledge Base Size: {len(adaptive.state.knowledge_base)} documents") + + # Example: Check if we can answer specific questions + print(f"\n{'='*50}") + print("INFORMATION SUFFICIENCY CHECK") + print(f"{'='*50}") + + if adaptive.confidence >= 0.8: + print("✓ High confidence - can answer detailed questions about async Python") + elif adaptive.confidence >= 0.6: + print("~ Moderate confidence - can answer basic questions") + else: + print("✗ Low confidence - need more information") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/custom_strategies.py b/docs/examples/adaptive_crawling/custom_strategies.py new file mode 100644 index 00000000..5ea622ac --- /dev/null +++ b/docs/examples/adaptive_crawling/custom_strategies.py @@ -0,0 +1,373 @@ +""" +Custom Adaptive Crawling Strategies + +This example demonstrates how to implement custom scoring strategies +for domain-specific crawling needs. +""" + +import asyncio +import re +from typing import List, Dict, Set +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig +from crawl4ai.adaptive_crawler import CrawlState, Link +import math + + +class APIDocumentationStrategy: + """ + Custom strategy optimized for API documentation crawling. + Prioritizes endpoint references, code examples, and parameter descriptions. + """ + + def __init__(self): + # Keywords that indicate high-value API documentation + self.api_keywords = { + 'endpoint', 'request', 'response', 'parameter', 'authentication', + 'header', 'body', 'query', 'path', 'method', 'get', 'post', 'put', + 'delete', 'patch', 'status', 'code', 'example', 'curl', 'python' + } + + # URL patterns that typically contain API documentation + self.valuable_patterns = [ + r'/api/', + r'/reference/', + r'/endpoints?/', + r'/methods?/', + r'/resources?/' + ] + + # Patterns to avoid + self.avoid_patterns = [ + r'/blog/', + r'/news/', + r'/about/', + r'/contact/', + r'/legal/' + ] + + def score_link(self, link: Link, query: str, state: CrawlState) -> float: + """Custom link scoring for API documentation""" + score = 1.0 + url = link.href.lower() + + # Boost API-related URLs + for pattern in self.valuable_patterns: + if re.search(pattern, url): + score *= 2.0 + break + + # Reduce score for non-API content + for pattern in self.avoid_patterns: + if re.search(pattern, url): + score *= 0.1 + break + + # Boost if preview contains API keywords + if link.text: + preview_lower = link.text.lower() + keyword_count = sum(1 for kw in self.api_keywords if kw in preview_lower) + score *= (1 + keyword_count * 0.2) + + # Prioritize shallow URLs (likely overview pages) + depth = url.count('/') - 2 # Subtract protocol slashes + if depth <= 3: + score *= 1.5 + elif depth > 6: + score *= 0.5 + + return score + + def calculate_api_coverage(self, state: CrawlState, query: str) -> Dict[str, float]: + """Calculate specialized coverage metrics for API documentation""" + metrics = { + 'endpoint_coverage': 0.0, + 'example_coverage': 0.0, + 'parameter_coverage': 0.0 + } + + # Analyze knowledge base for API-specific content + endpoint_patterns = [r'GET\s+/', r'POST\s+/', r'PUT\s+/', r'DELETE\s+/'] + example_patterns = [r'```\w+', r'curl\s+-', r'import\s+requests'] + param_patterns = [r'param(?:eter)?s?\s*:', r'required\s*:', r'optional\s*:'] + + total_docs = len(state.knowledge_base) + if total_docs == 0: + return metrics + + docs_with_endpoints = 0 + docs_with_examples = 0 + docs_with_params = 0 + + for doc in state.knowledge_base: + content = doc.markdown.raw_markdown if hasattr(doc, 'markdown') else str(doc) + + # Check for endpoints + if any(re.search(pattern, content, re.IGNORECASE) for pattern in endpoint_patterns): + docs_with_endpoints += 1 + + # Check for examples + if any(re.search(pattern, content, re.IGNORECASE) for pattern in example_patterns): + docs_with_examples += 1 + + # Check for parameters + if any(re.search(pattern, content, re.IGNORECASE) for pattern in param_patterns): + docs_with_params += 1 + + metrics['endpoint_coverage'] = docs_with_endpoints / total_docs + metrics['example_coverage'] = docs_with_examples / total_docs + metrics['parameter_coverage'] = docs_with_params / total_docs + + return metrics + + +class ResearchPaperStrategy: + """ + Strategy optimized for crawling research papers and academic content. + Prioritizes citations, abstracts, and methodology sections. + """ + + def __init__(self): + self.academic_keywords = { + 'abstract', 'introduction', 'methodology', 'results', 'conclusion', + 'references', 'citation', 'paper', 'study', 'research', 'analysis', + 'hypothesis', 'experiment', 'findings', 'doi' + } + + self.citation_patterns = [ + r'\[\d+\]', # [1] style citations + r'\(\w+\s+\d{4}\)', # (Author 2024) style + r'doi:\s*\S+', # DOI references + ] + + def calculate_academic_relevance(self, content: str, query: str) -> float: + """Calculate relevance score for academic content""" + score = 0.0 + content_lower = content.lower() + + # Check for academic keywords + keyword_matches = sum(1 for kw in self.academic_keywords if kw in content_lower) + score += keyword_matches * 0.1 + + # Check for citations + citation_count = sum( + len(re.findall(pattern, content)) + for pattern in self.citation_patterns + ) + score += min(citation_count * 0.05, 1.0) # Cap at 1.0 + + # Check for query terms in academic context + query_terms = query.lower().split() + for term in query_terms: + # Boost if term appears near academic keywords + for keyword in ['abstract', 'conclusion', 'results']: + if keyword in content_lower: + section = content_lower[content_lower.find(keyword):content_lower.find(keyword) + 500] + if term in section: + score += 0.2 + + return min(score, 2.0) # Cap total score + + +async def demo_custom_strategies(): + """Demonstrate custom strategy usage""" + + # Example 1: API Documentation Strategy + print("="*60) + print("EXAMPLE 1: Custom API Documentation Strategy") + print("="*60) + + api_strategy = APIDocumentationStrategy() + + async with AsyncWebCrawler() as crawler: + # Standard adaptive crawler + config = AdaptiveConfig( + confidence_threshold=0.8, + max_pages=15 + ) + + adaptive = AdaptiveCrawler(crawler, config) + + # Override link scoring with custom strategy + original_rank_links = adaptive._rank_links + + def custom_rank_links(links, query, state): + # Apply custom scoring + scored_links = [] + for link in links: + base_score = api_strategy.score_link(link, query, state) + scored_links.append((link, base_score)) + + # Sort by score + scored_links.sort(key=lambda x: x[1], reverse=True) + return [link for link, _ in scored_links[:config.top_k_links]] + + adaptive._rank_links = custom_rank_links + + # Crawl API documentation + print("\nCrawling API documentation with custom strategy...") + state = await adaptive.digest( + start_url="https://httpbin.org", + query="api endpoints authentication headers" + ) + + # Calculate custom metrics + api_metrics = api_strategy.calculate_api_coverage(state, "api endpoints") + + print(f"\nResults:") + print(f"Pages crawled: {len(state.crawled_urls)}") + print(f"Confidence: {adaptive.confidence:.2%}") + print(f"\nAPI-Specific Metrics:") + print(f" - Endpoint coverage: {api_metrics['endpoint_coverage']:.2%}") + print(f" - Example coverage: {api_metrics['example_coverage']:.2%}") + print(f" - Parameter coverage: {api_metrics['parameter_coverage']:.2%}") + + # Example 2: Combined Strategy + print("\n" + "="*60) + print("EXAMPLE 2: Hybrid Strategy Combining Multiple Approaches") + print("="*60) + + class HybridStrategy: + """Combines multiple strategies with weights""" + + def __init__(self): + self.api_strategy = APIDocumentationStrategy() + self.research_strategy = ResearchPaperStrategy() + self.weights = { + 'api': 0.7, + 'research': 0.3 + } + + def score_content(self, content: str, query: str) -> float: + # Get scores from each strategy + api_score = self._calculate_api_score(content, query) + research_score = self.research_strategy.calculate_academic_relevance(content, query) + + # Weighted combination + total_score = ( + api_score * self.weights['api'] + + research_score * self.weights['research'] + ) + + return total_score + + def _calculate_api_score(self, content: str, query: str) -> float: + # Simplified API scoring based on keyword presence + content_lower = content.lower() + api_keywords = self.api_strategy.api_keywords + + keyword_count = sum(1 for kw in api_keywords if kw in content_lower) + return min(keyword_count * 0.1, 2.0) + + hybrid_strategy = HybridStrategy() + + async with AsyncWebCrawler() as crawler: + adaptive = AdaptiveCrawler(crawler) + + # Crawl with hybrid scoring + print("\nTesting hybrid strategy on technical documentation...") + state = await adaptive.digest( + start_url="https://docs.python.org/3/library/asyncio.html", + query="async await coroutines api" + ) + + # Analyze results with hybrid strategy + print(f"\nHybrid Strategy Analysis:") + total_score = 0 + for doc in adaptive.get_relevant_content(top_k=5): + content = doc['content'] or "" + score = hybrid_strategy.score_content(content, "async await api") + total_score += score + print(f" - {doc['url'][:50]}... Score: {score:.2f}") + + print(f"\nAverage hybrid score: {total_score/5:.2f}") + + +async def demo_performance_optimization(): + """Demonstrate performance optimization with custom strategies""" + + print("\n" + "="*60) + print("EXAMPLE 3: Performance-Optimized Strategy") + print("="*60) + + class PerformanceOptimizedStrategy: + """Strategy that balances thoroughness with speed""" + + def __init__(self): + self.url_cache: Set[str] = set() + self.domain_scores: Dict[str, float] = {} + + def should_crawl_domain(self, url: str) -> bool: + """Implement domain-level filtering""" + domain = url.split('/')[2] if url.startswith('http') else url + + # Skip if we've already crawled many pages from this domain + domain_count = sum(1 for cached in self.url_cache if domain in cached) + if domain_count > 5: + return False + + # Skip low-scoring domains + if domain in self.domain_scores and self.domain_scores[domain] < 0.3: + return False + + return True + + def update_domain_score(self, url: str, relevance: float): + """Track domain-level performance""" + domain = url.split('/')[2] if url.startswith('http') else url + + if domain not in self.domain_scores: + self.domain_scores[domain] = relevance + else: + # Moving average + self.domain_scores[domain] = ( + 0.7 * self.domain_scores[domain] + 0.3 * relevance + ) + + perf_strategy = PerformanceOptimizedStrategy() + + async with AsyncWebCrawler() as crawler: + config = AdaptiveConfig( + confidence_threshold=0.7, + max_pages=10, + top_k_links=2 # Fewer links for speed + ) + + adaptive = AdaptiveCrawler(crawler, config) + + # Track performance + import time + start_time = time.time() + + state = await adaptive.digest( + start_url="https://httpbin.org", + query="http methods headers" + ) + + elapsed = time.time() - start_time + + print(f"\nPerformance Results:") + print(f" - Time elapsed: {elapsed:.2f} seconds") + print(f" - Pages crawled: {len(state.crawled_urls)}") + print(f" - Pages per second: {len(state.crawled_urls)/elapsed:.2f}") + print(f" - Final confidence: {adaptive.confidence:.2%}") + print(f" - Efficiency: {adaptive.confidence/len(state.crawled_urls):.2%} confidence per page") + + +async def main(): + """Run all demonstrations""" + try: + await demo_custom_strategies() + await demo_performance_optimization() + + print("\n" + "="*60) + print("All custom strategy examples completed!") + print("="*60) + + except Exception as e: + print(f"Error: {e}") + import traceback + traceback.print_exc() + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/embedding_configuration.py b/docs/examples/adaptive_crawling/embedding_configuration.py new file mode 100644 index 00000000..2ca37dc5 --- /dev/null +++ b/docs/examples/adaptive_crawling/embedding_configuration.py @@ -0,0 +1,206 @@ +""" +Advanced Embedding Configuration Example + +This example demonstrates all configuration options available for the +embedding strategy, including fine-tuning parameters for different use cases. +""" + +import asyncio +import os +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig + + +async def test_configuration(name: str, config: AdaptiveConfig, url: str, query: str): + """Test a specific configuration""" + print(f"\n{'='*60}") + print(f"Configuration: {name}") + print(f"{'='*60}") + + async with AsyncWebCrawler(verbose=False) as crawler: + adaptive = AdaptiveCrawler(crawler, config) + result = await adaptive.digest(start_url=url, query=query) + + print(f"Pages crawled: {len(result.crawled_urls)}") + print(f"Final confidence: {adaptive.confidence:.1%}") + print(f"Stopped reason: {result.metrics.get('stopped_reason', 'max_pages')}") + + if result.metrics.get('is_irrelevant', False): + print("⚠️ Query detected as irrelevant!") + + return result + + +async def main(): + """Demonstrate various embedding configurations""" + + print("EMBEDDING STRATEGY CONFIGURATION EXAMPLES") + print("=" * 60) + + # Base URL and query for testing + test_url = "https://docs.python.org/3/library/asyncio.html" + + # 1. Default Configuration + config_default = AdaptiveConfig( + strategy="embedding", + max_pages=10 + ) + + await test_configuration( + "Default Settings", + config_default, + test_url, + "async programming patterns" + ) + + # 2. Strict Coverage Requirements + config_strict = AdaptiveConfig( + strategy="embedding", + max_pages=20, + + # Stricter similarity requirements + embedding_k_exp=5.0, # Default is 3.0, higher = stricter + embedding_coverage_radius=0.15, # Default is 0.2, lower = stricter + + # Higher validation threshold + embedding_validation_min_score=0.6, # Default is 0.3 + + # More query variations for better coverage + n_query_variations=15 # Default is 10 + ) + + await test_configuration( + "Strict Coverage (Research/Academic)", + config_strict, + test_url, + "comprehensive guide async await" + ) + + # 3. Fast Exploration + config_fast = AdaptiveConfig( + strategy="embedding", + max_pages=10, + top_k_links=5, # Follow more links per page + + # Relaxed requirements for faster convergence + embedding_k_exp=1.0, # Lower = more lenient + embedding_min_relative_improvement=0.05, # Stop earlier + + # Lower quality thresholds + embedding_quality_min_confidence=0.5, # Display lower confidence + embedding_quality_max_confidence=0.85, + + # Fewer query variations for speed + n_query_variations=5 + ) + + await test_configuration( + "Fast Exploration (Quick Overview)", + config_fast, + test_url, + "async basics" + ) + + # 4. Irrelevance Detection Focus + config_irrelevance = AdaptiveConfig( + strategy="embedding", + max_pages=5, + + # Aggressive irrelevance detection + embedding_min_confidence_threshold=0.2, # Higher threshold (default 0.1) + embedding_k_exp=5.0, # Strict similarity + + # Quick stopping for irrelevant content + embedding_min_relative_improvement=0.15 + ) + + await test_configuration( + "Irrelevance Detection", + config_irrelevance, + test_url, + "recipe for chocolate cake" # Irrelevant query + ) + + # 5. High-Quality Knowledge Base + config_quality = AdaptiveConfig( + strategy="embedding", + max_pages=30, + + # Deduplication settings + embedding_overlap_threshold=0.75, # More aggressive deduplication + + # Quality focus + embedding_validation_min_score=0.5, + embedding_quality_scale_factor=1.0, # Linear quality mapping + + # Balanced parameters + embedding_k_exp=3.0, + embedding_nearest_weight=0.8, # Focus on best matches + embedding_top_k_weight=0.2 + ) + + await test_configuration( + "High-Quality Knowledge Base", + config_quality, + test_url, + "asyncio advanced patterns best practices" + ) + + # 6. Custom Embedding Provider + if os.getenv('OPENAI_API_KEY'): + config_openai = AdaptiveConfig( + strategy="embedding", + max_pages=10, + + # Use OpenAI embeddings + embedding_llm_config={ + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + }, + + # OpenAI embeddings are high quality, can be stricter + embedding_k_exp=4.0, + n_query_variations=12 + ) + + await test_configuration( + "OpenAI Embeddings", + config_openai, + test_url, + "event-driven architecture patterns" + ) + + # Parameter Guide + print("\n" + "="*60) + print("PARAMETER TUNING GUIDE") + print("="*60) + + print("\n📊 Key Parameters and Their Effects:") + print("\n1. embedding_k_exp (default: 3.0)") + print(" - Lower (1-2): More lenient, faster convergence") + print(" - Higher (4-5): Stricter, better precision") + + print("\n2. embedding_coverage_radius (default: 0.2)") + print(" - Lower (0.1-0.15): Requires closer matches") + print(" - Higher (0.25-0.3): Accepts broader matches") + + print("\n3. n_query_variations (default: 10)") + print(" - Lower (5-7): Faster, less comprehensive") + print(" - Higher (15-20): Better coverage, slower") + + print("\n4. embedding_min_confidence_threshold (default: 0.1)") + print(" - Set to 0.15-0.2 for aggressive irrelevance detection") + print(" - Set to 0.05 to crawl even barely relevant content") + + print("\n5. embedding_validation_min_score (default: 0.3)") + print(" - Higher (0.5-0.6): Requires strong validation") + print(" - Lower (0.2): More permissive stopping") + + print("\n💡 Tips:") + print("- For research: High k_exp, more variations, strict validation") + print("- For exploration: Low k_exp, fewer variations, relaxed thresholds") + print("- For quality: Focus on overlap_threshold and validation scores") + print("- For speed: Reduce variations, increase min_relative_improvement") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/embedding_strategy.py b/docs/examples/adaptive_crawling/embedding_strategy.py new file mode 100644 index 00000000..ee3d88dc --- /dev/null +++ b/docs/examples/adaptive_crawling/embedding_strategy.py @@ -0,0 +1,109 @@ +""" +Embedding Strategy Example for Adaptive Crawling + +This example demonstrates how to use the embedding-based strategy +for semantic understanding and intelligent crawling. +""" + +import asyncio +import os +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig + + +async def main(): + """Demonstrate embedding strategy for adaptive crawling""" + + # Configure embedding strategy + config = AdaptiveConfig( + strategy="embedding", # Use embedding strategy + embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default model + n_query_variations=10, # Generate 10 semantic variations + max_pages=15, + top_k_links=3, + min_gain_threshold=0.05, + + # Embedding-specific parameters + embedding_k_exp=3.0, # Higher = stricter similarity requirements + embedding_min_confidence_threshold=0.1, # Stop if <10% relevant + embedding_validation_min_score=0.4 # Validation threshold + ) + + # Optional: Use OpenAI embeddings instead + if os.getenv('OPENAI_API_KEY'): + config.embedding_llm_config = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + print("Using OpenAI embeddings") + else: + print("Using sentence-transformers (local embeddings)") + + async with AsyncWebCrawler(verbose=True) as crawler: + adaptive = AdaptiveCrawler(crawler, config) + + # Test 1: Relevant query with semantic understanding + print("\n" + "="*50) + print("TEST 1: Semantic Query Understanding") + print("="*50) + + result = await adaptive.digest( + start_url="https://docs.python.org/3/library/asyncio.html", + query="concurrent programming event-driven architecture" + ) + + print("\nQuery Expansion:") + print(f"Original query expanded to {len(result.expanded_queries)} variations") + for i, q in enumerate(result.expanded_queries[:3], 1): + print(f" {i}. {q}") + print(" ...") + + print("\nResults:") + adaptive.print_stats(detailed=False) + + # Test 2: Detecting irrelevant queries + print("\n" + "="*50) + print("TEST 2: Irrelevant Query Detection") + print("="*50) + + # Reset crawler for new query + adaptive = AdaptiveCrawler(crawler, config) + + result = await adaptive.digest( + start_url="https://docs.python.org/3/library/asyncio.html", + query="how to bake chocolate chip cookies" + ) + + if result.metrics.get('is_irrelevant', False): + print("\n✅ Successfully detected irrelevant query!") + print(f"Stopped after just {len(result.crawled_urls)} pages") + print(f"Reason: {result.metrics.get('stopped_reason', 'unknown')}") + else: + print("\n❌ Failed to detect irrelevance") + + print(f"Final confidence: {adaptive.confidence:.1%}") + + # Test 3: Semantic gap analysis + print("\n" + "="*50) + print("TEST 3: Semantic Gap Analysis") + print("="*50) + + # Show how embedding strategy identifies gaps + adaptive = AdaptiveCrawler(crawler, config) + + result = await adaptive.digest( + start_url="https://realpython.com", + query="python decorators advanced patterns" + ) + + print(f"\nSemantic gaps identified: {len(result.semantic_gaps)}") + print(f"Knowledge base embeddings shape: {result.kb_embeddings.shape if result.kb_embeddings is not None else 'None'}") + + # Show coverage metrics specific to embedding strategy + print("\nEmbedding-specific metrics:") + print(f" Average best similarity: {result.metrics.get('avg_best_similarity', 0):.3f}") + print(f" Coverage score: {result.metrics.get('coverage_score', 0):.3f}") + print(f" Validation confidence: {result.metrics.get('validation_confidence', 0):.2%}") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/embedding_vs_statistical.py b/docs/examples/adaptive_crawling/embedding_vs_statistical.py new file mode 100644 index 00000000..5ee5075e --- /dev/null +++ b/docs/examples/adaptive_crawling/embedding_vs_statistical.py @@ -0,0 +1,167 @@ +""" +Comparison: Embedding vs Statistical Strategy + +This example demonstrates the differences between statistical and embedding +strategies for adaptive crawling, showing when to use each approach. +""" + +import asyncio +import time +import os +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig + + +async def crawl_with_strategy(url: str, query: str, strategy: str, **kwargs): + """Helper function to crawl with a specific strategy""" + config = AdaptiveConfig( + strategy=strategy, + max_pages=20, + top_k_links=3, + min_gain_threshold=0.05, + **kwargs + ) + + async with AsyncWebCrawler(verbose=False) as crawler: + adaptive = AdaptiveCrawler(crawler, config) + + start_time = time.time() + result = await adaptive.digest(start_url=url, query=query) + elapsed = time.time() - start_time + + return { + 'result': result, + 'crawler': adaptive, + 'elapsed': elapsed, + 'pages': len(result.crawled_urls), + 'confidence': adaptive.confidence + } + + +async def main(): + """Compare embedding and statistical strategies""" + + # Test scenarios + test_cases = [ + { + 'name': 'Technical Documentation (Specific Terms)', + 'url': 'https://docs.python.org/3/library/asyncio.html', + 'query': 'asyncio.create_task event_loop.run_until_complete' + }, + { + 'name': 'Conceptual Query (Semantic Understanding)', + 'url': 'https://docs.python.org/3/library/asyncio.html', + 'query': 'concurrent programming patterns' + }, + { + 'name': 'Ambiguous Query', + 'url': 'https://realpython.com', + 'query': 'python performance optimization' + } + ] + + # Configure embedding strategy + embedding_config = {} + if os.getenv('OPENAI_API_KEY'): + embedding_config['embedding_llm_config'] = { + 'provider': 'openai/text-embedding-3-small', + 'api_token': os.getenv('OPENAI_API_KEY') + } + + for test in test_cases: + print("\n" + "="*70) + print(f"TEST: {test['name']}") + print(f"URL: {test['url']}") + print(f"Query: '{test['query']}'") + print("="*70) + + # Run statistical strategy + print("\n📊 Statistical Strategy:") + stat_result = await crawl_with_strategy( + test['url'], + test['query'], + 'statistical' + ) + + print(f" Pages crawled: {stat_result['pages']}") + print(f" Time taken: {stat_result['elapsed']:.2f}s") + print(f" Confidence: {stat_result['confidence']:.1%}") + print(f" Sufficient: {'Yes' if stat_result['crawler'].is_sufficient else 'No'}") + + # Show term coverage + if hasattr(stat_result['result'], 'term_frequencies'): + query_terms = test['query'].lower().split() + covered = sum(1 for term in query_terms + if term in stat_result['result'].term_frequencies) + print(f" Term coverage: {covered}/{len(query_terms)} query terms found") + + # Run embedding strategy + print("\n🧠 Embedding Strategy:") + emb_result = await crawl_with_strategy( + test['url'], + test['query'], + 'embedding', + **embedding_config + ) + + print(f" Pages crawled: {emb_result['pages']}") + print(f" Time taken: {emb_result['elapsed']:.2f}s") + print(f" Confidence: {emb_result['confidence']:.1%}") + print(f" Sufficient: {'Yes' if emb_result['crawler'].is_sufficient else 'No'}") + + # Show semantic understanding + if emb_result['result'].expanded_queries: + print(f" Query variations: {len(emb_result['result'].expanded_queries)}") + print(f" Semantic gaps: {len(emb_result['result'].semantic_gaps)}") + + # Compare results + print("\n📈 Comparison:") + efficiency_diff = ((stat_result['pages'] - emb_result['pages']) / + stat_result['pages'] * 100) if stat_result['pages'] > 0 else 0 + + print(f" Efficiency: ", end="") + if efficiency_diff > 0: + print(f"Embedding used {efficiency_diff:.0f}% fewer pages") + else: + print(f"Statistical used {-efficiency_diff:.0f}% fewer pages") + + print(f" Speed: ", end="") + if stat_result['elapsed'] < emb_result['elapsed']: + print(f"Statistical was {emb_result['elapsed']/stat_result['elapsed']:.1f}x faster") + else: + print(f"Embedding was {stat_result['elapsed']/emb_result['elapsed']:.1f}x faster") + + print(f" Confidence difference: {abs(stat_result['confidence'] - emb_result['confidence'])*100:.0f} percentage points") + + # Recommendation + print("\n💡 Recommendation:") + if 'specific' in test['name'].lower() or all(len(term) > 5 for term in test['query'].split()): + print(" → Statistical strategy is likely better for this use case (specific terms)") + elif 'conceptual' in test['name'].lower() or 'semantic' in test['name'].lower(): + print(" → Embedding strategy is likely better for this use case (semantic understanding)") + else: + if emb_result['confidence'] > stat_result['confidence'] + 0.1: + print(" → Embedding strategy achieved significantly better understanding") + elif stat_result['elapsed'] < emb_result['elapsed'] / 2: + print(" → Statistical strategy is much faster with similar results") + else: + print(" → Both strategies performed similarly; choose based on your priorities") + + # Summary recommendations + print("\n" + "="*70) + print("STRATEGY SELECTION GUIDE") + print("="*70) + print("\n✅ Use STATISTICAL strategy when:") + print(" - Queries contain specific technical terms") + print(" - Speed is critical") + print(" - No API access available") + print(" - Working with well-structured documentation") + + print("\n✅ Use EMBEDDING strategy when:") + print(" - Queries are conceptual or ambiguous") + print(" - Semantic understanding is important") + print(" - Need to detect irrelevant content") + print(" - Working with diverse content sources") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/examples/adaptive_crawling/export_import_kb.py b/docs/examples/adaptive_crawling/export_import_kb.py new file mode 100644 index 00000000..c0a72c2c --- /dev/null +++ b/docs/examples/adaptive_crawling/export_import_kb.py @@ -0,0 +1,232 @@ +""" +Knowledge Base Export and Import + +This example demonstrates how to export crawled knowledge bases and +import them for reuse, sharing, or analysis. +""" + +import asyncio +import json +from pathlib import Path +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig + + +async def build_knowledge_base(): + """Build a knowledge base about web technologies""" + print("="*60) + print("PHASE 1: Building Knowledge Base") + print("="*60) + + async with AsyncWebCrawler(verbose=False) as crawler: + adaptive = AdaptiveCrawler(crawler) + + # Crawl information about HTTP + print("\n1. Gathering HTTP protocol information...") + await adaptive.digest( + start_url="https://httpbin.org", + query="http methods headers status codes" + ) + print(f" - Pages crawled: {len(adaptive.state.crawled_urls)}") + print(f" - Confidence: {adaptive.confidence:.2%}") + + # Add more information about APIs + print("\n2. Adding API documentation knowledge...") + await adaptive.digest( + start_url="https://httpbin.org/anything", + query="rest api json response request" + ) + print(f" - Total pages: {len(adaptive.state.crawled_urls)}") + print(f" - Confidence: {adaptive.confidence:.2%}") + + # Export the knowledge base + export_path = "web_tech_knowledge.jsonl" + print(f"\n3. Exporting knowledge base to {export_path}") + adaptive.export_knowledge_base(export_path) + + # Show export statistics + export_size = Path(export_path).stat().st_size / 1024 + with open(export_path, 'r') as f: + line_count = sum(1 for _ in f) + + print(f" - Exported {line_count} documents") + print(f" - File size: {export_size:.1f} KB") + + return export_path + + +async def analyze_knowledge_base(kb_path): + """Analyze the exported knowledge base""" + print("\n" + "="*60) + print("PHASE 2: Analyzing Exported Knowledge Base") + print("="*60) + + # Read and analyze JSONL + documents = [] + with open(kb_path, 'r') as f: + for line in f: + documents.append(json.loads(line)) + + print(f"\nKnowledge base contains {len(documents)} documents:") + + # Analyze document properties + total_content_length = 0 + urls_by_domain = {} + + for doc in documents: + # Content analysis + content_length = len(doc.get('content', '')) + total_content_length += content_length + + # URL analysis + url = doc.get('url', '') + domain = url.split('/')[2] if url.startswith('http') else 'unknown' + urls_by_domain[domain] = urls_by_domain.get(domain, 0) + 1 + + # Show sample document + if documents.index(doc) == 0: + print(f"\nSample document structure:") + print(f" - URL: {url}") + print(f" - Content length: {content_length} chars") + print(f" - Has metadata: {'metadata' in doc}") + print(f" - Has links: {len(doc.get('links', []))} links") + print(f" - Query: {doc.get('query', 'N/A')}") + + print(f"\nContent statistics:") + print(f" - Total content: {total_content_length:,} characters") + print(f" - Average per document: {total_content_length/len(documents):,.0f} chars") + + print(f"\nDomain distribution:") + for domain, count in urls_by_domain.items(): + print(f" - {domain}: {count} pages") + + +async def import_and_continue(): + """Import a knowledge base and continue crawling""" + print("\n" + "="*60) + print("PHASE 3: Importing and Extending Knowledge Base") + print("="*60) + + kb_path = "web_tech_knowledge.jsonl" + + async with AsyncWebCrawler(verbose=False) as crawler: + # Create new adaptive crawler + adaptive = AdaptiveCrawler(crawler) + + # Import existing knowledge base + print(f"\n1. Importing knowledge base from {kb_path}") + adaptive.import_knowledge_base(kb_path) + + print(f" - Imported {len(adaptive.state.knowledge_base)} documents") + print(f" - Existing URLs: {len(adaptive.state.crawled_urls)}") + + # Check current state + print("\n2. Checking imported knowledge state:") + adaptive.print_stats(detailed=False) + + # Continue crawling with new query + print("\n3. Extending knowledge with new query...") + await adaptive.digest( + start_url="https://httpbin.org/status/200", + query="error handling retry timeout" + ) + + print("\n4. Final knowledge base state:") + adaptive.print_stats(detailed=False) + + # Export extended knowledge base + extended_path = "web_tech_knowledge_extended.jsonl" + adaptive.export_knowledge_base(extended_path) + print(f"\n5. Extended knowledge base exported to {extended_path}") + + +async def share_knowledge_bases(): + """Demonstrate sharing knowledge bases between projects""" + print("\n" + "="*60) + print("PHASE 4: Sharing Knowledge Between Projects") + print("="*60) + + # Simulate two different projects + project_a_kb = "project_a_knowledge.jsonl" + project_b_kb = "project_b_knowledge.jsonl" + + async with AsyncWebCrawler(verbose=False) as crawler: + # Project A: Security documentation + print("\n1. Project A: Building security knowledge...") + crawler_a = AdaptiveCrawler(crawler) + await crawler_a.digest( + start_url="https://httpbin.org/basic-auth/user/pass", + query="authentication security headers" + ) + crawler_a.export_knowledge_base(project_a_kb) + print(f" - Exported {len(crawler_a.state.knowledge_base)} documents") + + # Project B: API testing + print("\n2. Project B: Building testing knowledge...") + crawler_b = AdaptiveCrawler(crawler) + await crawler_b.digest( + start_url="https://httpbin.org/anything", + query="testing endpoints mocking" + ) + crawler_b.export_knowledge_base(project_b_kb) + print(f" - Exported {len(crawler_b.state.knowledge_base)} documents") + + # Merge knowledge bases + print("\n3. Merging knowledge bases...") + merged_crawler = AdaptiveCrawler(crawler) + + # Import both knowledge bases + merged_crawler.import_knowledge_base(project_a_kb) + initial_size = len(merged_crawler.state.knowledge_base) + + merged_crawler.import_knowledge_base(project_b_kb) + final_size = len(merged_crawler.state.knowledge_base) + + print(f" - Project A documents: {initial_size}") + print(f" - Additional from Project B: {final_size - initial_size}") + print(f" - Total merged documents: {final_size}") + + # Export merged knowledge + merged_kb = "merged_knowledge.jsonl" + merged_crawler.export_knowledge_base(merged_kb) + print(f"\n4. Merged knowledge base exported to {merged_kb}") + + # Show combined coverage + print("\n5. Combined knowledge coverage:") + merged_crawler.print_stats(detailed=False) + + +async def main(): + """Run all examples""" + try: + # Build initial knowledge base + kb_path = await build_knowledge_base() + + # Analyze the export + await analyze_knowledge_base(kb_path) + + # Import and extend + await import_and_continue() + + # Demonstrate sharing + await share_knowledge_bases() + + print("\n" + "="*60) + print("All examples completed successfully!") + print("="*60) + + finally: + # Clean up generated files + print("\nCleaning up generated files...") + for file in [ + "web_tech_knowledge.jsonl", + "web_tech_knowledge_extended.jsonl", + "project_a_knowledge.jsonl", + "project_b_knowledge.jsonl", + "merged_knowledge.jsonl" + ]: + Path(file).unlink(missing_ok=True) + print("Cleanup complete.") + + +if __name__ == "__main__": + asyncio.run(main()) \ No newline at end of file diff --git a/docs/md_v2/advanced/adaptive-strategies.md b/docs/md_v2/advanced/adaptive-strategies.md new file mode 100644 index 00000000..4ab5b4cd --- /dev/null +++ b/docs/md_v2/advanced/adaptive-strategies.md @@ -0,0 +1,432 @@ +# Advanced Adaptive Strategies + +## Overview + +While the default adaptive crawling configuration works well for most use cases, understanding the underlying strategies and scoring mechanisms allows you to fine-tune the crawler for specific domains and requirements. + +## The Three-Layer Scoring System + +### 1. Coverage Score + +Coverage measures how comprehensively your knowledge base covers the query terms and related concepts. + +#### Mathematical Foundation + +```python +Coverage(K, Q) = Σ(t ∈ Q) score(t, K) / |Q| + +where score(t, K) = doc_coverage(t) × (1 + freq_boost(t)) +``` + +#### Components + +- **Document Coverage**: Percentage of documents containing the term +- **Frequency Boost**: Logarithmic bonus for term frequency +- **Query Decomposition**: Handles multi-word queries intelligently + +#### Tuning Coverage + +```python +# For technical documentation with specific terminology +config = AdaptiveConfig( + confidence_threshold=0.85, # Require high coverage + top_k_links=5 # Cast wider net +) + +# For general topics with synonyms +config = AdaptiveConfig( + confidence_threshold=0.6, # Lower threshold + top_k_links=2 # More focused +) +``` + +### 2. Consistency Score + +Consistency evaluates whether the information across pages is coherent and non-contradictory. + +#### How It Works + +1. Extracts key statements from each document +2. Compares statements across documents +3. Measures agreement vs. contradiction +4. Returns normalized score (0-1) + +#### Practical Impact + +- **High consistency (>0.8)**: Information is reliable and coherent +- **Medium consistency (0.5-0.8)**: Some variation, but generally aligned +- **Low consistency (<0.5)**: Conflicting information, need more sources + +### 3. Saturation Score + +Saturation detects when new pages stop providing novel information. + +#### Detection Algorithm + +```python +# Tracks new unique terms per page +new_terms_page_1 = 50 +new_terms_page_2 = 30 # 60% of first +new_terms_page_3 = 15 # 50% of second +new_terms_page_4 = 5 # 33% of third +# Saturation detected: rapidly diminishing returns +``` + +#### Configuration + +```python +config = AdaptiveConfig( + min_gain_threshold=0.1 # Stop if <10% new information +) +``` + +## Link Ranking Algorithm + +### Expected Information Gain + +Each uncrawled link is scored based on: + +```python +ExpectedGain(link) = Relevance × Novelty × Authority +``` + +#### 1. Relevance Scoring + +Uses BM25 algorithm on link preview text: + +```python +relevance = BM25(link.preview_text, query) +``` + +Factors: +- Term frequency in preview +- Inverse document frequency +- Preview length normalization + +#### 2. Novelty Estimation + +Measures how different the link appears from already-crawled content: + +```python +novelty = 1 - max_similarity(preview, knowledge_base) +``` + +Prevents crawling duplicate or highly similar pages. + +#### 3. Authority Calculation + +URL structure and domain analysis: + +```python +authority = f(domain_rank, url_depth, url_structure) +``` + +Factors: +- Domain reputation +- URL depth (fewer slashes = higher authority) +- Clean URL structure + +### Custom Link Scoring + +```python +class CustomLinkScorer: + def score(self, link: Link, query: str, state: CrawlState) -> float: + # Prioritize specific URL patterns + if "/api/reference/" in link.href: + return 2.0 # Double the score + + # Deprioritize certain sections + if "/archive/" in link.href: + return 0.1 # Reduce score by 90% + + # Default scoring + return 1.0 + +# Use with adaptive crawler +adaptive = AdaptiveCrawler( + crawler, + config=config, + link_scorer=CustomLinkScorer() +) +``` + +## Domain-Specific Configurations + +### Technical Documentation + +```python +tech_doc_config = AdaptiveConfig( + confidence_threshold=0.85, + max_pages=30, + top_k_links=3, + min_gain_threshold=0.05 # Keep crawling for small gains +) +``` + +Rationale: +- High threshold ensures comprehensive coverage +- Lower gain threshold captures edge cases +- Moderate link following for depth + +### News & Articles + +```python +news_config = AdaptiveConfig( + confidence_threshold=0.6, + max_pages=10, + top_k_links=5, + min_gain_threshold=0.15 # Stop quickly on repetition +) +``` + +Rationale: +- Lower threshold (articles often repeat information) +- Higher gain threshold (avoid duplicate stories) +- More links per page (explore different perspectives) + +### E-commerce + +```python +ecommerce_config = AdaptiveConfig( + confidence_threshold=0.7, + max_pages=20, + top_k_links=2, + min_gain_threshold=0.1 +) +``` + +Rationale: +- Balanced threshold for product variations +- Focused link following (avoid infinite products) +- Standard gain threshold + +### Research & Academic + +```python +research_config = AdaptiveConfig( + confidence_threshold=0.9, + max_pages=50, + top_k_links=4, + min_gain_threshold=0.02 # Very low - capture citations +) +``` + +Rationale: +- Very high threshold for completeness +- Many pages allowed for thorough research +- Very low gain threshold to capture references + +## Performance Optimization + +### Memory Management + +```python +# For large crawls, use streaming +config = AdaptiveConfig( + max_pages=100, + save_state=True, + state_path="large_crawl.json" +) + +# Periodically clean state +if len(state.knowledge_base) > 1000: + # Keep only most relevant + state.knowledge_base = get_top_relevant(state.knowledge_base, 500) +``` + +### Parallel Processing + +```python +# Use multiple start points +start_urls = [ + "https://docs.example.com/intro", + "https://docs.example.com/api", + "https://docs.example.com/guides" +] + +# Crawl in parallel +tasks = [ + adaptive.digest(url, query) + for url in start_urls +] +results = await asyncio.gather(*tasks) +``` + +### Caching Strategy + +```python +# Enable caching for repeated crawls +async with AsyncWebCrawler( + config=BrowserConfig( + cache_mode=CacheMode.ENABLED + ) +) as crawler: + adaptive = AdaptiveCrawler(crawler, config) +``` + +## Debugging & Analysis + +### Enable Verbose Logging + +```python +import logging + +logging.basicConfig(level=logging.DEBUG) +adaptive = AdaptiveCrawler(crawler, config, verbose=True) +``` + +### Analyze Crawl Patterns + +```python +# After crawling +state = await adaptive.digest(start_url, query) + +# Analyze link selection +print("Link selection order:") +for i, url in enumerate(state.crawl_order): + print(f"{i+1}. {url}") + +# Analyze term discovery +print("\nTerm discovery rate:") +for i, new_terms in enumerate(state.new_terms_history): + print(f"Page {i+1}: {new_terms} new terms") + +# Analyze score progression +print("\nScore progression:") +print(f"Coverage: {state.metrics['coverage_history']}") +print(f"Saturation: {state.metrics['saturation_history']}") +``` + +### Export for Analysis + +```python +# Export detailed metrics +import json + +metrics = { + "query": query, + "total_pages": len(state.crawled_urls), + "confidence": adaptive.confidence, + "coverage_stats": adaptive.coverage_stats, + "crawl_order": state.crawl_order, + "term_frequencies": dict(state.term_frequencies), + "new_terms_history": state.new_terms_history +} + +with open("crawl_analysis.json", "w") as f: + json.dump(metrics, f, indent=2) +``` + +## Custom Strategies + +### Implementing a Custom Strategy + +```python +from crawl4ai.adaptive_crawler import BaseStrategy + +class DomainSpecificStrategy(BaseStrategy): + def calculate_coverage(self, state: CrawlState) -> float: + # Custom coverage calculation + # e.g., weight certain terms more heavily + pass + + def calculate_consistency(self, state: CrawlState) -> float: + # Custom consistency logic + # e.g., domain-specific validation + pass + + def rank_links(self, links: List[Link], state: CrawlState) -> List[Link]: + # Custom link ranking + # e.g., prioritize specific URL patterns + pass + +# Use custom strategy +adaptive = AdaptiveCrawler( + crawler, + config=config, + strategy=DomainSpecificStrategy() +) +``` + +### Combining Strategies + +```python +class HybridStrategy(BaseStrategy): + def __init__(self): + self.strategies = [ + TechnicalDocStrategy(), + SemanticSimilarityStrategy(), + URLPatternStrategy() + ] + + def calculate_confidence(self, state: CrawlState) -> float: + # Weighted combination of strategies + scores = [s.calculate_confidence(state) for s in self.strategies] + weights = [0.5, 0.3, 0.2] + return sum(s * w for s, w in zip(scores, weights)) +``` + +## Best Practices + +### 1. Start Conservative + +Begin with default settings and adjust based on results: + +```python +# Start with defaults +result = await adaptive.digest(url, query) + +# Analyze and adjust +if adaptive.confidence < 0.7: + config.max_pages += 10 + config.confidence_threshold -= 0.1 +``` + +### 2. Monitor Resource Usage + +```python +import psutil + +# Check memory before large crawls +memory_percent = psutil.virtual_memory().percent +if memory_percent > 80: + config.max_pages = min(config.max_pages, 20) +``` + +### 3. Use Domain Knowledge + +```python +# For API documentation +if "api" in start_url: + config.top_k_links = 2 # APIs have clear structure + +# For blogs +if "blog" in start_url: + config.min_gain_threshold = 0.2 # Avoid similar posts +``` + +### 4. Validate Results + +```python +# Always validate the knowledge base +relevant_content = adaptive.get_relevant_content(top_k=10) + +# Check coverage +query_terms = set(query.lower().split()) +covered_terms = set() + +for doc in relevant_content: + content_lower = doc['content'].lower() + for term in query_terms: + if term in content_lower: + covered_terms.add(term) + +coverage_ratio = len(covered_terms) / len(query_terms) +print(f"Query term coverage: {coverage_ratio:.0%}") +``` + +## Next Steps + +- Explore [Custom Strategy Implementation](../tutorials/custom-adaptive-strategies.md) +- Learn about [Knowledge Base Management](../tutorials/knowledge-base-management.md) +- See [Performance Benchmarks](../benchmarks/adaptive-performance.md) \ No newline at end of file diff --git a/docs/md_v2/api/adaptive-crawler.md b/docs/md_v2/api/adaptive-crawler.md new file mode 100644 index 00000000..af92ee3a --- /dev/null +++ b/docs/md_v2/api/adaptive-crawler.md @@ -0,0 +1,244 @@ +# AdaptiveCrawler + +The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation. + +## Constructor + +```python +AdaptiveCrawler( + crawler: AsyncWebCrawler, + config: Optional[AdaptiveConfig] = None +) +``` + +### Parameters + +- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages +- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings. + +## Primary Method + +### digest() + +The main method that performs adaptive crawling starting from a URL with a specific query. + +```python +async def digest( + start_url: str, + query: str, + resume_from: Optional[Union[str, Path]] = None +) -> CrawlState +``` + +#### Parameters + +- **start_url** (`str`): The starting URL for crawling +- **query** (`str`): The search query that guides the crawling process +- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from + +#### Returns + +- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics + +#### Example + +```python +async with AsyncWebCrawler() as crawler: + adaptive = AdaptiveCrawler(crawler) + state = await adaptive.digest( + start_url="https://docs.python.org", + query="async context managers" + ) +``` + +## Properties + +### confidence + +Current confidence score (0-1) indicating information sufficiency. + +```python +@property +def confidence(self) -> float +``` + +### coverage_stats + +Dictionary containing detailed coverage statistics. + +```python +@property +def coverage_stats(self) -> Dict[str, float] +``` + +Returns: +- **coverage**: Query term coverage score +- **consistency**: Information consistency score +- **saturation**: Content saturation score +- **confidence**: Overall confidence score + +### is_sufficient + +Boolean indicating whether sufficient information has been gathered. + +```python +@property +def is_sufficient(self) -> bool +``` + +### state + +Access to the current crawl state. + +```python +@property +def state(self) -> CrawlState +``` + +## Methods + +### get_relevant_content() + +Retrieve the most relevant content from the knowledge base. + +```python +def get_relevant_content( + self, + top_k: int = 5 +) -> List[Dict[str, Any]] +``` + +#### Parameters + +- **top_k** (`int`): Number of top relevant documents to return (default: 5) + +#### Returns + +List of dictionaries containing: +- **url**: The URL of the page +- **content**: The page content +- **score**: Relevance score +- **metadata**: Additional page metadata + +### print_stats() + +Display crawl statistics in formatted output. + +```python +def print_stats( + self, + detailed: bool = False +) -> None +``` + +#### Parameters + +- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table. + +### export_knowledge_base() + +Export the collected knowledge base to a JSONL file. + +```python +def export_knowledge_base( + self, + path: Union[str, Path] +) -> None +``` + +#### Parameters + +- **path** (`Union[str, Path]`): Output file path for JSONL export + +#### Example + +```python +adaptive.export_knowledge_base("my_knowledge.jsonl") +``` + +### import_knowledge_base() + +Import a previously exported knowledge base. + +```python +def import_knowledge_base( + self, + path: Union[str, Path] +) -> None +``` + +#### Parameters + +- **path** (`Union[str, Path]`): Path to JSONL file to import + +## Configuration + +The `AdaptiveConfig` class controls the behavior of adaptive crawling: + +```python +@dataclass +class AdaptiveConfig: + confidence_threshold: float = 0.8 # Stop when confidence reaches this + max_pages: int = 50 # Maximum pages to crawl + top_k_links: int = 5 # Links to follow per page + min_gain_threshold: float = 0.1 # Minimum expected gain to continue + save_state: bool = False # Auto-save crawl state + state_path: Optional[str] = None # Path for state persistence +``` + +### Example with Custom Config + +```python +config = AdaptiveConfig( + confidence_threshold=0.7, + max_pages=20, + top_k_links=3 +) + +adaptive = AdaptiveCrawler(crawler, config=config) +``` + +## Complete Example + +```python +import asyncio +from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig + +async def main(): + # Configure adaptive crawling + config = AdaptiveConfig( + confidence_threshold=0.75, + max_pages=15, + save_state=True, + state_path="my_crawl.json" + ) + + async with AsyncWebCrawler() as crawler: + adaptive = AdaptiveCrawler(crawler, config) + + # Start crawling + state = await adaptive.digest( + start_url="https://example.com/docs", + query="authentication oauth2 jwt" + ) + + # Check results + print(f"Confidence achieved: {adaptive.confidence:.0%}") + adaptive.print_stats() + + # Get most relevant pages + for page in adaptive.get_relevant_content(top_k=3): + print(f"- {page['url']} (score: {page['score']:.2f})") + + # Export for later use + adaptive.export_knowledge_base("auth_knowledge.jsonl") + +if __name__ == "__main__": + asyncio.run(main()) +``` + +## See Also + +- [digest() Method Reference](digest.md) +- [Adaptive Crawling Guide](../core/adaptive-crawling.md) +- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md) \ No newline at end of file diff --git a/docs/md_v2/api/digest.md b/docs/md_v2/api/digest.md new file mode 100644 index 00000000..9256f526 --- /dev/null +++ b/docs/md_v2/api/digest.md @@ -0,0 +1,181 @@ +# digest() + +The `digest()` method is the primary interface for adaptive web crawling. It intelligently crawls websites starting from a given URL, guided by a query, and automatically determines when sufficient information has been gathered. + +## Method Signature + +```python +async def digest( + start_url: str, + query: str, + resume_from: Optional[Union[str, Path]] = None +) -> CrawlState +``` + +## Parameters + +### start_url +- **Type**: `str` +- **Required**: Yes +- **Description**: The starting URL for the crawl. This should be a valid HTTP/HTTPS URL that serves as the entry point for information gathering. + +### query +- **Type**: `str` +- **Required**: Yes +- **Description**: The search query that guides the crawling process. This should contain key terms related to the information you're seeking. The crawler uses this to evaluate relevance and determine which links to follow. + +### resume_from +- **Type**: `Optional[Union[str, Path]]` +- **Default**: `None` +- **Description**: Path to a previously saved crawl state file. When provided, the crawler resumes from the saved state instead of starting fresh. + +## Return Value + +Returns a `CrawlState` object containing: + +- **crawled_urls** (`Set[str]`): All URLs that have been crawled +- **knowledge_base** (`List[CrawlResult]`): Collection of crawled pages with content +- **pending_links** (`List[Link]`): Links discovered but not yet crawled +- **metrics** (`Dict[str, float]`): Performance and quality metrics +- **query** (`str`): The original query +- Additional statistical information for scoring + +## How It Works + +The `digest()` method implements an intelligent crawling algorithm: + +1. **Initial Crawl**: Starts from the provided URL +2. **Link Analysis**: Evaluates all discovered links for relevance +3. **Scoring**: Uses three metrics to assess information sufficiency: + - **Coverage**: How well the query terms are covered + - **Consistency**: Information coherence across pages + - **Saturation**: Diminishing returns detection +4. **Adaptive Selection**: Chooses the most promising links to follow +5. **Stopping Decision**: Automatically stops when confidence threshold is reached + +## Examples + +### Basic Usage + +```python +async with AsyncWebCrawler() as crawler: + adaptive = AdaptiveCrawler(crawler) + + state = await adaptive.digest( + start_url="https://docs.python.org/3/", + query="async await context managers" + ) + + print(f"Crawled {len(state.crawled_urls)} pages") + print(f"Confidence: {adaptive.confidence:.0%}") +``` + +### With Configuration + +```python +config = AdaptiveConfig( + confidence_threshold=0.9, # Require high confidence + max_pages=30, # Allow more pages + top_k_links=3 # Follow top 3 links per page +) + +adaptive = AdaptiveCrawler(crawler, config=config) + +state = await adaptive.digest( + start_url="https://api.example.com/docs", + query="authentication endpoints rate limits" +) +``` + +### Resuming a Previous Crawl + +```python +# First crawl - may be interrupted +state1 = await adaptive.digest( + start_url="https://example.com", + query="machine learning algorithms" +) + +# Save state (if not auto-saved) +state1.save("ml_crawl_state.json") + +# Later, resume from saved state +state2 = await adaptive.digest( + start_url="https://example.com", + query="machine learning algorithms", + resume_from="ml_crawl_state.json" +) +``` + +### With Progress Monitoring + +```python +state = await adaptive.digest( + start_url="https://docs.example.com", + query="api reference" +) + +# Monitor progress +print(f"Pages crawled: {len(state.crawled_urls)}") +print(f"New terms discovered: {state.new_terms_history}") +print(f"Final confidence: {adaptive.confidence:.2%}") + +# View detailed statistics +adaptive.print_stats(detailed=True) +``` + +## Query Best Practices + +1. **Be Specific**: Use descriptive terms that appear in target content + ```python + # Good + query = "python async context managers implementation" + + # Too broad + query = "python programming" + ``` + +2. **Include Key Terms**: Add technical terms you expect to find + ```python + query = "oauth2 jwt refresh tokens authorization" + ``` + +3. **Multiple Concepts**: Combine related concepts for comprehensive coverage + ```python + query = "rest api pagination sorting filtering" + ``` + +## Performance Considerations + +- **Initial URL**: Choose a page with good navigation (e.g., documentation index) +- **Query Length**: 3-8 terms typically work best +- **Link Density**: Sites with clear navigation crawl more efficiently +- **Caching**: Enable caching for repeated crawls of the same domain + +## Error Handling + +```python +try: + state = await adaptive.digest( + start_url="https://example.com", + query="search terms" + ) +except Exception as e: + print(f"Crawl failed: {e}") + # State is auto-saved if save_state=True in config +``` + +## Stopping Conditions + +The crawl stops when any of these conditions are met: + +1. **Confidence Threshold**: Reached the configured confidence level +2. **Page Limit**: Crawled the maximum number of pages +3. **Diminishing Returns**: Expected information gain below threshold +4. **No Relevant Links**: No promising links remain to follow + +## See Also + +- [AdaptiveCrawler Class](adaptive-crawler.md) +- [Adaptive Crawling Guide](../core/adaptive-crawling.md) +- [Configuration Options](../core/adaptive-crawling.md#configuration-options) \ No newline at end of file diff --git a/docs/md_v2/blog/articles/adaptive-crawling-revolution.md b/docs/md_v2/blog/articles/adaptive-crawling-revolution.md new file mode 100644 index 00000000..c2b06750 --- /dev/null +++ b/docs/md_v2/blog/articles/adaptive-crawling-revolution.md @@ -0,0 +1,369 @@ +# Adaptive Crawling: Building Dynamic Knowledge That Grows on Demand + +*Published on January 29, 2025 • 8 min read* + +*By [unclecode](https://x.com/unclecode) • Follow me on [X/Twitter](https://x.com/unclecode) for more web scraping insights* + +--- + +## The Knowledge Capacitor + +Imagine a capacitor that stores energy, releasing it precisely when needed. Now imagine that for information. That's Adaptive Crawling—a term I coined to describe a fundamentally different approach to web crawling. Instead of the brute force of traditional deep crawling, we build knowledge dynamically, growing it based on queries and circumstances, like a living organism responding to its environment. + +This isn't just another crawling optimization. It's a paradigm shift from "crawl everything, hope for the best" to "crawl intelligently, know when to stop." + +## Why I Built This + +I've watched too many startups burn through resources with a dangerous misconception: that LLMs make everything efficient. They don't. They make things *possible*, not necessarily *smart*. When you combine brute-force crawling with LLM processing, you're not just wasting time—you're hemorrhaging money on tokens, compute, and opportunity cost. + +Consider this reality: +- **Traditional deep crawling**: 500 pages → 50 useful → $15 in LLM tokens → 2 hours wasted +- **Adaptive crawling**: 15 pages → 14 useful → $2 in tokens → 10 minutes → **7.5x cost reduction** + +But it's not about crawling less. It's about crawling *right*. + +## The Information Theory Foundation + +