- Renamed `CrawlState` to `AdaptiveCrawlResult` to better reflect its purpose. - Updated all references to `CrawlState` in the codebase, including method signatures and documentation. - Modified the `AdaptiveCrawler` class to initialize and manage the new `AdaptiveCrawlResult` state. - Adjusted example strategies and documentation to align with the new state class. - Ensured all tests are updated to use `AdaptiveCrawlResult` instead of `CrawlState`.
5.4 KiB
AdaptiveCrawler
The AdaptiveCrawler class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
Constructor
AdaptiveCrawler(
crawler: AsyncWebCrawler,
config: Optional[AdaptiveConfig] = None
)
Parameters
- crawler (
AsyncWebCrawler): The underlying web crawler instance to use for fetching pages - config (
Optional[AdaptiveConfig]): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
Primary Method
digest()
The main method that performs adaptive crawling starting from a URL with a specific query.
async def digest(
start_url: str,
query: str,
resume_from: Optional[Union[str, Path]] = None
) -> AdaptiveCrawlResult
Parameters
- start_url (
str): The starting URL for crawling - query (
str): The search query that guides the crawling process - resume_from (
Optional[Union[str, Path]]): Path to a saved state file to resume from
Returns
- AdaptiveCrawlResult: The final crawl state containing all crawled URLs, knowledge base, and metrics
Example
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
state = await adaptive.digest(
start_url="https://docs.python.org",
query="async context managers"
)
Properties
confidence
Current confidence score (0-1) indicating information sufficiency.
@property
def confidence(self) -> float
coverage_stats
Dictionary containing detailed coverage statistics.
@property
def coverage_stats(self) -> Dict[str, float]
Returns:
- coverage: Query term coverage score
- consistency: Information consistency score
- saturation: Content saturation score
- confidence: Overall confidence score
is_sufficient
Boolean indicating whether sufficient information has been gathered.
@property
def is_sufficient(self) -> bool
state
Access to the current crawl state.
@property
def state(self) -> AdaptiveCrawlResult
Methods
get_relevant_content()
Retrieve the most relevant content from the knowledge base.
def get_relevant_content(
self,
top_k: int = 5
) -> List[Dict[str, Any]]
Parameters
- top_k (
int): Number of top relevant documents to return (default: 5)
Returns
List of dictionaries containing:
- url: The URL of the page
- content: The page content
- score: Relevance score
- metadata: Additional page metadata
print_stats()
Display crawl statistics in formatted output.
def print_stats(
self,
detailed: bool = False
) -> None
Parameters
- detailed (
bool): If True, shows detailed metrics with colors. If False, shows summary table.
export_knowledge_base()
Export the collected knowledge base to a JSONL file.
def export_knowledge_base(
self,
path: Union[str, Path]
) -> None
Parameters
- path (
Union[str, Path]): Output file path for JSONL export
Example
adaptive.export_knowledge_base("my_knowledge.jsonl")
import_knowledge_base()
Import a previously exported knowledge base.
def import_knowledge_base(
self,
path: Union[str, Path]
) -> None
Parameters
- path (
Union[str, Path]): Path to JSONL file to import
Configuration
The AdaptiveConfig class controls the behavior of adaptive crawling:
@dataclass
class AdaptiveConfig:
confidence_threshold: float = 0.8 # Stop when confidence reaches this
max_pages: int = 50 # Maximum pages to crawl
top_k_links: int = 5 # Links to follow per page
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
save_state: bool = False # Auto-save crawl state
state_path: Optional[str] = None # Path for state persistence
Example with Custom Config
config = AdaptiveConfig(
confidence_threshold=0.7,
max_pages=20,
top_k_links=3
)
adaptive = AdaptiveCrawler(crawler, config=config)
Complete Example
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
async def main():
# Configure adaptive crawling
config = AdaptiveConfig(
confidence_threshold=0.75,
max_pages=15,
save_state=True,
state_path="my_crawl.json"
)
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Start crawling
state = await adaptive.digest(
start_url="https://example.com/docs",
query="authentication oauth2 jwt"
)
# Check results
print(f"Confidence achieved: {adaptive.confidence:.0%}")
adaptive.print_stats()
# Get most relevant pages
for page in adaptive.get_relevant_content(top_k=3):
print(f"- {page['url']} (score: {page['score']:.2f})")
# Export for later use
adaptive.export_knowledge_base("auth_knowledge.jsonl")
if __name__ == "__main__":
asyncio.run(main())