Files

UncleCode 843457a9cb Refactor adaptive crawling state management

- Renamed `CrawlState` to `AdaptiveCrawlResult` to better reflect its purpose.
- Updated all references to `CrawlState` in the codebase, including method signatures and documentation.
- Modified the `AdaptiveCrawler` class to initialize and manage the new `AdaptiveCrawlResult` state.
- Adjusted example strategies and documentation to align with the new state class.
- Ensured all tests are updated to use `AdaptiveCrawlResult` instead of `CrawlState`.

2025-07-24 20:11:43 +08:00

5.4 KiB

Raw Permalink Blame History

AdaptiveCrawler

The AdaptiveCrawler class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.

Constructor

AdaptiveCrawler(
    crawler: AsyncWebCrawler,
    config: Optional[AdaptiveConfig] = None
)

Parameters

crawler (AsyncWebCrawler): The underlying web crawler instance to use for fetching pages
config (Optional[AdaptiveConfig]): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.

Primary Method

digest()

The main method that performs adaptive crawling starting from a URL with a specific query.

async def digest(
    start_url: str,
    query: str,
    resume_from: Optional[Union[str, Path]] = None
) -> AdaptiveCrawlResult

Parameters

start_url (str): The starting URL for crawling
query (str): The search query that guides the crawling process
resume_from (Optional[Union[str, Path]]): Path to a saved state file to resume from

Returns

AdaptiveCrawlResult: The final crawl state containing all crawled URLs, knowledge base, and metrics

Example

async with AsyncWebCrawler() as crawler:
    adaptive = AdaptiveCrawler(crawler)
    state = await adaptive.digest(
        start_url="https://docs.python.org",
        query="async context managers"
    )

Properties

confidence

Current confidence score (0-1) indicating information sufficiency.

@property
def confidence(self) -> float

coverage_stats

Dictionary containing detailed coverage statistics.

@property  
def coverage_stats(self) -> Dict[str, float]

Returns:

coverage: Query term coverage score
consistency: Information consistency score
saturation: Content saturation score
confidence: Overall confidence score

is_sufficient

Boolean indicating whether sufficient information has been gathered.

@property
def is_sufficient(self) -> bool

state

Access to the current crawl state.

@property
def state(self) -> AdaptiveCrawlResult

Methods

get_relevant_content()

Retrieve the most relevant content from the knowledge base.

def get_relevant_content(
    self,
    top_k: int = 5
) -> List[Dict[str, Any]]

Parameters

top_k (int): Number of top relevant documents to return (default: 5)

Returns

List of dictionaries containing:

url: The URL of the page
content: The page content
score: Relevance score
metadata: Additional page metadata

print_stats()

Display crawl statistics in formatted output.

def print_stats(
    self,
    detailed: bool = False
) -> None

Parameters

detailed (bool): If True, shows detailed metrics with colors. If False, shows summary table.

export_knowledge_base()

Export the collected knowledge base to a JSONL file.

def export_knowledge_base(
    self,
    path: Union[str, Path]
) -> None

Parameters

path (Union[str, Path]): Output file path for JSONL export

Example

adaptive.export_knowledge_base("my_knowledge.jsonl")

import_knowledge_base()

Import a previously exported knowledge base.

def import_knowledge_base(
    self,
    path: Union[str, Path]
) -> None

Parameters

path (Union[str, Path]): Path to JSONL file to import

Configuration

The AdaptiveConfig class controls the behavior of adaptive crawling:

@dataclass
class AdaptiveConfig:
    confidence_threshold: float = 0.8      # Stop when confidence reaches this
    max_pages: int = 50                    # Maximum pages to crawl
    top_k_links: int = 5                   # Links to follow per page
    min_gain_threshold: float = 0.1        # Minimum expected gain to continue
    save_state: bool = False               # Auto-save crawl state
    state_path: Optional[str] = None       # Path for state persistence

Example with Custom Config

config = AdaptiveConfig(
    confidence_threshold=0.7,
    max_pages=20,
    top_k_links=3
)

adaptive = AdaptiveCrawler(crawler, config=config)

Complete Example

import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig

async def main():
    # Configure adaptive crawling
    config = AdaptiveConfig(
        confidence_threshold=0.75,
        max_pages=15,
        save_state=True,
        state_path="my_crawl.json"
    )
    
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        
        # Start crawling
        state = await adaptive.digest(
            start_url="https://example.com/docs",
            query="authentication oauth2 jwt"
        )
        
        # Check results
        print(f"Confidence achieved: {adaptive.confidence:.0%}")
        adaptive.print_stats()
        
        # Get most relevant pages
        for page in adaptive.get_relevant_content(top_k=3):
            print(f"- {page['url']} (score: {page['score']:.2f})")
        
        # Export for later use
        adaptive.export_knowledge_base("auth_knowledge.jsonl")

if __name__ == "__main__":
    asyncio.run(main())

5.4 KiB Raw Permalink Blame History

AdaptiveCrawler

Constructor

Parameters

Primary Method

digest()

Parameters

Returns

Example

Properties

confidence

coverage_stats

is_sufficient

state

Methods

get_relevant_content()

Parameters

Returns

print_stats()

Parameters

export_knowledge_base()

Parameters

Example

import_knowledge_base()

Parameters

Configuration

Example with Custom Config

Complete Example

See Also

5.4 KiB

Raw Permalink Blame History