Files

UncleCode 1a73fb60db feat(crawl4ai): Implement adaptive crawling feature

This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.

The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.

The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.

Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature

BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.

Refs: #123, #456

2025-07-04 15:16:53 +08:00

5.4 KiB

Raw Blame History

AdaptiveCrawler

The AdaptiveCrawler class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.

Constructor

AdaptiveCrawler(
    crawler: AsyncWebCrawler,
    config: Optional[AdaptiveConfig] = None
)

Parameters

crawler (AsyncWebCrawler): The underlying web crawler instance to use for fetching pages
config (Optional[AdaptiveConfig]): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.

Primary Method

digest()

The main method that performs adaptive crawling starting from a URL with a specific query.

async def digest(
    start_url: str,
    query: str,
    resume_from: Optional[Union[str, Path]] = None
) -> CrawlState

Parameters

start_url (str): The starting URL for crawling
query (str): The search query that guides the crawling process
resume_from (Optional[Union[str, Path]]): Path to a saved state file to resume from

Returns

CrawlState: The final crawl state containing all crawled URLs, knowledge base, and metrics

Example

async with AsyncWebCrawler() as crawler:
    adaptive = AdaptiveCrawler(crawler)
    state = await adaptive.digest(
        start_url="https://docs.python.org",
        query="async context managers"
    )

Properties

confidence

Current confidence score (0-1) indicating information sufficiency.

@property
def confidence(self) -> float

coverage_stats

Dictionary containing detailed coverage statistics.

@property  
def coverage_stats(self) -> Dict[str, float]

Returns:

coverage: Query term coverage score
consistency: Information consistency score
saturation: Content saturation score
confidence: Overall confidence score

is_sufficient

Boolean indicating whether sufficient information has been gathered.

@property
def is_sufficient(self) -> bool

state

Access to the current crawl state.

@property
def state(self) -> CrawlState

Methods

get_relevant_content()

Retrieve the most relevant content from the knowledge base.

def get_relevant_content(
    self,
    top_k: int = 5
) -> List[Dict[str, Any]]

Parameters

top_k (int): Number of top relevant documents to return (default: 5)

Returns

List of dictionaries containing:

url: The URL of the page
content: The page content
score: Relevance score
metadata: Additional page metadata

print_stats()

Display crawl statistics in formatted output.

def print_stats(
    self,
    detailed: bool = False
) -> None

Parameters

detailed (bool): If True, shows detailed metrics with colors. If False, shows summary table.

export_knowledge_base()

Export the collected knowledge base to a JSONL file.

def export_knowledge_base(
    self,
    path: Union[str, Path]
) -> None

Parameters

path (Union[str, Path]): Output file path for JSONL export

Example

adaptive.export_knowledge_base("my_knowledge.jsonl")

import_knowledge_base()

Import a previously exported knowledge base.

def import_knowledge_base(
    self,
    path: Union[str, Path]
) -> None

Parameters

path (Union[str, Path]): Path to JSONL file to import

Configuration

The AdaptiveConfig class controls the behavior of adaptive crawling:

@dataclass
class AdaptiveConfig:
    confidence_threshold: float = 0.8      # Stop when confidence reaches this
    max_pages: int = 50                    # Maximum pages to crawl
    top_k_links: int = 5                   # Links to follow per page
    min_gain_threshold: float = 0.1        # Minimum expected gain to continue
    save_state: bool = False               # Auto-save crawl state
    state_path: Optional[str] = None       # Path for state persistence

Example with Custom Config

config = AdaptiveConfig(
    confidence_threshold=0.7,
    max_pages=20,
    top_k_links=3
)

adaptive = AdaptiveCrawler(crawler, config=config)

Complete Example

import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig

async def main():
    # Configure adaptive crawling
    config = AdaptiveConfig(
        confidence_threshold=0.75,
        max_pages=15,
        save_state=True,
        state_path="my_crawl.json"
    )
    
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler, config)
        
        # Start crawling
        state = await adaptive.digest(
            start_url="https://example.com/docs",
            query="authentication oauth2 jwt"
        )
        
        # Check results
        print(f"Confidence achieved: {adaptive.confidence:.0%}")
        adaptive.print_stats()
        
        # Get most relevant pages
        for page in adaptive.get_relevant_content(top_k=3):
            print(f"- {page['url']} (score: {page['score']:.2f})")
        
        # Export for later use
        adaptive.export_knowledge_base("auth_knowledge.jsonl")

if __name__ == "__main__":
    asyncio.run(main())

5.4 KiB Raw Blame History

AdaptiveCrawler

Constructor

Parameters

Primary Method

digest()

Parameters

Returns

Example

Properties

confidence

coverage_stats

is_sufficient

state

Methods

get_relevant_content()

Parameters

Returns

print_stats()

Parameters

export_knowledge_base()

Parameters

Example

import_knowledge_base()

Parameters

Configuration

Example with Custom Config

Complete Example

See Also

5.4 KiB

Raw Blame History