- Renamed `CrawlState` to `AdaptiveCrawlResult` to better reflect its purpose. - Updated all references to `CrawlState` in the codebase, including method signatures and documentation. - Modified the `AdaptiveCrawler` class to initialize and manage the new `AdaptiveCrawlResult` state. - Adjusted example strategies and documentation to align with the new state class. - Ensured all tests are updated to use `AdaptiveCrawlResult` instead of `CrawlState`.
244 lines
5.4 KiB
Markdown
244 lines
5.4 KiB
Markdown
# AdaptiveCrawler
|
|
|
|
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
|
|
|
|
## Constructor
|
|
|
|
```python
|
|
AdaptiveCrawler(
|
|
crawler: AsyncWebCrawler,
|
|
config: Optional[AdaptiveConfig] = None
|
|
)
|
|
```
|
|
|
|
### Parameters
|
|
|
|
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
|
|
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
|
|
|
|
## Primary Method
|
|
|
|
### digest()
|
|
|
|
The main method that performs adaptive crawling starting from a URL with a specific query.
|
|
|
|
```python
|
|
async def digest(
|
|
start_url: str,
|
|
query: str,
|
|
resume_from: Optional[Union[str, Path]] = None
|
|
) -> AdaptiveCrawlResult
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **start_url** (`str`): The starting URL for crawling
|
|
- **query** (`str`): The search query that guides the crawling process
|
|
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
|
|
|
|
#### Returns
|
|
|
|
- **AdaptiveCrawlResult**: The final crawl state containing all crawled URLs, knowledge base, and metrics
|
|
|
|
#### Example
|
|
|
|
```python
|
|
async with AsyncWebCrawler() as crawler:
|
|
adaptive = AdaptiveCrawler(crawler)
|
|
state = await adaptive.digest(
|
|
start_url="https://docs.python.org",
|
|
query="async context managers"
|
|
)
|
|
```
|
|
|
|
## Properties
|
|
|
|
### confidence
|
|
|
|
Current confidence score (0-1) indicating information sufficiency.
|
|
|
|
```python
|
|
@property
|
|
def confidence(self) -> float
|
|
```
|
|
|
|
### coverage_stats
|
|
|
|
Dictionary containing detailed coverage statistics.
|
|
|
|
```python
|
|
@property
|
|
def coverage_stats(self) -> Dict[str, float]
|
|
```
|
|
|
|
Returns:
|
|
- **coverage**: Query term coverage score
|
|
- **consistency**: Information consistency score
|
|
- **saturation**: Content saturation score
|
|
- **confidence**: Overall confidence score
|
|
|
|
### is_sufficient
|
|
|
|
Boolean indicating whether sufficient information has been gathered.
|
|
|
|
```python
|
|
@property
|
|
def is_sufficient(self) -> bool
|
|
```
|
|
|
|
### state
|
|
|
|
Access to the current crawl state.
|
|
|
|
```python
|
|
@property
|
|
def state(self) -> AdaptiveCrawlResult
|
|
```
|
|
|
|
## Methods
|
|
|
|
### get_relevant_content()
|
|
|
|
Retrieve the most relevant content from the knowledge base.
|
|
|
|
```python
|
|
def get_relevant_content(
|
|
self,
|
|
top_k: int = 5
|
|
) -> List[Dict[str, Any]]
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
|
|
|
|
#### Returns
|
|
|
|
List of dictionaries containing:
|
|
- **url**: The URL of the page
|
|
- **content**: The page content
|
|
- **score**: Relevance score
|
|
- **metadata**: Additional page metadata
|
|
|
|
### print_stats()
|
|
|
|
Display crawl statistics in formatted output.
|
|
|
|
```python
|
|
def print_stats(
|
|
self,
|
|
detailed: bool = False
|
|
) -> None
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
|
|
|
|
### export_knowledge_base()
|
|
|
|
Export the collected knowledge base to a JSONL file.
|
|
|
|
```python
|
|
def export_knowledge_base(
|
|
self,
|
|
path: Union[str, Path]
|
|
) -> None
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **path** (`Union[str, Path]`): Output file path for JSONL export
|
|
|
|
#### Example
|
|
|
|
```python
|
|
adaptive.export_knowledge_base("my_knowledge.jsonl")
|
|
```
|
|
|
|
### import_knowledge_base()
|
|
|
|
Import a previously exported knowledge base.
|
|
|
|
```python
|
|
def import_knowledge_base(
|
|
self,
|
|
path: Union[str, Path]
|
|
) -> None
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **path** (`Union[str, Path]`): Path to JSONL file to import
|
|
|
|
## Configuration
|
|
|
|
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
|
|
|
|
```python
|
|
@dataclass
|
|
class AdaptiveConfig:
|
|
confidence_threshold: float = 0.8 # Stop when confidence reaches this
|
|
max_pages: int = 50 # Maximum pages to crawl
|
|
top_k_links: int = 5 # Links to follow per page
|
|
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
|
|
save_state: bool = False # Auto-save crawl state
|
|
state_path: Optional[str] = None # Path for state persistence
|
|
```
|
|
|
|
### Example with Custom Config
|
|
|
|
```python
|
|
config = AdaptiveConfig(
|
|
confidence_threshold=0.7,
|
|
max_pages=20,
|
|
top_k_links=3
|
|
)
|
|
|
|
adaptive = AdaptiveCrawler(crawler, config=config)
|
|
```
|
|
|
|
## Complete Example
|
|
|
|
```python
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
|
|
|
async def main():
|
|
# Configure adaptive crawling
|
|
config = AdaptiveConfig(
|
|
confidence_threshold=0.75,
|
|
max_pages=15,
|
|
save_state=True,
|
|
state_path="my_crawl.json"
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
adaptive = AdaptiveCrawler(crawler, config)
|
|
|
|
# Start crawling
|
|
state = await adaptive.digest(
|
|
start_url="https://example.com/docs",
|
|
query="authentication oauth2 jwt"
|
|
)
|
|
|
|
# Check results
|
|
print(f"Confidence achieved: {adaptive.confidence:.0%}")
|
|
adaptive.print_stats()
|
|
|
|
# Get most relevant pages
|
|
for page in adaptive.get_relevant_content(top_k=3):
|
|
print(f"- {page['url']} (score: {page['score']:.2f})")
|
|
|
|
# Export for later use
|
|
adaptive.export_knowledge_base("auth_knowledge.jsonl")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [digest() Method Reference](digest.md)
|
|
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
|
|
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md) |