This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
244 lines
5.4 KiB
Markdown
244 lines
5.4 KiB
Markdown
# AdaptiveCrawler
|
|
|
|
The `AdaptiveCrawler` class implements intelligent web crawling that automatically determines when sufficient information has been gathered to answer a query. It uses a three-layer scoring system to evaluate coverage, consistency, and saturation.
|
|
|
|
## Constructor
|
|
|
|
```python
|
|
AdaptiveCrawler(
|
|
crawler: AsyncWebCrawler,
|
|
config: Optional[AdaptiveConfig] = None
|
|
)
|
|
```
|
|
|
|
### Parameters
|
|
|
|
- **crawler** (`AsyncWebCrawler`): The underlying web crawler instance to use for fetching pages
|
|
- **config** (`Optional[AdaptiveConfig]`): Configuration settings for adaptive crawling behavior. If not provided, uses default settings.
|
|
|
|
## Primary Method
|
|
|
|
### digest()
|
|
|
|
The main method that performs adaptive crawling starting from a URL with a specific query.
|
|
|
|
```python
|
|
async def digest(
|
|
start_url: str,
|
|
query: str,
|
|
resume_from: Optional[Union[str, Path]] = None
|
|
) -> CrawlState
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **start_url** (`str`): The starting URL for crawling
|
|
- **query** (`str`): The search query that guides the crawling process
|
|
- **resume_from** (`Optional[Union[str, Path]]`): Path to a saved state file to resume from
|
|
|
|
#### Returns
|
|
|
|
- **CrawlState**: The final crawl state containing all crawled URLs, knowledge base, and metrics
|
|
|
|
#### Example
|
|
|
|
```python
|
|
async with AsyncWebCrawler() as crawler:
|
|
adaptive = AdaptiveCrawler(crawler)
|
|
state = await adaptive.digest(
|
|
start_url="https://docs.python.org",
|
|
query="async context managers"
|
|
)
|
|
```
|
|
|
|
## Properties
|
|
|
|
### confidence
|
|
|
|
Current confidence score (0-1) indicating information sufficiency.
|
|
|
|
```python
|
|
@property
|
|
def confidence(self) -> float
|
|
```
|
|
|
|
### coverage_stats
|
|
|
|
Dictionary containing detailed coverage statistics.
|
|
|
|
```python
|
|
@property
|
|
def coverage_stats(self) -> Dict[str, float]
|
|
```
|
|
|
|
Returns:
|
|
- **coverage**: Query term coverage score
|
|
- **consistency**: Information consistency score
|
|
- **saturation**: Content saturation score
|
|
- **confidence**: Overall confidence score
|
|
|
|
### is_sufficient
|
|
|
|
Boolean indicating whether sufficient information has been gathered.
|
|
|
|
```python
|
|
@property
|
|
def is_sufficient(self) -> bool
|
|
```
|
|
|
|
### state
|
|
|
|
Access to the current crawl state.
|
|
|
|
```python
|
|
@property
|
|
def state(self) -> CrawlState
|
|
```
|
|
|
|
## Methods
|
|
|
|
### get_relevant_content()
|
|
|
|
Retrieve the most relevant content from the knowledge base.
|
|
|
|
```python
|
|
def get_relevant_content(
|
|
self,
|
|
top_k: int = 5
|
|
) -> List[Dict[str, Any]]
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **top_k** (`int`): Number of top relevant documents to return (default: 5)
|
|
|
|
#### Returns
|
|
|
|
List of dictionaries containing:
|
|
- **url**: The URL of the page
|
|
- **content**: The page content
|
|
- **score**: Relevance score
|
|
- **metadata**: Additional page metadata
|
|
|
|
### print_stats()
|
|
|
|
Display crawl statistics in formatted output.
|
|
|
|
```python
|
|
def print_stats(
|
|
self,
|
|
detailed: bool = False
|
|
) -> None
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **detailed** (`bool`): If True, shows detailed metrics with colors. If False, shows summary table.
|
|
|
|
### export_knowledge_base()
|
|
|
|
Export the collected knowledge base to a JSONL file.
|
|
|
|
```python
|
|
def export_knowledge_base(
|
|
self,
|
|
path: Union[str, Path]
|
|
) -> None
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **path** (`Union[str, Path]`): Output file path for JSONL export
|
|
|
|
#### Example
|
|
|
|
```python
|
|
adaptive.export_knowledge_base("my_knowledge.jsonl")
|
|
```
|
|
|
|
### import_knowledge_base()
|
|
|
|
Import a previously exported knowledge base.
|
|
|
|
```python
|
|
def import_knowledge_base(
|
|
self,
|
|
path: Union[str, Path]
|
|
) -> None
|
|
```
|
|
|
|
#### Parameters
|
|
|
|
- **path** (`Union[str, Path]`): Path to JSONL file to import
|
|
|
|
## Configuration
|
|
|
|
The `AdaptiveConfig` class controls the behavior of adaptive crawling:
|
|
|
|
```python
|
|
@dataclass
|
|
class AdaptiveConfig:
|
|
confidence_threshold: float = 0.8 # Stop when confidence reaches this
|
|
max_pages: int = 50 # Maximum pages to crawl
|
|
top_k_links: int = 5 # Links to follow per page
|
|
min_gain_threshold: float = 0.1 # Minimum expected gain to continue
|
|
save_state: bool = False # Auto-save crawl state
|
|
state_path: Optional[str] = None # Path for state persistence
|
|
```
|
|
|
|
### Example with Custom Config
|
|
|
|
```python
|
|
config = AdaptiveConfig(
|
|
confidence_threshold=0.7,
|
|
max_pages=20,
|
|
top_k_links=3
|
|
)
|
|
|
|
adaptive = AdaptiveCrawler(crawler, config=config)
|
|
```
|
|
|
|
## Complete Example
|
|
|
|
```python
|
|
import asyncio
|
|
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
|
|
|
|
async def main():
|
|
# Configure adaptive crawling
|
|
config = AdaptiveConfig(
|
|
confidence_threshold=0.75,
|
|
max_pages=15,
|
|
save_state=True,
|
|
state_path="my_crawl.json"
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
adaptive = AdaptiveCrawler(crawler, config)
|
|
|
|
# Start crawling
|
|
state = await adaptive.digest(
|
|
start_url="https://example.com/docs",
|
|
query="authentication oauth2 jwt"
|
|
)
|
|
|
|
# Check results
|
|
print(f"Confidence achieved: {adaptive.confidence:.0%}")
|
|
adaptive.print_stats()
|
|
|
|
# Get most relevant pages
|
|
for page in adaptive.get_relevant_content(top_k=3):
|
|
print(f"- {page['url']} (score: {page['score']:.2f})")
|
|
|
|
# Export for later use
|
|
adaptive.export_knowledge_base("auth_knowledge.jsonl")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [digest() Method Reference](digest.md)
|
|
- [Adaptive Crawling Guide](../core/adaptive-crawling.md)
|
|
- [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md) |