This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456
2.6 KiB
2.6 KiB
Adaptive Crawling Examples
This directory contains examples demonstrating various aspects of Crawl4AI's Adaptive Crawling feature.
Examples Overview
1. basic_usage.py
- Simple introduction to adaptive crawling
- Uses default statistical strategy
- Shows how to get crawl statistics and relevant content
2. embedding_strategy.py ⭐ NEW
- Demonstrates the embedding-based strategy for semantic understanding
- Shows query expansion and irrelevance detection
- Includes configuration for both local and API-based embeddings
3. embedding_vs_statistical.py ⭐ NEW
- Direct comparison between statistical and embedding strategies
- Helps you choose the right strategy for your use case
- Shows performance and accuracy trade-offs
4. embedding_configuration.py ⭐ NEW
- Advanced configuration options for embedding strategy
- Parameter tuning guide for different scenarios
- Examples for research, exploration, and quality-focused crawling
5. advanced_configuration.py
- Shows various configuration options for both strategies
- Demonstrates threshold tuning and performance optimization
6. custom_strategies.py
- How to implement your own crawling strategy
- Extends the base CrawlStrategy class
- Advanced use case for specialized requirements
7. export_import_kb.py
- Export crawled knowledge base to JSONL
- Import and continue crawling from saved state
- Useful for building persistent knowledge bases
Quick Start
For your first adaptive crawling experience, run:
python basic_usage.py
To try the new embedding strategy with semantic understanding:
python embedding_strategy.py
To compare strategies and see which works best for your use case:
python embedding_vs_statistical.py
Strategy Selection Guide
Use Statistical Strategy (Default) When:
- Working with technical documentation
- Queries contain specific terms or code
- Speed is critical
- No API access available
Use Embedding Strategy When:
- Queries are conceptual or ambiguous
- Need semantic understanding beyond exact matches
- Want to detect irrelevant content
- Working with diverse content sources
Requirements
- Crawl4AI installed
- For embedding strategy with local models:
sentence-transformers - For embedding strategy with OpenAI: Set
OPENAI_API_KEYenvironment variable