Refactor: Renamed scrape to traverse and deep_crawl in a few sections where it applies
This commit is contained in:
206
docs/deep_crawl/how_to_use.md
Normal file
206
docs/deep_crawl/how_to_use.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Scraper Examples Guide
|
||||
|
||||
This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
|
||||
|
||||
## Basic Example
|
||||
|
||||
The basic example demonstrates a simple blog scraping scenario:
|
||||
|
||||
```python
|
||||
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
|
||||
|
||||
# Create simple filter chain
|
||||
filter_chain = FilterChain([
|
||||
URLPatternFilter("*/blog/*"),
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
|
||||
# Initialize strategy
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=None,
|
||||
max_concurrent=3
|
||||
)
|
||||
|
||||
# Create and run scraper
|
||||
crawler = AsyncWebCrawler()
|
||||
scraper = AsyncWebScraper(crawler, strategy)
|
||||
result = await scraper.ascrape("https://example.com/blog/")
|
||||
```
|
||||
|
||||
### Features Demonstrated
|
||||
- Basic URL filtering
|
||||
- Simple content type filtering
|
||||
- Depth control
|
||||
- Concurrent request limiting
|
||||
- Result collection
|
||||
|
||||
## Advanced Example
|
||||
|
||||
The advanced example shows a sophisticated news site scraping setup with all features enabled:
|
||||
|
||||
```python
|
||||
# Create comprehensive filter chain
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(
|
||||
allowed_domains=["example.com"],
|
||||
blocked_domains=["ads.example.com"]
|
||||
),
|
||||
URLPatternFilter([
|
||||
"*/article/*",
|
||||
re.compile(r"\d{4}/\d{2}/.*")
|
||||
]),
|
||||
ContentTypeFilter(["text/html"])
|
||||
])
|
||||
|
||||
# Create intelligent scorer
|
||||
scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(
|
||||
keywords=["news", "breaking"],
|
||||
weight=1.0
|
||||
),
|
||||
PathDepthScorer(optimal_depth=3, weight=0.7),
|
||||
FreshnessScorer(weight=0.9)
|
||||
])
|
||||
|
||||
# Initialize advanced strategy
|
||||
strategy = BFSScraperStrategy(
|
||||
max_depth=4,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=scorer,
|
||||
max_concurrent=5
|
||||
)
|
||||
```
|
||||
|
||||
### Features Demonstrated
|
||||
1. **Advanced Filtering**
|
||||
- Domain filtering
|
||||
- Pattern matching
|
||||
- Content type control
|
||||
|
||||
2. **Intelligent Scoring**
|
||||
- Keyword relevance
|
||||
- Path optimization
|
||||
- Freshness priority
|
||||
|
||||
3. **Monitoring**
|
||||
- Progress tracking
|
||||
- Error handling
|
||||
- Statistics collection
|
||||
|
||||
4. **Resource Management**
|
||||
- Concurrent processing
|
||||
- Rate limiting
|
||||
- Cleanup handling
|
||||
|
||||
## Running the Examples
|
||||
|
||||
```bash
|
||||
# Basic usage
|
||||
python basic_scraper_example.py
|
||||
|
||||
# Advanced usage with logging
|
||||
PYTHONPATH=. python advanced_scraper_example.py
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
### Basic Example
|
||||
```
|
||||
Crawled 15 pages:
|
||||
- https://example.com/blog/post1: 24560 bytes
|
||||
- https://example.com/blog/post2: 18920 bytes
|
||||
...
|
||||
```
|
||||
|
||||
### Advanced Example
|
||||
```
|
||||
INFO: Starting crawl of https://example.com/news/
|
||||
INFO: Processed: https://example.com/news/breaking/story1
|
||||
DEBUG: KeywordScorer: 0.85
|
||||
DEBUG: FreshnessScorer: 0.95
|
||||
INFO: Progress: 10 URLs processed
|
||||
...
|
||||
INFO: Scraping completed:
|
||||
INFO: - URLs processed: 50
|
||||
INFO: - Errors: 2
|
||||
INFO: - Total content size: 1240.50 KB
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### Adding Custom Filters
|
||||
```python
|
||||
class CustomFilter(URLFilter):
|
||||
def apply(self, url: str) -> bool:
|
||||
# Your custom filtering logic
|
||||
return True
|
||||
|
||||
filter_chain.add_filter(CustomFilter())
|
||||
```
|
||||
|
||||
### Custom Scoring Logic
|
||||
```python
|
||||
class CustomScorer(URLScorer):
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
# Your custom scoring logic
|
||||
return 1.0
|
||||
|
||||
scorer = CompositeScorer([
|
||||
CustomScorer(weight=1.0),
|
||||
...
|
||||
])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start Simple**
|
||||
- Begin with basic filtering
|
||||
- Add features incrementally
|
||||
- Test thoroughly at each step
|
||||
|
||||
2. **Monitor Performance**
|
||||
- Watch memory usage
|
||||
- Track processing times
|
||||
- Adjust concurrency as needed
|
||||
|
||||
3. **Handle Errors**
|
||||
- Implement proper error handling
|
||||
- Log important events
|
||||
- Track error statistics
|
||||
|
||||
4. **Optimize Resources**
|
||||
- Set appropriate delays
|
||||
- Limit concurrent requests
|
||||
- Use streaming for large crawls
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Common issues and solutions:
|
||||
|
||||
1. **Too Many Requests**
|
||||
```python
|
||||
strategy = BFSScraperStrategy(
|
||||
max_concurrent=3, # Reduce concurrent requests
|
||||
min_crawl_delay=2 # Increase delay between requests
|
||||
)
|
||||
```
|
||||
|
||||
2. **Memory Issues**
|
||||
```python
|
||||
# Use streaming mode for large crawls
|
||||
async for result in scraper.ascrape(url, stream=True):
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
3. **Missing Content**
|
||||
```python
|
||||
# Check your filter chain
|
||||
filter_chain = FilterChain([
|
||||
URLPatternFilter("*"), # Broaden patterns
|
||||
ContentTypeFilter(["*"]) # Accept all content
|
||||
])
|
||||
```
|
||||
|
||||
For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).
|
||||
Reference in New Issue
Block a user