Enhance Crawl4AI with new features and documentation
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
This commit is contained in:
@@ -2,80 +2,12 @@
|
||||
|
||||
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
|
||||
|
||||
## Content Cleaning
|
||||
|
||||
### Understanding Clean Content
|
||||
When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
|
||||
|
||||
1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
|
||||
2. **Content Relevance**: Identifies and preserves meaningful content blocks
|
||||
3. **Layout Analysis**: Understands page structure to identify main content areas
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Remove blocks with fewer words
|
||||
excluded_tags=['form', 'nav'], # Remove specific HTML tags
|
||||
remove_overlay_elements=True # Remove popups/modals
|
||||
)
|
||||
|
||||
# Get clean content
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown) # Clean markdown version
|
||||
```
|
||||
|
||||
### Fit Markdown: Smart Content Extraction
|
||||
One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
|
||||
|
||||
#### How Fit Markdown Works
|
||||
- Analyzes content density and distribution
|
||||
- Identifies content patterns and structures
|
||||
- Removes boilerplate content (headers, footers, sidebars)
|
||||
- Preserves the most relevant content blocks
|
||||
- Maintains content hierarchy and formatting
|
||||
|
||||
#### Perfect For:
|
||||
- Blog posts and articles
|
||||
- News content
|
||||
- Documentation pages
|
||||
- Any page with a clear main content area
|
||||
|
||||
#### Not Recommended For:
|
||||
- E-commerce product listings
|
||||
- Search results pages
|
||||
- Social media feeds
|
||||
- Pages with multiple equal-weight content sections
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Get the most relevant content
|
||||
main_content = result.fit_markdown
|
||||
|
||||
# Compare with regular markdown
|
||||
all_content = result.markdown
|
||||
|
||||
print(f"Fit Markdown Length: {len(main_content)}")
|
||||
print(f"Regular Markdown Length: {len(all_content)}")
|
||||
```
|
||||
|
||||
#### Example Use Case
|
||||
```python
|
||||
async def extract_article_content(url: str) -> str:
|
||||
"""Extract main article content from a blog or news site."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url)
|
||||
|
||||
# fit_markdown will focus on the article content,
|
||||
# excluding navigation, ads, and other distractions
|
||||
return result.fit_markdown
|
||||
```
|
||||
|
||||
## Media Processing
|
||||
|
||||
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
|
||||
|
||||
### Image Processing
|
||||
|
||||
The library handles various image scenarios, including:
|
||||
- Regular images
|
||||
- Lazy-loaded images
|
||||
@@ -84,7 +16,10 @@ The library handles various image scenarios, including:
|
||||
- Image metadata and context
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
for image in result.media["images"]:
|
||||
# Each image includes rich metadata
|
||||
@@ -96,20 +31,27 @@ for image in result.media["images"]:
|
||||
```
|
||||
|
||||
### Handling Lazy-Loaded Content
|
||||
Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
|
||||
|
||||
Crawl4AI already handles lazy loading for media elements. You can customize the wait time for lazy-loaded content with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
wait_for="css:img[data-src]", # Wait for lazy images
|
||||
delay_before_return_html=2.0 # Additional wait time
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Video and Audio Content
|
||||
|
||||
The library extracts video and audio elements with their metadata:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Process videos
|
||||
for video in result.media["videos"]:
|
||||
print(f"Video source: {video['src']}")
|
||||
@@ -129,6 +71,7 @@ for audio in result.media["audios"]:
|
||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
||||
|
||||
### Link Classification
|
||||
|
||||
The library automatically categorizes links into:
|
||||
- Internal links (same domain)
|
||||
- External links (different domains)
|
||||
@@ -137,7 +80,10 @@ The library automatically categorizes links into:
|
||||
- Content links
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Analyze internal links
|
||||
for link in result.links["internal"]:
|
||||
@@ -154,18 +100,19 @@ for link in result.links["external"]:
|
||||
```
|
||||
|
||||
### Smart Link Filtering
|
||||
Control which links are included in the results:
|
||||
|
||||
Control which links are included in the results with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
@@ -173,7 +120,10 @@ result = await crawler.arun(
|
||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
metadata = result.metadata
|
||||
print(f"Title: {metadata['title']}")
|
||||
@@ -184,40 +134,3 @@ print(f"Published Date: {metadata['published_date']}")
|
||||
print(f"Modified Date: {metadata['modified_date']}")
|
||||
print(f"Language: {metadata['language']}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Fit Markdown for Articles**
|
||||
```python
|
||||
# Perfect for blog posts, news articles, documentation
|
||||
content = result.fit_markdown
|
||||
```
|
||||
|
||||
2. **Handle Media Appropriately**
|
||||
```python
|
||||
# Filter by relevance score
|
||||
relevant_images = [
|
||||
img for img in result.media["images"]
|
||||
if img['score'] > 5
|
||||
]
|
||||
```
|
||||
|
||||
3. **Combine Link Analysis with Content**
|
||||
```python
|
||||
# Get content links with context
|
||||
content_links = [
|
||||
link for link in result.links["internal"]
|
||||
if link['type'] == 'content'
|
||||
]
|
||||
```
|
||||
|
||||
4. **Clean Content with Purpose**
|
||||
```python
|
||||
# Customize cleaning based on your needs
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=20, # Adjust based on content type
|
||||
keep_data_attributes=False, # Remove data attributes
|
||||
process_iframes=True # Include iframe content
|
||||
)
|
||||
```
|
||||
Reference in New Issue
Block a user