Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions
--- a/docs/md_v2/advanced/content-processing.md
+++ b/docs/md_v2/advanced/content-processing.md
@@ -2,80 +2,12 @@

 Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.

-## Content Cleaning
-
-### Understanding Clean Content
-When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
-
-1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
-2. **Content Relevance**: Identifies and preserves meaningful content blocks
-3. **Layout Analysis**: Understands page structure to identify main content areas
-
-```python
-result = await crawler.arun(
-    url="https://example.com",
-    word_count_threshold=10,        # Remove blocks with fewer words
-    excluded_tags=['form', 'nav'],  # Remove specific HTML tags
-    remove_overlay_elements=True    # Remove popups/modals
-)
-
-# Get clean content
-print(result.cleaned_html)  # Cleaned HTML
-print(result.markdown)      # Clean markdown version
-```
-
-### Fit Markdown: Smart Content Extraction
-One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
-
-#### How Fit Markdown Works
- Analyzes content density and distribution
- Identifies content patterns and structures
- Removes boilerplate content (headers, footers, sidebars)
- Preserves the most relevant content blocks
- Maintains content hierarchy and formatting
-
-#### Perfect For:
- Blog posts and articles
- News content
- Documentation pages
- Any page with a clear main content area
-
-#### Not Recommended For:
- E-commerce product listings
- Search results pages
- Social media feeds
- Pages with multiple equal-weight content sections
-
-```python
-result = await crawler.arun(url="https://example.com")
-
-# Get the most relevant content
-main_content = result.fit_markdown
-
-# Compare with regular markdown
-all_content = result.markdown
-
-print(f"Fit Markdown Length: {len(main_content)}")
-print(f"Regular Markdown Length: {len(all_content)}")
-```
-
-#### Example Use Case
-```python
-async def extract_article_content(url: str) -> str:
-    """Extract main article content from a blog or news site."""
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(url=url)
-        
-        # fit_markdown will focus on the article content,
-        # excluding navigation, ads, and other distractions
-        return result.fit_markdown
-```
-
 ## Media Processing

 Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.

 ### Image Processing
+
 The library handles various image scenarios, including:
 - Regular images
 - Lazy-loaded images
@@ -84,7 +16,10 @@ The library handles various image scenarios, including:
 - Image metadata and context

 ```python
-result = await crawler.arun(url="https://example.com")
+from crawl4ai.async_configs import CrawlerRunConfig
+
+config = CrawlerRunConfig()
+result = await crawler.arun(url="https://example.com", config=config)

 for image in result.media["images"]:
    # Each image includes rich metadata
@@ -96,20 +31,27 @@ for image in result.media["images"]:
 ```

 ### Handling Lazy-Loaded Content
-Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
+
+Crawl4AI already handles lazy loading for media elements. You can customize the wait time for lazy-loaded content with `CrawlerRunConfig`:

 ```python
-result = await crawler.arun(
-    url="https://example.com",
+config = CrawlerRunConfig(
    wait_for="css:img[data-src]",  # Wait for lazy images
    delay_before_return_html=2.0   # Additional wait time
 )
+result = await crawler.arun(url="https://example.com", config=config)
 ```

 ### Video and Audio Content
+
 The library extracts video and audio elements with their metadata:

 ```python
+from crawl4ai.async_configs import CrawlerRunConfig
+
+config = CrawlerRunConfig()
+result = await crawler.arun(url="https://example.com", config=config)
+
 # Process videos
 for video in result.media["videos"]:
    print(f"Video source: {video['src']}")
@@ -129,6 +71,7 @@ for audio in result.media["audios"]:
 Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.

 ### Link Classification
+
 The library automatically categorizes links into:
 - Internal links (same domain)
 - External links (different domains)
@@ -137,7 +80,10 @@ The library automatically categorizes links into:
 - Content links

 ```python
-result = await crawler.arun(url="https://example.com")
+from crawl4ai.async_configs import CrawlerRunConfig
+
+config = CrawlerRunConfig()
+result = await crawler.arun(url="https://example.com", config=config)

 # Analyze internal links
 for link in result.links["internal"]:
@@ -154,18 +100,19 @@ for link in result.links["external"]:
 ```

 ### Smart Link Filtering
-Control which links are included in the results:
+
+Control which links are included in the results with `CrawlerRunConfig`:

 ```python
-result = await crawler.arun(
-    url="https://example.com",
+config = CrawlerRunConfig(
    exclude_external_links=True,          # Remove external links
    exclude_social_media_links=True,      # Remove social media links
-    exclude_social_media_domains=[                # Custom social media domains
+    exclude_social_media_domains=[        # Custom social media domains
        "facebook.com", "twitter.com", "instagram.com"
    ],
    exclude_domains=["ads.example.com"]   # Exclude specific domains
 )
+result = await crawler.arun(url="https://example.com", config=config)
 ```

 ## Metadata Extraction
@@ -173,7 +120,10 @@ result = await crawler.arun(
 Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:

 ```python
-result = await crawler.arun(url="https://example.com")
+from crawl4ai.async_configs import CrawlerRunConfig
+
+config = CrawlerRunConfig()
+result = await crawler.arun(url="https://example.com", config=config)

 metadata = result.metadata
 print(f"Title: {metadata['title']}")
@@ -184,40 +134,3 @@ print(f"Published Date: {metadata['published_date']}")
 print(f"Modified Date: {metadata['modified_date']}")
 print(f"Language: {metadata['language']}")
 ```
-
-## Best Practices
-
-1. **Use Fit Markdown for Articles**
-   ```python
-   # Perfect for blog posts, news articles, documentation
-   content = result.fit_markdown
-   ```
-
-2. **Handle Media Appropriately**
-   ```python
-   # Filter by relevance score
-   relevant_images = [
-       img for img in result.media["images"]
-       if img['score'] > 5
-   ]
-   ```
-
-3. **Combine Link Analysis with Content**
-   ```python
-   # Get content links with context
-   content_links = [
-       link for link in result.links["internal"]
-       if link['type'] == 'content'
-   ]
-   ```
-
-4. **Clean Content with Purpose**
-   ```python
-   # Customize cleaning based on your needs
-   result = await crawler.arun(
-       url=url,
-       word_count_threshold=20,      # Adjust based on content type
-       keep_data_attributes=False,   # Remove data attributes
-       process_iframes=True         # Include iframe content
-   )
-   ```