feat(content-filter): add LLMContentFilter for intelligent markdown generation

Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content:
- Implement intelligent content filtering with customizable instructions
- Add chunk processing for handling large documents
- Support parallel processing of content chunks
- Include caching mechanism for filtered results
- Add usage tracking and statistics
- Update documentation with examples and use cases

Also includes minor changes:
- Disable Pydantic warnings in __init__.py
- Add new prompt template for content filtering
This commit is contained in:
UncleCode
2025-01-18 19:31:07 +08:00
parent 2d6b19e1a2
commit 3d09b6a221
5 changed files with 495 additions and 5 deletions

View File

@@ -170,6 +170,82 @@ prune_filter = PruningContentFilter(
- You want a broad cleanup without a user query.
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
### 4.3 LLMContentFilter
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main():
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
provider="openai/gpt-4", # or your preferred provider
api_token="your-api-token", # or use environment variable
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True
)
config = CrawlerRunConfig(
content_filter=filter
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.fit_markdown) # Filtered markdown content
```
**Key Features:**
- **Intelligent Filtering**: Uses LLMs to understand and extract relevant content while maintaining context
- **Customizable Instructions**: Tailor the filtering process with specific instructions
- **Chunk Processing**: Handles large documents by processing them in chunks (controlled by `chunk_token_threshold`)
- **Parallel Processing**: For better performance, use smaller `chunk_token_threshold` (e.g., 2048 or 4096) to enable parallel processing of content chunks
**Two Common Use Cases:**
1. **Exact Content Preservation**:
```python
filter = LLMContentFilter(
instruction="""
Extract the main educational content while preserving its original wording and substance completely.
1. Maintain the exact language and terminology
2. Keep all technical explanations and examples intact
3. Preserve the original flow and structure
4. Remove only clearly irrelevant elements like navigation menus and ads
""",
chunk_token_threshold=4096
)
```
2. **Focused Content Extraction**:
```python
filter = LLMContentFilter(
instruction="""
Focus on extracting specific types of content:
- Technical documentation
- Code examples
- API references
Reformat the content into clear, well-structured markdown
""",
chunk_token_threshold=4096
)
```
> **Performance Tip**: Set a smaller `chunk_token_threshold` (e.g., 2048 or 4096) to enable parallel processing of content chunks. The default value is infinity, which processes the entire content as a single chunk.
---
## 5. Using Fit Markdown