Enhance crawler capabilities and documentation
- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
This commit is contained in:
@@ -169,6 +169,35 @@ llm_result = await crawler.arun(
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
## Input Formats
|
||||
All extraction strategies support different input formats to give you more control over how content is processed:
|
||||
|
||||
- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical.
|
||||
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements.
|
||||
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured.
|
||||
|
||||
To specify an input format:
|
||||
```python
|
||||
strategy = LLMExtractionStrategy(
|
||||
input_format="html", # or "markdown" or "fit_markdown"
|
||||
provider="openai/gpt-4",
|
||||
instruction="Extract product information"
|
||||
)
|
||||
```
|
||||
|
||||
Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter:
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter() # Content filter goes here for fit_markdown
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Strategy**
|
||||
|
||||
Reference in New Issue
Block a user