Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
This commit is contained in:
UncleCode
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions

View File

@@ -169,6 +169,35 @@ llm_result = await crawler.arun(
)
```
## Input Formats
All extraction strategies support different input formats to give you more control over how content is processed:
- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical.
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements.
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured.
To specify an input format:
```python
strategy = LLMExtractionStrategy(
input_format="html", # or "markdown" or "fit_markdown"
provider="openai/gpt-4",
instruction="Extract product information"
)
```
Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter:
```python
config = CrawlerRunConfig(
extraction_strategy=strategy,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter() # Content filter goes here for fit_markdown
)
)
```
If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning.
## Best Practices
1. **Choose the Right Strategy**