Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/md_v2/extraction/overview.md
+++ b/docs/md_v2/extraction/overview.md
@@ -169,6 +169,35 @@ llm_result = await crawler.arun(
   )
   ```

+
+## Input Formats
+All extraction strategies support different input formats to give you more control over how content is processed:
+
+- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical.
+- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements.
+- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured.
+
+To specify an input format:
+```python
+strategy = LLMExtractionStrategy(
+    input_format="html",  # or "markdown" or "fit_markdown"
+    provider="openai/gpt-4",
+    instruction="Extract product information"
+)
+```
+
+Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter:
+```python
+config = CrawlerRunConfig(
+    extraction_strategy=strategy,
+    markdown_generator=DefaultMarkdownGenerator(
+        content_filter=PruningContentFilter()  # Content filter goes here for fit_markdown
+    )
+)
+```
+
+If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning.
+
 ## Best Practices

 1. **Choose the Right Strategy**