docs: remove hallucinations from docs for CrawlerRunConfig + Add chunking strategy docs in the table
This commit is contained in:
@@ -69,7 +69,8 @@ We group them by category.
|
|||||||
| **Parameter** | **Type / Default** | **What It Does** |
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
|
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
|
||||||
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
|
||||||
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.).
|
||||||
|
| **`chunking_strategy`** | `ChunkingStrategy` (default: RegexChunking) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
||||||
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
|
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
|
||||||
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
|
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
|
||||||
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
|
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
|
||||||
|
|||||||
@@ -136,11 +136,6 @@ class CrawlerRunConfig:
|
|||||||
wait_for=None,
|
wait_for=None,
|
||||||
screenshot=False,
|
screenshot=False,
|
||||||
pdf=False,
|
pdf=False,
|
||||||
enable_rate_limiting=False,
|
|
||||||
rate_limit_config=None,
|
|
||||||
memory_threshold_percent=70.0,
|
|
||||||
check_interval=1.0,
|
|
||||||
max_session_permit=20,
|
|
||||||
display_mode=None,
|
display_mode=None,
|
||||||
verbose=True,
|
verbose=True,
|
||||||
stream=False, # Enable streaming for arun_many()
|
stream=False, # Enable streaming for arun_many()
|
||||||
@@ -183,25 +178,7 @@ class CrawlerRunConfig:
|
|||||||
- Logs additional runtime details.
|
- Logs additional runtime details.
|
||||||
- Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.
|
- Overlaps with the browser’s verbosity if also set to `True` in `BrowserConfig`.
|
||||||
|
|
||||||
9. **`enable_rate_limiting`**:
|
|
||||||
- If `True`, enables rate limiting for batch processing.
|
|
||||||
- Requires `rate_limit_config` to be set.
|
|
||||||
|
|
||||||
10. **`memory_threshold_percent`**:
|
|
||||||
- The memory threshold (as a percentage) to monitor.
|
|
||||||
- If exceeded, the crawler will pause or slow down.
|
|
||||||
|
|
||||||
11. **`check_interval`**:
|
|
||||||
- The interval (in seconds) to check system resources.
|
|
||||||
- Affects how often memory and CPU usage are monitored.
|
|
||||||
|
|
||||||
12. **`max_session_permit`**:
|
|
||||||
- The maximum number of concurrent crawl sessions.
|
|
||||||
- Helps prevent overwhelming the system.
|
|
||||||
|
|
||||||
13. **`display_mode`**:
|
|
||||||
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
|
|
||||||
- Affects how much information is printed during the crawl.
|
|
||||||
|
|
||||||
### Helper Methods
|
### Helper Methods
|
||||||
|
|
||||||
@@ -236,9 +213,6 @@ The `clone()` method:
|
|||||||
---
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## 3. LLMConfig Essentials
|
## 3. LLMConfig Essentials
|
||||||
|
|
||||||
### Key fields to note
|
### Key fields to note
|
||||||
|
|||||||
Reference in New Issue
Block a user