Enhance crawler features and improve documentation

- Added detailed CrawlerRunConfig parameters documentation.
  - Introduced plans for real-time event-driven crawling.
  - Updated async logger default level to DEBUG for better insights.
  - Improved structure and readability in configuration file.
  - Enhanced documentation on future capabilities in new blog entries.
This commit is contained in:
UncleCode
2024-12-16 18:52:51 +08:00
parent ed7bc1909c
commit a11d9646e3
6 changed files with 439 additions and 125 deletions

View File

@@ -0,0 +1,85 @@
# CrawlerRunConfig Parameters Documentation
## Content Processing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
## Caching Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
## Page Navigation and Timing Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
## Page Interaction Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
## Media Handling Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
| `pdf` | bool | False | Whether to generate a PDF of the page |
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
## Link and Domain Handling Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
## Debugging and Logging Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `verbose` | bool | True | Enable verbose logging |
| `log_console` | bool | False | If True, log console messages from the page |

View File

@@ -0,0 +1,46 @@
## Introducing Event Streams and Interactive Hooks in Crawl4AI
![event-driven-crawl](https://res.cloudinary.com/kidocode/image/upload/t_400x400/v1734344008/15bb8bbb-83ac-43ac-962d-3feb3e0c3bbf_2_tjmr4n.webp)
In the near future, Im planning to enhance Crawl4AIs capabilities by introducing an event stream mechanism that will give clients deeper, real-time insights into the crawling process. Today, hooks are a powerful feature at the code level—they let developers define custom logic at key points in the crawl. However, when using Crawl4AI as a service (e.g., through a Dockerized API), there isnt an easy way to interact with these hooks at runtime.
**Whats Changing?**
Im working on a solution that will allow the crawler to emit a continuous stream of events, updating clients on the current crawling stage, encountered pages, and any decision points. This event stream could be exposed over a standardized protocol like Server-Sent Events (SSE) or WebSockets, enabling clients to “subscribe” and listen as the crawler works.
**Interactivity Through Process IDs**
A key part of this new design is the concept of a unique process ID for each crawl session. Imagine youre listening to an event stream that informs you:
- The crawler just hit a certain page
- It triggered a hook and is now pausing for instructions
With the event stream in place, you can send a follow-up request back to the server—referencing the unique process ID—to provide extra data, instructions, or parameters. This might include selecting which links to follow next, adjusting extraction strategies, or providing authentication tokens for a protected API. Once the crawler receives these instructions, it resumes execution with the updated context.
```mermaid
sequenceDiagram
participant Client
participant Server
participant Crawler
Client->>Server: Start crawl request
Server->>Crawler: Initiate crawl with Process ID
Crawler-->>Server: Event: Page hit
Server-->>Client: Stream: Page hit event
Client->>Server: Instruction for Process ID
Server->>Crawler: Update crawl with new instructions
Crawler-->>Server: Event: Crawl completed
Server-->>Client: Stream: Crawl completed
```
**Benefits for Developers and Users**
1. **Fine-Grained Control**: Instead of predefining all logic upfront, you can dynamically guide the crawler in response to actual data and conditions encountered mid-crawl.
2. **Real-Time Insights**: Monitor progress, errors, or network bottlenecks as they happen, without waiting for the entire crawl to finish.
3. **Enhanced Collaboration**: Different team members or automated systems can watch the same crawl events and provide input, making the crawling process more adaptive and intelligent.
**Next Steps**
Im currently exploring the best APIs, technologies, and patterns to make this vision a reality. My goal is to deliver a seamless developer experience—one that integrates with existing Crawl4AI workflows while offering new flexibility and power.
Stay tuned for more updates as I continue building this feature out. In the meantime, Id love to hear any feedback or suggestions you might have to help shape this interactive, event-driven future of web crawling with Crawl4AI.