Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
58 lines
3.9 KiB
Markdown
58 lines
3.9 KiB
Markdown
Below is a structured list of hypothetical questions derived from the file’s content, followed by a bullet-point summary of key topics discussed.
|
||
|
||
### Hypothetical Questions
|
||
|
||
1. **General Hook Usage**
|
||
- *"What are hooks in Crawl4AI, and how do they help customize the crawling process?"*
|
||
- *"Which stages of the crawling lifecycle can I attach hooks to?"*
|
||
|
||
2. **Specific Hooks**
|
||
- *"What does the `on_browser_created` hook allow me to do?"*
|
||
- *"How can I use the `on_page_context_created` hook to modify requests before navigation?"*
|
||
- *"When should I use `before_goto` and `after_goto` hooks?"*
|
||
- *"How does `on_execution_started` help with custom JavaScript execution?"*
|
||
- *"What kind of preprocessing can I do in `before_return_html`?"*
|
||
|
||
3. **Authentication and Customization**
|
||
- *"How can I perform authentication (like logging in) before actual crawling begins?"*
|
||
- *"Can I set cookies, headers, or modify requests using hooks?"*
|
||
|
||
4. **Error Handling and Debugging**
|
||
- *"If my hooks fail or raise errors, how is that handled during the crawling process?"*
|
||
- *"How can I use hooks to troubleshoot issues, like blocking image requests or logging console messages?"*
|
||
|
||
5. **Complex Scenarios**
|
||
- *"Can I combine multiple hooks to handle complex workflows like login, script execution, and dynamic content blocking?"*
|
||
- *"Is it possible to add conditional logic in hooks to treat certain URLs differently?"*
|
||
|
||
6. **Performance and Reliability**
|
||
- *"Do these hooks run asynchronously, and how does that affect the crawler’s performance?"*
|
||
- *"Can I cancel requests or actions via hooks to improve efficiency?"*
|
||
|
||
7. **Integration with `BrowserConfig` and `CrawlerRunConfig`**
|
||
- *"How do I use `BrowserConfig` and `CrawlerRunConfig` in tandem with hooks?"*
|
||
- *"Does setting hooks require changes to the configuration objects or can I apply them at runtime?"*
|
||
|
||
### Topics Discussed in the File
|
||
|
||
- **Hooks in `AsyncWebCrawler`**:
|
||
Hooks are asynchronous callback functions triggered at key points in the crawling lifecycle. They allow advanced customization, such as modifying browser/page contexts, injecting scripts, or altering network requests.
|
||
|
||
- **Hook Types and Purposes**:
|
||
- **`on_browser_created`**: Initialize browser state, handle authentication (login), set cookies.
|
||
- **`on_page_context_created`**: Set up request routing, block resources, or modify requests before navigation.
|
||
- **`before_goto`**: Add or modify HTTP headers, prepare the page before actually navigating to the target URL.
|
||
- **`after_goto`**: Verify the current URL, log details, or ensure that page navigation succeeded.
|
||
- **`on_execution_started`**: Perform actions right after JS execution, like logging console output or checking state.
|
||
- **`before_return_html`**: Analyze, log, or preprocess the extracted HTML before it’s returned.
|
||
|
||
- **Practical Examples**:
|
||
Demonstrations of handling authentication via `on_browser_created`, blocking images using `on_page_context_created` with a custom routing function, adding HTTP headers in `before_goto`, and logging content details in `before_return_html`.
|
||
|
||
- **Integration with Configuration Objects**:
|
||
Using `BrowserConfig` for initial browser settings and `CrawlerRunConfig` for specifying JavaScript code, wait conditions, and more, then combining them with hooks for a fully customizable crawling workflow.
|
||
|
||
- **Asynchronous and Flexible**:
|
||
Hooks are async, fitting seamlessly into the event-driven model of crawling. They can abort requests, continue them, or conditionally modify behavior based on URL patterns.
|
||
|
||
In summary, this file explains how to use hooks in Crawl4AI’s `AsyncWebCrawler` to customize nearly every aspect of the crawling process. By attaching hooks at various lifecycle stages, developers can implement authentication routines, block certain types of requests, tweak headers, run custom JS, and analyze the final HTML—all while maintaining control and flexibility. |