Commit Message:

Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
parent 8fbc2e0463
commit 84b311760f
47 changed files with 6510 additions and 2 deletions
--- a/docs/llm.txt/4_browser_context_page.q.md
+++ b/docs/llm.txt/4_browser_context_page.q.md
@@ -0,0 +1,62 @@
+### Questions
+
+1. **Browser Creation and Configuration**
+   - *"How do I create a browser instance with `BrowserConfig` for asynchronous crawling?"*
+   - *"What is the difference between standard browser creation and using persistent contexts?"*
+   - *"How do I configure headless mode and viewport dimensions?"*
+
+2. **Persistent Sessions and `user_data_dir`**
+   - *"How do persistent contexts work with `user_data_dir` to maintain session data?"*
+   - *"How can I reuse cookies and local storage to avoid repetitive logins?"*
+
+3. **Managed Browser**
+   - *"What benefits does `ManagedBrowser` provide over a standard browser instance?"*
+   - *"How do I enable identity preservation and stealth techniques using `ManagedBrowser`?"*
+   - *"How can I integrate debugging tools like Chrome Developer Tools with `ManagedBrowser`?"*
+
+4. **Identity Preservation**
+   - *"How can I simulate human-like behavior (mouse movements, scrolling) to preserve identity?"*
+   - *"What techniques does `crawl4ai` use to bypass CAPTCHA challenges and maintain authenticity?"*
+   - *"How do I use real user profiles to solve CAPTCHAs and save session data?"*
+
+5. **Session Management**
+   - *"How can I maintain state across multiple crawls using `session_id`?"*
+   - *"What are best practices for using sessions to handle multi-step login flows?"*
+   - *"How do I reuse sessions for authenticated workflows and reduce overhead?"*
+
+6. **Dynamic Content Handling**
+   - *"How can I inject JavaScript or wait conditions to ensure dynamic elements load before extraction?"*
+   - *"What strategies can I use to navigate infinite scrolling or ‘Load More’ buttons?"*
+   - *"How do I integrate JS code execution and waiting to handle modern SPA (Single Page Application) layouts?"*
+
+7. **Scaling and Performance**
+   - *"How do I scale crawls to handle thousands of URLs concurrently?"*
+   - *"What options exist for caching and resource utilization optimization?"*
+   - *"How do I handle multiple browser instances efficiently for high-volume crawling?"*
+
+8. **Extraction Strategies**
+   - *"How can I use `JsonCssExtractionStrategy` to extract structured data?"*
+   - *"What methods are available to chunk or filter extracted content?"*
+
+9. **Magic Mode vs. Managed Browsers**
+   - *"What is Magic Mode and when should I use it over Managed Browsers?"*
+   - *"Does Magic Mode help with basic sites, and how do I enable it?"*
+
+10. **Troubleshooting and Best Practices**
+    - *"How can I debug browser automation issues with logs and headful mode?"*
+    - *"What best practices should I follow to respect website policies?"*
+    - *"How do I handle authentication flows, form submissions, and CAPTCHA challenges effectively?"*
+
+### Topics Discussed in the File
+
+- **Browser Instance Creation** (Standard vs. Persistent Contexts)  
+- **`BrowserConfig` Customization** (headless mode, viewport, proxies, debugging)  
+- **Managed Browser for Resource Management and Debugging**  
+- **Identity Preservation Techniques** (Stealth, Human-like Behavior, Bypass CAPTCHAs)  
+- **Persistent Sessions and `user_data_dir`** (Session Reuse, Authentication Flows)  
+- **Crawling Modern Web Apps** (Dynamic Content, JS Injection, Infinite Scrolling)  
+- **Session Management with `session_id`** (Maintaining State, Multi-Step Flows)  
+- **Magic Mode** (Automation of User-Like Behavior, Simple Setup)  
+- **Extraction Strategies** (`JsonCssExtractionStrategy`, Handling Structured Data)  
+- **Scaling and Performance Optimization** (Multiple URLs, Concurrency, Reusing Sessions)  
+- **Best Practices and Troubleshooting** (Respecting Policies, Debugging Tools, Handling Errors)