Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/llm.txt/4_browser_context_page.q.md
+++ b/docs/llm.txt/4_browser_context_page.q.md
@@ -1,62 +1,10 @@
-### Questions
-
-1. **Browser Creation and Configuration**
-   - *"How do I create a browser instance with `BrowserConfig` for asynchronous crawling?"*
-   - *"What is the difference between standard browser creation and using persistent contexts?"*
-   - *"How do I configure headless mode and viewport dimensions?"*
-
-2. **Persistent Sessions and `user_data_dir`**
-   - *"How do persistent contexts work with `user_data_dir` to maintain session data?"*
-   - *"How can I reuse cookies and local storage to avoid repetitive logins?"*
-
-3. **Managed Browser**
-   - *"What benefits does `ManagedBrowser` provide over a standard browser instance?"*
-   - *"How do I enable identity preservation and stealth techniques using `ManagedBrowser`?"*
-   - *"How can I integrate debugging tools like Chrome Developer Tools with `ManagedBrowser`?"*
-
-4. **Identity Preservation**
-   - *"How can I simulate human-like behavior (mouse movements, scrolling) to preserve identity?"*
-   - *"What techniques does `crawl4ai` use to bypass CAPTCHA challenges and maintain authenticity?"*
-   - *"How do I use real user profiles to solve CAPTCHAs and save session data?"*
-
-5. **Session Management**
-   - *"How can I maintain state across multiple crawls using `session_id`?"*
-   - *"What are best practices for using sessions to handle multi-step login flows?"*
-   - *"How do I reuse sessions for authenticated workflows and reduce overhead?"*
-
-6. **Dynamic Content Handling**
-   - *"How can I inject JavaScript or wait conditions to ensure dynamic elements load before extraction?"*
-   - *"What strategies can I use to navigate infinite scrolling or ‘Load More’ buttons?"*
-   - *"How do I integrate JS code execution and waiting to handle modern SPA (Single Page Application) layouts?"*
-
-7. **Scaling and Performance**
-   - *"How do I scale crawls to handle thousands of URLs concurrently?"*
-   - *"What options exist for caching and resource utilization optimization?"*
-   - *"How do I handle multiple browser instances efficiently for high-volume crawling?"*
-
-8. **Extraction Strategies**
-   - *"How can I use `JsonCssExtractionStrategy` to extract structured data?"*
-   - *"What methods are available to chunk or filter extracted content?"*
-
-9. **Magic Mode vs. Managed Browsers**
-   - *"What is Magic Mode and when should I use it over Managed Browsers?"*
-   - *"Does Magic Mode help with basic sites, and how do I enable it?"*
-
-10. **Troubleshooting and Best Practices**
-    - *"How can I debug browser automation issues with logs and headful mode?"*
-    - *"What best practices should I follow to respect website policies?"*
-    - *"How do I handle authentication flows, form submissions, and CAPTCHA challenges effectively?"*
-
-### Topics Discussed in the File
-
- **Browser Instance Creation** (Standard vs. Persistent Contexts)  
- **`BrowserConfig` Customization** (headless mode, viewport, proxies, debugging)  
- **Managed Browser for Resource Management and Debugging**  
- **Identity Preservation Techniques** (Stealth, Human-like Behavior, Bypass CAPTCHAs)  
- **Persistent Sessions and `user_data_dir`** (Session Reuse, Authentication Flows)  
- **Crawling Modern Web Apps** (Dynamic Content, JS Injection, Infinite Scrolling)  
- **Session Management with `session_id`** (Maintaining State, Multi-Step Flows)  
- **Magic Mode** (Automation of User-Like Behavior, Simple Setup)  
- **Extraction Strategies** (`JsonCssExtractionStrategy`, Handling Structured Data)  
- **Scaling and Performance Optimization** (Multiple URLs, Concurrency, Reusing Sessions)  
- **Best Practices and Troubleshooting** (Respecting Policies, Debugging Tools, Handling Errors)
+browser_creation: Create standard browser instance with default configurations | browser initialization, basic setup, minimal config | AsyncWebCrawler(browser_config=BrowserConfig(browser_type="chromium", headless=True))
+persistent_context: Use persistent browser contexts to maintain session data and cookies | user_data_dir, session storage, login state | BrowserConfig(user_data_dir="/path/to/user/data")
+managed_browser: High-level browser management with resource optimization and debugging | browser process, stealth mode, debugging tools | BrowserConfig(headless=False, debug_port=9222)
+context_config: Configure browser context with custom headers and cookies | headers customization, session reuse | CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
+page_creation: Create and customize browser pages with viewport settings | viewport size, iframe handling, lazy loading | CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
+identity_preservation: Maintain authentic digital identity using Managed Browsers | user profiles, CAPTCHA bypass, persistent login | BrowserConfig(use_managed_browser=True, user_data_dir="/path/to/profile")
+magic_mode: Enable automated user-like behavior and detection bypass | automation masking, cookie handling | crawler.arun(url="example.com", magic=True)
+session_management: Maintain state across multiple requests using session IDs | session reuse, sequential crawling | CrawlerRunConfig(session_id="my_session")
+dynamic_content: Handle JavaScript-rendered content with custom execution hooks | content loading, pagination | js_code="document.querySelector('a.pagination-next').click()"
+best_practices: Follow recommended patterns for efficient crawling | resource management, error handling | crawler.crawler_strategy.kill_session(session_id)