Update documents, upload new version of quickstart.

2024-10-30 20:39:35 +08:00
parent 3529c2e732
commit 9307c19f35
10 changed files with 1481 additions and 799 deletions
--- a/docs/md_v2/assets/styles.css
+++ b/docs/md_v2/assets/styles.css
@@ -150,4 +150,11 @@ strong,
 .tab-content pre {
    margin: 0;
    max-height: 300px; overflow: auto; border:none;
+}
+
+ol li::before {
+    content: counters(item, ".") ". ";
+    counter-increment: item;
+    /* float: left; */
+    /* padding-right: 5px; */
 }
--- a/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
+++ b/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
@@ -9,17 +9,19 @@ Here's a condensed outline of the **Installation and Setup** video content:

 ---

-1. **Introduction to Crawl4AI**:
-   - Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
+1 **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.

-2. **Installation Overview**:
+2 **Installation Overview**:   
+   
   - **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
+ 
   - **Optional Advanced Installs**:
     - `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
     - `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
     - `pip install crawl4ai[all]` - Installs all features for complete functionality.

-3. **Verifying the Installation**:
+3 **Verifying the Installation**:
+   
   - Walk through a simple test script to confirm the setup:
      ```python
      import asyncio
@@ -34,12 +36,13 @@ Here's a condensed outline of the **Installation and Setup** video content:
      ```
   - Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.

-4. **Important Tips**:
+4 **Important Tips**:
+   
   - **Run** `playwright install` **after installation** to set up dependencies.
   - **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
   - If you encounter issues, refer to the documentation or GitHub issues.

-5. **Wrap Up**:
+5 **Wrap Up**:
   - Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).

 ---
--- a/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
+++ b/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
@@ -11,10 +11,12 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri

 ### **Overview of Advanced Features**

-1. **Introduction to Advanced Features**:
+1 **Introduction to Advanced Features**:
+ 
   - Briefly introduce Crawl4AI’s advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.

-2. **Taking Screenshots**:
+2 **Taking Screenshots**:
+ 
   - Explain the screenshot capability for capturing page state and verifying content.
   - **Example**:
      ```python
@@ -22,7 +24,8 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
      ```
   - Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.

-3. **Media and Link Extraction**:
+3 **Media and Link Extraction**:
+ 
   - Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
   - **Example**:
      ```python
@@ -31,14 +34,16 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
      print("Links:", result.links)
      ```

-4. **Custom User Agent**:
+4 **Custom User Agent**:
+ 
   - Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
   - **Example**:
      ```python
      result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
      ```

-5. **Custom Hooks for Enhanced Control**:
+5 **Custom Hooks for Enhanced Control**:
+ 
   - Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
   - **Example**: Setting a custom header with `before_get_url` hook.
      ```python
@@ -46,7 +51,8 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
          await page.set_extra_http_headers({"X-Test-Header": "test"})
      ```

-6. **CSS Selectors for Targeted Extraction**:
+6 **CSS Selectors for Targeted Extraction**:
+ 
   - Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
   - **Example**:
      ```python
@@ -54,14 +60,16 @@ Here's a condensed outline for an **Overview of Advanced Features** video coveri
      print("H2 Tags:", result.extracted_content)
      ```

-7. **Crawling Inside Iframes**:
+7 **Crawling Inside Iframes**:
+ 
   - Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
   - **Example**:
      ```python
      result = await crawler.arun(url="https://www.example.com", process_iframes=True)
      ```

-8. **Wrap-Up**:
+8 **Wrap-Up**:
+ 
   - Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
   - Tease upcoming videos where each feature will be explored in detail.

--- a/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md
+++ b/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md
@@ -42,7 +42,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     async def log_browser_creation(browser):
         print("Browser instance created:", browser)
     
-     crawler.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
     ```
   - **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.

@@ -57,7 +57,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     def update_user_agent(user_agent):
         print(f"User Agent Updated: {user_agent}")
     
-     crawler.set_hook('on_user_agent_updated', update_user_agent)
+     crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
     crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
     ```
   - **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
@@ -73,7 +73,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     async def log_execution_start(page):
         print("Execution started on page:", page.url)
     
-     crawler.set_hook('on_execution_started', log_execution_start)
+     crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
     ```
   - **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.

@@ -90,7 +90,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
         print("Custom headers set before navigation")
     
-     crawler.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
     ```
   - **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.

@@ -106,7 +106,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
         print("Scrolled to the bottom after navigation")
     
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
     ```
   - **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.

@@ -122,7 +122,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
         print("Advertisements removed before returning HTML")
     
-     crawler.set_hook('before_return_html', remove_advertisements)
+     crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
     ```
   - **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.

@@ -138,7 +138,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.wait_for_selector('.main-content')
         print("Main content loaded, ready to retrieve HTML")
     
-     crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
+     crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
     ```
   - **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.

@@ -148,9 +148,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     - Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
   - **Example Setup**:
     ```python
-     crawler.set_hook('on_browser_created', log_browser_creation)
-     crawler.set_hook('before_goto', modify_headers_before_goto)
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
     ```

 #### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
@@ -160,10 +160,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     async def custom_crawl():
         async with AsyncWebCrawler() as crawler:
             # Set hooks for custom workflow
-             crawler.set_hook('on_browser_created', log_browser_creation)
-             crawler.set_hook('before_goto', modify_headers_before_goto)
-             crawler.set_hook('after_goto', post_navigation_scroll)
-             crawler.set_hook('before_return_html', remove_advertisements)
+             crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+             crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+             crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
+             crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
             
             # Perform the crawl
             url = "https://example.com"
--- a/docs/md_v2/tutorial/tutorial.md
+++ b/docs/md_v2/tutorial/tutorial.md
@@ -771,9 +771,11 @@ Here’s a concise outline for the **Custom Headers, Identity Management, and Us
     async with AsyncWebCrawler(
         headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
         user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
-         simulate_user=True
     ) as crawler:
-         result = await crawler.arun(url="https://example.com/secure-page")
+         result = await crawler.arun(
+            url="https://example.com/secure-page",
+            simulate_user=True
+        )
         print(result.markdown[:500])  # Display extracted content
     ```
   - This example enables detailed customization for evading detection and accessing protected pages smoothly.
@@ -1576,7 +1578,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     async def log_browser_creation(browser):
         print("Browser instance created:", browser)
     
-     crawler.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
     ```
   - **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.

@@ -1591,7 +1593,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     def update_user_agent(user_agent):
         print(f"User Agent Updated: {user_agent}")
     
-     crawler.set_hook('on_user_agent_updated', update_user_agent)
+     crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
     crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
     ```
   - **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
@@ -1607,7 +1609,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     async def log_execution_start(page):
         print("Execution started on page:", page.url)
     
-     crawler.set_hook('on_execution_started', log_execution_start)
+     crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
     ```
   - **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.

@@ -1624,7 +1626,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
         print("Custom headers set before navigation")
     
-     crawler.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
     ```
   - **Explanation**: This hook allows injecting headers or altering settings based on the page’s needs, particularly useful for pages with custom requirements.

@@ -1640,7 +1642,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
         print("Scrolled to the bottom after navigation")
     
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
     ```
   - **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.

@@ -1656,7 +1658,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
         print("Advertisements removed before returning HTML")
     
-     crawler.set_hook('before_return_html', remove_advertisements)
+     crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
     ```
   - **Explanation**: The hook removes ad banners from the HTML before it’s retrieved, ensuring a cleaner data extraction.

@@ -1672,7 +1674,7 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
         await page.wait_for_selector('.main-content')
         print("Main content loaded, ready to retrieve HTML")
     
-     crawler.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
+     crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
     ```
   - **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.

@@ -1682,9 +1684,9 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     - Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
   - **Example Setup**:
     ```python
-     crawler.set_hook('on_browser_created', log_browser_creation)
-     crawler.set_hook('before_goto', modify_headers_before_goto)
-     crawler.set_hook('after_goto', post_navigation_scroll)
+     crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+     crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+     crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
     ```

 #### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
@@ -1694,10 +1696,10 @@ Here’s a detailed outline for the **Hooks and Custom Workflow with AsyncWebCra
     async def custom_crawl():
         async with AsyncWebCrawler() as crawler:
             # Set hooks for custom workflow
-             crawler.set_hook('on_browser_created', log_browser_creation)
-             crawler.set_hook('before_goto', modify_headers_before_goto)
-             crawler.set_hook('after_goto', post_navigation_scroll)
-             crawler.set_hook('before_return_html', remove_advertisements)
+             crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
+             crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
+             crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
+             crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
             
             # Perform the crawl
             url = "https://example.com"