Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions
--- a/docs/md/full_details/crawl_request_parameters.md
+++ b/docs/md/full_details/crawl_request_parameters.md
@@ -1,6 +1,6 @@
-# Crawl Request Parameters
+# Crawl Request Parameters for AsyncWebCrawler

-The `run` function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `run` function, along with their descriptions, possible values, and examples.
+The `arun` method in Crawl4AI's `AsyncWebCrawler` is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `arun` method, along with their descriptions, possible values, and examples.

 ## Parameters

@@ -13,9 +13,9 @@ url = "https://www.nbcnews.com/business"
 ```

 ### word_count_threshold (int)
-**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is `5`.
+**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is defined by `MIN_WORD_THRESHOLD`.
 **Required:** No
-**Default Value:** `5`
+**Default Value:** `MIN_WORD_THRESHOLD`
 **Example:**
 ```python
 word_count_threshold = 10
@@ -88,43 +88,92 @@ verbose = True
 Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:

 - **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
+- **session_id (str):** A unique identifier for the crawling session. This is useful for maintaining state across multiple requests.
+- **js_code (str or list):** JavaScript code to be executed on the page before extraction.
+- **wait_for (str):** A CSS selector or JavaScript function to wait for before considering the page load complete.

 **Example:**
 ```python
-result = crawler.run(
+result = await crawler.arun(
    url="https://www.nbcnews.com/business",
    css_selector="p",
-    only_text=True
+    only_text=True,
+    session_id="unique_session_123",
+    js_code="window.scrollTo(0, document.body.scrollHeight);",
+    wait_for="article.main-article"
 )
 ```

 ## Example Usage

-Here's an example of how to use the `run` function with various parameters:
+Here's an example of how to use the `arun` method with various parameters:

 ```python
-from crawl4ai import WebCrawler
+import asyncio
+from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import CosineStrategy
 from crawl4ai.chunking_strategy import NlpSentenceChunking

-# Create the WebCrawler instance 
-crawler = WebCrawler() 
+async def main():
+    # Create the AsyncWebCrawler instance 
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Run the crawler with custom parameters
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            word_count_threshold=10,
+            extraction_strategy=CosineStrategy(semantic_filter="finance"),
+            chunking_strategy=NlpSentenceChunking(),
+            bypass_cache=True,
+            css_selector="div.article-content",
+            screenshot=True,
+            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
+            verbose=True,
+            only_text=True,
+            session_id="business_news_session",
+            js_code="window.scrollTo(0, document.body.scrollHeight);",
+            wait_for="footer"
+        )

-# Run the crawler with custom parameters
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    word_count_threshold=10,
-    extraction_strategy=CosineStrategy(semantic_filter="finance"),
-    chunking_strategy=NlpSentenceChunking(),
-    bypass_cache=True,
-    css_selector="div.article-content",
-    screenshot=True,
-    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
-    verbose=True,
-    only_text=True
-)
+        print(result)

-print(result)
+# Run the async function
+asyncio.run(main())
 ```

-This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.
+This example demonstrates how to configure various parameters to customize the crawling and extraction process using the asynchronous version of Crawl4AI.
+
+## Additional Asynchronous Methods
+
+The `AsyncWebCrawler` class also provides other useful asynchronous methods:
+
+### arun_many
+**Description:** Crawl multiple URLs concurrently.
+**Example:**
+```python
+urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
+results = await crawler.arun_many(urls, word_count_threshold=10, bypass_cache=True)
+```
+
+### aclear_cache
+**Description:** Clear the crawler's cache.
+**Example:**
+```python
+await crawler.aclear_cache()
+```
+
+### aflush_cache
+**Description:** Completely flush the crawler's cache.
+**Example:**
+```python
+await crawler.aflush_cache()
+```
+
+### aget_cache_size
+**Description:** Get the current size of the cache.
+**Example:**
+```python
+cache_size = await crawler.aget_cache_size()
+print(f"Current cache size: {cache_size}")
+```
+
+These asynchronous methods allow for efficient and flexible use of the AsyncWebCrawler in various scenarios.