Update Documentation

2024-10-27 19:24:46 +08:00
parent 38474bd66a
commit 4239654722
111 changed files with 7680 additions and 53 deletions
--- a/docs/md_v1/full_details/crawl_request_parameters.md
+++ b/docs/md_v1/full_details/crawl_request_parameters.md
@@ -0,0 +1,179 @@
+# Crawl Request Parameters for AsyncWebCrawler
+
+The `arun` method in Crawl4AI's `AsyncWebCrawler` is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `arun` method, along with their descriptions, possible values, and examples.
+
+## Parameters
+
+### url (str)
+**Description:** The URL of the webpage to crawl.
+**Required:** Yes
+**Example:**
+```python
+url = "https://www.nbcnews.com/business"
+```
+
+### word_count_threshold (int)
+**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is defined by `MIN_WORD_THRESHOLD`.
+**Required:** No
+**Default Value:** `MIN_WORD_THRESHOLD`
+**Example:**
+```python
+word_count_threshold = 10
+```
+
+### extraction_strategy (ExtractionStrategy)
+**Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`.
+**Required:** No
+**Default Value:** `NoExtractionStrategy()`
+**Example:**
+```python
+extraction_strategy = CosineStrategy(semantic_filter="finance")
+```
+
+### chunking_strategy (ChunkingStrategy)
+**Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`.
+**Required:** No
+**Default Value:** `RegexChunking()`
+**Example:**
+```python
+chunking_strategy = NlpSentenceChunking()
+```
+
+### bypass_cache (bool)
+**Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`.
+**Required:** No
+**Default Value:** `False`
+**Example:**
+```python
+bypass_cache = True
+```
+
+### css_selector (str)
+**Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
+**Required:** No
+**Default Value:** `None`
+**Example:**
+```python
+css_selector = "div.article-content"
+```
+
+### screenshot (bool)
+**Description:** Whether to take screenshots of the page. The default value is `False`.
+**Required:** No
+**Default Value:** `False`
+**Example:**
+```python
+screenshot = True
+```
+
+### user_agent (str)
+**Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
+**Required:** No
+**Default Value:** `None`
+**Example:**
+```python
+user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
+```
+
+### verbose (bool)
+**Description:** Whether to enable verbose logging. The default value is `True`.
+**Required:** No
+**Default Value:** `True`
+**Example:**
+```python
+verbose = True
+```
+
+### **kwargs
+Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
+
+- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
+- **session_id (str):** A unique identifier for the crawling session. This is useful for maintaining state across multiple requests.
+- **js_code (str or list):** JavaScript code to be executed on the page before extraction.
+- **wait_for (str):** A CSS selector or JavaScript function to wait for before considering the page load complete.
+
+**Example:**
+```python
+result = await crawler.arun(
+    url="https://www.nbcnews.com/business",
+    css_selector="p",
+    only_text=True,
+    session_id="unique_session_123",
+    js_code="window.scrollTo(0, document.body.scrollHeight);",
+    wait_for="article.main-article"
+)
+```
+
+## Example Usage
+
+Here's an example of how to use the `arun` method with various parameters:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import CosineStrategy
+from crawl4ai.chunking_strategy import NlpSentenceChunking
+
+async def main():
+    # Create the AsyncWebCrawler instance 
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Run the crawler with custom parameters
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            word_count_threshold=10,
+            extraction_strategy=CosineStrategy(semantic_filter="finance"),
+            chunking_strategy=NlpSentenceChunking(),
+            bypass_cache=True,
+            css_selector="div.article-content",
+            screenshot=True,
+            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
+            verbose=True,
+            only_text=True,
+            session_id="business_news_session",
+            js_code="window.scrollTo(0, document.body.scrollHeight);",
+            wait_for="footer"
+        )
+
+        print(result)
+
+# Run the async function
+asyncio.run(main())
+```
+
+This example demonstrates how to configure various parameters to customize the crawling and extraction process using the asynchronous version of Crawl4AI.
+
+## Additional Asynchronous Methods
+
+The `AsyncWebCrawler` class also provides other useful asynchronous methods:
+
+### arun_many
+**Description:** Crawl multiple URLs concurrently.
+**Example:**
+```python
+urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
+results = await crawler.arun_many(urls, word_count_threshold=10, bypass_cache=True)
+```
+
+### aclear_cache
+**Description:** Clear the crawler's cache.
+**Example:**
+```python
+await crawler.aclear_cache()
+```
+
+### aflush_cache
+**Description:** Completely flush the crawler's cache.
+**Example:**
+```python
+await crawler.aflush_cache()
+```
+
+### aget_cache_size
+**Description:** Get the current size of the cache.
+**Example:**
+```python
+cache_size = await crawler.aget_cache_size()
+print(f"Current cache size: {cache_size}")
+```
+
+These asynchronous methods allow for efficient and flexible use of the AsyncWebCrawler in various scenarios.