ADD MKDocs

2024-06-21 17:56:54 +08:00
parent 21b110bfd7
commit e7705e661a
34 changed files with 3933 additions and 580 deletions
--- a/docs/md/full_details/advanced_features.md
+++ b/docs/md/full_details/advanced_features.md
@@ -0,0 +1,138 @@
+# Advanced Features
+
+Crawl4AI offers a range of advanced features that allow you to fine-tune your web crawling and data extraction process. This section will cover some of these advanced features, including taking screenshots, extracting media and links, customizing the user agent, using custom hooks, and leveraging CSS selectors.
+
+## Taking Screenshots 📸
+
+One of the cool features of Crawl4AI is the ability to take screenshots of the web pages you're crawling. This can be particularly useful for visual verification or for capturing the state of dynamic content.
+
+Here's how you can take a screenshot:
+
+```python
+from crawl4ai import WebCrawler
+import base64
+
+# Create the WebCrawler instance
+crawler = WebCrawler()
+crawler.warmup()
+
+# Run the crawler with the screenshot parameter
+result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
+
+# Save the screenshot to a file
+with open("screenshot.png", "wb") as f:
+    f.write(base64.b64decode(result.screenshot))
+
+print("Screenshot saved to 'screenshot.png'!")
+```
+
+In this example, we create a `WebCrawler` instance, warm it up, and then run it with the `screenshot` parameter set to `True`. The screenshot is saved as a base64 encoded string in the result, which we then decode and save as a PNG file.
+
+## Extracting Media and Links 🎨🔗
+
+Crawl4AI can extract all media tags (images, audio, and video) and links (both internal and external) from a web page. This feature is useful for collecting multimedia content or analyzing link structures.
+
+Here's an example:
+
+```python
+from crawl4ai import WebCrawler
+
+# Create the WebCrawler instance
+crawler = WebCrawler()
+crawler.warmup()
+
+# Run the crawler
+result = crawler.run(url="https://www.nbcnews.com/business")
+
+print("Extracted media:", result.media)
+print("Extracted links:", result.links)
+```
+
+In this example, the `result` object contains dictionaries for media and links, which you can access and use as needed.
+
+## Customizing the User Agent 🕵️‍♂️
+
+Crawl4AI allows you to set a custom user agent for your HTTP requests. This can help you avoid detection by web servers or simulate different browsing environments.
+
+Here's how to set a custom user agent:
+
+```python
+from crawl4ai import WebCrawler
+
+# Create the WebCrawler instance
+crawler = WebCrawler()
+crawler.warmup()
+
+# Run the crawler with a custom user agent
+result = crawler.run(url="https://www.nbcnews.com/business", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
+
+print("Crawl result:", result)
+```
+
+In this example, we specify a custom user agent string when running the crawler.
+
+## Using Custom Hooks 🪝
+
+Hooks are a powerful feature in Crawl4AI that allow you to customize the crawling process at various stages. You can define hooks for actions such as driver initialization, before and after URL fetching, and before returning the HTML.
+
+Here's an example of using hooks:
+
+```python
+from crawl4ai import WebCrawler
+from selenium.webdriver.common.by import By
+from selenium.webdriver.support.ui import WebDriverWait
+from selenium.webdriver.support import expected_conditions as EC
+
+# Define the hooks
+def on_driver_created(driver):
+    driver.maximize_window()
+    driver.get('https://example.com/login')
+    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, 'username'))).send_keys('testuser')
+    driver.find_element(By.NAME, 'password').send_keys('password123')
+    driver.find_element(By.NAME, 'login').click()
+    return driver
+
+def before_get_url(driver):
+    driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
+    return driver
+
+# Create the WebCrawler instance
+crawler = WebCrawler()
+crawler.warmup()
+
+# Set the hooks
+crawler.set_hook('on_driver_created', on_driver_created)
+crawler.set_hook('before_get_url', before_get_url)
+
+# Run the crawler
+result = crawler.run(url="https://example.com")
+
+print("Crawl result:", result)
+```
+
+In this example, we define hooks to handle driver initialization and custom headers before fetching the URL.
+
+## Using CSS Selectors 🎯
+
+CSS selectors allow you to target specific elements on a web page for extraction. This can be useful for scraping structured content, such as articles or product details.
+
+Here's an example of using a CSS selector:
+
+```python
+from crawl4ai import WebCrawler
+
+# Create the WebCrawler instance
+crawler = WebCrawler()
+crawler.warmup()
+
+# Run the crawler with a CSS selector to extract only H2 tags
+result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
+
+print("Extracted H2 tags:", result.extracted_content)
+```
+
+In this example, we use the `css_selector` parameter to extract only the H2 tags from the web page.
+
+---
+
+With these advanced features, you can leverage Crawl4AI to perform sophisticated web crawling and data extraction tasks. Whether you need to take screenshots, extract specific elements, customize the crawling process, or set custom headers, Crawl4AI provides the flexibility and power to meet your needs. Happy crawling! 🕷️🚀
--- a/docs/md/full_details/chunking_strategies.md
+++ b/docs/md/full_details/chunking_strategies.md
@@ -0,0 +1,133 @@
+## Chunking Strategies 📚
+
+Crawl4AI provides several powerful chunking strategies to divide text into manageable parts for further processing. Each strategy has unique characteristics and is suitable for different scenarios. Let's explore them one by one.
+
+### RegexChunking
+
+`RegexChunking` splits text using regular expressions. This is ideal for creating chunks based on specific patterns like paragraphs or sentences.
+
+#### When to Use
+- Great for structured text with consistent delimiters.
+- Suitable for documents where specific patterns (e.g., double newlines, periods) indicate logical chunks.
+
+#### Parameters
+- `patterns` (list, optional): Regular expressions used to split the text. Default is to split by double newlines (`['\n\n']`).
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import RegexChunking
+
+# Define patterns for splitting text
+patterns = [r'\n\n', r'\. ']
+chunker = RegexChunking(patterns=patterns)
+
+# Sample text
+text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### NlpSentenceChunking
+
+`NlpSentenceChunking` uses NLP models to split text into sentences, ensuring accurate sentence boundaries.
+
+#### When to Use
+- Ideal for texts where sentence boundaries are crucial.
+- Useful for creating chunks that preserve grammatical structures.
+
+#### Parameters
+- None.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import NlpSentenceChunking
+
+chunker = NlpSentenceChunking()
+
+# Sample text
+text = "This is a sample text. It will be split into sentences. Here's another sentence."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### TopicSegmentationChunking
+
+`TopicSegmentationChunking` employs the TextTiling algorithm to segment text into topic-based chunks. This method identifies thematic boundaries.
+
+#### When to Use
+- Perfect for long documents with distinct topics.
+- Useful when preserving topic continuity is more important than maintaining text order.
+
+#### Parameters
+- `num_keywords` (int, optional): Number of keywords for each topic segment. Default is `3`.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import TopicSegmentationChunking
+
+chunker = TopicSegmentationChunking(num_keywords=3)
+
+# Sample text
+text = "This document contains several topics. Topic one discusses AI. Topic two covers machine learning."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### FixedLengthWordChunking
+
+`FixedLengthWordChunking` splits text into chunks based on a fixed number of words. This ensures each chunk has approximately the same length.
+
+#### When to Use
+- Suitable for processing large texts where uniform chunk size is important.
+- Useful when the number of words per chunk needs to be controlled.
+
+#### Parameters
+- `chunk_size` (int, optional): Number of words per chunk. Default is `100`.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import FixedLengthWordChunking
+
+chunker = FixedLengthWordChunking(chunk_size=10)
+
+# Sample text
+text = "This is a sample text. It will be split into chunks of fixed length."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### SlidingWindowChunking
+
+`SlidingWindowChunking` uses a sliding window approach to create overlapping chunks. Each chunk has a fixed length, and the window slides by a specified step size.
+
+#### When to Use
+- Ideal for creating overlapping chunks to preserve context.
+- Useful for tasks where context from adjacent chunks is needed.
+
+#### Parameters
+- `window_size` (int, optional): Number of words in each chunk. Default is `100`.
+- `step` (int, optional): Number of words to slide the window. Default is `50`.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import SlidingWindowChunking
+
+chunker = SlidingWindowChunking(window_size=10, step=5)
+
+# Sample text
+text = "This is a sample text. It will be split using a sliding window approach to preserve context."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+With these chunking strategies, you can choose the best method to divide your text based on your specific needs. Whether you need precise sentence boundaries, topic-based segmentation, or uniform chunk sizes, Crawl4AI has you covered. Happy chunking! 📝✨
--- a/docs/md/full_details/crawl_request_parameters.md
+++ b/docs/md/full_details/crawl_request_parameters.md
@@ -0,0 +1,130 @@
+# Crawl Request Parameters
+
+The `run` function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `run` function, along with their descriptions, possible values, and examples.
+
+## Parameters
+
+### url (str)
+**Description:** The URL of the webpage to crawl.
+**Required:** Yes
+**Example:**
+```python
+url = "https://www.nbcnews.com/business"
+```
+
+### word_count_threshold (int)
+**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is `5`.
+**Required:** No
+**Default Value:** `5`
+**Example:**
+```python
+word_count_threshold = 10
+```
+
+### extraction_strategy (ExtractionStrategy)
+**Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`.
+**Required:** No
+**Default Value:** `NoExtractionStrategy()`
+**Example:**
+```python
+extraction_strategy = CosineStrategy(semantic_filter="finance")
+```
+
+### chunking_strategy (ChunkingStrategy)
+**Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`.
+**Required:** No
+**Default Value:** `RegexChunking()`
+**Example:**
+```python
+chunking_strategy = NlpSentenceChunking()
+```
+
+### bypass_cache (bool)
+**Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`.
+**Required:** No
+**Default Value:** `False`
+**Example:**
+```python
+bypass_cache = True
+```
+
+### css_selector (str)
+**Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
+**Required:** No
+**Default Value:** `None`
+**Example:**
+```python
+css_selector = "div.article-content"
+```
+
+### screenshot (bool)
+**Description:** Whether to take screenshots of the page. The default value is `False`.
+**Required:** No
+**Default Value:** `False`
+**Example:**
+```python
+screenshot = True
+```
+
+### user_agent (str)
+**Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
+**Required:** No
+**Default Value:** `None`
+**Example:**
+```python
+user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
+```
+
+### verbose (bool)
+**Description:** Whether to enable verbose logging. The default value is `True`.
+**Required:** No
+**Default Value:** `True`
+**Example:**
+```python
+verbose = True
+```
+
+### **kwargs
+Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
+
+- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
+
+**Example:**
+```python
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    css_selector="p",
+    only_text=True
+)
+```
+
+## Example Usage
+
+Here's an example of how to use the `run` function with various parameters:
+
+```python
+from crawl4ai import WebCrawler
+from crawl4ai.extraction_strategy import CosineStrategy
+from crawl4ai.chunking_strategy import NlpSentenceChunking
+
+# Create the WebCrawler instance 
+crawler = WebCrawler() 
+
+# Run the crawler with custom parameters
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    word_count_threshold=10,
+    extraction_strategy=CosineStrategy(semantic_filter="finance"),
+    chunking_strategy=NlpSentenceChunking(),
+    bypass_cache=True,
+    css_selector="div.article-content",
+    screenshot=True,
+    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
+    verbose=True,
+    only_text=True
+)
+
+print(result)
+```
+
+This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.
--- a/docs/md/full_details/crawl_result_class.md
+++ b/docs/md/full_details/crawl_result_class.md
@@ -0,0 +1,120 @@
+# Crawl Result
+
+The `CrawlResult` class is the heart of Crawl4AI's output, encapsulating all the data extracted from a crawling session. This class contains various fields that store the results of the web crawling and extraction process. Let's break down each field and see what it holds. 🎉
+
+## Class Definition
+
+```python
+class CrawlResult(BaseModel):
+    url: str
+    html: str
+    success: bool
+    cleaned_html: Optional[str] = None
+    media: Dict[str, List[Dict]] = {}
+    links: Dict[str, List[Dict]] = {}
+    screenshot: Optional[str] = None
+    markdown: Optional[str] = None
+    extracted_content: Optional[str] = None
+    metadata: Optional[dict] = None
+    error_message: Optional[str] = None
+```
+
+## Fields Explanation
+
+### `url: str`
+The URL that was crawled. This field simply stores the URL of the web page that was processed.
+
+### `html: str`
+The raw HTML content of the web page. This is the unprocessed HTML source as retrieved by the crawler.
+
+### `success: bool`
+A flag indicating whether the crawling and extraction were successful. If any error occurs during the process, this will be `False`.
+
+### `cleaned_html: Optional[str]`
+The cleaned HTML content of the web page. This field holds the HTML after removing unwanted tags like `<script>`, `<style>`, and others that do not contribute to the useful content.
+
+### `media: Dict[str, List[Dict]]`
+A dictionary containing lists of extracted media elements from the web page. The media elements are categorized into images, videos, and audios. Here’s how they are structured:
+
+- **Images**: Each image is represented as a dictionary with `src` (source URL) and `alt` (alternate text).
+- **Videos**: Each video is represented similarly with `src` and `alt`.
+- **Audios**: Each audio is represented with `src` and `alt`.
+
+```python
+media = {
+    'images': [
+        {'src': 'image_url1', 'alt': 'description1', "type": "image"},
+        {'src': 'image_url2', 'alt': 'description2', "type": "image"}
+    ],
+    'videos': [
+        {'src': 'video_url1', 'alt': 'description1', "type": "video"}
+    ],
+    'audios': [
+        {'src': 'audio_url1', 'alt': 'description1', "type": "audio"}
+    ]
+}
+```
+
+### `links: Dict[str, List[Dict]]`
+A dictionary containing lists of internal and external links extracted from the web page. Each link is represented as a dictionary with `href` (URL) and `text` (link text).
+
+- **Internal Links**: Links pointing to the same domain.
+- **External Links**: Links pointing to different domains.
+
+```python
+links = {
+    'internal': [
+        {'href': 'internal_link1', 'text': 'link_text1'},
+        {'href': 'internal_link2', 'text': 'link_text2'}
+    ],
+    'external': [
+        {'href': 'external_link1', 'text': 'link_text1'}
+    ]
+}
+```
+
+### `screenshot: Optional[str]`
+A base64-encoded screenshot of the web page. This field stores the screenshot data if the crawling was configured to take a screenshot.
+
+### `markdown: Optional[str]`
+The content of the web page converted to Markdown format. This is useful for generating clean, readable text that retains the structure of the original HTML.
+
+### `extracted_content: Optional[str]`
+The content extracted based on the specified extraction strategy. This field holds the meaningful content blocks extracted from the web page, ready for your AI and data processing needs.
+
+### `metadata: Optional[dict]`
+A dictionary containing metadata extracted from the web page, such as title, description, keywords, and other meta tags.
+
+### `error_message: Optional[str]`
+If an error occurs during crawling, this field will contain the error message, helping you debug and understand what went wrong. 🚨
+
+## Example Usage
+
+Here's a quick example to illustrate how you might use the `CrawlResult` in your code:
+
+```python
+from crawl4ai import WebCrawler
+
+# Create the WebCrawler instance
+crawler = WebCrawler()
+
+# Run the crawler on a URL
+result = crawler.run(url="https://www.example.com")
+
+# Check if the crawl was successful
+if result.success:
+    print("Crawl succeeded!")
+    print("URL:", result.url)
+    print("HTML:", result.html[:100])  # Print the first 100 characters of the HTML
+    print("Cleaned HTML:", result.cleaned_html[:100])
+    print("Media:", result.media)
+    print("Links:", result.links)
+    print("Screenshot:", result.screenshot)
+    print("Markdown:", result.markdown[:100])
+    print("Extracted Content:", result.extracted_content)
+    print("Metadata:", result.metadata)
+else:
+    print("Crawl failed with error:", result.error_message)
+```
+
+With this setup, you can easily access all the valuable data extracted from the web page and integrate it into your applications. Happy crawling! 🕷️🤖
--- a/docs/md/full_details/extraction_strategies.md
+++ b/docs/md/full_details/extraction_strategies.md
@@ -0,0 +1,116 @@
+## Extraction Strategies 🧠
+
+Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into two of the most important strategies: `CosineStrategy` and `LLMExtractionStrategy`.
+
+### CosineStrategy
+
+`CosineStrategy` uses hierarchical clustering based on cosine similarity to group text chunks into meaningful clusters. This method converts each chunk into its embedding and then clusters them to form semantical chunks.
+
+#### When to Use
+- Ideal for fast, accurate semantic segmentation of text.
+- Perfect for scenarios where LLMs might be overkill or too slow.
+- Suitable for narrowing down content based on specific queries or keywords.
+
+#### Parameters
+- `semantic_filter` (str, optional): Keywords for filtering relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
+- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
+- `max_dist` (float, optional): Maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
+- `linkage_method` (str, optional): Linkage method for hierarchical clustering. Default is `'ward'`.
+- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
+- `model_name` (str, optional): Model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
+
+#### Example
+```python
+from crawl4ai.extraction_strategy import CosineStrategy
+from crawl4ai import WebCrawler
+
+crawler = WebCrawler()
+crawler.warmup()
+
+# Define extraction strategy
+strategy = CosineStrategy(
+    semantic_filter="finance economy stock market",
+    word_count_threshold=10,
+    max_dist=0.2,
+    linkage_method='ward',
+    top_k=3,
+    model_name='BAAI/bge-small-en-v1.5'
+)
+
+# Sample URL
+url = "https://www.nbcnews.com/business"
+
+# Run the crawler with the extraction strategy
+result = crawler.run(url=url, extraction_strategy=strategy)
+print(result.extracted_content)
+```
+
+### LLMExtractionStrategy
+
+`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
+
+#### When to Use
+- Suitable for complex extraction tasks requiring nuanced understanding.
+- Ideal for scenarios where detailed instructions can guide the extraction process.
+- Perfect for extracting specific types of information or content with precise guidelines.
+
+#### Parameters
+- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
+- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
+- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
+
+#### Example Without Instructions
+```python
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from crawl4ai import WebCrawler
+
+crawler = WebCrawler()
+crawler.warmup()
+
+# Define extraction strategy without instructions
+strategy = LLMExtractionStrategy(
+    provider='openai',
+    api_token='your_api_token'
+)
+
+# Sample URL
+url = "https://www.nbcnews.com/business"
+
+# Run the crawler with the extraction strategy
+result = crawler.run(url=url, extraction_strategy=strategy)
+print(result.extracted_content)
+```
+
+#### Example With Instructions
+```python
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from crawl4ai import WebCrawler
+
+crawler = WebCrawler()
+crawler.warmup()
+
+# Define extraction strategy with instructions
+strategy = LLMExtractionStrategy(
+    provider='openai',
+    api_token='your_api_token',
+    instruction="Extract only financial news and summarize key points."
+)
+
+# Sample URL
+url = "https://www.nbcnews.com/business"
+
+# Run the crawler with the extraction strategy
+result = crawler.run(url=url, extraction_strategy=strategy)
+print(result.extracted_content)
+```
+
+#### Use Cases for LLMExtractionStrategy
+- Extracting specific data types from structured or semi-structured content.
+- Generating summaries, extracting key information, or transforming content into different formats.
+- Performing detailed extractions based on custom instructions.
+
+For more detailed examples, please refer to the [Examples section](../examples/index.md) of the documentation.
+
+---
+
+By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy` or nuanced, instruction-based extraction with `LLMExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️‍♂️✨