ADD MKDocs

This commit is contained in:
unclecode
2024-06-21 17:56:54 +08:00
parent 21b110bfd7
commit e7705e661a
34 changed files with 3933 additions and 580 deletions

View File

@@ -0,0 +1,138 @@
# Advanced Features
Crawl4AI offers a range of advanced features that allow you to fine-tune your web crawling and data extraction process. This section will cover some of these advanced features, including taking screenshots, extracting media and links, customizing the user agent, using custom hooks, and leveraging CSS selectors.
## Taking Screenshots 📸
One of the cool features of Crawl4AI is the ability to take screenshots of the web pages you're crawling. This can be particularly useful for visual verification or for capturing the state of dynamic content.
Here's how you can take a screenshot:
```python
from crawl4ai import WebCrawler
import base64
# Create the WebCrawler instance
crawler = WebCrawler()
crawler.warmup()
# Run the crawler with the screenshot parameter
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
# Save the screenshot to a file
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
print("Screenshot saved to 'screenshot.png'!")
```
In this example, we create a `WebCrawler` instance, warm it up, and then run it with the `screenshot` parameter set to `True`. The screenshot is saved as a base64 encoded string in the result, which we then decode and save as a PNG file.
## Extracting Media and Links 🎨🔗
Crawl4AI can extract all media tags (images, audio, and video) and links (both internal and external) from a web page. This feature is useful for collecting multimedia content or analyzing link structures.
Here's an example:
```python
from crawl4ai import WebCrawler
# Create the WebCrawler instance
crawler = WebCrawler()
crawler.warmup()
# Run the crawler
result = crawler.run(url="https://www.nbcnews.com/business")
print("Extracted media:", result.media)
print("Extracted links:", result.links)
```
In this example, the `result` object contains dictionaries for media and links, which you can access and use as needed.
## Customizing the User Agent 🕵️‍♂️
Crawl4AI allows you to set a custom user agent for your HTTP requests. This can help you avoid detection by web servers or simulate different browsing environments.
Here's how to set a custom user agent:
```python
from crawl4ai import WebCrawler
# Create the WebCrawler instance
crawler = WebCrawler()
crawler.warmup()
# Run the crawler with a custom user agent
result = crawler.run(url="https://www.nbcnews.com/business", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
print("Crawl result:", result)
```
In this example, we specify a custom user agent string when running the crawler.
## Using Custom Hooks 🪝
Hooks are a powerful feature in Crawl4AI that allow you to customize the crawling process at various stages. You can define hooks for actions such as driver initialization, before and after URL fetching, and before returning the HTML.
Here's an example of using hooks:
```python
from crawl4ai import WebCrawler
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Define the hooks
def on_driver_created(driver):
driver.maximize_window()
driver.get('https://example.com/login')
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.NAME, 'username'))).send_keys('testuser')
driver.find_element(By.NAME, 'password').send_keys('password123')
driver.find_element(By.NAME, 'login').click()
return driver
def before_get_url(driver):
driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
return driver
# Create the WebCrawler instance
crawler = WebCrawler()
crawler.warmup()
# Set the hooks
crawler.set_hook('on_driver_created', on_driver_created)
crawler.set_hook('before_get_url', before_get_url)
# Run the crawler
result = crawler.run(url="https://example.com")
print("Crawl result:", result)
```
In this example, we define hooks to handle driver initialization and custom headers before fetching the URL.
## Using CSS Selectors 🎯
CSS selectors allow you to target specific elements on a web page for extraction. This can be useful for scraping structured content, such as articles or product details.
Here's an example of using a CSS selector:
```python
from crawl4ai import WebCrawler
# Create the WebCrawler instance
crawler = WebCrawler()
crawler.warmup()
# Run the crawler with a CSS selector to extract only H2 tags
result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
print("Extracted H2 tags:", result.extracted_content)
```
In this example, we use the `css_selector` parameter to extract only the H2 tags from the web page.
---
With these advanced features, you can leverage Crawl4AI to perform sophisticated web crawling and data extraction tasks. Whether you need to take screenshots, extract specific elements, customize the crawling process, or set custom headers, Crawl4AI provides the flexibility and power to meet your needs. Happy crawling! 🕷️🚀

View File

@@ -0,0 +1,133 @@
## Chunking Strategies 📚
Crawl4AI provides several powerful chunking strategies to divide text into manageable parts for further processing. Each strategy has unique characteristics and is suitable for different scenarios. Let's explore them one by one.
### RegexChunking
`RegexChunking` splits text using regular expressions. This is ideal for creating chunks based on specific patterns like paragraphs or sentences.
#### When to Use
- Great for structured text with consistent delimiters.
- Suitable for documents where specific patterns (e.g., double newlines, periods) indicate logical chunks.
#### Parameters
- `patterns` (list, optional): Regular expressions used to split the text. Default is to split by double newlines (`['\n\n']`).
#### Example
```python
from crawl4ai.chunking_strategy import RegexChunking
# Define patterns for splitting text
patterns = [r'\n\n', r'\. ']
chunker = RegexChunking(patterns=patterns)
# Sample text
text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."
# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```
### NlpSentenceChunking
`NlpSentenceChunking` uses NLP models to split text into sentences, ensuring accurate sentence boundaries.
#### When to Use
- Ideal for texts where sentence boundaries are crucial.
- Useful for creating chunks that preserve grammatical structures.
#### Parameters
- None.
#### Example
```python
from crawl4ai.chunking_strategy import NlpSentenceChunking
chunker = NlpSentenceChunking()
# Sample text
text = "This is a sample text. It will be split into sentences. Here's another sentence."
# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```
### TopicSegmentationChunking
`TopicSegmentationChunking` employs the TextTiling algorithm to segment text into topic-based chunks. This method identifies thematic boundaries.
#### When to Use
- Perfect for long documents with distinct topics.
- Useful when preserving topic continuity is more important than maintaining text order.
#### Parameters
- `num_keywords` (int, optional): Number of keywords for each topic segment. Default is `3`.
#### Example
```python
from crawl4ai.chunking_strategy import TopicSegmentationChunking
chunker = TopicSegmentationChunking(num_keywords=3)
# Sample text
text = "This document contains several topics. Topic one discusses AI. Topic two covers machine learning."
# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```
### FixedLengthWordChunking
`FixedLengthWordChunking` splits text into chunks based on a fixed number of words. This ensures each chunk has approximately the same length.
#### When to Use
- Suitable for processing large texts where uniform chunk size is important.
- Useful when the number of words per chunk needs to be controlled.
#### Parameters
- `chunk_size` (int, optional): Number of words per chunk. Default is `100`.
#### Example
```python
from crawl4ai.chunking_strategy import FixedLengthWordChunking
chunker = FixedLengthWordChunking(chunk_size=10)
# Sample text
text = "This is a sample text. It will be split into chunks of fixed length."
# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```
### SlidingWindowChunking
`SlidingWindowChunking` uses a sliding window approach to create overlapping chunks. Each chunk has a fixed length, and the window slides by a specified step size.
#### When to Use
- Ideal for creating overlapping chunks to preserve context.
- Useful for tasks where context from adjacent chunks is needed.
#### Parameters
- `window_size` (int, optional): Number of words in each chunk. Default is `100`.
- `step` (int, optional): Number of words to slide the window. Default is `50`.
#### Example
```python
from crawl4ai.chunking_strategy import SlidingWindowChunking
chunker = SlidingWindowChunking(window_size=10, step=5)
# Sample text
text = "This is a sample text. It will be split using a sliding window approach to preserve context."
# Chunk the text
chunks = chunker.chunk(text)
print(chunks)
```
With these chunking strategies, you can choose the best method to divide your text based on your specific needs. Whether you need precise sentence boundaries, topic-based segmentation, or uniform chunk sizes, Crawl4AI has you covered. Happy chunking! 📝✨

View File

@@ -0,0 +1,130 @@
# Crawl Request Parameters
The `run` function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `run` function, along with their descriptions, possible values, and examples.
## Parameters
### url (str)
**Description:** The URL of the webpage to crawl.
**Required:** Yes
**Example:**
```python
url = "https://www.nbcnews.com/business"
```
### word_count_threshold (int)
**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is `5`.
**Required:** No
**Default Value:** `5`
**Example:**
```python
word_count_threshold = 10
```
### extraction_strategy (ExtractionStrategy)
**Description:** The strategy to use for extracting content from the HTML. It must be an instance of `ExtractionStrategy`. If not provided, the default is `NoExtractionStrategy`.
**Required:** No
**Default Value:** `NoExtractionStrategy()`
**Example:**
```python
extraction_strategy = CosineStrategy(semantic_filter="finance")
```
### chunking_strategy (ChunkingStrategy)
**Description:** The strategy to use for chunking the text before processing. It must be an instance of `ChunkingStrategy`. The default value is `RegexChunking()`.
**Required:** No
**Default Value:** `RegexChunking()`
**Example:**
```python
chunking_strategy = NlpSentenceChunking()
```
### bypass_cache (bool)
**Description:** Whether to force a fresh crawl even if the URL has been previously crawled. The default value is `False`.
**Required:** No
**Default Value:** `False`
**Example:**
```python
bypass_cache = True
```
### css_selector (str)
**Description:** The CSS selector to target specific parts of the HTML for extraction. If not provided, the entire HTML will be processed.
**Required:** No
**Default Value:** `None`
**Example:**
```python
css_selector = "div.article-content"
```
### screenshot (bool)
**Description:** Whether to take screenshots of the page. The default value is `False`.
**Required:** No
**Default Value:** `False`
**Example:**
```python
screenshot = True
```
### user_agent (str)
**Description:** The user agent to use for the HTTP requests. If not provided, a default user agent will be used.
**Required:** No
**Default Value:** `None`
**Example:**
```python
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
```
### verbose (bool)
**Description:** Whether to enable verbose logging. The default value is `True`.
**Required:** No
**Default Value:** `True`
**Example:**
```python
verbose = True
```
### **kwargs
Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
**Example:**
```python
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="p",
only_text=True
)
```
## Example Usage
Here's an example of how to use the `run` function with various parameters:
```python
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.chunking_strategy import NlpSentenceChunking
# Create the WebCrawler instance
crawler = WebCrawler()
# Run the crawler with custom parameters
result = crawler.run(
url="https://www.nbcnews.com/business",
word_count_threshold=10,
extraction_strategy=CosineStrategy(semantic_filter="finance"),
chunking_strategy=NlpSentenceChunking(),
bypass_cache=True,
css_selector="div.article-content",
screenshot=True,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
verbose=True,
only_text=True
)
print(result)
```
This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.

View File

@@ -0,0 +1,120 @@
# Crawl Result
The `CrawlResult` class is the heart of Crawl4AI's output, encapsulating all the data extracted from a crawling session. This class contains various fields that store the results of the web crawling and extraction process. Let's break down each field and see what it holds. 🎉
## Class Definition
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
screenshot: Optional[str] = None
markdown: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
```
## Fields Explanation
### `url: str`
The URL that was crawled. This field simply stores the URL of the web page that was processed.
### `html: str`
The raw HTML content of the web page. This is the unprocessed HTML source as retrieved by the crawler.
### `success: bool`
A flag indicating whether the crawling and extraction were successful. If any error occurs during the process, this will be `False`.
### `cleaned_html: Optional[str]`
The cleaned HTML content of the web page. This field holds the HTML after removing unwanted tags like `<script>`, `<style>`, and others that do not contribute to the useful content.
### `media: Dict[str, List[Dict]]`
A dictionary containing lists of extracted media elements from the web page. The media elements are categorized into images, videos, and audios. Heres how they are structured:
- **Images**: Each image is represented as a dictionary with `src` (source URL) and `alt` (alternate text).
- **Videos**: Each video is represented similarly with `src` and `alt`.
- **Audios**: Each audio is represented with `src` and `alt`.
```python
media = {
'images': [
{'src': 'image_url1', 'alt': 'description1', "type": "image"},
{'src': 'image_url2', 'alt': 'description2', "type": "image"}
],
'videos': [
{'src': 'video_url1', 'alt': 'description1', "type": "video"}
],
'audios': [
{'src': 'audio_url1', 'alt': 'description1', "type": "audio"}
]
}
```
### `links: Dict[str, List[Dict]]`
A dictionary containing lists of internal and external links extracted from the web page. Each link is represented as a dictionary with `href` (URL) and `text` (link text).
- **Internal Links**: Links pointing to the same domain.
- **External Links**: Links pointing to different domains.
```python
links = {
'internal': [
{'href': 'internal_link1', 'text': 'link_text1'},
{'href': 'internal_link2', 'text': 'link_text2'}
],
'external': [
{'href': 'external_link1', 'text': 'link_text1'}
]
}
```
### `screenshot: Optional[str]`
A base64-encoded screenshot of the web page. This field stores the screenshot data if the crawling was configured to take a screenshot.
### `markdown: Optional[str]`
The content of the web page converted to Markdown format. This is useful for generating clean, readable text that retains the structure of the original HTML.
### `extracted_content: Optional[str]`
The content extracted based on the specified extraction strategy. This field holds the meaningful content blocks extracted from the web page, ready for your AI and data processing needs.
### `metadata: Optional[dict]`
A dictionary containing metadata extracted from the web page, such as title, description, keywords, and other meta tags.
### `error_message: Optional[str]`
If an error occurs during crawling, this field will contain the error message, helping you debug and understand what went wrong. 🚨
## Example Usage
Here's a quick example to illustrate how you might use the `CrawlResult` in your code:
```python
from crawl4ai import WebCrawler
# Create the WebCrawler instance
crawler = WebCrawler()
# Run the crawler on a URL
result = crawler.run(url="https://www.example.com")
# Check if the crawl was successful
if result.success:
print("Crawl succeeded!")
print("URL:", result.url)
print("HTML:", result.html[:100]) # Print the first 100 characters of the HTML
print("Cleaned HTML:", result.cleaned_html[:100])
print("Media:", result.media)
print("Links:", result.links)
print("Screenshot:", result.screenshot)
print("Markdown:", result.markdown[:100])
print("Extracted Content:", result.extracted_content)
print("Metadata:", result.metadata)
else:
print("Crawl failed with error:", result.error_message)
```
With this setup, you can easily access all the valuable data extracted from the web page and integrate it into your applications. Happy crawling! 🕷️🤖

View File

@@ -0,0 +1,116 @@
## Extraction Strategies 🧠
Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into two of the most important strategies: `CosineStrategy` and `LLMExtractionStrategy`.
### CosineStrategy
`CosineStrategy` uses hierarchical clustering based on cosine similarity to group text chunks into meaningful clusters. This method converts each chunk into its embedding and then clusters them to form semantical chunks.
#### When to Use
- Ideal for fast, accurate semantic segmentation of text.
- Perfect for scenarios where LLMs might be overkill or too slow.
- Suitable for narrowing down content based on specific queries or keywords.
#### Parameters
- `semantic_filter` (str, optional): Keywords for filtering relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
- `max_dist` (float, optional): Maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
- `linkage_method` (str, optional): Linkage method for hierarchical clustering. Default is `'ward'`.
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
- `model_name` (str, optional): Model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
#### Example
```python
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai import WebCrawler
crawler = WebCrawler()
crawler.warmup()
# Define extraction strategy
strategy = CosineStrategy(
semantic_filter="finance economy stock market",
word_count_threshold=10,
max_dist=0.2,
linkage_method='ward',
top_k=3,
model_name='BAAI/bge-small-en-v1.5'
)
# Sample URL
url = "https://www.nbcnews.com/business"
# Run the crawler with the extraction strategy
result = crawler.run(url=url, extraction_strategy=strategy)
print(result.extracted_content)
```
### LLMExtractionStrategy
`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
#### When to Use
- Suitable for complex extraction tasks requiring nuanced understanding.
- Ideal for scenarios where detailed instructions can guide the extraction process.
- Perfect for extracting specific types of information or content with precise guidelines.
#### Parameters
- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
#### Example Without Instructions
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import WebCrawler
crawler = WebCrawler()
crawler.warmup()
# Define extraction strategy without instructions
strategy = LLMExtractionStrategy(
provider='openai',
api_token='your_api_token'
)
# Sample URL
url = "https://www.nbcnews.com/business"
# Run the crawler with the extraction strategy
result = crawler.run(url=url, extraction_strategy=strategy)
print(result.extracted_content)
```
#### Example With Instructions
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import WebCrawler
crawler = WebCrawler()
crawler.warmup()
# Define extraction strategy with instructions
strategy = LLMExtractionStrategy(
provider='openai',
api_token='your_api_token',
instruction="Extract only financial news and summarize key points."
)
# Sample URL
url = "https://www.nbcnews.com/business"
# Run the crawler with the extraction strategy
result = crawler.run(url=url, extraction_strategy=strategy)
print(result.extracted_content)
```
#### Use Cases for LLMExtractionStrategy
- Extracting specific data types from structured or semi-structured content.
- Generating summaries, extracting key information, or transforming content into different formats.
- Performing detailed extractions based on custom instructions.
For more detailed examples, please refer to the [Examples section](../examples/index.md) of the documentation.
---
By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy` or nuanced, instruction-based extraction with `LLMExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️‍♂️✨