Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.
This commit is contained in:
UncleCode
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions

View File

@@ -45,13 +45,15 @@ if __name__ == "__main__":
### New Code (Recommended)
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode # Import CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import CrawlerRunConfig
async def use_proxy():
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) # Use CacheMode in CrawlerRunConfig
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
cache_mode=CacheMode.BYPASS # New way
config=config # Pass the configuration object
)
print(len(result.markdown))
@@ -64,12 +66,12 @@ if __name__ == "__main__":
## Common Migration Patterns
Old Flag | New Mode
---------|----------
`bypass_cache=True` | `cache_mode=CacheMode.BYPASS`
`disable_cache=True` | `cache_mode=CacheMode.DISABLED`
`no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY`
`no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY`
| Old Flag | New Mode |
|-----------------------|---------------------------------|
| `bypass_cache=True` | `cache_mode=CacheMode.BYPASS` |
| `disable_cache=True` | `cache_mode=CacheMode.DISABLED`|
| `no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY` |
| `no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY` |
## Suppressing Deprecation Warnings
If you need time to migrate, you can temporarily suppress deprecation warnings:

View File

@@ -1,68 +1,58 @@
# Content Selection
### Content Selection
Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.
## CSS Selectors
#### CSS Selectors
The simplest way to extract specific content:
Extract specific content using a `CrawlerRunConfig` with CSS selectors:
```python
# Extract specific content using CSS selector
result = await crawler.arun(
url="https://example.com",
css_selector=".main-article" # Target main article content
)
from crawl4ai.async_configs import CrawlerRunConfig
# Multiple selectors
result = await crawler.arun(
url="https://example.com",
css_selector="article h1, article .content" # Target heading and content
)
config = CrawlerRunConfig(css_selector=".main-article") # Target main article content
result = await crawler.arun(url="https://crawl4ai.com", config=config)
config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content
result = await crawler.arun(url="https://crawl4ai.com", config=config)
```
## Content Filtering
#### Content Filtering
Control what content is included or excluded:
Control content inclusion or exclusion with `CrawlerRunConfig`:
```python
result = await crawler.arun(
url="https://example.com",
# Content thresholds
config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per block
# Tag exclusions
excluded_tags=['form', 'header', 'footer', 'nav'],
# Link filtering
excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
# Media filtering
exclude_external_images=True # Remove external images
)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
```
## Iframe Content
#### Iframe Content
Process content inside iframes:
Process iframe content by enabling specific options in `CrawlerRunConfig`:
```python
result = await crawler.arun(
url="https://example.com",
process_iframes=True, # Extract iframe content
config = CrawlerRunConfig(
process_iframes=True, # Extract iframe content
remove_overlay_elements=True # Remove popups/modals that might block iframes
)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
```
## Structured Content Selection
#### Structured Content Selection Using LLMs
### Using LLMs for Smart Selection
Use LLMs to intelligently extract specific types of content:
Leverage LLMs for intelligent content extraction:
```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel
from typing import List
class ArticleContent(BaseModel):
title: str
@@ -70,28 +60,27 @@ class ArticleContent(BaseModel):
conclusion: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron", # Works with any supported LLM
provider="ollama/nemotron",
schema=ArticleContent.schema(),
instruction="Extract the main article title, key points, and conclusion"
)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
article = json.loads(result.extracted_content)
```
### Pattern-Based Selection
#### Pattern-Based Selection
For repeated content patterns (like product listings, news feeds):
Extract content matching repetitive patterns:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "News Articles",
"baseSelector": "article.news-item", # Repeated element
"baseSelector": "article.news-item",
"fields": [
{"name": "headline", "selector": "h2", "type": "text"},
{"name": "summary", "selector": ".summary", "type": "text"},
@@ -108,51 +97,19 @@ schema = {
}
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://crawl4ai.com", config=config)
articles = json.loads(result.extracted_content)
```
## Domain-Based Filtering
#### Comprehensive Example
Control content based on domains:
Combine different selection methods using `CrawlerRunConfig`:
```python
result = await crawler.arun(
url="https://example.com",
exclude_domains=["ads.com", "tracker.com"],
exclude_social_media_domains=["facebook.com", "twitter.com"], # Custom social media domains to exclude
exclude_social_media_links=True
)
```
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
## Media Selection
Select specific types of media:
```python
result = await crawler.arun(url="https://example.com")
# Access different media types
images = result.media["images"] # List of image details
videos = result.media["videos"] # List of video details
audios = result.media["audios"] # List of audio details
# Image with metadata
for image in images:
print(f"URL: {image['src']}")
print(f"Alt text: {image['alt']}")
print(f"Description: {image['desc']}")
print(f"Relevance score: {image['score']}")
```
## Comprehensive Example
Here's how to combine different selection methods:
```python
async def extract_article_content(url: str):
# Define structured extraction
article_schema = {
@@ -163,37 +120,16 @@ async def extract_article_content(url: str):
{"name": "content", "selector": ".content", "type": "text"}
]
}
# Define LLM extraction
class ArticleAnalysis(BaseModel):
key_points: List[str]
sentiment: str
category: str
# Define configuration
config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(article_schema),
word_count_threshold=10,
excluded_tags=['nav', 'footer'],
exclude_external_links=True
)
async with AsyncWebCrawler() as crawler:
# Get structured content
pattern_result = await crawler.arun(
url=url,
extraction_strategy=JsonCssExtractionStrategy(article_schema),
word_count_threshold=10,
excluded_tags=['nav', 'footer'],
exclude_external_links=True
)
# Get semantic analysis
analysis_result = await crawler.arun(
url=url,
extraction_strategy=LLMExtractionStrategy(
provider="ollama/nemotron",
schema=ArticleAnalysis.schema(),
instruction="Analyze the article content"
)
)
# Combine results
return {
"article": json.loads(pattern_result.extracted_content),
"analysis": json.loads(analysis_result.extracted_content),
"media": pattern_result.media
}
```
result = await crawler.arun(url=url, config=config)
return json.loads(result.extracted_content)
```

View File

@@ -1,136 +1,83 @@
# Content Filtering in Crawl4AI
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
## Relevance Content Filter
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
The `RelevanceContentFilter` is an abstract class providing a common interface for content filtering strategies. Specific algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
## Pruning Content Filter
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
The `PruningContentFilter` removes less relevant nodes based on metrics like text density, link density, and tag importance. Nodes that fall below a defined threshold are pruned, leaving only high-value content.
### Usage
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
async def filter_content(url):
async with AsyncWebCrawler() as crawler:
content_filter = PruningContentFilter(
min_word_threshold=5,
threshold_type='dynamic',
threshold=0.45
)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
if result.success:
print(f"Cleaned Markdown:\n{result.fit_markdown}")
config = CrawlerRunConfig(
content_filter=PruningContentFilter(
min_word_threshold=5,
threshold_type='dynamic',
threshold=0.45
),
fit_markdown=True # Activates markdown fitting
)
result = await crawler.arun(url="https://example.com", config=config)
if result.success:
print(f"Cleaned Markdown:\n{result.fit_markdown}")
```
### Parameters
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
- `'fixed'`: Uses a constant threshold value for all nodes
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
- For fixed threshold: Nodes scoring below this value are removed
- For dynamic threshold: This value is adjusted based on node properties
- `'fixed'`: Uses a constant threshold value for all nodes.
- `'dynamic'`: Adjusts thresholds based on node properties (e.g., tag importance, text/link ratios).
- **`threshold`**: (Optional, default 0.48) Base threshold for pruning:
- Fixed: Nodes scoring below this value are removed.
- Dynamic: This value adjusts based on node characteristics.
### How It Works
The pruning algorithm evaluates each node using multiple metrics:
- Text density: Ratio of actual text to overall node content
- Link density: Proportion of text within links
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
- Content quality: Metrics like text length and structural importance
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
The algorithm is particularly effective for:
- Removing boilerplate content
- Eliminating navigation menus and sidebars
- Preserving main article content
- Maintaining document structure while removing noise
The algorithm evaluates each node using:
- **Text density**: Ratio of text to overall content.
- **Link density**: Proportion of text within links.
- **Tag importance**: Weights based on HTML tag type (e.g., `<article>`, `<p>`, `<div>`).
- **Content quality**: Metrics like text length and structural importance.
## BM25 Algorithm
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
The `BM25ContentFilter` uses the BM25 algorithm to rank and extract text chunks based on relevance to a search query or page metadata.
### Usage
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def filter_content(url, query=None):
async with AsyncWebCrawler() as crawler:
content_filter = BM25ContentFilter(user_query=query)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
if result.success:
print(f"Filtered Content (JSON):\n{result.extracted_content}")
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
else:
print("Error:", result.error_message)
config = CrawlerRunConfig(
content_filter=BM25ContentFilter(user_query="fruit nutrition health"),
fit_markdown=True # Activates markdown fitting
)
# Example usage:
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
result = await crawler.arun(url="https://example.com", config=config)
if result.success:
print(f"Filtered Content:\n{result.extracted_content}")
print(f"\nFiltered Markdown:\n{result.fit_markdown}")
print(f"\nFiltered HTML:\n{result.fit_html}")
else:
print("Error:", result.error_message)
```
### Parameters
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts metadata (title, description, keywords) and uses it as the query.
- **`bm25_threshold`**: (Optional, default 1.0) Threshold controlling relevance:
- Higher values return stricter, more relevant results.
- Lower values include more lenient filtering.
## Fit Markdown Flag
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
## Custom Content Filtering Strategies
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
```python
from crawl4ai.content_filter_strategy import RelevantContentFilter
from bs4 import BeautifulSoup, Tag
from typing import List
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html: str) -> List[str]:
soup = BeautifulSoup(html, 'lxml')
# Implement custom filtering logic here
# Example: extract all paragraphs within divs with class "article-body"
filtered_paragraphs = []
for tag in soup.select("div.article-body p"):
if isinstance(tag, Tag):
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
return filtered_paragraphs
async def custom_filter_demo(url: str):
async with AsyncWebCrawler() as crawler:
custom_filter = MyCustomFilter()
result = await crawler.arun(url, extraction_strategy=custom_filter)
if result.success:
print(result.extracted_content)
```
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
## Conclusion
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.

View File

@@ -1,124 +1,109 @@
# Download Handling in Crawl4AI
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
## Enabling Downloads
By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method.
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
async def main():
async with AsyncWebCrawler(accept_downloads=True) as crawler: # Globally enable downloads
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
async with AsyncWebCrawler(config=config) as crawler:
# ... your crawling logic ...
asyncio.run(main())
```
Or, enable it for a specific crawl:
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
```python
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="...", accept_downloads=True)
config = CrawlerRunConfig(accept_downloads=True)
result = await crawler.arun(url="https://example.com", config=config)
# ...
```
## Specifying Download Location
You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory.
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
```python
from crawl4ai.async_configs import BrowserConfig
import os
from pathlib import Path
# ... inside your crawl function:
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
os.makedirs(downloads_path, exist_ok=True)
result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
# ...
```
If you are setting it globally, provide the path to the AsyncWebCrawler:
```python
async def crawl_with_downloads(url: str, download_path: str):
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=download_path, # or set it on arun
verbose=True
) as crawler:
result = await crawler.arun(url=url) # you still need to enable downloads per call.
async def main():
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun(url="https://example.com")
# ...
```
## Triggering Downloads
Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context. The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
```python
result = await crawler.arun(
url="https://www.python.org/downloads/",
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig(
js_code="""
// Find and click the first Windows installer link
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
downloadLink.click();
}
""",
wait_for=5 # Wait for 5 seconds for the download to start
wait_for=5 # Wait 5 seconds for the download to start
)
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
```
## Accessing Downloaded Files
Downloaded file paths are stored in the `downloaded_files` attribute of the returned `CrawlResult` object. This is a list of strings, with each string representing the absolute path to a downloaded file.
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
```python
if result.downloaded_files:
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
# Perform operations with downloaded files, e.g., check file size
file_size = os.path.getsize(file_path)
print(f"- File size: {file_size} bytes")
else:
print("No files downloaded.")
```
## Example: Downloading Multiple Files
## Example: Downloading Multiple Files
```python
import asyncio
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os
from pathlib import Path
from crawl4ai import AsyncWebCrawler
async def download_multiple_files(url: str, download_path: str):
async with AsyncWebCrawler(
accept_downloads=True,
downloads_path=download_path,
verbose=True
) as crawler:
result = await crawler.arun(
url=url,
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
async with AsyncWebCrawler(config=config) as crawler:
run_config = CrawlerRunConfig(
js_code="""
// Trigger multiple downloads (example)
const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
for (const link of downloadLinks) {
link.click();
await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
}
const downloadLinks = document.querySelectorAll('a[download]');
for (const link of downloadLinks) {
link.click();
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
}
""",
wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
wait_for=10 # Wait for all downloads to start
)
result = await crawler.arun(url=url, config=run_config)
if result.downloaded_files:
print("Downloaded files:")
@@ -126,23 +111,19 @@ async def download_multiple_files(url: str, download_path: str):
print(f"- {file}")
else:
print("No files downloaded.")
# Example usage
# Usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
```
## Important Considerations
- **Browser Context:** Downloads are managed within the browser context. Ensure your `js_code` correctly targets the download triggers on the specific web page.
- **Waiting:** Use `wait_for` to manage the timing of the crawl process if immediate download might not occur.
- **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
- **Security:** Downloaded files should be scanned for potential security threats before use.
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
- **Security:** Scan downloaded files for potential security threats before use.
This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!

View File

@@ -1,6 +1,6 @@
# Output Formats
Crawl4AI provides multiple output formats to suit different needs, from raw HTML to structured data using LLM or pattern-based extraction.
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.
## Basic Formats
@@ -8,18 +8,20 @@ Crawl4AI provides multiple output formats to suit different needs, from raw HTML
result = await crawler.arun(url="https://example.com")
# Access different formats
raw_html = result.html # Original HTML
clean_html = result.cleaned_html # Sanitized HTML
markdown = result.markdown # Standard markdown
fit_md = result.fit_markdown # Most relevant content in markdown
raw_html = result.html # Original HTML
clean_html = result.cleaned_html # Sanitized HTML
markdown_v2 = result.markdown_v2 # Detailed markdown generation results
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
```
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
## Raw HTML
Original, unmodified HTML from the webpage. Useful when you need to:
- Preserve the exact page structure
- Process HTML with your own tools
- Debug page issues
- Preserve the exact page structure.
- Process HTML with your own tools.
- Debug page issues.
```python
result = await crawler.arun(url="https://example.com")
@@ -29,167 +31,72 @@ print(result.html) # Complete HTML including headers, scripts, etc.
## Cleaned HTML
Sanitized HTML with unnecessary elements removed. Automatically:
- Removes scripts and styles
- Cleans up formatting
- Preserves semantic structure
- Removes scripts and styles.
- Cleans up formatting.
- Preserves semantic structure.
```python
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
keep_data_attributes=False # Remove data-* attributes
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.cleaned_html)
```
## Standard Markdown
HTML converted to clean markdown format. Great for:
- Content analysis
- Documentation
- Readability
HTML converted to clean markdown format. This output is useful for:
- Content analysis.
- Documentation.
- Readability.
```python
result = await crawler.arun(
url="https://example.com",
include_links_on_markdown=True # Include links in markdown
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
options={"include_links": True} # Include links in markdown
)
)
print(result.markdown)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.raw_markdown) # Standard markdown with links
```
## Fit Markdown
Most relevant content extracted and converted to markdown. Ideal for:
- Article extraction
- Main content focus
- Removing boilerplate
Extract and convert only the most relevant content into markdown format. Best suited for:
- Article extraction.
- Focusing on the main content.
- Removing boilerplate.
To generate `fit_markdown`, use a content filter like `PruningContentFilter`:
```python
result = await crawler.arun(url="https://example.com")
print(result.fit_markdown) # Only the main content
from crawl4ai.content_filter_strategy import PruningContentFilter
config = CrawlerRunConfig(
content_filter=PruningContentFilter(
threshold=0.7,
threshold_type="dynamic",
min_word_threshold=100
)
)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
```
## Structured Data Extraction
## Markdown with Citations
Crawl4AI offers two powerful approaches for structured data extraction:
### 1. LLM-Based Extraction
Use any LLM (OpenAI, HuggingFace, Ollama, etc.) to extract structured data with high accuracy:
Generate markdown that includes citations for links. This format is ideal for:
- Creating structured documentation.
- Including references for extracted content.
```python
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class KnowledgeGraph(BaseModel):
entities: List[dict]
relationships: List[dict]
strategy = LLMExtractionStrategy(
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
api_token="your-token", # not needed for Ollama
schema=KnowledgeGraph.schema(),
instruction="Extract entities and relationships from the content"
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
options={"citations": True} # Enable citations
)
)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
knowledge_graph = json.loads(result.extracted_content)
result = await crawler.arun(url="https://example.com", config=config)
print(result.markdown_v2.markdown_with_citations)
print(result.markdown_v2.references_markdown) # Citations section
```
### 2. Pattern-Based Extraction
For pages with repetitive patterns (e.g., product listings, article feeds), use JsonCssExtractionStrategy:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product Listing",
"baseSelector": ".product-card", # Repeated element
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "description", "selector": ".desc", "type": "text"}
]
}
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
products = json.loads(result.extracted_content)
```
## Content Customization
### HTML to Text Options
Configure markdown conversion:
```python
result = await crawler.arun(
url="https://example.com",
html2text={
"escape_dot": False,
"body_width": 0,
"protect_links": True,
"unicode_snob": True
}
)
```
### Content Filters
Control what content is included:
```python
result = await crawler.arun(
url="https://example.com",
word_count_threshold=10, # Minimum words per block
exclude_external_links=True, # Remove external links
exclude_external_images=True, # Remove external images
excluded_tags=['form', 'nav'] # Remove specific HTML tags
)
```
## Comprehensive Example
Here's how to use multiple output formats together:
```python
async def crawl_content(url: str):
async with AsyncWebCrawler() as crawler:
# Extract main content with fit markdown
result = await crawler.arun(
url=url,
word_count_threshold=10,
exclude_external_links=True
)
# Get structured data using LLM
llm_result = await crawler.arun(
url=url,
extraction_strategy=LLMExtractionStrategy(
provider="ollama/nemotron",
schema=YourSchema.schema(),
instruction="Extract key information"
)
)
# Get repeated patterns (if any)
pattern_result = await crawler.arun(
url=url,
extraction_strategy=JsonCssExtractionStrategy(your_schema)
)
return {
"main_content": result.fit_markdown,
"structured_data": json.loads(llm_result.extracted_content),
"pattern_data": json.loads(pattern_result.extracted_content),
"media": result.media
}
```

View File

@@ -7,11 +7,13 @@ Crawl4AI provides powerful features for interacting with dynamic webpages, handl
### Basic Execution
```python
from crawl4ai.async_configs import CrawlerRunConfig
# Single JavaScript command
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);"
)
result = await crawler.arun(url="https://example.com", config=config)
# Multiple commands
js_commands = [
@@ -19,10 +21,8 @@ js_commands = [
"document.querySelector('.load-more').click();",
"document.querySelector('#consent-button').click();"
]
result = await crawler.arun(
url="https://example.com",
js_code=js_commands
)
config = CrawlerRunConfig(js_code=js_commands)
result = await crawler.arun(url="https://example.com", config=config)
```
## Wait Conditions
@@ -32,10 +32,8 @@ result = await crawler.arun(
Wait for elements to appear:
```python
result = await crawler.arun(
url="https://example.com",
wait_for="css:.dynamic-content" # Wait for element with class 'dynamic-content'
)
config = CrawlerRunConfig(wait_for="css:.dynamic-content") # Wait for element with class 'dynamic-content'
result = await crawler.arun(url="https://example.com", config=config)
```
### JavaScript-Based Waiting
@@ -48,10 +46,8 @@ wait_condition = """() => {
return document.querySelectorAll('.item').length > 10;
}"""
result = await crawler.arun(
url="https://example.com",
wait_for=f"js:{wait_condition}"
)
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
result = await crawler.arun(url="https://example.com", config=config)
# Wait for dynamic content to load
wait_for_content = """() => {
@@ -59,10 +55,8 @@ wait_for_content = """() => {
return content && content.innerText.length > 100;
}"""
result = await crawler.arun(
url="https://example.com",
wait_for=f"js:{wait_for_content}"
)
config = CrawlerRunConfig(wait_for=f"js:{wait_for_content}")
result = await crawler.arun(url="https://example.com", config=config)
```
## Handling Dynamic Content
@@ -72,18 +66,14 @@ result = await crawler.arun(
Handle infinite scroll or load more buttons:
```python
# Scroll and wait pattern
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
js_code=[
# Scroll to bottom
"window.scrollTo(0, document.body.scrollHeight);",
# Click load more if exists
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();"
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();" # Click load more
],
# Wait for new content
wait_for="js:() => document.querySelectorAll('.item').length > previousCount"
wait_for="js:() => document.querySelectorAll('.item').length > previousCount" # Wait for new content
)
result = await crawler.arun(url="https://example.com", config=config)
```
### Form Interaction
@@ -92,17 +82,15 @@ Handle forms and inputs:
```python
js_form_interaction = """
// Fill form fields
document.querySelector('#search').value = 'search term';
// Submit form
document.querySelector('form').submit();
document.querySelector('#search').value = 'search term'; // Fill form fields
document.querySelector('form').submit(); // Submit form
"""
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
js_code=js_form_interaction,
wait_for="css:.results" # Wait for results to load
)
result = await crawler.arun(url="https://example.com", config=config)
```
## Timing Control
@@ -112,11 +100,11 @@ result = await crawler.arun(
Control timing of interactions:
```python
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
page_timeout=60000, # Page load timeout (ms)
delay_before_return_html=2.0, # Wait before capturing content
delay_before_return_html=2.0 # Wait before capturing content
)
result = await crawler.arun(url="https://example.com", config=config)
```
## Complex Interactions Example
@@ -124,43 +112,37 @@ result = await crawler.arun(
Here's an example of handling a dynamic page with multiple interactions:
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def crawl_dynamic_content():
async with AsyncWebCrawler() as crawler:
# Initial page load
result = await crawler.arun(
url="https://example.com",
# Handle cookie consent
js_code="document.querySelector('.cookie-accept')?.click();",
config = CrawlerRunConfig(
js_code="document.querySelector('.cookie-accept')?.click();", # Handle cookie consent
wait_for="css:.main-content"
)
result = await crawler.arun(url="https://example.com", config=config)
# Load more content
session_id = "dynamic_session" # Keep session for multiple interactions
for page in range(3): # Load 3 pages of content
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
session_id=session_id,
js_code=[
# Scroll to bottom
"window.scrollTo(0, document.body.scrollHeight);",
# Store current item count
"window.previousCount = document.querySelectorAll('.item').length;",
# Click load more
"document.querySelector('.load-more')?.click();"
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
"window.previousCount = document.querySelectorAll('.item').length;", # Store item count
"document.querySelector('.load-more')?.click();" # Click load more
],
# Wait for new items
wait_for="""() => {
const currentCount = document.querySelectorAll('.item').length;
return currentCount > window.previousCount;
}""",
# Only execute JS without reloading page
js_only=True if page > 0 else False
js_only=(page > 0) # Execute JS without reloading page for subsequent interactions
)
# Process content after each load
result = await crawler.arun(url="https://example.com", config=config)
print(f"Page {page + 1} items:", len(result.cleaned_html))
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
```
@@ -171,6 +153,7 @@ Combine page interaction with structured extraction:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig
# Pattern-based extraction after interaction
schema = {
@@ -182,20 +165,19 @@ schema = {
]
}
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.item:nth-child(10)", # Wait for 10 items
extraction_strategy=JsonCssExtractionStrategy(schema)
)
result = await crawler.arun(url="https://example.com", config=config)
# Or use LLM to analyze dynamic content
class ContentAnalysis(BaseModel):
topics: List[str]
summary: str
result = await crawler.arun(
url="https://example.com",
config = CrawlerRunConfig(
js_code="document.querySelector('.show-more').click();",
wait_for="css:.full-content",
extraction_strategy=LLMExtractionStrategy(
@@ -204,4 +186,5 @@ result = await crawler.arun(
instruction="Analyze the full content"
)
)
```
result = await crawler.arun(url="https://example.com", config=config)
```

View File

@@ -2,31 +2,19 @@
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
## Table of Contents
- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)
- [Table of Contents](#table-of-contents)
- [Crawling a Web URL](#crawling-a-web-url)
- [Crawling a Local HTML File](#crawling-a-local-html-file)
- [Crawling Raw HTML Content](#crawling-raw-html-content)
- [Complete Example](#complete-example)
- [**How It Works**](#how-it-works)
- [**Running the Example**](#running-the-example)
- [Conclusion](#conclusion)
## Crawling a Web URL
---
### Crawling a Web URL
To crawl a live web page, provide the URL starting with `http://` or `https://`.
To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_web():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
if result.success:
print("Markdown Content:")
print(result.markdown)
@@ -36,20 +24,22 @@ async def crawl_web():
asyncio.run(crawl_web())
```
### Crawling a Local HTML File
## Crawling a Local HTML File
To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
file_url = f"file://{local_file_path}"
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url=file_url, bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=file_url, config=config)
if result.success:
print("Markdown Content from Local File:")
print(result.markdown)
@@ -59,20 +49,22 @@ async def crawl_local_file():
asyncio.run(crawl_local_file())
```
### Crawling Raw HTML Content
## Crawling Raw HTML Content
To crawl raw HTML content, prefix the HTML string with `raw:`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_raw_html():
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
raw_html_url = f"raw:{raw_html}"
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url=raw_html_url, bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=raw_html_url, config=config)
if result.success:
print("Markdown Content from Raw HTML:")
print(result.markdown)
@@ -84,152 +76,83 @@ asyncio.run(crawl_raw_html())
---
## Complete Example
# Complete Example
Below is a comprehensive script that:
1. **Crawls the Wikipedia page for "Apple".**
2. **Saves the HTML content to a local file (`apple.html`).**
3. **Crawls the local HTML file and verifies the markdown length matches the original crawl.**
4. **Crawls the raw HTML content from the saved file and verifies consistency.**
1. Crawls the Wikipedia page for "Apple."
2. Saves the HTML content to a local file (`apple.html`).
3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
4. Crawls the raw HTML content from the saved file and verifies consistency.
```python
import os
import sys
import asyncio
from pathlib import Path
# Adjust the parent directory to include the crawl4ai module
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
# Define the URL to crawl
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
# Define the path to save the HTML file
# Save the file in the same directory as the script
script_dir = Path(__file__).parent
html_file_path = script_dir / "apple.html"
async with AsyncWebCrawler(verbose=True) as crawler:
async with AsyncWebCrawler() as crawler:
# Step 1: Crawl the Web URL
print("\n=== Step 1: Crawling the Wikipedia URL ===")
# Crawl the Wikipedia URL
result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
# Check if crawling was successful
web_config = CrawlerRunConfig(bypass_cache=True)
result = await crawler.arun(url=wikipedia_url, config=web_config)
if not result.success:
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
return
# Save the HTML content to a local file
with open(html_file_path, 'w', encoding='utf-8') as f:
f.write(result.html)
print(f"Saved HTML content to {html_file_path}")
# Store the length of the generated markdown
web_crawl_length = len(result.markdown)
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
# Step 2: Crawl from the Local HTML File
print("=== Step 2: Crawling from the Local HTML File ===")
# Construct the file URL with 'file://' prefix
file_url = f"file://{html_file_path.resolve()}"
# Crawl the local HTML file
local_result = await crawler.arun(url=file_url, bypass_cache=True)
# Check if crawling was successful
file_config = CrawlerRunConfig(bypass_cache=True)
local_result = await crawler.arun(url=file_url, config=file_config)
if not local_result.success:
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
return
# Store the length of the generated markdown from local file
local_crawl_length = len(local_result.markdown)
print(f"Length of markdown from local file crawl: {local_crawl_length}")
# Compare the lengths
assert web_crawl_length == local_crawl_length, (
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
)
print("✅ Markdown length matches between web crawl and local file crawl.\n")
assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
print("✅ Markdown length matches between web and local file crawl.\n")
# Step 3: Crawl Using Raw HTML Content
print("=== Step 3: Crawling Using Raw HTML Content ===")
# Read the HTML content from the saved file
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read()
# Prefix the raw HTML content with 'raw:'
raw_html_url = f"raw:{raw_html_content}"
# Crawl using the raw HTML content
raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
# Check if crawling was successful
raw_config = CrawlerRunConfig(bypass_cache=True)
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
if not raw_result.success:
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
return
# Store the length of the generated markdown from raw HTML
raw_crawl_length = len(raw_result.markdown)
print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
# Compare the lengths
assert web_crawl_length == raw_crawl_length, (
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
)
print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
print("✅ Markdown length matches between web and raw HTML crawl.\n")
print("All tests passed successfully!")
# Clean up by removing the saved HTML file
if html_file_path.exists():
os.remove(html_file_path)
print(f"Removed the saved HTML file: {html_file_path}")
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
```
### **How It Works**
1. **Step 1: Crawl the Web URL**
- Crawls `https://en.wikipedia.org/wiki/apple`.
- Saves the HTML content to `apple.html`.
- Records the length of the generated markdown.
2. **Step 2: Crawl from the Local HTML File**
- Uses the `file://` prefix to crawl `apple.html`.
- Ensures the markdown length matches the original web crawl.
3. **Step 3: Crawl Using Raw HTML Content**
- Reads the HTML from `apple.html`.
- Prefixes it with `raw:` and crawls.
- Verifies the markdown length matches the previous results.
4. **Cleanup**
- Deletes the `apple.html` file after testing.
### **Running the Example**
1. **Save the Script:**
- Save the above code as `test_crawl4ai.py` in your project directory.
2. **Execute the Script:**
- Run the script using:
```bash
python test_crawl4ai.py
```
3. **Observe the Output:**
- The script will print logs detailing each step.
- Assertions ensure consistency across different crawling methods.
- Upon success, it confirms that all markdown lengths match.
---
## Conclusion
With the new prefix-based input handling in **Crawl4AI**, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.
# Conclusion
With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.

View File

@@ -1,49 +1,66 @@
# Quick Start Guide 🚀
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI, covering everything from initial setup to advanced features like chunking and extraction strategies, using asynchronous programming. Let's dive in! 🌟
---
## Getting Started 🛠️
First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
Set up your environment with `BrowserConfig` and create an `AsyncWebCrawler` instance.
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
# We'll add our crawling code here
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Add your crawling logic here
pass
if __name__ == "__main__":
asyncio.run(main())
```
---
### Basic Usage
Simply provide a URL and let Crawl4AI do the magic!
Provide a URL and let Crawl4AI do the work!
```python
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(url="https://www.nbcnews.com/business")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(main())
```
---
### Taking Screenshots 📸
Capture screenshots of web pages easily:
Capture and save webpage screenshots with `CrawlerRunConfig`:
```python
from crawl4ai.async_configs import CacheMode
async def capture_and_save_screenshot(url: str, output_path: str):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
cache_mode=CacheMode.BYPASS
)
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(
url=url,
screenshot=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
if result.success and result.screenshot:
import base64
@@ -55,243 +72,101 @@ async def capture_and_save_screenshot(url: str, output_path: str):
print("Failed to capture screenshot")
```
---
### Browser Selection 🌐
Crawl4AI supports multiple browser engines. Here's how to use different browsers:
Choose from multiple browser engines using `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
# Use Firefox
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
firefox_config = BrowserConfig(browser_type="firefox", verbose=True, headless=True)
async with AsyncWebCrawler(config=firefox_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
# Use WebKit
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
webkit_config = BrowserConfig(browser_type="webkit", verbose=True, headless=True)
async with AsyncWebCrawler(config=webkit_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
# Use Chromium (default)
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
chromium_config = BrowserConfig(verbose=True, headless=True)
async with AsyncWebCrawler(config=chromium_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
```
---
### User Simulation 🎭
Simulate real user behavior to avoid detection:
Simulate real user behavior to bypass detection:
```python
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(
url="YOUR-URL-HERE",
cache_mode=CacheMode.BYPASS,
simulate_user=True, # Causes random mouse movements and clicks
override_navigator=True # Makes the browser appear more like a real user
)
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(verbose=True, headless=True)
crawl_config = CrawlerRunConfig(
url="YOUR-URL-HERE",
cache_mode=CacheMode.BYPASS,
simulate_user=True, # Random mouse movements and clicks
override_navigator=True # Makes the browser appear like a real user
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
```
---
### Understanding Parameters 🧠
By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
Explore caching and forcing fresh crawls:
```python
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
# First crawl (caches the result)
result1 = await crawler.arun(url="https://www.nbcnews.com/business")
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First crawl (uses cache)
result1 = await crawler.arun(config=CrawlerRunConfig(url="https://www.nbcnews.com/business"))
print(f"First crawl result: {result1.markdown[:100]}...")
# Force to crawl again
result2 = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
# Force fresh crawl
result2 = await crawler.arun(
config=CrawlerRunConfig(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
)
print(f"Second crawl result: {result2.markdown[:100]}...")
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(main())
```
---
### Adding a Chunking Strategy 🧩
Let's add a chunking strategy: `RegexChunking`! This strategy splits the text based on a given regex pattern.
Split content into chunks using `RegexChunking`:
```python
from crawl4ai.chunking_strategy import RegexChunking
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
print(f"RegexChunking result: {result.extracted_content[:200]}...")
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(main())
```
### Using LLMExtractionStrategy with Different Providers 🤖
---
Crawl4AI supports multiple LLM providers for extraction:
### Advanced Features and Configurations
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
# OpenAI
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# Hugging Face
await extract_structured_data_using_llm(
"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
os.getenv("HUGGINGFACE_API_KEY")
)
# Ollama
await extract_structured_data_using_llm("ollama/llama3.2")
# With custom headers
custom_headers = {
"Authorization": "Bearer your-custom-token",
"X-Custom-Header": "Some-Value"
}
await extract_structured_data_using_llm(extra_headers=custom_headers)
```
### Knowledge Graph Generation 🕸️
Generate knowledge graphs from web content:
```python
from pydantic import BaseModel
from typing import List
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini',
api_token=os.getenv('OPENAI_API_KEY'),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="Extract entities and relationships from the given text."
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://paulgraham.com/love.html",
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy
)
```
### Advanced Session-Based Crawling with Dynamic Content 🔄
For modern web applications with dynamic content loading, here's how to handle pagination and content updates:
```python
async def crawl_dynamic_content():
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
wait_for = """() => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.firstCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [
{
"name": "title",
"selector": "h4.markdown-title",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS,
headless=False,
)
await crawler.crawler_strategy.kill_session(session_id)
```
### Handling Overlays and Fitting Content 📏
Remove overlay elements and fit content appropriately:
```python
async with AsyncWebCrawler(headless=False) as crawler:
result = await crawler.arun(
url="your-url-here",
cache_mode=CacheMode.BYPASS,
word_count_threshold=10,
remove_overlay_elements=True,
screenshot=True
)
```
## Performance Comparison 🏎️
Crawl4AI offers impressive performance compared to other solutions:
```python
# Firecrawl comparison
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
# Crawl4AI comparison
async with AsyncWebCrawler() as crawler:
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
```
Note: Performance comparisons should be conducted in environments with stable and fast internet connections for accurate results.
## Congratulations! 🎉
You've made it through the updated Crawl4AI Quickstart Guide! Now you're equipped with even more powerful features to crawl the web asynchronously like a pro! 🕸️
Happy crawling! 🚀
For advanced examples (LLM strategies, knowledge graphs, pagination handling), ensure all code aligns with the `BrowserConfig` and `CrawlerRunConfig` pattern shown above.

View File

@@ -4,16 +4,21 @@ This guide covers the basics of web crawling with Crawl4AI. You'll learn how to
## Basic Usage
Here's the simplest way to crawl a webpage:
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
browser_config = BrowserConfig() # Default browser configuration
run_config = CrawlerRunConfig() # Default crawl run configuration
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com"
url="https://example.com",
config=run_config
)
print(result.markdown) # Print clean markdown content
@@ -26,7 +31,10 @@ if __name__ == "__main__":
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
```python
result = await crawler.arun(url="https://example.com", fit_markdown=True)
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(fit_markdown=True)
)
# Different content formats
print(result.html) # Raw HTML
@@ -45,16 +53,20 @@ print(result.links) # Dictionary of internal and external links
## Adding Basic Options
Customize your crawl with these common options:
Customize your crawl using `CrawlerRunConfig`:
```python
result = await crawler.arun(
url="https://example.com",
run_config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per content block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True # Process iframe content
)
result = await crawler.arun(
url="https://example.com",
config=run_config
)
```
## Handling Errors
@@ -62,7 +74,9 @@ result = await crawler.arun(
Always check if the crawl was successful:
```python
result = await crawler.arun(url="https://example.com")
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
@@ -70,36 +84,45 @@ if not result.success:
## Logging and Debugging
Enable verbose mode for detailed logging:
Enable verbose logging in `BrowserConfig`:
```python
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://example.com")
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
```
## Complete Example
Here's a more comprehensive example showing common usage patterns:
Here's a more comprehensive example demonstrating common usage patterns:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
# Content filtering
word_count_threshold=10,
excluded_tags=['form', 'header'],
exclude_external_links=True,
# Content processing
process_iframes=True,
remove_overlay_elements=True,
# Cache control
cache_mode=CacheMode.ENABLED # Use cache if available
)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
# Content filtering
word_count_threshold=10,
excluded_tags=['form', 'header'],
exclude_external_links=True,
# Content processing
process_iframes=True,
remove_overlay_elements=True,
# Cache control
cache_mode=CacheMode.ENABLE # Use cache if available
config=run_config
)
if result.success: