Merge next + resolve conflicts

This commit is contained in:
Aravind Karnam
2025-04-23 19:44:50 +05:30
137 changed files with 36997 additions and 8947 deletions

74
docs/md_v2/core/ask-ai.md Normal file
View File

@@ -0,0 +1,74 @@
<div class="ask-ai-container">
<iframe id="ask-ai-frame" src="../../ask_ai/index.html" width="100%" style="border:none; display: block;" title="Crawl4AI Assistant"></iframe>
</div>
<script>
// Iframe height adjustment
function resizeAskAiIframe() {
const iframe = document.getElementById('ask-ai-frame');
if (iframe) {
const headerHeight = parseFloat(getComputedStyle(document.documentElement).getPropertyValue('--header-height') || '55');
// Footer is removed by JS below, so calculate height based on header + small buffer
const topOffset = headerHeight + 20; // Header + buffer/margin
const availableHeight = window.innerHeight - topOffset;
iframe.style.height = Math.max(600, availableHeight) + 'px'; // Min height 600px
}
}
// Run immediately and on resize/load
resizeAskAiIframe(); // Initial call
let resizeTimer;
window.addEventListener('load', resizeAskAiIframe);
window.addEventListener('resize', () => {
clearTimeout(resizeTimer);
resizeTimer = setTimeout(resizeAskAiIframe, 150);
});
// Remove Footer & HR from parent page (DOM Ready might be safer)
document.addEventListener('DOMContentLoaded', () => {
setTimeout(() => { // Add slight delay just in case elements render slowly
const footer = window.parent.document.querySelector('footer'); // Target parent document
if (footer) {
const hrBeforeFooter = footer.previousElementSibling;
if (hrBeforeFooter && hrBeforeFooter.tagName === 'HR') {
hrBeforeFooter.remove();
}
footer.remove();
// Trigger resize again after removing footer
resizeAskAiIframe();
} else {
console.warn("Ask AI Page: Could not find footer in parent document to remove.");
}
}, 100); // Shorter delay
});
</script>
<style>
#terminal-mkdocs-main-content {
padding: 0 !important;
margin: 0;
width: 100%;
height: 100%;
overflow: hidden; /* Prevent body scrollbars, panels handle scroll */
}
/* Ensure iframe container takes full space */
#terminal-mkdocs-main-content .ask-ai-container {
/* Remove negative margins if footer removal handles space */
margin: 0;
padding: 0;
max-width: none;
/* Let the JS set the height */
/* height: 600px; Initial fallback height */
overflow: hidden; /* Hide potential overflow before JS resize */
}
/* Hide title/paragraph if they were part of the markdown */
/* Alternatively, just remove them from the .md file directly */
/* #terminal-mkdocs-main-content > h1,
#terminal-mkdocs-main-content > p:first-of-type {
display: none;
} */
</style>

View File

@@ -1,9 +1,9 @@
# Browser, Crawler & LLM Configuration (Quick Overview)
Crawl4AIs flexibility stems from two key classes:
Crawl4AI's flexibility stems from two key classes:
1. **`BrowserConfig`** Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
1. **`BrowserConfig`** Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)
In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
@@ -36,18 +36,16 @@ class BrowserConfig:
### Key Fields to Note
1. **`browser_type`**
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**
2. **`headless`**
- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**
3. **`proxy_config`**
- A dictionary with fields like:
```json
{
@@ -58,31 +56,31 @@ class BrowserConfig:
```
- Leave as `None` if a proxy is not required.
4. **`viewport_width` & `viewport_height`**:
4. **`viewport_width` & `viewport_height`**:
- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
5. **`verbose`**:
5. **`verbose`**:
- If `True`, prints extra logs.
- Handy for debugging.
6. **`use_persistent_context`**:
6. **`use_persistent_context`**:
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
7. **`cookies`** & **`headers`**:
7. **`cookies`** & **`headers`**:
- If you want to start with specific cookies or add universal HTTP headers, set them here.
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
8. **`user_agent`**:
8. **`user_agent`**:
- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
9. **`text_mode`** & **`light_mode`**:
9. **`text_mode`** & **`light_mode`**:
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
10. **`extra_args`**:
10. **`extra_args`**:
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
@@ -136,6 +134,12 @@ class CrawlerRunConfig:
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
# Location and Identity Parameters
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
@@ -151,58 +155,65 @@ class CrawlerRunConfig:
### Key Fields to Note
1. **`word_count_threshold`**:
1. **`word_count_threshold`**:
- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:
2. **`extraction_strategy`**:
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:
3. **`markdown_generator`**:
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:
4. **`cache_mode`**:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
5. **`js_code`**:
5. **`js_code`**:
- A string or list of JS strings to execute.
- Great for Load More buttons or user interactions.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:
6. **`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`** & **`pdf`**:
- If `True`, captures a screenshot or PDF after the page is fully loaded.
- The results go to `result.screenshot` (base64) or `result.pdf` (bytes).
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **`verbose`**:
8. **Location Parameters**:
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
- See [Identity Based Crawling](../advanced/identity-based-crawling.md#7-locale-timezone-and-geolocation-control)
9. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browsers verbosity if also set to `True` in `BrowserConfig`.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
9. **`enable_rate_limiting`**:
10. **`enable_rate_limiting`**:
- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
10. **`memory_threshold_percent`**:
11. **`memory_threshold_percent`**:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
11. **`check_interval`**:
12. **`check_interval`**:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
12. **`max_session_permit`**:
13. **`max_session_permit`**:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
13. **`display_mode`**:
14. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
### Helper Methods
The `clone()` method is particularly useful for creating variations of your crawler configuration:
@@ -236,23 +247,20 @@ The `clone()` method:
---
## 3. LLMConfig Essentials
### Key fields to note
1. **`provider`**:
1. **`provider`**:
- Which LLM provoder to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
2. **`api_token`**:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
3. **`base_url`**:
- If your provider has a custom endpoint
```python
@@ -261,7 +269,7 @@ llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENA
## 4. Putting It All Together
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each calls needs:
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:
```python
import asyncio

View File

@@ -26,6 +26,7 @@ class CrawlResult(BaseModel):
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
@@ -51,6 +52,7 @@ class CrawlResult(BaseModel):
| **downloaded_files (`Optional[List[str]]`)** | If `accept_downloads=True` in `BrowserConfig`, this lists the filepaths of saved downloads. |
| **screenshot (`Optional[str]`)** | Screenshot of the page (base64-encoded) if `screenshot=True`. |
| **pdf (`Optional[bytes]`)** | PDF of the page if `pdf=True`. |
| **mhtml (`Optional[str]`)** | MHTML snapshot of the page if `capture_mhtml=True`. Contains the full page with all resources. |
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | It holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
| **extracted_content (`Optional[str]`)** | The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text. |
| **metadata (`Optional[dict]`)** | Additional info about the crawl or extracted data. |
@@ -190,18 +192,27 @@ for img in images:
print("Image URL:", img["src"], "Alt:", img.get("alt"))
```
### 5.3 `screenshot` and `pdf`
### 5.3 `screenshot`, `pdf`, and `mhtml`
If you set `screenshot=True` or `pdf=True` in **`CrawlerRunConfig`**, then:
If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:
- `result.screenshot` contains a base64-encoded PNG string.
- `result.screenshot` contains a base64-encoded PNG string.
- `result.pdf` contains raw PDF bytes (you can write them to a file).
- `result.mhtml` contains the MHTML snapshot of the page as a string (you can write it to a .mhtml file).
```python
# Save the PDF
with open("page.pdf", "wb") as f:
f.write(result.pdf)
# Save the MHTML
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.
### 5.4 `ssl_certificate`
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the sites SSL cert, such as issuer, validity dates, etc.

File diff suppressed because it is too large Load Diff

115
docs/md_v2/core/examples.md Normal file
View File

@@ -0,0 +1,115 @@
# Code Examples
This page provides a comprehensive list of example scripts that demonstrate various features and capabilities of Crawl4AI. Each example is designed to showcase specific functionality, making it easier for you to understand how to implement these features in your own projects.
## Getting Started Examples
| Example | Description | Link |
|---------|-------------|------|
| Hello World | A simple introductory example demonstrating basic usage of AsyncWebCrawler with JavaScript execution and content filtering. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world.py) |
| Quickstart | A comprehensive collection of examples showcasing various features including basic crawling, content cleaning, link analysis, JavaScript execution, CSS selectors, media handling, custom hooks, proxy configuration, screenshots, and multiple extraction strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart.py) |
| Quickstart Set 1 | Basic examples for getting started with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_1.py) |
| Quickstart Set 2 | More advanced examples for working with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_examples_set_2.py) |
## Browser & Crawling Features
| Example | Description | Link |
|---------|-------------|------|
| Built-in Browser | Demonstrates how to use the built-in browser capabilities. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/builtin_browser_example.py) |
| Browser Optimization | Focuses on browser performance optimization techniques. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/browser_optimization_example.py) |
| arun vs arun_many | Compares the `arun` and `arun_many` methods for single vs. multiple URL crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/arun_vs_arun_many.py) |
| Multiple URLs | Shows how to crawl multiple URLs asynchronously. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/async_webcrawler_multiple_urls_example.py) |
| Page Interaction | Guide on interacting with dynamic elements through clicks. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/tutorial_dynamic_clicks.md) |
| Crawler Monitor | Shows how to monitor the crawler's activities and status. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawler_monitor_example.py) |
| Full Page Screenshot & PDF | Guide on capturing full-page screenshots and PDFs from massive webpages. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/full_page_screenshot_and_pdf_export.md) |
## Advanced Crawling & Deep Crawling
| Example | Description | Link |
|---------|-------------|------|
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
## Extraction Strategies
| Example | Description | Link |
|---------|-------------|------|
| Extraction Strategies | Demonstrates different extraction strategies with various input formats (markdown, HTML, fit_markdown) and JSON-based extractors (CSS and XPath). | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/extraction_strategies_examples.py) |
| Scraping Strategies | Compares the performance of different scraping strategies. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/scraping_strategies_performance.py) |
| LLM Extraction | Demonstrates LLM-based extraction specifically for OpenAI pricing data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_extraction_openai_pricing.py) |
| LLM Markdown | Shows how to use LLMs to generate markdown from crawled content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_markdown_generator.py) |
| Summarize Page | Shows how to summarize web page content. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/summarize_page.py) |
## E-commerce & Specialized Crawling
| Example | Description | Link |
|---------|-------------|------|
| Amazon Product Extraction | Demonstrates how to extract structured product data from Amazon search results using CSS selectors. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_direct_url.py) |
| Amazon with Hooks | Shows how to use hooks with Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_hooks.py) |
| Amazon with JavaScript | Demonstrates using custom JavaScript for Amazon product extraction. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/amazon_product_extraction_using_use_javascript.py) |
| Crypto Analysis | Demonstrates how to crawl and analyze cryptocurrency data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crypto_analysis_example.py) |
| SERP API | Demonstrates using Crawl4AI with search engine result pages. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/serp_api_project_11_feb.py) |
## Customization & Security
| Example | Description | Link |
|---------|-------------|------|
| Hooks | Illustrates how to use hooks at different stages of the crawling process for advanced customization. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hooks_example.py) |
| Identity-Based Browsing | Illustrates identity-based browsing configurations for authentic browsing experiences. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/identity_based_browsing.py) |
| Proxy Rotation | Shows how to use proxy rotation for web scraping and avoiding IP blocks. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/proxy_rotation_demo.py) |
| SSL Certificate | Illustrates SSL certificate handling and verification. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/ssl_example.py) |
| Language Support | Shows how to handle different languages during crawling. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/language_support_example.py) |
| Geolocation | Demonstrates how to use geolocation features. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/use_geo_location.py) |
## Docker & Deployment
| Example | Description | Link |
|---------|-------------|------|
| Docker Config | Demonstrates how to create and use Docker configuration objects. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_config_obj.py) |
| Docker Basic | A test suite for Docker deployment, showcasing various functionalities through the Docker API. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py) |
| Docker REST API | Shows how to interact with Crawl4AI Docker using REST API calls. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_rest_api.py) |
| Docker SDK | Demonstrates using the Python SDK for Crawl4AI Docker. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_python_sdk.py) |
## Application Examples
| Example | Description | Link |
|---------|-------------|------|
| Research Assistant | Demonstrates how to build a research assistant using Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/research_assistant.py) |
| REST Call | Shows how to make REST API calls with Crawl4AI. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/rest_call.py) |
| Chainlit Integration | Shows how to integrate Crawl4AI with Chainlit. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/chainlit.md) |
| Crawl4AI vs FireCrawl | Compares Crawl4AI with the FireCrawl library. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crawlai_vs_firecrawl.py) |
## Content Generation & Markdown
| Example | Description | Link |
|---------|-------------|------|
| Content Source | Demonstrates how to work with different content sources in markdown generation. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_example.py) |
| Content Source (Short) | A simplified version of content source usage. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/markdown/content_source_short_example.py) |
| Built-in Browser Guide | Guide for using the built-in browser capabilities. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/README_BUILTIN_BROWSER.md) |
## Running the Examples
To run any of these examples, you'll need to have Crawl4AI installed:
```bash
pip install crawl4ai
```
Then, you can run an example script like this:
```bash
python -m docs.examples.hello_world
```
For examples that require additional dependencies or environment variables, refer to the comments at the top of each file.
Some examples may require:
- API keys (for LLM-based examples)
- Docker setup (for Docker-related examples)
- Additional dependencies (specified in the example files)
## Contributing New Examples
If you've created an interesting example that demonstrates a unique use case or feature of Crawl4AI, we encourage you to contribute it to our examples collection. Please see our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.

View File

@@ -4,7 +4,35 @@ In this tutorial, youll learn how to:
1. Extract links (internal, external) from crawled pages
2. Filter or exclude specific domains (e.g., social media or custom domains)
3. Access and manage media data (especially images) in the crawl result
3. Access and ma### 3.2 Excluding Images
#### Excluding External Images
If you're dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
```python
crawler_cfg = CrawlerRunConfig(
exclude_external_images=True
)
```
This setting attempts to discard images from outside the primary domain, keeping only those from the site you're crawling.
#### Excluding All Images
If you want to completely remove all images from the page to maximize performance and reduce memory usage, use:
```python
crawler_cfg = CrawlerRunConfig(
exclude_all_images=True
)
```
This setting removes all images very early in the processing pipeline, which significantly improves memory efficiency and processing speed. This is particularly useful when:
- You don't need image data in your results
- You're crawling image-heavy pages that cause memory issues
- You want to focus only on text content
- You need to maximize crawling speeddata (especially images) in the crawl result
4. Configure your crawler to exclude or prioritize certain images
> **Prerequisites**
@@ -271,8 +299,41 @@ Each extracted table contains:
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
- **`capture_mhtml`**: Set to `True` if you want an MHTML snapshot of the page in `result.mhtml`. This format preserves the entire web page with all its resources (CSS, images, scripts) in a single file, making it perfect for archiving or offline viewing.
- **`wait_for_images`**: If `True`, attempts to wait until images are fully loaded before final extraction.
#### Example: Capturing Page as MHTML
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
crawler_cfg = CrawlerRunConfig(
capture_mhtml=True # Enable MHTML capture
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=crawler_cfg)
if result.success and result.mhtml:
# Save the MHTML snapshot to a file
with open("example.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
print("MHTML snapshot saved to example.mhtml")
else:
print("Failed to capture MHTML:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
The MHTML format is particularly useful because:
- It captures the complete page state including all resources
- It can be opened in most modern browsers for offline viewing
- It preserves the page exactly as it appeared during crawling
- It's a single file, making it easy to store and transfer
---
## 4. Putting It All Together: Link & Media Filtering

View File

@@ -111,13 +111,71 @@ Some commonly used `options`:
- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.
- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
## 4. Selecting the HTML Source for Markdown Generation
The `content_source` parameter allows you to control which HTML content is used as input for markdown generation. This gives you flexibility in how the HTML is processed before conversion to markdown.
```python
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Option 1: Use the raw HTML directly from the webpage (before any processing)
raw_md_generator = DefaultMarkdownGenerator(
content_source="raw_html",
options={"ignore_links": True}
)
# Option 2: Use the cleaned HTML (after scraping strategy processing - default)
cleaned_md_generator = DefaultMarkdownGenerator(
content_source="cleaned_html", # This is the default
options={"ignore_links": True}
)
# Option 3: Use preprocessed HTML optimized for schema extraction
fit_md_generator = DefaultMarkdownGenerator(
content_source="fit_html",
options={"ignore_links": True}
)
# Use one of the generators in your crawler config
config = CrawlerRunConfig(
markdown_generator=raw_md_generator # Try each of the generators
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success:
print("Markdown:\n", result.markdown.raw_markdown[:500])
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
```
### HTML Source Options
- **`"cleaned_html"`** (default): Uses the HTML after it has been processed by the scraping strategy. This HTML is typically cleaner and more focused on content, with some boilerplate removed.
- **`"raw_html"`**: Uses the original HTML directly from the webpage, before any cleaning or processing. This preserves more of the original content, but may include navigation bars, ads, footers, and other elements that might not be relevant to the main content.
- **`"fit_html"`**: Uses HTML preprocessed for schema extraction. This HTML is optimized for structured data extraction and may have certain elements simplified or removed.
### When to Use Each Option
- Use **`"cleaned_html"`** (default) for most cases where you want a balance of content preservation and noise removal.
- Use **`"raw_html"`** when you need to preserve all original content, or when the cleaning process is removing content you actually want to keep.
- Use **`"fit_html"`** when working with structured data or when you need HTML that's optimized for schema extraction.
---
## 4. Content Filters
## 5. Content Filters
**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you dont want.
### 4.1 BM25ContentFilter
### 5.1 BM25ContentFilter
If you have a **search query**, BM25 is a good choice:
@@ -146,7 +204,7 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
### 4.2 PruningContentFilter
### 5.2 PruningContentFilter
If you **dont** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
@@ -170,7 +228,7 @@ prune_filter = PruningContentFilter(
- You want a broad cleanup without a user query.
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
### 4.3 LLMContentFilter
### 5.3 LLMContentFilter
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
@@ -247,7 +305,7 @@ filter = LLMContentFilter(
---
## 5. Using Fit Markdown
## 6. Using Fit Markdown
When a content filter is active, the library produces two forms of markdown inside `result.markdown`:
@@ -284,7 +342,7 @@ if __name__ == "__main__":
---
## 6. The `MarkdownGenerationResult` Object
## 7. The `MarkdownGenerationResult` Object
If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, youll see fields such as:
@@ -315,7 +373,7 @@ Below is a **revised section** under “Combining Filters (BM25 + Pruning)” th
---
## 7. Combining Filters (BM25 + Pruning) in Two Passes
## 8. Combining Filters (BM25 + Pruning) in Two Passes
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank whats left** against a user query (with `BM25ContentFilter`). You dont have to crawl the page twice. Instead:
@@ -407,7 +465,7 @@ If your codebase or pipeline design allows applying multiple filters in one pass
---
## 8. Common Pitfalls & Tips
## 9. Common Pitfalls & Tips
1. **No Markdown Output?**
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
@@ -427,11 +485,12 @@ If your codebase or pipeline design allows applying multiple filters in one pass
---
## 9. Summary & Next Steps
## 10. Summary & Next Steps
In this **Markdown Generation Basics** tutorial, you learned to:
- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.
- Select different HTML sources using the `content_source` parameter.
- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.
- Distinguish between raw and filtered markdown (`fit_markdown`).
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).