Compare commits
14 Commits
fix/exit_w
...
release/v0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a9a2d798b4 | ||
|
|
612270fcb0 | ||
|
|
bc099fdd76 | ||
|
|
18504d782e | ||
|
|
ad547607b9 | ||
|
|
6b0b5301ba | ||
|
|
a5bcac4c9d | ||
|
|
45d8327d23 | ||
|
|
437395e490 | ||
|
|
fddae303fb | ||
|
|
805c498adf | ||
|
|
6a728cbe5b | ||
|
|
5c33cbcca2 | ||
|
|
7b80eb6b99 |
7
.github/FUNDING.yml
vendored
Normal file
7
.github/FUNDING.yml
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
# These are supported funding model platforms
|
||||
|
||||
# GitHub Sponsors
|
||||
github: unclecode
|
||||
|
||||
# Custom links for enterprise inquiries (uncomment when ready)
|
||||
# custom: ["https://crawl4ai.com/enterprise"]
|
||||
809
README-first.md
Normal file
809
README-first.md
Normal file
@@ -0,0 +1,809 @@
|
||||
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
|
||||
|
||||
<div align="center">
|
||||
|
||||
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
||||
|
||||
[](https://github.com/unclecode/crawl4ai/stargazers)
|
||||
[](https://github.com/unclecode/crawl4ai/network/members)
|
||||
|
||||
[](https://badge.fury.io/py/crawl4ai)
|
||||
[](https://pypi.org/project/crawl4ai/)
|
||||
[](https://pepy.tech/project/crawl4ai)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
<p align="center">
|
||||
<a href="https://x.com/crawl4ai">
|
||||
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
|
||||
</a>
|
||||
<a href="https://www.linkedin.com/company/crawl4ai">
|
||||
<img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" />
|
||||
</a>
|
||||
<a href="https://discord.gg/jP8KfhDhyN">
|
||||
<img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
|
||||
</a>
|
||||
</p>
|
||||
</div>
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.7.0](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
|
||||
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
|
||||
|
||||
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
|
||||
|
||||
I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
|
||||
|
||||
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
|
||||
</details>
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
|
||||
2. **Lightning Fast**: Delivers results faster with real-time, cost-efficient performance.
|
||||
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
|
||||
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
|
||||
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
|
||||
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
1. Install Crawl4AI:
|
||||
```bash
|
||||
# Install the package
|
||||
pip install -U crawl4ai
|
||||
|
||||
# For pre release versions
|
||||
pip install crawl4ai --pre
|
||||
|
||||
# Run post-installation setup
|
||||
crawl4ai-setup
|
||||
|
||||
# Verify your installation
|
||||
crawl4ai-doctor
|
||||
```
|
||||
|
||||
If you encounter any browser-related issues, you can install them manually:
|
||||
```bash
|
||||
python -m playwright install --with-deps chromium
|
||||
```
|
||||
|
||||
2. Run a simple web crawl with Python:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import *
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
3. Or use the new command-line interface:
|
||||
```bash
|
||||
# Basic crawl with markdown output
|
||||
crwl https://www.nbcnews.com/business -o markdown
|
||||
|
||||
# Deep crawl with BFS strategy, max 10 pages
|
||||
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
|
||||
|
||||
# Use LLM extraction with a specific question
|
||||
crwl https://www.example.com/products -q "Extract all product prices"
|
||||
```
|
||||
|
||||
## ✨ Features
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Markdown Generation</strong></summary>
|
||||
|
||||
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
|
||||
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
|
||||
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
|
||||
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
|
||||
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📊 <strong>Structured Data Extraction</strong></summary>
|
||||
|
||||
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
|
||||
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
|
||||
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
|
||||
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
|
||||
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🌐 <strong>Browser Integration</strong></summary>
|
||||
|
||||
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
|
||||
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
|
||||
- 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
|
||||
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
|
||||
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
|
||||
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
|
||||
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
|
||||
|
||||
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
|
||||
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
|
||||
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
|
||||
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
|
||||
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
|
||||
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
|
||||
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
|
||||
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
|
||||
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
|
||||
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
|
||||
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🚀 <strong>Deployment</strong></summary>
|
||||
|
||||
- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
|
||||
- 🔑 **Secure Authentication**: Built-in JWT token authentication for API security.
|
||||
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
|
||||
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
|
||||
- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🎯 <strong>Additional Features</strong></summary>
|
||||
|
||||
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
|
||||
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
|
||||
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
|
||||
- 🛡️ **Error Handling**: Robust error management for seamless execution.
|
||||
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
|
||||
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
|
||||
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
|
||||
|
||||
</details>
|
||||
|
||||
## Try it Now!
|
||||
|
||||
✨ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
|
||||
✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
|
||||
|
||||
## Installation 🛠️
|
||||
|
||||
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
|
||||
|
||||
<details>
|
||||
<summary>🐍 <strong>Using pip</strong></summary>
|
||||
|
||||
Choose the installation option that best fits your needs:
|
||||
|
||||
### Basic Installation
|
||||
|
||||
For basic web crawling and scraping tasks:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup # Setup the browser
|
||||
```
|
||||
|
||||
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
|
||||
|
||||
👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
|
||||
|
||||
1. Through the command line:
|
||||
|
||||
```bash
|
||||
playwright install
|
||||
```
|
||||
|
||||
2. If the above doesn't work, try this more specific command:
|
||||
|
||||
```bash
|
||||
python -m playwright install chromium
|
||||
```
|
||||
|
||||
This second method has proven to be more reliable in some cases.
|
||||
|
||||
---
|
||||
|
||||
### Installation with Synchronous Version
|
||||
|
||||
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai[sync]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Development Installation
|
||||
|
||||
For contributors who plan to modify the source code:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e . # Basic installation in editable mode
|
||||
```
|
||||
|
||||
Install optional features:
|
||||
|
||||
```bash
|
||||
pip install -e ".[torch]" # With PyTorch features
|
||||
pip install -e ".[transformer]" # With Transformer features
|
||||
pip install -e ".[cosine]" # With cosine similarity features
|
||||
pip install -e ".[sync]" # With synchronous crawling (Selenium)
|
||||
pip install -e ".[all]" # Install all optional features
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||
|
||||
> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
|
||||
|
||||
### New Docker Features
|
||||
|
||||
The new Docker implementation includes:
|
||||
- **Browser pooling** with page pre-warming for faster response times
|
||||
- **Interactive playground** to test and generate request code
|
||||
- **MCP integration** for direct connection to AI tools like Claude Code
|
||||
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
|
||||
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
|
||||
- **Optimized resources** with improved memory management
|
||||
|
||||
### Getting Started
|
||||
|
||||
```bash
|
||||
# Pull and run the latest release candidate
|
||||
docker pull unclecode/crawl4ai:0.7.0
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
|
||||
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
|
||||
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
### Quick Test
|
||||
|
||||
Run a quick test (works for both Docker options):
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Submit a crawl job
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={"urls": ["https://example.com"], "priority": 10}
|
||||
)
|
||||
if response.status_code == 200:
|
||||
print("Crawl job submitted successfully.")
|
||||
|
||||
if "results" in response.json():
|
||||
results = response.json()["results"]
|
||||
print("Crawl job completed. Results:")
|
||||
for result in results:
|
||||
print(result)
|
||||
else:
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Crawl job submitted. Task ID:: {task_id}")
|
||||
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
||||
```
|
||||
|
||||
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
## 🔬 Advanced Usage Examples 🔬
|
||||
|
||||
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
||||
# ),
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
config=run_config
|
||||
)
|
||||
print(len(result.markdown.raw_markdown))
|
||||
print(len(result.markdown.fit_markdown))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "KidoCode Courses",
|
||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||
"fields": [
|
||||
{
|
||||
"name": "section_title",
|
||||
"selector": "h3.heading-50",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "section_description",
|
||||
"selector": ".charge-content",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_name",
|
||||
"selector": ".text-block-93",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_description",
|
||||
"selector": ".course-content-text",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_icon",
|
||||
"selector": ".image-92",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://www.kidocode.com/degrees/technology",
|
||||
config=run_config
|
||||
)
|
||||
|
||||
companies = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(companies)} companies")
|
||||
print(json.dumps(companies[0], indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
|
||||
|
||||
```python
|
||||
import os
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
|
||||
# provider="ollama/qwen2", api_token="no-token",
|
||||
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url='https://openai.com/api/pricing/',
|
||||
config=run_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🤖 <strong>Using Your own Browser with Custom User Profile</strong></summary>
|
||||
|
||||
```python
|
||||
import os, sys
|
||||
from pathlib import Path
|
||||
import asyncio, time
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def test_news_crawl():
|
||||
# Create a persistent user data directory
|
||||
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
||||
os.makedirs(user_data_dir, exist_ok=True)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
verbose=True,
|
||||
headless=True,
|
||||
user_data_dir=user_data_dir,
|
||||
use_persistent_context=True,
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
|
||||
|
||||
result = await crawler.arun(
|
||||
url,
|
||||
config=run_config,
|
||||
magic=True,
|
||||
)
|
||||
|
||||
print(f"Successfully crawled {url}")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
|
||||
|
||||
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7, # Min confidence to stop crawling
|
||||
max_depth=5, # Maximum crawl depth
|
||||
max_pages=20, # Maximum number of pages to crawl
|
||||
strategy="statistical"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive_crawler = AdaptiveCrawler(crawler, config)
|
||||
state = await adaptive_crawler.digest(
|
||||
start_url="https://news.example.com",
|
||||
query="latest news content"
|
||||
)
|
||||
# Crawler learns patterns and improves extraction over time
|
||||
```
|
||||
|
||||
- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
|
||||
```python
|
||||
scroll_config = VirtualScrollConfig(
|
||||
container_selector="[data-testid='feed']",
|
||||
scroll_count=20,
|
||||
scroll_by="container_height",
|
||||
wait_after_scroll=1.0
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config=CrawlerRunConfig(
|
||||
virtual_scroll_config=scroll_config
|
||||
))
|
||||
```
|
||||
|
||||
- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
|
||||
```python
|
||||
link_config = LinkPreviewConfig(
|
||||
query="machine learning tutorials",
|
||||
score_threshold=0.3,
|
||||
concurrent_requests=10
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config=CrawlerRunConfig(
|
||||
link_preview_config=link_config,
|
||||
score_links=True
|
||||
))
|
||||
# Links ranked by relevance and quality
|
||||
```
|
||||
|
||||
- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
|
||||
```python
|
||||
seeder = AsyncUrlSeeder(SeedingConfig(
|
||||
source="sitemap+cc",
|
||||
pattern="*/blog/*",
|
||||
query="python tutorials",
|
||||
score_threshold=0.4
|
||||
))
|
||||
|
||||
urls = await seeder.discover("https://example.com")
|
||||
```
|
||||
|
||||
- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
|
||||
|
||||
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||
|
||||
### Version Numbers Explained
|
||||
|
||||
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
|
||||
|
||||
#### Pre-release Versions
|
||||
We use different suffixes to indicate development stages:
|
||||
|
||||
- `dev` (0.4.3dev1): Development versions, unstable
|
||||
- `a` (0.4.3a1): Alpha releases, experimental features
|
||||
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
||||
- `rc` (0.4.3): Release candidates, potential final version
|
||||
|
||||
#### Installation
|
||||
- Regular installation (stable version):
|
||||
```bash
|
||||
pip install -U crawl4ai
|
||||
```
|
||||
|
||||
- Install pre-release versions:
|
||||
```bash
|
||||
pip install crawl4ai --pre
|
||||
```
|
||||
|
||||
- Install specific version:
|
||||
```bash
|
||||
pip install crawl4ai==0.4.3b1
|
||||
```
|
||||
|
||||
#### Why Pre-releases?
|
||||
We use pre-releases to:
|
||||
- Test new features in real-world scenarios
|
||||
- Gather feedback before final releases
|
||||
- Ensure stability for production users
|
||||
- Allow early adopters to try new features
|
||||
|
||||
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
|
||||
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
|
||||
|
||||
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
|
||||
<details>
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
|
||||
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
|
||||
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
|
||||
|
||||
</details>
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
||||
|
||||
I'll help modify the license section with badges. For the halftone effect, here's a version with it:
|
||||
|
||||
Here's the updated license section:
|
||||
|
||||
## 📄 License & Attribution
|
||||
|
||||
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||
|
||||
### Attribution Requirements
|
||||
When using Crawl4AI, you must include one of the following attribution methods:
|
||||
|
||||
#### 1. Badge Attribution (Recommended)
|
||||
Add one of these badges to your README, documentation, or website:
|
||||
|
||||
| Theme | Badge |
|
||||
|-------|-------|
|
||||
| **Disco Theme (Animated)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
| **Night Theme (Dark with Neon)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
| **Dark Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
| **Light Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
|
||||
|
||||
HTML code for adding the badges:
|
||||
```html
|
||||
<!-- Disco Theme (Animated) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Night Theme (Dark with Neon) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Dark Theme (Classic) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Light Theme (Classic) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Simple Shield Badge -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
|
||||
</a>
|
||||
```
|
||||
|
||||
#### 2. Text Attribution
|
||||
Add this line to your documentation:
|
||||
```
|
||||
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
|
||||
```
|
||||
|
||||
## 📚 Citation
|
||||
|
||||
If you use Crawl4AI in your research or project, please cite:
|
||||
|
||||
```bibtex
|
||||
@software{crawl4ai2024,
|
||||
author = {UncleCode},
|
||||
title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
|
||||
year = {2024},
|
||||
publisher = {GitHub},
|
||||
journal = {GitHub Repository},
|
||||
howpublished = {\url{https://github.com/unclecode/crawl4ai}},
|
||||
commit = {Please use the commit hash you're working with}
|
||||
}
|
||||
```
|
||||
|
||||
Text citation format:
|
||||
```
|
||||
UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
|
||||
GitHub. https://github.com/unclecode/crawl4ai
|
||||
```
|
||||
|
||||
## 📧 Contact
|
||||
|
||||
For questions, suggestions, or feedback, feel free to reach out:
|
||||
|
||||
- GitHub: [unclecode](https://github.com/unclecode)
|
||||
- Twitter: [@unclecode](https://twitter.com/unclecode)
|
||||
- Website: [crawl4ai.com](https://crawl4ai.com)
|
||||
|
||||
Happy Crawling! 🕸️🚀
|
||||
|
||||
## 💖 Support Crawl4AI
|
||||
|
||||
> 🎉 **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame!
|
||||
|
||||
Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your support ensures we stay independent, innovative, and free forever.
|
||||
|
||||
<div align="center">
|
||||
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
</div>
|
||||
|
||||
### 🤝 Sponsorship Tiers
|
||||
|
||||
- **🌱 Believer ($5/mo)**: Join the movement for data democratization
|
||||
- **🚀 Builder ($50/mo)**: Get priority support and early feature access
|
||||
- **💼 Growing Team ($500/mo)**: Bi-weekly syncs and optimization help
|
||||
- **🏢 Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support
|
||||
|
||||
**Why sponsor?** Every tier includes real benefits. No more rate-limited APIs. Own your data pipeline. Build data sovereignty together.
|
||||
|
||||
[View All Tiers & Benefits →](https://github.com/sponsors/unclecode)
|
||||
|
||||
### 🏆 Our Sponsors
|
||||
|
||||
#### 👑 Founding Sponsors (First 50)
|
||||
*Be part of history - [Become a Founding Sponsor](https://github.com/sponsors/unclecode)*
|
||||
|
||||
<!-- Founding sponsors will be permanently recognized here -->
|
||||
|
||||
#### Current Sponsors
|
||||
Thank you to all our sponsors who make this project possible!
|
||||
|
||||
<!-- Sponsors will be automatically added here -->
|
||||
|
||||
## 🗾 Mission
|
||||
|
||||
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
|
||||
|
||||
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
|
||||
|
||||
<details>
|
||||
<summary>🔑 <strong>Key Opportunities</strong></summary>
|
||||
|
||||
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
|
||||
- **Authentic AI Data**: Provide AI systems with real human insights.
|
||||
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🚀 <strong>Development Pathway</strong></summary>
|
||||
|
||||
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
|
||||
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
|
||||
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
|
||||
|
||||
For more details, see our [full mission statement](./MISSION.md).
|
||||
</details>
|
||||
|
||||
## Star History
|
||||
|
||||
[](https://star-history.com/#unclecode/crawl4ai&Date)
|
||||
190
README.md
190
README.md
@@ -10,6 +10,7 @@
|
||||
[](https://badge.fury.io/py/crawl4ai)
|
||||
[](https://pypi.org/project/crawl4ai/)
|
||||
[](https://pepy.tech/project/crawl4ai)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
<p align="center">
|
||||
<a href="https://x.com/crawl4ai">
|
||||
@@ -24,32 +25,33 @@
|
||||
</p>
|
||||
</div>
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
||||
|
||||
[✨ Check out latest update v0.7.0](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
||||
✨ New in v0.7.0, Adaptive Crawling, Virtual Scroll, Link Preview scoring, Async URL Seeder, big performance gains. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
|
||||
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
|
||||
I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. That’s where I learned how much extraction matters.
|
||||
|
||||
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
|
||||
In 2023, I needed web-to-Markdown. The “open source” option wanted an account, API token, and $16, and still under-delivered. I went turbo anger mode, built Crawl4AI in days, and it went viral. Now it’s the most-starred crawler on GitHub.
|
||||
|
||||
I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
|
||||
|
||||
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
|
||||
I made it open source for **availability**, anyone can use it without a gate. Now I’m building the platform for **affordability**, anyone can run serious crawls without breaking the bank. If that resonates, join in, send feedback, or just crawl something amazing.
|
||||
</details>
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
|
||||
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
|
||||
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
|
||||
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
|
||||
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
|
||||
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
|
||||
<details>
|
||||
<summary>Why developers pick Crawl4AI</summary>
|
||||
|
||||
- **LLM ready output**, smart Markdown with headings, tables, code, citation hints
|
||||
- **Fast in practice**, async browser pool, caching, minimal hops
|
||||
- **Full control**, sessions, proxies, cookies, user scripts, hooks
|
||||
- **Adaptive intelligence**, learns site patterns, explores only what matters
|
||||
- **Deploy anywhere**, zero keys, CLI and Docker, cloud friendly
|
||||
</details>
|
||||
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
@@ -101,6 +103,33 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
|
||||
crwl https://www.example.com/products -q "Extract all product prices"
|
||||
```
|
||||
|
||||
## 💖 Support Crawl4AI
|
||||
|
||||
> 🎉 **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame.
|
||||
|
||||
Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community — while giving you direct access to premium benefits.
|
||||
|
||||
<div align="">
|
||||
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
</div>
|
||||
|
||||
### 🤝 Sponsorship Tiers
|
||||
|
||||
- **🌱 Believer ($5/mo)** — Join the movement for data democratization
|
||||
- **🚀 Builder ($50/mo)** — Priority support & early access to features
|
||||
- **💼 Growing Team ($500/mo)** — Bi-weekly syncs & optimization help
|
||||
- **🏢 Data Infrastructure Partner ($2000/mo)** — Full partnership with dedicated support
|
||||
*Custom arrangements available - see [SPONSORS.md](SPONSORS.md) for details & contact*
|
||||
|
||||
**Why sponsor?**
|
||||
No rate-limited APIs. No lock-in. Build and own your data pipeline with direct guidance from the creator of Crawl4AI.
|
||||
|
||||
[See All Tiers & Benefits →](https://github.com/sponsors/unclecode)
|
||||
|
||||
|
||||
## ✨ Features
|
||||
|
||||
<details>
|
||||
@@ -280,12 +309,6 @@ docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
|
||||
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
### Quick Test
|
||||
|
||||
Run a quick test (works for both Docker options):
|
||||
@@ -316,10 +339,11 @@ For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Advanced Usage Examples 🔬
|
||||
|
||||
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
|
||||
@@ -478,7 +502,7 @@ if __name__ == "__main__":
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
|
||||
<summary>🤖 <strong>Using Your own Browser with Custom User Profile</strong></summary>
|
||||
|
||||
```python
|
||||
import os, sys
|
||||
@@ -583,97 +607,12 @@ async def test_news_crawl():
|
||||
|
||||
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
### Previous Version: 0.6.0 Release Highlights
|
||||
|
||||
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
||||
```python
|
||||
crun_cfg = CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo", # test page that shows your location
|
||||
locale="en-US", # Accept-Language & UI locale
|
||||
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
|
||||
geolocation=GeolocationConfig( # override GPS coords
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
- **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# Set up scraping parameters
|
||||
crawl_config = CrawlerRunConfig(
|
||||
table_score_threshold=8, # Strict table detection
|
||||
)
|
||||
|
||||
# Execute market data extraction
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
)
|
||||
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
if result.success and result.media["tables"]:
|
||||
raw_df = pd.DataFrame(
|
||||
result.media["tables"][0]["rows"],
|
||||
columns=result.media["tables"][0]["headers"],
|
||||
)
|
||||
break
|
||||
print(raw_df.head())
|
||||
|
||||
finally:
|
||||
await crawler.stop()
|
||||
```
|
||||
|
||||
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
|
||||
|
||||
- **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
|
||||
```python
|
||||
crawler_config = CrawlerRunConfig(
|
||||
capture_network=True,
|
||||
capture_console=True,
|
||||
mhtml=True
|
||||
)
|
||||
```
|
||||
|
||||
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
|
||||
```bash
|
||||
# Add Crawl4AI to Claude Code
|
||||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||
```
|
||||
|
||||
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `http://localhost:11235//playground`
|
||||
|
||||
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
|
||||
|
||||
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
||||
|
||||
|
||||
### Previous Version: 0.5.0 Major Release Highlights
|
||||
|
||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
|
||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
|
||||
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
|
||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
|
||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles
|
||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
|
||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
|
||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching
|
||||
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
|
||||
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
|
||||
|
||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||
|
||||
### Version Numbers Explained
|
||||
<details>
|
||||
<summary>📈 <strong>Version Numbers Explained</strong></summary>
|
||||
|
||||
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
|
||||
|
||||
@@ -710,6 +649,8 @@ We use pre-releases to:
|
||||
|
||||
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
|
||||
|
||||
</details>
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
@@ -722,16 +663,16 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
|
||||
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [x] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [x] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [x] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
- [x] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [x] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [x] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
- [x] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [x] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
|
||||
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [x] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
|
||||
|
||||
</details>
|
||||
@@ -746,12 +687,13 @@ Here's the updated license section:
|
||||
|
||||
## 📄 License & Attribution
|
||||
|
||||
This project is licensed under the Apache License 2.0 with a required attribution clause. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||
|
||||
### Attribution Requirements
|
||||
When using Crawl4AI, you must include one of the following attribution methods:
|
||||
|
||||
#### 1. Badge Attribution (Recommended)
|
||||
<details>
|
||||
<summary>📈 <strong>1. Badge Attribution (Recommended)</strong></summary>
|
||||
Add one of these badges to your README, documentation, or website:
|
||||
|
||||
| Theme | Badge |
|
||||
@@ -790,11 +732,15 @@ HTML code for adding the badges:
|
||||
</a>
|
||||
```
|
||||
|
||||
#### 2. Text Attribution
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📖 <strong>2. Text Attribution</strong></summary>
|
||||
Add this line to your documentation:
|
||||
```
|
||||
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
|
||||
```
|
||||
</details>
|
||||
|
||||
## 📚 Citation
|
||||
|
||||
|
||||
65
SPONSORS.md
Normal file
65
SPONSORS.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# 💖 Sponsors & Supporters
|
||||
|
||||
Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this project open-source and actively maintained.
|
||||
|
||||
## 👑 Founding Sponsors
|
||||
*The first 50 sponsors who believed in our vision - permanently recognized*
|
||||
|
||||
<!-- Founding sponsors will be listed here with special recognition -->
|
||||
🎉 **Become a Founding Sponsor!** Only [X/50] spots remaining! [Join now →](https://github.com/sponsors/unclecode)
|
||||
|
||||
---
|
||||
|
||||
## 🏢 Data Infrastructure Partners ($2000/month)
|
||||
*These organizations are building their data sovereignty with Crawl4AI at the core*
|
||||
|
||||
<!-- Data Infrastructure Partners will be listed here -->
|
||||
*Be the first Data Infrastructure Partner! [Join us →](https://github.com/sponsors/unclecode)*
|
||||
|
||||
---
|
||||
|
||||
## 💼 Growing Teams ($500/month)
|
||||
*Teams scaling their data extraction with Crawl4AI*
|
||||
|
||||
<!-- Growing Teams will be listed here -->
|
||||
*Your team could be here! [Become a sponsor →](https://github.com/sponsors/unclecode)*
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Builders ($50/month)
|
||||
*Developers and entrepreneurs building with Crawl4AI*
|
||||
|
||||
<!-- Builders will be listed here -->
|
||||
*Join the builders! [Start sponsoring →](https://github.com/sponsors/unclecode)*
|
||||
|
||||
---
|
||||
|
||||
## 🌱 Believers ($5/month)
|
||||
*The community supporting data democratization*
|
||||
|
||||
<!-- Believers will be listed here -->
|
||||
*Thank you to all our community believers!*
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Want to Sponsor?
|
||||
|
||||
Crawl4AI is the #1 trending open-source web crawler. We're building the future of data extraction - where organizations own their data pipelines instead of relying on rate-limited APIs.
|
||||
|
||||
### Available Sponsorship Tiers:
|
||||
- **🌱 Believer** ($5/mo) - Support the movement
|
||||
- **🚀 Builder** ($50/mo) - Priority support & early access
|
||||
- **💼 Growing Team** ($500/mo) - Bi-weekly syncs & optimization
|
||||
- **🏢 Data Infrastructure Partner** ($2000/mo) - Full partnership & dedicated support
|
||||
|
||||
[View all tiers and benefits →](https://github.com/sponsors/unclecode)
|
||||
|
||||
### Enterprise & Custom Partnerships
|
||||
|
||||
Building data extraction at scale? Need dedicated support or infrastructure? Let's talk about a custom partnership.
|
||||
|
||||
📧 Contact: [hello@crawl4ai.com](mailto:hello@crawl4ai.com) | 📅 [Schedule a call](https://calendar.app.google/rEpvi2UBgUQjWHfJ9)
|
||||
|
||||
---
|
||||
|
||||
*This list is updated regularly. Sponsors at $50+ tiers can submit their logos via [hello@crawl4ai.com](mailto:hello@crawl4ai.com)*
|
||||
@@ -88,6 +88,13 @@ from .script import (
|
||||
ErrorDetail
|
||||
)
|
||||
|
||||
# Browser Adapters
|
||||
from .browser_adapter import (
|
||||
BrowserAdapter,
|
||||
PlaywrightAdapter,
|
||||
UndetectedAdapter
|
||||
)
|
||||
|
||||
from .utils import (
|
||||
start_colab_display_server,
|
||||
setup_colab_environment
|
||||
@@ -174,6 +181,10 @@ __all__ = [
|
||||
"CompilationResult",
|
||||
"ValidationResult",
|
||||
"ErrorDetail",
|
||||
# Browser Adapters
|
||||
"BrowserAdapter",
|
||||
"PlaywrightAdapter",
|
||||
"UndetectedAdapter",
|
||||
"LinkPreviewConfig"
|
||||
]
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# crawl4ai/__version__.py
|
||||
|
||||
# This is the version that will be used for stable releases
|
||||
__version__ = "0.7.2"
|
||||
__version__ = "0.7.3"
|
||||
|
||||
# For nightly builds, this gets set during build process
|
||||
__nightly_version__ = None
|
||||
|
||||
@@ -390,6 +390,8 @@ class BrowserConfig:
|
||||
light_mode (bool): Disables certain background features for performance gains. Default: False.
|
||||
extra_args (list): Additional command-line arguments passed to the browser.
|
||||
Default: [].
|
||||
enable_stealth (bool): If True, applies playwright-stealth to bypass basic bot detection.
|
||||
Cannot be used with use_undetected browser mode. Default: False.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
@@ -430,6 +432,7 @@ class BrowserConfig:
|
||||
extra_args: list = None,
|
||||
debugging_port: int = 9222,
|
||||
host: str = "localhost",
|
||||
enable_stealth: bool = False,
|
||||
):
|
||||
self.browser_type = browser_type
|
||||
self.headless = headless
|
||||
@@ -470,6 +473,7 @@ class BrowserConfig:
|
||||
self.verbose = verbose
|
||||
self.debugging_port = debugging_port
|
||||
self.host = host
|
||||
self.enable_stealth = enable_stealth
|
||||
|
||||
fa_user_agenr_generator = ValidUAGenerator()
|
||||
if self.user_agent_mode == "random":
|
||||
@@ -501,6 +505,13 @@ class BrowserConfig:
|
||||
# If persistent context is requested, ensure managed browser is enabled
|
||||
if self.use_persistent_context:
|
||||
self.use_managed_browser = True
|
||||
|
||||
# Validate stealth configuration
|
||||
if self.enable_stealth and self.use_managed_browser and self.browser_mode == "builtin":
|
||||
raise ValueError(
|
||||
"enable_stealth cannot be used with browser_mode='builtin'. "
|
||||
"Stealth mode requires a dedicated browser instance."
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def from_kwargs(kwargs: dict) -> "BrowserConfig":
|
||||
@@ -537,6 +548,7 @@ class BrowserConfig:
|
||||
extra_args=kwargs.get("extra_args", []),
|
||||
debugging_port=kwargs.get("debugging_port", 9222),
|
||||
host=kwargs.get("host", "localhost"),
|
||||
enable_stealth=kwargs.get("enable_stealth", False),
|
||||
)
|
||||
|
||||
def to_dict(self):
|
||||
@@ -571,6 +583,7 @@ class BrowserConfig:
|
||||
"verbose": self.verbose,
|
||||
"debugging_port": self.debugging_port,
|
||||
"host": self.host,
|
||||
"enable_stealth": self.enable_stealth,
|
||||
}
|
||||
|
||||
|
||||
|
||||
2450
crawl4ai/async_crawler_strategy.back.py
Normal file
2450
crawl4ai/async_crawler_strategy.back.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -21,6 +21,7 @@ from .async_logger import AsyncLogger
|
||||
from .ssl_certificate import SSLCertificate
|
||||
from .user_agent_generator import ValidUAGenerator
|
||||
from .browser_manager import BrowserManager
|
||||
from .browser_adapter import BrowserAdapter, PlaywrightAdapter, UndetectedAdapter
|
||||
|
||||
import aiofiles
|
||||
import aiohttp
|
||||
@@ -71,7 +72,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, browser_config: BrowserConfig = None, logger: AsyncLogger = None, **kwargs
|
||||
self, browser_config: BrowserConfig = None, logger: AsyncLogger = None, browser_adapter: BrowserAdapter = None, **kwargs
|
||||
):
|
||||
"""
|
||||
Initialize the AsyncPlaywrightCrawlerStrategy with a browser configuration.
|
||||
@@ -80,11 +81,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
browser_config (BrowserConfig): Configuration object containing browser settings.
|
||||
If None, will be created from kwargs for backwards compatibility.
|
||||
logger: Logger instance for recording events and errors.
|
||||
browser_adapter (BrowserAdapter): Browser adapter for handling browser-specific operations.
|
||||
If None, defaults to PlaywrightAdapter.
|
||||
**kwargs: Additional arguments for backwards compatibility and extending functionality.
|
||||
"""
|
||||
# Initialize browser config, either from provided object or kwargs
|
||||
self.browser_config = browser_config or BrowserConfig.from_kwargs(kwargs)
|
||||
self.logger = logger
|
||||
|
||||
# Initialize browser adapter
|
||||
self.adapter = browser_adapter or PlaywrightAdapter()
|
||||
|
||||
# Initialize session management
|
||||
self._downloaded_files = []
|
||||
@@ -104,7 +110,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
# Initialize browser manager with config
|
||||
self.browser_manager = BrowserManager(
|
||||
browser_config=self.browser_config, logger=self.logger
|
||||
browser_config=self.browser_config,
|
||||
logger=self.logger,
|
||||
use_undetected=isinstance(self.adapter, UndetectedAdapter)
|
||||
)
|
||||
|
||||
async def __aenter__(self):
|
||||
@@ -322,7 +330,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"""
|
||||
|
||||
try:
|
||||
result = await page.evaluate(wrapper_js)
|
||||
result = await self.adapter.evaluate(page, wrapper_js)
|
||||
return result
|
||||
except Exception as e:
|
||||
if "Error evaluating condition" in str(e):
|
||||
@@ -367,7 +375,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
# Replace the iframe with a div containing the extracted content
|
||||
_iframe = iframe_content.replace("`", "\\`")
|
||||
await page.evaluate(
|
||||
await self.adapter.evaluate(page,
|
||||
f"""
|
||||
() => {{
|
||||
const iframe = document.getElementById('iframe-{i}');
|
||||
@@ -628,91 +636,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
page.on("requestfailed", handle_request_failed_capture)
|
||||
|
||||
# Console Message Capturing
|
||||
handle_console = None
|
||||
handle_error = None
|
||||
if config.capture_console_messages:
|
||||
def handle_console_capture(msg):
|
||||
try:
|
||||
message_type = "unknown"
|
||||
try:
|
||||
message_type = msg.type
|
||||
except:
|
||||
pass
|
||||
|
||||
message_text = "unknown"
|
||||
try:
|
||||
message_text = msg.text
|
||||
except:
|
||||
pass
|
||||
|
||||
# Basic console message with minimal content
|
||||
entry = {
|
||||
"type": message_type,
|
||||
"text": message_text,
|
||||
"timestamp": time.time()
|
||||
}
|
||||
|
||||
captured_console.append(entry)
|
||||
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.warning(f"Error capturing console message: {e}", tag="CAPTURE")
|
||||
# Still add something to the list even on error
|
||||
captured_console.append({
|
||||
"type": "console_capture_error",
|
||||
"error": str(e),
|
||||
"timestamp": time.time()
|
||||
})
|
||||
|
||||
def handle_pageerror_capture(err):
|
||||
try:
|
||||
error_message = "Unknown error"
|
||||
try:
|
||||
error_message = err.message
|
||||
except:
|
||||
pass
|
||||
|
||||
error_stack = ""
|
||||
try:
|
||||
error_stack = err.stack
|
||||
except:
|
||||
pass
|
||||
|
||||
captured_console.append({
|
||||
"type": "error",
|
||||
"text": error_message,
|
||||
"stack": error_stack,
|
||||
"timestamp": time.time()
|
||||
})
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.warning(f"Error capturing page error: {e}", tag="CAPTURE")
|
||||
captured_console.append({
|
||||
"type": "pageerror_capture_error",
|
||||
"error": str(e),
|
||||
"timestamp": time.time()
|
||||
})
|
||||
|
||||
# Add event listeners directly
|
||||
page.on("console", handle_console_capture)
|
||||
page.on("pageerror", handle_pageerror_capture)
|
||||
# Set up console capture using adapter
|
||||
handle_console = await self.adapter.setup_console_capture(page, captured_console)
|
||||
handle_error = await self.adapter.setup_error_capture(page, captured_console)
|
||||
|
||||
# Set up console logging if requested
|
||||
if config.log_console:
|
||||
def log_consol(
|
||||
msg, console_log_type="debug"
|
||||
): # Corrected the parameter syntax
|
||||
if console_log_type == "error":
|
||||
self.logger.error(
|
||||
message=f"Console error: {msg}", # Use f-string for variable interpolation
|
||||
tag="CONSOLE"
|
||||
)
|
||||
elif console_log_type == "debug":
|
||||
self.logger.debug(
|
||||
message=f"Console: {msg}", # Use f-string for variable interpolation
|
||||
tag="CONSOLE"
|
||||
)
|
||||
|
||||
page.on("console", log_consol)
|
||||
page.on("pageerror", lambda e: log_consol(e, "error"))
|
||||
# Note: For undetected browsers, console logging won't work directly
|
||||
# but captured messages can still be logged after retrieval
|
||||
|
||||
try:
|
||||
# Get SSL certificate information if requested and URL is HTTPS
|
||||
@@ -998,7 +931,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
await page.wait_for_load_state("domcontentloaded", timeout=5)
|
||||
except PlaywrightTimeoutError:
|
||||
pass
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
await self.adapter.evaluate(page, update_image_dimensions_js)
|
||||
except Exception as e:
|
||||
self.logger.error(
|
||||
message="Error updating image dimensions: {error}",
|
||||
@@ -1027,7 +960,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
for selector in selectors:
|
||||
try:
|
||||
content = await page.evaluate(
|
||||
content = await self.adapter.evaluate(page,
|
||||
f"""Array.from(document.querySelectorAll("{selector}"))
|
||||
.map(el => el.outerHTML)
|
||||
.join('')"""
|
||||
@@ -1085,6 +1018,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
await asyncio.sleep(delay)
|
||||
return await page.content()
|
||||
|
||||
# For undetected browsers, retrieve console messages before returning
|
||||
if config.capture_console_messages and hasattr(self.adapter, 'retrieve_console_messages'):
|
||||
final_messages = await self.adapter.retrieve_console_messages(page)
|
||||
captured_console.extend(final_messages)
|
||||
|
||||
# Return complete response
|
||||
return AsyncCrawlResponse(
|
||||
html=html,
|
||||
@@ -1123,8 +1061,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
page.remove_listener("response", handle_response_capture)
|
||||
page.remove_listener("requestfailed", handle_request_failed_capture)
|
||||
if config.capture_console_messages:
|
||||
page.remove_listener("console", handle_console_capture)
|
||||
page.remove_listener("pageerror", handle_pageerror_capture)
|
||||
# Retrieve any final console messages for undetected browsers
|
||||
if hasattr(self.adapter, 'retrieve_console_messages'):
|
||||
final_messages = await self.adapter.retrieve_console_messages(page)
|
||||
captured_console.extend(final_messages)
|
||||
|
||||
# Clean up console capture
|
||||
await self.adapter.cleanup_console_capture(page, handle_console, handle_error)
|
||||
|
||||
# Close the page
|
||||
await page.close()
|
||||
@@ -1354,7 +1297,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"""
|
||||
|
||||
# Execute virtual scroll capture
|
||||
result = await page.evaluate(virtual_scroll_js, config.to_dict())
|
||||
result = await self.adapter.evaluate(page, virtual_scroll_js, config.to_dict())
|
||||
|
||||
if result.get("replaced", False):
|
||||
self.logger.success(
|
||||
@@ -1438,7 +1381,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
remove_overlays_js = load_js_script("remove_overlay_elements")
|
||||
|
||||
try:
|
||||
await page.evaluate(
|
||||
await self.adapter.evaluate(page,
|
||||
f"""
|
||||
(() => {{
|
||||
try {{
|
||||
@@ -1843,7 +1786,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
# When {script} contains statements (e.g., const link = …; link.click();),
|
||||
# this forms invalid JavaScript, causing Playwright execution error: SyntaxError: Unexpected token 'const'.
|
||||
# """
|
||||
result = await page.evaluate(
|
||||
result = await self.adapter.evaluate(page,
|
||||
f"""
|
||||
(async () => {{
|
||||
try {{
|
||||
@@ -1965,7 +1908,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
for script in scripts:
|
||||
try:
|
||||
# Execute the script and wait for network idle
|
||||
result = await page.evaluate(
|
||||
result = await self.adapter.evaluate(page,
|
||||
f"""
|
||||
(() => {{
|
||||
return new Promise((resolve) => {{
|
||||
@@ -2049,7 +1992,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
Returns:
|
||||
Boolean indicating visibility
|
||||
"""
|
||||
return await page.evaluate(
|
||||
return await self.adapter.evaluate(page,
|
||||
"""
|
||||
() => {
|
||||
const element = document.body;
|
||||
@@ -2090,7 +2033,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
Dict containing scroll status and position information
|
||||
"""
|
||||
try:
|
||||
result = await page.evaluate(
|
||||
result = await self.adapter.evaluate(page,
|
||||
f"""() => {{
|
||||
try {{
|
||||
const startX = window.scrollX;
|
||||
@@ -2147,7 +2090,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
Returns:
|
||||
Dict containing width and height of the page
|
||||
"""
|
||||
return await page.evaluate(
|
||||
return await self.adapter.evaluate(page,
|
||||
"""
|
||||
() => {
|
||||
const {scrollWidth, scrollHeight} = document.documentElement;
|
||||
@@ -2167,7 +2110,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
bool: True if page needs scrolling
|
||||
"""
|
||||
try:
|
||||
need_scroll = await page.evaluate(
|
||||
need_scroll = await self.adapter.evaluate(page,
|
||||
"""
|
||||
() => {
|
||||
const scrollHeight = document.documentElement.scrollHeight;
|
||||
@@ -2186,265 +2129,3 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
return True # Default to scrolling if check fails
|
||||
|
||||
|
||||
####################################################################################################
|
||||
# HTTP Crawler Strategy
|
||||
####################################################################################################
|
||||
|
||||
class HTTPCrawlerError(Exception):
|
||||
"""Base error class for HTTP crawler specific exceptions"""
|
||||
pass
|
||||
|
||||
|
||||
class ConnectionTimeoutError(HTTPCrawlerError):
|
||||
"""Raised when connection timeout occurs"""
|
||||
pass
|
||||
|
||||
|
||||
class HTTPStatusError(HTTPCrawlerError):
|
||||
"""Raised for unexpected status codes"""
|
||||
def __init__(self, status_code: int, message: str):
|
||||
self.status_code = status_code
|
||||
super().__init__(f"HTTP {status_code}: {message}")
|
||||
|
||||
|
||||
class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"""
|
||||
Fast, lightweight HTTP-only crawler strategy optimized for memory efficiency.
|
||||
"""
|
||||
|
||||
__slots__ = ('logger', 'max_connections', 'dns_cache_ttl', 'chunk_size', '_session', 'hooks', 'browser_config')
|
||||
|
||||
DEFAULT_TIMEOUT: Final[int] = 30
|
||||
DEFAULT_CHUNK_SIZE: Final[int] = 64 * 1024
|
||||
DEFAULT_MAX_CONNECTIONS: Final[int] = min(32, (os.cpu_count() or 1) * 4)
|
||||
DEFAULT_DNS_CACHE_TTL: Final[int] = 300
|
||||
VALID_SCHEMES: Final = frozenset({'http', 'https', 'file', 'raw'})
|
||||
|
||||
_BASE_HEADERS: Final = MappingProxyType({
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
|
||||
'Accept-Language': 'en-US,en;q=0.5',
|
||||
'Accept-Encoding': 'gzip, deflate, br',
|
||||
'Connection': 'keep-alive',
|
||||
'Upgrade-Insecure-Requests': '1',
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
})
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
browser_config: Optional[HTTPCrawlerConfig] = None,
|
||||
logger: Optional[AsyncLogger] = None,
|
||||
max_connections: int = DEFAULT_MAX_CONNECTIONS,
|
||||
dns_cache_ttl: int = DEFAULT_DNS_CACHE_TTL,
|
||||
chunk_size: int = DEFAULT_CHUNK_SIZE
|
||||
):
|
||||
"""Initialize the HTTP crawler with config"""
|
||||
self.browser_config = browser_config or HTTPCrawlerConfig()
|
||||
self.logger = logger
|
||||
self.max_connections = max_connections
|
||||
self.dns_cache_ttl = dns_cache_ttl
|
||||
self.chunk_size = chunk_size
|
||||
self._session: Optional[aiohttp.ClientSession] = None
|
||||
|
||||
self.hooks = {
|
||||
k: partial(self._execute_hook, k)
|
||||
for k in ('before_request', 'after_request', 'on_error')
|
||||
}
|
||||
|
||||
# Set default hooks
|
||||
self.set_hook('before_request', lambda *args, **kwargs: None)
|
||||
self.set_hook('after_request', lambda *args, **kwargs: None)
|
||||
self.set_hook('on_error', lambda *args, **kwargs: None)
|
||||
|
||||
|
||||
async def __aenter__(self) -> AsyncHTTPCrawlerStrategy:
|
||||
await self.start()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb) -> None:
|
||||
await self.close()
|
||||
|
||||
@contextlib.asynccontextmanager
|
||||
async def _session_context(self):
|
||||
try:
|
||||
if not self._session:
|
||||
await self.start()
|
||||
yield self._session
|
||||
finally:
|
||||
pass
|
||||
|
||||
def set_hook(self, hook_type: str, hook_func: Callable) -> None:
|
||||
if hook_type in self.hooks:
|
||||
self.hooks[hook_type] = partial(self._execute_hook, hook_type, hook_func)
|
||||
else:
|
||||
raise ValueError(f"Invalid hook type: {hook_type}")
|
||||
|
||||
async def _execute_hook(
|
||||
self,
|
||||
hook_type: str,
|
||||
hook_func: Callable,
|
||||
*args: Any,
|
||||
**kwargs: Any
|
||||
) -> Any:
|
||||
if asyncio.iscoroutinefunction(hook_func):
|
||||
return await hook_func(*args, **kwargs)
|
||||
return hook_func(*args, **kwargs)
|
||||
|
||||
async def start(self) -> None:
|
||||
if not self._session:
|
||||
connector = aiohttp.TCPConnector(
|
||||
limit=self.max_connections,
|
||||
ttl_dns_cache=self.dns_cache_ttl,
|
||||
use_dns_cache=True,
|
||||
force_close=False
|
||||
)
|
||||
self._session = aiohttp.ClientSession(
|
||||
headers=dict(self._BASE_HEADERS),
|
||||
connector=connector,
|
||||
timeout=ClientTimeout(total=self.DEFAULT_TIMEOUT)
|
||||
)
|
||||
|
||||
async def close(self) -> None:
|
||||
if self._session and not self._session.closed:
|
||||
try:
|
||||
await asyncio.wait_for(self._session.close(), timeout=5.0)
|
||||
except asyncio.TimeoutError:
|
||||
if self.logger:
|
||||
self.logger.warning(
|
||||
message="Session cleanup timed out",
|
||||
tag="CLEANUP"
|
||||
)
|
||||
finally:
|
||||
self._session = None
|
||||
|
||||
async def _stream_file(self, path: str) -> AsyncGenerator[memoryview, None]:
|
||||
async with aiofiles.open(path, mode='rb') as f:
|
||||
while chunk := await f.read(self.chunk_size):
|
||||
yield memoryview(chunk)
|
||||
|
||||
async def _handle_file(self, path: str) -> AsyncCrawlResponse:
|
||||
if not os.path.exists(path):
|
||||
raise FileNotFoundError(f"Local file not found: {path}")
|
||||
|
||||
chunks = []
|
||||
async for chunk in self._stream_file(path):
|
||||
chunks.append(chunk.tobytes().decode('utf-8', errors='replace'))
|
||||
|
||||
return AsyncCrawlResponse(
|
||||
html=''.join(chunks),
|
||||
response_headers={},
|
||||
status_code=200
|
||||
)
|
||||
|
||||
async def _handle_raw(self, content: str) -> AsyncCrawlResponse:
|
||||
return AsyncCrawlResponse(
|
||||
html=content,
|
||||
response_headers={},
|
||||
status_code=200
|
||||
)
|
||||
|
||||
|
||||
async def _handle_http(
|
||||
self,
|
||||
url: str,
|
||||
config: CrawlerRunConfig
|
||||
) -> AsyncCrawlResponse:
|
||||
async with self._session_context() as session:
|
||||
timeout = ClientTimeout(
|
||||
total=config.page_timeout or self.DEFAULT_TIMEOUT,
|
||||
connect=10,
|
||||
sock_read=30
|
||||
)
|
||||
|
||||
headers = dict(self._BASE_HEADERS)
|
||||
if self.browser_config.headers:
|
||||
headers.update(self.browser_config.headers)
|
||||
|
||||
request_kwargs = {
|
||||
'timeout': timeout,
|
||||
'allow_redirects': self.browser_config.follow_redirects,
|
||||
'ssl': self.browser_config.verify_ssl,
|
||||
'headers': headers
|
||||
}
|
||||
|
||||
if self.browser_config.method == "POST":
|
||||
if self.browser_config.data:
|
||||
request_kwargs['data'] = self.browser_config.data
|
||||
if self.browser_config.json:
|
||||
request_kwargs['json'] = self.browser_config.json
|
||||
|
||||
await self.hooks['before_request'](url, request_kwargs)
|
||||
|
||||
try:
|
||||
async with session.request(self.browser_config.method, url, **request_kwargs) as response:
|
||||
content = memoryview(await response.read())
|
||||
|
||||
if not (200 <= response.status < 300):
|
||||
raise HTTPStatusError(
|
||||
response.status,
|
||||
f"Unexpected status code for {url}"
|
||||
)
|
||||
|
||||
encoding = response.charset
|
||||
if not encoding:
|
||||
encoding = chardet.detect(content.tobytes())['encoding'] or 'utf-8'
|
||||
|
||||
result = AsyncCrawlResponse(
|
||||
html=content.tobytes().decode(encoding, errors='replace'),
|
||||
response_headers=dict(response.headers),
|
||||
status_code=response.status,
|
||||
redirected_url=str(response.url)
|
||||
)
|
||||
|
||||
await self.hooks['after_request'](result)
|
||||
return result
|
||||
|
||||
except aiohttp.ServerTimeoutError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionTimeoutError(f"Request timed out: {str(e)}")
|
||||
|
||||
except aiohttp.ClientConnectorError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionError(f"Connection failed: {str(e)}")
|
||||
|
||||
except aiohttp.ClientError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise HTTPCrawlerError(f"HTTP client error: {str(e)}")
|
||||
|
||||
except asyncio.exceptions.TimeoutError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionTimeoutError(f"Request timed out: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise HTTPCrawlerError(f"HTTP request failed: {str(e)}")
|
||||
|
||||
async def crawl(
|
||||
self,
|
||||
url: str,
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
**kwargs
|
||||
) -> AsyncCrawlResponse:
|
||||
config = config or CrawlerRunConfig.from_kwargs(kwargs)
|
||||
|
||||
parsed = urlparse(url)
|
||||
scheme = parsed.scheme.rstrip('/')
|
||||
|
||||
if scheme not in self.VALID_SCHEMES:
|
||||
raise ValueError(f"Unsupported URL scheme: {scheme}")
|
||||
|
||||
try:
|
||||
if scheme == 'file':
|
||||
return await self._handle_file(parsed.path)
|
||||
elif scheme == 'raw':
|
||||
return await self._handle_raw(parsed.path)
|
||||
else: # http or https
|
||||
return await self._handle_http(url, config)
|
||||
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.error(
|
||||
message="Crawl failed: {error}",
|
||||
tag="CRAWL",
|
||||
params={"error": str(e), "url": url}
|
||||
)
|
||||
raise
|
||||
293
crawl4ai/browser_adapter.py
Normal file
293
crawl4ai/browser_adapter.py
Normal file
@@ -0,0 +1,293 @@
|
||||
# browser_adapter.py
|
||||
"""
|
||||
Browser adapter for Crawl4AI to support both Playwright and undetected browsers
|
||||
with minimal changes to existing codebase.
|
||||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import List, Dict, Any, Optional, Callable
|
||||
import time
|
||||
import json
|
||||
|
||||
# Import both, but use conditionally
|
||||
try:
|
||||
from playwright.async_api import Page
|
||||
except ImportError:
|
||||
Page = Any
|
||||
|
||||
try:
|
||||
from patchright.async_api import Page as UndetectedPage
|
||||
except ImportError:
|
||||
UndetectedPage = Any
|
||||
|
||||
|
||||
class BrowserAdapter(ABC):
|
||||
"""Abstract adapter for browser-specific operations"""
|
||||
|
||||
@abstractmethod
|
||||
async def evaluate(self, page: Page, expression: str, arg: Any = None) -> Any:
|
||||
"""Execute JavaScript in the page"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def setup_console_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
|
||||
"""Setup console message capturing, returns handler function if needed"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def setup_error_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
|
||||
"""Setup error capturing, returns handler function if needed"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def retrieve_console_messages(self, page: Page) -> List[Dict]:
|
||||
"""Retrieve captured console messages (for undetected browsers)"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def cleanup_console_capture(self, page: Page, handle_console: Optional[Callable], handle_error: Optional[Callable]):
|
||||
"""Clean up console event listeners"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_imports(self) -> tuple:
|
||||
"""Get the appropriate imports for this adapter"""
|
||||
pass
|
||||
|
||||
|
||||
class PlaywrightAdapter(BrowserAdapter):
|
||||
"""Adapter for standard Playwright"""
|
||||
|
||||
async def evaluate(self, page: Page, expression: str, arg: Any = None) -> Any:
|
||||
"""Standard Playwright evaluate"""
|
||||
if arg is not None:
|
||||
return await page.evaluate(expression, arg)
|
||||
return await page.evaluate(expression)
|
||||
|
||||
async def setup_console_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
|
||||
"""Setup console capture using Playwright's event system"""
|
||||
def handle_console_capture(msg):
|
||||
try:
|
||||
message_type = "unknown"
|
||||
try:
|
||||
message_type = msg.type
|
||||
except:
|
||||
pass
|
||||
|
||||
message_text = "unknown"
|
||||
try:
|
||||
message_text = msg.text
|
||||
except:
|
||||
pass
|
||||
|
||||
entry = {
|
||||
"type": message_type,
|
||||
"text": message_text,
|
||||
"timestamp": time.time()
|
||||
}
|
||||
|
||||
captured_console.append(entry)
|
||||
|
||||
except Exception as e:
|
||||
captured_console.append({
|
||||
"type": "console_capture_error",
|
||||
"error": str(e),
|
||||
"timestamp": time.time()
|
||||
})
|
||||
|
||||
page.on("console", handle_console_capture)
|
||||
return handle_console_capture
|
||||
|
||||
async def setup_error_capture(self, page: Page, captured_console: List[Dict]) -> Optional[Callable]:
|
||||
"""Setup error capture using Playwright's event system"""
|
||||
def handle_pageerror_capture(err):
|
||||
try:
|
||||
error_message = "Unknown error"
|
||||
try:
|
||||
error_message = err.message
|
||||
except:
|
||||
pass
|
||||
|
||||
error_stack = ""
|
||||
try:
|
||||
error_stack = err.stack
|
||||
except:
|
||||
pass
|
||||
|
||||
captured_console.append({
|
||||
"type": "error",
|
||||
"text": error_message,
|
||||
"stack": error_stack,
|
||||
"timestamp": time.time()
|
||||
})
|
||||
except Exception as e:
|
||||
captured_console.append({
|
||||
"type": "pageerror_capture_error",
|
||||
"error": str(e),
|
||||
"timestamp": time.time()
|
||||
})
|
||||
|
||||
page.on("pageerror", handle_pageerror_capture)
|
||||
return handle_pageerror_capture
|
||||
|
||||
async def retrieve_console_messages(self, page: Page) -> List[Dict]:
|
||||
"""Not needed for Playwright - messages are captured via events"""
|
||||
return []
|
||||
|
||||
async def cleanup_console_capture(self, page: Page, handle_console: Optional[Callable], handle_error: Optional[Callable]):
|
||||
"""Remove event listeners"""
|
||||
if handle_console:
|
||||
page.remove_listener("console", handle_console)
|
||||
if handle_error:
|
||||
page.remove_listener("pageerror", handle_error)
|
||||
|
||||
def get_imports(self) -> tuple:
|
||||
"""Return Playwright imports"""
|
||||
from playwright.async_api import Page, Error
|
||||
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
|
||||
return Page, Error, PlaywrightTimeoutError
|
||||
|
||||
|
||||
class UndetectedAdapter(BrowserAdapter):
|
||||
"""Adapter for undetected browser automation with stealth features"""
|
||||
|
||||
def __init__(self):
|
||||
self._console_script_injected = {}
|
||||
|
||||
async def evaluate(self, page: UndetectedPage, expression: str, arg: Any = None) -> Any:
|
||||
"""Undetected browser evaluate with isolated context"""
|
||||
# For most evaluations, use isolated context for stealth
|
||||
# Only use non-isolated when we need to access our injected console capture
|
||||
isolated = not (
|
||||
"__console" in expression or
|
||||
"__captured" in expression or
|
||||
"__error" in expression or
|
||||
"window.__" in expression
|
||||
)
|
||||
|
||||
if arg is not None:
|
||||
return await page.evaluate(expression, arg, isolated_context=isolated)
|
||||
return await page.evaluate(expression, isolated_context=isolated)
|
||||
|
||||
async def setup_console_capture(self, page: UndetectedPage, captured_console: List[Dict]) -> Optional[Callable]:
|
||||
"""Setup console capture using JavaScript injection for undetected browsers"""
|
||||
if not self._console_script_injected.get(page, False):
|
||||
await page.add_init_script("""
|
||||
// Initialize console capture
|
||||
window.__capturedConsole = [];
|
||||
window.__capturedErrors = [];
|
||||
|
||||
// Store original console methods
|
||||
const originalConsole = {};
|
||||
['log', 'info', 'warn', 'error', 'debug'].forEach(method => {
|
||||
originalConsole[method] = console[method];
|
||||
console[method] = function(...args) {
|
||||
try {
|
||||
window.__capturedConsole.push({
|
||||
type: method,
|
||||
text: args.map(arg => {
|
||||
try {
|
||||
if (typeof arg === 'object') {
|
||||
return JSON.stringify(arg);
|
||||
}
|
||||
return String(arg);
|
||||
} catch (e) {
|
||||
return '[Object]';
|
||||
}
|
||||
}).join(' '),
|
||||
timestamp: Date.now()
|
||||
});
|
||||
} catch (e) {
|
||||
// Fail silently to avoid detection
|
||||
}
|
||||
|
||||
// Call original method
|
||||
originalConsole[method].apply(console, args);
|
||||
};
|
||||
});
|
||||
""")
|
||||
self._console_script_injected[page] = True
|
||||
|
||||
return None # No handler function needed for undetected browser
|
||||
|
||||
async def setup_error_capture(self, page: UndetectedPage, captured_console: List[Dict]) -> Optional[Callable]:
|
||||
"""Setup error capture using JavaScript injection for undetected browsers"""
|
||||
if not self._console_script_injected.get(page, False):
|
||||
await page.add_init_script("""
|
||||
// Capture errors
|
||||
window.addEventListener('error', (event) => {
|
||||
try {
|
||||
window.__capturedErrors.push({
|
||||
type: 'error',
|
||||
text: event.message,
|
||||
stack: event.error ? event.error.stack : '',
|
||||
filename: event.filename,
|
||||
lineno: event.lineno,
|
||||
colno: event.colno,
|
||||
timestamp: Date.now()
|
||||
});
|
||||
} catch (e) {
|
||||
// Fail silently
|
||||
}
|
||||
});
|
||||
|
||||
// Capture unhandled promise rejections
|
||||
window.addEventListener('unhandledrejection', (event) => {
|
||||
try {
|
||||
window.__capturedErrors.push({
|
||||
type: 'unhandledrejection',
|
||||
text: event.reason ? String(event.reason) : 'Unhandled Promise Rejection',
|
||||
stack: event.reason && event.reason.stack ? event.reason.stack : '',
|
||||
timestamp: Date.now()
|
||||
});
|
||||
} catch (e) {
|
||||
// Fail silently
|
||||
}
|
||||
});
|
||||
""")
|
||||
self._console_script_injected[page] = True
|
||||
|
||||
return None # No handler function needed for undetected browser
|
||||
|
||||
async def retrieve_console_messages(self, page: UndetectedPage) -> List[Dict]:
|
||||
"""Retrieve captured console messages and errors from the page"""
|
||||
messages = []
|
||||
|
||||
try:
|
||||
# Get console messages
|
||||
console_messages = await page.evaluate(
|
||||
"() => { const msgs = window.__capturedConsole || []; window.__capturedConsole = []; return msgs; }",
|
||||
isolated_context=False
|
||||
)
|
||||
messages.extend(console_messages)
|
||||
|
||||
# Get errors
|
||||
errors = await page.evaluate(
|
||||
"() => { const errs = window.__capturedErrors || []; window.__capturedErrors = []; return errs; }",
|
||||
isolated_context=False
|
||||
)
|
||||
messages.extend(errors)
|
||||
|
||||
# Convert timestamps from JS to Python format
|
||||
for msg in messages:
|
||||
if 'timestamp' in msg and isinstance(msg['timestamp'], (int, float)):
|
||||
msg['timestamp'] = msg['timestamp'] / 1000.0 # Convert from ms to seconds
|
||||
|
||||
except Exception:
|
||||
# If retrieval fails, return empty list
|
||||
pass
|
||||
|
||||
return messages
|
||||
|
||||
async def cleanup_console_capture(self, page: UndetectedPage, handle_console: Optional[Callable], handle_error: Optional[Callable]):
|
||||
"""Clean up for undetected browser - retrieve final messages"""
|
||||
# For undetected browser, we don't have event listeners to remove
|
||||
# but we should retrieve any final messages
|
||||
final_messages = await self.retrieve_console_messages(page)
|
||||
return final_messages
|
||||
|
||||
def get_imports(self) -> tuple:
|
||||
"""Return undetected browser imports"""
|
||||
from patchright.async_api import Page, Error
|
||||
from patchright.async_api import TimeoutError as PlaywrightTimeoutError
|
||||
return Page, Error, PlaywrightTimeoutError
|
||||
@@ -573,21 +573,26 @@ class BrowserManager:
|
||||
_playwright_instance = None
|
||||
|
||||
@classmethod
|
||||
async def get_playwright(cls):
|
||||
from playwright.async_api import async_playwright
|
||||
async def get_playwright(cls, use_undetected: bool = False):
|
||||
if use_undetected:
|
||||
from patchright.async_api import async_playwright
|
||||
else:
|
||||
from playwright.async_api import async_playwright
|
||||
cls._playwright_instance = await async_playwright().start()
|
||||
return cls._playwright_instance
|
||||
|
||||
def __init__(self, browser_config: BrowserConfig, logger=None):
|
||||
def __init__(self, browser_config: BrowserConfig, logger=None, use_undetected: bool = False):
|
||||
"""
|
||||
Initialize the BrowserManager with a browser configuration.
|
||||
|
||||
Args:
|
||||
browser_config (BrowserConfig): Configuration object containing all browser settings
|
||||
logger: Logger instance for recording events and errors
|
||||
use_undetected (bool): Whether to use undetected browser (Patchright)
|
||||
"""
|
||||
self.config: BrowserConfig = browser_config
|
||||
self.logger = logger
|
||||
self.use_undetected = use_undetected
|
||||
|
||||
# Browser state
|
||||
self.browser = None
|
||||
@@ -601,7 +606,11 @@ class BrowserManager:
|
||||
|
||||
# Keep track of contexts by a "config signature," so each unique config reuses a single context
|
||||
self.contexts_by_config = {}
|
||||
self._contexts_lock = asyncio.Lock()
|
||||
self._contexts_lock = asyncio.Lock()
|
||||
|
||||
# Stealth-related attributes
|
||||
self._stealth_instance = None
|
||||
self._stealth_cm = None
|
||||
|
||||
# Initialize ManagedBrowser if needed
|
||||
if self.config.use_managed_browser:
|
||||
@@ -630,9 +639,21 @@ class BrowserManager:
|
||||
if self.playwright is not None:
|
||||
await self.close()
|
||||
|
||||
from playwright.async_api import async_playwright
|
||||
if self.use_undetected:
|
||||
from patchright.async_api import async_playwright
|
||||
else:
|
||||
from playwright.async_api import async_playwright
|
||||
|
||||
self.playwright = await async_playwright().start()
|
||||
# Initialize playwright with or without stealth
|
||||
if self.config.enable_stealth and not self.use_undetected:
|
||||
# Import stealth only when needed
|
||||
from playwright_stealth import Stealth
|
||||
# Use the recommended stealth wrapper approach
|
||||
self._stealth_instance = Stealth()
|
||||
self._stealth_cm = self._stealth_instance.use_async(async_playwright())
|
||||
self.playwright = await self._stealth_cm.__aenter__()
|
||||
else:
|
||||
self.playwright = await async_playwright().start()
|
||||
|
||||
if self.config.cdp_url or self.config.use_managed_browser:
|
||||
self.config.use_managed_browser = True
|
||||
@@ -1094,5 +1115,19 @@ class BrowserManager:
|
||||
self.managed_browser = None
|
||||
|
||||
if self.playwright:
|
||||
await self.playwright.stop()
|
||||
# Handle stealth context manager cleanup if it exists
|
||||
if hasattr(self, '_stealth_cm') and self._stealth_cm is not None:
|
||||
try:
|
||||
await self._stealth_cm.__aexit__(None, None, None)
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.error(
|
||||
message="Error closing stealth context: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
self._stealth_cm = None
|
||||
self._stealth_instance = None
|
||||
else:
|
||||
await self.playwright.stop()
|
||||
self.playwright = None
|
||||
|
||||
@@ -119,6 +119,32 @@ def install_playwright():
|
||||
logger.warning(
|
||||
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
|
||||
)
|
||||
|
||||
# Install Patchright browsers for undetected browser support
|
||||
logger.info("Installing Patchright browsers for undetected mode...", tag="INIT")
|
||||
try:
|
||||
subprocess.check_call(
|
||||
[
|
||||
sys.executable,
|
||||
"-m",
|
||||
"patchright",
|
||||
"install",
|
||||
"--with-deps",
|
||||
"--force",
|
||||
"chromium",
|
||||
]
|
||||
)
|
||||
logger.success(
|
||||
"Patchright installation completed successfully.", tag="COMPLETE"
|
||||
)
|
||||
except subprocess.CalledProcessError:
|
||||
logger.warning(
|
||||
f"Please run '{sys.executable} -m patchright install --with-deps' manually after the installation."
|
||||
)
|
||||
except Exception:
|
||||
logger.warning(
|
||||
f"Please run '{sys.executable} -m patchright install --with-deps' manually after the installation."
|
||||
)
|
||||
|
||||
|
||||
def run_migration():
|
||||
|
||||
@@ -1056,7 +1056,7 @@ Your output must:
|
||||
</output_requirements>
|
||||
"""
|
||||
|
||||
GENERATE_SCRIPT_PROMPT = """You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** script possible to prepare a web page for data extraction.
|
||||
GENERATE_SCRIPT_PROMPT = r"""You are a world-class browser automation specialist. Your sole purpose is to convert a natural language objective and a snippet of HTML into the most **efficient, robust, and simple** script possible to prepare a web page for data extraction.
|
||||
|
||||
Your scripts run **before the crawl** to handle dynamic content, user interactions, and other obstacles. You are a master of two tools: raw **JavaScript** and the high-level **Crawl4ai Script (c4a)**.
|
||||
|
||||
|
||||
170
docs/blog/release-v0.7.3.md
Normal file
170
docs/blog/release-v0.7.3.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update
|
||||
|
||||
*August 6, 2025 • 5 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||
- **Bug Fixes**: Resolved several critical issues for better stability
|
||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
||||
|
||||
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
||||
|
||||
**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic.
|
||||
|
||||
**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
|
||||
|
||||
# Define specialized configs for different content types
|
||||
configs = [
|
||||
# Documentation sites - aggressive caching, include links
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*docs*", "*documentation*"],
|
||||
cache_mode="write",
|
||||
markdown_generator_options={"include_links": True}
|
||||
),
|
||||
|
||||
# News/blog sites - fresh content, scroll for lazy loading
|
||||
CrawlerRunConfig(
|
||||
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||
cache_mode="bypass",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight/2);"
|
||||
),
|
||||
|
||||
# API endpoints - structured extraction
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*.json", "*api*"],
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini",
|
||||
extraction_type="structured"
|
||||
)
|
||||
),
|
||||
|
||||
# Default fallback for everything else
|
||||
CrawlerRunConfig() # No url_matcher = matches everything
|
||||
]
|
||||
|
||||
# Crawl multiple URLs with appropriate configs
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=[
|
||||
"https://docs.python.org/3/", # → Uses documentation config
|
||||
"https://blog.python.org/", # → Uses blog config
|
||||
"https://api.github.com/users", # → Uses API config
|
||||
"https://example.com/" # → Uses default config
|
||||
],
|
||||
config=configs
|
||||
)
|
||||
```
|
||||
|
||||
**Matching Capabilities:**
|
||||
- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"`
|
||||
- **Function Matchers**: Lambda functions for complex logic
|
||||
- **Mixed Matchers**: Combine strings and functions with AND/OR logic
|
||||
- **Fallback Support**: Default config when nothing matches
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl
|
||||
- **Multi-Domain Crawling**: Different strategies per domain without separate runs
|
||||
- **Reduced Complexity**: No more if/else forests in your extraction code
|
||||
- **Better Performance**: Each URL gets exactly the processing it needs
|
||||
|
||||
## 🐳 Docker: Flexible LLM Provider Configuration
|
||||
|
||||
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
||||
|
||||
**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images.
|
||||
|
||||
### Deployment Flexibility
|
||||
|
||||
```bash
|
||||
# Option 1: Direct environment variables
|
||||
docker run -d \
|
||||
-e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
|
||||
-e GROQ_API_KEY="your-key" \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Option 2: Using .llm.env file (recommended for production)
|
||||
# Create .llm.env file:
|
||||
# LLM_PROVIDER=openai/gpt-4o-mini
|
||||
# OPENAI_API_KEY=your-openai-key
|
||||
# GROQ_API_KEY=your-groq-key
|
||||
|
||||
docker run -d \
|
||||
--env-file .llm.env \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
Override per request when needed:
|
||||
```python
|
||||
# Use default provider from .llm.env
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://example.com",
|
||||
"extraction_strategy": {"type": "llm"}
|
||||
})
|
||||
|
||||
# Override to use different provider for this specific request
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://complex-page.com",
|
||||
"extraction_strategy": {
|
||||
"type": "llm",
|
||||
"provider": "openai/gpt-4" # Override default
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex
|
||||
- **A/B Testing**: Compare provider performance without deployment changes
|
||||
- **Fallback Strategies**: Switch providers on-the-fly during outages
|
||||
- **Development Flexibility**: Test locally with one provider, deploy with another
|
||||
- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands
|
||||
|
||||
## 🔧 Bug Fixes & Improvements
|
||||
|
||||
This release includes several important bug fixes that improve stability and reliability:
|
||||
|
||||
- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic
|
||||
- **Memory Management**: Resolved memory leaks in long-running crawl sessions
|
||||
- **Sitemap Processing**: Fixed redirect handling in sitemap fetching
|
||||
- **Table Extraction**: Improved table detection and extraction accuracy
|
||||
- **Error Handling**: Better error messages and recovery from network failures
|
||||
|
||||
## 📚 Documentation Enhancements
|
||||
|
||||
Based on community feedback, we've updated:
|
||||
- Clearer examples for multi-URL configuration
|
||||
- Improved CrawlResult documentation with all available fields
|
||||
- Fixed typos and inconsistencies across documentation
|
||||
- Added real-world URLs in examples for better understanding
|
||||
- New comprehensive demo showcasing all v0.7.3 features
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thanks to our contributors and the entire community for feedback and bug reports.
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- [Full Documentation](https://docs.crawl4ai.com)
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Discord Community](https://discord.gg/crawl4ai)
|
||||
- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py)
|
||||
|
||||
---
|
||||
|
||||
*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!*
|
||||
|
||||
**Happy Crawling! 🕷️**
|
||||
|
||||
*- The Crawl4AI Team*
|
||||
@@ -3,8 +3,8 @@ C4A-Script API Usage Examples
|
||||
Shows how to use the new Result-based API in various scenarios
|
||||
"""
|
||||
|
||||
from c4a_compile import compile, validate, compile_file
|
||||
from c4a_result import CompilationResult, ValidationResult
|
||||
from crawl4ai.script.c4a_compile import compile, validate, compile_file
|
||||
from crawl4ai.script.c4a_result import CompilationResult, ValidationResult
|
||||
import json
|
||||
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@ C4A-Script Hello World
|
||||
A concise example showing how to use the C4A-Script compiler
|
||||
"""
|
||||
|
||||
from c4a_compile import compile
|
||||
from crawl4ai.script.c4a_compile import compile
|
||||
|
||||
# Define your C4A-Script
|
||||
script = """
|
||||
|
||||
@@ -3,7 +3,7 @@ C4A-Script Hello World - Error Example
|
||||
Shows how error handling works
|
||||
"""
|
||||
|
||||
from c4a_compile import compile
|
||||
from crawl4ai.script.c4a_compile import compile
|
||||
|
||||
# Define a script with an error (missing THEN)
|
||||
script = """
|
||||
|
||||
57
docs/examples/hello_world_undetected.py
Normal file
57
docs/examples/hello_world_undetected.py
Normal file
@@ -0,0 +1,57 @@
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
DefaultMarkdownGenerator,
|
||||
PruningContentFilter,
|
||||
CrawlResult,
|
||||
UndetectedAdapter
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
|
||||
async def main():
|
||||
# Create browser config
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# Create the undetected adapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
|
||||
# Create the crawler strategy with the undetected adapter
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=undetected_adapter
|
||||
)
|
||||
|
||||
# Create the crawler with our custom strategy
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
# Configure the crawl
|
||||
crawler_config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter()
|
||||
),
|
||||
capture_console_messages=True, # Enable console capture to test adapter
|
||||
)
|
||||
|
||||
# Test on a site that typically detects bots
|
||||
print("Testing undetected adapter...")
|
||||
result: CrawlResult = await crawler.arun(
|
||||
url="https://www.helloworld.org",
|
||||
config=crawler_config
|
||||
)
|
||||
|
||||
print(f"Status: {result.status_code}")
|
||||
print(f"Success: {result.success}")
|
||||
print(f"Console messages captured: {len(result.console_messages or [])}")
|
||||
print(f"Markdown content (first 500 chars):\n{result.markdown.raw_markdown[:500]}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
59
docs/examples/simple_anti_bot_examples.py
Normal file
59
docs/examples/simple_anti_bot_examples.py
Normal file
@@ -0,0 +1,59 @@
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, UndetectedAdapter
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
# Example 1: Stealth Mode
|
||||
async def stealth_mode_example():
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=False
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
return result.html[:500]
|
||||
|
||||
# Example 2: Undetected Browser
|
||||
async def undetected_browser_example():
|
||||
browser_config = BrowserConfig(
|
||||
headless=False
|
||||
)
|
||||
|
||||
adapter = UndetectedAdapter()
|
||||
strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
return result.html[:500]
|
||||
|
||||
# Example 3: Both Combined
|
||||
async def combined_example():
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=False
|
||||
)
|
||||
|
||||
adapter = UndetectedAdapter()
|
||||
strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
return result.html[:500]
|
||||
|
||||
# Run examples
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(stealth_mode_example())
|
||||
asyncio.run(undetected_browser_example())
|
||||
asyncio.run(combined_example())
|
||||
522
docs/examples/stealth_mode_example.py
Normal file
522
docs/examples/stealth_mode_example.py
Normal file
@@ -0,0 +1,522 @@
|
||||
"""
|
||||
Stealth Mode Example with Crawl4AI
|
||||
|
||||
This example demonstrates how to use the stealth mode feature to bypass basic bot detection.
|
||||
The stealth mode uses playwright-stealth to modify browser fingerprints and behaviors
|
||||
that are commonly used to detect automated browsers.
|
||||
|
||||
Key features demonstrated:
|
||||
1. Comparing crawling with and without stealth mode
|
||||
2. Testing against bot detection sites
|
||||
3. Accessing sites that block automated browsers
|
||||
4. Best practices for stealth crawling
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from typing import Dict, Any
|
||||
from colorama import Fore, Style, init
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.async_logger import AsyncLogger
|
||||
|
||||
# Initialize colorama for colored output
|
||||
init()
|
||||
|
||||
# Create a logger for better output
|
||||
logger = AsyncLogger(verbose=True)
|
||||
|
||||
|
||||
async def test_bot_detection(use_stealth: bool = False) -> Dict[str, Any]:
|
||||
"""Test against a bot detection service"""
|
||||
|
||||
logger.info(
|
||||
f"Testing bot detection with stealth={'ON' if use_stealth else 'OFF'}",
|
||||
tag="STEALTH"
|
||||
)
|
||||
|
||||
# Configure browser with or without stealth
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Use False to see the browser in action
|
||||
enable_stealth=use_stealth,
|
||||
viewport_width=1280,
|
||||
viewport_height=800
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# JavaScript to extract bot detection results
|
||||
detection_script = """
|
||||
// Comprehensive bot detection checks
|
||||
(() => {
|
||||
const detectionResults = {
|
||||
// Basic WebDriver detection
|
||||
webdriver: navigator.webdriver,
|
||||
|
||||
// Chrome specific
|
||||
chrome: !!window.chrome,
|
||||
chromeRuntime: !!window.chrome?.runtime,
|
||||
|
||||
// Automation indicators
|
||||
automationControlled: navigator.webdriver,
|
||||
|
||||
// Permissions API
|
||||
permissionsPresent: !!navigator.permissions?.query,
|
||||
|
||||
// Plugins
|
||||
pluginsLength: navigator.plugins.length,
|
||||
pluginsArray: Array.from(navigator.plugins).map(p => p.name),
|
||||
|
||||
// Languages
|
||||
languages: navigator.languages,
|
||||
language: navigator.language,
|
||||
|
||||
// User agent
|
||||
userAgent: navigator.userAgent,
|
||||
|
||||
// Screen and window properties
|
||||
screen: {
|
||||
width: screen.width,
|
||||
height: screen.height,
|
||||
availWidth: screen.availWidth,
|
||||
availHeight: screen.availHeight,
|
||||
colorDepth: screen.colorDepth,
|
||||
pixelDepth: screen.pixelDepth
|
||||
},
|
||||
|
||||
// WebGL vendor
|
||||
webglVendor: (() => {
|
||||
try {
|
||||
const canvas = document.createElement('canvas');
|
||||
const gl = canvas.getContext('webgl') || canvas.getContext('experimental-webgl');
|
||||
const ext = gl.getExtension('WEBGL_debug_renderer_info');
|
||||
return gl.getParameter(ext.UNMASKED_VENDOR_WEBGL);
|
||||
} catch (e) {
|
||||
return 'Error';
|
||||
}
|
||||
})(),
|
||||
|
||||
// Platform
|
||||
platform: navigator.platform,
|
||||
|
||||
// Hardware concurrency
|
||||
hardwareConcurrency: navigator.hardwareConcurrency,
|
||||
|
||||
// Device memory
|
||||
deviceMemory: navigator.deviceMemory,
|
||||
|
||||
// Connection
|
||||
connection: navigator.connection?.effectiveType
|
||||
};
|
||||
|
||||
// Log results for console capture
|
||||
console.log('DETECTION_RESULTS:', JSON.stringify(detectionResults, null, 2));
|
||||
|
||||
// Return results
|
||||
return detectionResults;
|
||||
})();
|
||||
"""
|
||||
|
||||
# Crawl bot detection test page
|
||||
config = CrawlerRunConfig(
|
||||
js_code=detection_script,
|
||||
capture_console_messages=True,
|
||||
wait_until="networkidle",
|
||||
delay_before_return_html=2.0 # Give time for all checks to complete
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://bot.sannysoft.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Extract detection results from console
|
||||
detection_data = None
|
||||
for msg in result.console_messages or []:
|
||||
if "DETECTION_RESULTS:" in msg.get("text", ""):
|
||||
try:
|
||||
json_str = msg["text"].replace("DETECTION_RESULTS:", "").strip()
|
||||
detection_data = json.loads(json_str)
|
||||
except:
|
||||
pass
|
||||
|
||||
# Also try to get from JavaScript execution result
|
||||
if not detection_data and result.js_execution_result:
|
||||
detection_data = result.js_execution_result
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"url": result.url,
|
||||
"detection_data": detection_data,
|
||||
"page_title": result.metadata.get("title", ""),
|
||||
"stealth_enabled": use_stealth
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"error": result.error_message,
|
||||
"stealth_enabled": use_stealth
|
||||
}
|
||||
|
||||
|
||||
async def test_cloudflare_site(use_stealth: bool = False) -> Dict[str, Any]:
|
||||
"""Test accessing a Cloudflare-protected site"""
|
||||
|
||||
logger.info(
|
||||
f"Testing Cloudflare site with stealth={'ON' if use_stealth else 'OFF'}",
|
||||
tag="STEALTH"
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True, # Cloudflare detection works better in headless mode with stealth
|
||||
enable_stealth=use_stealth,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
wait_until="networkidle",
|
||||
page_timeout=30000, # 30 seconds
|
||||
delay_before_return_html=3.0
|
||||
)
|
||||
|
||||
# Test on a site that often shows Cloudflare challenges
|
||||
result = await crawler.arun(
|
||||
url="https://nowsecure.nl",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Check if we hit Cloudflare challenge
|
||||
cloudflare_detected = False
|
||||
if result.html:
|
||||
cloudflare_indicators = [
|
||||
"Checking your browser",
|
||||
"Just a moment",
|
||||
"cf-browser-verification",
|
||||
"cf-challenge",
|
||||
"ray ID"
|
||||
]
|
||||
cloudflare_detected = any(indicator in result.html for indicator in cloudflare_indicators)
|
||||
|
||||
return {
|
||||
"success": result.success,
|
||||
"url": result.url,
|
||||
"cloudflare_challenge": cloudflare_detected,
|
||||
"status_code": result.status_code,
|
||||
"page_title": result.metadata.get("title", "") if result.metadata else "",
|
||||
"stealth_enabled": use_stealth,
|
||||
"html_snippet": result.html[:500] if result.html else ""
|
||||
}
|
||||
|
||||
|
||||
async def test_anti_bot_site(use_stealth: bool = False) -> Dict[str, Any]:
|
||||
"""Test against sites with anti-bot measures"""
|
||||
|
||||
logger.info(
|
||||
f"Testing anti-bot site with stealth={'ON' if use_stealth else 'OFF'}",
|
||||
tag="STEALTH"
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
enable_stealth=use_stealth,
|
||||
# Additional browser arguments that help with stealth
|
||||
extra_args=[
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-features=site-per-process"
|
||||
] if not use_stealth else [] # These are automatically applied with stealth
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Some sites check for specific behaviors
|
||||
behavior_script = """
|
||||
(async () => {
|
||||
// Simulate human-like behavior
|
||||
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
|
||||
|
||||
// Random mouse movement
|
||||
const moveX = Math.random() * 100;
|
||||
const moveY = Math.random() * 100;
|
||||
|
||||
// Simulate reading time
|
||||
await sleep(1000 + Math.random() * 2000);
|
||||
|
||||
// Scroll slightly
|
||||
window.scrollBy(0, 100 + Math.random() * 200);
|
||||
|
||||
console.log('Human behavior simulation complete');
|
||||
return true;
|
||||
})()
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code=behavior_script,
|
||||
wait_until="networkidle",
|
||||
delay_before_return_html=5.0, # Longer delay to appear more human
|
||||
capture_console_messages=True
|
||||
)
|
||||
|
||||
# Test on a site that implements anti-bot measures
|
||||
result = await crawler.arun(
|
||||
url="https://www.g2.com/",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Check for common anti-bot blocks
|
||||
blocked_indicators = [
|
||||
"Access Denied",
|
||||
"403 Forbidden",
|
||||
"Security Check",
|
||||
"Verify you are human",
|
||||
"captcha",
|
||||
"challenge"
|
||||
]
|
||||
|
||||
blocked = False
|
||||
if result.html:
|
||||
blocked = any(indicator.lower() in result.html.lower() for indicator in blocked_indicators)
|
||||
|
||||
return {
|
||||
"success": result.success and not blocked,
|
||||
"url": result.url,
|
||||
"blocked": blocked,
|
||||
"status_code": result.status_code,
|
||||
"page_title": result.metadata.get("title", "") if result.metadata else "",
|
||||
"stealth_enabled": use_stealth
|
||||
}
|
||||
|
||||
|
||||
async def compare_results():
|
||||
"""Run all tests with and without stealth mode and compare results"""
|
||||
|
||||
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
|
||||
print(f"{Fore.CYAN}Crawl4AI Stealth Mode Comparison{Style.RESET_ALL}")
|
||||
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}\n")
|
||||
|
||||
# Test 1: Bot Detection
|
||||
print(f"{Fore.YELLOW}1. Bot Detection Test (bot.sannysoft.com){Style.RESET_ALL}")
|
||||
print("-" * 40)
|
||||
|
||||
# Without stealth
|
||||
regular_detection = await test_bot_detection(use_stealth=False)
|
||||
if regular_detection["success"] and regular_detection["detection_data"]:
|
||||
print(f"{Fore.RED}Without Stealth:{Style.RESET_ALL}")
|
||||
data = regular_detection["detection_data"]
|
||||
print(f" • WebDriver detected: {data.get('webdriver', 'Unknown')}")
|
||||
print(f" • Chrome: {data.get('chrome', 'Unknown')}")
|
||||
print(f" • Languages: {data.get('languages', 'Unknown')}")
|
||||
print(f" • Plugins: {data.get('pluginsLength', 'Unknown')}")
|
||||
print(f" • User Agent: {data.get('userAgent', 'Unknown')[:60]}...")
|
||||
|
||||
# With stealth
|
||||
stealth_detection = await test_bot_detection(use_stealth=True)
|
||||
if stealth_detection["success"] and stealth_detection["detection_data"]:
|
||||
print(f"\n{Fore.GREEN}With Stealth:{Style.RESET_ALL}")
|
||||
data = stealth_detection["detection_data"]
|
||||
print(f" • WebDriver detected: {data.get('webdriver', 'Unknown')}")
|
||||
print(f" • Chrome: {data.get('chrome', 'Unknown')}")
|
||||
print(f" • Languages: {data.get('languages', 'Unknown')}")
|
||||
print(f" • Plugins: {data.get('pluginsLength', 'Unknown')}")
|
||||
print(f" • User Agent: {data.get('userAgent', 'Unknown')[:60]}...")
|
||||
|
||||
# Test 2: Cloudflare Site
|
||||
print(f"\n\n{Fore.YELLOW}2. Cloudflare Protected Site Test{Style.RESET_ALL}")
|
||||
print("-" * 40)
|
||||
|
||||
# Without stealth
|
||||
regular_cf = await test_cloudflare_site(use_stealth=False)
|
||||
print(f"{Fore.RED}Without Stealth:{Style.RESET_ALL}")
|
||||
print(f" • Success: {regular_cf['success']}")
|
||||
print(f" • Cloudflare Challenge: {regular_cf['cloudflare_challenge']}")
|
||||
print(f" • Status Code: {regular_cf['status_code']}")
|
||||
print(f" • Page Title: {regular_cf['page_title']}")
|
||||
|
||||
# With stealth
|
||||
stealth_cf = await test_cloudflare_site(use_stealth=True)
|
||||
print(f"\n{Fore.GREEN}With Stealth:{Style.RESET_ALL}")
|
||||
print(f" • Success: {stealth_cf['success']}")
|
||||
print(f" • Cloudflare Challenge: {stealth_cf['cloudflare_challenge']}")
|
||||
print(f" • Status Code: {stealth_cf['status_code']}")
|
||||
print(f" • Page Title: {stealth_cf['page_title']}")
|
||||
|
||||
# Test 3: Anti-bot Site
|
||||
print(f"\n\n{Fore.YELLOW}3. Anti-Bot Site Test{Style.RESET_ALL}")
|
||||
print("-" * 40)
|
||||
|
||||
# Without stealth
|
||||
regular_antibot = await test_anti_bot_site(use_stealth=False)
|
||||
print(f"{Fore.RED}Without Stealth:{Style.RESET_ALL}")
|
||||
print(f" • Success: {regular_antibot['success']}")
|
||||
print(f" • Blocked: {regular_antibot['blocked']}")
|
||||
print(f" • Status Code: {regular_antibot['status_code']}")
|
||||
print(f" • Page Title: {regular_antibot['page_title']}")
|
||||
|
||||
# With stealth
|
||||
stealth_antibot = await test_anti_bot_site(use_stealth=True)
|
||||
print(f"\n{Fore.GREEN}With Stealth:{Style.RESET_ALL}")
|
||||
print(f" • Success: {stealth_antibot['success']}")
|
||||
print(f" • Blocked: {stealth_antibot['blocked']}")
|
||||
print(f" • Status Code: {stealth_antibot['status_code']}")
|
||||
print(f" • Page Title: {stealth_antibot['page_title']}")
|
||||
|
||||
# Summary
|
||||
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
|
||||
print(f"{Fore.CYAN}Summary:{Style.RESET_ALL}")
|
||||
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
|
||||
print(f"\nStealth mode helps bypass basic bot detection by:")
|
||||
print(f" • Hiding webdriver property")
|
||||
print(f" • Modifying browser fingerprints")
|
||||
print(f" • Adjusting navigator properties")
|
||||
print(f" • Emulating real browser plugin behavior")
|
||||
print(f"\n{Fore.YELLOW}Note:{Style.RESET_ALL} Stealth mode is not a silver bullet.")
|
||||
print(f"Advanced anti-bot systems may still detect automation.")
|
||||
print(f"Always respect robots.txt and website terms of service.")
|
||||
|
||||
|
||||
async def stealth_best_practices():
|
||||
"""Demonstrate best practices for using stealth mode"""
|
||||
|
||||
print(f"\n\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
|
||||
print(f"{Fore.CYAN}Stealth Mode Best Practices{Style.RESET_ALL}")
|
||||
print(f"{Fore.CYAN}{'='*60}{Style.RESET_ALL}\n")
|
||||
|
||||
# Best Practice 1: Combine with realistic behavior
|
||||
print(f"{Fore.YELLOW}1. Combine with Realistic Behavior:{Style.RESET_ALL}")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
enable_stealth=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Simulate human-like behavior
|
||||
human_behavior_script = """
|
||||
(async () => {
|
||||
// Wait random time between actions
|
||||
const randomWait = () => Math.random() * 2000 + 1000;
|
||||
|
||||
// Simulate reading
|
||||
await new Promise(resolve => setTimeout(resolve, randomWait()));
|
||||
|
||||
// Smooth scroll
|
||||
const smoothScroll = async () => {
|
||||
const totalHeight = document.body.scrollHeight;
|
||||
const viewHeight = window.innerHeight;
|
||||
let currentPosition = 0;
|
||||
|
||||
while (currentPosition < totalHeight - viewHeight) {
|
||||
const scrollAmount = Math.random() * 300 + 100;
|
||||
window.scrollBy({
|
||||
top: scrollAmount,
|
||||
behavior: 'smooth'
|
||||
});
|
||||
currentPosition += scrollAmount;
|
||||
await new Promise(resolve => setTimeout(resolve, randomWait()));
|
||||
}
|
||||
};
|
||||
|
||||
await smoothScroll();
|
||||
console.log('Human-like behavior simulation completed');
|
||||
return true;
|
||||
})()
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code=human_behavior_script,
|
||||
wait_until="networkidle",
|
||||
delay_before_return_html=3.0,
|
||||
capture_console_messages=True
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
print(f" ✓ Simulated human-like scrolling and reading patterns")
|
||||
print(f" ✓ Added random delays between actions")
|
||||
print(f" ✓ Result: {result.success}")
|
||||
|
||||
# Best Practice 2: Use appropriate viewport and user agent
|
||||
print(f"\n{Fore.YELLOW}2. Use Realistic Viewport and User Agent:{Style.RESET_ALL}")
|
||||
|
||||
# Get a realistic user agent
|
||||
from crawl4ai.user_agent_generator import UserAgentGenerator
|
||||
ua_generator = UserAgentGenerator()
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
enable_stealth=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
user_agent=ua_generator.generate(device_type="desktop", browser_type="chrome")
|
||||
)
|
||||
|
||||
print(f" ✓ Using realistic viewport: 1920x1080")
|
||||
print(f" ✓ Using current Chrome user agent")
|
||||
print(f" ✓ Stealth mode will ensure consistency")
|
||||
|
||||
# Best Practice 3: Manage request rate
|
||||
print(f"\n{Fore.YELLOW}3. Manage Request Rate:{Style.RESET_ALL}")
|
||||
print(f" ✓ Add delays between requests")
|
||||
print(f" ✓ Randomize timing patterns")
|
||||
print(f" ✓ Respect robots.txt")
|
||||
|
||||
# Best Practice 4: Session management
|
||||
print(f"\n{Fore.YELLOW}4. Use Session Management:{Style.RESET_ALL}")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
enable_stealth=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Create a session for multiple requests
|
||||
session_id = "stealth_session_1"
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
session_id=session_id,
|
||||
wait_until="domcontentloaded"
|
||||
)
|
||||
|
||||
# First request
|
||||
result1 = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Subsequent request reuses the same browser context
|
||||
result2 = await crawler.arun(
|
||||
url="https://example.com/about",
|
||||
config=config
|
||||
)
|
||||
|
||||
print(f" ✓ Reused browser session for multiple requests")
|
||||
print(f" ✓ Maintains cookies and state between requests")
|
||||
print(f" ✓ More efficient and realistic browsing pattern")
|
||||
|
||||
print(f"\n{Fore.CYAN}{'='*60}{Style.RESET_ALL}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples"""
|
||||
|
||||
# Run comparison tests
|
||||
await compare_results()
|
||||
|
||||
# Show best practices
|
||||
await stealth_best_practices()
|
||||
|
||||
print(f"\n{Fore.GREEN}Examples completed!{Style.RESET_ALL}")
|
||||
print(f"\n{Fore.YELLOW}Remember:{Style.RESET_ALL}")
|
||||
print(f"• Stealth mode helps with basic bot detection")
|
||||
print(f"• Always respect website terms of service")
|
||||
print(f"• Consider rate limiting and ethical scraping practices")
|
||||
print(f"• For advanced protection, consider additional measures")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
215
docs/examples/stealth_mode_quick_start.py
Normal file
215
docs/examples/stealth_mode_quick_start.py
Normal file
@@ -0,0 +1,215 @@
|
||||
"""
|
||||
Quick Start: Using Stealth Mode in Crawl4AI
|
||||
|
||||
This example shows practical use cases for the stealth mode feature.
|
||||
Stealth mode helps bypass basic bot detection mechanisms.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
|
||||
async def example_1_basic_stealth():
|
||||
"""Example 1: Basic stealth mode usage"""
|
||||
print("\n=== Example 1: Basic Stealth Mode ===")
|
||||
|
||||
# Enable stealth mode in browser config
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True, # This is the key parameter
|
||||
headless=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(f"✓ Crawled {result.url} successfully")
|
||||
print(f"✓ Title: {result.metadata.get('title', 'N/A')}")
|
||||
|
||||
|
||||
async def example_2_stealth_with_screenshot():
|
||||
"""Example 2: Stealth mode with screenshot to show detection results"""
|
||||
print("\n=== Example 2: Stealth Mode Visual Verification ===")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=False # Set to False to see the browser
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
screenshot=True,
|
||||
wait_until="networkidle"
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://bot.sannysoft.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"✓ Successfully crawled bot detection site")
|
||||
print(f"✓ With stealth enabled, many detection tests should show as passed")
|
||||
|
||||
if result.screenshot:
|
||||
# Save screenshot for verification
|
||||
import base64
|
||||
with open("stealth_detection_results.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print(f"✓ Screenshot saved as 'stealth_detection_results.png'")
|
||||
print(f" Check the screenshot to see detection results!")
|
||||
|
||||
|
||||
async def example_3_stealth_for_protected_sites():
|
||||
"""Example 3: Using stealth for sites with bot protection"""
|
||||
print("\n=== Example 3: Stealth for Protected Sites ===")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Add human-like behavior
|
||||
config = CrawlerRunConfig(
|
||||
wait_until="networkidle",
|
||||
delay_before_return_html=2.0, # Wait 2 seconds
|
||||
js_code="""
|
||||
// Simulate human-like scrolling
|
||||
window.scrollTo({
|
||||
top: document.body.scrollHeight / 2,
|
||||
behavior: 'smooth'
|
||||
});
|
||||
"""
|
||||
)
|
||||
|
||||
# Try accessing a site that might have bot protection
|
||||
result = await crawler.arun(
|
||||
url="https://www.g2.com/products/slack/reviews",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"✓ Successfully accessed protected site")
|
||||
print(f"✓ Retrieved {len(result.html)} characters of HTML")
|
||||
else:
|
||||
print(f"✗ Failed to access site: {result.error_message}")
|
||||
|
||||
|
||||
async def example_4_stealth_with_sessions():
|
||||
"""Example 4: Stealth mode with session management"""
|
||||
print("\n=== Example 4: Stealth + Session Management ===")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=False
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
session_id = "my_stealth_session"
|
||||
|
||||
# First request - establish session
|
||||
config = CrawlerRunConfig(
|
||||
session_id=session_id,
|
||||
wait_until="domcontentloaded"
|
||||
)
|
||||
|
||||
result1 = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=config
|
||||
)
|
||||
print(f"✓ First request completed: {result1.url}")
|
||||
|
||||
# Second request - reuse session
|
||||
await asyncio.sleep(2) # Brief delay between requests
|
||||
|
||||
result2 = await crawler.arun(
|
||||
url="https://news.ycombinator.com/best",
|
||||
config=config
|
||||
)
|
||||
print(f"✓ Second request completed: {result2.url}")
|
||||
print(f"✓ Session reused, maintaining cookies and state")
|
||||
|
||||
|
||||
async def example_5_stealth_comparison():
|
||||
"""Example 5: Compare results with and without stealth using screenshots"""
|
||||
print("\n=== Example 5: Stealth Mode Comparison ===")
|
||||
|
||||
test_url = "https://bot.sannysoft.com"
|
||||
|
||||
# First test WITHOUT stealth
|
||||
print("\nWithout stealth:")
|
||||
regular_config = BrowserConfig(
|
||||
enable_stealth=False,
|
||||
headless=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=regular_config) as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
screenshot=True,
|
||||
wait_until="networkidle"
|
||||
)
|
||||
result = await crawler.arun(url=test_url, config=config)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
with open("comparison_without_stealth.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print(f" ✓ Screenshot saved: comparison_without_stealth.png")
|
||||
print(f" Many tests will show as FAILED (red)")
|
||||
|
||||
# Then test WITH stealth
|
||||
print("\nWith stealth:")
|
||||
stealth_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=stealth_config) as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
screenshot=True,
|
||||
wait_until="networkidle"
|
||||
)
|
||||
result = await crawler.arun(url=test_url, config=config)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
with open("comparison_with_stealth.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print(f" ✓ Screenshot saved: comparison_with_stealth.png")
|
||||
print(f" More tests should show as PASSED (green)")
|
||||
|
||||
print("\nCompare the two screenshots to see the difference!")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples"""
|
||||
print("Crawl4AI Stealth Mode Examples")
|
||||
print("==============================")
|
||||
|
||||
# Run basic example
|
||||
await example_1_basic_stealth()
|
||||
|
||||
# Run screenshot verification example
|
||||
await example_2_stealth_with_screenshot()
|
||||
|
||||
# Run protected site example
|
||||
await example_3_stealth_for_protected_sites()
|
||||
|
||||
# Run session example
|
||||
await example_4_stealth_with_sessions()
|
||||
|
||||
# Run comparison example
|
||||
await example_5_stealth_comparison()
|
||||
|
||||
print("\n" + "="*50)
|
||||
print("Tips for using stealth mode effectively:")
|
||||
print("- Use realistic viewport sizes (1920x1080, 1366x768)")
|
||||
print("- Add delays between requests to appear more human")
|
||||
print("- Combine with session management for better results")
|
||||
print("- Remember: stealth mode is for legitimate scraping only")
|
||||
print("="*50)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
62
docs/examples/stealth_test_simple.py
Normal file
62
docs/examples/stealth_test_simple.py
Normal file
@@ -0,0 +1,62 @@
|
||||
"""
|
||||
Simple test to verify stealth mode is working
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
|
||||
async def test_stealth():
|
||||
"""Test stealth mode effectiveness"""
|
||||
|
||||
# Test WITHOUT stealth
|
||||
print("=== WITHOUT Stealth ===")
|
||||
config1 = BrowserConfig(
|
||||
headless=False,
|
||||
enable_stealth=False
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=config1) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://bot.sannysoft.com",
|
||||
config=CrawlerRunConfig(
|
||||
wait_until="networkidle",
|
||||
screenshot=True
|
||||
)
|
||||
)
|
||||
print(f"Success: {result.success}")
|
||||
# Take screenshot
|
||||
if result.screenshot:
|
||||
with open("without_stealth.png", "wb") as f:
|
||||
import base64
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print("Screenshot saved: without_stealth.png")
|
||||
|
||||
# Test WITH stealth
|
||||
print("\n=== WITH Stealth ===")
|
||||
config2 = BrowserConfig(
|
||||
headless=False,
|
||||
enable_stealth=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=config2) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://bot.sannysoft.com",
|
||||
config=CrawlerRunConfig(
|
||||
wait_until="networkidle",
|
||||
screenshot=True
|
||||
)
|
||||
)
|
||||
print(f"Success: {result.success}")
|
||||
# Take screenshot
|
||||
if result.screenshot:
|
||||
with open("with_stealth.png", "wb") as f:
|
||||
import base64
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print("Screenshot saved: with_stealth.png")
|
||||
|
||||
print("\nCheck the screenshots to see the difference in bot detection results!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(test_stealth())
|
||||
74
docs/examples/undetectability/undetected_basic_test.py
Normal file
74
docs/examples/undetectability/undetected_basic_test.py
Normal file
@@ -0,0 +1,74 @@
|
||||
"""
|
||||
Basic Undetected Browser Test
|
||||
Simple example to test if undetected mode works
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
async def test_regular_mode():
|
||||
"""Test with regular browser"""
|
||||
print("Testing Regular Browser Mode...")
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com")
|
||||
print(f"Regular Mode - Success: {result.success}")
|
||||
print(f"Regular Mode - Status: {result.status_code}")
|
||||
print(f"Regular Mode - Content length: {len(result.markdown.raw_markdown)}")
|
||||
print(f"Regular Mode - First 100 chars: {result.markdown.raw_markdown[:100]}...")
|
||||
return result.success
|
||||
|
||||
async def test_undetected_mode():
|
||||
"""Test with undetected browser"""
|
||||
print("\nTesting Undetected Browser Mode...")
|
||||
from crawl4ai import UndetectedAdapter
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Create undetected adapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
|
||||
# Create strategy with undetected adapter
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=undetected_adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com")
|
||||
print(f"Undetected Mode - Success: {result.success}")
|
||||
print(f"Undetected Mode - Status: {result.status_code}")
|
||||
print(f"Undetected Mode - Content length: {len(result.markdown.raw_markdown)}")
|
||||
print(f"Undetected Mode - First 100 chars: {result.markdown.raw_markdown[:100]}...")
|
||||
return result.success
|
||||
|
||||
async def main():
|
||||
"""Run both tests"""
|
||||
print("🤖 Crawl4AI Basic Adapter Test\n")
|
||||
|
||||
# Test regular mode
|
||||
regular_success = await test_regular_mode()
|
||||
|
||||
# Test undetected mode
|
||||
undetected_success = await test_undetected_mode()
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*50)
|
||||
print("Summary:")
|
||||
print(f"Regular Mode: {'✅ Success' if regular_success else '❌ Failed'}")
|
||||
print(f"Undetected Mode: {'✅ Success' if undetected_success else '❌ Failed'}")
|
||||
print("="*50)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
155
docs/examples/undetectability/undetected_bot_test.py
Normal file
155
docs/examples/undetectability/undetected_bot_test.py
Normal file
@@ -0,0 +1,155 @@
|
||||
"""
|
||||
Bot Detection Test - Compare Regular vs Undetected
|
||||
Tests browser fingerprinting differences at bot.sannysoft.com
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
UndetectedAdapter,
|
||||
CrawlResult
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
# Bot detection test site
|
||||
TEST_URL = "https://bot.sannysoft.com"
|
||||
|
||||
def analyze_bot_detection(result: CrawlResult) -> dict:
|
||||
"""Analyze bot detection results from the page"""
|
||||
detections = {
|
||||
"webdriver": False,
|
||||
"headless": False,
|
||||
"automation": False,
|
||||
"user_agent": False,
|
||||
"total_tests": 0,
|
||||
"failed_tests": 0
|
||||
}
|
||||
|
||||
if not result.success or not result.html:
|
||||
return detections
|
||||
|
||||
# Look for specific test results in the HTML
|
||||
html_lower = result.html.lower()
|
||||
|
||||
# Check for common bot indicators
|
||||
if "webdriver" in html_lower and ("fail" in html_lower or "true" in html_lower):
|
||||
detections["webdriver"] = True
|
||||
detections["failed_tests"] += 1
|
||||
|
||||
if "headless" in html_lower and ("fail" in html_lower or "true" in html_lower):
|
||||
detections["headless"] = True
|
||||
detections["failed_tests"] += 1
|
||||
|
||||
if "automation" in html_lower and "detected" in html_lower:
|
||||
detections["automation"] = True
|
||||
detections["failed_tests"] += 1
|
||||
|
||||
# Count total tests (approximate)
|
||||
detections["total_tests"] = html_lower.count("test") + html_lower.count("check")
|
||||
|
||||
return detections
|
||||
|
||||
async def test_browser_mode(adapter_name: str, adapter=None):
|
||||
"""Test a browser mode and return results"""
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Testing: {adapter_name}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Run in headed mode for better results
|
||||
verbose=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
)
|
||||
|
||||
if adapter:
|
||||
# Use undetected mode
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
crawler = AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
)
|
||||
else:
|
||||
# Use regular mode
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
|
||||
async with crawler:
|
||||
config = CrawlerRunConfig(
|
||||
delay_before_return_html=3.0, # Let detection scripts run
|
||||
wait_for_images=True,
|
||||
screenshot=True,
|
||||
simulate_user=False, # Don't simulate for accurate detection
|
||||
)
|
||||
|
||||
result = await crawler.arun(url=TEST_URL, config=config)
|
||||
|
||||
print(f"\n✓ Success: {result.success}")
|
||||
print(f"✓ Status Code: {result.status_code}")
|
||||
|
||||
if result.success:
|
||||
# Analyze detection results
|
||||
detections = analyze_bot_detection(result)
|
||||
|
||||
print(f"\n🔍 Bot Detection Analysis:")
|
||||
print(f" - WebDriver Detected: {'❌ Yes' if detections['webdriver'] else '✅ No'}")
|
||||
print(f" - Headless Detected: {'❌ Yes' if detections['headless'] else '✅ No'}")
|
||||
print(f" - Automation Detected: {'❌ Yes' if detections['automation'] else '✅ No'}")
|
||||
print(f" - Failed Tests: {detections['failed_tests']}")
|
||||
|
||||
# Show some content
|
||||
if result.markdown.raw_markdown:
|
||||
print(f"\nContent preview:")
|
||||
lines = result.markdown.raw_markdown.split('\n')
|
||||
for line in lines[:20]: # Show first 20 lines
|
||||
if any(keyword in line.lower() for keyword in ['test', 'pass', 'fail', 'yes', 'no']):
|
||||
print(f" {line.strip()}")
|
||||
|
||||
return result, detections if result.success else {}
|
||||
|
||||
async def main():
|
||||
"""Run the comparison"""
|
||||
print("🤖 Crawl4AI - Bot Detection Test")
|
||||
print(f"Testing at: {TEST_URL}")
|
||||
print("This site runs various browser fingerprinting tests\n")
|
||||
|
||||
# Test regular browser
|
||||
regular_result, regular_detections = await test_browser_mode("Regular Browser")
|
||||
|
||||
# Small delay
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test undetected browser
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
undetected_result, undetected_detections = await test_browser_mode(
|
||||
"Undetected Browser",
|
||||
undetected_adapter
|
||||
)
|
||||
|
||||
# Summary comparison
|
||||
print(f"\n{'='*60}")
|
||||
print("COMPARISON SUMMARY")
|
||||
print(f"{'='*60}")
|
||||
|
||||
print(f"\n{'Test':<25} {'Regular':<15} {'Undetected':<15}")
|
||||
print(f"{'-'*55}")
|
||||
|
||||
if regular_detections and undetected_detections:
|
||||
print(f"{'WebDriver Detection':<25} {'❌ Detected' if regular_detections['webdriver'] else '✅ Passed':<15} {'❌ Detected' if undetected_detections['webdriver'] else '✅ Passed':<15}")
|
||||
print(f"{'Headless Detection':<25} {'❌ Detected' if regular_detections['headless'] else '✅ Passed':<15} {'❌ Detected' if undetected_detections['headless'] else '✅ Passed':<15}")
|
||||
print(f"{'Automation Detection':<25} {'❌ Detected' if regular_detections['automation'] else '✅ Passed':<15} {'❌ Detected' if undetected_detections['automation'] else '✅ Passed':<15}")
|
||||
print(f"{'Failed Tests':<25} {regular_detections['failed_tests']:<15} {undetected_detections['failed_tests']:<15}")
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
|
||||
if undetected_detections.get('failed_tests', 0) < regular_detections.get('failed_tests', 1):
|
||||
print("✅ Undetected browser performed better at evading detection!")
|
||||
else:
|
||||
print("ℹ️ Both browsers had similar detection results")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
164
docs/examples/undetectability/undetected_cloudflare_test.py
Normal file
164
docs/examples/undetectability/undetected_cloudflare_test.py
Normal file
@@ -0,0 +1,164 @@
|
||||
"""
|
||||
Undetected Browser Test - Cloudflare Protected Site
|
||||
Tests the difference between regular and undetected modes on a Cloudflare-protected site
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
UndetectedAdapter
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
# Test URL with Cloudflare protection
|
||||
TEST_URL = "https://nowsecure.nl"
|
||||
|
||||
async def test_regular_browser():
|
||||
"""Test with regular browser - likely to be blocked"""
|
||||
print("=" * 60)
|
||||
print("Testing with Regular Browser")
|
||||
print("=" * 60)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
delay_before_return_html=2.0,
|
||||
simulate_user=True,
|
||||
magic=True, # Try with magic mode too
|
||||
)
|
||||
|
||||
result = await crawler.arun(url=TEST_URL, config=config)
|
||||
|
||||
print(f"\n✓ Success: {result.success}")
|
||||
print(f"✓ Status Code: {result.status_code}")
|
||||
print(f"✓ HTML Length: {len(result.html)}")
|
||||
|
||||
# Check for Cloudflare challenge
|
||||
if result.html:
|
||||
cf_indicators = [
|
||||
"Checking your browser",
|
||||
"Please stand by",
|
||||
"cloudflare",
|
||||
"cf-browser-verification",
|
||||
"Access denied",
|
||||
"Ray ID"
|
||||
]
|
||||
|
||||
detected = False
|
||||
for indicator in cf_indicators:
|
||||
if indicator.lower() in result.html.lower():
|
||||
print(f"⚠️ Cloudflare Challenge Detected: '{indicator}' found")
|
||||
detected = True
|
||||
break
|
||||
|
||||
if not detected and len(result.markdown.raw_markdown) > 100:
|
||||
print("✅ Successfully bypassed Cloudflare!")
|
||||
print(f"Content preview: {result.markdown.raw_markdown[:200]}...")
|
||||
elif not detected:
|
||||
print("⚠️ Page loaded but content seems minimal")
|
||||
|
||||
return result
|
||||
|
||||
async def test_undetected_browser():
|
||||
"""Test with undetected browser - should bypass Cloudflare"""
|
||||
print("\n" + "=" * 60)
|
||||
print("Testing with Undetected Browser")
|
||||
print("=" * 60)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Headless is easier to detect
|
||||
verbose=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
)
|
||||
|
||||
# Create undetected adapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
|
||||
# Create strategy with undetected adapter
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=undetected_adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
delay_before_return_html=2.0,
|
||||
simulate_user=True,
|
||||
)
|
||||
|
||||
result = await crawler.arun(url=TEST_URL, config=config)
|
||||
|
||||
print(f"\n✓ Success: {result.success}")
|
||||
print(f"✓ Status Code: {result.status_code}")
|
||||
print(f"✓ HTML Length: {len(result.html)}")
|
||||
|
||||
# Check for Cloudflare challenge
|
||||
if result.html:
|
||||
cf_indicators = [
|
||||
"Checking your browser",
|
||||
"Please stand by",
|
||||
"cloudflare",
|
||||
"cf-browser-verification",
|
||||
"Access denied",
|
||||
"Ray ID"
|
||||
]
|
||||
|
||||
detected = False
|
||||
for indicator in cf_indicators:
|
||||
if indicator.lower() in result.html.lower():
|
||||
print(f"⚠️ Cloudflare Challenge Detected: '{indicator}' found")
|
||||
detected = True
|
||||
break
|
||||
|
||||
if not detected and len(result.markdown.raw_markdown) > 100:
|
||||
print("✅ Successfully bypassed Cloudflare!")
|
||||
print(f"Content preview: {result.markdown.raw_markdown[:200]}...")
|
||||
elif not detected:
|
||||
print("⚠️ Page loaded but content seems minimal")
|
||||
|
||||
return result
|
||||
|
||||
async def main():
|
||||
"""Compare regular vs undetected browser"""
|
||||
print("🤖 Crawl4AI - Cloudflare Bypass Test")
|
||||
print(f"Testing URL: {TEST_URL}\n")
|
||||
|
||||
# Test regular browser
|
||||
regular_result = await test_regular_browser()
|
||||
|
||||
# Small delay
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test undetected browser
|
||||
undetected_result = await test_undetected_browser()
|
||||
|
||||
# Summary
|
||||
print("\n" + "=" * 60)
|
||||
print("SUMMARY")
|
||||
print("=" * 60)
|
||||
print(f"Regular Browser:")
|
||||
print(f" - Success: {regular_result.success}")
|
||||
print(f" - Content Length: {len(regular_result.markdown.raw_markdown) if regular_result.markdown else 0}")
|
||||
|
||||
print(f"\nUndetected Browser:")
|
||||
print(f" - Success: {undetected_result.success}")
|
||||
print(f" - Content Length: {len(undetected_result.markdown.raw_markdown) if undetected_result.markdown else 0}")
|
||||
|
||||
if undetected_result.success and len(undetected_result.markdown.raw_markdown) > len(regular_result.markdown.raw_markdown):
|
||||
print("\n✅ Undetected browser successfully bypassed protection!")
|
||||
print("=" * 60)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -0,0 +1,184 @@
|
||||
"""
|
||||
Undetected vs Regular Browser Comparison
|
||||
This example demonstrates the difference between regular and undetected browser modes
|
||||
when accessing sites with bot detection services.
|
||||
|
||||
Based on tested anti-bot services:
|
||||
- Cloudflare
|
||||
- Kasada
|
||||
- Akamai
|
||||
- DataDome
|
||||
- Bet365
|
||||
- And others
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
PlaywrightAdapter,
|
||||
UndetectedAdapter,
|
||||
CrawlResult
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
|
||||
# Test URLs for various bot detection services
|
||||
TEST_SITES = {
|
||||
"Cloudflare Protected": "https://nowsecure.nl",
|
||||
# "Bot Detection Test": "https://bot.sannysoft.com",
|
||||
# "Fingerprint Test": "https://fingerprint.com/products/bot-detection",
|
||||
# "Browser Scan": "https://browserscan.net",
|
||||
# "CreepJS": "https://abrahamjuliot.github.io/creepjs",
|
||||
}
|
||||
|
||||
|
||||
async def test_with_adapter(url: str, adapter_name: str, adapter):
|
||||
"""Test a URL with a specific adapter"""
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Better for avoiding detection
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# Create the crawler strategy with the adapter
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Testing with {adapter_name} adapter")
|
||||
print(f"URL: {url}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
try:
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
crawler_config = CrawlerRunConfig(
|
||||
delay_before_return_html=3.0, # Give page time to load
|
||||
wait_for_images=True,
|
||||
screenshot=True,
|
||||
simulate_user=True, # Add user simulation
|
||||
)
|
||||
|
||||
result: CrawlResult = await crawler.arun(
|
||||
url=url,
|
||||
config=crawler_config
|
||||
)
|
||||
|
||||
# Check results
|
||||
print(f"✓ Status Code: {result.status_code}")
|
||||
print(f"✓ Success: {result.success}")
|
||||
print(f"✓ HTML Length: {len(result.html)}")
|
||||
print(f"✓ Markdown Length: {len(result.markdown.raw_markdown)}")
|
||||
|
||||
# Check for common bot detection indicators
|
||||
detection_indicators = [
|
||||
"Access denied",
|
||||
"Please verify you are human",
|
||||
"Checking your browser",
|
||||
"Enable JavaScript",
|
||||
"captcha",
|
||||
"403 Forbidden",
|
||||
"Bot detection",
|
||||
"Security check"
|
||||
]
|
||||
|
||||
content_lower = result.markdown.raw_markdown.lower()
|
||||
detected = False
|
||||
for indicator in detection_indicators:
|
||||
if indicator.lower() in content_lower:
|
||||
print(f"⚠️ Possible detection: Found '{indicator}'")
|
||||
detected = True
|
||||
break
|
||||
|
||||
if not detected:
|
||||
print("✅ No obvious bot detection triggered!")
|
||||
# Show first 200 chars of content
|
||||
print(f"Content preview: {result.markdown.raw_markdown[:200]}...")
|
||||
|
||||
return result.success and not detected
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ Error: {str(e)}")
|
||||
return False
|
||||
|
||||
|
||||
async def compare_adapters(url: str, site_name: str):
|
||||
"""Compare regular and undetected adapters on the same URL"""
|
||||
print(f"\n{'#'*60}")
|
||||
print(f"# Testing: {site_name}")
|
||||
print(f"{'#'*60}")
|
||||
|
||||
# Test with regular adapter
|
||||
regular_adapter = PlaywrightAdapter()
|
||||
regular_success = await test_with_adapter(url, "Regular", regular_adapter)
|
||||
|
||||
# Small delay between tests
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test with undetected adapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
undetected_success = await test_with_adapter(url, "Undetected", undetected_adapter)
|
||||
|
||||
# Summary
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Summary for {site_name}:")
|
||||
print(f"Regular Adapter: {'✅ Passed' if regular_success else '❌ Blocked/Detected'}")
|
||||
print(f"Undetected Adapter: {'✅ Passed' if undetected_success else '❌ Blocked/Detected'}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
return regular_success, undetected_success
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run comparison tests on multiple sites"""
|
||||
print("🤖 Crawl4AI Browser Adapter Comparison")
|
||||
print("Testing regular vs undetected browser modes\n")
|
||||
|
||||
results = {}
|
||||
|
||||
# Test each site
|
||||
for site_name, url in TEST_SITES.items():
|
||||
regular, undetected = await compare_adapters(url, site_name)
|
||||
results[site_name] = {
|
||||
"regular": regular,
|
||||
"undetected": undetected
|
||||
}
|
||||
|
||||
# Delay between different sites
|
||||
await asyncio.sleep(3)
|
||||
|
||||
# Final summary
|
||||
print(f"\n{'#'*60}")
|
||||
print("# FINAL RESULTS")
|
||||
print(f"{'#'*60}")
|
||||
print(f"{'Site':<30} {'Regular':<15} {'Undetected':<15}")
|
||||
print(f"{'-'*60}")
|
||||
|
||||
for site, result in results.items():
|
||||
regular_status = "✅ Passed" if result["regular"] else "❌ Blocked"
|
||||
undetected_status = "✅ Passed" if result["undetected"] else "❌ Blocked"
|
||||
print(f"{site:<30} {regular_status:<15} {undetected_status:<15}")
|
||||
|
||||
# Calculate success rates
|
||||
regular_success = sum(1 for r in results.values() if r["regular"])
|
||||
undetected_success = sum(1 for r in results.values() if r["undetected"])
|
||||
total = len(results)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Success Rates:")
|
||||
print(f"Regular Adapter: {regular_success}/{total} ({regular_success/total*100:.1f}%)")
|
||||
print(f"Undetected Adapter: {undetected_success}/{total} ({undetected_success/total*100:.1f}%)")
|
||||
print(f"{'='*60}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Note: This example may take a while to run as it tests multiple sites
|
||||
# You can comment out sites in TEST_SITES to run faster tests
|
||||
asyncio.run(main())
|
||||
118
docs/examples/undetected_simple_demo.py
Normal file
118
docs/examples/undetected_simple_demo.py
Normal file
@@ -0,0 +1,118 @@
|
||||
"""
|
||||
Simple Undetected Browser Demo
|
||||
Demonstrates the basic usage of undetected browser mode
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
UndetectedAdapter
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
async def crawl_with_regular_browser(url: str):
|
||||
"""Crawl with regular browser"""
|
||||
print("\n[Regular Browser Mode]")
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(
|
||||
delay_before_return_html=2.0
|
||||
)
|
||||
)
|
||||
|
||||
print(f"Success: {result.success}")
|
||||
print(f"Status: {result.status_code}")
|
||||
print(f"Content length: {len(result.markdown.raw_markdown)}")
|
||||
|
||||
# Check for bot detection keywords
|
||||
content = result.markdown.raw_markdown.lower()
|
||||
if any(word in content for word in ["cloudflare", "checking your browser", "please wait"]):
|
||||
print("⚠️ Bot detection triggered!")
|
||||
else:
|
||||
print("✅ Page loaded successfully")
|
||||
|
||||
return result
|
||||
|
||||
async def crawl_with_undetected_browser(url: str):
|
||||
"""Crawl with undetected browser"""
|
||||
print("\n[Undetected Browser Mode]")
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# Create undetected adapter and strategy
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=undetected_adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=CrawlerRunConfig(
|
||||
delay_before_return_html=2.0
|
||||
)
|
||||
)
|
||||
|
||||
print(f"Success: {result.success}")
|
||||
print(f"Status: {result.status_code}")
|
||||
print(f"Content length: {len(result.markdown.raw_markdown)}")
|
||||
|
||||
# Check for bot detection keywords
|
||||
content = result.markdown.raw_markdown.lower()
|
||||
if any(word in content for word in ["cloudflare", "checking your browser", "please wait"]):
|
||||
print("⚠️ Bot detection triggered!")
|
||||
else:
|
||||
print("✅ Page loaded successfully")
|
||||
|
||||
return result
|
||||
|
||||
async def main():
|
||||
"""Demo comparing regular vs undetected modes"""
|
||||
print("🤖 Crawl4AI Undetected Browser Demo")
|
||||
print("="*50)
|
||||
|
||||
# Test URLs - you can change these
|
||||
test_urls = [
|
||||
"https://www.example.com", # Simple site
|
||||
"https://httpbin.org/headers", # Shows request headers
|
||||
]
|
||||
|
||||
for url in test_urls:
|
||||
print(f"\n📍 Testing URL: {url}")
|
||||
|
||||
# Test with regular browser
|
||||
regular_result = await crawl_with_regular_browser(url)
|
||||
|
||||
# Small delay
|
||||
await asyncio.sleep(2)
|
||||
|
||||
# Test with undetected browser
|
||||
undetected_result = await crawl_with_undetected_browser(url)
|
||||
|
||||
# Compare results
|
||||
print(f"\n📊 Comparison for {url}:")
|
||||
print(f"Regular browser content: {len(regular_result.markdown.raw_markdown)} chars")
|
||||
print(f"Undetected browser content: {len(undetected_result.markdown.raw_markdown)} chars")
|
||||
|
||||
if url == "https://httpbin.org/headers":
|
||||
# Show headers for comparison
|
||||
print("\nHeaders seen by server:")
|
||||
print("Regular:", regular_result.markdown.raw_markdown[:500])
|
||||
print("\nUndetected:", undetected_result.markdown.raw_markdown[:500])
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -358,9 +358,77 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 7. Anti-Bot Features (Stealth Mode & Undetected Browser)
|
||||
|
||||
Crawl4AI provides two powerful features to bypass bot detection:
|
||||
|
||||
### 7.1 Stealth Mode
|
||||
|
||||
Stealth mode uses playwright-stealth to modify browser fingerprints and behaviors. Enable it with a simple flag:
|
||||
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True, # Activates stealth mode
|
||||
headless=False
|
||||
)
|
||||
```
|
||||
|
||||
**When to use**: Sites with basic bot detection (checking navigator.webdriver, plugins, etc.)
|
||||
|
||||
### 7.2 Undetected Browser
|
||||
|
||||
For advanced bot detection, use the undetected browser adapter:
|
||||
|
||||
```python
|
||||
from crawl4ai import UndetectedAdapter
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
# Create undetected adapter
|
||||
adapter = UndetectedAdapter()
|
||||
strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(crawler_strategy=strategy, config=browser_config) as crawler:
|
||||
# Your crawling code
|
||||
```
|
||||
|
||||
**When to use**: Sites with sophisticated bot detection (Cloudflare, DataDome, etc.)
|
||||
|
||||
### 7.3 Combining Both
|
||||
|
||||
For maximum evasion, combine stealth mode with undetected browser:
|
||||
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True, # Enable stealth
|
||||
headless=False
|
||||
)
|
||||
|
||||
adapter = UndetectedAdapter() # Use undetected browser
|
||||
```
|
||||
|
||||
### Choosing the Right Approach
|
||||
|
||||
| Detection Level | Recommended Approach |
|
||||
|----------------|---------------------|
|
||||
| No protection | Regular browser |
|
||||
| Basic checks | Regular + Stealth mode |
|
||||
| Advanced protection | Undetected browser |
|
||||
| Maximum evasion | Undetected + Stealth mode |
|
||||
|
||||
**Best Practice**: Start with regular browser + stealth mode. Only use undetected browser if needed, as it may be slightly slower.
|
||||
|
||||
See [Undetected Browser Mode](undetected-browser.md) for detailed examples.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion & Next Steps
|
||||
|
||||
You’ve now explored several **advanced** features:
|
||||
You've now explored several **advanced** features:
|
||||
|
||||
- **Proxy Usage**
|
||||
- **PDF & Screenshot** capturing for large or critical pages
|
||||
@@ -368,7 +436,10 @@ You’ve now explored several **advanced** features:
|
||||
- **Custom Headers** for language or specialized requests
|
||||
- **Session Persistence** via storage state
|
||||
- **Robots.txt Compliance**
|
||||
- **Anti-Bot Features** (Stealth Mode & Undetected Browser)
|
||||
|
||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, manage sessions across multiple runs, and bypass bot detection—streamlining your entire data collection pipeline.
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
**Note**: In future versions, we may enable stealth mode and undetected browser by default. For now, users should explicitly enable these features when needed.
|
||||
|
||||
**Last Updated**: 2025-01-17
|
||||
394
docs/md_v2/advanced/undetected-browser.md
Normal file
394
docs/md_v2/advanced/undetected-browser.md
Normal file
@@ -0,0 +1,394 @@
|
||||
# Undetected Browser Mode
|
||||
|
||||
## Overview
|
||||
|
||||
Crawl4AI offers two powerful anti-bot features to help you access websites with bot detection:
|
||||
|
||||
1. **Stealth Mode** - Uses playwright-stealth to modify browser fingerprints and behaviors
|
||||
2. **Undetected Browser Mode** - Advanced browser adapter with deep-level patches for sophisticated bot detection
|
||||
|
||||
This guide covers both features and helps you choose the right approach for your needs.
|
||||
|
||||
## Anti-Bot Features Comparison
|
||||
|
||||
| Feature | Regular Browser | Stealth Mode | Undetected Browser |
|
||||
|---------|----------------|--------------|-------------------|
|
||||
| WebDriver Detection | ❌ | ✅ | ✅ |
|
||||
| Navigator Properties | ❌ | ✅ | ✅ |
|
||||
| Plugin Emulation | ❌ | ✅ | ✅ |
|
||||
| CDP Detection | ❌ | Partial | ✅ |
|
||||
| Deep Browser Patches | ❌ | ❌ | ✅ |
|
||||
| Performance Impact | None | Minimal | Moderate |
|
||||
| Setup Complexity | None | None | Minimal |
|
||||
|
||||
## When to Use Each Approach
|
||||
|
||||
### Use Regular Browser + Stealth Mode When:
|
||||
- Sites have basic bot detection (checking navigator.webdriver, plugins, etc.)
|
||||
- You need good performance with basic protection
|
||||
- Sites check for common automation indicators
|
||||
|
||||
### Use Undetected Browser When:
|
||||
- Sites employ sophisticated bot detection services (Cloudflare, DataDome, etc.)
|
||||
- Stealth mode alone isn't sufficient
|
||||
- You're willing to trade some performance for better evasion
|
||||
|
||||
### Best Practice: Progressive Enhancement
|
||||
1. **Start with**: Regular browser + Stealth mode
|
||||
2. **If blocked**: Switch to Undetected browser
|
||||
3. **If still blocked**: Combine Undetected browser + Stealth mode
|
||||
|
||||
## Stealth Mode
|
||||
|
||||
Stealth mode is the simpler anti-bot solution that works with both regular and undetected browsers:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
# Enable stealth mode with regular browser
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True, # Simple flag to enable
|
||||
headless=False # Better for avoiding detection
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
```
|
||||
|
||||
### What Stealth Mode Does:
|
||||
- Removes `navigator.webdriver` flag
|
||||
- Modifies browser fingerprints
|
||||
- Emulates realistic plugin behavior
|
||||
- Adjusts navigator properties
|
||||
- Fixes common automation leaks
|
||||
|
||||
## Undetected Browser Mode
|
||||
|
||||
For sites with sophisticated bot detection that stealth mode can't bypass, use the undetected browser adapter:
|
||||
|
||||
### Key Features
|
||||
|
||||
- **Drop-in Replacement**: Uses the same API as regular browser mode
|
||||
- **Enhanced Stealth**: Built-in patches to evade common detection methods
|
||||
- **Browser Adapter Pattern**: Seamlessly switch between regular and undetected modes
|
||||
- **Automatic Installation**: `crawl4ai-setup` installs all necessary browser dependencies
|
||||
|
||||
### Quick Start
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
UndetectedAdapter
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
async def main():
|
||||
# Create the undetected adapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
|
||||
# Create browser config
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Headless mode can be detected easier
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# Create the crawler strategy with undetected adapter
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=undetected_adapter
|
||||
)
|
||||
|
||||
# Create the crawler with our custom strategy
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
# Your crawling code here
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig()
|
||||
)
|
||||
print(result.markdown[:500])
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Combining Both Features
|
||||
|
||||
For maximum evasion, combine stealth mode with undetected browser:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, UndetectedAdapter
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
# Create browser config with stealth enabled
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True, # Enable stealth mode
|
||||
headless=False
|
||||
)
|
||||
|
||||
# Create undetected adapter
|
||||
adapter = UndetectedAdapter()
|
||||
|
||||
# Create strategy with both features
|
||||
strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
result = await crawler.arun("https://protected-site.com")
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Example 1: Basic Stealth Mode
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def test_stealth_mode():
|
||||
# Simple stealth mode configuration
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=False
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://bot.sannysoft.com",
|
||||
config=CrawlerRunConfig(screenshot=True)
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("✓ Successfully accessed bot detection test site")
|
||||
# Save screenshot to verify detection results
|
||||
if result.screenshot:
|
||||
import base64
|
||||
with open("stealth_test.png", "wb") as f:
|
||||
f.write(base64.b64decode(result.screenshot))
|
||||
print("✓ Screenshot saved - check for green (passed) tests")
|
||||
|
||||
asyncio.run(test_stealth_mode())
|
||||
```
|
||||
|
||||
### Example 2: Undetected Browser Mode
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
UndetectedAdapter
|
||||
)
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
|
||||
async def main():
|
||||
# Create browser config
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# Create the undetected adapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
|
||||
# Create the crawler strategy with the undetected adapter
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=undetected_adapter
|
||||
)
|
||||
|
||||
# Create the crawler with our custom strategy
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=crawler_strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
# Configure the crawl
|
||||
crawler_config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter()
|
||||
),
|
||||
capture_console_messages=True, # Test adapter console capture
|
||||
)
|
||||
|
||||
# Test on a site that typically detects bots
|
||||
print("Testing undetected adapter...")
|
||||
result: CrawlResult = await crawler.arun(
|
||||
url="https://www.helloworld.org",
|
||||
config=crawler_config
|
||||
)
|
||||
|
||||
print(f"Status: {result.status_code}")
|
||||
print(f"Success: {result.success}")
|
||||
print(f"Console messages captured: {len(result.console_messages or [])}")
|
||||
print(f"Markdown content (first 500 chars):\n{result.markdown.raw_markdown[:500]}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Browser Adapter Pattern
|
||||
|
||||
The undetected browser support is implemented using an adapter pattern, allowing seamless switching between different browser implementations:
|
||||
|
||||
```python
|
||||
# Regular browser adapter (default)
|
||||
from crawl4ai import PlaywrightAdapter
|
||||
regular_adapter = PlaywrightAdapter()
|
||||
|
||||
# Undetected browser adapter
|
||||
from crawl4ai import UndetectedAdapter
|
||||
undetected_adapter = UndetectedAdapter()
|
||||
```
|
||||
|
||||
The adapter handles:
|
||||
- JavaScript execution
|
||||
- Console message capture
|
||||
- Error handling
|
||||
- Browser-specific optimizations
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Avoid Headless Mode**: Detection is easier in headless mode
|
||||
```python
|
||||
browser_config = BrowserConfig(headless=False)
|
||||
```
|
||||
|
||||
2. **Use Reasonable Delays**: Don't rush through pages
|
||||
```python
|
||||
crawler_config = CrawlerRunConfig(
|
||||
wait_time=3.0, # Wait 3 seconds after page load
|
||||
delay_before_return_html=2.0 # Additional delay
|
||||
)
|
||||
```
|
||||
|
||||
3. **Rotate User Agents**: You can customize user agents
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
headers={"User-Agent": "your-user-agent"}
|
||||
)
|
||||
```
|
||||
|
||||
4. **Handle Failures Gracefully**: Some sites may still detect and block
|
||||
```python
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
```
|
||||
|
||||
## Advanced Usage Tips
|
||||
|
||||
### Progressive Detection Handling
|
||||
|
||||
```python
|
||||
async def crawl_with_progressive_evasion(url):
|
||||
# Step 1: Try regular browser with stealth
|
||||
browser_config = BrowserConfig(
|
||||
enable_stealth=True,
|
||||
headless=False
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url)
|
||||
if result.success and "Access Denied" not in result.html:
|
||||
return result
|
||||
|
||||
# Step 2: If blocked, try undetected browser
|
||||
print("Regular + stealth blocked, trying undetected browser...")
|
||||
|
||||
adapter = UndetectedAdapter()
|
||||
strategy = AsyncPlaywrightCrawlerStrategy(
|
||||
browser_config=browser_config,
|
||||
browser_adapter=adapter
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
crawler_strategy=strategy,
|
||||
config=browser_config
|
||||
) as crawler:
|
||||
result = await crawler.arun(url)
|
||||
return result
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
The undetected browser dependencies are automatically installed when you run:
|
||||
|
||||
```bash
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
This command installs all necessary browser dependencies for both regular and undetected modes.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Performance**: Slightly slower than regular mode due to additional patches
|
||||
- **Headless Detection**: Some sites can still detect headless mode
|
||||
- **Resource Usage**: May use more resources than regular mode
|
||||
- **Not 100% Guaranteed**: Advanced anti-bot services are constantly evolving
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Browser Not Found
|
||||
|
||||
Run the setup command:
|
||||
```bash
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
### Detection Still Occurring
|
||||
|
||||
Try combining with other features:
|
||||
```python
|
||||
crawler_config = CrawlerRunConfig(
|
||||
simulate_user=True, # Add user simulation
|
||||
magic=True, # Enable magic mode
|
||||
wait_time=5.0, # Longer waits
|
||||
)
|
||||
```
|
||||
|
||||
### Performance Issues
|
||||
|
||||
If experiencing slow performance:
|
||||
```python
|
||||
# Use selective undetected mode only for protected sites
|
||||
if is_protected_site(url):
|
||||
adapter = UndetectedAdapter()
|
||||
else:
|
||||
adapter = PlaywrightAdapter() # Default adapter
|
||||
```
|
||||
|
||||
## Future Plans
|
||||
|
||||
**Note**: In future versions of Crawl4AI, we may enable stealth mode and undetected browser by default to provide better out-of-the-box success rates. For now, users should explicitly enable these features when needed.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Crawl4AI provides flexible anti-bot solutions:
|
||||
|
||||
1. **Start Simple**: Use regular browser + stealth mode for most sites
|
||||
2. **Escalate if Needed**: Switch to undetected browser for sophisticated protection
|
||||
3. **Combine for Maximum Effect**: Use both features together when facing the toughest challenges
|
||||
|
||||
Remember:
|
||||
- Always respect robots.txt and website terms of service
|
||||
- Use appropriate delays to avoid overwhelming servers
|
||||
- Consider the performance trade-offs of each approach
|
||||
- Test progressively to find the minimum necessary evasion level
|
||||
|
||||
## See Also
|
||||
|
||||
- [Advanced Features](advanced-features.md) - Overview of all advanced features
|
||||
- [Proxy & Security](proxy-security.md) - Using proxies with anti-bot features
|
||||
- [Session Management](session-management.md) - Maintaining sessions across requests
|
||||
- [Identity Based Crawling](identity-based-crawling.md) - Additional anti-detection strategies
|
||||
@@ -20,24 +20,30 @@ Ever wondered why your AI coding assistant struggles with your library despite c
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [Crawl4AI v0.7.0 – The Adaptive Intelligence Update](releases/0.7.0.md)
|
||||
*January 28, 2025*
|
||||
### [Crawl4AI v0.7.3 – The Multi-Config Intelligence Update](releases/0.7.3.md)
|
||||
*August 6, 2025*
|
||||
|
||||
Crawl4AI v0.7.0 introduces groundbreaking intelligence features that transform how crawlers understand and adapt to websites. This release brings Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, and the powerful Async URL Seeder for massive URL discovery.
|
||||
Crawl4AI v0.7.3 brings smarter URL-specific configurations, flexible Docker deployments, and critical stability improvements. Configure different crawling strategies for different URL patterns in a single batch—perfect for mixed content sites with docs, blogs, and APIs.
|
||||
|
||||
Key highlights:
|
||||
- **Adaptive Crawling**: Crawlers that learn and adapt to website structures automatically
|
||||
- **Virtual Scroll Support**: Complete content extraction from modern infinite scroll pages
|
||||
- **Link Preview**: 3-layer scoring system for intelligent link prioritization
|
||||
- **Async URL Seeder**: Discover thousands of URLs in seconds with smart filtering
|
||||
- **Performance Boost**: Up to 3x faster with optimized resource handling
|
||||
- **Multi-URL Configurations**: Different strategies for different URL patterns in one crawl
|
||||
- **Flexible Docker LLM Providers**: Configure providers via environment variables
|
||||
- **Bug Fixes**: Critical stability improvements for production deployments
|
||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
||||
|
||||
[Read full release notes →](releases/0.7.0.md)
|
||||
[Read full release notes →](releases/0.7.3.md)
|
||||
|
||||
---
|
||||
|
||||
## Previous Releases
|
||||
|
||||
### [Crawl4AI v0.7.0 – The Adaptive Intelligence Update](releases/0.7.0.md)
|
||||
*January 28, 2025*
|
||||
|
||||
Introduced groundbreaking intelligence features including Adaptive Crawling, Virtual Scroll support, intelligent Link Preview, and the Async URL Seeder for massive URL discovery.
|
||||
|
||||
[Read release notes →](releases/0.7.0.md)
|
||||
|
||||
### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
|
||||
*December 23, 2024*
|
||||
|
||||
|
||||
170
docs/md_v2/blog/releases/0.7.3.md
Normal file
170
docs/md_v2/blog/releases/0.7.3.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update
|
||||
|
||||
*August 6, 2025 • 5 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||
- **Bug Fixes**: Resolved several critical issues for better stability
|
||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
||||
|
||||
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
||||
|
||||
**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic.
|
||||
|
||||
**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
|
||||
|
||||
# Define specialized configs for different content types
|
||||
configs = [
|
||||
# Documentation sites - aggressive caching, include links
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*docs*", "*documentation*"],
|
||||
cache_mode="write",
|
||||
markdown_generator_options={"include_links": True}
|
||||
),
|
||||
|
||||
# News/blog sites - fresh content, scroll for lazy loading
|
||||
CrawlerRunConfig(
|
||||
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||
cache_mode="bypass",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight/2);"
|
||||
),
|
||||
|
||||
# API endpoints - structured extraction
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*.json", "*api*"],
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini",
|
||||
extraction_type="structured"
|
||||
)
|
||||
),
|
||||
|
||||
# Default fallback for everything else
|
||||
CrawlerRunConfig() # No url_matcher = matches everything
|
||||
]
|
||||
|
||||
# Crawl multiple URLs with appropriate configs
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=[
|
||||
"https://docs.python.org/3/", # → Uses documentation config
|
||||
"https://blog.python.org/", # → Uses blog config
|
||||
"https://api.github.com/users", # → Uses API config
|
||||
"https://example.com/" # → Uses default config
|
||||
],
|
||||
config=configs
|
||||
)
|
||||
```
|
||||
|
||||
**Matching Capabilities:**
|
||||
- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"`
|
||||
- **Function Matchers**: Lambda functions for complex logic
|
||||
- **Mixed Matchers**: Combine strings and functions with AND/OR logic
|
||||
- **Fallback Support**: Default config when nothing matches
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl
|
||||
- **Multi-Domain Crawling**: Different strategies per domain without separate runs
|
||||
- **Reduced Complexity**: No more if/else forests in your extraction code
|
||||
- **Better Performance**: Each URL gets exactly the processing it needs
|
||||
|
||||
## 🐳 Docker: Flexible LLM Provider Configuration
|
||||
|
||||
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
||||
|
||||
**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images.
|
||||
|
||||
### Deployment Flexibility
|
||||
|
||||
```bash
|
||||
# Option 1: Direct environment variables
|
||||
docker run -d \
|
||||
-e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
|
||||
-e GROQ_API_KEY="your-key" \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Option 2: Using .llm.env file (recommended for production)
|
||||
# Create .llm.env file:
|
||||
# LLM_PROVIDER=openai/gpt-4o-mini
|
||||
# OPENAI_API_KEY=your-openai-key
|
||||
# GROQ_API_KEY=your-groq-key
|
||||
|
||||
docker run -d \
|
||||
--env-file .llm.env \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
Override per request when needed:
|
||||
```python
|
||||
# Use default provider from .llm.env
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://example.com",
|
||||
"extraction_strategy": {"type": "llm"}
|
||||
})
|
||||
|
||||
# Override to use different provider for this specific request
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://complex-page.com",
|
||||
"extraction_strategy": {
|
||||
"type": "llm",
|
||||
"provider": "openai/gpt-4" # Override default
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex
|
||||
- **A/B Testing**: Compare provider performance without deployment changes
|
||||
- **Fallback Strategies**: Switch providers on-the-fly during outages
|
||||
- **Development Flexibility**: Test locally with one provider, deploy with another
|
||||
- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands
|
||||
|
||||
## 🔧 Bug Fixes & Improvements
|
||||
|
||||
This release includes several important bug fixes that improve stability and reliability:
|
||||
|
||||
- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic
|
||||
- **Memory Management**: Resolved memory leaks in long-running crawl sessions
|
||||
- **Sitemap Processing**: Fixed redirect handling in sitemap fetching
|
||||
- **Table Extraction**: Improved table detection and extraction accuracy
|
||||
- **Error Handling**: Better error messages and recovery from network failures
|
||||
|
||||
## 📚 Documentation Enhancements
|
||||
|
||||
Based on community feedback, we've updated:
|
||||
- Clearer examples for multi-URL configuration
|
||||
- Improved CrawlResult documentation with all available fields
|
||||
- Fixed typos and inconsistencies across documentation
|
||||
- Added real-world URLs in examples for better understanding
|
||||
- New comprehensive demo showcasing all v0.7.3 features
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thanks to our contributors and the entire community for feedback and bug reports.
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- [Full Documentation](https://docs.crawl4ai.com)
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Discord Community](https://discord.gg/crawl4ai)
|
||||
- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py)
|
||||
|
||||
---
|
||||
|
||||
*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!*
|
||||
|
||||
**Happy Crawling! 🕷️**
|
||||
|
||||
*- The Crawl4AI Team*
|
||||
@@ -29,6 +29,7 @@ class BrowserConfig:
|
||||
text_mode=False,
|
||||
light_mode=False,
|
||||
extra_args=None,
|
||||
enable_stealth=False,
|
||||
# ... other advanced parameters omitted here
|
||||
):
|
||||
...
|
||||
@@ -84,6 +85,11 @@ class BrowserConfig:
|
||||
- Additional flags for the underlying browser.
|
||||
- E.g. `["--disable-extensions"]`.
|
||||
|
||||
11. **`enable_stealth`**:
|
||||
- If `True`, enables stealth mode using playwright-stealth.
|
||||
- Modifies browser fingerprints to avoid basic bot detection.
|
||||
- Default is `False`. Recommended for sites with bot protection.
|
||||
|
||||
### Helper Methods
|
||||
|
||||
Both configuration classes provide a `clone()` method to create modified copies:
|
||||
|
||||
@@ -187,7 +187,7 @@ Here:
|
||||
|
||||
---
|
||||
|
||||
## 5. More Fields: Links, Media, and More
|
||||
## 5. More Fields: Links, Media, Tables and More
|
||||
|
||||
### 5.1 `links`
|
||||
|
||||
@@ -207,7 +207,77 @@ for img in images:
|
||||
print("Image URL:", img["src"], "Alt:", img.get("alt"))
|
||||
```
|
||||
|
||||
### 5.3 `screenshot`, `pdf`, and `mhtml`
|
||||
### 5.3 `tables`
|
||||
|
||||
The `tables` field contains structured data extracted from HTML tables found on the crawled page. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
|
||||
|
||||
- Presence of thead and tbody sections
|
||||
- Use of th elements for headers
|
||||
- Column consistency
|
||||
- Text density
|
||||
- And other factors
|
||||
|
||||
Tables that score above the threshold (default: 7) are extracted and stored in result.tables.
|
||||
|
||||
### Accessing Table data:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.w3schools.com/html/html_tables.asp",
|
||||
config=CrawlerRunConfig(
|
||||
table_score_threshold=7 # Minimum score for table detection
|
||||
)
|
||||
)
|
||||
|
||||
if result.success and result.tables:
|
||||
print(f"Found {len(result.tables)} tables")
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
print(f"\nTable {i+1}:")
|
||||
print(f"Caption: {table.get('caption', 'No caption')}")
|
||||
print(f"Headers: {table['headers']}")
|
||||
print(f"Rows: {len(table['rows'])}")
|
||||
|
||||
# Print first few rows as example
|
||||
for j, row in enumerate(table['rows'][:3]):
|
||||
print(f" Row {j+1}: {row}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
```
|
||||
|
||||
### Configuring Table Extraction:
|
||||
|
||||
You can adjust the sensitivity of the table detection algorithm with:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5 # Lower value = more tables detected (default: 7)
|
||||
)
|
||||
```
|
||||
|
||||
Each extracted table contains:
|
||||
|
||||
- `headers`: Column header names
|
||||
- `rows`: List of rows, each containing cell values
|
||||
- `caption`: Table caption text (if available)
|
||||
- `summary`: Table summary attribute (if specified)
|
||||
|
||||
### Table Extraction Tips
|
||||
|
||||
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
|
||||
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
|
||||
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
|
||||
|
||||
The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.
|
||||
|
||||
|
||||
### 5.4 `screenshot`, `pdf`, and `mhtml`
|
||||
|
||||
If you set `screenshot=True`, `pdf=True`, or `capture_mhtml=True` in **`CrawlerRunConfig`**, then:
|
||||
|
||||
@@ -228,7 +298,7 @@ if result.mhtml:
|
||||
|
||||
The MHTML (MIME HTML) format is particularly useful as it captures the entire web page including all of its resources (CSS, images, scripts, etc.) in a single file, making it perfect for archiving or offline viewing.
|
||||
|
||||
### 5.4 `ssl_certificate`
|
||||
### 5.5 `ssl_certificate`
|
||||
|
||||
If `fetch_ssl_certificate=True`, `result.ssl_certificate` holds details about the site’s SSL cert, such as issuer, validity dates, etc.
|
||||
|
||||
|
||||
@@ -58,15 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
Our latest release is `0.7.3`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
|
||||
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
|
||||
> 💡 **Note**: The `latest` tag points to the stable `0.7.3` version.
|
||||
|
||||
```bash
|
||||
# Pull the release candidate (for testing new features)
|
||||
docker pull unclecode/crawl4ai:0.7.0-r1
|
||||
# Pull the latest version
|
||||
docker pull unclecode/crawl4ai:0.7.3
|
||||
|
||||
# Or pull the current stable version (0.6.0)
|
||||
# Or pull using the latest tag
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
@@ -126,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
|
||||
#### Docker Hub Versioning Explained
|
||||
|
||||
* **Image Name:** `unclecode/crawl4ai`
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.3`)
|
||||
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
||||
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
||||
* **`latest` Tag:** Points to the most recent stable version
|
||||
|
||||
@@ -54,6 +54,16 @@ This page provides a comprehensive list of example scripts that demonstrate vari
|
||||
| Crypto Analysis | Demonstrates how to crawl and analyze cryptocurrency data. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/crypto_analysis_example.py) |
|
||||
| SERP API | Demonstrates using Crawl4AI with search engine result pages. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/serp_api_project_11_feb.py) |
|
||||
|
||||
## Anti-Bot & Stealth Features
|
||||
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Stealth Mode Quick Start | Five practical examples showing how to use stealth mode for bypassing basic bot detection. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_quick_start.py) |
|
||||
| Stealth Mode Comprehensive | Comprehensive demonstration of stealth mode features with bot detection testing and comparisons. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/stealth_mode_example.py) |
|
||||
| Undetected Browser | Simple example showing how to use the undetected browser adapter. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/hello_world_undetected.py) |
|
||||
| Undetected Browser Demo | Basic demo comparing regular and undetected browser modes. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/undetected_simple_demo.py) |
|
||||
| Undetected Tests | Advanced tests comparing regular vs undetected browsers on various bot detection services. | [View Folder](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/undetectability/) |
|
||||
|
||||
## Customization & Security
|
||||
|
||||
| Example | Description | Link |
|
||||
|
||||
@@ -18,7 +18,7 @@ crawl4ai-setup
|
||||
```
|
||||
|
||||
**What does it do?**
|
||||
- Installs or updates required Playwright browsers (Chromium, Firefox, etc.)
|
||||
- Installs or updates required browser dependencies for both regular and undetected modes
|
||||
- Performs OS-level checks (e.g., missing libs on Linux)
|
||||
- Confirms your environment is ready to crawl
|
||||
|
||||
|
||||
@@ -520,7 +520,8 @@ This approach is handy when you still want external links but need to block cert
|
||||
|
||||
### 4.1 Accessing `result.media`
|
||||
|
||||
By default, Crawl4AI collects images, audio, video URLs, and data tables it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`, `tables`).
|
||||
By default, Crawl4AI collects images, audio and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
|
||||
**Note: Tables have been moved from `result.media["tables"]` to the new `result.tables` format for better organization and direct access.**
|
||||
|
||||
**Basic Example**:
|
||||
|
||||
@@ -534,14 +535,6 @@ if result.success:
|
||||
print(f" Alt text: {img.get('alt', '')}")
|
||||
print(f" Score: {img.get('score')}")
|
||||
print(f" Description: {img.get('desc', '')}\n")
|
||||
|
||||
# Get tables
|
||||
tables = result.media.get("tables", [])
|
||||
print(f"Found {len(tables)} data tables in total.")
|
||||
for i, table in enumerate(tables):
|
||||
print(f"[Table {i}] Caption: {table.get('caption', 'No caption')}")
|
||||
print(f" Columns: {len(table.get('headers', []))}")
|
||||
print(f" Rows: {len(table.get('rows', []))}")
|
||||
```
|
||||
|
||||
**Structure Example**:
|
||||
@@ -568,19 +561,6 @@ result.media = {
|
||||
"audio": [
|
||||
# Similar structure but with audio-specific fields
|
||||
],
|
||||
"tables": [
|
||||
{
|
||||
"headers": ["Name", "Age", "Location"],
|
||||
"rows": [
|
||||
["John Doe", "34", "New York"],
|
||||
["Jane Smith", "28", "San Francisco"],
|
||||
["Alex Johnson", "42", "Chicago"]
|
||||
],
|
||||
"caption": "Employee Directory",
|
||||
"summary": "Directory of company employees"
|
||||
},
|
||||
# More tables if present
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
@@ -608,53 +588,7 @@ crawler_cfg = CrawlerRunConfig(
|
||||
|
||||
This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.
|
||||
|
||||
### 3.3 Working with Tables
|
||||
|
||||
Crawl4AI can detect and extract structured data from HTML tables. Tables are analyzed based on various criteria to determine if they are actual data tables (as opposed to layout tables), including:
|
||||
|
||||
- Presence of thead and tbody sections
|
||||
- Use of th elements for headers
|
||||
- Column consistency
|
||||
- Text density
|
||||
- And other factors
|
||||
|
||||
Tables that score above the threshold (default: 7) are extracted and stored in `result.media.tables`.
|
||||
|
||||
**Accessing Table Data**:
|
||||
|
||||
```python
|
||||
if result.success:
|
||||
tables = result.media.get("tables", [])
|
||||
print(f"Found {len(tables)} data tables on the page")
|
||||
|
||||
if tables:
|
||||
# Access the first table
|
||||
first_table = tables[0]
|
||||
print(f"Table caption: {first_table.get('caption', 'No caption')}")
|
||||
print(f"Headers: {first_table.get('headers', [])}")
|
||||
|
||||
# Print the first 3 rows
|
||||
for i, row in enumerate(first_table.get('rows', [])[:3]):
|
||||
print(f"Row {i+1}: {row}")
|
||||
```
|
||||
|
||||
**Configuring Table Extraction**:
|
||||
|
||||
You can adjust the sensitivity of the table detection algorithm with:
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
table_score_threshold=5 # Lower value = more tables detected (default: 7)
|
||||
)
|
||||
```
|
||||
|
||||
Each extracted table contains:
|
||||
- `headers`: Column header names
|
||||
- `rows`: List of rows, each containing cell values
|
||||
- `caption`: Table caption text (if available)
|
||||
- `summary`: Table summary attribute (if specified)
|
||||
|
||||
### 3.4 Additional Media Config
|
||||
### 4.3 Additional Media Config
|
||||
|
||||
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
|
||||
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
|
||||
@@ -695,7 +629,7 @@ The MHTML format is particularly useful because:
|
||||
|
||||
---
|
||||
|
||||
## 4. Putting It All Together: Link & Media Filtering
|
||||
## 5. Putting It All Together: Link & Media Filtering
|
||||
|
||||
Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:
|
||||
|
||||
@@ -743,7 +677,7 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 5. Common Pitfalls & Tips
|
||||
## 6. Common Pitfalls & Tips
|
||||
|
||||
1. **Conflicting Flags**:
|
||||
- `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.
|
||||
@@ -762,10 +696,3 @@ if __name__ == "__main__":
|
||||
---
|
||||
|
||||
**That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
|
||||
### Table Extraction Tips
|
||||
|
||||
- Not all HTML tables are extracted - only those detected as "data tables" vs. layout tables.
|
||||
- Tables with inconsistent cell counts, nested tables, or those used purely for layout may be skipped.
|
||||
- If you're missing tables, try adjusting the `table_score_threshold` to a lower value (default is 7).
|
||||
|
||||
The table detection algorithm scores tables based on features like consistent columns, presence of headers, text density, and more. Tables scoring above the threshold are considered data tables worth extracting.
|
||||
|
||||
@@ -45,6 +45,7 @@ nav:
|
||||
- "Lazy Loading": "advanced/lazy-loading.md"
|
||||
- "Hooks & Auth": "advanced/hooks-auth.md"
|
||||
- "Proxy & Security": "advanced/proxy-security.md"
|
||||
- "Undetected Browser": "advanced/undetected-browser.md"
|
||||
- "Session Management": "advanced/session-management.md"
|
||||
- "Multi-URL Crawling": "advanced/multi-url-crawling.md"
|
||||
- "Crawl Dispatcher": "advanced/crawl-dispatcher.md"
|
||||
|
||||
@@ -13,34 +13,34 @@ authors = [
|
||||
{name = "Unclecode", email = "unclecode@kidocode.com"}
|
||||
]
|
||||
dependencies = [
|
||||
"aiofiles>=24.1.0",
|
||||
"aiohttp>=3.11.11",
|
||||
"aiosqlite~=0.20",
|
||||
"anyio>=4.0.0",
|
||||
"lxml~=5.3",
|
||||
"litellm>=1.53.1",
|
||||
"numpy>=1.26.0,<3",
|
||||
"pillow>=10.4",
|
||||
"playwright>=1.49.0",
|
||||
"patchright>=1.49.0",
|
||||
"python-dotenv~=1.0",
|
||||
"requests~=2.26",
|
||||
"beautifulsoup4~=4.12",
|
||||
"tf-playwright-stealth>=1.1.0",
|
||||
"xxhash~=3.4",
|
||||
"rank-bm25~=0.2",
|
||||
"aiofiles>=24.1.0",
|
||||
"snowballstemmer~=2.2",
|
||||
"pydantic>=2.10",
|
||||
"pyOpenSSL>=24.3.0",
|
||||
"psutil>=6.1.1",
|
||||
"PyYAML>=6.0",
|
||||
"nltk>=3.9.1",
|
||||
"playwright",
|
||||
"rich>=13.9.4",
|
||||
"cssselect>=1.2.0",
|
||||
"httpx>=0.27.2",
|
||||
"httpx[http2]>=0.27.2",
|
||||
"fake-useragent>=2.0.3",
|
||||
"click>=8.1.7",
|
||||
"pyperclip>=1.8.2",
|
||||
"chardet>=5.2.0",
|
||||
"aiohttp>=3.11.11",
|
||||
"brotli>=1.1.0",
|
||||
"humanize>=4.10.0",
|
||||
"lark>=1.2.2",
|
||||
|
||||
@@ -1,26 +1,29 @@
|
||||
# Note: These requirements are also specified in pyproject.toml
|
||||
# This file is kept for development environment setup and compatibility
|
||||
aiofiles>=24.1.0
|
||||
aiohttp>=3.11.11
|
||||
aiosqlite~=0.20
|
||||
anyio>=4.0.0
|
||||
lxml~=5.3
|
||||
litellm>=1.53.1
|
||||
numpy>=1.26.0,<3
|
||||
pillow>=10.4
|
||||
playwright>=1.49.0
|
||||
patchright>=1.49.0
|
||||
python-dotenv~=1.0
|
||||
requests~=2.26
|
||||
beautifulsoup4~=4.12
|
||||
tf-playwright-stealth>=1.1.0
|
||||
xxhash~=3.4
|
||||
rank-bm25~=0.2
|
||||
aiofiles>=24.1.0
|
||||
colorama~=0.4
|
||||
snowballstemmer~=2.2
|
||||
pydantic>=2.10
|
||||
pyOpenSSL>=24.3.0
|
||||
psutil>=6.1.1
|
||||
PyYAML>=6.0
|
||||
nltk>=3.9.1
|
||||
rich>=13.9.4
|
||||
cssselect>=1.2.0
|
||||
chardet>=5.2.0
|
||||
brotli>=1.1.0
|
||||
httpx[http2]>=0.27.2
|
||||
|
||||
344
tests/check_dependencies.py
Executable file
344
tests/check_dependencies.py
Executable file
@@ -0,0 +1,344 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Dependency checker for Crawl4AI
|
||||
Analyzes imports in the codebase and shows which files use them
|
||||
"""
|
||||
|
||||
import ast
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Set, Dict, List, Tuple
|
||||
from collections import defaultdict
|
||||
import re
|
||||
import toml
|
||||
|
||||
# Standard library modules to ignore
|
||||
STDLIB_MODULES = {
|
||||
'abc', 'argparse', 'asyncio', 'base64', 'collections', 'concurrent', 'contextlib',
|
||||
'copy', 'datetime', 'decimal', 'email', 'enum', 'functools', 'glob', 'hashlib',
|
||||
'http', 'importlib', 'io', 'itertools', 'json', 'logging', 'math', 'mimetypes',
|
||||
'multiprocessing', 'os', 'pathlib', 'pickle', 'platform', 'pprint', 'random',
|
||||
're', 'shutil', 'signal', 'socket', 'sqlite3', 'string', 'subprocess', 'sys',
|
||||
'tempfile', 'threading', 'time', 'traceback', 'typing', 'unittest', 'urllib',
|
||||
'uuid', 'warnings', 'weakref', 'xml', 'zipfile', 'dataclasses', 'secrets',
|
||||
'statistics', 'textwrap', 'queue', 'csv', 'gzip', 'tarfile', 'configparser',
|
||||
'inspect', 'operator', 'struct', 'binascii', 'codecs', 'locale', 'gc',
|
||||
'atexit', 'builtins', 'html', 'errno', 'fcntl', 'pwd', 'grp', 'resource',
|
||||
'termios', 'tty', 'pty', 'select', 'selectors', 'ssl', 'zlib', 'bz2',
|
||||
'lzma', 'types', 'copy', 'pydoc', 'profile', 'cProfile', 'timeit',
|
||||
'trace', 'doctest', 'pdb', 'contextvars', 'dataclasses', 'graphlib',
|
||||
'zoneinfo', 'tomllib', 'cgi', 'wsgiref', 'fileinput', 'linecache',
|
||||
'tokenize', 'tabnanny', 'compileall', 'dis', 'pickletools', 'formatter',
|
||||
'__future__', 'array', 'ctypes', 'heapq', 'bisect', 'array', 'weakref',
|
||||
'types', 'copy', 'pprint', 'repr', 'numbers', 'cmath', 'fractions',
|
||||
'statistics', 'itertools', 'functools', 'operator', 'pathlib', 'fileinput',
|
||||
'stat', 'filecmp', 'tempfile', 'glob', 'fnmatch', 'linecache', 'shutil',
|
||||
'pickle', 'copyreg', 'shelve', 'marshal', 'dbm', 'sqlite3', 'zlib', 'gzip',
|
||||
'bz2', 'lzma', 'zipfile', 'tarfile', 'configparser', 'netrc', 'xdrlib',
|
||||
'plistlib', 'hashlib', 'hmac', 'secrets', 'os', 'io', 'time', 'argparse',
|
||||
'getopt', 'logging', 'getpass', 'curses', 'platform', 'errno', 'ctypes',
|
||||
'threading', 'multiprocessing', 'concurrent', 'subprocess', 'sched', 'queue',
|
||||
'contextvars', 'asyncio', 'socket', 'ssl', 'email', 'json', 'mailcap',
|
||||
'mailbox', 'mimetypes', 'base64', 'binhex', 'binascii', 'quopri', 'uu',
|
||||
'html', 'xml', 'webbrowser', 'cgi', 'cgitb', 'wsgiref', 'urllib', 'http',
|
||||
'ftplib', 'poplib', 'imaplib', 'nntplib', 'smtplib', 'smtpd', 'telnetlib',
|
||||
'uuid', 'socketserver', 'xmlrpc', 'ipaddress', 'audioop', 'aifc', 'sunau',
|
||||
'wave', 'chunk', 'colorsys', 'imghdr', 'sndhdr', 'ossaudiodev', 'gettext',
|
||||
'locale', 'turtle', 'cmd', 'shlex', 'tkinter', 'typing', 'pydoc', 'doctest',
|
||||
'unittest', 'test', '2to3', 'distutils', 'venv', 'ensurepip', 'zipapp',
|
||||
'py_compile', 'compileall', 'dis', 'pickletools', 'pdb', 'timeit', 'trace',
|
||||
'tracemalloc', 'warnings', 'faulthandler', 'pdb', 'dataclasses', 'cgi',
|
||||
'cgitb', 'chunk', 'crypt', 'imghdr', 'mailcap', 'nis', 'nntplib', 'optparse',
|
||||
'ossaudiodev', 'pipes', 'smtpd', 'sndhdr', 'spwd', 'sunau', 'telnetlib',
|
||||
'uu', 'xdrlib', 'msilib', 'pstats', 'rlcompleter', 'tkinter', 'ast'
|
||||
}
|
||||
|
||||
# Known package name mappings (import name -> package name)
|
||||
PACKAGE_MAPPINGS = {
|
||||
'bs4': 'beautifulsoup4',
|
||||
'PIL': 'pillow',
|
||||
'cv2': 'opencv-python',
|
||||
'sklearn': 'scikit-learn',
|
||||
'yaml': 'PyYAML',
|
||||
'OpenSSL': 'pyOpenSSL',
|
||||
'sqlalchemy': 'SQLAlchemy',
|
||||
'playwright': 'playwright',
|
||||
'patchright': 'patchright',
|
||||
'dotenv': 'python-dotenv',
|
||||
'fake_useragent': 'fake-useragent',
|
||||
'playwright_stealth': 'tf-playwright-stealth',
|
||||
'sentence_transformers': 'sentence-transformers',
|
||||
'rank_bm25': 'rank-bm25',
|
||||
'snowballstemmer': 'snowballstemmer',
|
||||
'PyPDF2': 'PyPDF2',
|
||||
'pdf2image': 'pdf2image',
|
||||
}
|
||||
|
||||
|
||||
class ImportVisitor(ast.NodeVisitor):
|
||||
"""AST visitor to extract imports from Python files"""
|
||||
|
||||
def __init__(self):
|
||||
self.imports = {} # Changed to dict to store line numbers
|
||||
self.from_imports = {}
|
||||
|
||||
def visit_Import(self, node):
|
||||
for alias in node.names:
|
||||
module_name = alias.name.split('.')[0]
|
||||
if module_name not in self.imports:
|
||||
self.imports[module_name] = []
|
||||
self.imports[module_name].append(node.lineno)
|
||||
|
||||
def visit_ImportFrom(self, node):
|
||||
if node.module and node.level == 0: # absolute imports only
|
||||
module_name = node.module.split('.')[0]
|
||||
if module_name not in self.from_imports:
|
||||
self.from_imports[module_name] = []
|
||||
self.from_imports[module_name].append(node.lineno)
|
||||
|
||||
|
||||
def extract_imports_from_file(filepath: Path) -> Dict[str, List[int]]:
|
||||
"""Extract all imports from a Python file with line numbers"""
|
||||
all_imports = {}
|
||||
|
||||
try:
|
||||
with open(filepath, 'r', encoding='utf-8') as f:
|
||||
content = f.read()
|
||||
|
||||
tree = ast.parse(content)
|
||||
visitor = ImportVisitor()
|
||||
visitor.visit(tree)
|
||||
|
||||
# Merge imports and from_imports
|
||||
for module, lines in visitor.imports.items():
|
||||
if module not in all_imports:
|
||||
all_imports[module] = []
|
||||
all_imports[module].extend(lines)
|
||||
|
||||
for module, lines in visitor.from_imports.items():
|
||||
if module not in all_imports:
|
||||
all_imports[module] = []
|
||||
all_imports[module].extend(lines)
|
||||
|
||||
except Exception as e:
|
||||
# Silently skip files that can't be parsed
|
||||
pass
|
||||
|
||||
return all_imports
|
||||
|
||||
|
||||
def get_codebase_imports_with_files(root_dir: Path) -> Dict[str, List[Tuple[str, List[int]]]]:
|
||||
"""Get all imports from the crawl4ai library and docs folders with file locations and line numbers"""
|
||||
import_to_files = defaultdict(list)
|
||||
|
||||
# Only scan crawl4ai library folder and docs folder
|
||||
target_dirs = [
|
||||
root_dir / 'crawl4ai',
|
||||
root_dir / 'docs'
|
||||
]
|
||||
|
||||
for target_dir in target_dirs:
|
||||
if not target_dir.exists():
|
||||
continue
|
||||
|
||||
for py_file in target_dir.rglob('*.py'):
|
||||
# Skip __pycache__ directories
|
||||
if '__pycache__' in py_file.parts:
|
||||
continue
|
||||
|
||||
# Skip setup.py and similar files
|
||||
if py_file.name in ['setup.py', 'setup.cfg', 'conf.py']:
|
||||
continue
|
||||
|
||||
imports = extract_imports_from_file(py_file)
|
||||
|
||||
# Map each import to the file and line numbers
|
||||
for imp, line_numbers in imports.items():
|
||||
relative_path = py_file.relative_to(root_dir)
|
||||
import_to_files[imp].append((str(relative_path), sorted(line_numbers)))
|
||||
|
||||
return dict(import_to_files)
|
||||
|
||||
|
||||
def get_declared_dependencies() -> Set[str]:
|
||||
"""Get declared dependencies from pyproject.toml and requirements.txt"""
|
||||
declared = set()
|
||||
|
||||
# Read from pyproject.toml
|
||||
if Path('pyproject.toml').exists():
|
||||
with open('pyproject.toml', 'r') as f:
|
||||
data = toml.load(f)
|
||||
|
||||
# Get main dependencies
|
||||
deps = data.get('project', {}).get('dependencies', [])
|
||||
for dep in deps:
|
||||
# Parse dependency string (e.g., "numpy>=1.26.0,<3")
|
||||
match = re.match(r'^([a-zA-Z0-9_-]+)', dep)
|
||||
if match:
|
||||
pkg_name = match.group(1).lower()
|
||||
declared.add(pkg_name)
|
||||
|
||||
# Get optional dependencies
|
||||
optional = data.get('project', {}).get('optional-dependencies', {})
|
||||
for group, deps in optional.items():
|
||||
for dep in deps:
|
||||
match = re.match(r'^([a-zA-Z0-9_-]+)', dep)
|
||||
if match:
|
||||
pkg_name = match.group(1).lower()
|
||||
declared.add(pkg_name)
|
||||
|
||||
# Also check requirements.txt as backup
|
||||
if Path('requirements.txt').exists():
|
||||
with open('requirements.txt', 'r') as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if line and not line.startswith('#'):
|
||||
match = re.match(r'^([a-zA-Z0-9_-]+)', line)
|
||||
if match:
|
||||
pkg_name = match.group(1).lower()
|
||||
declared.add(pkg_name)
|
||||
|
||||
return declared
|
||||
|
||||
|
||||
def normalize_package_name(name: str) -> str:
|
||||
"""Normalize package name for comparison"""
|
||||
# Handle known mappings first
|
||||
if name in PACKAGE_MAPPINGS:
|
||||
return PACKAGE_MAPPINGS[name].lower()
|
||||
|
||||
# Basic normalization
|
||||
return name.lower().replace('_', '-')
|
||||
|
||||
|
||||
def check_missing_dependencies():
|
||||
"""Main function to check for missing dependencies"""
|
||||
print("🔍 Analyzing crawl4ai library and docs folders...\n")
|
||||
|
||||
# Get all imports with their file locations
|
||||
root_dir = Path('.')
|
||||
import_to_files = get_codebase_imports_with_files(root_dir)
|
||||
|
||||
# Get declared dependencies
|
||||
declared_deps = get_declared_dependencies()
|
||||
|
||||
# Normalize declared dependencies
|
||||
normalized_declared = {normalize_package_name(dep) for dep in declared_deps}
|
||||
|
||||
# Categorize imports
|
||||
external_imports = {}
|
||||
local_imports = {}
|
||||
|
||||
# Known local packages
|
||||
local_packages = {'crawl4ai'}
|
||||
|
||||
for imp, file_info in import_to_files.items():
|
||||
# Skip standard library
|
||||
if imp in STDLIB_MODULES:
|
||||
continue
|
||||
|
||||
# Check if it's a local import
|
||||
if any(imp.startswith(local) for local in local_packages):
|
||||
local_imports[imp] = file_info
|
||||
else:
|
||||
external_imports[imp] = file_info
|
||||
|
||||
# Check which external imports are not declared
|
||||
not_declared = {}
|
||||
declared_imports = {}
|
||||
|
||||
for imp, file_info in external_imports.items():
|
||||
normalized_imp = normalize_package_name(imp)
|
||||
|
||||
# Check if import is covered by declared dependencies
|
||||
found = False
|
||||
for declared in normalized_declared:
|
||||
if normalized_imp == declared or normalized_imp.startswith(declared + '.') or declared.startswith(normalized_imp):
|
||||
found = True
|
||||
break
|
||||
|
||||
if found:
|
||||
declared_imports[imp] = file_info
|
||||
else:
|
||||
not_declared[imp] = file_info
|
||||
|
||||
# Print results
|
||||
print(f"📊 Summary:")
|
||||
print(f" - Total unique imports: {len(import_to_files)}")
|
||||
print(f" - External imports: {len(external_imports)}")
|
||||
print(f" - Declared dependencies: {len(declared_deps)}")
|
||||
print(f" - External imports NOT in dependencies: {len(not_declared)}\n")
|
||||
|
||||
if not_declared:
|
||||
print("❌ External imports NOT declared in pyproject.toml or requirements.txt:\n")
|
||||
|
||||
# Sort by import name
|
||||
for imp in sorted(not_declared.keys()):
|
||||
file_info = not_declared[imp]
|
||||
print(f" 📦 {imp}")
|
||||
if imp in PACKAGE_MAPPINGS:
|
||||
print(f" → Package name: {PACKAGE_MAPPINGS[imp]}")
|
||||
|
||||
# Show up to 3 files that use this import
|
||||
for i, (file_path, line_numbers) in enumerate(file_info[:3]):
|
||||
# Format line numbers for clickable output
|
||||
if len(line_numbers) == 1:
|
||||
print(f" - {file_path}:{line_numbers[0]}")
|
||||
else:
|
||||
# Show first few line numbers
|
||||
line_str = ','.join(str(ln) for ln in line_numbers[:3])
|
||||
if len(line_numbers) > 3:
|
||||
line_str += f"... ({len(line_numbers)} imports)"
|
||||
print(f" - {file_path}: lines {line_str}")
|
||||
|
||||
if len(file_info) > 3:
|
||||
print(f" ... and {len(file_info) - 3} more files")
|
||||
print()
|
||||
|
||||
# Check for potentially unused dependencies
|
||||
print("\n🔎 Checking declared dependencies usage...\n")
|
||||
|
||||
# Get all used external packages
|
||||
used_packages = set()
|
||||
for imp in external_imports.keys():
|
||||
normalized = normalize_package_name(imp)
|
||||
used_packages.add(normalized)
|
||||
|
||||
# Find unused
|
||||
unused = []
|
||||
for dep in declared_deps:
|
||||
normalized_dep = normalize_package_name(dep)
|
||||
|
||||
# Check if any import uses this dependency
|
||||
found_usage = False
|
||||
for used in used_packages:
|
||||
if used == normalized_dep or used.startswith(normalized_dep) or normalized_dep.startswith(used):
|
||||
found_usage = True
|
||||
break
|
||||
|
||||
if not found_usage:
|
||||
# Some packages are commonly unused directly
|
||||
indirect_deps = {'wheel', 'setuptools', 'pip', 'colorama', 'certifi', 'packaging', 'urllib3'}
|
||||
if normalized_dep not in indirect_deps:
|
||||
unused.append(dep)
|
||||
|
||||
if unused:
|
||||
print("⚠️ Declared dependencies with NO imports found:")
|
||||
for dep in sorted(unused):
|
||||
print(f" - {dep}")
|
||||
print("\n Note: These might be used indirectly or by other dependencies")
|
||||
else:
|
||||
print("✅ All declared dependencies have corresponding imports")
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("💡 How to use this report:")
|
||||
print(" 1. Check each ❌ import to see if it's legitimate")
|
||||
print(" 2. If legitimate, add the package to pyproject.toml")
|
||||
print(" 3. If it's an internal module or typo, fix the import")
|
||||
print(" 4. Review unused dependencies - remove if truly not needed")
|
||||
print("="*60)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
check_missing_dependencies()
|
||||
Reference in New Issue
Block a user