Compare commits
1 Commits
fix/docker
...
fix/deep-c
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
88a9fbbb7e |
7
.github/FUNDING.yml
vendored
7
.github/FUNDING.yml
vendored
@@ -1,7 +0,0 @@
|
||||
# These are supported funding model platforms
|
||||
|
||||
# GitHub Sponsors
|
||||
github: unclecode
|
||||
|
||||
# Custom links for enterprise inquiries (uncomment when ready)
|
||||
# custom: ["https://crawl4ai.com/enterprise"]
|
||||
70
CHANGELOG.md
70
CHANGELOG.md
@@ -5,76 +5,6 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.7.3] - 2025-08-09
|
||||
|
||||
### Added
|
||||
- **🕵️ Undetected Browser Support**: New browser adapter pattern with stealth capabilities
|
||||
- `browser_adapter.py` with undetected Chrome integration
|
||||
- Bypass sophisticated bot detection systems (Cloudflare, Akamai, custom solutions)
|
||||
- Support for headless stealth mode with anti-detection techniques
|
||||
- Human-like behavior simulation with random mouse movements and scrolling
|
||||
- Comprehensive examples for anti-bot strategies and stealth crawling
|
||||
- Full documentation guide for undetected browser usage
|
||||
|
||||
- **🎨 Multi-URL Configuration System**: URL-specific crawler configurations for batch processing
|
||||
- Different crawling strategies for different URL patterns in a single batch
|
||||
- Support for string patterns with wildcards (`"*.pdf"`, `"*/blog/*"`)
|
||||
- Lambda function matchers for complex URL logic
|
||||
- Mixed matchers combining strings and functions with AND/OR logic
|
||||
- Fallback configuration support when no patterns match
|
||||
- First-match-wins configuration selection with optional fallback
|
||||
|
||||
- **🧠 Memory Monitoring & Optimization**: Comprehensive memory usage tracking
|
||||
- New `memory_utils.py` module for memory monitoring and optimization
|
||||
- Real-time memory usage tracking during crawl sessions
|
||||
- Memory leak detection and reporting
|
||||
- Performance optimization recommendations
|
||||
- Peak memory usage analysis and efficiency metrics
|
||||
- Automatic cleanup suggestions for memory-intensive operations
|
||||
|
||||
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
|
||||
- Direct `result.tables` interface replacing generic `result.media` approach
|
||||
- Instant pandas DataFrame conversion with `pd.DataFrame(table['data'])`
|
||||
- Enhanced table detection algorithms for better accuracy
|
||||
- Table metadata including source XPath and headers
|
||||
- Improved table structure preservation during extraction
|
||||
|
||||
- **💰 GitHub Sponsors Integration**: 4-tier sponsorship system
|
||||
- Supporter ($5/month): Community support + early feature previews
|
||||
- Professional ($25/month): Priority support + beta access
|
||||
- Business ($100/month): Direct consultation + custom integrations
|
||||
- Enterprise ($500/month): Dedicated support + feature development
|
||||
- Custom arrangement options for larger organizations
|
||||
|
||||
- **🐳 Docker LLM Provider Flexibility**: Environment-based LLM configuration
|
||||
- `LLM_PROVIDER` environment variable support for dynamic provider switching
|
||||
- `.llm.env` file support for secure configuration management
|
||||
- Per-request provider override capabilities in API endpoints
|
||||
- Support for OpenAI, Groq, and other providers without rebuilding images
|
||||
- Enhanced Docker documentation with deployment examples
|
||||
|
||||
### Fixed
|
||||
- **URL Matcher Fallback**: Resolved edge cases in URL pattern matching logic
|
||||
- **Memory Management**: Fixed memory leaks in long-running crawl sessions
|
||||
- **Sitemap Processing**: Improved redirect handling in sitemap fetching
|
||||
- **Table Extraction**: Enhanced table detection and extraction accuracy
|
||||
- **Error Handling**: Better error messages and recovery from network failures
|
||||
|
||||
### Changed
|
||||
- **Architecture Refactoring**: Major cleanup and optimization
|
||||
- Moved 2,450+ lines from main `async_crawler_strategy.py` to backup
|
||||
- Cleaner separation of concerns in crawler architecture
|
||||
- Better maintainability and code organization
|
||||
- Preserved backward compatibility while improving performance
|
||||
|
||||
### Documentation
|
||||
- **Comprehensive Examples**: Added real-world URLs and practical use cases
|
||||
- **API Documentation**: Complete CrawlResult field documentation with all available fields
|
||||
- **Migration Guides**: Updated table extraction patterns from `result.media` to `result.tables`
|
||||
- **Undetected Browser Guide**: Full documentation for stealth mode and anti-bot strategies
|
||||
- **Multi-Config Examples**: Detailed examples for URL-specific configurations
|
||||
- **Docker Deployment**: Enhanced Docker documentation with LLM provider configuration
|
||||
|
||||
## [0.7.x] - 2025-06-29
|
||||
|
||||
### Added
|
||||
|
||||
809
README-first.md
809
README-first.md
@@ -1,809 +0,0 @@
|
||||
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
|
||||
|
||||
<div align="center">
|
||||
|
||||
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
||||
|
||||
[](https://github.com/unclecode/crawl4ai/stargazers)
|
||||
[](https://github.com/unclecode/crawl4ai/network/members)
|
||||
|
||||
[](https://badge.fury.io/py/crawl4ai)
|
||||
[](https://pypi.org/project/crawl4ai/)
|
||||
[](https://pepy.tech/project/crawl4ai)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
<p align="center">
|
||||
<a href="https://x.com/crawl4ai">
|
||||
<img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
|
||||
</a>
|
||||
<a href="https://www.linkedin.com/company/crawl4ai">
|
||||
<img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" />
|
||||
</a>
|
||||
<a href="https://discord.gg/jP8KfhDhyN">
|
||||
<img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
|
||||
</a>
|
||||
</p>
|
||||
</div>
|
||||
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.7.0](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
|
||||
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
|
||||
|
||||
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
|
||||
|
||||
I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
|
||||
|
||||
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
|
||||
</details>
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
|
||||
2. **Lightning Fast**: Delivers results faster with real-time, cost-efficient performance.
|
||||
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
|
||||
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
|
||||
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
|
||||
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
1. Install Crawl4AI:
|
||||
```bash
|
||||
# Install the package
|
||||
pip install -U crawl4ai
|
||||
|
||||
# For pre release versions
|
||||
pip install crawl4ai --pre
|
||||
|
||||
# Run post-installation setup
|
||||
crawl4ai-setup
|
||||
|
||||
# Verify your installation
|
||||
crawl4ai-doctor
|
||||
```
|
||||
|
||||
If you encounter any browser-related issues, you can install them manually:
|
||||
```bash
|
||||
python -m playwright install --with-deps chromium
|
||||
```
|
||||
|
||||
2. Run a simple web crawl with Python:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import *
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
3. Or use the new command-line interface:
|
||||
```bash
|
||||
# Basic crawl with markdown output
|
||||
crwl https://www.nbcnews.com/business -o markdown
|
||||
|
||||
# Deep crawl with BFS strategy, max 10 pages
|
||||
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
|
||||
|
||||
# Use LLM extraction with a specific question
|
||||
crwl https://www.example.com/products -q "Extract all product prices"
|
||||
```
|
||||
|
||||
## ✨ Features
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Markdown Generation</strong></summary>
|
||||
|
||||
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
|
||||
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
|
||||
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
|
||||
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
|
||||
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📊 <strong>Structured Data Extraction</strong></summary>
|
||||
|
||||
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
|
||||
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
|
||||
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
|
||||
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
|
||||
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🌐 <strong>Browser Integration</strong></summary>
|
||||
|
||||
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
|
||||
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
|
||||
- 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
|
||||
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
|
||||
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
|
||||
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
|
||||
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
|
||||
|
||||
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
|
||||
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
|
||||
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
|
||||
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
|
||||
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
|
||||
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
|
||||
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
|
||||
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
|
||||
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
|
||||
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
|
||||
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🚀 <strong>Deployment</strong></summary>
|
||||
|
||||
- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
|
||||
- 🔑 **Secure Authentication**: Built-in JWT token authentication for API security.
|
||||
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
|
||||
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
|
||||
- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🎯 <strong>Additional Features</strong></summary>
|
||||
|
||||
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
|
||||
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
|
||||
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
|
||||
- 🛡️ **Error Handling**: Robust error management for seamless execution.
|
||||
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
|
||||
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
|
||||
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
|
||||
|
||||
</details>
|
||||
|
||||
## Try it Now!
|
||||
|
||||
✨ Play around with this [](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
|
||||
✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
|
||||
|
||||
## Installation 🛠️
|
||||
|
||||
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
|
||||
|
||||
<details>
|
||||
<summary>🐍 <strong>Using pip</strong></summary>
|
||||
|
||||
Choose the installation option that best fits your needs:
|
||||
|
||||
### Basic Installation
|
||||
|
||||
For basic web crawling and scraping tasks:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup # Setup the browser
|
||||
```
|
||||
|
||||
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
|
||||
|
||||
👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
|
||||
|
||||
1. Through the command line:
|
||||
|
||||
```bash
|
||||
playwright install
|
||||
```
|
||||
|
||||
2. If the above doesn't work, try this more specific command:
|
||||
|
||||
```bash
|
||||
python -m playwright install chromium
|
||||
```
|
||||
|
||||
This second method has proven to be more reliable in some cases.
|
||||
|
||||
---
|
||||
|
||||
### Installation with Synchronous Version
|
||||
|
||||
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai[sync]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Development Installation
|
||||
|
||||
For contributors who plan to modify the source code:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e . # Basic installation in editable mode
|
||||
```
|
||||
|
||||
Install optional features:
|
||||
|
||||
```bash
|
||||
pip install -e ".[torch]" # With PyTorch features
|
||||
pip install -e ".[transformer]" # With Transformer features
|
||||
pip install -e ".[cosine]" # With cosine similarity features
|
||||
pip install -e ".[sync]" # With synchronous crawling (Selenium)
|
||||
pip install -e ".[all]" # Install all optional features
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||
|
||||
> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.
|
||||
|
||||
### New Docker Features
|
||||
|
||||
The new Docker implementation includes:
|
||||
- **Browser pooling** with page pre-warming for faster response times
|
||||
- **Interactive playground** to test and generate request code
|
||||
- **MCP integration** for direct connection to AI tools like Claude Code
|
||||
- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution
|
||||
- **Multi-architecture support** with automatic detection (AMD64/ARM64)
|
||||
- **Optimized resources** with improved memory management
|
||||
|
||||
### Getting Started
|
||||
|
||||
```bash
|
||||
# Pull and run the latest release candidate
|
||||
docker pull unclecode/crawl4ai:0.7.0
|
||||
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
|
||||
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
|
||||
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
### Quick Test
|
||||
|
||||
Run a quick test (works for both Docker options):
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Submit a crawl job
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json={"urls": ["https://example.com"], "priority": 10}
|
||||
)
|
||||
if response.status_code == 200:
|
||||
print("Crawl job submitted successfully.")
|
||||
|
||||
if "results" in response.json():
|
||||
results = response.json()["results"]
|
||||
print("Crawl job completed. Results:")
|
||||
for result in results:
|
||||
print(result)
|
||||
else:
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Crawl job submitted. Task ID:: {task_id}")
|
||||
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
||||
```
|
||||
|
||||
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
## 🔬 Advanced Usage Examples 🔬
|
||||
|
||||
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
||||
# ),
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
config=run_config
|
||||
)
|
||||
print(len(result.markdown.raw_markdown))
|
||||
print(len(result.markdown.fit_markdown))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "KidoCode Courses",
|
||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||
"fields": [
|
||||
{
|
||||
"name": "section_title",
|
||||
"selector": "h3.heading-50",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "section_description",
|
||||
"selector": ".charge-content",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_name",
|
||||
"selector": ".text-block-93",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_description",
|
||||
"selector": ".course-content-text",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_icon",
|
||||
"selector": ".image-92",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://www.kidocode.com/degrees/technology",
|
||||
config=run_config
|
||||
)
|
||||
|
||||
companies = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(companies)} companies")
|
||||
print(json.dumps(companies[0], indent=2))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
|
||||
|
||||
```python
|
||||
import os
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
|
||||
# provider="ollama/qwen2", api_token="no-token",
|
||||
llm_config = LLMConfig(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY')),
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url='https://openai.com/api/pricing/',
|
||||
config=run_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🤖 <strong>Using Your own Browser with Custom User Profile</strong></summary>
|
||||
|
||||
```python
|
||||
import os, sys
|
||||
from pathlib import Path
|
||||
import asyncio, time
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def test_news_crawl():
|
||||
# Create a persistent user data directory
|
||||
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
||||
os.makedirs(user_data_dir, exist_ok=True)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
verbose=True,
|
||||
headless=True,
|
||||
user_data_dir=user_data_dir,
|
||||
use_persistent_context=True,
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
|
||||
|
||||
result = await crawler.arun(
|
||||
url,
|
||||
config=run_config,
|
||||
magic=True,
|
||||
)
|
||||
|
||||
print(f"Successfully crawled {url}")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
|
||||
|
||||
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7, # Min confidence to stop crawling
|
||||
max_depth=5, # Maximum crawl depth
|
||||
max_pages=20, # Maximum number of pages to crawl
|
||||
strategy="statistical"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive_crawler = AdaptiveCrawler(crawler, config)
|
||||
state = await adaptive_crawler.digest(
|
||||
start_url="https://news.example.com",
|
||||
query="latest news content"
|
||||
)
|
||||
# Crawler learns patterns and improves extraction over time
|
||||
```
|
||||
|
||||
- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
|
||||
```python
|
||||
scroll_config = VirtualScrollConfig(
|
||||
container_selector="[data-testid='feed']",
|
||||
scroll_count=20,
|
||||
scroll_by="container_height",
|
||||
wait_after_scroll=1.0
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config=CrawlerRunConfig(
|
||||
virtual_scroll_config=scroll_config
|
||||
))
|
||||
```
|
||||
|
||||
- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
|
||||
```python
|
||||
link_config = LinkPreviewConfig(
|
||||
query="machine learning tutorials",
|
||||
score_threshold=0.3,
|
||||
concurrent_requests=10
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config=CrawlerRunConfig(
|
||||
link_preview_config=link_config,
|
||||
score_links=True
|
||||
))
|
||||
# Links ranked by relevance and quality
|
||||
```
|
||||
|
||||
- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
|
||||
```python
|
||||
seeder = AsyncUrlSeeder(SeedingConfig(
|
||||
source="sitemap+cc",
|
||||
pattern="*/blog/*",
|
||||
query="python tutorials",
|
||||
score_threshold=0.4
|
||||
))
|
||||
|
||||
urls = await seeder.discover("https://example.com")
|
||||
```
|
||||
|
||||
- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
|
||||
|
||||
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||
|
||||
### Version Numbers Explained
|
||||
|
||||
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
|
||||
|
||||
#### Pre-release Versions
|
||||
We use different suffixes to indicate development stages:
|
||||
|
||||
- `dev` (0.4.3dev1): Development versions, unstable
|
||||
- `a` (0.4.3a1): Alpha releases, experimental features
|
||||
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
||||
- `rc` (0.4.3): Release candidates, potential final version
|
||||
|
||||
#### Installation
|
||||
- Regular installation (stable version):
|
||||
```bash
|
||||
pip install -U crawl4ai
|
||||
```
|
||||
|
||||
- Install pre-release versions:
|
||||
```bash
|
||||
pip install crawl4ai --pre
|
||||
```
|
||||
|
||||
- Install specific version:
|
||||
```bash
|
||||
pip install crawl4ai==0.4.3b1
|
||||
```
|
||||
|
||||
#### Why Pre-releases?
|
||||
We use pre-releases to:
|
||||
- Test new features in real-world scenarios
|
||||
- Gather feedback before final releases
|
||||
- Ensure stability for production users
|
||||
- Allow early adopters to try new features
|
||||
|
||||
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
|
||||
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
|
||||
|
||||
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
|
||||
<details>
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
|
||||
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
|
||||
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
|
||||
|
||||
</details>
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
||||
|
||||
I'll help modify the license section with badges. For the halftone effect, here's a version with it:
|
||||
|
||||
Here's the updated license section:
|
||||
|
||||
## 📄 License & Attribution
|
||||
|
||||
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||
|
||||
### Attribution Requirements
|
||||
When using Crawl4AI, you must include one of the following attribution methods:
|
||||
|
||||
#### 1. Badge Attribution (Recommended)
|
||||
Add one of these badges to your README, documentation, or website:
|
||||
|
||||
| Theme | Badge |
|
||||
|-------|-------|
|
||||
| **Disco Theme (Animated)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
| **Night Theme (Dark with Neon)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
| **Dark Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
| **Light Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||
|
||||
|
||||
HTML code for adding the badges:
|
||||
```html
|
||||
<!-- Disco Theme (Animated) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Night Theme (Dark with Neon) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Dark Theme (Classic) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Light Theme (Classic) -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||
</a>
|
||||
|
||||
<!-- Simple Shield Badge -->
|
||||
<a href="https://github.com/unclecode/crawl4ai">
|
||||
<img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
|
||||
</a>
|
||||
```
|
||||
|
||||
#### 2. Text Attribution
|
||||
Add this line to your documentation:
|
||||
```
|
||||
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
|
||||
```
|
||||
|
||||
## 📚 Citation
|
||||
|
||||
If you use Crawl4AI in your research or project, please cite:
|
||||
|
||||
```bibtex
|
||||
@software{crawl4ai2024,
|
||||
author = {UncleCode},
|
||||
title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
|
||||
year = {2024},
|
||||
publisher = {GitHub},
|
||||
journal = {GitHub Repository},
|
||||
howpublished = {\url{https://github.com/unclecode/crawl4ai}},
|
||||
commit = {Please use the commit hash you're working with}
|
||||
}
|
||||
```
|
||||
|
||||
Text citation format:
|
||||
```
|
||||
UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
|
||||
GitHub. https://github.com/unclecode/crawl4ai
|
||||
```
|
||||
|
||||
## 📧 Contact
|
||||
|
||||
For questions, suggestions, or feedback, feel free to reach out:
|
||||
|
||||
- GitHub: [unclecode](https://github.com/unclecode)
|
||||
- Twitter: [@unclecode](https://twitter.com/unclecode)
|
||||
- Website: [crawl4ai.com](https://crawl4ai.com)
|
||||
|
||||
Happy Crawling! 🕸️🚀
|
||||
|
||||
## 💖 Support Crawl4AI
|
||||
|
||||
> 🎉 **Sponsorship Program Just Launched!** Be among the first 50 **Founding Sponsors** and get permanent recognition in our Hall of Fame!
|
||||
|
||||
Crawl4AI is the #1 trending open-source web crawler with 51K+ stars. Your support ensures we stay independent, innovative, and free forever.
|
||||
|
||||
<div align="center">
|
||||
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
</div>
|
||||
|
||||
### 🤝 Sponsorship Tiers
|
||||
|
||||
- **🌱 Believer ($5/mo)**: Join the movement for data democratization
|
||||
- **🚀 Builder ($50/mo)**: Get priority support and early feature access
|
||||
- **💼 Growing Team ($500/mo)**: Bi-weekly syncs and optimization help
|
||||
- **🏢 Data Infrastructure Partner ($2000/mo)**: Full partnership with dedicated support
|
||||
|
||||
**Why sponsor?** Every tier includes real benefits. No more rate-limited APIs. Own your data pipeline. Build data sovereignty together.
|
||||
|
||||
[View All Tiers & Benefits →](https://github.com/sponsors/unclecode)
|
||||
|
||||
### 🏆 Our Sponsors
|
||||
|
||||
#### 👑 Founding Sponsors (First 50)
|
||||
*Be part of history - [Become a Founding Sponsor](https://github.com/sponsors/unclecode)*
|
||||
|
||||
<!-- Founding sponsors will be permanently recognized here -->
|
||||
|
||||
#### Current Sponsors
|
||||
Thank you to all our sponsors who make this project possible!
|
||||
|
||||
<!-- Sponsors will be automatically added here -->
|
||||
|
||||
## 🗾 Mission
|
||||
|
||||
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
|
||||
|
||||
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
|
||||
|
||||
<details>
|
||||
<summary>🔑 <strong>Key Opportunities</strong></summary>
|
||||
|
||||
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
|
||||
- **Authentic AI Data**: Provide AI systems with real human insights.
|
||||
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🚀 <strong>Development Pathway</strong></summary>
|
||||
|
||||
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
|
||||
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
|
||||
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
|
||||
|
||||
For more details, see our [full mission statement](./MISSION.md).
|
||||
</details>
|
||||
|
||||
## Star History
|
||||
|
||||
[](https://star-history.com/#unclecode/crawl4ai&Date)
|
||||
316
README.md
316
README.md
@@ -10,7 +10,6 @@
|
||||
[](https://badge.fury.io/py/crawl4ai)
|
||||
[](https://pypi.org/project/crawl4ai/)
|
||||
[](https://pepy.tech/project/crawl4ai)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
<p align="center">
|
||||
<a href="https://x.com/crawl4ai">
|
||||
@@ -25,35 +24,32 @@
|
||||
</p>
|
||||
</div>
|
||||
|
||||
Crawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
[✨ Check out latest update v0.7.4](#-recent-updates)
|
||||
[✨ Check out latest update v0.7.0](#-recent-updates)
|
||||
|
||||
✨ New in v0.7.4: Revolutionary LLM Table Extraction with intelligent chunking, enhanced concurrency fixes, memory management refactor, and critical stability improvements. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
|
||||
|
||||
✨ Recent v0.7.3: Undetected Browser Support, Multi-URL Configurations, Memory Monitoring, Enhanced Table Extraction, GitHub Sponsors. [Release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
|
||||
|
||||
<details>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||
|
||||
I grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. That’s where I learned how much extraction matters.
|
||||
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
|
||||
|
||||
In 2023, I needed web-to-Markdown. The “open source” option wanted an account, API token, and $16, and still under-delivered. I went turbo anger mode, built Crawl4AI in days, and it went viral. Now it’s the most-starred crawler on GitHub.
|
||||
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
|
||||
|
||||
I made it open source for **availability**, anyone can use it without a gate. Now I’m building the platform for **affordability**, anyone can run serious crawls without breaking the bank. If that resonates, join in, send feedback, or just crawl something amazing.
|
||||
I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
|
||||
|
||||
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
|
||||
</details>
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
<details>
|
||||
<summary>Why developers pick Crawl4AI</summary>
|
||||
|
||||
- **LLM ready output**, smart Markdown with headings, tables, code, citation hints
|
||||
- **Fast in practice**, async browser pool, caching, minimal hops
|
||||
- **Full control**, sessions, proxies, cookies, user scripts, hooks
|
||||
- **Adaptive intelligence**, learns site patterns, explores only what matters
|
||||
- **Deploy anywhere**, zero keys, CLI and Docker, cloud friendly
|
||||
</details>
|
||||
|
||||
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
|
||||
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
|
||||
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
|
||||
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
|
||||
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
|
||||
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
@@ -105,33 +101,6 @@ crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
|
||||
crwl https://www.example.com/products -q "Extract all product prices"
|
||||
```
|
||||
|
||||
## 💖 Support Crawl4AI
|
||||
|
||||
> 🎉 **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame.
|
||||
|
||||
Crawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community — while giving you direct access to premium benefits.
|
||||
|
||||
<div align="">
|
||||
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
[](https://github.com/sponsors/unclecode)
|
||||
|
||||
</div>
|
||||
|
||||
### 🤝 Sponsorship Tiers
|
||||
|
||||
- **🌱 Believer ($5/mo)** — Join the movement for data democratization
|
||||
- **🚀 Builder ($50/mo)** — Priority support & early access to features
|
||||
- **💼 Growing Team ($500/mo)** — Bi-weekly syncs & optimization help
|
||||
- **🏢 Data Infrastructure Partner ($2000/mo)** — Full partnership with dedicated support
|
||||
*Custom arrangements available - see [SPONSORS.md](SPONSORS.md) for details & contact*
|
||||
|
||||
**Why sponsor?**
|
||||
No rate-limited APIs. No lock-in. Build and own your data pipeline with direct guidance from the creator of Crawl4AI.
|
||||
|
||||
[See All Tiers & Benefits →](https://github.com/sponsors/unclecode)
|
||||
|
||||
|
||||
## ✨ Features
|
||||
|
||||
<details>
|
||||
@@ -311,6 +280,12 @@ docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.
|
||||
# Visit the playground at http://localhost:11235/playground
|
||||
```
|
||||
|
||||
For complete documentation, see our [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
### Quick Test
|
||||
|
||||
Run a quick test (works for both Docker options):
|
||||
@@ -341,11 +316,10 @@ For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
## 🔬 Advanced Usage Examples 🔬
|
||||
|
||||
You can check the project structure in the directory [docs/examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
|
||||
@@ -373,7 +347,7 @@ async def main():
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://docs.micronaut.io/4.9.9/guide/",
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
config=run_config
|
||||
)
|
||||
print(len(result.markdown.raw_markdown))
|
||||
@@ -425,7 +399,7 @@ async def main():
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
@@ -504,7 +478,7 @@ if __name__ == "__main__":
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🤖 <strong>Using Your own Browser with Custom User Profile</strong></summary>
|
||||
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
|
||||
|
||||
```python
|
||||
import os, sys
|
||||
@@ -544,123 +518,7 @@ async def test_news_crawl():
|
||||
|
||||
## ✨ Recent Updates
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update</strong></summary>
|
||||
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables:
|
||||
```python
|
||||
from crawl4ai import LLMTableExtraction, LLMConfig
|
||||
|
||||
# Configure intelligent table extraction
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4.1-mini"),
|
||||
enable_chunking=True, # Handle massive tables
|
||||
chunk_token_threshold=5000, # Smart chunking threshold
|
||||
overlap_threshold=100, # Maintain context between chunks
|
||||
extraction_type="structured" # Get structured data output
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(table_extraction_strategy=table_strategy)
|
||||
result = await crawler.arun("https://complex-tables-site.com", config=config)
|
||||
|
||||
# Tables are automatically chunked, processed, and merged
|
||||
for table in result.tables:
|
||||
print(f"Extracted table: {len(table['data'])} rows")
|
||||
```
|
||||
|
||||
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks
|
||||
- **🧹 Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw:// URLs and base tag link resolution
|
||||
- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration supporting both dict and string formats
|
||||
|
||||
[Full v0.7.4 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.4.md)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update</strong></summary>
|
||||
|
||||
- **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="undetected", # Use undetected Chrome
|
||||
headless=True, # Can run headless with stealth
|
||||
extra_args=[
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-web-security"
|
||||
]
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://protected-site.com")
|
||||
# Successfully bypass Cloudflare, Akamai, and custom bot detection
|
||||
```
|
||||
|
||||
- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig, MatchMode
|
||||
|
||||
configs = [
|
||||
# Documentation sites - aggressive caching
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*docs*", "*documentation*"],
|
||||
cache_mode="write",
|
||||
markdown_generator_options={"include_links": True}
|
||||
),
|
||||
|
||||
# News/blog sites - fresh content
|
||||
CrawlerRunConfig(
|
||||
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||
cache_mode="bypass"
|
||||
),
|
||||
|
||||
# Fallback for everything else
|
||||
CrawlerRunConfig()
|
||||
]
|
||||
|
||||
results = await crawler.arun_many(urls, config=configs)
|
||||
# Each URL gets the perfect configuration automatically
|
||||
```
|
||||
|
||||
- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:
|
||||
```python
|
||||
from crawl4ai.memory_utils import MemoryMonitor
|
||||
|
||||
monitor = MemoryMonitor()
|
||||
monitor.start_monitoring()
|
||||
|
||||
results = await crawler.arun_many(large_url_list)
|
||||
|
||||
report = monitor.get_report()
|
||||
print(f"Peak memory: {report['peak_mb']:.1f} MB")
|
||||
print(f"Efficiency: {report['efficiency']:.1f}%")
|
||||
# Get optimization recommendations
|
||||
```
|
||||
|
||||
- **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:
|
||||
```python
|
||||
result = await crawler.arun("https://site-with-tables.com")
|
||||
|
||||
# New way - direct table access
|
||||
if result.tables:
|
||||
import pandas as pd
|
||||
for table in result.tables:
|
||||
df = pd.DataFrame(table['data'])
|
||||
print(f"Table: {df.shape[0]} rows × {df.shape[1]} columns")
|
||||
```
|
||||
|
||||
- **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability
|
||||
- **🐳 Docker LLM Flexibility**: Configure providers via environment variables
|
||||
|
||||
[Full v0.7.3 Release Notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.3.md)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update</strong></summary>
|
||||
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
|
||||
|
||||
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
|
||||
```python
|
||||
@@ -725,14 +583,97 @@ async def test_news_crawl():
|
||||
|
||||
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
</details>
|
||||
### Previous Version: 0.6.0 Release Highlights
|
||||
|
||||
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
|
||||
```python
|
||||
crun_cfg = CrawlerRunConfig(
|
||||
url="https://browserleaks.com/geo", # test page that shows your location
|
||||
locale="en-US", # Accept-Language & UI locale
|
||||
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
|
||||
geolocation=GeolocationConfig( # override GPS coords
|
||||
latitude=34.0522,
|
||||
longitude=-118.2437,
|
||||
accuracy=10.0,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
- **📊 Table-to-DataFrame Extraction**: Extract HTML tables directly to CSV or pandas DataFrames:
|
||||
```python
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# Set up scraping parameters
|
||||
crawl_config = CrawlerRunConfig(
|
||||
table_score_threshold=8, # Strict table detection
|
||||
)
|
||||
|
||||
# Execute market data extraction
|
||||
results: List[CrawlResult] = await crawler.arun(
|
||||
url="https://coinmarketcap.com/?page=1", config=crawl_config
|
||||
)
|
||||
|
||||
# Process results
|
||||
raw_df = pd.DataFrame()
|
||||
for result in results:
|
||||
if result.success and result.tables:
|
||||
raw_df = pd.DataFrame(
|
||||
result.tables[0]["rows"],
|
||||
columns=result.tables[0]["headers"],
|
||||
)
|
||||
break
|
||||
print(raw_df.head())
|
||||
|
||||
finally:
|
||||
await crawler.close()
|
||||
```
|
||||
|
||||
- **🚀 Browser Pooling**: Pages launch hot with pre-warmed browser instances for lower latency and memory usage
|
||||
|
||||
- **🕸️ Network and Console Capture**: Full traffic logs and MHTML snapshots for debugging:
|
||||
```python
|
||||
crawler_config = CrawlerRunConfig(
|
||||
capture_network=True,
|
||||
capture_console=True,
|
||||
mhtml=True
|
||||
)
|
||||
```
|
||||
|
||||
- **🔌 MCP Integration**: Connect to AI tools like Claude Code through the Model Context Protocol
|
||||
```bash
|
||||
# Add Crawl4AI to Claude Code
|
||||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||
```
|
||||
|
||||
- **🖥️ Interactive Playground**: Test configurations and generate API requests with the built-in web interface at `http://localhost:11235//playground`
|
||||
|
||||
- **🐳 Revamped Docker Deployment**: Streamlined multi-architecture Docker image with improved resource efficiency
|
||||
|
||||
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
|
||||
|
||||
|
||||
### Previous Version: 0.5.0 Major Release Highlights
|
||||
|
||||
- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies
|
||||
- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory
|
||||
- **🔄 Multiple Crawling Strategies**: Browser-based and lightweight HTTP-only crawlers
|
||||
- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access
|
||||
- **👤 Browser Profiler**: Create and manage persistent browser profiles
|
||||
- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant
|
||||
- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library
|
||||
- **🌐 Proxy Rotation**: Built-in support for proxy switching
|
||||
- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
|
||||
- **📄 PDF Processing**: Extract text, images, and metadata from PDF files
|
||||
|
||||
Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html).
|
||||
|
||||
## Version Numbering in Crawl4AI
|
||||
|
||||
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||
|
||||
<details>
|
||||
<summary>📈 <strong>Version Numbers Explained</strong></summary>
|
||||
### Version Numbers Explained
|
||||
|
||||
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
|
||||
|
||||
@@ -769,8 +710,6 @@ We use pre-releases to:
|
||||
|
||||
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
|
||||
|
||||
</details>
|
||||
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
@@ -783,16 +722,16 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
|
||||
- [x] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [x] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [x] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
- [x] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [x] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [x] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
- [x] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [x] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
|
||||
- [x] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
|
||||
|
||||
</details>
|
||||
@@ -807,13 +746,12 @@ Here's the updated license section:
|
||||
|
||||
## 📄 License & Attribution
|
||||
|
||||
This project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||
This project is licensed under the Apache License 2.0 with a required attribution clause. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||
|
||||
### Attribution Requirements
|
||||
When using Crawl4AI, you must include one of the following attribution methods:
|
||||
|
||||
<details>
|
||||
<summary>📈 <strong>1. Badge Attribution (Recommended)</strong></summary>
|
||||
#### 1. Badge Attribution (Recommended)
|
||||
Add one of these badges to your README, documentation, or website:
|
||||
|
||||
| Theme | Badge |
|
||||
@@ -852,15 +790,11 @@ HTML code for adding the badges:
|
||||
</a>
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📖 <strong>2. Text Attribution</strong></summary>
|
||||
#### 2. Text Attribution
|
||||
Add this line to your documentation:
|
||||
```
|
||||
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
|
||||
```
|
||||
</details>
|
||||
|
||||
## 📚 Citation
|
||||
|
||||
|
||||
65
SPONSORS.md
65
SPONSORS.md
@@ -1,65 +0,0 @@
|
||||
# 💖 Sponsors & Supporters
|
||||
|
||||
Thank you to everyone supporting Crawl4AI! Your sponsorship helps keep this project open-source and actively maintained.
|
||||
|
||||
## 👑 Founding Sponsors
|
||||
*The first 50 sponsors who believed in our vision - permanently recognized*
|
||||
|
||||
<!-- Founding sponsors will be listed here with special recognition -->
|
||||
🎉 **Become a Founding Sponsor!** Only [X/50] spots remaining! [Join now →](https://github.com/sponsors/unclecode)
|
||||
|
||||
---
|
||||
|
||||
## 🏢 Data Infrastructure Partners ($2000/month)
|
||||
*These organizations are building their data sovereignty with Crawl4AI at the core*
|
||||
|
||||
<!-- Data Infrastructure Partners will be listed here -->
|
||||
*Be the first Data Infrastructure Partner! [Join us →](https://github.com/sponsors/unclecode)*
|
||||
|
||||
---
|
||||
|
||||
## 💼 Growing Teams ($500/month)
|
||||
*Teams scaling their data extraction with Crawl4AI*
|
||||
|
||||
<!-- Growing Teams will be listed here -->
|
||||
*Your team could be here! [Become a sponsor →](https://github.com/sponsors/unclecode)*
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Builders ($50/month)
|
||||
*Developers and entrepreneurs building with Crawl4AI*
|
||||
|
||||
<!-- Builders will be listed here -->
|
||||
*Join the builders! [Start sponsoring →](https://github.com/sponsors/unclecode)*
|
||||
|
||||
---
|
||||
|
||||
## 🌱 Believers ($5/month)
|
||||
*The community supporting data democratization*
|
||||
|
||||
<!-- Believers will be listed here -->
|
||||
*Thank you to all our community believers!*
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Want to Sponsor?
|
||||
|
||||
Crawl4AI is the #1 trending open-source web crawler. We're building the future of data extraction - where organizations own their data pipelines instead of relying on rate-limited APIs.
|
||||
|
||||
### Available Sponsorship Tiers:
|
||||
- **🌱 Believer** ($5/mo) - Support the movement
|
||||
- **🚀 Builder** ($50/mo) - Priority support & early access
|
||||
- **💼 Growing Team** ($500/mo) - Bi-weekly syncs & optimization
|
||||
- **🏢 Data Infrastructure Partner** ($2000/mo) - Full partnership & dedicated support
|
||||
|
||||
[View all tiers and benefits →](https://github.com/sponsors/unclecode)
|
||||
|
||||
### Enterprise & Custom Partnerships
|
||||
|
||||
Building data extraction at scale? Need dedicated support or infrastructure? Let's talk about a custom partnership.
|
||||
|
||||
📧 Contact: [hello@crawl4ai.com](mailto:hello@crawl4ai.com) | 📅 [Schedule a call](https://calendar.app.google/rEpvi2UBgUQjWHfJ9)
|
||||
|
||||
---
|
||||
|
||||
*This list is updated regularly. Sponsors at $50+ tiers can submit their logos via [hello@crawl4ai.com](mailto:hello@crawl4ai.com)*
|
||||
@@ -29,12 +29,6 @@ from .extraction_strategy import (
|
||||
)
|
||||
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||
from .markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from .table_extraction import (
|
||||
TableExtractionStrategy,
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction,
|
||||
LLMTableExtraction,
|
||||
)
|
||||
from .content_filter_strategy import (
|
||||
PruningContentFilter,
|
||||
BM25ContentFilter,
|
||||
@@ -162,9 +156,6 @@ __all__ = [
|
||||
"ChunkingStrategy",
|
||||
"RegexChunking",
|
||||
"DefaultMarkdownGenerator",
|
||||
"TableExtractionStrategy",
|
||||
"DefaultTableExtraction",
|
||||
"NoTableExtraction",
|
||||
"RelevantContentFilter",
|
||||
"PruningContentFilter",
|
||||
"BM25ContentFilter",
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# crawl4ai/__version__.py
|
||||
|
||||
# This is the version that will be used for stable releases
|
||||
__version__ = "0.7.4"
|
||||
__version__ = "0.7.2"
|
||||
|
||||
# For nightly builds, this gets set during build process
|
||||
__nightly_version__ = None
|
||||
|
||||
@@ -20,7 +20,6 @@ from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
|
||||
from .content_scraping_strategy import ContentScrapingStrategy, LXMLWebScrapingStrategy
|
||||
from .deep_crawling import DeepCrawlStrategy
|
||||
from .table_extraction import TableExtractionStrategy, DefaultTableExtraction
|
||||
|
||||
from .cache_context import CacheMode
|
||||
from .proxy_strategy import ProxyRotationStrategy
|
||||
@@ -449,10 +448,6 @@ class BrowserConfig:
|
||||
self.chrome_channel = ""
|
||||
self.proxy = proxy
|
||||
self.proxy_config = proxy_config
|
||||
if isinstance(self.proxy_config, dict):
|
||||
self.proxy_config = ProxyConfig.from_dict(self.proxy_config)
|
||||
if isinstance(self.proxy_config, str):
|
||||
self.proxy_config = ProxyConfig.from_string(self.proxy_config)
|
||||
|
||||
|
||||
self.viewport_width = viewport_width
|
||||
@@ -983,8 +978,6 @@ class CrawlerRunConfig():
|
||||
Default: False.
|
||||
table_score_threshold (int): Minimum score threshold for processing a table.
|
||||
Default: 7.
|
||||
table_extraction (TableExtractionStrategy): Strategy to use for table extraction.
|
||||
Default: DefaultTableExtraction with table_score_threshold.
|
||||
|
||||
# Virtual Scroll Parameters
|
||||
virtual_scroll_config (VirtualScrollConfig or dict or None): Configuration for handling virtual scroll containers.
|
||||
@@ -1111,7 +1104,6 @@ class CrawlerRunConfig():
|
||||
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
|
||||
table_score_threshold: int = 7,
|
||||
table_extraction: TableExtractionStrategy = None,
|
||||
exclude_external_images: bool = False,
|
||||
exclude_all_images: bool = False,
|
||||
# Link and Domain Handling Parameters
|
||||
@@ -1167,11 +1159,6 @@ class CrawlerRunConfig():
|
||||
self.parser_type = parser_type
|
||||
self.scraping_strategy = scraping_strategy or LXMLWebScrapingStrategy()
|
||||
self.proxy_config = proxy_config
|
||||
if isinstance(proxy_config, dict):
|
||||
self.proxy_config = ProxyConfig.from_dict(proxy_config)
|
||||
if isinstance(proxy_config, str):
|
||||
self.proxy_config = ProxyConfig.from_string(proxy_config)
|
||||
|
||||
self.proxy_rotation_strategy = proxy_rotation_strategy
|
||||
|
||||
# Browser Location and Identity Parameters
|
||||
@@ -1228,12 +1215,6 @@ class CrawlerRunConfig():
|
||||
self.exclude_external_images = exclude_external_images
|
||||
self.exclude_all_images = exclude_all_images
|
||||
self.table_score_threshold = table_score_threshold
|
||||
|
||||
# Table extraction strategy (default to DefaultTableExtraction if not specified)
|
||||
if table_extraction is None:
|
||||
self.table_extraction = DefaultTableExtraction(table_score_threshold=table_score_threshold)
|
||||
else:
|
||||
self.table_extraction = table_extraction
|
||||
|
||||
# Link and Domain Handling Parameters
|
||||
self.exclude_social_media_domains = (
|
||||
@@ -1505,7 +1486,6 @@ class CrawlerRunConfig():
|
||||
"image_score_threshold", IMAGE_SCORE_THRESHOLD
|
||||
),
|
||||
table_score_threshold=kwargs.get("table_score_threshold", 7),
|
||||
table_extraction=kwargs.get("table_extraction", None),
|
||||
exclude_all_images=kwargs.get("exclude_all_images", False),
|
||||
exclude_external_images=kwargs.get("exclude_external_images", False),
|
||||
# Link and Domain Handling Parameters
|
||||
@@ -1614,7 +1594,6 @@ class CrawlerRunConfig():
|
||||
"image_description_min_word_threshold": self.image_description_min_word_threshold,
|
||||
"image_score_threshold": self.image_score_threshold,
|
||||
"table_score_threshold": self.table_score_threshold,
|
||||
"table_extraction": self.table_extraction,
|
||||
"exclude_all_images": self.exclude_all_images,
|
||||
"exclude_external_images": self.exclude_external_images,
|
||||
"exclude_social_media_domains": self.exclude_social_media_domains,
|
||||
|
||||
@@ -824,7 +824,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Error:
|
||||
visibility_info = await self.check_visibility(page)
|
||||
|
||||
if self.browser_config.verbose:
|
||||
if self.browser_config.config.verbose:
|
||||
self.logger.debug(
|
||||
message="Body visibility info: {info}",
|
||||
tag="DEBUG",
|
||||
|
||||
@@ -2129,265 +2129,3 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
return True # Default to scrolling if check fails
|
||||
|
||||
|
||||
####################################################################################################
|
||||
# HTTP Crawler Strategy
|
||||
####################################################################################################
|
||||
|
||||
class HTTPCrawlerError(Exception):
|
||||
"""Base error class for HTTP crawler specific exceptions"""
|
||||
pass
|
||||
|
||||
|
||||
class ConnectionTimeoutError(HTTPCrawlerError):
|
||||
"""Raised when connection timeout occurs"""
|
||||
pass
|
||||
|
||||
|
||||
class HTTPStatusError(HTTPCrawlerError):
|
||||
"""Raised for unexpected status codes"""
|
||||
def __init__(self, status_code: int, message: str):
|
||||
self.status_code = status_code
|
||||
super().__init__(f"HTTP {status_code}: {message}")
|
||||
|
||||
|
||||
class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"""
|
||||
Fast, lightweight HTTP-only crawler strategy optimized for memory efficiency.
|
||||
"""
|
||||
|
||||
__slots__ = ('logger', 'max_connections', 'dns_cache_ttl', 'chunk_size', '_session', 'hooks', 'browser_config')
|
||||
|
||||
DEFAULT_TIMEOUT: Final[int] = 30
|
||||
DEFAULT_CHUNK_SIZE: Final[int] = 64 * 1024
|
||||
DEFAULT_MAX_CONNECTIONS: Final[int] = min(32, (os.cpu_count() or 1) * 4)
|
||||
DEFAULT_DNS_CACHE_TTL: Final[int] = 300
|
||||
VALID_SCHEMES: Final = frozenset({'http', 'https', 'file', 'raw'})
|
||||
|
||||
_BASE_HEADERS: Final = MappingProxyType({
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
|
||||
'Accept-Language': 'en-US,en;q=0.5',
|
||||
'Accept-Encoding': 'gzip, deflate, br',
|
||||
'Connection': 'keep-alive',
|
||||
'Upgrade-Insecure-Requests': '1',
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
})
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
browser_config: Optional[HTTPCrawlerConfig] = None,
|
||||
logger: Optional[AsyncLogger] = None,
|
||||
max_connections: int = DEFAULT_MAX_CONNECTIONS,
|
||||
dns_cache_ttl: int = DEFAULT_DNS_CACHE_TTL,
|
||||
chunk_size: int = DEFAULT_CHUNK_SIZE
|
||||
):
|
||||
"""Initialize the HTTP crawler with config"""
|
||||
self.browser_config = browser_config or HTTPCrawlerConfig()
|
||||
self.logger = logger
|
||||
self.max_connections = max_connections
|
||||
self.dns_cache_ttl = dns_cache_ttl
|
||||
self.chunk_size = chunk_size
|
||||
self._session: Optional[aiohttp.ClientSession] = None
|
||||
|
||||
self.hooks = {
|
||||
k: partial(self._execute_hook, k)
|
||||
for k in ('before_request', 'after_request', 'on_error')
|
||||
}
|
||||
|
||||
# Set default hooks
|
||||
self.set_hook('before_request', lambda *args, **kwargs: None)
|
||||
self.set_hook('after_request', lambda *args, **kwargs: None)
|
||||
self.set_hook('on_error', lambda *args, **kwargs: None)
|
||||
|
||||
|
||||
async def __aenter__(self) -> AsyncHTTPCrawlerStrategy:
|
||||
await self.start()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb) -> None:
|
||||
await self.close()
|
||||
|
||||
@contextlib.asynccontextmanager
|
||||
async def _session_context(self):
|
||||
try:
|
||||
if not self._session:
|
||||
await self.start()
|
||||
yield self._session
|
||||
finally:
|
||||
pass
|
||||
|
||||
def set_hook(self, hook_type: str, hook_func: Callable) -> None:
|
||||
if hook_type in self.hooks:
|
||||
self.hooks[hook_type] = partial(self._execute_hook, hook_type, hook_func)
|
||||
else:
|
||||
raise ValueError(f"Invalid hook type: {hook_type}")
|
||||
|
||||
async def _execute_hook(
|
||||
self,
|
||||
hook_type: str,
|
||||
hook_func: Callable,
|
||||
*args: Any,
|
||||
**kwargs: Any
|
||||
) -> Any:
|
||||
if asyncio.iscoroutinefunction(hook_func):
|
||||
return await hook_func(*args, **kwargs)
|
||||
return hook_func(*args, **kwargs)
|
||||
|
||||
async def start(self) -> None:
|
||||
if not self._session:
|
||||
connector = aiohttp.TCPConnector(
|
||||
limit=self.max_connections,
|
||||
ttl_dns_cache=self.dns_cache_ttl,
|
||||
use_dns_cache=True,
|
||||
force_close=False
|
||||
)
|
||||
self._session = aiohttp.ClientSession(
|
||||
headers=dict(self._BASE_HEADERS),
|
||||
connector=connector,
|
||||
timeout=ClientTimeout(total=self.DEFAULT_TIMEOUT)
|
||||
)
|
||||
|
||||
async def close(self) -> None:
|
||||
if self._session and not self._session.closed:
|
||||
try:
|
||||
await asyncio.wait_for(self._session.close(), timeout=5.0)
|
||||
except asyncio.TimeoutError:
|
||||
if self.logger:
|
||||
self.logger.warning(
|
||||
message="Session cleanup timed out",
|
||||
tag="CLEANUP"
|
||||
)
|
||||
finally:
|
||||
self._session = None
|
||||
|
||||
async def _stream_file(self, path: str) -> AsyncGenerator[memoryview, None]:
|
||||
async with aiofiles.open(path, mode='rb') as f:
|
||||
while chunk := await f.read(self.chunk_size):
|
||||
yield memoryview(chunk)
|
||||
|
||||
async def _handle_file(self, path: str) -> AsyncCrawlResponse:
|
||||
if not os.path.exists(path):
|
||||
raise FileNotFoundError(f"Local file not found: {path}")
|
||||
|
||||
chunks = []
|
||||
async for chunk in self._stream_file(path):
|
||||
chunks.append(chunk.tobytes().decode('utf-8', errors='replace'))
|
||||
|
||||
return AsyncCrawlResponse(
|
||||
html=''.join(chunks),
|
||||
response_headers={},
|
||||
status_code=200
|
||||
)
|
||||
|
||||
async def _handle_raw(self, content: str) -> AsyncCrawlResponse:
|
||||
return AsyncCrawlResponse(
|
||||
html=content,
|
||||
response_headers={},
|
||||
status_code=200
|
||||
)
|
||||
|
||||
|
||||
async def _handle_http(
|
||||
self,
|
||||
url: str,
|
||||
config: CrawlerRunConfig
|
||||
) -> AsyncCrawlResponse:
|
||||
async with self._session_context() as session:
|
||||
timeout = ClientTimeout(
|
||||
total=config.page_timeout or self.DEFAULT_TIMEOUT,
|
||||
connect=10,
|
||||
sock_read=30
|
||||
)
|
||||
|
||||
headers = dict(self._BASE_HEADERS)
|
||||
if self.browser_config.headers:
|
||||
headers.update(self.browser_config.headers)
|
||||
|
||||
request_kwargs = {
|
||||
'timeout': timeout,
|
||||
'allow_redirects': self.browser_config.follow_redirects,
|
||||
'ssl': self.browser_config.verify_ssl,
|
||||
'headers': headers
|
||||
}
|
||||
|
||||
if self.browser_config.method == "POST":
|
||||
if self.browser_config.data:
|
||||
request_kwargs['data'] = self.browser_config.data
|
||||
if self.browser_config.json:
|
||||
request_kwargs['json'] = self.browser_config.json
|
||||
|
||||
await self.hooks['before_request'](url, request_kwargs)
|
||||
|
||||
try:
|
||||
async with session.request(self.browser_config.method, url, **request_kwargs) as response:
|
||||
content = memoryview(await response.read())
|
||||
|
||||
if not (200 <= response.status < 300):
|
||||
raise HTTPStatusError(
|
||||
response.status,
|
||||
f"Unexpected status code for {url}"
|
||||
)
|
||||
|
||||
encoding = response.charset
|
||||
if not encoding:
|
||||
encoding = chardet.detect(content.tobytes())['encoding'] or 'utf-8'
|
||||
|
||||
result = AsyncCrawlResponse(
|
||||
html=content.tobytes().decode(encoding, errors='replace'),
|
||||
response_headers=dict(response.headers),
|
||||
status_code=response.status,
|
||||
redirected_url=str(response.url)
|
||||
)
|
||||
|
||||
await self.hooks['after_request'](result)
|
||||
return result
|
||||
|
||||
except aiohttp.ServerTimeoutError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionTimeoutError(f"Request timed out: {str(e)}")
|
||||
|
||||
except aiohttp.ClientConnectorError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionError(f"Connection failed: {str(e)}")
|
||||
|
||||
except aiohttp.ClientError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise HTTPCrawlerError(f"HTTP client error: {str(e)}")
|
||||
|
||||
except asyncio.exceptions.TimeoutError as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise ConnectionTimeoutError(f"Request timed out: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
await self.hooks['on_error'](e)
|
||||
raise HTTPCrawlerError(f"HTTP request failed: {str(e)}")
|
||||
|
||||
async def crawl(
|
||||
self,
|
||||
url: str,
|
||||
config: Optional[CrawlerRunConfig] = None,
|
||||
**kwargs
|
||||
) -> AsyncCrawlResponse:
|
||||
config = config or CrawlerRunConfig.from_kwargs(kwargs)
|
||||
|
||||
parsed = urlparse(url)
|
||||
scheme = parsed.scheme.rstrip('/')
|
||||
|
||||
if scheme not in self.VALID_SCHEMES:
|
||||
raise ValueError(f"Unsupported URL scheme: {scheme}")
|
||||
|
||||
try:
|
||||
if scheme == 'file':
|
||||
return await self._handle_file(parsed.path)
|
||||
elif scheme == 'raw':
|
||||
return await self._handle_raw(parsed.path)
|
||||
else: # http or https
|
||||
return await self._handle_http(url, config)
|
||||
|
||||
except Exception as e:
|
||||
if self.logger:
|
||||
self.logger.error(
|
||||
message="Crawl failed: {error}",
|
||||
tag="CRAWL",
|
||||
params={"error": str(e), "url": url}
|
||||
)
|
||||
raise
|
||||
|
||||
@@ -22,7 +22,7 @@ from urllib.parse import urlparse
|
||||
import random
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from .utils import get_true_memory_usage_percent
|
||||
from .memory_utils import get_true_memory_usage_percent
|
||||
|
||||
|
||||
class RateLimiter:
|
||||
@@ -407,34 +407,32 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
|
||||
t.cancel()
|
||||
raise exc
|
||||
|
||||
# If memory pressure is low, greedily fill all available slots
|
||||
if not self.memory_pressure_mode:
|
||||
slots = self.max_session_permit - len(active_tasks)
|
||||
while slots > 0:
|
||||
try:
|
||||
# Use get_nowait() to immediately get tasks without blocking
|
||||
priority, (url, task_id, retry_count, enqueue_time) = self.task_queue.get_nowait()
|
||||
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
# If memory pressure is low, start new tasks
|
||||
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
|
||||
try:
|
||||
# Try to get a task with timeout to avoid blocking indefinitely
|
||||
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
|
||||
self.task_queue.get(), timeout=0.1
|
||||
)
|
||||
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
|
||||
slots -= 1
|
||||
|
||||
except asyncio.QueueEmpty:
|
||||
# No more tasks in queue, exit the loop
|
||||
break
|
||||
except asyncio.TimeoutError:
|
||||
# No tasks in queue, that's fine
|
||||
pass
|
||||
|
||||
# Wait for completion even if queue is starved
|
||||
if active_tasks:
|
||||
@@ -561,34 +559,32 @@ class MemoryAdaptiveDispatcher(BaseDispatcher):
|
||||
for t in active_tasks:
|
||||
t.cancel()
|
||||
raise exc
|
||||
# If memory pressure is low, greedily fill all available slots
|
||||
if not self.memory_pressure_mode:
|
||||
slots = self.max_session_permit - len(active_tasks)
|
||||
while slots > 0:
|
||||
try:
|
||||
# Use get_nowait() to immediately get tasks without blocking
|
||||
priority, (url, task_id, retry_count, enqueue_time) = self.task_queue.get_nowait()
|
||||
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
# If memory pressure is low, start new tasks
|
||||
if not self.memory_pressure_mode and len(active_tasks) < self.max_session_permit:
|
||||
try:
|
||||
# Try to get a task with timeout
|
||||
priority, (url, task_id, retry_count, enqueue_time) = await asyncio.wait_for(
|
||||
self.task_queue.get(), timeout=0.1
|
||||
)
|
||||
|
||||
# Create and start the task
|
||||
task = asyncio.create_task(
|
||||
self.crawl_url(url, config, task_id, retry_count)
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
active_tasks.append(task)
|
||||
|
||||
# Update waiting time in monitor
|
||||
if self.monitor:
|
||||
wait_time = time.time() - enqueue_time
|
||||
self.monitor.update_task(
|
||||
task_id,
|
||||
wait_time=wait_time,
|
||||
status=CrawlStatus.IN_PROGRESS
|
||||
)
|
||||
|
||||
slots -= 1
|
||||
|
||||
except asyncio.QueueEmpty:
|
||||
# No more tasks in queue, exit the loop
|
||||
break
|
||||
except asyncio.TimeoutError:
|
||||
# No tasks in queue, that's fine
|
||||
pass
|
||||
|
||||
# Process completed tasks and yield results
|
||||
if active_tasks:
|
||||
|
||||
@@ -608,11 +608,6 @@ class BrowserManager:
|
||||
self.contexts_by_config = {}
|
||||
self._contexts_lock = asyncio.Lock()
|
||||
|
||||
# Serialize context.new_page() across concurrent tasks to avoid races
|
||||
# when using a shared persistent context (context.pages may be empty
|
||||
# for all racers). Prevents 'Target page/context closed' errors.
|
||||
self._page_lock = asyncio.Lock()
|
||||
|
||||
# Stealth-related attributes
|
||||
self._stealth_instance = None
|
||||
self._stealth_cm = None
|
||||
@@ -1032,26 +1027,13 @@ class BrowserManager:
|
||||
context = await self.create_browser_context(crawlerRunConfig)
|
||||
ctx = self.default_context # default context, one window only
|
||||
ctx = await clone_runtime_state(context, ctx, crawlerRunConfig, self.config)
|
||||
# Avoid concurrent new_page on shared persistent context
|
||||
# See GH-1198: context.pages can be empty under races
|
||||
async with self._page_lock:
|
||||
page = await ctx.new_page()
|
||||
page = await ctx.new_page()
|
||||
else:
|
||||
context = self.default_context
|
||||
pages = context.pages
|
||||
page = next((p for p in pages if p.url == crawlerRunConfig.url), None)
|
||||
if not page:
|
||||
if pages:
|
||||
page = pages[0]
|
||||
else:
|
||||
# Double-check under lock to avoid TOCTOU and ensure only
|
||||
# one task calls new_page when pages=[] concurrently
|
||||
async with self._page_lock:
|
||||
pages = context.pages
|
||||
if pages:
|
||||
page = pages[0]
|
||||
else:
|
||||
page = await context.new_page()
|
||||
page = context.pages[0] # await context.new_page()
|
||||
else:
|
||||
# Otherwise, check if we have an existing context for this config
|
||||
config_signature = self._make_config_signature(crawlerRunConfig)
|
||||
|
||||
@@ -65,213 +65,6 @@ class BrowserProfiler:
|
||||
self.builtin_config_file = os.path.join(self.builtin_browser_dir, "browser_config.json")
|
||||
os.makedirs(self.builtin_browser_dir, exist_ok=True)
|
||||
|
||||
def _is_windows(self) -> bool:
|
||||
"""Check if running on Windows platform."""
|
||||
return sys.platform.startswith('win') or sys.platform == 'cygwin'
|
||||
|
||||
def _is_macos(self) -> bool:
|
||||
"""Check if running on macOS platform."""
|
||||
return sys.platform == 'darwin'
|
||||
|
||||
def _is_linux(self) -> bool:
|
||||
"""Check if running on Linux platform."""
|
||||
return sys.platform.startswith('linux')
|
||||
|
||||
def _get_quit_message(self, tag: str) -> str:
|
||||
"""Get appropriate quit message based on context."""
|
||||
if tag == "PROFILE":
|
||||
return "Closing browser and saving profile..."
|
||||
elif tag == "CDP":
|
||||
return "Closing browser..."
|
||||
else:
|
||||
return "Closing browser..."
|
||||
|
||||
async def _listen_windows(self, user_done_event, check_browser_process, tag: str):
|
||||
"""Windows-specific keyboard listener using msvcrt."""
|
||||
try:
|
||||
import msvcrt
|
||||
except ImportError:
|
||||
raise ImportError("msvcrt module not available on this platform")
|
||||
|
||||
while True:
|
||||
try:
|
||||
# Check for keyboard input
|
||||
if msvcrt.kbhit():
|
||||
raw = msvcrt.getch()
|
||||
|
||||
# Handle Unicode decoding more robustly
|
||||
key = None
|
||||
try:
|
||||
key = raw.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
try:
|
||||
# Try different encodings
|
||||
key = raw.decode("latin1")
|
||||
except UnicodeDecodeError:
|
||||
# Skip if we can't decode
|
||||
continue
|
||||
|
||||
# Validate key
|
||||
if not key or len(key) != 1:
|
||||
continue
|
||||
|
||||
# Check for printable characters only
|
||||
if not key.isprintable():
|
||||
continue
|
||||
|
||||
# Check for quit command
|
||||
if key.lower() == "q":
|
||||
self.logger.info(
|
||||
self._get_quit_message(tag),
|
||||
tag=tag,
|
||||
base_color=LogColor.GREEN
|
||||
)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if browser process ended
|
||||
if await check_browser_process():
|
||||
return
|
||||
|
||||
# Small delay to prevent busy waiting
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error in Windows keyboard listener: {e}", tag=tag)
|
||||
# Continue trying instead of failing completely
|
||||
await asyncio.sleep(0.1)
|
||||
continue
|
||||
|
||||
async def _listen_unix(self, user_done_event: asyncio.Event, check_browser_process, tag: str):
|
||||
"""Unix/Linux/macOS keyboard listener using termios and select."""
|
||||
try:
|
||||
import termios
|
||||
import tty
|
||||
import select
|
||||
except ImportError:
|
||||
raise ImportError("termios/tty/select modules not available on this platform")
|
||||
|
||||
# Get stdin file descriptor
|
||||
try:
|
||||
fd = sys.stdin.fileno()
|
||||
except (AttributeError, OSError):
|
||||
raise ImportError("stdin is not a terminal")
|
||||
|
||||
# Save original terminal settings
|
||||
old_settings = None
|
||||
try:
|
||||
old_settings = termios.tcgetattr(fd)
|
||||
except termios.error as e:
|
||||
raise ImportError(f"Cannot get terminal attributes: {e}")
|
||||
|
||||
try:
|
||||
# Switch to non-canonical mode (cbreak mode)
|
||||
tty.setcbreak(fd)
|
||||
|
||||
while True:
|
||||
try:
|
||||
# Use select to check if input is available (non-blocking)
|
||||
# Timeout of 0.5 seconds to periodically check browser process
|
||||
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
|
||||
|
||||
if readable:
|
||||
# Read one character
|
||||
key = sys.stdin.read(1)
|
||||
|
||||
if key and key.lower() == "q":
|
||||
self.logger.info(
|
||||
self._get_quit_message(tag),
|
||||
tag=tag,
|
||||
base_color=LogColor.GREEN
|
||||
)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if browser process ended
|
||||
if await check_browser_process():
|
||||
return
|
||||
|
||||
# Small delay to prevent busy waiting
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
# Handle Ctrl+C or EOF gracefully
|
||||
self.logger.info("Keyboard interrupt received", tag=tag)
|
||||
user_done_event.set()
|
||||
return
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error in Unix keyboard listener: {e}", tag=tag)
|
||||
await asyncio.sleep(0.1)
|
||||
continue
|
||||
|
||||
finally:
|
||||
# Always restore terminal settings
|
||||
if old_settings is not None:
|
||||
try:
|
||||
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to restore terminal settings: {e}", tag=tag)
|
||||
|
||||
async def _listen_fallback(self, user_done_event: asyncio.Event, check_browser_process, tag: str):
|
||||
"""Fallback keyboard listener using simple input() method."""
|
||||
self.logger.info("Using fallback input mode. Type 'q' and press Enter to quit.", tag=tag)
|
||||
|
||||
# Run input in a separate thread to avoid blocking
|
||||
import threading
|
||||
import queue
|
||||
|
||||
input_queue = queue.Queue()
|
||||
|
||||
def input_thread():
|
||||
"""Thread function to handle input."""
|
||||
try:
|
||||
while not user_done_event.is_set():
|
||||
try:
|
||||
# Use input() with a prompt
|
||||
user_input = input("Press 'q' + Enter to quit: ").strip().lower()
|
||||
input_queue.put(user_input)
|
||||
if user_input == 'q':
|
||||
break
|
||||
except (EOFError, KeyboardInterrupt):
|
||||
input_queue.put('q')
|
||||
break
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error in input thread: {e}", tag=tag)
|
||||
break
|
||||
except Exception as e:
|
||||
self.logger.error(f"Input thread failed: {e}", tag=tag)
|
||||
|
||||
# Start input thread
|
||||
thread = threading.Thread(target=input_thread, daemon=True)
|
||||
thread.start()
|
||||
|
||||
try:
|
||||
while not user_done_event.is_set():
|
||||
# Check for user input
|
||||
try:
|
||||
user_input = input_queue.get_nowait()
|
||||
if user_input == 'q':
|
||||
self.logger.info(
|
||||
self._get_quit_message(tag),
|
||||
tag=tag,
|
||||
base_color=LogColor.GREEN
|
||||
)
|
||||
user_done_event.set()
|
||||
return
|
||||
except queue.Empty:
|
||||
pass
|
||||
|
||||
# Check if browser process ended
|
||||
if await check_browser_process():
|
||||
return
|
||||
|
||||
# Small delay
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Fallback listener failed: {e}", tag=tag)
|
||||
user_done_event.set()
|
||||
|
||||
async def create_profile(self,
|
||||
profile_name: Optional[str] = None,
|
||||
browser_config: Optional[BrowserConfig] = None) -> Optional[str]:
|
||||
@@ -387,38 +180,42 @@ class BrowserProfiler:
|
||||
|
||||
# Run keyboard input loop in a separate task
|
||||
async def listen_for_quit_command():
|
||||
"""Cross-platform keyboard listener that waits for 'q' key press."""
|
||||
import termios
|
||||
import tty
|
||||
import select
|
||||
|
||||
# First output the prompt
|
||||
self.logger.info(
|
||||
"Press {segment} when you've finished using the browser...",
|
||||
tag="PROFILE",
|
||||
params={"segment": "'q'"}, colors={"segment": LogColor.YELLOW},
|
||||
base_color=LogColor.CYAN
|
||||
)
|
||||
|
||||
async def check_browser_process():
|
||||
"""Check if browser process is still running."""
|
||||
if (
|
||||
managed_browser.browser_process
|
||||
and managed_browser.browser_process.poll() is not None
|
||||
):
|
||||
self.logger.info(
|
||||
"Browser already closed. Ending input listener.", tag="PROFILE"
|
||||
)
|
||||
user_done_event.set()
|
||||
return True
|
||||
return False
|
||||
|
||||
# Try platform-specific implementations with fallback
|
||||
self.logger.info("Press 'q' when you've finished using the browser...", tag="PROFILE")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
old_settings = termios.tcgetattr(fd)
|
||||
|
||||
try:
|
||||
if self._is_windows():
|
||||
await self._listen_windows(user_done_event, check_browser_process, "PROFILE")
|
||||
else:
|
||||
await self._listen_unix(user_done_event, check_browser_process, "PROFILE")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Platform-specific keyboard listener failed: {e}", tag="PROFILE")
|
||||
self.logger.info("Falling back to simple input mode...", tag="PROFILE")
|
||||
await self._listen_fallback(user_done_event, check_browser_process, "PROFILE")
|
||||
# Switch to non-canonical mode (no line buffering)
|
||||
tty.setcbreak(fd)
|
||||
|
||||
while True:
|
||||
# Check if input is available (non-blocking)
|
||||
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info("Closing browser and saving profile...", tag="PROFILE", base_color=LogColor.GREEN)
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if the browser process has already exited
|
||||
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
|
||||
self.logger.info("Browser already closed. Ending input listener.", tag="PROFILE")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
finally:
|
||||
# Restore terminal settings
|
||||
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
|
||||
|
||||
try:
|
||||
from playwright.async_api import async_playwright
|
||||
@@ -885,33 +682,42 @@ class BrowserProfiler:
|
||||
|
||||
# Run keyboard input loop in a separate task
|
||||
async def listen_for_quit_command():
|
||||
"""Cross-platform keyboard listener that waits for 'q' key press."""
|
||||
import termios
|
||||
import tty
|
||||
import select
|
||||
|
||||
# First output the prompt
|
||||
self.logger.info(
|
||||
"Press {segment} to stop the browser and exit...",
|
||||
tag="CDP",
|
||||
params={"segment": "'q'"}, colors={"segment": LogColor.YELLOW},
|
||||
base_color=LogColor.CYAN
|
||||
)
|
||||
|
||||
async def check_browser_process():
|
||||
"""Check if browser process is still running."""
|
||||
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
|
||||
self.logger.info("Browser already closed. Ending input listener.", tag="CDP")
|
||||
user_done_event.set()
|
||||
return True
|
||||
return False
|
||||
|
||||
# Try platform-specific implementations with fallback
|
||||
self.logger.info("Press 'q' to stop the browser and exit...", tag="CDP")
|
||||
|
||||
# Save original terminal settings
|
||||
fd = sys.stdin.fileno()
|
||||
old_settings = termios.tcgetattr(fd)
|
||||
|
||||
try:
|
||||
if self._is_windows():
|
||||
await self._listen_windows(user_done_event, check_browser_process, "CDP")
|
||||
else:
|
||||
await self._listen_unix(user_done_event, check_browser_process, "CDP")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Platform-specific keyboard listener failed: {e}", tag="CDP")
|
||||
self.logger.info("Falling back to simple input mode...", tag="CDP")
|
||||
await self._listen_fallback(user_done_event, check_browser_process, "CDP")
|
||||
# Switch to non-canonical mode (no line buffering)
|
||||
tty.setcbreak(fd)
|
||||
|
||||
while True:
|
||||
# Check if input is available (non-blocking)
|
||||
readable, _, _ = select.select([sys.stdin], [], [], 0.5)
|
||||
if readable:
|
||||
key = sys.stdin.read(1)
|
||||
if key.lower() == 'q':
|
||||
self.logger.info("Closing browser...", tag="CDP")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
# Check if the browser process has already exited
|
||||
if managed_browser.browser_process and managed_browser.browser_process.poll() is not None:
|
||||
self.logger.info("Browser already closed. Ending input listener.", tag="CDP")
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
finally:
|
||||
# Restore terminal settings
|
||||
termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
|
||||
|
||||
# Function to retrieve and display CDP JSON config
|
||||
async def get_cdp_json(port):
|
||||
|
||||
@@ -242,16 +242,6 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
exclude_domains = set(kwargs.get("exclude_domains", []))
|
||||
|
||||
# Process links
|
||||
try:
|
||||
base_element = element.xpath("//head/base[@href]")
|
||||
if base_element:
|
||||
base_href = base_element[0].get("href", "").strip()
|
||||
if base_href:
|
||||
url = base_href
|
||||
except Exception as e:
|
||||
self._log("error", f"Error extracting base URL: {str(e)}", "SCRAPE")
|
||||
pass
|
||||
|
||||
for link in element.xpath(".//a[@href]"):
|
||||
href = link.get("href", "").strip()
|
||||
if not href:
|
||||
@@ -586,6 +576,117 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
|
||||
return root
|
||||
|
||||
def is_data_table(self, table: etree.Element, **kwargs) -> bool:
|
||||
score = 0
|
||||
# Check for thead and tbody
|
||||
has_thead = len(table.xpath(".//thead")) > 0
|
||||
has_tbody = len(table.xpath(".//tbody")) > 0
|
||||
if has_thead:
|
||||
score += 2
|
||||
if has_tbody:
|
||||
score += 1
|
||||
|
||||
# Check for th elements
|
||||
th_count = len(table.xpath(".//th"))
|
||||
if th_count > 0:
|
||||
score += 2
|
||||
if has_thead or table.xpath(".//tr[1]/th"):
|
||||
score += 1
|
||||
|
||||
# Check for nested tables
|
||||
if len(table.xpath(".//table")) > 0:
|
||||
score -= 3
|
||||
|
||||
# Role attribute check
|
||||
role = table.get("role", "").lower()
|
||||
if role in {"presentation", "none"}:
|
||||
score -= 3
|
||||
|
||||
# Column consistency
|
||||
rows = table.xpath(".//tr")
|
||||
if not rows:
|
||||
return False
|
||||
col_counts = [len(row.xpath(".//td|.//th")) for row in rows]
|
||||
avg_cols = sum(col_counts) / len(col_counts)
|
||||
variance = sum((c - avg_cols)**2 for c in col_counts) / len(col_counts)
|
||||
if variance < 1:
|
||||
score += 2
|
||||
|
||||
# Caption and summary
|
||||
if table.xpath(".//caption"):
|
||||
score += 2
|
||||
if table.get("summary"):
|
||||
score += 1
|
||||
|
||||
# Text density
|
||||
total_text = sum(len(''.join(cell.itertext()).strip()) for row in rows for cell in row.xpath(".//td|.//th"))
|
||||
total_tags = sum(1 for _ in table.iterdescendants())
|
||||
text_ratio = total_text / (total_tags + 1e-5)
|
||||
if text_ratio > 20:
|
||||
score += 3
|
||||
elif text_ratio > 10:
|
||||
score += 2
|
||||
|
||||
# Data attributes
|
||||
data_attrs = sum(1 for attr in table.attrib if attr.startswith('data-'))
|
||||
score += data_attrs * 0.5
|
||||
|
||||
# Size check
|
||||
if avg_cols >= 2 and len(rows) >= 2:
|
||||
score += 2
|
||||
|
||||
threshold = kwargs.get("table_score_threshold", 7)
|
||||
return score >= threshold
|
||||
|
||||
def extract_table_data(self, table: etree.Element) -> dict:
|
||||
caption = table.xpath(".//caption/text()")
|
||||
caption = caption[0].strip() if caption else ""
|
||||
summary = table.get("summary", "").strip()
|
||||
|
||||
# Extract headers with colspan handling
|
||||
headers = []
|
||||
thead_rows = table.xpath(".//thead/tr")
|
||||
if thead_rows:
|
||||
header_cells = thead_rows[0].xpath(".//th")
|
||||
for cell in header_cells:
|
||||
text = cell.text_content().strip()
|
||||
colspan = int(cell.get("colspan", 1))
|
||||
headers.extend([text] * colspan)
|
||||
else:
|
||||
first_row = table.xpath(".//tr[1]")
|
||||
if first_row:
|
||||
for cell in first_row[0].xpath(".//th|.//td"):
|
||||
text = cell.text_content().strip()
|
||||
colspan = int(cell.get("colspan", 1))
|
||||
headers.extend([text] * colspan)
|
||||
|
||||
# Extract rows with colspan handling
|
||||
rows = []
|
||||
for row in table.xpath(".//tr[not(ancestor::thead)]"):
|
||||
row_data = []
|
||||
for cell in row.xpath(".//td"):
|
||||
text = cell.text_content().strip()
|
||||
colspan = int(cell.get("colspan", 1))
|
||||
row_data.extend([text] * colspan)
|
||||
if row_data:
|
||||
rows.append(row_data)
|
||||
|
||||
# Align rows with headers
|
||||
max_columns = len(headers) if headers else (max(len(row) for row in rows) if rows else 0)
|
||||
aligned_rows = []
|
||||
for row in rows:
|
||||
aligned = row[:max_columns] + [''] * (max_columns - len(row))
|
||||
aligned_rows.append(aligned)
|
||||
|
||||
if not headers:
|
||||
headers = [f"Column {i+1}" for i in range(max_columns)]
|
||||
|
||||
return {
|
||||
"headers": headers,
|
||||
"rows": aligned_rows,
|
||||
"caption": caption,
|
||||
"summary": summary,
|
||||
}
|
||||
|
||||
def _scrap(
|
||||
self,
|
||||
@@ -728,16 +829,12 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
# Extract tables using the table extraction strategy if provided
|
||||
if 'table' not in excluded_tags:
|
||||
table_extraction = kwargs.get('table_extraction')
|
||||
if table_extraction:
|
||||
# Pass logger to the strategy if it doesn't have one
|
||||
if not table_extraction.logger:
|
||||
table_extraction.logger = self.logger
|
||||
# Extract tables using the strategy
|
||||
extracted_tables = table_extraction.extract_tables(body, **kwargs)
|
||||
media["tables"].extend(extracted_tables)
|
||||
tables = body.xpath(".//table")
|
||||
for table in tables:
|
||||
if self.is_data_table(table, **kwargs):
|
||||
table_data = self.extract_table_data(table)
|
||||
media["tables"].append(table_data)
|
||||
|
||||
# Handle only_text option
|
||||
if kwargs.get("only_text", False):
|
||||
|
||||
@@ -116,11 +116,6 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
|
||||
|
||||
valid_links.append(base_url)
|
||||
|
||||
# If we have more valid links than capacity, limit them
|
||||
if len(valid_links) > remaining_capacity:
|
||||
valid_links = valid_links[:remaining_capacity]
|
||||
self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
|
||||
|
||||
# Record the new depths and add to next_links
|
||||
for url in valid_links:
|
||||
depths[url] = new_depth
|
||||
@@ -140,7 +135,8 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
|
||||
"""
|
||||
queue: asyncio.PriorityQueue = asyncio.PriorityQueue()
|
||||
# Push the initial URL with score 0 and depth 0.
|
||||
await queue.put((0, 0, start_url, None))
|
||||
initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
|
||||
await queue.put((-initial_score, 0, start_url, None))
|
||||
visited: Set[str] = set()
|
||||
depths: Dict[str, int] = {start_url: 0}
|
||||
|
||||
@@ -187,7 +183,7 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
|
||||
result.metadata = result.metadata or {}
|
||||
result.metadata["depth"] = depth
|
||||
result.metadata["parent_url"] = parent_url
|
||||
result.metadata["score"] = score
|
||||
result.metadata["score"] = -score
|
||||
|
||||
# Count only successful crawls toward max_pages limit
|
||||
if result.success:
|
||||
@@ -208,7 +204,7 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
|
||||
for new_url, new_parent in new_links:
|
||||
new_depth = depths.get(new_url, depth + 1)
|
||||
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
|
||||
await queue.put((new_score, new_depth, new_url, new_parent))
|
||||
await queue.put((-new_score, new_depth, new_url, new_parent))
|
||||
|
||||
# End of crawl.
|
||||
|
||||
|
||||
79
crawl4ai/memory_utils.py
Normal file
79
crawl4ai/memory_utils.py
Normal file
@@ -0,0 +1,79 @@
|
||||
import psutil
|
||||
import platform
|
||||
import subprocess
|
||||
from typing import Tuple
|
||||
|
||||
|
||||
def get_true_available_memory_gb() -> float:
|
||||
"""Get truly available memory including inactive pages (cross-platform)"""
|
||||
vm = psutil.virtual_memory()
|
||||
|
||||
if platform.system() == 'Darwin': # macOS
|
||||
# On macOS, we need to include inactive memory too
|
||||
try:
|
||||
# Use vm_stat to get accurate values
|
||||
result = subprocess.run(['vm_stat'], capture_output=True, text=True)
|
||||
lines = result.stdout.split('\n')
|
||||
|
||||
page_size = 16384 # macOS page size
|
||||
pages = {}
|
||||
|
||||
for line in lines:
|
||||
if 'Pages free:' in line:
|
||||
pages['free'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages inactive:' in line:
|
||||
pages['inactive'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages speculative:' in line:
|
||||
pages['speculative'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages purgeable:' in line:
|
||||
pages['purgeable'] = int(line.split()[-1].rstrip('.'))
|
||||
|
||||
# Calculate total available (free + inactive + speculative + purgeable)
|
||||
total_available_pages = (
|
||||
pages.get('free', 0) +
|
||||
pages.get('inactive', 0) +
|
||||
pages.get('speculative', 0) +
|
||||
pages.get('purgeable', 0)
|
||||
)
|
||||
available_gb = (total_available_pages * page_size) / (1024**3)
|
||||
|
||||
return available_gb
|
||||
except:
|
||||
# Fallback to psutil
|
||||
return vm.available / (1024**3)
|
||||
else:
|
||||
# For Windows and Linux, psutil.available is accurate
|
||||
return vm.available / (1024**3)
|
||||
|
||||
|
||||
def get_true_memory_usage_percent() -> float:
|
||||
"""
|
||||
Get memory usage percentage that accounts for platform differences.
|
||||
|
||||
Returns:
|
||||
float: Memory usage percentage (0-100)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
|
||||
# Calculate used percentage based on truly available memory
|
||||
used_percent = 100.0 * (total_gb - available_gb) / total_gb
|
||||
|
||||
# Ensure it's within valid range
|
||||
return max(0.0, min(100.0, used_percent))
|
||||
|
||||
|
||||
def get_memory_stats() -> Tuple[float, float, float]:
|
||||
"""
|
||||
Get comprehensive memory statistics.
|
||||
|
||||
Returns:
|
||||
Tuple[float, float, float]: (used_percent, available_gb, total_gb)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
used_percent = get_true_memory_usage_percent()
|
||||
|
||||
return used_percent, available_gb, total_gb
|
||||
File diff suppressed because it is too large
Load Diff
@@ -16,7 +16,7 @@ from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD, IM
|
||||
import httpx
|
||||
from socket import gaierror
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, List, Optional, Callable, Generator, Tuple, Iterable
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
from urllib.parse import urljoin
|
||||
import requests
|
||||
from requests.exceptions import InvalidSchema
|
||||
@@ -40,7 +40,8 @@ from typing import Sequence
|
||||
|
||||
from itertools import chain
|
||||
from collections import deque
|
||||
import psutil
|
||||
from typing import Generator, Iterable
|
||||
|
||||
import numpy as np
|
||||
|
||||
from urllib.parse import (
|
||||
@@ -3413,79 +3414,3 @@ def cosine_distance(vec1: np.ndarray, vec2: np.ndarray) -> float:
|
||||
"""Calculate cosine distance (1 - similarity) between two vectors"""
|
||||
return 1 - cosine_similarity(vec1, vec2)
|
||||
|
||||
|
||||
# Memory utilities
|
||||
|
||||
def get_true_available_memory_gb() -> float:
|
||||
"""Get truly available memory including inactive pages (cross-platform)"""
|
||||
vm = psutil.virtual_memory()
|
||||
|
||||
if platform.system() == 'Darwin': # macOS
|
||||
# On macOS, we need to include inactive memory too
|
||||
try:
|
||||
# Use vm_stat to get accurate values
|
||||
result = subprocess.run(['vm_stat'], capture_output=True, text=True)
|
||||
lines = result.stdout.split('\n')
|
||||
|
||||
page_size = 16384 # macOS page size
|
||||
pages = {}
|
||||
|
||||
for line in lines:
|
||||
if 'Pages free:' in line:
|
||||
pages['free'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages inactive:' in line:
|
||||
pages['inactive'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages speculative:' in line:
|
||||
pages['speculative'] = int(line.split()[-1].rstrip('.'))
|
||||
elif 'Pages purgeable:' in line:
|
||||
pages['purgeable'] = int(line.split()[-1].rstrip('.'))
|
||||
|
||||
# Calculate total available (free + inactive + speculative + purgeable)
|
||||
total_available_pages = (
|
||||
pages.get('free', 0) +
|
||||
pages.get('inactive', 0) +
|
||||
pages.get('speculative', 0) +
|
||||
pages.get('purgeable', 0)
|
||||
)
|
||||
available_gb = (total_available_pages * page_size) / (1024**3)
|
||||
|
||||
return available_gb
|
||||
except:
|
||||
# Fallback to psutil
|
||||
return vm.available / (1024**3)
|
||||
else:
|
||||
# For Windows and Linux, psutil.available is accurate
|
||||
return vm.available / (1024**3)
|
||||
|
||||
|
||||
def get_true_memory_usage_percent() -> float:
|
||||
"""
|
||||
Get memory usage percentage that accounts for platform differences.
|
||||
|
||||
Returns:
|
||||
float: Memory usage percentage (0-100)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
|
||||
# Calculate used percentage based on truly available memory
|
||||
used_percent = 100.0 * (total_gb - available_gb) / total_gb
|
||||
|
||||
# Ensure it's within valid range
|
||||
return max(0.0, min(100.0, used_percent))
|
||||
|
||||
|
||||
def get_memory_stats() -> Tuple[float, float, float]:
|
||||
"""
|
||||
Get comprehensive memory statistics.
|
||||
|
||||
Returns:
|
||||
Tuple[float, float, float]: (used_percent, available_gb, total_gb)
|
||||
"""
|
||||
vm = psutil.virtual_memory()
|
||||
total_gb = vm.total / (1024**3)
|
||||
available_gb = get_true_available_memory_gb()
|
||||
used_percent = get_true_memory_usage_percent()
|
||||
|
||||
return used_percent, available_gb, total_gb
|
||||
@@ -692,7 +692,8 @@ app:
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
|
||||
# api_key: sk-... # If you pass the API key directly (not recommended)
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
|
||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
||||
redis:
|
||||
|
||||
@@ -4,7 +4,7 @@ import asyncio
|
||||
from typing import List, Tuple, Dict
|
||||
from functools import partial
|
||||
from uuid import uuid4
|
||||
from datetime import datetime, timezone
|
||||
from datetime import datetime
|
||||
from base64 import b64encode
|
||||
|
||||
import logging
|
||||
@@ -65,7 +65,7 @@ async def handle_llm_qa(
|
||||
) -> str:
|
||||
"""Process QA using LLM with crawled content as context."""
|
||||
try:
|
||||
if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")):
|
||||
if not url.startswith(('http://', 'https://')):
|
||||
url = 'https://' + url
|
||||
# Extract base URL by finding last '?q=' occurrence
|
||||
last_q_index = url.rfind('?q=')
|
||||
@@ -96,7 +96,7 @@ async def handle_llm_qa(
|
||||
response = perform_completion_with_backoff(
|
||||
provider=config["llm"]["provider"],
|
||||
prompt_with_variables=prompt,
|
||||
api_token=get_llm_api_key(config) # Returns None to let litellm handle it
|
||||
api_token=get_llm_api_key(config)
|
||||
)
|
||||
|
||||
return response.choices[0].message.content
|
||||
@@ -127,7 +127,7 @@ async def process_llm_extraction(
|
||||
"error": error_msg
|
||||
})
|
||||
return
|
||||
api_key = get_llm_api_key(config, provider) # Returns None to let litellm handle it
|
||||
api_key = get_llm_api_key(config, provider)
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(
|
||||
provider=provider or config["llm"]["provider"],
|
||||
@@ -191,7 +191,7 @@ async def handle_markdown_request(
|
||||
detail=error_msg
|
||||
)
|
||||
decoded_url = unquote(url)
|
||||
if not decoded_url.startswith(('http://', 'https://')) and not decoded_url.startswith(("raw:", "raw://")):
|
||||
if not decoded_url.startswith(('http://', 'https://')):
|
||||
decoded_url = 'https://' + decoded_url
|
||||
|
||||
if filter_type == FilterType.RAW:
|
||||
@@ -203,7 +203,7 @@ async def handle_markdown_request(
|
||||
FilterType.LLM: LLMContentFilter(
|
||||
llm_config=LLMConfig(
|
||||
provider=provider or config["llm"]["provider"],
|
||||
api_token=get_llm_api_key(config, provider), # Returns None to let litellm handle it
|
||||
api_token=get_llm_api_key(config, provider),
|
||||
),
|
||||
instruction=query or "Extract main content"
|
||||
)
|
||||
@@ -328,7 +328,7 @@ async def create_new_task(
|
||||
) -> JSONResponse:
|
||||
"""Create and initialize a new task."""
|
||||
decoded_url = unquote(input_path)
|
||||
if not decoded_url.startswith(('http://', 'https://')) and not decoded_url.startswith(("raw:", "raw://")):
|
||||
if not decoded_url.startswith(('http://', 'https://')):
|
||||
decoded_url = 'https://' + decoded_url
|
||||
|
||||
from datetime import datetime
|
||||
@@ -428,7 +428,7 @@ async def handle_crawl_request(
|
||||
peak_mem_mb = start_mem_mb
|
||||
|
||||
try:
|
||||
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) and not url.startswith(("raw:", "raw://")) else url for url in urls]
|
||||
urls = [('https://' + url) if not url.startswith(('http://', 'https://')) else url for url in urls]
|
||||
browser_config = BrowserConfig.load(browser_config)
|
||||
crawler_config = CrawlerRunConfig.load(crawler_config)
|
||||
|
||||
@@ -576,7 +576,7 @@ async def handle_crawl_job(
|
||||
task_id = f"crawl_{uuid4().hex[:8]}"
|
||||
await redis.hset(f"task:{task_id}", mapping={
|
||||
"status": TaskStatus.PROCESSING, # <-- keep enum values consistent
|
||||
"created_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
|
||||
"created_at": datetime.utcnow().isoformat(),
|
||||
"url": json.dumps(urls), # store list as JSON string
|
||||
"result": "",
|
||||
"error": "",
|
||||
|
||||
@@ -11,7 +11,8 @@ app:
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini"
|
||||
# api_key: sk-... # If you pass the API key directly (not recommended)
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
|
||||
# Redis Configuration
|
||||
redis:
|
||||
|
||||
@@ -237,9 +237,9 @@ async def get_markdown(
|
||||
body: MarkdownRequest,
|
||||
_td: Dict = Depends(token_dep),
|
||||
):
|
||||
if not body.url.startswith(("http://", "https://")) and not body.url.startswith(("raw:", "raw://")):
|
||||
if not body.url.startswith(("http://", "https://")):
|
||||
raise HTTPException(
|
||||
400, "Invalid URL format. Must start with http://, https://, or for raw HTML (raw:, raw://)")
|
||||
400, "URL must be absolute and start with http/https")
|
||||
markdown = await handle_markdown_request(
|
||||
body.url, body.f, body.q, body.c, config, body.provider
|
||||
)
|
||||
@@ -401,7 +401,7 @@ async def llm_endpoint(
|
||||
):
|
||||
if not q:
|
||||
raise HTTPException(400, "Query parameter 'q' is required")
|
||||
if not url.startswith(("http://", "https://")) and not url.startswith(("raw:", "raw://")):
|
||||
if not url.startswith(("http://", "https://")):
|
||||
url = "https://" + url
|
||||
answer = await handle_llm_qa(url, q, config)
|
||||
return JSONResponse({"answer": answer})
|
||||
|
||||
@@ -71,7 +71,7 @@ def decode_redis_hash(hash_data: Dict[bytes, bytes]) -> Dict[str, str]:
|
||||
|
||||
|
||||
|
||||
def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> Optional[str]:
|
||||
def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> str:
|
||||
"""Get the appropriate API key based on the LLM provider.
|
||||
|
||||
Args:
|
||||
@@ -79,14 +79,19 @@ def get_llm_api_key(config: Dict, provider: Optional[str] = None) -> Optional[st
|
||||
provider: Optional provider override (e.g., "openai/gpt-4")
|
||||
|
||||
Returns:
|
||||
The API key if directly configured, otherwise None to let litellm handle it
|
||||
The API key for the provider, or empty string if not found
|
||||
"""
|
||||
# Check if direct API key is configured (for backward compatibility)
|
||||
|
||||
# Use provided provider or fall back to config
|
||||
if not provider:
|
||||
provider = config["llm"]["provider"]
|
||||
|
||||
# Check if direct API key is configured
|
||||
if "api_key" in config["llm"]:
|
||||
return config["llm"]["api_key"]
|
||||
|
||||
# Return None - litellm will automatically find the right environment variable
|
||||
return None
|
||||
# Fall back to the configured api_key_env if no match
|
||||
return os.environ.get(config["llm"].get("api_key_env", ""), "")
|
||||
|
||||
|
||||
def validate_llm_provider(config: Dict, provider: Optional[str] = None) -> tuple[bool, str]:
|
||||
@@ -99,12 +104,16 @@ def validate_llm_provider(config: Dict, provider: Optional[str] = None) -> tuple
|
||||
Returns:
|
||||
Tuple of (is_valid, error_message)
|
||||
"""
|
||||
# If a direct API key is configured, validation passes
|
||||
if "api_key" in config["llm"]:
|
||||
return True, ""
|
||||
# Use provided provider or fall back to config
|
||||
if not provider:
|
||||
provider = config["llm"]["provider"]
|
||||
|
||||
# Get the API key for this provider
|
||||
api_key = get_llm_api_key(config, provider)
|
||||
|
||||
if not api_key:
|
||||
return False, f"No API key found for provider '{provider}'. Please set the appropriate environment variable."
|
||||
|
||||
# Otherwise, trust that litellm will find the appropriate environment variable
|
||||
# We can't easily validate this without reimplementing litellm's logic
|
||||
return True, ""
|
||||
|
||||
|
||||
|
||||
@@ -1,350 +0,0 @@
|
||||
# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update
|
||||
|
||||
*August 6, 2025 • 5 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **🕵️ Undetected Browser Support**: Stealth mode for bypassing bot detection systems
|
||||
- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||
- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||
- **🧠 Memory Monitoring**: Enhanced memory usage tracking and optimization tools
|
||||
- **📊 Enhanced Table Extraction**: Improved table access and DataFrame conversion
|
||||
- **💰 GitHub Sponsors**: 4-tier sponsorship system with custom arrangements
|
||||
- **🔧 Bug Fixes**: Resolved several critical issues for better stability
|
||||
- **📚 Documentation Updates**: Clearer examples and improved API documentation
|
||||
|
||||
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
||||
|
||||
**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic.
|
||||
|
||||
**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
|
||||
|
||||
# Define specialized configs for different content types
|
||||
configs = [
|
||||
# Documentation sites - aggressive caching, include links
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*docs*", "*documentation*"],
|
||||
cache_mode="write",
|
||||
markdown_generator_options={"include_links": True}
|
||||
),
|
||||
|
||||
# News/blog sites - fresh content, scroll for lazy loading
|
||||
CrawlerRunConfig(
|
||||
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||
cache_mode="bypass",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight/2);"
|
||||
),
|
||||
|
||||
# API endpoints - structured extraction
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*.json", "*api*"],
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini",
|
||||
extraction_type="structured"
|
||||
)
|
||||
),
|
||||
|
||||
# Default fallback for everything else
|
||||
CrawlerRunConfig() # No url_matcher = matches everything
|
||||
]
|
||||
|
||||
# Crawl multiple URLs with appropriate configs
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=[
|
||||
"https://docs.python.org/3/", # → Uses documentation config
|
||||
"https://blog.python.org/", # → Uses blog config
|
||||
"https://api.github.com/users", # → Uses API config
|
||||
"https://example.com/" # → Uses default config
|
||||
],
|
||||
config=configs
|
||||
)
|
||||
```
|
||||
|
||||
**Matching Capabilities:**
|
||||
- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"`
|
||||
- **Function Matchers**: Lambda functions for complex logic
|
||||
- **Mixed Matchers**: Combine strings and functions with AND/OR logic
|
||||
- **Fallback Support**: Default config when nothing matches
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl
|
||||
- **Multi-Domain Crawling**: Different strategies per domain without separate runs
|
||||
- **Reduced Complexity**: No more if/else forests in your extraction code
|
||||
- **Better Performance**: Each URL gets exactly the processing it needs
|
||||
|
||||
## 🕵️ Undetected Browser Support: Stealth Mode Activated
|
||||
|
||||
**The Problem:** Modern websites employ sophisticated bot detection systems. Cloudflare, Akamai, and custom solutions block automated crawlers, limiting access to valuable content.
|
||||
|
||||
**My Solution:** I implemented undetected browser support with a flexible adapter pattern. Now Crawl4AI can bypass most bot detection systems using stealth techniques.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
# Enable undetected mode for stealth crawling
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="undetected", # Use undetected Chrome
|
||||
headless=True, # Can run headless with stealth
|
||||
extra_args=[
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-web-security",
|
||||
"--disable-features=VizDisplayCompositor"
|
||||
]
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# This will bypass most bot detection systems
|
||||
result = await crawler.arun("https://protected-site.com")
|
||||
|
||||
if result.success:
|
||||
print("✅ Successfully bypassed bot detection!")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
```
|
||||
|
||||
**Advanced Anti-Bot Strategies:**
|
||||
|
||||
```python
|
||||
# Combine multiple stealth techniques
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
# Random user agents and headers
|
||||
headers={
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
"Accept-Encoding": "gzip, deflate, br",
|
||||
"DNT": "1"
|
||||
},
|
||||
|
||||
# Human-like behavior simulation
|
||||
js_code="""
|
||||
// Random mouse movements
|
||||
const simulateHuman = () => {
|
||||
const event = new MouseEvent('mousemove', {
|
||||
clientX: Math.random() * window.innerWidth,
|
||||
clientY: Math.random() * window.innerHeight
|
||||
});
|
||||
document.dispatchEvent(event);
|
||||
};
|
||||
setInterval(simulateHuman, 100 + Math.random() * 200);
|
||||
|
||||
// Random scrolling
|
||||
const randomScroll = () => {
|
||||
const scrollY = Math.random() * (document.body.scrollHeight - window.innerHeight);
|
||||
window.scrollTo(0, scrollY);
|
||||
};
|
||||
setTimeout(randomScroll, 500 + Math.random() * 1000);
|
||||
""",
|
||||
|
||||
# Delay to appear more human
|
||||
delay_before_return_html=2.0
|
||||
)
|
||||
|
||||
result = await crawler.arun("https://bot-protected-site.com", config=config)
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Enterprise Scraping**: Access previously blocked corporate sites and databases
|
||||
- **Market Research**: Gather data from competitor sites with protection
|
||||
- **Price Monitoring**: Track e-commerce sites that block automated access
|
||||
- **Content Aggregation**: Collect news and social media despite anti-bot measures
|
||||
- **Compliance Testing**: Verify your own site's bot protection effectiveness
|
||||
|
||||
## 🧠 Memory Monitoring & Optimization
|
||||
|
||||
**The Problem:** Long-running crawl sessions consuming excessive memory, especially when processing large batches or heavy JavaScript sites.
|
||||
|
||||
**My Solution:** Built comprehensive memory monitoring and optimization utilities that track usage patterns and provide actionable insights.
|
||||
|
||||
### Memory Tracking Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai.memory_utils import MemoryMonitor, get_memory_info
|
||||
|
||||
# Monitor memory during crawling
|
||||
monitor = MemoryMonitor()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Start monitoring
|
||||
monitor.start_monitoring()
|
||||
|
||||
# Perform memory-intensive operations
|
||||
results = await crawler.arun_many([
|
||||
"https://heavy-js-site.com",
|
||||
"https://large-images-site.com",
|
||||
"https://dynamic-content-site.com"
|
||||
])
|
||||
|
||||
# Get detailed memory report
|
||||
memory_report = monitor.get_report()
|
||||
print(f"Peak memory usage: {memory_report['peak_mb']:.1f} MB")
|
||||
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
|
||||
|
||||
# Automatic cleanup suggestions
|
||||
if memory_report['peak_mb'] > 1000: # > 1GB
|
||||
print("💡 Consider batch size optimization")
|
||||
print("💡 Enable aggressive garbage collection")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Production Stability**: Prevent memory-related crashes in long-running services
|
||||
- **Cost Optimization**: Right-size server resources based on actual usage
|
||||
- **Performance Tuning**: Identify memory bottlenecks and optimization opportunities
|
||||
- **Scalability Planning**: Understand memory patterns for horizontal scaling
|
||||
|
||||
## 📊 Enhanced Table Extraction
|
||||
|
||||
**The Problem:** Table data was accessed through the generic `result.media` interface, making DataFrame conversion cumbersome and unclear.
|
||||
|
||||
**My Solution:** Dedicated `result.tables` interface with direct DataFrame conversion and improved detection algorithms.
|
||||
|
||||
### New Table Access Pattern
|
||||
|
||||
```python
|
||||
# Old way (deprecated)
|
||||
# tables_data = result.media.get('tables', [])
|
||||
|
||||
# New way (v0.7.3+)
|
||||
result = await crawler.arun("https://site-with-tables.com")
|
||||
|
||||
# Direct table access
|
||||
if result.tables:
|
||||
print(f"Found {len(result.tables)} tables")
|
||||
|
||||
# Convert to pandas DataFrame instantly
|
||||
import pandas as pd
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
df = pd.DataFrame(table['data'])
|
||||
print(f"Table {i}: {df.shape[0]} rows × {df.shape[1]} columns")
|
||||
print(df.head())
|
||||
|
||||
# Table metadata
|
||||
print(f"Source: {table.get('source_xpath', 'Unknown')}")
|
||||
print(f"Headers: {table.get('headers', [])}")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Data Analysis**: Faster transition from web data to analysis-ready DataFrames
|
||||
- **ETL Pipelines**: Cleaner integration with data processing workflows
|
||||
- **Reporting**: Simplified table extraction for automated reporting systems
|
||||
|
||||
## 💰 Community Support: GitHub Sponsors
|
||||
|
||||
I've launched GitHub Sponsors to ensure Crawl4AI's continued development and support our growing community.
|
||||
|
||||
**Sponsorship Tiers:**
|
||||
- **🌱 Supporter ($5/month)**: Community support + early feature previews
|
||||
- **🚀 Professional ($25/month)**: Priority support + beta access
|
||||
- **🏢 Business ($100/month)**: Direct consultation + custom integrations
|
||||
- **🏛️ Enterprise ($500/month)**: Dedicated support + feature development
|
||||
|
||||
**Why Sponsor?**
|
||||
- Ensure continuous development and maintenance
|
||||
- Get priority support and feature requests
|
||||
- Access to premium documentation and examples
|
||||
- Direct line to the development team
|
||||
|
||||
[**Become a Sponsor →**](https://github.com/sponsors/unclecode)
|
||||
|
||||
## 🐳 Docker: Flexible LLM Provider Configuration
|
||||
|
||||
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
||||
|
||||
**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images.
|
||||
|
||||
### Deployment Flexibility
|
||||
|
||||
```bash
|
||||
# Option 1: Direct environment variables
|
||||
docker run -d \
|
||||
-e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
|
||||
-e GROQ_API_KEY="your-key" \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Option 2: Using .llm.env file (recommended for production)
|
||||
# Create .llm.env file:
|
||||
# LLM_PROVIDER=openai/gpt-4o-mini
|
||||
# OPENAI_API_KEY=your-openai-key
|
||||
# GROQ_API_KEY=your-groq-key
|
||||
|
||||
docker run -d \
|
||||
--env-file .llm.env \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
Override per request when needed:
|
||||
```python
|
||||
# Use default provider from .llm.env
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://example.com",
|
||||
"extraction_strategy": {"type": "llm"}
|
||||
})
|
||||
|
||||
# Override to use different provider for this specific request
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://complex-page.com",
|
||||
"extraction_strategy": {
|
||||
"type": "llm",
|
||||
"provider": "openai/gpt-4" # Override default
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex
|
||||
- **A/B Testing**: Compare provider performance without deployment changes
|
||||
- **Fallback Strategies**: Switch providers on-the-fly during outages
|
||||
- **Development Flexibility**: Test locally with one provider, deploy with another
|
||||
- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands
|
||||
|
||||
## 🔧 Bug Fixes & Improvements
|
||||
|
||||
This release includes several important bug fixes that improve stability and reliability:
|
||||
|
||||
- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic
|
||||
- **Memory Management**: Resolved memory leaks in long-running crawl sessions
|
||||
- **Sitemap Processing**: Fixed redirect handling in sitemap fetching
|
||||
- **Table Extraction**: Improved table detection and extraction accuracy
|
||||
- **Error Handling**: Better error messages and recovery from network failures
|
||||
|
||||
## 📚 Documentation Enhancements
|
||||
|
||||
Based on community feedback, we've updated:
|
||||
- Clearer examples for multi-URL configuration
|
||||
- Improved CrawlResult documentation with all available fields
|
||||
- Fixed typos and inconsistencies across documentation
|
||||
- Added real-world URLs in examples for better understanding
|
||||
- New comprehensive demo showcasing all v0.7.3 features
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thanks to our contributors and the entire community for feedback and bug reports.
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- [Full Documentation](https://docs.crawl4ai.com)
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Discord Community](https://discord.gg/crawl4ai)
|
||||
- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py)
|
||||
|
||||
---
|
||||
|
||||
*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!*
|
||||
|
||||
**Happy Crawling! 🕷️**
|
||||
|
||||
*- The Crawl4AI Team*
|
||||
@@ -1,305 +0,0 @@
|
||||
# 🚀 Crawl4AI v0.7.4: The Intelligent Table Extraction & Performance Update
|
||||
|
||||
*August 17, 2025 • 6 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.4—the Intelligent Table Extraction & Performance Update. This release introduces revolutionary LLM-powered table extraction with intelligent chunking, significant performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
|
||||
- **⚡ Enhanced Concurrency**: True concurrency improvements for fast-completing tasks in batch operations
|
||||
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
|
||||
- **⌨️ Cross-Platform Browser Profiler**: Improved keyboard handling and quit mechanisms
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
|
||||
- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration with dict and string formats
|
||||
- **🐳 Docker Improvements**: Better API handling and raw HTML support
|
||||
|
||||
## 🚀 LLMTableExtraction: Revolutionary Table Processing
|
||||
|
||||
**The Problem:** Complex tables with rowspan, colspan, nested structures, or massive datasets that traditional HTML parsing can't handle effectively. Large tables that exceed token limits crash extraction processes.
|
||||
|
||||
**My Solution:** I developed LLMTableExtraction—an intelligent table extraction strategy that uses Large Language Models with automatic chunking to handle tables of any size and complexity.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
LLMTableExtraction,
|
||||
CacheMode
|
||||
)
|
||||
|
||||
# Configure LLM for table extraction
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1, # Low temperature for consistency
|
||||
max_tokens=32000
|
||||
)
|
||||
|
||||
# Create intelligent table extraction strategy
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
verbose=True,
|
||||
max_tries=2,
|
||||
enable_chunking=True, # Handle massive tables
|
||||
chunk_token_threshold=5000, # Smart chunking threshold
|
||||
overlap_threshold=100, # Maintain context between chunks
|
||||
extraction_type="structured" # Get structured data output
|
||||
)
|
||||
|
||||
# Apply to crawler configuration
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction_strategy=table_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Extract complex tables with intelligence
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Access extracted tables directly
|
||||
for i, table in enumerate(result.tables):
|
||||
print(f"Table {i}: {len(table['data'])} rows × {len(table['headers'])} columns")
|
||||
|
||||
# Convert to pandas DataFrame instantly
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(table['data'], columns=table['headers'])
|
||||
print(df.head())
|
||||
```
|
||||
|
||||
**Intelligent Chunking for Massive Tables:**
|
||||
|
||||
```python
|
||||
# Handle tables that exceed token limits
|
||||
large_table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=3000, # Conservative threshold
|
||||
overlap_threshold=150, # Preserve context
|
||||
max_concurrent_chunks=3, # Parallel processing
|
||||
merge_strategy="intelligent" # Smart chunk merging
|
||||
)
|
||||
|
||||
# Process Wikipedia comparison tables, financial reports, etc.
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction_strategy=large_table_strategy,
|
||||
# Target specific table containers
|
||||
css_selector="div.wikitable, table.sortable",
|
||||
delay_before_return_html=2.0
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/Comparison_of_operating_systems",
|
||||
config=config
|
||||
)
|
||||
|
||||
# Tables are automatically chunked, processed, and merged
|
||||
print(f"Extracted {len(result.tables)} complex tables")
|
||||
for table in result.tables:
|
||||
print(f"Merged table: {len(table['data'])} total rows")
|
||||
```
|
||||
|
||||
**Advanced Features:**
|
||||
|
||||
- **Intelligent Chunking**: Automatically splits massive tables while preserving structure
|
||||
- **Context Preservation**: Overlapping chunks maintain column relationships
|
||||
- **Parallel Processing**: Concurrent chunk processing for speed
|
||||
- **Smart Merging**: Reconstructs complete tables from processed chunks
|
||||
- **Complex Structure Support**: Handles rowspan, colspan, nested tables
|
||||
- **Metadata Extraction**: Captures table context, captions, and relationships
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Financial Analysis**: Extract complex earnings tables and financial statements
|
||||
- **Research & Academia**: Process large datasets from Wikipedia, research papers
|
||||
- **E-commerce**: Handle product comparison tables with complex layouts
|
||||
- **Government Data**: Extract census data, statistical tables from official sources
|
||||
- **Competitive Intelligence**: Process competitor pricing and feature tables
|
||||
|
||||
## ⚡ Enhanced Concurrency: True Performance Gains
|
||||
|
||||
**The Problem:** The `arun_many()` method wasn't achieving true concurrency for fast-completing tasks, leading to sequential processing bottlenecks in batch operations.
|
||||
|
||||
**My Solution:** I implemented true concurrency improvements in the dispatcher that enable genuine parallel processing for fast-completing tasks.
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
```python
|
||||
# Before v0.7.4: Sequential-like behavior for fast tasks
|
||||
# After v0.7.4: True concurrency
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# These will now run with true concurrency
|
||||
urls = [
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1",
|
||||
"https://httpbin.org/delay/1"
|
||||
]
|
||||
|
||||
# Processes in truly parallel fashion
|
||||
results = await crawler.arun_many(urls)
|
||||
|
||||
# Performance improvement: ~4x faster for fast-completing tasks
|
||||
print(f"Processed {len(results)} URLs with true concurrency")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **API Crawling**: 3-4x faster processing of REST endpoints and API documentation
|
||||
- **Batch URL Processing**: Significant speedup for large URL lists
|
||||
- **Monitoring Systems**: Faster health checks and status page monitoring
|
||||
- **Data Aggregation**: Improved performance for real-time data collection
|
||||
|
||||
## 🧹 Memory Management Refactor: Cleaner Architecture
|
||||
|
||||
**The Problem:** Memory utilities were scattered and difficult to maintain, with potential import conflicts and unclear organization.
|
||||
|
||||
**My Solution:** I consolidated all memory-related utilities into the main `utils.py` module, creating a cleaner, more maintainable architecture.
|
||||
|
||||
### Improved Memory Handling
|
||||
|
||||
```python
|
||||
# All memory utilities now consolidated
|
||||
from crawl4ai.utils import get_true_memory_usage_percent, MemoryMonitor
|
||||
|
||||
# Enhanced memory monitoring
|
||||
monitor = MemoryMonitor()
|
||||
monitor.start_monitoring()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Memory-efficient batch processing
|
||||
results = await crawler.arun_many(large_url_list)
|
||||
|
||||
# Get accurate memory metrics
|
||||
memory_usage = get_true_memory_usage_percent()
|
||||
memory_report = monitor.get_report()
|
||||
|
||||
print(f"Memory efficiency: {memory_report['efficiency']:.1f}%")
|
||||
print(f"Peak usage: {memory_report['peak_mb']:.1f} MB")
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Production Stability**: More reliable memory tracking and management
|
||||
- **Code Maintainability**: Cleaner architecture for easier debugging
|
||||
- **Import Clarity**: Resolved potential conflicts and import issues
|
||||
- **Developer Experience**: Simpler API for memory monitoring
|
||||
|
||||
## 🔧 Critical Stability Fixes
|
||||
|
||||
### Browser Manager Race Condition Resolution
|
||||
|
||||
**The Problem:** Concurrent page creation in persistent browser contexts caused "Target page/context closed" errors during high-concurrency operations.
|
||||
|
||||
**My Solution:** Implemented thread-safe page creation with proper locking mechanisms.
|
||||
|
||||
```python
|
||||
# Fixed: Safe concurrent page creation
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
use_persistent_context=True, # Now thread-safe
|
||||
max_concurrent_sessions=10 # Safely handle concurrent requests
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# These concurrent operations are now stable
|
||||
tasks = [crawler.arun(url) for url in url_list]
|
||||
results = await asyncio.gather(*tasks) # No more race conditions
|
||||
```
|
||||
|
||||
### Enhanced Browser Profiler
|
||||
|
||||
**The Problem:** Inconsistent keyboard handling across platforms and unreliable quit mechanisms.
|
||||
|
||||
**My Solution:** Cross-platform keyboard listeners with improved quit handling.
|
||||
|
||||
### Advanced URL Processing
|
||||
|
||||
**The Problem:** Raw URL formats (`raw://` and `raw:`) weren't properly handled, and base tag link resolution was incomplete.
|
||||
|
||||
**My Solution:** Enhanced URL preprocessing and base tag support.
|
||||
|
||||
```python
|
||||
# Now properly handles all URL formats
|
||||
urls = [
|
||||
"https://example.com",
|
||||
"raw://static-html-content",
|
||||
"raw:file://local-file.html"
|
||||
]
|
||||
|
||||
# Base tag links are now correctly resolved
|
||||
config = CrawlerRunConfig(
|
||||
include_links=True, # Links properly resolved with base tags
|
||||
resolve_absolute_urls=True
|
||||
)
|
||||
```
|
||||
|
||||
## 🛡️ Enhanced Proxy Configuration
|
||||
|
||||
**The Problem:** Proxy configuration only accepted specific formats, limiting flexibility.
|
||||
|
||||
**My Solution:** Enhanced ProxyConfig to support both dictionary and string formats.
|
||||
|
||||
```python
|
||||
# Multiple proxy configuration formats now supported
|
||||
from crawl4ai import BrowserConfig, ProxyConfig
|
||||
|
||||
# String format
|
||||
proxy_config = ProxyConfig("http://proxy.example.com:8080")
|
||||
|
||||
# Dictionary format
|
||||
proxy_config = ProxyConfig({
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
})
|
||||
|
||||
# Use with crawler
|
||||
browser_config = BrowserConfig(proxy_config=proxy_config)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://httpbin.org/ip")
|
||||
```
|
||||
|
||||
## 🐳 Docker & Infrastructure Improvements
|
||||
|
||||
This release includes several Docker and infrastructure improvements:
|
||||
|
||||
- **Better API Token Handling**: Improved Docker example scripts with correct endpoints
|
||||
- **Raw HTML Support**: Enhanced Docker API to handle raw HTML content properly
|
||||
- **Documentation Updates**: Comprehensive Docker deployment examples
|
||||
- **Test Coverage**: Expanded test suite with better coverage
|
||||
|
||||
## 📚 Documentation & Examples
|
||||
|
||||
Enhanced documentation includes:
|
||||
|
||||
- **LLM Table Extraction Guide**: Comprehensive examples and best practices
|
||||
- **Migration Documentation**: Updated patterns for new table extraction methods
|
||||
- **Docker Deployment**: Complete deployment guide with examples
|
||||
- **Performance Optimization**: Guidelines for concurrent crawling
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thanks to our contributors and community for feedback, bug reports, and feature requests that made this release possible.
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- [Full Documentation](https://docs.crawl4ai.com)
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Discord Community](https://discord.gg/crawl4ai)
|
||||
- [LLM Table Extraction Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/llm_table_extraction_example.py)
|
||||
|
||||
---
|
||||
|
||||
*Crawl4AI v0.7.4 delivers intelligent table extraction and significant performance improvements. The new LLMTableExtraction strategy handles complex tables that were previously impossible to process, while concurrency improvements make batch operations 3-4x faster. Try the intelligent table extraction—it's a game changer for data extraction workflows!*
|
||||
|
||||
**Happy Crawling! 🕷️**
|
||||
|
||||
*- The Crawl4AI Team*
|
||||
@@ -8,20 +8,26 @@ from typing import Dict, Any
|
||||
|
||||
|
||||
class Crawl4AiTester:
|
||||
def __init__(self, base_url: str = "http://localhost:11235"):
|
||||
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
|
||||
self.base_url = base_url
|
||||
self.api_token = (
|
||||
api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
|
||||
) # Check environment variable as fallback
|
||||
self.headers = (
|
||||
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
|
||||
)
|
||||
|
||||
def submit_and_wait(
|
||||
self, request_data: Dict[str, Any], timeout: int = 300
|
||||
) -> Dict[str, Any]:
|
||||
# Submit crawl job using async endpoint
|
||||
# Submit crawl job
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl/job", json=request_data
|
||||
f"{self.base_url}/crawl", json=request_data, headers=self.headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
job_response = response.json()
|
||||
task_id = job_response["task_id"]
|
||||
print(f"Submitted job with task_id: {task_id}")
|
||||
if response.status_code == 403:
|
||||
raise Exception("API token is invalid or missing")
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Task ID: {task_id}")
|
||||
|
||||
# Poll for result
|
||||
start_time = time.time()
|
||||
@@ -32,9 +38,8 @@ class Crawl4AiTester:
|
||||
)
|
||||
|
||||
result = requests.get(
|
||||
f"{self.base_url}/crawl/job/{task_id}"
|
||||
f"{self.base_url}/task/{task_id}", headers=self.headers
|
||||
)
|
||||
result.raise_for_status()
|
||||
status = result.json()
|
||||
|
||||
if status["status"] == "failed":
|
||||
@@ -47,10 +52,10 @@ class Crawl4AiTester:
|
||||
time.sleep(2)
|
||||
|
||||
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
# Use synchronous crawl endpoint
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl",
|
||||
f"{self.base_url}/crawl_sync",
|
||||
json=request_data,
|
||||
headers=self.headers,
|
||||
timeout=60,
|
||||
)
|
||||
if response.status_code == 408:
|
||||
@@ -58,9 +63,20 @@ class Crawl4AiTester:
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Directly crawl without using task queue"""
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
|
||||
|
||||
def test_docker_deployment(version="basic"):
|
||||
tester = Crawl4AiTester(
|
||||
base_url="http://localhost:11235",
|
||||
# base_url="https://api.crawl4ai.com" # just for example
|
||||
# api_token="test" # just for example
|
||||
)
|
||||
print(f"Testing Crawl4AI Docker {version} version")
|
||||
|
||||
@@ -79,8 +95,11 @@ def test_docker_deployment(version="basic"):
|
||||
time.sleep(5)
|
||||
|
||||
# Test cases based on version
|
||||
test_basic_crawl_direct(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl_sync(tester)
|
||||
|
||||
if version in ["full", "transformer"]:
|
||||
test_cosine_extraction(tester)
|
||||
|
||||
@@ -93,129 +112,115 @@ def test_docker_deployment(version="basic"):
|
||||
|
||||
|
||||
def test_basic_crawl(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Async) ===")
|
||||
print("\n=== Testing Basic Crawl ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {}
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"Basic crawl result count: {len(result['result']['results'])}")
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["results"]) > 0
|
||||
assert len(result["result"]["results"][0]["markdown"]) > 0
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_basic_crawl_sync(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Sync) ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {}
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
}
|
||||
|
||||
result = tester.submit_sync(request)
|
||||
print(f"Basic crawl result count: {len(result['results'])}")
|
||||
assert result["success"]
|
||||
assert len(result["results"]) > 0
|
||||
assert len(result["results"][0]["markdown"]) > 0
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["status"] == "completed"
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_basic_crawl_direct(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Direct) ===")
|
||||
request = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 10,
|
||||
# "session_id": "test"
|
||||
"cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
|
||||
}
|
||||
|
||||
result = tester.crawl_direct(request)
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_js_execution(tester: Crawl4AiTester):
|
||||
print("\n=== Testing JS Execution ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); if(loadMoreButton) loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "wide-tease-item__wrapper df flex-column flex-row-m flex-nowrap-m enable-new-sports-feed-mobile-design(10)"
|
||||
}
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 8,
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "article.tease-card:nth-child(10)",
|
||||
"crawler_params": {"headless": True},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"JS execution result count: {len(result['result']['results'])}")
|
||||
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
def test_css_selector(tester: Crawl4AiTester):
|
||||
print("\n=== Testing CSS Selector ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 7,
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"crawler_params": {"headless": True},
|
||||
"extra": {"word_count_threshold": 10},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"CSS selector result count: {len(result['result']['results'])}")
|
||||
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
def test_structured_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Structured Extraction ===")
|
||||
schema = {
|
||||
"name": "Cryptocurrency Prices",
|
||||
"baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "asset_name",
|
||||
"selector": "td:nth-child(2) p.cds-headline-h4steop",
|
||||
"type": "text"
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "asset_symbol",
|
||||
"selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "asset_image_url",
|
||||
"selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "asset_url",
|
||||
"selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "change",
|
||||
"selector": "td:nth-child(7) p.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
],
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.coinbase.com/explore"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {"schema": schema}
|
||||
}
|
||||
}
|
||||
}
|
||||
"urls": "https://www.coinbase.com/explore",
|
||||
"priority": 9,
|
||||
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} items")
|
||||
if extracted:
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
assert len(extracted) > 0
|
||||
|
||||
@@ -225,54 +230,43 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"asset_name": {
|
||||
"model_name": {
|
||||
"type": "string",
|
||||
"description": "Name of the asset.",
|
||||
"description": "Name of the OpenAI model.",
|
||||
},
|
||||
"price": {
|
||||
"input_fee": {
|
||||
"type": "string",
|
||||
"description": "Price of the asset.",
|
||||
"description": "Fee for input token for the OpenAI model.",
|
||||
},
|
||||
"change": {
|
||||
"output_fee": {
|
||||
"type": "string",
|
||||
"description": "Change in price of the asset.",
|
||||
"description": "Fee for output token for the OpenAI model.",
|
||||
},
|
||||
},
|
||||
"required": ["asset_name", "price", "change"],
|
||||
"required": ["model_name", "input_fee", "output_fee"],
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.coinbase.com/en-in/explore"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"urls": "https://openai.com/api/pricing",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "gemini/gemini-2.0-flash-exp",
|
||||
"api_token": os.getenv("GEMINI_API_KEY")
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "From the crawled content, extract asset names along with their prices and change in price.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": os.getenv("OPENAI_API_KEY"),
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
|
||||
},
|
||||
},
|
||||
"crawler_params": {"word_count_threshold": 1},
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} asset pricing entries")
|
||||
if extracted:
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} model pricing entries")
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
||||
@@ -280,16 +274,6 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
||||
|
||||
def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
print("\n=== Testing LLM with Ollama ===")
|
||||
|
||||
# Check if Ollama is accessible first
|
||||
try:
|
||||
ollama_response = requests.get("http://localhost:11434/api/tags", timeout=5)
|
||||
ollama_response.raise_for_status()
|
||||
print("Ollama is accessible")
|
||||
except:
|
||||
print("Ollama is not accessible, skipping test")
|
||||
return
|
||||
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
@@ -310,33 +294,24 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"verbose": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "ollama/llama3.2:latest",
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
"provider": "ollama/llama2",
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
},
|
||||
},
|
||||
"extra": {"word_count_threshold": 1},
|
||||
"crawler_params": {"verbose": True},
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print("Extracted content:", json.dumps(extracted, indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
@@ -346,30 +321,24 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
def test_cosine_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Cosine Extraction ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "cosine",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "CosineStrategy",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} text clusters")
|
||||
if extracted:
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"Cosine extraction test failed: {str(e)}")
|
||||
@@ -378,25 +347,20 @@ def test_cosine_extraction(tester: Crawl4AiTester):
|
||||
def test_screenshot(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Screenshot ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"screenshot": True
|
||||
}
|
||||
}
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"priority": 5,
|
||||
"screenshot": True,
|
||||
"crawler_params": {"headless": True},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
screenshot_data = result["result"]["results"][0]["screenshot"]
|
||||
print("Screenshot captured:", bool(screenshot_data))
|
||||
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
||||
|
||||
if screenshot_data:
|
||||
if result["result"]["screenshot"]:
|
||||
# Save screenshot
|
||||
screenshot_bytes = base64.b64decode(screenshot_data)
|
||||
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
||||
with open("test_screenshot.jpg", "wb") as f:
|
||||
f.write(screenshot_bytes)
|
||||
f.write(screenshot_data)
|
||||
print("Screenshot saved as test_screenshot.jpg")
|
||||
|
||||
assert result["result"]["success"]
|
||||
@@ -404,4 +368,5 @@ def test_screenshot(tester: Crawl4AiTester):
|
||||
|
||||
if __name__ == "__main__":
|
||||
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
|
||||
# version = "full"
|
||||
test_docker_deployment(version)
|
||||
|
||||
@@ -1,356 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Example demonstrating LLM-based table extraction in Crawl4AI.
|
||||
|
||||
This example shows how to use the LLMTableExtraction strategy to extract
|
||||
complex tables from web pages, including handling rowspan, colspan, and nested tables.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
# Get the grandparent directory
|
||||
grandparent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
sys.path.append(grandparent_dir)
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
LLMTableExtraction,
|
||||
CacheMode
|
||||
)
|
||||
import pandas as pd
|
||||
|
||||
|
||||
# Example 1: Basic LLM Table Extraction
|
||||
async def basic_llm_extraction():
|
||||
"""Extract tables using LLM with default settings."""
|
||||
print("\n=== Example 1: Basic LLM Table Extraction ===")
|
||||
|
||||
# Configure LLM (using OpenAI GPT-4o-mini for cost efficiency)
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY", # Uses environment variable
|
||||
temperature=0.1, # Low temperature for consistency
|
||||
max_tokens=32000
|
||||
)
|
||||
|
||||
# Create LLM table extraction strategy
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
verbose=True,
|
||||
# css_selector="div.mw-content-ltr",
|
||||
max_tries=2,
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=5000, # Lower threshold to force chunking
|
||||
min_rows_per_chunk=10,
|
||||
max_parallel_chunks=3
|
||||
)
|
||||
|
||||
# Configure crawler with the strategy
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Extract tables from a Wikipedia page
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/List_of_chemical_elements",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"✓ Found {len(result.tables)} tables")
|
||||
|
||||
# Display first table
|
||||
if result.tables:
|
||||
first_table = result.tables[0]
|
||||
print(f"\nFirst table:")
|
||||
print(f" Headers: {first_table['headers'][:5]}...")
|
||||
print(f" Rows: {len(first_table['rows'])}")
|
||||
|
||||
# Convert to pandas DataFrame
|
||||
df = pd.DataFrame(
|
||||
first_table['rows'],
|
||||
columns=first_table['headers']
|
||||
)
|
||||
print(f"\nDataFrame shape: {df.shape}")
|
||||
print(df.head())
|
||||
else:
|
||||
print(f"✗ Extraction failed: {result.error}")
|
||||
|
||||
|
||||
# Example 2: Focused Extraction with CSS Selector
|
||||
async def focused_extraction():
|
||||
"""Extract tables from specific page sections using CSS selectors."""
|
||||
print("\n=== Example 2: Focused Extraction with CSS Selector ===")
|
||||
|
||||
# HTML with multiple tables
|
||||
test_html = """
|
||||
<html>
|
||||
<body>
|
||||
<div class="sidebar">
|
||||
<table role="presentation">
|
||||
<tr><td>Navigation</td></tr>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="main-content">
|
||||
<table id="data-table">
|
||||
<caption>Quarterly Sales Report</caption>
|
||||
<thead>
|
||||
<tr>
|
||||
<th rowspan="2">Product</th>
|
||||
<th colspan="3">Q1 2024</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>Jan</th>
|
||||
<th>Feb</th>
|
||||
<th>Mar</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td>Widget A</td>
|
||||
<td>100</td>
|
||||
<td>120</td>
|
||||
<td>140</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Widget B</td>
|
||||
<td>200</td>
|
||||
<td>180</td>
|
||||
<td>220</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Focus only on main content area
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
css_selector=".main-content", # Only extract from main content
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=f"raw:{test_html}",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success and result.tables:
|
||||
table = result.tables[0]
|
||||
print(f"✓ Extracted table: {table.get('caption', 'No caption')}")
|
||||
print(f" Headers: {table['headers']}")
|
||||
print(f" Metadata: {table['metadata']}")
|
||||
|
||||
# The LLM should have handled the rowspan/colspan correctly
|
||||
print("\nProcessed data (rowspan/colspan handled):")
|
||||
for i, row in enumerate(table['rows']):
|
||||
print(f" Row {i+1}: {row}")
|
||||
|
||||
|
||||
# Example 3: Comparing with Default Extraction
|
||||
async def compare_strategies():
|
||||
"""Compare LLM extraction with default extraction on complex tables."""
|
||||
print("\n=== Example 3: Comparing LLM vs Default Extraction ===")
|
||||
|
||||
# Complex table with nested structure
|
||||
complex_html = """
|
||||
<html>
|
||||
<body>
|
||||
<table>
|
||||
<tr>
|
||||
<th rowspan="3">Category</th>
|
||||
<th colspan="2">2023</th>
|
||||
<th colspan="2">2024</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>H1</th>
|
||||
<th>H2</th>
|
||||
<th>H1</th>
|
||||
<th>H2</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="4">All values in millions</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Revenue</td>
|
||||
<td>100</td>
|
||||
<td>120</td>
|
||||
<td>130</td>
|
||||
<td>145</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Profit</td>
|
||||
<td>20</td>
|
||||
<td>25</td>
|
||||
<td>28</td>
|
||||
<td>32</td>
|
||||
</tr>
|
||||
</table>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Test with default extraction
|
||||
from crawl4ai import DefaultTableExtraction
|
||||
|
||||
default_strategy = DefaultTableExtraction(
|
||||
table_score_threshold=3,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config_default = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=default_strategy
|
||||
)
|
||||
|
||||
result_default = await crawler.arun(
|
||||
url=f"raw:{complex_html}",
|
||||
config=config_default
|
||||
)
|
||||
|
||||
# Test with LLM extraction
|
||||
llm_strategy = LLMTableExtraction(
|
||||
llm_config=LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
),
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config_llm = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=llm_strategy
|
||||
)
|
||||
|
||||
result_llm = await crawler.arun(
|
||||
url=f"raw:{complex_html}",
|
||||
config=config_llm
|
||||
)
|
||||
|
||||
# Compare results
|
||||
print("\nDefault Extraction:")
|
||||
if result_default.tables:
|
||||
table = result_default.tables[0]
|
||||
print(f" Headers: {table.get('headers', [])}")
|
||||
print(f" Rows: {len(table.get('rows', []))}")
|
||||
for i, row in enumerate(table.get('rows', [])[:3]):
|
||||
print(f" Row {i+1}: {row}")
|
||||
|
||||
print("\nLLM Extraction (handles complex structure better):")
|
||||
if result_llm.tables:
|
||||
table = result_llm.tables[0]
|
||||
print(f" Headers: {table.get('headers', [])}")
|
||||
print(f" Rows: {len(table.get('rows', []))}")
|
||||
for i, row in enumerate(table.get('rows', [])):
|
||||
print(f" Row {i+1}: {row}")
|
||||
print(f" Metadata: {table.get('metadata', {})}")
|
||||
|
||||
# Example 4: Batch Processing Multiple Pages
|
||||
async def batch_extraction():
|
||||
"""Extract tables from multiple pages efficiently."""
|
||||
print("\n=== Example 4: Batch Table Extraction ===")
|
||||
|
||||
urls = [
|
||||
"https://www.worldometers.info/geography/alphabetical-list-of-countries/",
|
||||
# "https://en.wikipedia.org/wiki/List_of_chemical_elements",
|
||||
]
|
||||
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4.1-mini",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1,
|
||||
max_tokens=1500
|
||||
)
|
||||
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
css_selector="div.datatable-container", # Wikipedia data tables
|
||||
verbose=False,
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=5000, # Lower threshold to force chunking
|
||||
min_rows_per_chunk=10,
|
||||
max_parallel_chunks=3
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=table_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
all_tables = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
print(f"\nProcessing: {url.split('/')[-1][:50]}...")
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success and result.tables:
|
||||
print(f" ✓ Found {len(result.tables)} tables")
|
||||
# Store first table from each page
|
||||
if result.tables:
|
||||
all_tables.append({
|
||||
'url': url,
|
||||
'table': result.tables[0]
|
||||
})
|
||||
|
||||
# Summary
|
||||
print(f"\n=== Summary ===")
|
||||
print(f"Extracted {len(all_tables)} tables from {len(urls)} pages")
|
||||
for item in all_tables:
|
||||
table = item['table']
|
||||
print(f"\nFrom {item['url'].split('/')[-1][:30]}:")
|
||||
print(f" Columns: {len(table['headers'])}")
|
||||
print(f" Rows: {len(table['rows'])}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples."""
|
||||
print("=" * 60)
|
||||
print("LLM TABLE EXTRACTION EXAMPLES")
|
||||
print("=" * 60)
|
||||
|
||||
# Run examples (comment out ones you don't want to run)
|
||||
|
||||
# Basic extraction
|
||||
await basic_llm_extraction()
|
||||
|
||||
# # Focused extraction with CSS
|
||||
# await focused_extraction()
|
||||
|
||||
# # Compare strategies
|
||||
# await compare_strategies()
|
||||
|
||||
# # Batch processing
|
||||
# await batch_extraction()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("ALL EXAMPLES COMPLETED")
|
||||
print("=" * 60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -1,276 +0,0 @@
|
||||
"""
|
||||
Example: Using Table Extraction Strategies in Crawl4AI
|
||||
|
||||
This example demonstrates how to use different table extraction strategies
|
||||
to extract tables from web pages.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import pandas as pd
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
CacheMode,
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction,
|
||||
TableExtractionStrategy
|
||||
)
|
||||
from typing import Dict, List, Any
|
||||
|
||||
|
||||
async def example_default_extraction():
|
||||
"""Example 1: Using default table extraction (automatic)."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 1: Default Table Extraction")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# No need to specify table_extraction - uses DefaultTableExtraction automatically
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_score_threshold=7 # Adjust sensitivity (default: 7)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success and result.tables:
|
||||
print(f"Found {len(result.tables)} tables")
|
||||
|
||||
# Convert first table to pandas DataFrame
|
||||
if result.tables:
|
||||
first_table = result.tables[0]
|
||||
df = pd.DataFrame(
|
||||
first_table['rows'],
|
||||
columns=first_table['headers'] if first_table['headers'] else None
|
||||
)
|
||||
print(f"\nFirst table preview:")
|
||||
print(df.head())
|
||||
print(f"Shape: {df.shape}")
|
||||
|
||||
|
||||
async def example_custom_configuration():
|
||||
"""Example 2: Custom table extraction configuration."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 2: Custom Table Configuration")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Create custom extraction strategy with specific settings
|
||||
table_strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5, # Lower threshold for more permissive detection
|
||||
min_rows=3, # Only extract tables with at least 3 rows
|
||||
min_cols=2, # Only extract tables with at least 2 columns
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=table_strategy,
|
||||
# Target specific tables using CSS selector
|
||||
css_selector="div.main-content"
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://example.com/data",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Found {len(result.tables)} tables matching criteria")
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
print(f"\nTable {i+1}:")
|
||||
print(f" Caption: {table.get('caption', 'No caption')}")
|
||||
print(f" Size: {table['metadata']['row_count']} rows × {table['metadata']['column_count']} columns")
|
||||
print(f" Has headers: {table['metadata']['has_headers']}")
|
||||
|
||||
|
||||
async def example_disable_extraction():
|
||||
"""Example 3: Disable table extraction when not needed."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 3: Disable Table Extraction")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Use NoTableExtraction to skip table processing entirely
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=NoTableExtraction() # No tables will be extracted
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://example.com",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Tables extracted: {len(result.tables)} (should be 0)")
|
||||
print("Table extraction disabled - better performance for non-table content")
|
||||
|
||||
|
||||
class FinancialTableExtraction(TableExtractionStrategy):
|
||||
"""
|
||||
Custom strategy for extracting financial tables with specific requirements.
|
||||
"""
|
||||
|
||||
def __init__(self, currency_symbols=None, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.currency_symbols = currency_symbols or ['$', '€', '£', '¥']
|
||||
|
||||
def extract_tables(self, element, **kwargs):
|
||||
"""Extract only tables that appear to contain financial data."""
|
||||
tables_data = []
|
||||
|
||||
for table in element.xpath(".//table"):
|
||||
# Check if table contains currency symbols
|
||||
table_text = ''.join(table.itertext())
|
||||
has_currency = any(symbol in table_text for symbol in self.currency_symbols)
|
||||
|
||||
if not has_currency:
|
||||
continue
|
||||
|
||||
# Extract using base logic (could reuse DefaultTableExtraction logic)
|
||||
headers = []
|
||||
rows = []
|
||||
|
||||
# Extract headers
|
||||
for th in table.xpath(".//thead//th | .//tr[1]//th"):
|
||||
headers.append(th.text_content().strip())
|
||||
|
||||
# Extract rows
|
||||
for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
|
||||
row = []
|
||||
for td in tr.xpath(".//td"):
|
||||
cell_text = td.text_content().strip()
|
||||
# Clean currency values
|
||||
for symbol in self.currency_symbols:
|
||||
cell_text = cell_text.replace(symbol, '')
|
||||
row.append(cell_text)
|
||||
if row:
|
||||
rows.append(row)
|
||||
|
||||
if headers or rows:
|
||||
tables_data.append({
|
||||
"headers": headers,
|
||||
"rows": rows,
|
||||
"caption": table.xpath(".//caption/text()")[0] if table.xpath(".//caption") else "",
|
||||
"summary": table.get("summary", ""),
|
||||
"metadata": {
|
||||
"type": "financial",
|
||||
"has_currency": True,
|
||||
"row_count": len(rows),
|
||||
"column_count": len(headers) if headers else len(rows[0]) if rows else 0
|
||||
}
|
||||
})
|
||||
|
||||
return tables_data
|
||||
|
||||
|
||||
async def example_custom_strategy():
|
||||
"""Example 4: Custom table extraction strategy."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 4: Custom Financial Table Strategy")
|
||||
print("="*50)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Use custom strategy for financial tables
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=FinancialTableExtraction(
|
||||
currency_symbols=['$', '€'],
|
||||
verbose=True
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://finance.yahoo.com/",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Found {len(result.tables)} financial tables")
|
||||
|
||||
for table in result.tables:
|
||||
if table['metadata'].get('type') == 'financial':
|
||||
print(f" ✓ Financial table with {table['metadata']['row_count']} rows")
|
||||
|
||||
|
||||
async def example_combined_extraction():
|
||||
"""Example 5: Combine table extraction with other strategies."""
|
||||
print("\n" + "="*50)
|
||||
print("Example 5: Combined Extraction Strategies")
|
||||
print("="*50)
|
||||
|
||||
from crawl4ai import LLMExtractionStrategy, LLMConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Define schema for structured extraction
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"page_title": {"type": "string"},
|
||||
"main_topic": {"type": "string"},
|
||||
"key_figures": {
|
||||
"type": "array",
|
||||
"items": {"type": "string"}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
# Table extraction
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=6,
|
||||
min_rows=2
|
||||
),
|
||||
# LLM extraction for structured data
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai"),
|
||||
schema=schema
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
"https://en.wikipedia.org/wiki/Economy_of_the_United_States",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print(f"Tables found: {len(result.tables)}")
|
||||
|
||||
# Tables are in result.tables
|
||||
if result.tables:
|
||||
print(f"First table has {len(result.tables[0]['rows'])} rows")
|
||||
|
||||
# Structured data is in result.extracted_content
|
||||
if result.extracted_content:
|
||||
import json
|
||||
structured_data = json.loads(result.extracted_content)
|
||||
print(f"Page title: {structured_data.get('page_title', 'N/A')}")
|
||||
print(f"Main topic: {structured_data.get('main_topic', 'N/A')}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Run all examples."""
|
||||
print("\n" + "="*60)
|
||||
print("CRAWL4AI TABLE EXTRACTION EXAMPLES")
|
||||
print("="*60)
|
||||
|
||||
# Run examples
|
||||
await example_default_extraction()
|
||||
await example_custom_configuration()
|
||||
await example_disable_extraction()
|
||||
await example_custom_strategy()
|
||||
# await example_combined_extraction() # Requires OpenAI API key
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("EXAMPLES COMPLETED")
|
||||
print("="*60)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -20,22 +20,130 @@ Ever wondered why your AI coding assistant struggles with your library despite c
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [Crawl4AI v0.7.4 – The Intelligent Table Extraction & Performance Update](../blog/release-v0.7.4.md)
|
||||
*August 17, 2025*
|
||||
### [Crawl4AI v0.7.0 – The Adaptive Intelligence Update](releases/0.7.0.md)
|
||||
*January 28, 2025*
|
||||
|
||||
Crawl4AI v0.7.4 introduces revolutionary LLM-powered table extraction with intelligent chunking, performance improvements for concurrent crawling, enhanced browser management, and critical stability fixes that make Crawl4AI more robust for production workloads.
|
||||
Crawl4AI v0.7.0 introduces groundbreaking intelligence features that transform how crawlers understand and adapt to websites. This release brings Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, and the powerful Async URL Seeder for massive URL discovery.
|
||||
|
||||
Key highlights:
|
||||
- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables
|
||||
- **⚡ Dispatcher Bug Fix**: Fixed sequential processing issue in arun_many for fast-completing tasks
|
||||
- **🧹 Memory Management Refactor**: Streamlined memory utilities and better resource management
|
||||
- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation
|
||||
- **🔗 Advanced URL Processing**: Better handling of raw URLs and base tag link resolution
|
||||
- **Adaptive Crawling**: Crawlers that learn and adapt to website structures automatically
|
||||
- **Virtual Scroll Support**: Complete content extraction from modern infinite scroll pages
|
||||
- **Link Preview**: 3-layer scoring system for intelligent link prioritization
|
||||
- **Async URL Seeder**: Discover thousands of URLs in seconds with smart filtering
|
||||
- **Performance Boost**: Up to 3x faster with optimized resource handling
|
||||
|
||||
[Read full release notes →](../blog/release-v0.7.4.md)
|
||||
[Read full release notes →](releases/0.7.0.md)
|
||||
|
||||
---
|
||||
|
||||
## Previous Releases
|
||||
|
||||
### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
|
||||
*December 23, 2024*
|
||||
|
||||
Crawl4AI v0.6.0 brought major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
|
||||
|
||||
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
|
||||
|
||||
Other key changes:
|
||||
|
||||
* Native support for `result.media["tables"]` to export DataFrames
|
||||
* Full network + console logs and MHTML snapshot per crawl
|
||||
* Browser pooling and pre-warming for faster cold starts
|
||||
* New streaming endpoints via MCP API and Playground
|
||||
* Robots.txt support, proxy rotation, and improved session handling
|
||||
* Deprecated old markdown names, legacy modules cleaned up
|
||||
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
|
||||
|
||||
[Read full release notes →](releases/0.6.0.md)
|
||||
|
||||
---
|
||||
|
||||
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
||||
|
||||
My dear friends and crawlers, there you go, this is the release of Crawl4AI v0.5.0! This release brings a wealth of new features, performance improvements, and a more streamlined developer experience. Here's a breakdown of what's new:
|
||||
|
||||
**Major New Features:**
|
||||
|
||||
* **Deep Crawling:** Explore entire websites with configurable strategies (BFS, DFS, Best-First). Define custom filters and URL scoring for targeted crawls.
|
||||
* **Memory-Adaptive Dispatcher:** Handle large-scale crawls with ease! Our new dispatcher dynamically adjusts concurrency based on available memory and includes built-in rate limiting.
|
||||
* **Multiple Crawler Strategies:** Choose between the full-featured Playwright browser-based crawler or a new, *much* faster HTTP-only crawler for simpler tasks.
|
||||
* **Docker Deployment:** Deploy Crawl4AI as a scalable, self-contained service with built-in API endpoints and optional JWT authentication.
|
||||
* **Command-Line Interface (CLI):** Interact with Crawl4AI directly from your terminal. Crawl, configure, and extract data with simple commands.
|
||||
* **LLM Configuration (`LLMConfig`):** A new, unified way to configure LLM providers (OpenAI, Anthropic, Ollama, etc.) for extraction, filtering, and schema generation. Simplifies API key management and switching between models.
|
||||
|
||||
**Minor Updates & Improvements:**
|
||||
|
||||
* **LXML Scraping Mode:** Faster HTML parsing with `LXMLWebScrapingStrategy`.
|
||||
* **Proxy Rotation:** Added `ProxyRotationStrategy` with a `RoundRobinProxyStrategy` implementation.
|
||||
* **PDF Processing:** Extract text, images, and metadata from PDF files.
|
||||
* **URL Redirection Tracking:** Automatically follows and records redirects.
|
||||
* **Robots.txt Compliance:** Optionally respect website crawling rules.
|
||||
* **LLM-Powered Schema Generation:** Automatically create extraction schemas using an LLM.
|
||||
* **`LLMContentFilter`:** Generate high-quality, focused markdown using an LLM.
|
||||
* **Improved Error Handling & Stability:** Numerous bug fixes and performance enhancements.
|
||||
* **Enhanced Documentation:** Updated guides and examples.
|
||||
|
||||
**Breaking Changes & Migration:**
|
||||
|
||||
This release includes several breaking changes to improve the library's structure and consistency. Here's what you need to know:
|
||||
|
||||
* **`arun_many()` Behavior:** Now uses the `MemoryAdaptiveDispatcher` by default. The return type depends on the `stream` parameter in `CrawlerRunConfig`. Adjust code that relied on unbounded concurrency.
|
||||
* **`max_depth` Location:** Moved to `CrawlerRunConfig` and now controls *crawl depth*.
|
||||
* **Deep Crawling Imports:** Import `DeepCrawlStrategy` and related classes from `crawl4ai.deep_crawling`.
|
||||
* **`BrowserContext` API:** Updated; the old `get_context` method is deprecated.
|
||||
* **Optional Model Fields:** Many data model fields are now optional. Handle potential `None` values.
|
||||
* **`ScrapingMode` Enum:** Replaced with strategy pattern (`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
|
||||
* **`content_filter` Parameter:** Removed from `CrawlerRunConfig`. Use extraction strategies or markdown generators with filters.
|
||||
* **Removed Functionality:** The synchronous `WebCrawler`, the old CLI, and docs management tools have been removed.
|
||||
* **Docker:** Significant changes to deployment. See the [Docker documentation](../deploy/docker/README.md).
|
||||
* **`ssl_certificate.json`:** This file has been removed.
|
||||
* **Config**: FastFilterChain has been replaced with FilterChain
|
||||
* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
|
||||
* **Proxy**: Removed synchronous WebCrawler support and related rate limiting configurations
|
||||
* **LLM Parameters:** Use the new `LLMConfig` object instead of passing `provider`, `api_token`, `base_url`, and `api_base` directly to `LLMExtractionStrategy` and `LLMContentFilter`.
|
||||
|
||||
**In short:** Update imports, adjust `arun_many()` usage, check for optional fields, and review the Docker deployment guide.
|
||||
|
||||
## License Change
|
||||
|
||||
Crawl4AI v0.5.0 updates the license to Apache 2.0 *with a required attribution clause*. This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you *must* clearly attribute the project in any public use or distribution. See the updated `LICENSE` file for the full legal text and specific requirements.
|
||||
|
||||
**Get Started:**
|
||||
|
||||
* **Installation:** `pip install "crawl4ai[all]"` (or use the Docker image)
|
||||
* **Documentation:** [https://docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||
* **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
|
||||
I'm very excited to see what you build with Crawl4AI v0.5.0!
|
||||
|
||||
---
|
||||
|
||||
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
|
||||
*December 12, 2024*
|
||||
|
||||
The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.
|
||||
|
||||
[Read full release notes →](releases/0.4.2.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
|
||||
*December 8, 2024*
|
||||
|
||||
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
|
||||
|
||||
[Read full release notes →](releases/0.4.1.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
|
||||
*December 1, 2024*
|
||||
|
||||
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
|
||||
[Read full release notes →](releases/0.4.0.md)
|
||||
|
||||
## Project History
|
||||
|
||||
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
||||
|
||||
@@ -1,170 +0,0 @@
|
||||
# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update
|
||||
|
||||
*August 6, 2025 • 5 min read*
|
||||
|
||||
---
|
||||
|
||||
Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready.
|
||||
|
||||
## 🎯 What's New at a Glance
|
||||
|
||||
- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch
|
||||
- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables
|
||||
- **Bug Fixes**: Resolved several critical issues for better stability
|
||||
- **Documentation Updates**: Clearer examples and improved API documentation
|
||||
|
||||
## 🎨 Multi-URL Configurations: One Size Doesn't Fit All
|
||||
|
||||
**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic.
|
||||
|
||||
**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support.
|
||||
|
||||
### Technical Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
|
||||
|
||||
# Define specialized configs for different content types
|
||||
configs = [
|
||||
# Documentation sites - aggressive caching, include links
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*docs*", "*documentation*"],
|
||||
cache_mode="write",
|
||||
markdown_generator_options={"include_links": True}
|
||||
),
|
||||
|
||||
# News/blog sites - fresh content, scroll for lazy loading
|
||||
CrawlerRunConfig(
|
||||
url_matcher=lambda url: 'blog' in url or 'news' in url,
|
||||
cache_mode="bypass",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight/2);"
|
||||
),
|
||||
|
||||
# API endpoints - structured extraction
|
||||
CrawlerRunConfig(
|
||||
url_matcher=["*.json", "*api*"],
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini",
|
||||
extraction_type="structured"
|
||||
)
|
||||
),
|
||||
|
||||
# Default fallback for everything else
|
||||
CrawlerRunConfig() # No url_matcher = matches everything
|
||||
]
|
||||
|
||||
# Crawl multiple URLs with appropriate configs
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=[
|
||||
"https://docs.python.org/3/", # → Uses documentation config
|
||||
"https://blog.python.org/", # → Uses blog config
|
||||
"https://api.github.com/users", # → Uses API config
|
||||
"https://example.com/" # → Uses default config
|
||||
],
|
||||
config=configs
|
||||
)
|
||||
```
|
||||
|
||||
**Matching Capabilities:**
|
||||
- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"`
|
||||
- **Function Matchers**: Lambda functions for complex logic
|
||||
- **Mixed Matchers**: Combine strings and functions with AND/OR logic
|
||||
- **Fallback Support**: Default config when nothing matches
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl
|
||||
- **Multi-Domain Crawling**: Different strategies per domain without separate runs
|
||||
- **Reduced Complexity**: No more if/else forests in your extraction code
|
||||
- **Better Performance**: Each URL gets exactly the processing it needs
|
||||
|
||||
## 🐳 Docker: Flexible LLM Provider Configuration
|
||||
|
||||
**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images.
|
||||
|
||||
**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images.
|
||||
|
||||
### Deployment Flexibility
|
||||
|
||||
```bash
|
||||
# Option 1: Direct environment variables
|
||||
docker run -d \
|
||||
-e LLM_PROVIDER="groq/llama-3.2-3b-preview" \
|
||||
-e GROQ_API_KEY="your-key" \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Option 2: Using .llm.env file (recommended for production)
|
||||
# Create .llm.env file:
|
||||
# LLM_PROVIDER=openai/gpt-4o-mini
|
||||
# OPENAI_API_KEY=your-openai-key
|
||||
# GROQ_API_KEY=your-groq-key
|
||||
|
||||
docker run -d \
|
||||
--env-file .llm.env \
|
||||
-p 11235:11235 \
|
||||
unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
Override per request when needed:
|
||||
```python
|
||||
# Use default provider from .llm.env
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://example.com",
|
||||
"extraction_strategy": {"type": "llm"}
|
||||
})
|
||||
|
||||
# Override to use different provider for this specific request
|
||||
response = requests.post("http://localhost:11235/crawl", json={
|
||||
"url": "https://complex-page.com",
|
||||
"extraction_strategy": {
|
||||
"type": "llm",
|
||||
"provider": "openai/gpt-4" # Override default
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
**Expected Real-World Impact:**
|
||||
- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex
|
||||
- **A/B Testing**: Compare provider performance without deployment changes
|
||||
- **Fallback Strategies**: Switch providers on-the-fly during outages
|
||||
- **Development Flexibility**: Test locally with one provider, deploy with another
|
||||
- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands
|
||||
|
||||
## 🔧 Bug Fixes & Improvements
|
||||
|
||||
This release includes several important bug fixes that improve stability and reliability:
|
||||
|
||||
- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic
|
||||
- **Memory Management**: Resolved memory leaks in long-running crawl sessions
|
||||
- **Sitemap Processing**: Fixed redirect handling in sitemap fetching
|
||||
- **Table Extraction**: Improved table detection and extraction accuracy
|
||||
- **Error Handling**: Better error messages and recovery from network failures
|
||||
|
||||
## 📚 Documentation Enhancements
|
||||
|
||||
Based on community feedback, we've updated:
|
||||
- Clearer examples for multi-URL configuration
|
||||
- Improved CrawlResult documentation with all available fields
|
||||
- Fixed typos and inconsistencies across documentation
|
||||
- Added real-world URLs in examples for better understanding
|
||||
- New comprehensive demo showcasing all v0.7.3 features
|
||||
|
||||
## 🙏 Acknowledgments
|
||||
|
||||
Thanks to our contributors and the entire community for feedback and bug reports.
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- [Full Documentation](https://docs.crawl4ai.com)
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Discord Community](https://discord.gg/crawl4ai)
|
||||
- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py)
|
||||
|
||||
---
|
||||
|
||||
*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!*
|
||||
|
||||
**Happy Crawling! 🕷️**
|
||||
|
||||
*- The Crawl4AI Team*
|
||||
@@ -58,15 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
Our latest release is `0.7.3`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
|
||||
> 💡 **Note**: The `latest` tag points to the stable `0.7.3` version.
|
||||
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
|
||||
|
||||
```bash
|
||||
# Pull the latest version
|
||||
docker pull unclecode/crawl4ai:0.7.3
|
||||
# Pull the release candidate (for testing new features)
|
||||
docker pull unclecode/crawl4ai:0.7.0-r1
|
||||
|
||||
# Or pull using the latest tag
|
||||
# Or pull the current stable version (0.6.0)
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
@@ -126,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
|
||||
#### Docker Hub Versioning Explained
|
||||
|
||||
* **Image Name:** `unclecode/crawl4ai`
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.3`)
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
|
||||
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
||||
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
||||
* **`latest` Tag:** Points to the most recent stable version
|
||||
@@ -176,7 +176,7 @@ The Docker setup now supports flexible LLM provider configuration through three
|
||||
|
||||
3. **Config File Default**: Falls back to `config.yml` (default: `openai/gpt-4o-mini`)
|
||||
|
||||
The system automatically selects the appropriate API key based on the provider. LiteLLM handles finding the correct environment variable for each provider (e.g., OPENAI_API_KEY for OpenAI, GEMINI_API_TOKEN for Google Gemini, etc.).
|
||||
The system automatically selects the appropriate API key based on the configured `api_key_env` in the config file.
|
||||
|
||||
#### 3. Build and Run with Compose
|
||||
|
||||
@@ -693,7 +693,8 @@ app:
|
||||
# Default LLM Configuration
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini" # Can be overridden by LLM_PROVIDER env var
|
||||
# api_key: sk-... # If you pass the API key directly (not recommended)
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
# api_key: sk-... # If you pass the API key directly then api_key_env will be ignored
|
||||
|
||||
# Redis Configuration (Used by internal Redis server managed by supervisord)
|
||||
redis:
|
||||
|
||||
@@ -1,807 +0,0 @@
|
||||
# Table Extraction Strategies
|
||||
|
||||
## Overview
|
||||
|
||||
**New in v0.7.3+**: Table extraction now follows the **Strategy Design Pattern**, providing unprecedented flexibility and power for handling different table structures. Don't worry - **your existing code still works!** We maintain full backward compatibility while offering new capabilities.
|
||||
|
||||
### What's Changed?
|
||||
- **Architecture**: Table extraction now uses pluggable strategies
|
||||
- **Backward Compatible**: Your existing code with `table_score_threshold` continues to work
|
||||
- **More Power**: Choose from multiple strategies or create your own
|
||||
- **Same Default Behavior**: By default, uses `DefaultTableExtraction` (same as before)
|
||||
|
||||
### Key Points
|
||||
✅ **Old code still works** - No breaking changes
|
||||
✅ **Same default behavior** - Uses the proven extraction algorithm
|
||||
✅ **New capabilities** - Add LLM extraction or custom strategies when needed
|
||||
✅ **Strategy pattern** - Clean, extensible architecture
|
||||
|
||||
## Quick Start
|
||||
|
||||
### The Simplest Way (Works Like Before)
|
||||
|
||||
If you're already using Crawl4AI, nothing changes:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def extract_tables():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# This works exactly like before - uses DefaultTableExtraction internally
|
||||
result = await crawler.arun("https://example.com/data")
|
||||
|
||||
# Tables are automatically extracted and available in result.tables
|
||||
for table in result.tables:
|
||||
print(f"Table with {len(table['rows'])} rows and {len(table['headers'])} columns")
|
||||
print(f"Headers: {table['headers']}")
|
||||
print(f"First row: {table['rows'][0] if table['rows'] else 'No data'}")
|
||||
|
||||
asyncio.run(extract_tables())
|
||||
```
|
||||
|
||||
### Using the Old Configuration (Still Supported)
|
||||
|
||||
Your existing code with `table_score_threshold` continues to work:
|
||||
|
||||
```python
|
||||
# This old approach STILL WORKS - we maintain backward compatibility
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7 # Internally creates DefaultTableExtraction(table_score_threshold=7)
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
```
|
||||
|
||||
## Table Extraction Strategies
|
||||
|
||||
### Understanding the Strategy Pattern
|
||||
|
||||
The strategy pattern allows you to choose different table extraction algorithms at runtime. Think of it as having different tools in a toolbox - you pick the right one for the job:
|
||||
|
||||
- **No explicit strategy?** → Uses `DefaultTableExtraction` automatically (same as v0.7.2 and earlier)
|
||||
- **Need complex table handling?** → Choose `LLMTableExtraction` (costs money, use sparingly)
|
||||
- **Want to disable tables?** → Use `NoTableExtraction`
|
||||
- **Have special requirements?** → Create a custom strategy
|
||||
|
||||
### Available Strategies
|
||||
|
||||
| Strategy | Description | Use Case | Cost | When to Use |
|
||||
|----------|-------------|----------|------|-------------|
|
||||
| `DefaultTableExtraction` | **RECOMMENDED**: Same algorithm as before v0.7.3 | General purpose (default) | Free | **Use this first - handles 95% of cases** |
|
||||
| `LLMTableExtraction` | AI-powered extraction for complex tables | Tables with complex rowspan/colspan | **$$$ Per API call** | Only when DefaultTableExtraction fails |
|
||||
| `NoTableExtraction` | Disables table extraction | When tables aren't needed | Free | For text-only extraction |
|
||||
| Custom strategies | User-defined extraction logic | Specialized requirements | Free | Domain-specific needs |
|
||||
|
||||
> **⚠️ CRITICAL COST WARNING for LLMTableExtraction**:
|
||||
>
|
||||
> **DO NOT USE `LLMTableExtraction` UNLESS ABSOLUTELY NECESSARY!**
|
||||
>
|
||||
> - **Always try `DefaultTableExtraction` first** - It's free and handles most tables perfectly
|
||||
> - LLM extraction **costs money** with every API call
|
||||
> - For large tables (100+ rows), LLM extraction can be **very slow**
|
||||
> - **For large tables**: If you must use LLM, choose fast providers:
|
||||
> - ✅ **Groq** (fastest inference)
|
||||
> - ✅ **Cerebras** (optimized for speed)
|
||||
> - ⚠️ Avoid: OpenAI, Anthropic for large tables (slower)
|
||||
>
|
||||
> **🚧 WORK IN PROGRESS**:
|
||||
> We are actively developing an **advanced non-LLM algorithm** that will handle complex table structures (rowspan, colspan, nested tables) for **FREE**. This will replace the need for costly LLM extraction in most cases. Coming soon!
|
||||
|
||||
### DefaultTableExtraction
|
||||
|
||||
The default strategy uses a sophisticated scoring system to identify data tables:
|
||||
|
||||
```python
|
||||
from crawl4ai import DefaultTableExtraction, CrawlerRunConfig
|
||||
|
||||
# Customize the default extraction
|
||||
table_strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7, # Scoring threshold (default: 7)
|
||||
min_rows=2, # Minimum rows required
|
||||
min_cols=2, # Minimum columns required
|
||||
verbose=True # Enable detailed logging
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
```
|
||||
|
||||
#### Scoring System
|
||||
|
||||
The scoring system evaluates multiple factors:
|
||||
|
||||
| Factor | Score Impact | Description |
|
||||
|--------|--------------|-------------|
|
||||
| Has `<thead>` | +2 | Semantic table structure |
|
||||
| Has `<tbody>` | +1 | Organized table body |
|
||||
| Has `<th>` elements | +2 | Header cells present |
|
||||
| Headers in correct position | +1 | Proper semantic structure |
|
||||
| Consistent column count | +2 | Regular data structure |
|
||||
| Has caption | +2 | Descriptive caption |
|
||||
| Has summary | +1 | Summary attribute |
|
||||
| High text density | +2 to +3 | Content-rich cells |
|
||||
| Data attributes | +0.5 each | Data-* attributes |
|
||||
| Nested tables | -3 | Often indicates layout |
|
||||
| Role="presentation" | -3 | Explicitly non-data |
|
||||
| Too few rows | -2 | Insufficient data |
|
||||
|
||||
### LLMTableExtraction (Use Sparingly!)
|
||||
|
||||
**⚠️ WARNING**: Only use this when `DefaultTableExtraction` fails with complex tables!
|
||||
|
||||
LLMTableExtraction uses AI to understand complex table structures that traditional parsers struggle with. It automatically handles large tables through intelligent chunking and parallel processing:
|
||||
|
||||
```python
|
||||
from crawl4ai import LLMTableExtraction, LLMConfig, CrawlerRunConfig
|
||||
|
||||
# Configure LLM (costs money per call!)
|
||||
llm_config = LLMConfig(
|
||||
provider="groq/llama-3.3-70b-versatile", # Fast provider for large tables
|
||||
api_token="your_api_key",
|
||||
temperature=0.1
|
||||
)
|
||||
|
||||
# Create LLM extraction strategy with smart chunking
|
||||
table_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
max_tries=3, # Retry up to 3 times if extraction fails
|
||||
css_selector="table", # Optional: focus on specific tables
|
||||
enable_chunking=True, # Automatically chunk large tables (default: True)
|
||||
chunk_token_threshold=3000, # Split tables larger than this (default: 3000 tokens)
|
||||
min_rows_per_chunk=10, # Minimum rows per chunk (default: 10)
|
||||
max_parallel_chunks=5, # Process up to 5 chunks in parallel (default: 5)
|
||||
verbose=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=table_strategy
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
```
|
||||
|
||||
#### When to Use LLMTableExtraction
|
||||
|
||||
✅ **Use ONLY when**:
|
||||
- Tables have complex merged cells (rowspan/colspan) that break DefaultTableExtraction
|
||||
- Nested tables that need semantic understanding
|
||||
- Tables with irregular structures
|
||||
- You've tried DefaultTableExtraction and it failed
|
||||
|
||||
❌ **Never use when**:
|
||||
- DefaultTableExtraction works (99% of cases)
|
||||
- Tables are simple or well-structured
|
||||
- You're processing many pages (costs add up!)
|
||||
- Tables have 100+ rows (very slow)
|
||||
|
||||
#### How Smart Chunking Works
|
||||
|
||||
LLMTableExtraction automatically handles large tables through intelligent chunking:
|
||||
|
||||
1. **Automatic Detection**: Tables exceeding the token threshold are automatically split
|
||||
2. **Smart Splitting**: Chunks are created at row boundaries, preserving table structure
|
||||
3. **Header Preservation**: Each chunk includes the original headers for context
|
||||
4. **Parallel Processing**: Multiple chunks are processed simultaneously for speed
|
||||
5. **Intelligent Merging**: Results are merged back into a single, complete table
|
||||
|
||||
**Chunking Parameters**:
|
||||
- `enable_chunking` (default: `True`): Automatically handle large tables
|
||||
- `chunk_token_threshold` (default: `3000`): When to split tables
|
||||
- `min_rows_per_chunk` (default: `10`): Ensures meaningful chunk sizes
|
||||
- `max_parallel_chunks` (default: `5`): Concurrent processing for speed
|
||||
|
||||
The chunking is completely transparent - you get the same output format whether the table was processed in one piece or multiple chunks.
|
||||
|
||||
#### Performance Optimization for LLMTableExtraction
|
||||
|
||||
**Provider Recommendations by Table Size**:
|
||||
|
||||
| Table Size | Recommended Providers | Why |
|
||||
|------------|----------------------|-----|
|
||||
| Small (<50 rows) | Any provider | Fast enough |
|
||||
| Medium (50-200 rows) | Groq, Cerebras | Optimized inference |
|
||||
| Large (200+ rows) | **Groq** (best), Cerebras | Fastest inference + automatic chunking |
|
||||
| Very Large (500+ rows) | Groq with chunking | Parallel processing keeps it fast |
|
||||
|
||||
### NoTableExtraction
|
||||
|
||||
Disable table extraction for better performance when tables aren't needed:
|
||||
|
||||
```python
|
||||
from crawl4ai import NoTableExtraction, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=NoTableExtraction()
|
||||
)
|
||||
|
||||
# Tables won't be extracted, improving performance
|
||||
result = await crawler.arun(url, config)
|
||||
assert len(result.tables) == 0
|
||||
```
|
||||
|
||||
## Extracted Table Structure
|
||||
|
||||
Each extracted table contains:
|
||||
|
||||
```python
|
||||
{
|
||||
"headers": ["Column 1", "Column 2", ...], # Column headers
|
||||
"rows": [ # Data rows
|
||||
["Row 1 Col 1", "Row 1 Col 2", ...],
|
||||
["Row 2 Col 1", "Row 2 Col 2", ...],
|
||||
],
|
||||
"caption": "Table Caption", # If present
|
||||
"summary": "Table Summary", # If present
|
||||
"metadata": {
|
||||
"row_count": 10, # Number of rows
|
||||
"column_count": 3, # Number of columns
|
||||
"has_headers": True, # Headers detected
|
||||
"has_caption": True, # Caption exists
|
||||
"has_summary": False, # Summary exists
|
||||
"id": "data-table-1", # Table ID if present
|
||||
"class": "financial-data" # Table class if present
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
# Table extraction settings
|
||||
table_score_threshold=7, # Default threshold (backward compatible)
|
||||
table_extraction=strategy, # Optional: custom strategy
|
||||
|
||||
# Filter what to process
|
||||
css_selector="main", # Focus on specific area
|
||||
excluded_tags=["nav", "aside"] # Exclude page sections
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
from crawl4ai import DefaultTableExtraction, CrawlerRunConfig
|
||||
|
||||
# Fine-tuned extraction
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5, # Lower = more permissive
|
||||
min_rows=3, # Require at least 3 rows
|
||||
min_cols=2, # Require at least 2 columns
|
||||
verbose=True # Detailed logging
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy,
|
||||
css_selector="article.content", # Target specific content
|
||||
exclude_domains=["ads.com"], # Exclude ad domains
|
||||
cache_mode=CacheMode.BYPASS # Fresh extraction
|
||||
)
|
||||
```
|
||||
|
||||
## Working with Extracted Tables
|
||||
|
||||
### Convert to Pandas DataFrame
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
async def tables_to_dataframes(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
|
||||
dataframes = []
|
||||
for table_data in result.tables:
|
||||
# Create DataFrame
|
||||
if table_data['headers']:
|
||||
df = pd.DataFrame(
|
||||
table_data['rows'],
|
||||
columns=table_data['headers']
|
||||
)
|
||||
else:
|
||||
df = pd.DataFrame(table_data['rows'])
|
||||
|
||||
# Add metadata as DataFrame attributes
|
||||
df.attrs['caption'] = table_data.get('caption', '')
|
||||
df.attrs['metadata'] = table_data.get('metadata', {})
|
||||
|
||||
dataframes.append(df)
|
||||
|
||||
return dataframes
|
||||
```
|
||||
|
||||
### Filter Tables by Criteria
|
||||
|
||||
```python
|
||||
async def extract_large_tables(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Configure minimum size requirements
|
||||
strategy = DefaultTableExtraction(
|
||||
min_rows=10,
|
||||
min_cols=3,
|
||||
table_score_threshold=6
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
# Further filter results
|
||||
large_tables = [
|
||||
table for table in result.tables
|
||||
if table['metadata']['row_count'] > 10
|
||||
and table['metadata']['column_count'] > 3
|
||||
]
|
||||
|
||||
return large_tables
|
||||
```
|
||||
|
||||
### Export Tables to Different Formats
|
||||
|
||||
```python
|
||||
import json
|
||||
import csv
|
||||
|
||||
async def export_tables(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
# Export as JSON
|
||||
with open(f'table_{i}.json', 'w') as f:
|
||||
json.dump(table, f, indent=2)
|
||||
|
||||
# Export as CSV
|
||||
with open(f'table_{i}.csv', 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
if table['headers']:
|
||||
writer.writerow(table['headers'])
|
||||
writer.writerows(table['rows'])
|
||||
|
||||
# Export as Markdown
|
||||
with open(f'table_{i}.md', 'w') as f:
|
||||
# Write headers
|
||||
if table['headers']:
|
||||
f.write('| ' + ' | '.join(table['headers']) + ' |\n')
|
||||
f.write('|' + '---|' * len(table['headers']) + '\n')
|
||||
|
||||
# Write rows
|
||||
for row in table['rows']:
|
||||
f.write('| ' + ' | '.join(str(cell) for cell in row) + ' |\n')
|
||||
```
|
||||
|
||||
## Creating Custom Strategies
|
||||
|
||||
Extend `TableExtractionStrategy` to create custom extraction logic:
|
||||
|
||||
### Example: Financial Table Extractor
|
||||
|
||||
```python
|
||||
from crawl4ai import TableExtractionStrategy
|
||||
from typing import List, Dict, Any
|
||||
import re
|
||||
|
||||
class FinancialTableExtractor(TableExtractionStrategy):
|
||||
"""Extract tables containing financial data."""
|
||||
|
||||
def __init__(self, currency_symbols=None, require_numbers=True, **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.currency_symbols = currency_symbols or ['$', '€', '£', '¥']
|
||||
self.require_numbers = require_numbers
|
||||
self.number_pattern = re.compile(r'\d+[,.]?\d*')
|
||||
|
||||
def extract_tables(self, element, **kwargs):
|
||||
tables_data = []
|
||||
|
||||
for table in element.xpath(".//table"):
|
||||
# Check if table contains financial indicators
|
||||
table_text = ''.join(table.itertext())
|
||||
|
||||
# Must contain currency symbols
|
||||
has_currency = any(sym in table_text for sym in self.currency_symbols)
|
||||
if not has_currency:
|
||||
continue
|
||||
|
||||
# Must contain numbers if required
|
||||
if self.require_numbers:
|
||||
numbers = self.number_pattern.findall(table_text)
|
||||
if len(numbers) < 3: # Arbitrary minimum
|
||||
continue
|
||||
|
||||
# Extract the table data
|
||||
table_data = self._extract_financial_data(table)
|
||||
if table_data:
|
||||
tables_data.append(table_data)
|
||||
|
||||
return tables_data
|
||||
|
||||
def _extract_financial_data(self, table):
|
||||
"""Extract and clean financial data from table."""
|
||||
headers = []
|
||||
rows = []
|
||||
|
||||
# Extract headers
|
||||
for th in table.xpath(".//thead//th | .//tr[1]//th"):
|
||||
headers.append(th.text_content().strip())
|
||||
|
||||
# Extract and clean rows
|
||||
for tr in table.xpath(".//tbody//tr | .//tr[position()>1]"):
|
||||
row = []
|
||||
for td in tr.xpath(".//td"):
|
||||
text = td.text_content().strip()
|
||||
# Clean currency formatting
|
||||
text = re.sub(r'[$€£¥,]', '', text)
|
||||
row.append(text)
|
||||
if row:
|
||||
rows.append(row)
|
||||
|
||||
return {
|
||||
"headers": headers,
|
||||
"rows": rows,
|
||||
"caption": self._get_caption(table),
|
||||
"summary": table.get("summary", ""),
|
||||
"metadata": {
|
||||
"type": "financial",
|
||||
"row_count": len(rows),
|
||||
"column_count": len(headers) or len(rows[0]) if rows else 0
|
||||
}
|
||||
}
|
||||
|
||||
def _get_caption(self, table):
|
||||
caption = table.xpath(".//caption/text()")
|
||||
return caption[0].strip() if caption else ""
|
||||
|
||||
# Usage
|
||||
strategy = FinancialTableExtractor(
|
||||
currency_symbols=['$', 'EUR'],
|
||||
require_numbers=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Example: Specific Table Extractor
|
||||
|
||||
```python
|
||||
class SpecificTableExtractor(TableExtractionStrategy):
|
||||
"""Extract only tables matching specific criteria."""
|
||||
|
||||
def __init__(self,
|
||||
required_headers=None,
|
||||
id_pattern=None,
|
||||
class_pattern=None,
|
||||
**kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.required_headers = required_headers or []
|
||||
self.id_pattern = id_pattern
|
||||
self.class_pattern = class_pattern
|
||||
|
||||
def extract_tables(self, element, **kwargs):
|
||||
tables_data = []
|
||||
|
||||
for table in element.xpath(".//table"):
|
||||
# Check ID pattern
|
||||
if self.id_pattern:
|
||||
table_id = table.get('id', '')
|
||||
if not re.match(self.id_pattern, table_id):
|
||||
continue
|
||||
|
||||
# Check class pattern
|
||||
if self.class_pattern:
|
||||
table_class = table.get('class', '')
|
||||
if not re.match(self.class_pattern, table_class):
|
||||
continue
|
||||
|
||||
# Extract headers to check requirements
|
||||
headers = self._extract_headers(table)
|
||||
|
||||
# Check if required headers are present
|
||||
if self.required_headers:
|
||||
if not all(req in headers for req in self.required_headers):
|
||||
continue
|
||||
|
||||
# Extract full table data
|
||||
table_data = self._extract_table_data(table, headers)
|
||||
tables_data.append(table_data)
|
||||
|
||||
return tables_data
|
||||
```
|
||||
|
||||
## Combining with Other Strategies
|
||||
|
||||
Table extraction works seamlessly with other Crawl4AI strategies:
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
DefaultTableExtraction,
|
||||
LLMExtractionStrategy,
|
||||
JsonCssExtractionStrategy
|
||||
)
|
||||
|
||||
async def combined_extraction(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
# Table extraction
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=6,
|
||||
min_rows=2
|
||||
),
|
||||
|
||||
# CSS-based extraction for specific elements
|
||||
extraction_strategy=JsonCssExtractionStrategy({
|
||||
"title": "h1",
|
||||
"summary": "p.summary",
|
||||
"date": "time"
|
||||
}),
|
||||
|
||||
# Focus on main content
|
||||
css_selector="main.content"
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
# Access different extraction results
|
||||
tables = result.tables # Table data
|
||||
structured = json.loads(result.extracted_content) # CSS extraction
|
||||
|
||||
return {
|
||||
"tables": tables,
|
||||
"structured_data": structured,
|
||||
"markdown": result.markdown
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
1. **Disable when not needed**: Use `NoTableExtraction` if tables aren't required
|
||||
2. **Target specific areas**: Use `css_selector` to limit processing scope
|
||||
3. **Set minimum thresholds**: Filter out small/irrelevant tables early
|
||||
4. **Cache results**: Use appropriate cache modes for repeated extractions
|
||||
|
||||
```python
|
||||
# Optimized configuration for large pages
|
||||
config = CrawlerRunConfig(
|
||||
# Only process main content area
|
||||
css_selector="article.main-content",
|
||||
|
||||
# Exclude navigation and sidebars
|
||||
excluded_tags=["nav", "aside", "footer"],
|
||||
|
||||
# Higher threshold for stricter filtering
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=8,
|
||||
min_rows=5,
|
||||
min_cols=3
|
||||
),
|
||||
|
||||
# Enable caching for repeated access
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
```
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### Important: Your Code Still Works!
|
||||
|
||||
**No changes required!** The transition to the strategy pattern is **fully backward compatible**.
|
||||
|
||||
### How It Works Internally
|
||||
|
||||
#### v0.7.2 and Earlier
|
||||
```python
|
||||
# Old way - directly passing table_score_threshold
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
# Internally: No strategy pattern, direct implementation
|
||||
```
|
||||
|
||||
#### v0.7.3+ (Current)
|
||||
```python
|
||||
# Old way STILL WORKS - we handle it internally
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
# Internally: Automatically creates DefaultTableExtraction(table_score_threshold=7)
|
||||
```
|
||||
|
||||
### Taking Advantage of New Features
|
||||
|
||||
While your old code works, you can now use the strategy pattern for more control:
|
||||
|
||||
```python
|
||||
# Option 1: Keep using the old way (perfectly fine!)
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7 # Still supported
|
||||
)
|
||||
|
||||
# Option 2: Use the new strategy pattern (more flexibility)
|
||||
from crawl4ai import DefaultTableExtraction
|
||||
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7,
|
||||
min_rows=2, # New capability!
|
||||
min_cols=2 # New capability!
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
|
||||
# Option 3: Use advanced strategies when needed
|
||||
from crawl4ai import LLMTableExtraction, LLMConfig
|
||||
|
||||
# Only for complex tables that DefaultTableExtraction can't handle
|
||||
# Automatically handles large tables with smart chunking
|
||||
llm_strategy = LLMTableExtraction(
|
||||
llm_config=LLMConfig(
|
||||
provider="groq/llama-3.3-70b-versatile",
|
||||
api_token="your_key"
|
||||
),
|
||||
max_tries=3,
|
||||
enable_chunking=True, # Automatically chunk large tables
|
||||
chunk_token_threshold=3000, # Chunk when exceeding 3000 tokens
|
||||
max_parallel_chunks=5 # Process up to 5 chunks in parallel
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=llm_strategy # Advanced extraction with automatic chunking
|
||||
)
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
||||
- ✅ **No breaking changes** - Old code works as-is
|
||||
- ✅ **Same defaults** - DefaultTableExtraction is automatically used
|
||||
- ✅ **Gradual adoption** - Use new features when you need them
|
||||
- ✅ **Full compatibility** - result.tables structure unchanged
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Choose the Right Strategy (Cost-Conscious Approach)
|
||||
|
||||
**Decision Flow**:
|
||||
```
|
||||
1. Do you need tables?
|
||||
→ No: Use NoTableExtraction
|
||||
→ Yes: Continue to #2
|
||||
|
||||
2. Try DefaultTableExtraction first (FREE)
|
||||
→ Works? Done! ✅
|
||||
→ Fails? Continue to #3
|
||||
|
||||
3. Is the table critical and complex?
|
||||
→ No: Accept DefaultTableExtraction results
|
||||
→ Yes: Continue to #4
|
||||
|
||||
4. Use LLMTableExtraction (COSTS MONEY)
|
||||
→ Small table (<50 rows): Any LLM provider
|
||||
→ Large table (50+ rows): Use Groq or Cerebras
|
||||
→ Very large (500+ rows): Reconsider - maybe chunk the page
|
||||
```
|
||||
|
||||
**Strategy Selection Guide**:
|
||||
- **DefaultTableExtraction**: Use for 99% of cases - it's free and effective
|
||||
- **LLMTableExtraction**: Only for complex tables with merged cells that break DefaultTableExtraction
|
||||
- **NoTableExtraction**: When you only need text/markdown content
|
||||
- **Custom Strategy**: For specialized requirements (financial, scientific, etc.)
|
||||
|
||||
### 2. Validate Extracted Data
|
||||
|
||||
```python
|
||||
def validate_table(table):
|
||||
"""Validate table data quality."""
|
||||
# Check structure
|
||||
if not table.get('rows'):
|
||||
return False
|
||||
|
||||
# Check consistency
|
||||
if table.get('headers'):
|
||||
expected_cols = len(table['headers'])
|
||||
for row in table['rows']:
|
||||
if len(row) != expected_cols:
|
||||
return False
|
||||
|
||||
# Check minimum content
|
||||
total_cells = sum(len(row) for row in table['rows'])
|
||||
non_empty = sum(1 for row in table['rows']
|
||||
for cell in row if cell.strip())
|
||||
|
||||
if non_empty / total_cells < 0.5: # Less than 50% non-empty
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Filter valid tables
|
||||
valid_tables = [t for t in result.tables if validate_table(t)]
|
||||
```
|
||||
|
||||
### 3. Handle Edge Cases
|
||||
|
||||
```python
|
||||
async def robust_table_extraction(url):
|
||||
"""Extract tables with error handling."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
try:
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=6,
|
||||
verbose=True
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error}")
|
||||
return []
|
||||
|
||||
# Process tables safely
|
||||
processed_tables = []
|
||||
for table in result.tables:
|
||||
try:
|
||||
# Validate and process
|
||||
if validate_table(table):
|
||||
processed_tables.append(table)
|
||||
except Exception as e:
|
||||
print(f"Error processing table: {e}")
|
||||
continue
|
||||
|
||||
return processed_tables
|
||||
|
||||
except Exception as e:
|
||||
print(f"Extraction error: {e}")
|
||||
return []
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
| Issue | Cause | Solution |
|
||||
|-------|-------|----------|
|
||||
| No tables extracted | Score too high | Lower `table_score_threshold` |
|
||||
| Layout tables included | Score too low | Increase `table_score_threshold` |
|
||||
| Missing tables | CSS selector too specific | Broaden or remove `css_selector` |
|
||||
| Incomplete data | Complex table structure | Create custom strategy |
|
||||
| Performance issues | Processing entire page | Use `css_selector` to limit scope |
|
||||
|
||||
### Debug Logging
|
||||
|
||||
Enable verbose logging to understand extraction decisions:
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
# Configure logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# Enable verbose mode in strategy
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7,
|
||||
verbose=True # Detailed extraction logs
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy,
|
||||
verbose=True # General crawler logs
|
||||
)
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [Extraction Strategies](extraction-strategies.md) - Overview of all extraction strategies
|
||||
- [Content Selection](content-selection.md) - Using CSS selectors and filters
|
||||
- [Performance Optimization](../optimization/performance-tuning.md) - Speed up extraction
|
||||
- [Examples](../examples/table_extraction_example.py) - Complete working examples
|
||||
@@ -102,16 +102,16 @@ async def smart_blog_crawler():
|
||||
|
||||
# Step 2: Configure discovery - let's find all blog posts
|
||||
config = SeedingConfig(
|
||||
source="sitemap+cc", # Use the website's sitemap+cc
|
||||
pattern="*/courses/*", # Only courses related posts
|
||||
source="sitemap", # Use the website's sitemap
|
||||
pattern="*/blog/*.html", # Only blog posts
|
||||
extract_head=True, # Get page metadata
|
||||
max_urls=100 # Limit for this example
|
||||
)
|
||||
|
||||
# Step 3: Discover URLs from the Python blog
|
||||
print("🔍 Discovering course posts...")
|
||||
print("🔍 Discovering blog posts...")
|
||||
urls = await seeder.urls("realpython.com", config)
|
||||
print(f"✅ Found {len(urls)} course posts")
|
||||
print(f"✅ Found {len(urls)} blog posts")
|
||||
|
||||
# Step 4: Filter for Python tutorials (using metadata!)
|
||||
tutorials = [
|
||||
@@ -134,8 +134,7 @@ async def smart_blog_crawler():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
only_text=True,
|
||||
word_count_threshold=300, # Only substantial articles
|
||||
stream=True
|
||||
word_count_threshold=300 # Only substantial articles
|
||||
)
|
||||
|
||||
# Extract URLs and crawl them
|
||||
@@ -156,7 +155,7 @@ asyncio.run(smart_blog_crawler())
|
||||
|
||||
**What just happened?**
|
||||
|
||||
1. We discovered all blog URLs from the sitemap+cc
|
||||
1. We discovered all blog URLs from the sitemap
|
||||
2. We filtered using metadata (no crawling needed!)
|
||||
3. We crawled only the relevant tutorials
|
||||
4. We saved tons of time and bandwidth
|
||||
@@ -283,8 +282,8 @@ config = SeedingConfig(
|
||||
live_check=True, # Verify each URL is accessible
|
||||
concurrency=20 # Check 20 URLs in parallel
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Now you can filter by status
|
||||
live_urls = [u for u in urls if u["status"] == "valid"]
|
||||
@@ -312,8 +311,8 @@ This is where URL seeding gets really powerful. Instead of crawling entire pages
|
||||
config = SeedingConfig(
|
||||
extract_head=True # Extract metadata from <head> section
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Now each URL has rich metadata
|
||||
for url in urls[:3]:
|
||||
@@ -388,8 +387,8 @@ config = SeedingConfig(
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.3
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# URLs are scored based on:
|
||||
# 1. Domain parts matching (e.g., 'python' in python.example.com)
|
||||
@@ -430,8 +429,8 @@ config = SeedingConfig(
|
||||
extract_head=True,
|
||||
live_check=True
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("blog.example.com", config)
|
||||
|
||||
urls = await seeder.urls("blog.example.com", config)
|
||||
|
||||
# Analyze the results
|
||||
for url in urls[:5]:
|
||||
@@ -489,8 +488,8 @@ config = SeedingConfig(
|
||||
scoring_method="bm25", # Use BM25 algorithm
|
||||
score_threshold=0.3 # Minimum relevance score
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("realpython.com", config)
|
||||
|
||||
urls = await seeder.urls("realpython.com", config)
|
||||
|
||||
# Results are automatically sorted by relevance!
|
||||
for url in urls[:5]:
|
||||
@@ -512,8 +511,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.5,
|
||||
max_urls=20
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("docs.example.com", config)
|
||||
|
||||
urls = await seeder.urls("docs.example.com", config)
|
||||
|
||||
# The highest scoring URLs will be API docs!
|
||||
```
|
||||
@@ -530,8 +529,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.4,
|
||||
pattern="*/product/*" # Combine with pattern matching
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("shop.example.com", config)
|
||||
|
||||
urls = await seeder.urls("shop.example.com", config)
|
||||
|
||||
# Filter further by price (from metadata)
|
||||
affordable = [
|
||||
@@ -551,8 +550,8 @@ config = SeedingConfig(
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.35
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("technews.com", config)
|
||||
|
||||
urls = await seeder.urls("technews.com", config)
|
||||
|
||||
# Filter by date
|
||||
from datetime import datetime, timedelta
|
||||
@@ -592,8 +591,8 @@ for query in queries:
|
||||
score_threshold=0.4,
|
||||
max_urls=10 # Top 10 per topic
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("learning-platform.com", config)
|
||||
|
||||
urls = await seeder.urls("learning-platform.com", config)
|
||||
all_tutorials.extend(urls)
|
||||
|
||||
# Remove duplicates while preserving order
|
||||
@@ -626,8 +625,7 @@ config = SeedingConfig(
|
||||
)
|
||||
|
||||
# Returns a dictionary: {domain: [urls]}
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(domains, config)
|
||||
results = await seeder.many_urls(domains, config)
|
||||
|
||||
# Process results
|
||||
for domain, urls in results.items():
|
||||
@@ -656,8 +654,8 @@ config = SeedingConfig(
|
||||
pattern="*/blog/*",
|
||||
max_urls=100
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(competitors, config)
|
||||
|
||||
results = await seeder.many_urls(competitors, config)
|
||||
|
||||
# Analyze content types
|
||||
for domain, urls in results.items():
|
||||
@@ -692,8 +690,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.3,
|
||||
max_urls=20 # Per site
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(educational_sites, config)
|
||||
|
||||
results = await seeder.many_urls(educational_sites, config)
|
||||
|
||||
# Find the best beginner tutorials
|
||||
all_tutorials = []
|
||||
@@ -733,8 +731,8 @@ config = SeedingConfig(
|
||||
score_threshold=0.5, # High threshold for relevance
|
||||
max_urls=10
|
||||
)
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
results = await seeder.many_urls(news_sites, config)
|
||||
|
||||
results = await seeder.many_urls(news_sites, config)
|
||||
|
||||
# Collect all mentions
|
||||
mentions = []
|
||||
|
||||
@@ -1,376 +0,0 @@
|
||||
# Migration Guide: Table Extraction v0.7.3
|
||||
|
||||
## Overview
|
||||
|
||||
Version 0.7.3 introduces the **Table Extraction Strategy Pattern**, providing a more flexible and extensible approach to table extraction while maintaining full backward compatibility.
|
||||
|
||||
## What's New
|
||||
|
||||
### Strategy Pattern Implementation
|
||||
|
||||
Table extraction now follows the same strategy pattern used throughout Crawl4AI:
|
||||
|
||||
- **Consistent Architecture**: Aligns with extraction, chunking, and markdown strategies
|
||||
- **Extensibility**: Easy to create custom table extraction strategies
|
||||
- **Better Separation**: Table logic moved from content scraping to dedicated module
|
||||
- **Full Control**: Fine-grained control over table detection and extraction
|
||||
|
||||
### New Classes
|
||||
|
||||
```python
|
||||
from crawl4ai import (
|
||||
TableExtractionStrategy, # Abstract base class
|
||||
DefaultTableExtraction, # Current implementation (default)
|
||||
NoTableExtraction # Explicitly disable extraction
|
||||
)
|
||||
```
|
||||
|
||||
## Backward Compatibility
|
||||
|
||||
**✅ All existing code continues to work without changes.**
|
||||
|
||||
### No Changes Required
|
||||
|
||||
If your code looks like this, it will continue to work:
|
||||
|
||||
```python
|
||||
# This still works exactly the same
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
tables = result.tables # Same structure, same data
|
||||
```
|
||||
|
||||
### What Happens Behind the Scenes
|
||||
|
||||
When you don't specify a `table_extraction` strategy:
|
||||
|
||||
1. `CrawlerRunConfig` automatically creates `DefaultTableExtraction`
|
||||
2. It uses your `table_score_threshold` parameter
|
||||
3. Tables are extracted exactly as before
|
||||
4. Results appear in `result.tables` with the same structure
|
||||
|
||||
## New Capabilities
|
||||
|
||||
### 1. Explicit Strategy Configuration
|
||||
|
||||
You can now explicitly configure table extraction:
|
||||
|
||||
```python
|
||||
# New: Explicit control
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7,
|
||||
min_rows=2, # New: minimum row filter
|
||||
min_cols=2, # New: minimum column filter
|
||||
verbose=True # New: detailed logging
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Disable Table Extraction
|
||||
|
||||
Improve performance when tables aren't needed:
|
||||
|
||||
```python
|
||||
# New: Skip table extraction entirely
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=NoTableExtraction()
|
||||
)
|
||||
# No CPU cycles spent on table detection/extraction
|
||||
```
|
||||
|
||||
### 3. Custom Extraction Strategies
|
||||
|
||||
Create specialized extractors:
|
||||
|
||||
```python
|
||||
class MyTableExtractor(TableExtractionStrategy):
|
||||
def extract_tables(self, element, **kwargs):
|
||||
# Custom extraction logic
|
||||
return custom_tables
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=MyTableExtractor()
|
||||
)
|
||||
```
|
||||
|
||||
## Migration Scenarios
|
||||
|
||||
### Scenario 1: Basic Usage (No Changes Needed)
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```python
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url, config)
|
||||
for table in result.tables:
|
||||
print(table['headers'])
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```python
|
||||
# Exactly the same - no changes required
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url, config)
|
||||
for table in result.tables:
|
||||
print(table['headers'])
|
||||
```
|
||||
|
||||
### Scenario 2: Custom Threshold (No Changes Needed)
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5
|
||||
)
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```python
|
||||
# Still works the same
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5
|
||||
)
|
||||
|
||||
# Or use new explicit approach for more control
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5,
|
||||
min_rows=2 # Additional filtering
|
||||
)
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Scenario 3: Advanced Filtering (New Feature)
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```python
|
||||
# Had to filter after extraction
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=5
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
|
||||
# Manual filtering
|
||||
large_tables = [
|
||||
t for t in result.tables
|
||||
if len(t['rows']) >= 5 and len(t['headers']) >= 3
|
||||
]
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```python
|
||||
# Filter during extraction (more efficient)
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=5,
|
||||
min_rows=5,
|
||||
min_cols=3
|
||||
)
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=strategy
|
||||
)
|
||||
result = await crawler.arun(url, config)
|
||||
# result.tables already filtered
|
||||
```
|
||||
|
||||
## Code Organization Changes
|
||||
|
||||
### Module Structure
|
||||
|
||||
**Before (v0.7.2):**
|
||||
```
|
||||
crawl4ai/
|
||||
content_scraping_strategy.py
|
||||
- LXMLWebScrapingStrategy
|
||||
- is_data_table() # Table detection
|
||||
- extract_table_data() # Table extraction
|
||||
```
|
||||
|
||||
**After (v0.7.3):**
|
||||
```
|
||||
crawl4ai/
|
||||
content_scraping_strategy.py
|
||||
- LXMLWebScrapingStrategy
|
||||
# Table methods removed, uses strategy
|
||||
|
||||
table_extraction.py (NEW)
|
||||
- TableExtractionStrategy # Base class
|
||||
- DefaultTableExtraction # Moved logic here
|
||||
- NoTableExtraction # New option
|
||||
```
|
||||
|
||||
### Import Changes
|
||||
|
||||
**New imports available (optional):**
|
||||
```python
|
||||
# These are now available but not required for existing code
|
||||
from crawl4ai import (
|
||||
TableExtractionStrategy,
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Implications
|
||||
|
||||
### No Performance Impact
|
||||
|
||||
For existing code, performance remains identical:
|
||||
- Same extraction logic
|
||||
- Same scoring algorithm
|
||||
- Same processing time
|
||||
|
||||
### Performance Improvements Available
|
||||
|
||||
New options for better performance:
|
||||
|
||||
```python
|
||||
# Skip tables entirely (faster)
|
||||
config = CrawlerRunConfig(
|
||||
table_extraction=NoTableExtraction()
|
||||
)
|
||||
|
||||
# Process only specific areas (faster)
|
||||
config = CrawlerRunConfig(
|
||||
css_selector="main.content",
|
||||
table_extraction=DefaultTableExtraction(
|
||||
min_rows=5, # Skip small tables
|
||||
min_cols=3
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Testing Your Migration
|
||||
|
||||
### Verification Script
|
||||
|
||||
Run this to verify your extraction still works:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def verify_extraction():
|
||||
url = "your_url_here"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Test 1: Old approach
|
||||
config_old = CrawlerRunConfig(
|
||||
table_score_threshold=7
|
||||
)
|
||||
result_old = await crawler.arun(url, config_old)
|
||||
|
||||
# Test 2: New explicit approach
|
||||
from crawl4ai import DefaultTableExtraction
|
||||
config_new = CrawlerRunConfig(
|
||||
table_extraction=DefaultTableExtraction(
|
||||
table_score_threshold=7
|
||||
)
|
||||
)
|
||||
result_new = await crawler.arun(url, config_new)
|
||||
|
||||
# Compare results
|
||||
assert len(result_old.tables) == len(result_new.tables)
|
||||
print(f"✓ Both approaches extracted {len(result_old.tables)} tables")
|
||||
|
||||
# Verify structure
|
||||
for old, new in zip(result_old.tables, result_new.tables):
|
||||
assert old['headers'] == new['headers']
|
||||
assert old['rows'] == new['rows']
|
||||
|
||||
print("✓ Table content identical")
|
||||
|
||||
asyncio.run(verify_extraction())
|
||||
```
|
||||
|
||||
## Deprecation Notes
|
||||
|
||||
### No Deprecations
|
||||
|
||||
- All existing parameters continue to work
|
||||
- `table_score_threshold` in `CrawlerRunConfig` is still supported
|
||||
- No breaking changes
|
||||
|
||||
### Internal Changes (Transparent to Users)
|
||||
|
||||
- `LXMLWebScrapingStrategy.is_data_table()` - Moved to `DefaultTableExtraction`
|
||||
- `LXMLWebScrapingStrategy.extract_table_data()` - Moved to `DefaultTableExtraction`
|
||||
|
||||
These methods were internal and not part of the public API.
|
||||
|
||||
## Benefits of Upgrading
|
||||
|
||||
While not required, using the new pattern provides:
|
||||
|
||||
1. **Better Control**: Filter tables during extraction, not after
|
||||
2. **Performance Options**: Skip extraction when not needed
|
||||
3. **Extensibility**: Create custom extractors for specific needs
|
||||
4. **Consistency**: Same pattern as other Crawl4AI strategies
|
||||
5. **Future-Proof**: Ready for upcoming advanced strategies
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue: Different Number of Tables
|
||||
|
||||
**Cause**: Threshold or filtering differences
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Ensure same threshold
|
||||
strategy = DefaultTableExtraction(
|
||||
table_score_threshold=7, # Match your old setting
|
||||
min_rows=0, # No filtering (default)
|
||||
min_cols=0 # No filtering (default)
|
||||
)
|
||||
```
|
||||
|
||||
### Issue: Import Errors
|
||||
|
||||
**Cause**: Using new classes without importing
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
# Add imports if using new features
|
||||
from crawl4ai import (
|
||||
DefaultTableExtraction,
|
||||
NoTableExtraction,
|
||||
TableExtractionStrategy
|
||||
)
|
||||
```
|
||||
|
||||
### Issue: Custom Strategy Not Working
|
||||
|
||||
**Cause**: Incorrect method signature
|
||||
|
||||
**Solution**:
|
||||
```python
|
||||
class CustomExtractor(TableExtractionStrategy):
|
||||
def extract_tables(self, element, **kwargs): # Correct signature
|
||||
# Not: extract_tables(self, html)
|
||||
# Not: extract(self, element)
|
||||
return tables_list
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
If you encounter issues:
|
||||
|
||||
1. Check your `table_score_threshold` matches previous settings
|
||||
2. Verify imports if using new classes
|
||||
3. Enable verbose logging: `DefaultTableExtraction(verbose=True)`
|
||||
4. Review the [Table Extraction Documentation](../core/table_extraction.md)
|
||||
5. Check [examples](../examples/table_extraction_example.py)
|
||||
|
||||
## Summary
|
||||
|
||||
- ✅ **Full backward compatibility** - No code changes required
|
||||
- ✅ **Same results** - Identical extraction behavior by default
|
||||
- ✅ **New options** - Additional control when needed
|
||||
- ✅ **Better architecture** - Consistent with Crawl4AI patterns
|
||||
- ✅ **Ready for future** - Foundation for advanced strategies
|
||||
|
||||
The migration to v0.7.3 is seamless with no required changes while providing new capabilities for those who need them.
|
||||
@@ -91,17 +91,6 @@ async def test_css_selector_extraction():
|
||||
assert result.markdown
|
||||
assert all(heading in result.markdown for heading in ["#", "##", "###"])
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_base_tag_link_extraction():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://sohamkukreti.github.io/portfolio"
|
||||
result = await crawler.arun(url=url)
|
||||
assert result.success
|
||||
assert result.links
|
||||
assert isinstance(result.links, dict)
|
||||
assert "internal" in result.links
|
||||
assert "external" in result.links
|
||||
assert any("github.com" in x["href"] for x in result.links["external"])
|
||||
|
||||
# Entry point for debugging
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -10,13 +10,11 @@ import sys
|
||||
import uuid
|
||||
import shutil
|
||||
|
||||
from crawl4ai import BrowserProfiler
|
||||
from crawl4ai.browser_manager import BrowserManager
|
||||
|
||||
# Add the project root to Python path if running directly
|
||||
if __name__ == "__main__":
|
||||
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../..')))
|
||||
|
||||
from crawl4ai.browser import BrowserManager, BrowserProfileManager
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.async_logger import AsyncLogger
|
||||
|
||||
@@ -27,7 +25,7 @@ async def test_profile_creation():
|
||||
"""Test creating and managing browser profiles."""
|
||||
logger.info("Testing profile creation and management", tag="TEST")
|
||||
|
||||
profile_manager = BrowserProfiler(logger=logger)
|
||||
profile_manager = BrowserProfileManager(logger=logger)
|
||||
|
||||
try:
|
||||
# List existing profiles
|
||||
@@ -85,7 +83,7 @@ async def test_profile_with_browser():
|
||||
"""Test using a profile with a browser."""
|
||||
logger.info("Testing using a profile with a browser", tag="TEST")
|
||||
|
||||
profile_manager = BrowserProfiler(logger=logger)
|
||||
profile_manager = BrowserProfileManager(logger=logger)
|
||||
test_profile_name = f"test-browser-profile-{uuid.uuid4().hex[:8]}"
|
||||
profile_path = None
|
||||
|
||||
@@ -103,8 +101,6 @@ async def test_profile_with_browser():
|
||||
# Now use this profile with a browser
|
||||
browser_config = BrowserConfig(
|
||||
user_data_dir=profile_path,
|
||||
use_managed_browser=True,
|
||||
use_persistent_context=True,
|
||||
headless=True
|
||||
)
|
||||
|
||||
|
||||
@@ -168,7 +168,7 @@ class SimpleApiTester:
|
||||
print("\n=== CORE APIs ===")
|
||||
|
||||
test_url = "https://example.com"
|
||||
test_raw_html_url = "raw://<html><body><h1>Hello, World!</h1></body></html>"
|
||||
|
||||
# Test markdown endpoint
|
||||
md_payload = {
|
||||
"url": test_url,
|
||||
@@ -180,17 +180,6 @@ class SimpleApiTester:
|
||||
# print(result['data'].get('markdown', ''))
|
||||
self.print_result(result)
|
||||
|
||||
# Test markdown endpoint with raw HTML
|
||||
raw_md_payload = {
|
||||
"url": test_raw_html_url,
|
||||
"f": "fit",
|
||||
"q": "test query",
|
||||
"c": "0"
|
||||
}
|
||||
result = self.test_post_endpoint("/md", raw_md_payload)
|
||||
self.print_result(result)
|
||||
|
||||
|
||||
# Test HTML endpoint
|
||||
html_payload = {"url": test_url}
|
||||
result = self.test_post_endpoint("/html", html_payload)
|
||||
@@ -226,15 +215,6 @@ class SimpleApiTester:
|
||||
result = self.test_post_endpoint("/crawl", crawl_payload)
|
||||
self.print_result(result)
|
||||
|
||||
# Test crawl endpoint with raw HTML
|
||||
crawl_payload = {
|
||||
"urls": [test_raw_html_url],
|
||||
"browser_config": {},
|
||||
"crawler_config": {}
|
||||
}
|
||||
result = self.test_post_endpoint("/crawl", crawl_payload)
|
||||
self.print_result(result)
|
||||
|
||||
# Test config dump
|
||||
config_payload = {"code": "CrawlerRunConfig()"}
|
||||
result = self.test_post_endpoint("/config/dump", config_payload)
|
||||
|
||||
@@ -74,7 +74,7 @@ async def test_direct_api():
|
||||
# Make direct API call
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
"http://localhost:8000/crawl",
|
||||
json=request_data,
|
||||
timeout=300
|
||||
)
|
||||
@@ -100,24 +100,13 @@ async def test_direct_api():
|
||||
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
"http://localhost:8000/crawl",
|
||||
json=request_data
|
||||
)
|
||||
assert response.status_code == 200
|
||||
result = response.json()
|
||||
print("Structured extraction result:", result["success"])
|
||||
|
||||
# Test 3: Raw HTML
|
||||
request_data["urls"] = ["raw://<html><body><h1>Hello, World!</h1><a href='https://example.com'>Example</a></body></html>"]
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
"http://localhost:11235/crawl",
|
||||
json=request_data
|
||||
)
|
||||
assert response.status_code == 200
|
||||
result = response.json()
|
||||
print("Raw HTML result:", result["success"])
|
||||
|
||||
# Test 3: Get schema
|
||||
# async with httpx.AsyncClient() as client:
|
||||
# response = await client.get("http://localhost:8000/schema")
|
||||
@@ -129,7 +118,7 @@ async def test_with_client():
|
||||
"""Test using the Crawl4AI Docker client SDK"""
|
||||
print("\n=== Testing Client SDK ===")
|
||||
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235", verbose=True) as client:
|
||||
async with Crawl4aiDockerClient(verbose=True) as client:
|
||||
# Test 1: Basic crawl
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
|
||||
@@ -6,22 +6,28 @@ import base64
|
||||
import os
|
||||
from typing import Dict, Any
|
||||
|
||||
class Crawl4AiTester:
|
||||
def __init__(self, base_url: str = "http://localhost:11235"):
|
||||
self.base_url = base_url
|
||||
|
||||
class Crawl4AiTester:
|
||||
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
|
||||
self.base_url = base_url
|
||||
self.api_token = api_token or os.getenv(
|
||||
"CRAWL4AI_API_TOKEN"
|
||||
) # Check environment variable as fallback
|
||||
self.headers = (
|
||||
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
|
||||
)
|
||||
|
||||
def submit_and_wait(
|
||||
self, request_data: Dict[str, Any], timeout: int = 300
|
||||
) -> Dict[str, Any]:
|
||||
# Submit crawl job using async endpoint
|
||||
# Submit crawl job
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl/job", json=request_data
|
||||
f"{self.base_url}/crawl", json=request_data, headers=self.headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
job_response = response.json()
|
||||
task_id = job_response["task_id"]
|
||||
print(f"Submitted job with task_id: {task_id}")
|
||||
if response.status_code == 403:
|
||||
raise Exception("API token is invalid or missing")
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"Task ID: {task_id}")
|
||||
|
||||
# Poll for result
|
||||
start_time = time.time()
|
||||
@@ -32,9 +38,8 @@ class Crawl4AiTester:
|
||||
)
|
||||
|
||||
result = requests.get(
|
||||
f"{self.base_url}/crawl/job/{task_id}"
|
||||
f"{self.base_url}/task/{task_id}", headers=self.headers
|
||||
)
|
||||
result.raise_for_status()
|
||||
status = result.json()
|
||||
|
||||
if status["status"] == "failed":
|
||||
@@ -47,10 +52,10 @@ class Crawl4AiTester:
|
||||
time.sleep(2)
|
||||
|
||||
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
# Use synchronous crawl endpoint
|
||||
response = requests.post(
|
||||
f"{self.base_url}/crawl",
|
||||
f"{self.base_url}/crawl_sync",
|
||||
json=request_data,
|
||||
headers=self.headers,
|
||||
timeout=60,
|
||||
)
|
||||
if response.status_code == 408:
|
||||
@@ -61,8 +66,9 @@ class Crawl4AiTester:
|
||||
|
||||
def test_docker_deployment(version="basic"):
|
||||
tester = Crawl4AiTester(
|
||||
base_url="http://localhost:11235",
|
||||
#base_url="https://crawl4ai-sby74.ondigitalocean.app",
|
||||
# base_url="http://localhost:11235" ,
|
||||
base_url="https://crawl4ai-sby74.ondigitalocean.app",
|
||||
api_token="test",
|
||||
)
|
||||
print(f"Testing Crawl4AI Docker {version} version")
|
||||
|
||||
@@ -82,60 +88,63 @@ def test_docker_deployment(version="basic"):
|
||||
|
||||
# Test cases based on version
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl_sync(tester)
|
||||
|
||||
if version in ["full", "transformer"]:
|
||||
test_cosine_extraction(tester)
|
||||
# if version in ["full", "transformer"]:
|
||||
# test_cosine_extraction(tester)
|
||||
|
||||
test_js_execution(tester)
|
||||
test_css_selector(tester)
|
||||
test_structured_extraction(tester)
|
||||
test_llm_extraction(tester)
|
||||
test_llm_with_ollama(tester)
|
||||
test_screenshot(tester)
|
||||
# test_js_execution(tester)
|
||||
# test_css_selector(tester)
|
||||
# test_structured_extraction(tester)
|
||||
# test_llm_extraction(tester)
|
||||
# test_llm_with_ollama(tester)
|
||||
# test_screenshot(tester)
|
||||
|
||||
|
||||
def test_basic_crawl(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Async) ===")
|
||||
print("\n=== Testing Basic Crawl ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"Basic crawl result count: {len(result['result']['results'])}")
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["results"]) > 0
|
||||
assert len(result["result"]["results"][0]["markdown"]) > 0
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_basic_crawl_sync(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Basic Crawl (Sync) ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"priority": 10,
|
||||
"session_id": "test",
|
||||
}
|
||||
|
||||
result = tester.submit_sync(request)
|
||||
print(f"Basic crawl result count: {len(result['results'])}")
|
||||
assert result["success"]
|
||||
assert len(result["results"]) > 0
|
||||
assert len(result["results"][0]["markdown"]) > 0
|
||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||
assert result["status"] == "completed"
|
||||
assert result["result"]["success"]
|
||||
assert len(result["result"]["markdown"]) > 0
|
||||
|
||||
|
||||
def test_js_execution(tester: Crawl4AiTester):
|
||||
print("\n=== Testing JS Execution ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); if(loadMoreButton) loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "wide-tease-item__wrapper df flex-column flex-row-m flex-nowrap-m enable-new-sports-feed-mobile-design(10)"
|
||||
}
|
||||
"priority": 8,
|
||||
"js_code": [
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||
],
|
||||
"wait_for": "article.tease-card:nth-child(10)",
|
||||
"crawler_params": {"headless": True},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"JS execution result count: {len(result['result']['results'])}")
|
||||
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
@@ -143,78 +152,51 @@ def test_css_selector(tester: Crawl4AiTester):
|
||||
print("\n=== Testing CSS Selector ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
"priority": 7,
|
||||
"css_selector": ".wide-tease-item__description",
|
||||
"crawler_params": {"headless": True},
|
||||
"extra": {"word_count_threshold": 10},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
print(f"CSS selector result count: {len(result['result']['results'])}")
|
||||
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
||||
assert result["result"]["success"]
|
||||
|
||||
|
||||
def test_structured_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Structured Extraction ===")
|
||||
schema = {
|
||||
"name": "Cryptocurrency Prices",
|
||||
"baseSelector": "table[data-testid=\"prices-table\"] tbody tr",
|
||||
"fields": [
|
||||
{
|
||||
"name": "asset_name",
|
||||
"selector": "td:nth-child(2) p.cds-headline-h4steop",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "asset_symbol",
|
||||
"selector": "td:nth-child(2) p.cds-label2-l1sm09ec",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "asset_image_url",
|
||||
"selector": "td:nth-child(2) img[alt=\"Asset Symbol\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "asset_url",
|
||||
"selector": "td:nth-child(2) a[aria-label^=\"Asset page for\"]",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(3) div.cds-typographyResets-t6muwls.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "change",
|
||||
"selector": "td:nth-child(7) p.cds-body-bwup3gq",
|
||||
"type": "text"
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
},
|
||||
],
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.coinbase.com/explore"],
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {"schema": schema}
|
||||
}
|
||||
}
|
||||
}
|
||||
"priority": 9,
|
||||
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} items")
|
||||
if extracted:
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
print("Sample item:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
assert len(extracted) > 0
|
||||
|
||||
@@ -224,54 +206,43 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
||||
schema = {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"asset_name": {
|
||||
"model_name": {
|
||||
"type": "string",
|
||||
"description": "Name of the asset.",
|
||||
"description": "Name of the OpenAI model.",
|
||||
},
|
||||
"price": {
|
||||
"input_fee": {
|
||||
"type": "string",
|
||||
"description": "Price of the asset.",
|
||||
"description": "Fee for input token for the OpenAI model.",
|
||||
},
|
||||
"change": {
|
||||
"output_fee": {
|
||||
"type": "string",
|
||||
"description": "Change in price of the asset.",
|
||||
"description": "Fee for output token for the OpenAI model.",
|
||||
},
|
||||
},
|
||||
"required": ["asset_name", "price", "change"],
|
||||
"required": ["model_name", "input_fee", "output_fee"],
|
||||
}
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.coinbase.com/en-in/explore"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"urls": ["https://openai.com/api/pricing"],
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "gemini/gemini-2.5-flash",
|
||||
"api_token": os.getenv("GEMINI_API_KEY")
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "From the crawled content tioned asset names along with their prices and change in price.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": os.getenv("OPENAI_API_KEY"),
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
|
||||
},
|
||||
},
|
||||
"crawler_params": {"word_count_threshold": 1},
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} model pricing entries")
|
||||
if extracted:
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
print("Sample entry:", json.dumps(extracted[0], indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
||||
@@ -300,32 +271,23 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"verbose": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "llm",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "ollama/llama3.2:latest",
|
||||
}
|
||||
},
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
}
|
||||
},
|
||||
"word_count_threshold": 1
|
||||
}
|
||||
}
|
||||
"provider": "ollama/llama2",
|
||||
"schema": schema,
|
||||
"extraction_type": "schema",
|
||||
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||
},
|
||||
},
|
||||
"extra": {"word_count_threshold": 1},
|
||||
"crawler_params": {"verbose": True},
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print("Extracted content:", json.dumps(extracted, indent=2))
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
@@ -336,29 +298,23 @@ def test_cosine_extraction(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Cosine Extraction ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"priority": 8,
|
||||
"extraction_config": {
|
||||
"type": "cosine",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "CosineStrategy",
|
||||
"params": {
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
"semantic_filter": "business finance economy",
|
||||
"word_count_threshold": 10,
|
||||
"max_dist": 0.2,
|
||||
"top_k": 3,
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
try:
|
||||
result = tester.submit_and_wait(request)
|
||||
extracted = json.loads(result["result"]["results"][0]["extracted_content"])
|
||||
extracted = json.loads(result["result"]["extracted_content"])
|
||||
print(f"Extracted {len(extracted)} text clusters")
|
||||
if extracted:
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
print("First cluster tags:", extracted[0]["tags"])
|
||||
assert result["result"]["success"]
|
||||
except Exception as e:
|
||||
print(f"Cosine extraction test failed: {str(e)}")
|
||||
@@ -368,24 +324,19 @@ def test_screenshot(tester: Crawl4AiTester):
|
||||
print("\n=== Testing Screenshot ===")
|
||||
request = {
|
||||
"urls": ["https://www.nbcnews.com/business"],
|
||||
"browser_config": {"headless": True},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"screenshot": True
|
||||
}
|
||||
}
|
||||
"priority": 5,
|
||||
"screenshot": True,
|
||||
"crawler_params": {"headless": True},
|
||||
}
|
||||
|
||||
result = tester.submit_and_wait(request)
|
||||
screenshot_data = result["result"]["results"][0]["screenshot"]
|
||||
print("Screenshot captured:", bool(screenshot_data))
|
||||
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
||||
|
||||
if screenshot_data:
|
||||
if result["result"]["screenshot"]:
|
||||
# Save screenshot
|
||||
screenshot_bytes = base64.b64decode(screenshot_data)
|
||||
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
||||
with open("test_screenshot.jpg", "wb") as f:
|
||||
f.write(screenshot_bytes)
|
||||
f.write(screenshot_data)
|
||||
print("Screenshot saved as test_screenshot.jpg")
|
||||
|
||||
assert result["result"]["success"]
|
||||
|
||||
117
tests/general/test_bff_scoring.py
Normal file
117
tests/general/test_bff_scoring.py
Normal file
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Simple test to verify BestFirstCrawlingStrategy fixes.
|
||||
This test crawls a real website and shows that:
|
||||
1. Higher-scoring pages are crawled first (priority queue fix)
|
||||
2. Links are scored before truncation (link discovery fix)
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
async def test_best_first_strategy():
|
||||
"""Test BestFirstCrawlingStrategy with keyword scoring"""
|
||||
|
||||
print("=" * 70)
|
||||
print("Testing BestFirstCrawlingStrategy with Real URL")
|
||||
print("=" * 70)
|
||||
print("\nThis test will:")
|
||||
print("1. Crawl Python.org documentation")
|
||||
print("2. Score pages based on keywords: 'tutorial', 'guide', 'reference'")
|
||||
print("3. Show that higher-scoring pages are crawled first")
|
||||
print("-" * 70)
|
||||
|
||||
# Create a keyword scorer that prioritizes tutorial/guide pages
|
||||
scorer = KeywordRelevanceScorer(
|
||||
keywords=["tutorial", "guide", "reference", "documentation"],
|
||||
weight=1.0,
|
||||
case_sensitive=False
|
||||
)
|
||||
|
||||
# Create the strategy with scoring
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2, # Crawl 2 levels deep
|
||||
max_pages=10, # Limit to 10 pages total
|
||||
url_scorer=scorer, # Use keyword scoring
|
||||
include_external=False # Only internal links
|
||||
)
|
||||
|
||||
# Configure browser and crawler
|
||||
browser_config = BrowserConfig(
|
||||
headless=True, # Run in background
|
||||
verbose=False # Reduce output noise
|
||||
)
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=strategy,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
print("\nStarting crawl of https://docs.python.org/3/")
|
||||
print("Looking for pages with keywords: tutorial, guide, reference, documentation")
|
||||
print("-" * 70)
|
||||
|
||||
crawled_urls = []
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Crawl and collect results
|
||||
results = await crawler.arun(
|
||||
url="https://docs.python.org/3/",
|
||||
config=crawler_config
|
||||
)
|
||||
|
||||
# Process results
|
||||
if isinstance(results, list):
|
||||
for result in results:
|
||||
score = result.metadata.get('score', 0) if result.metadata else 0
|
||||
depth = result.metadata.get('depth', 0) if result.metadata else 0
|
||||
crawled_urls.append({
|
||||
'url': result.url,
|
||||
'score': score,
|
||||
'depth': depth,
|
||||
'success': result.success
|
||||
})
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("CRAWL RESULTS (in order of crawling)")
|
||||
print("=" * 70)
|
||||
|
||||
for i, item in enumerate(crawled_urls, 1):
|
||||
status = "✓" if item['success'] else "✗"
|
||||
# Highlight high-scoring pages
|
||||
if item['score'] > 0.5:
|
||||
print(f"{i:2}. [{status}] Score: {item['score']:.2f} | Depth: {item['depth']} | {item['url']}")
|
||||
print(f" ^ HIGH SCORE - Contains keywords!")
|
||||
else:
|
||||
print(f"{i:2}. [{status}] Score: {item['score']:.2f} | Depth: {item['depth']} | {item['url']}")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("ANALYSIS")
|
||||
print("=" * 70)
|
||||
|
||||
# Check if higher scores appear early in the crawl
|
||||
scores = [item['score'] for item in crawled_urls[1:]] # Skip initial URL
|
||||
high_score_indices = [i for i, s in enumerate(scores) if s > 0.3]
|
||||
|
||||
if high_score_indices and high_score_indices[0] < len(scores) / 2:
|
||||
print("✅ SUCCESS: Higher-scoring pages (with keywords) were crawled early!")
|
||||
print(" This confirms the priority queue fix is working.")
|
||||
else:
|
||||
print("⚠️ Check the crawl order above - higher scores should appear early")
|
||||
|
||||
# Show score distribution
|
||||
print(f"\nScore Statistics:")
|
||||
print(f" - Total pages crawled: {len(crawled_urls)}")
|
||||
print(f" - Average score: {sum(item['score'] for item in crawled_urls) / len(crawled_urls):.2f}")
|
||||
print(f" - Max score: {max(item['score'] for item in crawled_urls):.2f}")
|
||||
print(f" - Pages with keywords: {sum(1 for item in crawled_urls if item['score'] > 0.3)}")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("TEST COMPLETE")
|
||||
print("=" * 70)
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("\n🔍 BestFirstCrawlingStrategy Simple Test\n")
|
||||
asyncio.run(test_best_first_strategy())
|
||||
@@ -1,43 +0,0 @@
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
# Simple concurrency test for persistent context page creation
|
||||
# Usage: python scripts/test_persistent_context.py
|
||||
|
||||
URLS = [
|
||||
# "https://example.com",
|
||||
"https://httpbin.org/html",
|
||||
"https://www.python.org/",
|
||||
"https://www.rust-lang.org/",
|
||||
]
|
||||
|
||||
async def main():
|
||||
profile_dir = os.path.join(os.path.expanduser("~"), ".crawl4ai", "profiles", "test-persistent-profile")
|
||||
os.makedirs(profile_dir, exist_ok=True)
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
use_persistent_context=True,
|
||||
user_data_dir=profile_dir,
|
||||
use_managed_browser=True,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
run_cfg = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
stream=False,
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
results = await crawler.arun_many(URLS, config=run_cfg)
|
||||
for r in results:
|
||||
print(r.url, r.success, len(r.markdown.raw_markdown) if r.markdown else 0)
|
||||
# r = await crawler.arun(url=URLS[0], config=run_cfg)
|
||||
# print(r.url, r.success, len(r.markdown.raw_markdown) if r.markdown else 0)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -1,55 +0,0 @@
|
||||
import sys
|
||||
import pytest
|
||||
import asyncio
|
||||
from unittest.mock import patch, MagicMock
|
||||
from crawl4ai.browser_profiler import BrowserProfiler
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.skipif(sys.platform != "win32", reason="Windows-specific msvcrt test")
|
||||
async def test_keyboard_input_handling():
|
||||
# Mock sequence of keystrokes: arrow key followed by 'q'
|
||||
mock_keys = [b'\x00K', b'q']
|
||||
mock_kbhit = MagicMock(side_effect=[True, True, False])
|
||||
mock_getch = MagicMock(side_effect=mock_keys)
|
||||
|
||||
with patch('msvcrt.kbhit', mock_kbhit), patch('msvcrt.getch', mock_getch):
|
||||
# profiler = BrowserProfiler()
|
||||
user_done_event = asyncio.Event()
|
||||
|
||||
# Create a local async function to simulate the keyboard input handling
|
||||
async def test_listen_for_quit_command():
|
||||
if sys.platform == "win32":
|
||||
while True:
|
||||
try:
|
||||
if mock_kbhit():
|
||||
raw = mock_getch()
|
||||
try:
|
||||
key = raw.decode("utf-8")
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
|
||||
if len(key) != 1 or not key.isprintable():
|
||||
continue
|
||||
|
||||
if key.lower() == "q":
|
||||
user_done_event.set()
|
||||
return
|
||||
|
||||
await asyncio.sleep(0.1)
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
# Run the listener
|
||||
listener_task = asyncio.create_task(test_listen_for_quit_command())
|
||||
|
||||
# Wait for the event to be set
|
||||
try:
|
||||
await asyncio.wait_for(user_done_event.wait(), timeout=1.0)
|
||||
assert user_done_event.is_set()
|
||||
finally:
|
||||
if not listener_task.done():
|
||||
listener_task.cancel()
|
||||
try:
|
||||
await listener_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
@@ -1,582 +0,0 @@
|
||||
"""
|
||||
Comprehensive test suite for ProxyConfig in different forms:
|
||||
1. String form (ip:port:username:password)
|
||||
2. Dict form (dictionary with keys)
|
||||
3. Object form (ProxyConfig instance)
|
||||
4. Environment variable form (from env vars)
|
||||
|
||||
Tests cover all possible scenarios and edge cases using pytest.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import pytest
|
||||
import tempfile
|
||||
from unittest.mock import patch
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, ProxyConfig
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
|
||||
class TestProxyConfig:
|
||||
"""Comprehensive test suite for ProxyConfig functionality."""
|
||||
|
||||
# Test data for different scenarios
|
||||
# get free proxy server from from webshare.io https://www.webshare.io/?referral_code=3sqog0y1fvsl
|
||||
TEST_PROXY_DATA = {
|
||||
"server": "",
|
||||
"username": "",
|
||||
"password": "",
|
||||
"ip": ""
|
||||
}
|
||||
|
||||
def setup_method(self):
|
||||
"""Setup for each test method."""
|
||||
self.test_url = "https://httpbin.org/ip" # Use httpbin for testing
|
||||
|
||||
# ==================== OBJECT FORM TESTS ====================
|
||||
|
||||
def test_proxy_config_object_creation_basic(self):
|
||||
"""Test basic ProxyConfig object creation."""
|
||||
proxy = ProxyConfig(server="127.0.0.1:8080")
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "127.0.0.1" # Should auto-extract IP
|
||||
|
||||
def test_proxy_config_object_creation_full(self):
|
||||
"""Test ProxyConfig object creation with all parameters."""
|
||||
proxy = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password'],
|
||||
ip=self.TEST_PROXY_DATA['ip']
|
||||
)
|
||||
assert proxy.server == f"http://{self.TEST_PROXY_DATA['server']}"
|
||||
assert proxy.username == self.TEST_PROXY_DATA['username']
|
||||
assert proxy.password == self.TEST_PROXY_DATA['password']
|
||||
assert proxy.ip == self.TEST_PROXY_DATA['ip']
|
||||
|
||||
def test_proxy_config_object_ip_extraction(self):
|
||||
"""Test automatic IP extraction from server URL."""
|
||||
test_cases = [
|
||||
("http://192.168.1.1:8080", "192.168.1.1"),
|
||||
("https://10.0.0.1:3128", "10.0.0.1"),
|
||||
("192.168.1.100:8080", "192.168.1.100"),
|
||||
("proxy.example.com:8080", "proxy.example.com"),
|
||||
]
|
||||
|
||||
for server, expected_ip in test_cases:
|
||||
proxy = ProxyConfig(server=server)
|
||||
assert proxy.ip == expected_ip, f"Failed for server: {server}"
|
||||
|
||||
def test_proxy_config_object_invalid_server(self):
|
||||
"""Test ProxyConfig with invalid server formats."""
|
||||
# Should not raise exception but may not extract IP properly
|
||||
proxy = ProxyConfig(server="invalid-format")
|
||||
assert proxy.server == "invalid-format"
|
||||
# IP extraction might fail but object should still be created
|
||||
|
||||
# ==================== DICT FORM TESTS ====================
|
||||
|
||||
def test_proxy_config_from_dict_basic(self):
|
||||
"""Test creating ProxyConfig from basic dictionary."""
|
||||
proxy_dict = {"server": "127.0.0.1:8080"}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
|
||||
def test_proxy_config_from_dict_full(self):
|
||||
"""Test creating ProxyConfig from complete dictionary."""
|
||||
proxy_dict = {
|
||||
"server": f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
"username": self.TEST_PROXY_DATA['username'],
|
||||
"password": self.TEST_PROXY_DATA['password'],
|
||||
"ip": self.TEST_PROXY_DATA['ip']
|
||||
}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == proxy_dict["server"]
|
||||
assert proxy.username == proxy_dict["username"]
|
||||
assert proxy.password == proxy_dict["password"]
|
||||
assert proxy.ip == proxy_dict["ip"]
|
||||
|
||||
def test_proxy_config_from_dict_missing_keys(self):
|
||||
"""Test creating ProxyConfig from dictionary with missing keys."""
|
||||
proxy_dict = {"server": "127.0.0.1:8080", "username": "user"}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username == "user"
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "127.0.0.1" # Should auto-extract
|
||||
|
||||
def test_proxy_config_from_dict_empty(self):
|
||||
"""Test creating ProxyConfig from empty dictionary."""
|
||||
proxy_dict = {}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server is None
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip is None
|
||||
|
||||
def test_proxy_config_from_dict_none_values(self):
|
||||
"""Test creating ProxyConfig from dictionary with None values."""
|
||||
proxy_dict = {
|
||||
"server": "127.0.0.1:8080",
|
||||
"username": None,
|
||||
"password": None,
|
||||
"ip": None
|
||||
}
|
||||
proxy = ProxyConfig.from_dict(proxy_dict)
|
||||
assert proxy.server == "127.0.0.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "127.0.0.1" # Should auto-extract despite None
|
||||
|
||||
# ==================== STRING FORM TESTS ====================
|
||||
|
||||
def test_proxy_config_from_string_full_format(self):
|
||||
"""Test creating ProxyConfig from full string format (ip:port:username:password)."""
|
||||
proxy_str = f"{self.TEST_PROXY_DATA['ip']}:6114:{self.TEST_PROXY_DATA['username']}:{self.TEST_PROXY_DATA['password']}"
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == f"http://{self.TEST_PROXY_DATA['ip']}:6114"
|
||||
assert proxy.username == self.TEST_PROXY_DATA['username']
|
||||
assert proxy.password == self.TEST_PROXY_DATA['password']
|
||||
assert proxy.ip == self.TEST_PROXY_DATA['ip']
|
||||
|
||||
def test_proxy_config_from_string_ip_port_only(self):
|
||||
"""Test creating ProxyConfig from string with only ip:port."""
|
||||
proxy_str = "192.168.1.1:8080"
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == "http://192.168.1.1:8080"
|
||||
assert proxy.username is None
|
||||
assert proxy.password is None
|
||||
assert proxy.ip == "192.168.1.1"
|
||||
|
||||
def test_proxy_config_from_string_invalid_format(self):
|
||||
"""Test creating ProxyConfig from invalid string formats."""
|
||||
invalid_formats = [
|
||||
"invalid",
|
||||
"ip:port:user", # Missing password (3 parts)
|
||||
"ip:port:user:pass:extra", # Too many parts (5 parts)
|
||||
"",
|
||||
"::", # Empty parts but 3 total (invalid)
|
||||
"::::", # Empty parts but 5 total (invalid)
|
||||
]
|
||||
|
||||
for proxy_str in invalid_formats:
|
||||
with pytest.raises(ValueError, match="Invalid proxy string format"):
|
||||
ProxyConfig.from_string(proxy_str)
|
||||
|
||||
def test_proxy_config_from_string_edge_cases_that_work(self):
|
||||
"""Test string formats that should work but might be edge cases."""
|
||||
# These cases actually work as valid formats
|
||||
edge_cases = [
|
||||
(":", "http://:", ""), # ip:port format with empty values
|
||||
(":::", "http://:", ""), # ip:port:user:pass format with empty values
|
||||
]
|
||||
|
||||
for proxy_str, expected_server, expected_ip in edge_cases:
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == expected_server
|
||||
assert proxy.ip == expected_ip
|
||||
|
||||
def test_proxy_config_from_string_edge_cases(self):
|
||||
"""Test string parsing edge cases."""
|
||||
# Test with different port numbers
|
||||
proxy_str = "10.0.0.1:3128:user:pass"
|
||||
proxy = ProxyConfig.from_string(proxy_str)
|
||||
assert proxy.server == "http://10.0.0.1:3128"
|
||||
|
||||
# Test with special characters in credentials
|
||||
proxy_str = "10.0.0.1:8080:user@domain:pass:word"
|
||||
with pytest.raises(ValueError): # Should fail due to extra colon in password
|
||||
ProxyConfig.from_string(proxy_str)
|
||||
|
||||
# ==================== ENVIRONMENT VARIABLE TESTS ====================
|
||||
|
||||
def test_proxy_config_from_env_single_proxy(self):
|
||||
"""Test loading single proxy from environment variable."""
|
||||
proxy_str = f"{self.TEST_PROXY_DATA['ip']}:6114:{self.TEST_PROXY_DATA['username']}:{self.TEST_PROXY_DATA['password']}"
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 1
|
||||
proxy = proxies[0]
|
||||
assert proxy.ip == self.TEST_PROXY_DATA['ip']
|
||||
assert proxy.username == self.TEST_PROXY_DATA['username']
|
||||
assert proxy.password == self.TEST_PROXY_DATA['password']
|
||||
|
||||
def test_proxy_config_from_env_multiple_proxies(self):
|
||||
"""Test loading multiple proxies from environment variable."""
|
||||
proxy_list = [
|
||||
"192.168.1.1:8080:user1:pass1",
|
||||
"192.168.1.2:8080:user2:pass2",
|
||||
"10.0.0.1:3128" # No auth
|
||||
]
|
||||
proxy_str = ",".join(proxy_list)
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 3
|
||||
|
||||
# Check first proxy
|
||||
assert proxies[0].ip == "192.168.1.1"
|
||||
assert proxies[0].username == "user1"
|
||||
assert proxies[0].password == "pass1"
|
||||
|
||||
# Check second proxy
|
||||
assert proxies[1].ip == "192.168.1.2"
|
||||
assert proxies[1].username == "user2"
|
||||
assert proxies[1].password == "pass2"
|
||||
|
||||
# Check third proxy (no auth)
|
||||
assert proxies[2].ip == "10.0.0.1"
|
||||
assert proxies[2].username is None
|
||||
assert proxies[2].password is None
|
||||
|
||||
def test_proxy_config_from_env_empty_var(self):
|
||||
"""Test loading from empty environment variable."""
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': ''}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 0
|
||||
|
||||
def test_proxy_config_from_env_missing_var(self):
|
||||
"""Test loading from missing environment variable."""
|
||||
# Ensure the env var doesn't exist
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
proxies = ProxyConfig.from_env('NON_EXISTENT_VAR')
|
||||
assert len(proxies) == 0
|
||||
|
||||
def test_proxy_config_from_env_with_empty_entries(self):
|
||||
"""Test loading proxies with empty entries in the list."""
|
||||
proxy_str = "192.168.1.1:8080:user:pass,,10.0.0.1:3128,"
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
assert len(proxies) == 2 # Empty entries should be skipped
|
||||
assert proxies[0].ip == "192.168.1.1"
|
||||
assert proxies[1].ip == "10.0.0.1"
|
||||
|
||||
def test_proxy_config_from_env_with_invalid_entries(self):
|
||||
"""Test loading proxies with some invalid entries."""
|
||||
proxy_str = "192.168.1.1:8080:user:pass,invalid_proxy,10.0.0.1:3128"
|
||||
|
||||
with patch.dict(os.environ, {'TEST_PROXIES': proxy_str}):
|
||||
# Should handle errors gracefully and return valid proxies
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
# Depending on implementation, might return partial list or empty
|
||||
# This tests error handling
|
||||
assert isinstance(proxies, list)
|
||||
|
||||
# ==================== SERIALIZATION TESTS ====================
|
||||
|
||||
def test_proxy_config_to_dict(self):
|
||||
"""Test converting ProxyConfig to dictionary."""
|
||||
proxy = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password'],
|
||||
ip=self.TEST_PROXY_DATA['ip']
|
||||
)
|
||||
|
||||
result_dict = proxy.to_dict()
|
||||
expected = {
|
||||
"server": f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
"username": self.TEST_PROXY_DATA['username'],
|
||||
"password": self.TEST_PROXY_DATA['password'],
|
||||
"ip": self.TEST_PROXY_DATA['ip']
|
||||
}
|
||||
assert result_dict == expected
|
||||
|
||||
def test_proxy_config_clone(self):
|
||||
"""Test cloning ProxyConfig with modifications."""
|
||||
original = ProxyConfig(
|
||||
server="http://127.0.0.1:8080",
|
||||
username="user",
|
||||
password="pass"
|
||||
)
|
||||
|
||||
# Clone with modifications
|
||||
cloned = original.clone(username="new_user", password="new_pass")
|
||||
|
||||
# Original should be unchanged
|
||||
assert original.username == "user"
|
||||
assert original.password == "pass"
|
||||
|
||||
# Clone should have new values
|
||||
assert cloned.username == "new_user"
|
||||
assert cloned.password == "new_pass"
|
||||
assert cloned.server == original.server # Unchanged value
|
||||
|
||||
def test_proxy_config_roundtrip_serialization(self):
|
||||
"""Test that ProxyConfig can be serialized and deserialized without loss."""
|
||||
original = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password'],
|
||||
ip=self.TEST_PROXY_DATA['ip']
|
||||
)
|
||||
|
||||
# Serialize to dict and back
|
||||
serialized = original.to_dict()
|
||||
deserialized = ProxyConfig.from_dict(serialized)
|
||||
|
||||
assert deserialized.server == original.server
|
||||
assert deserialized.username == original.username
|
||||
assert deserialized.password == original.password
|
||||
assert deserialized.ip == original.ip
|
||||
|
||||
# ==================== INTEGRATION TESTS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_proxy_config_object(self):
|
||||
"""Test AsyncWebCrawler with ProxyConfig object."""
|
||||
proxy_config = ProxyConfig(
|
||||
server=f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
username=self.TEST_PROXY_DATA['username'],
|
||||
password=self.TEST_PROXY_DATA['password']
|
||||
)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
# Test that the crawler accepts the ProxyConfig object without errors
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
# Note: This might fail due to actual proxy connection, but should not fail due to config issues
|
||||
result = await crawler.arun(
|
||||
url=self.test_url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config,
|
||||
page_timeout=10000 # Short timeout for testing
|
||||
)
|
||||
)
|
||||
# If we get here, proxy config was accepted
|
||||
assert result is not None
|
||||
except Exception as e:
|
||||
# We expect connection errors with test proxies, but not config errors
|
||||
error_msg = str(e).lower()
|
||||
assert "attribute" not in error_msg, f"Config error: {e}"
|
||||
assert "proxy_config" not in error_msg, f"Proxy config error: {e}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_proxy_config_dict(self):
|
||||
"""Test AsyncWebCrawler with ProxyConfig from dictionary."""
|
||||
proxy_dict = {
|
||||
"server": f"http://{self.TEST_PROXY_DATA['server']}",
|
||||
"username": self.TEST_PROXY_DATA['username'],
|
||||
"password": self.TEST_PROXY_DATA['password']
|
||||
}
|
||||
proxy_config = ProxyConfig.from_dict(proxy_dict)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url=self.test_url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config,
|
||||
page_timeout=10000
|
||||
)
|
||||
)
|
||||
assert result is not None
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
assert "attribute" not in error_msg, f"Config error: {e}"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_crawler_with_proxy_config_from_string(self):
|
||||
"""Test AsyncWebCrawler with ProxyConfig from string."""
|
||||
proxy_str = f"{self.TEST_PROXY_DATA['ip']}:6114:{self.TEST_PROXY_DATA['username']}:{self.TEST_PROXY_DATA['password']}"
|
||||
proxy_config = ProxyConfig.from_string(proxy_str)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url=self.test_url,
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config,
|
||||
page_timeout=10000
|
||||
)
|
||||
)
|
||||
assert result is not None
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
assert "attribute" not in error_msg, f"Config error: {e}"
|
||||
|
||||
# ==================== EDGE CASES AND ERROR HANDLING ====================
|
||||
|
||||
def test_proxy_config_with_none_server(self):
|
||||
"""Test ProxyConfig behavior with None server."""
|
||||
proxy = ProxyConfig(server=None)
|
||||
assert proxy.server is None
|
||||
assert proxy.ip is None # Should not crash
|
||||
|
||||
def test_proxy_config_with_empty_string_server(self):
|
||||
"""Test ProxyConfig behavior with empty string server."""
|
||||
proxy = ProxyConfig(server="")
|
||||
assert proxy.server == ""
|
||||
assert proxy.ip is None or proxy.ip == ""
|
||||
|
||||
def test_proxy_config_special_characters_in_credentials(self):
|
||||
"""Test ProxyConfig with special characters in username/password."""
|
||||
special_chars_tests = [
|
||||
("user@domain.com", "pass!@#$%"),
|
||||
("user_123", "p@ssw0rd"),
|
||||
("user-test", "pass-word"),
|
||||
]
|
||||
|
||||
for username, password in special_chars_tests:
|
||||
proxy = ProxyConfig(
|
||||
server="http://127.0.0.1:8080",
|
||||
username=username,
|
||||
password=password
|
||||
)
|
||||
assert proxy.username == username
|
||||
assert proxy.password == password
|
||||
|
||||
def test_proxy_config_unicode_handling(self):
|
||||
"""Test ProxyConfig with unicode characters."""
|
||||
proxy = ProxyConfig(
|
||||
server="http://127.0.0.1:8080",
|
||||
username="ユーザー", # Japanese characters
|
||||
password="пароль" # Cyrillic characters
|
||||
)
|
||||
assert proxy.username == "ユーザー"
|
||||
assert proxy.password == "пароль"
|
||||
|
||||
# ==================== PERFORMANCE TESTS ====================
|
||||
|
||||
def test_proxy_config_creation_performance(self):
|
||||
"""Test that ProxyConfig creation is reasonably fast."""
|
||||
import time
|
||||
|
||||
start_time = time.time()
|
||||
for i in range(1000):
|
||||
proxy = ProxyConfig(
|
||||
server=f"http://192.168.1.{i % 255}:8080",
|
||||
username=f"user{i}",
|
||||
password=f"pass{i}"
|
||||
)
|
||||
end_time = time.time()
|
||||
|
||||
# Should be able to create 1000 configs in less than 1 second
|
||||
assert (end_time - start_time) < 1.0
|
||||
|
||||
def test_proxy_config_from_env_performance(self):
|
||||
"""Test that loading many proxies from env is reasonably fast."""
|
||||
import time
|
||||
|
||||
# Create a large list of proxy strings
|
||||
proxy_list = [f"192.168.1.{i}:8080:user{i}:pass{i}" for i in range(100)]
|
||||
proxy_str = ",".join(proxy_list)
|
||||
|
||||
with patch.dict(os.environ, {'PERF_TEST_PROXIES': proxy_str}):
|
||||
start_time = time.time()
|
||||
proxies = ProxyConfig.from_env('PERF_TEST_PROXIES')
|
||||
end_time = time.time()
|
||||
|
||||
assert len(proxies) == 100
|
||||
# Should be able to parse 100 proxies in less than 1 second
|
||||
assert (end_time - start_time) < 1.0
|
||||
|
||||
|
||||
# ==================== STANDALONE TEST FUNCTIONS ====================
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_dict_proxy():
|
||||
"""Original test function for dict proxy - kept for backward compatibility."""
|
||||
proxy_config = {
|
||||
"server": "23.95.150.145:6114",
|
||||
"username": "cfyswbwn",
|
||||
"password": "1gs266hoqysi"
|
||||
}
|
||||
proxy_config_obj = ProxyConfig.from_dict(proxy_config)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url="https://httpbin.org/ip", config=CrawlerRunConfig(
|
||||
stream=False,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config_obj,
|
||||
page_timeout=10000
|
||||
))
|
||||
print("Dict proxy test passed!")
|
||||
print(result.markdown[:200] if result and result.markdown else "No result")
|
||||
except Exception as e:
|
||||
print(f"Dict proxy test error (expected): {e}")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_string_proxy():
|
||||
"""Test function for string proxy format."""
|
||||
proxy_str = "23.95.150.145:6114:cfyswbwn:1gs266hoqysi"
|
||||
proxy_config_obj = ProxyConfig.from_string(proxy_str)
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url="https://httpbin.org/ip", config=CrawlerRunConfig(
|
||||
stream=False,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config_obj,
|
||||
page_timeout=10000
|
||||
))
|
||||
print("String proxy test passed!")
|
||||
print(result.markdown[:200] if result and result.markdown else "No result")
|
||||
except Exception as e:
|
||||
print(f"String proxy test error (expected): {e}")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_env_proxy():
|
||||
"""Test function for environment variable proxy."""
|
||||
# Set environment variable
|
||||
os.environ['TEST_PROXIES'] = "23.95.150.145:6114:cfyswbwn:1gs266hoqysi"
|
||||
|
||||
proxies = ProxyConfig.from_env('TEST_PROXIES')
|
||||
if proxies:
|
||||
proxy_config_obj = proxies[0] # Use first proxy
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
try:
|
||||
result = await crawler.arun(url="https://httpbin.org/ip", config=CrawlerRunConfig(
|
||||
stream=False,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
proxy_config=proxy_config_obj,
|
||||
page_timeout=10000
|
||||
))
|
||||
print("Environment proxy test passed!")
|
||||
print(result.markdown[:200] if result and result.markdown else "No result")
|
||||
except Exception as e:
|
||||
print(f"Environment proxy test error (expected): {e}")
|
||||
else:
|
||||
print("No proxies loaded from environment")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Running comprehensive ProxyConfig tests...")
|
||||
print("=" * 50)
|
||||
|
||||
# Run the standalone test functions
|
||||
print("\n1. Testing dict proxy format...")
|
||||
asyncio.run(test_dict_proxy())
|
||||
|
||||
print("\n2. Testing string proxy format...")
|
||||
asyncio.run(test_string_proxy())
|
||||
|
||||
print("\n3. Testing environment variable proxy format...")
|
||||
asyncio.run(test_env_proxy())
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print("To run the full pytest suite, use: pytest " + __file__)
|
||||
print("=" * 50)
|
||||
@@ -1,170 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test LLMTableExtraction with controlled HTML
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
CrawlerRunConfig,
|
||||
LLMConfig,
|
||||
LLMTableExtraction,
|
||||
DefaultTableExtraction,
|
||||
CacheMode
|
||||
)
|
||||
|
||||
async def test_controlled_html():
|
||||
"""Test with controlled HTML content."""
|
||||
print("\n" + "=" * 60)
|
||||
print("LLM TABLE EXTRACTION TEST")
|
||||
print("=" * 60)
|
||||
|
||||
url = "https://en.wikipedia.org/wiki/List_of_chemical_elements"
|
||||
# url = "https://en.wikipedia.org/wiki/List_of_prime_ministers_of_India"
|
||||
|
||||
# Configure LLM
|
||||
llm_config = LLMConfig(
|
||||
# provider="openai/gpt-4.1-mini",
|
||||
# api_token=os.getenv("OPENAI_API_KEY"),
|
||||
provider="groq/llama-3.3-70b-versatile",
|
||||
api_token="GROQ_API_TOKEN",
|
||||
temperature=0.1,
|
||||
max_tokens=32000
|
||||
)
|
||||
|
||||
print("\n1. Testing LLMTableExtraction:")
|
||||
|
||||
# Create LLM extraction strategy
|
||||
llm_strategy = LLMTableExtraction(
|
||||
llm_config=llm_config,
|
||||
verbose=True,
|
||||
# css_selector="div.w3-example"
|
||||
css_selector="div.mw-content-ltr",
|
||||
# css_selector="table.wikitable",
|
||||
max_tries=2,
|
||||
|
||||
enable_chunking=True,
|
||||
chunk_token_threshold=5000, # Lower threshold to force chunking
|
||||
min_rows_per_chunk=10,
|
||||
max_parallel_chunks=3
|
||||
)
|
||||
|
||||
config_llm = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
table_extraction=llm_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Test with LLM extraction
|
||||
result_llm = await crawler.arun(
|
||||
# url=f"raw:{test_html}",
|
||||
url=url,
|
||||
config=config_llm
|
||||
)
|
||||
|
||||
if result_llm.success:
|
||||
print(f"\n ✓ LLM Extraction: Found {len(result_llm.tables)} table(s)")
|
||||
|
||||
for i, table in enumerate(result_llm.tables, 1):
|
||||
print(f"\n Table {i}:")
|
||||
print(f" - Caption: {table.get('caption', 'No caption')}")
|
||||
print(f" - Headers: {table['headers']}")
|
||||
print(f" - Rows: {len(table['rows'])}")
|
||||
|
||||
# Show how colspan/rowspan were handled
|
||||
print(f" - Sample rows:")
|
||||
for j, row in enumerate(table['rows'][:2], 1):
|
||||
print(f" Row {j}: {row}")
|
||||
|
||||
metadata = table.get('metadata', {})
|
||||
print(f" - Metadata:")
|
||||
print(f" • Has merged cells: {metadata.get('has_merged_cells', False)}")
|
||||
print(f" • Table type: {metadata.get('table_type', 'unknown')}")
|
||||
|
||||
# # Compare with default extraction
|
||||
# print("\n2. Comparing with DefaultTableExtraction:")
|
||||
|
||||
# default_strategy = DefaultTableExtraction(
|
||||
# table_score_threshold=3,
|
||||
# verbose=False
|
||||
# )
|
||||
|
||||
# config_default = CrawlerRunConfig(
|
||||
# cache_mode=CacheMode.BYPASS,
|
||||
# table_extraction=default_strategy
|
||||
# )
|
||||
|
||||
# result_default = await crawler.arun(
|
||||
# # url=f"raw:{test_html}",
|
||||
# url=url,
|
||||
# config=config_default
|
||||
# )
|
||||
|
||||
# if result_default.success:
|
||||
# print(f" ✓ Default Extraction: Found {len(result_default.tables)} table(s)")
|
||||
|
||||
# # Compare handling of complex structures
|
||||
# print("\n3. Comparison Summary:")
|
||||
# print(f" LLM found: {len(result_llm.tables)} tables")
|
||||
# print(f" Default found: {len(result_default.tables)} tables")
|
||||
|
||||
# if result_llm.tables and result_default.tables:
|
||||
# llm_first = result_llm.tables[0]
|
||||
# default_first = result_default.tables[0]
|
||||
|
||||
# print(f"\n First table comparison:")
|
||||
# print(f" LLM headers: {len(llm_first['headers'])} columns")
|
||||
# print(f" Default headers: {len(default_first['headers'])} columns")
|
||||
|
||||
# # Check if LLM better handled the complex structure
|
||||
# if llm_first.get('metadata', {}).get('has_merged_cells'):
|
||||
# print(" ✓ LLM correctly identified merged cells")
|
||||
|
||||
# # Test pandas compatibility
|
||||
# try:
|
||||
# import pandas as pd
|
||||
|
||||
# print("\n4. Testing Pandas compatibility:")
|
||||
|
||||
# # Create DataFrame from LLM extraction
|
||||
# df_llm = pd.DataFrame(
|
||||
# llm_first['rows'],
|
||||
# columns=llm_first['headers']
|
||||
# )
|
||||
# print(f" ✓ LLM table -> DataFrame: Shape {df_llm.shape}")
|
||||
|
||||
# # Create DataFrame from default extraction
|
||||
# df_default = pd.DataFrame(
|
||||
# default_first['rows'],
|
||||
# columns=default_first['headers']
|
||||
# )
|
||||
# print(f" ✓ Default table -> DataFrame: Shape {df_default.shape}")
|
||||
|
||||
# print("\n LLM DataFrame preview:")
|
||||
# print(df_llm.head(2).to_string())
|
||||
|
||||
# except ImportError:
|
||||
# print("\n4. Pandas not installed, skipping DataFrame test")
|
||||
|
||||
print("\n✅ Test completed successfully!")
|
||||
|
||||
async def main():
|
||||
"""Run the test."""
|
||||
|
||||
# Check for API key
|
||||
if not os.getenv("OPENAI_API_KEY"):
|
||||
print("⚠️ OPENAI_API_KEY not set. Please set it to test LLM extraction.")
|
||||
print(" You can set it with: export OPENAI_API_KEY='your-key-here'")
|
||||
return
|
||||
|
||||
await test_controlled_html()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
import psutil
|
||||
import platform
|
||||
import time
|
||||
from crawl4ai.utils import get_true_memory_usage_percent, get_memory_stats, get_true_available_memory_gb
|
||||
from crawl4ai.memory_utils import get_true_memory_usage_percent, get_memory_stats, get_true_available_memory_gb
|
||||
|
||||
|
||||
def test_memory_calculation():
|
||||
|
||||
Reference in New Issue
Block a user