Merge branch 'vr0.4.3b2'
This commit is contained in:
2
.gitattributes
vendored
2
.gitattributes
vendored
@@ -9,4 +9,4 @@ docs/md_v2/* linguist-documentation
|
|||||||
*.py linguist-language=Python
|
*.py linguist-language=Python
|
||||||
|
|
||||||
# Exclude HTML from language statistics
|
# Exclude HTML from language statistics
|
||||||
*.html linguist-detectable=false
|
*.html linguist-detectable=false
|
||||||
|
|||||||
5
.gitignore
vendored
5
.gitignore
vendored
@@ -230,4 +230,7 @@ plans/
|
|||||||
|
|
||||||
# Codeium
|
# Codeium
|
||||||
.codeiumignore
|
.codeiumignore
|
||||||
todo/
|
todo/
|
||||||
|
|
||||||
|
# windsurf rules
|
||||||
|
.windsurfrules
|
||||||
|
|||||||
116
CHANGELOG.md
116
CHANGELOG.md
@@ -7,23 +7,123 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## [0.4.267] - 2025 - 01 - 06
|
### Changed
|
||||||
|
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
|
||||||
|
|
||||||
### Added
|
## Version 0.4.3b2 (2025-01-21)
|
||||||
|
|
||||||
|
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
|
||||||
|
|
||||||
|
### Features
|
||||||
|
|
||||||
|
- **Robots.txt Compliance:**
|
||||||
|
- Added robots.txt compliance support with efficient SQLite-based caching.
|
||||||
|
- New `check_robots_txt` parameter in `CrawlerRunConfig` to enable robots.txt checking before crawling a URL.
|
||||||
|
- Automated robots.txt checking is now integrated into `AsyncWebCrawler` with 403 status codes for blocked URLs.
|
||||||
|
|
||||||
|
- **Proxy Configuration:**
|
||||||
|
- Added proxy configuration support to `CrawlerRunConfig`, allowing dynamic proxy settings per crawl request.
|
||||||
|
- Updated documentation with examples for using proxy configuration in crawl operations.
|
||||||
|
|
||||||
|
- **LLM-Powered Schema Generation:**
|
||||||
|
- Introduced a new utility for automatic CSS and XPath schema generation using OpenAI or Ollama models.
|
||||||
|
- Added comprehensive documentation and examples for schema generation.
|
||||||
|
- New prompt templates optimized for HTML schema analysis.
|
||||||
|
|
||||||
|
- **URL Redirection Tracking:**
|
||||||
|
- Added URL redirection tracking to capture the final URL after any redirects.
|
||||||
|
- The final URL is now available in the `redirected_url` field of the `AsyncCrawlResponse` object.
|
||||||
|
|
||||||
|
- **Enhanced Streamlined Documentation:**
|
||||||
|
- Refactored and improved the documentation structure for clarity and ease of use.
|
||||||
|
- Added detailed explanations of new features and updated examples.
|
||||||
|
|
||||||
|
- **Improved Browser Context Management:**
|
||||||
|
- Enhanced the management of browser contexts and added shared data support.
|
||||||
|
- Introduced the `shared_data` parameter in `CrawlerRunConfig` to pass data between hooks.
|
||||||
|
|
||||||
|
- **Memory Dispatcher System:**
|
||||||
|
- Migrated to a memory dispatcher system with enhanced monitoring capabilities.
|
||||||
|
- Introduced `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher` for improved resource management.
|
||||||
|
- Added `RateLimiter` for rate limiting support.
|
||||||
|
- New `CrawlerMonitor` for real-time monitoring of crawler operations.
|
||||||
|
|
||||||
|
- **Streaming Support:**
|
||||||
|
- Added streaming support for processing crawled URLs as they are processed.
|
||||||
|
- Enabled streaming mode with the `stream` parameter in `CrawlerRunConfig`.
|
||||||
|
|
||||||
|
- **Content Scraping Strategy:**
|
||||||
|
- Introduced a new `LXMLWebScrapingStrategy` for faster content scraping.
|
||||||
|
- Added support for selecting the scraping strategy via the `scraping_strategy` parameter in `CrawlerRunConfig`.
|
||||||
|
|
||||||
|
### Bug Fixes
|
||||||
|
|
||||||
|
- **Browser Path Management:**
|
||||||
|
- Improved browser path management for consistent behavior across different environments.
|
||||||
|
|
||||||
|
- **Memory Threshold:**
|
||||||
|
- Adjusted the default memory threshold to improve resource utilization.
|
||||||
|
|
||||||
|
- **Pydantic Model Fields:**
|
||||||
|
- Made several model fields optional with default values to improve flexibility.
|
||||||
|
|
||||||
|
### Refactor
|
||||||
|
|
||||||
|
- **Documentation Structure:**
|
||||||
|
- Reorganized documentation structure to improve navigation and readability.
|
||||||
|
- Updated styles and added new sections for advanced features.
|
||||||
|
|
||||||
|
- **Scraping Mode:**
|
||||||
|
- Replaced the `ScrapingMode` enum with a strategy pattern for more flexible content scraping.
|
||||||
|
|
||||||
|
- **Version Update:**
|
||||||
|
- Updated the version to `0.4.248`.
|
||||||
|
|
||||||
|
- **Code Cleanup:**
|
||||||
|
- Removed unused files and improved type hints.
|
||||||
|
- Applied Ruff corrections for code quality.
|
||||||
|
|
||||||
|
- **Updated dependencies:**
|
||||||
|
- Updated dependencies to their latest versions to ensure compatibility and security.
|
||||||
|
|
||||||
|
- **Ignored certain patterns and directories:**
|
||||||
|
- Updated `.gitignore` and `.codeiumignore` to ignore additional patterns and directories, streamlining the development environment.
|
||||||
|
|
||||||
|
- **Simplified Personal Story in README:**
|
||||||
|
- Streamlined the personal story and project vision in the `README.md` for clarity.
|
||||||
|
|
||||||
|
- **Removed Deprecated Files:**
|
||||||
|
- Deleted several deprecated files and examples that are no longer relevant.
|
||||||
|
|
||||||
|
---
|
||||||
|
**Previous Releases:**
|
||||||
|
|
||||||
|
### 0.4.24x (2024-12-31)
|
||||||
|
- **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling.
|
||||||
|
- **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies.
|
||||||
|
- **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction.
|
||||||
|
- **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types.
|
||||||
|
- **Performance Boost**: Optimized caching, parallel processing, and memory management.
|
||||||
|
- **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking.
|
||||||
|
- **Security Features**: Improved input validation and safe expression evaluation.
|
||||||
|
|
||||||
|
### 0.4.247 (2025-01-06)
|
||||||
|
|
||||||
|
#### Added
|
||||||
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||||
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
### Changed
|
#### Changed
|
||||||
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
|
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
|
||||||
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
|
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
|
||||||
- **Documentation Update**:
|
- **Documentation Update**:
|
||||||
- Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
- Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||||
|
|
||||||
### Removed
|
#### Removed
|
||||||
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
|
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
|
||||||
|
|
||||||
### Fixed
|
#### Fixed
|
||||||
- **Page Closing to Prevent Memory Leaks**:
|
- **Page Closing to Prevent Memory Leaks**:
|
||||||
- **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
|
- **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
|
||||||
- **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
|
- **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
|
||||||
@@ -38,9 +138,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
|
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
|
||||||
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
### Other
|
## [0.4.267] - 2025 - 01 - 06
|
||||||
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
|
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||||
|
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
## [0.4.24] - 2024-12-31
|
## [0.4.24] - 2024-12-31
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ We would like to thank the following people for their contributions to Crawl4AI:
|
|||||||
|
|
||||||
- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
|
- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
|
||||||
- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
|
- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
|
||||||
- [Aravind Karnam](https://github.com/aravindkarnam) - Developer
|
- [Aravind Karnam](https://github.com/aravindkarnam) - Head of Community and Product
|
||||||
|
|
||||||
## Community Contributors
|
## Community Contributors
|
||||||
|
|
||||||
|
|||||||
87
README.md
87
README.md
@@ -21,9 +21,25 @@
|
|||||||
|
|
||||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||||
|
|
||||||
[✨ Check out latest update v0.4.24x](#-recent-updates)
|
[✨ Check out latest update v0.4.3bx](#-recent-updates)
|
||||||
|
|
||||||
|
<<<<<<< HEAD
|
||||||
🎉 **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
🎉 **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||||
|
=======
|
||||||
|
🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>🤓 <strong>My Personal Story</strong></summary>
|
||||||
|
|
||||||
|
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
|
||||||
|
|
||||||
|
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
|
||||||
|
|
||||||
|
I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
|
||||||
|
|
||||||
|
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
|
||||||
|
</details>
|
||||||
|
>>>>>>> vr0.4.3b2
|
||||||
|
|
||||||
## 🧐 Why Crawl4AI?
|
## 🧐 Why Crawl4AI?
|
||||||
|
|
||||||
@@ -41,6 +57,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
|
|||||||
# Install the package
|
# Install the package
|
||||||
pip install -U crawl4ai
|
pip install -U crawl4ai
|
||||||
|
|
||||||
|
# For pre release versions
|
||||||
|
pip install crawl4ai --pre
|
||||||
|
|
||||||
# Run post-installation setup
|
# Run post-installation setup
|
||||||
crawl4ai-setup
|
crawl4ai-setup
|
||||||
|
|
||||||
@@ -470,18 +489,64 @@ async def test_news_crawl():
|
|||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## ✨ Recent Updates
|
||||||
|
|
||||||
## ✨ Recent Updates
|
- **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
|
||||||
|
- **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
|
||||||
|
- **🤖 Enhanced LLM Integration**:
|
||||||
|
- **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
|
||||||
|
- **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
|
||||||
|
- **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
|
||||||
|
- **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
|
||||||
|
- **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
|
||||||
|
- **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence.
|
||||||
|
- **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects.
|
||||||
|
- **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
|
||||||
|
- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
|
||||||
|
- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
|
||||||
|
|
||||||
- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling
|
Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||||
- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies
|
|
||||||
- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction
|
|
||||||
- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types
|
|
||||||
- ⚡ **Performance Boost**: Optimized caching, parallel processing, and memory management
|
|
||||||
- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking
|
|
||||||
- 🔐 **Security Features**: Improved input validation and safe expression evaluation
|
|
||||||
|
|
||||||
Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
## Version Numbering in Crawl4AI
|
||||||
|
|
||||||
|
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
|
||||||
|
|
||||||
|
### Version Numbers Explained
|
||||||
|
|
||||||
|
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
|
||||||
|
|
||||||
|
#### Pre-release Versions
|
||||||
|
We use different suffixes to indicate development stages:
|
||||||
|
|
||||||
|
- `dev` (0.4.3dev1): Development versions, unstable
|
||||||
|
- `a` (0.4.3a1): Alpha releases, experimental features
|
||||||
|
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
|
||||||
|
- `rc` (0.4.3rc1): Release candidates, potential final version
|
||||||
|
|
||||||
|
#### Installation
|
||||||
|
- Regular installation (stable version):
|
||||||
|
```bash
|
||||||
|
pip install -U crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
|
- Install pre-release versions:
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai --pre
|
||||||
|
```
|
||||||
|
|
||||||
|
- Install specific version:
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai==0.4.3b1
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Why Pre-releases?
|
||||||
|
We use pre-releases to:
|
||||||
|
- Test new features in real-world scenarios
|
||||||
|
- Gather feedback before final releases
|
||||||
|
- Ensure stability for production users
|
||||||
|
- Allow early adopters to try new features
|
||||||
|
|
||||||
|
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
|
||||||
|
|
||||||
## 📖 Documentation & Roadmap
|
## 📖 Documentation & Roadmap
|
||||||
|
|
||||||
@@ -511,7 +576,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
|
|||||||
|
|
||||||
## 🤝 Contributing
|
## 🤝 Contributing
|
||||||
|
|
||||||
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
|
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
||||||
|
|
||||||
## 📄 License
|
## 📄 License
|
||||||
|
|
||||||
|
|||||||
@@ -2,45 +2,87 @@
|
|||||||
|
|
||||||
from .async_webcrawler import AsyncWebCrawler, CacheMode
|
from .async_webcrawler import AsyncWebCrawler, CacheMode
|
||||||
from .async_configs import BrowserConfig, CrawlerRunConfig
|
from .async_configs import BrowserConfig, CrawlerRunConfig
|
||||||
from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy
|
from .content_scraping_strategy import (
|
||||||
|
ContentScrapingStrategy,
|
||||||
|
WebScrapingStrategy,
|
||||||
|
LXMLWebScrapingStrategy,
|
||||||
|
)
|
||||||
|
from .extraction_strategy import (
|
||||||
|
ExtractionStrategy,
|
||||||
|
LLMExtractionStrategy,
|
||||||
|
CosineStrategy,
|
||||||
|
JsonCssExtractionStrategy,
|
||||||
|
JsonXPathExtractionStrategy
|
||||||
|
)
|
||||||
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||||
from .markdown_generation_strategy import DefaultMarkdownGenerator
|
from .markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter
|
||||||
from .models import CrawlResult
|
from .models import CrawlResult, MarkdownGenerationResult
|
||||||
from .__version__ import __version__
|
from .async_dispatcher import (
|
||||||
|
MemoryAdaptiveDispatcher,
|
||||||
|
SemaphoreDispatcher,
|
||||||
|
RateLimiter,
|
||||||
|
CrawlerMonitor,
|
||||||
|
DisplayMode,
|
||||||
|
BaseDispatcher
|
||||||
|
)
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"AsyncWebCrawler",
|
"AsyncWebCrawler",
|
||||||
"CrawlResult",
|
"CrawlResult",
|
||||||
"CacheMode",
|
"CacheMode",
|
||||||
'BrowserConfig',
|
"ContentScrapingStrategy",
|
||||||
'CrawlerRunConfig',
|
"WebScrapingStrategy",
|
||||||
'ExtractionStrategy',
|
"LXMLWebScrapingStrategy",
|
||||||
'LLMExtractionStrategy',
|
"BrowserConfig",
|
||||||
'CosineStrategy',
|
"CrawlerRunConfig",
|
||||||
'JsonCssExtractionStrategy',
|
"ExtractionStrategy",
|
||||||
'ChunkingStrategy',
|
"LLMExtractionStrategy",
|
||||||
'RegexChunking',
|
"CosineStrategy",
|
||||||
'DefaultMarkdownGenerator',
|
"JsonCssExtractionStrategy",
|
||||||
'PruningContentFilter',
|
"JsonXPathExtractionStrategy",
|
||||||
'BM25ContentFilter',
|
"ChunkingStrategy",
|
||||||
|
"RegexChunking",
|
||||||
|
"DefaultMarkdownGenerator",
|
||||||
|
"PruningContentFilter",
|
||||||
|
"BM25ContentFilter",
|
||||||
|
"LLMContentFilter",
|
||||||
|
"BaseDispatcher",
|
||||||
|
"MemoryAdaptiveDispatcher",
|
||||||
|
"SemaphoreDispatcher",
|
||||||
|
"RateLimiter",
|
||||||
|
"CrawlerMonitor",
|
||||||
|
"DisplayMode",
|
||||||
|
"MarkdownGenerationResult",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
def is_sync_version_installed():
|
def is_sync_version_installed():
|
||||||
try:
|
try:
|
||||||
import selenium
|
import selenium
|
||||||
|
|
||||||
return True
|
return True
|
||||||
except ImportError:
|
except ImportError:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
if is_sync_version_installed():
|
if is_sync_version_installed():
|
||||||
try:
|
try:
|
||||||
from .web_crawler import WebCrawler
|
from .web_crawler import WebCrawler
|
||||||
|
|
||||||
__all__.append("WebCrawler")
|
__all__.append("WebCrawler")
|
||||||
except ImportError:
|
except ImportError:
|
||||||
import warnings
|
print(
|
||||||
print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
|
"Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
WebCrawler = None
|
WebCrawler = None
|
||||||
# import warnings
|
# import warnings
|
||||||
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
|
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
|
||||||
|
|
||||||
|
import warnings
|
||||||
|
from pydantic import warnings as pydantic_warnings
|
||||||
|
|
||||||
|
# Disable all Pydantic warnings
|
||||||
|
warnings.filterwarnings("ignore", module="pydantic")
|
||||||
|
# pydantic_warnings.filter_warnings()
|
||||||
@@ -1,2 +1,2 @@
|
|||||||
# crawl4ai/_version.py
|
# crawl4ai/_version.py
|
||||||
__version__ = "0.4.247"
|
__version__ = "0.4.3b2"
|
||||||
|
|||||||
@@ -5,13 +5,13 @@ from .config import (
|
|||||||
PAGE_TIMEOUT,
|
PAGE_TIMEOUT,
|
||||||
IMAGE_SCORE_THRESHOLD,
|
IMAGE_SCORE_THRESHOLD,
|
||||||
SOCIAL_MEDIA_DOMAINS,
|
SOCIAL_MEDIA_DOMAINS,
|
||||||
|
|
||||||
)
|
)
|
||||||
from .user_agent_generator import UserAgentGenerator
|
from .user_agent_generator import UserAgentGenerator
|
||||||
from .extraction_strategy import ExtractionStrategy
|
from .extraction_strategy import ExtractionStrategy
|
||||||
from .chunking_strategy import ChunkingStrategy
|
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||||
from .markdown_generation_strategy import MarkdownGenerationStrategy
|
from .markdown_generation_strategy import MarkdownGenerationStrategy
|
||||||
from typing import Union, List
|
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
|
||||||
|
from typing import Optional, Union, List
|
||||||
|
|
||||||
|
|
||||||
class BrowserConfig:
|
class BrowserConfig:
|
||||||
@@ -38,7 +38,7 @@ class BrowserConfig:
|
|||||||
is "chromium". Default: "chromium".
|
is "chromium". Default: "chromium".
|
||||||
channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
|
channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
|
||||||
is "chromium". Default: "chromium".
|
is "chromium". Default: "chromium".
|
||||||
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
|
proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
|
||||||
Default: None.
|
Default: None.
|
||||||
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
||||||
If None, no additional proxy config. Default: None.
|
If None, no additional proxy config. Default: None.
|
||||||
@@ -81,10 +81,10 @@ class BrowserConfig:
|
|||||||
user_data_dir: str = None,
|
user_data_dir: str = None,
|
||||||
chrome_channel: str = "chromium",
|
chrome_channel: str = "chromium",
|
||||||
channel: str = "chromium",
|
channel: str = "chromium",
|
||||||
proxy: str = None,
|
proxy: Optional[str] = None,
|
||||||
proxy_config: dict = None,
|
proxy_config: dict = None,
|
||||||
viewport_width: int = 1080,
|
viewport_width: int = 1080,
|
||||||
viewport_height: int = 600,
|
viewport_height: int = 600,
|
||||||
accept_downloads: bool = False,
|
accept_downloads: bool = False,
|
||||||
downloads_path: str = None,
|
downloads_path: str = None,
|
||||||
storage_state=None,
|
storage_state=None,
|
||||||
@@ -103,7 +103,7 @@ class BrowserConfig:
|
|||||||
text_mode: bool = False,
|
text_mode: bool = False,
|
||||||
light_mode: bool = False,
|
light_mode: bool = False,
|
||||||
extra_args: list = None,
|
extra_args: list = None,
|
||||||
debugging_port : int = 9222,
|
debugging_port: int = 9222,
|
||||||
):
|
):
|
||||||
self.browser_type = browser_type
|
self.browser_type = browser_type
|
||||||
self.headless = headless
|
self.headless = headless
|
||||||
@@ -112,6 +112,9 @@ class BrowserConfig:
|
|||||||
self.user_data_dir = user_data_dir
|
self.user_data_dir = user_data_dir
|
||||||
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
|
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
|
||||||
self.channel = channel or self.browser_type or "chromium"
|
self.channel = channel or self.browser_type or "chromium"
|
||||||
|
if self.browser_type in ["firefox", "webkit"]:
|
||||||
|
self.channel = ""
|
||||||
|
self.chrome_channel = ""
|
||||||
self.proxy = proxy
|
self.proxy = proxy
|
||||||
self.proxy_config = proxy_config
|
self.proxy_config = proxy_config
|
||||||
self.viewport_width = viewport_width
|
self.viewport_width = viewport_width
|
||||||
@@ -142,7 +145,7 @@ class BrowserConfig:
|
|||||||
self.user_agent = user_agenr_generator.generate()
|
self.user_agent = user_agenr_generator.generate()
|
||||||
else:
|
else:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
|
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
|
||||||
self.headers.setdefault("sec-ch-ua", self.browser_hint)
|
self.headers.setdefault("sec-ch-ua", self.browser_hint)
|
||||||
|
|
||||||
@@ -183,6 +186,50 @@ class BrowserConfig:
|
|||||||
extra_args=kwargs.get("extra_args", []),
|
extra_args=kwargs.get("extra_args", []),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def to_dict(self):
|
||||||
|
return {
|
||||||
|
"browser_type": self.browser_type,
|
||||||
|
"headless": self.headless,
|
||||||
|
"use_managed_browser": self.use_managed_browser,
|
||||||
|
"use_persistent_context": self.use_persistent_context,
|
||||||
|
"user_data_dir": self.user_data_dir,
|
||||||
|
"chrome_channel": self.chrome_channel,
|
||||||
|
"channel": self.channel,
|
||||||
|
"proxy": self.proxy,
|
||||||
|
"proxy_config": self.proxy_config,
|
||||||
|
"viewport_width": self.viewport_width,
|
||||||
|
"viewport_height": self.viewport_height,
|
||||||
|
"accept_downloads": self.accept_downloads,
|
||||||
|
"downloads_path": self.downloads_path,
|
||||||
|
"storage_state": self.storage_state,
|
||||||
|
"ignore_https_errors": self.ignore_https_errors,
|
||||||
|
"java_script_enabled": self.java_script_enabled,
|
||||||
|
"cookies": self.cookies,
|
||||||
|
"headers": self.headers,
|
||||||
|
"user_agent": self.user_agent,
|
||||||
|
"user_agent_mode": self.user_agent_mode,
|
||||||
|
"user_agent_generator_config": self.user_agent_generator_config,
|
||||||
|
"text_mode": self.text_mode,
|
||||||
|
"light_mode": self.light_mode,
|
||||||
|
"extra_args": self.extra_args,
|
||||||
|
"sleep_on_close": self.sleep_on_close,
|
||||||
|
"verbose": self.verbose,
|
||||||
|
"debugging_port": self.debugging_port,
|
||||||
|
}
|
||||||
|
|
||||||
|
def clone(self, **kwargs):
|
||||||
|
"""Create a copy of this configuration with updated values.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
**kwargs: Key-value pairs of configuration options to update
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
BrowserConfig: A new instance with the specified updates
|
||||||
|
"""
|
||||||
|
config_dict = self.to_dict()
|
||||||
|
config_dict.update(kwargs)
|
||||||
|
return BrowserConfig.from_kwargs(config_dict)
|
||||||
|
|
||||||
|
|
||||||
class CrawlerRunConfig:
|
class CrawlerRunConfig:
|
||||||
"""
|
"""
|
||||||
@@ -221,6 +268,10 @@ class CrawlerRunConfig:
|
|||||||
Default: False.
|
Default: False.
|
||||||
parser_type (str): Type of parser to use for HTML parsing.
|
parser_type (str): Type of parser to use for HTML parsing.
|
||||||
Default: "lxml".
|
Default: "lxml".
|
||||||
|
scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
|
||||||
|
Default: WebScrapingStrategy.
|
||||||
|
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
||||||
|
If None, no additional proxy config. Default: None.
|
||||||
|
|
||||||
# Caching Parameters
|
# Caching Parameters
|
||||||
cache_mode (CacheMode or None): Defines how caching is handled.
|
cache_mode (CacheMode or None): Defines how caching is handled.
|
||||||
@@ -237,6 +288,8 @@ class CrawlerRunConfig:
|
|||||||
Default: False.
|
Default: False.
|
||||||
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
|
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
|
||||||
Default: False.
|
Default: False.
|
||||||
|
shared_data (dict or None): Shared data to be passed between hooks.
|
||||||
|
Default: None.
|
||||||
|
|
||||||
# Page Navigation and Timing Parameters
|
# Page Navigation and Timing Parameters
|
||||||
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
|
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
|
||||||
@@ -311,6 +364,15 @@ class CrawlerRunConfig:
|
|||||||
Default: True.
|
Default: True.
|
||||||
log_console (bool): If True, log console messages from the page.
|
log_console (bool): If True, log console messages from the page.
|
||||||
Default: False.
|
Default: False.
|
||||||
|
|
||||||
|
# Streaming Parameters
|
||||||
|
stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
|
||||||
|
Default: False.
|
||||||
|
|
||||||
|
# Optional Parameters
|
||||||
|
stream (bool): If True, stream the page content as it is being loaded.
|
||||||
|
url: str = None # This is not a compulsory parameter
|
||||||
|
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
@@ -318,7 +380,7 @@ class CrawlerRunConfig:
|
|||||||
# Content Processing Parameters
|
# Content Processing Parameters
|
||||||
word_count_threshold: int = MIN_WORD_THRESHOLD,
|
word_count_threshold: int = MIN_WORD_THRESHOLD,
|
||||||
extraction_strategy: ExtractionStrategy = None,
|
extraction_strategy: ExtractionStrategy = None,
|
||||||
chunking_strategy: ChunkingStrategy = None,
|
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||||
markdown_generator: MarkdownGenerationStrategy = None,
|
markdown_generator: MarkdownGenerationStrategy = None,
|
||||||
content_filter=None,
|
content_filter=None,
|
||||||
only_text: bool = False,
|
only_text: bool = False,
|
||||||
@@ -329,10 +391,10 @@ class CrawlerRunConfig:
|
|||||||
remove_forms: bool = False,
|
remove_forms: bool = False,
|
||||||
prettiify: bool = False,
|
prettiify: bool = False,
|
||||||
parser_type: str = "lxml",
|
parser_type: str = "lxml",
|
||||||
|
scraping_strategy: ContentScrapingStrategy = None,
|
||||||
|
proxy_config: dict = None,
|
||||||
# SSL Parameters
|
# SSL Parameters
|
||||||
fetch_ssl_certificate: bool = False,
|
fetch_ssl_certificate: bool = False,
|
||||||
|
|
||||||
# Caching Parameters
|
# Caching Parameters
|
||||||
cache_mode=None,
|
cache_mode=None,
|
||||||
session_id: str = None,
|
session_id: str = None,
|
||||||
@@ -340,7 +402,7 @@ class CrawlerRunConfig:
|
|||||||
disable_cache: bool = False,
|
disable_cache: bool = False,
|
||||||
no_cache_read: bool = False,
|
no_cache_read: bool = False,
|
||||||
no_cache_write: bool = False,
|
no_cache_write: bool = False,
|
||||||
|
shared_data: dict = None,
|
||||||
# Page Navigation and Timing Parameters
|
# Page Navigation and Timing Parameters
|
||||||
wait_until: str = "domcontentloaded",
|
wait_until: str = "domcontentloaded",
|
||||||
page_timeout: int = PAGE_TIMEOUT,
|
page_timeout: int = PAGE_TIMEOUT,
|
||||||
@@ -350,7 +412,6 @@ class CrawlerRunConfig:
|
|||||||
mean_delay: float = 0.1,
|
mean_delay: float = 0.1,
|
||||||
max_range: float = 0.3,
|
max_range: float = 0.3,
|
||||||
semaphore_count: int = 5,
|
semaphore_count: int = 5,
|
||||||
|
|
||||||
# Page Interaction Parameters
|
# Page Interaction Parameters
|
||||||
js_code: Union[str, List[str]] = None,
|
js_code: Union[str, List[str]] = None,
|
||||||
js_only: bool = False,
|
js_only: bool = False,
|
||||||
@@ -363,7 +424,6 @@ class CrawlerRunConfig:
|
|||||||
override_navigator: bool = False,
|
override_navigator: bool = False,
|
||||||
magic: bool = False,
|
magic: bool = False,
|
||||||
adjust_viewport_to_content: bool = False,
|
adjust_viewport_to_content: bool = False,
|
||||||
|
|
||||||
# Media Handling Parameters
|
# Media Handling Parameters
|
||||||
screenshot: bool = False,
|
screenshot: bool = False,
|
||||||
screenshot_wait_for: float = None,
|
screenshot_wait_for: float = None,
|
||||||
@@ -372,21 +432,21 @@ class CrawlerRunConfig:
|
|||||||
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||||
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
|
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
|
||||||
exclude_external_images: bool = False,
|
exclude_external_images: bool = False,
|
||||||
|
|
||||||
# Link and Domain Handling Parameters
|
# Link and Domain Handling Parameters
|
||||||
exclude_social_media_domains: list = None,
|
exclude_social_media_domains: list = None,
|
||||||
exclude_external_links: bool = False,
|
exclude_external_links: bool = False,
|
||||||
exclude_social_media_links: bool = False,
|
exclude_social_media_links: bool = False,
|
||||||
exclude_domains: list = None,
|
exclude_domains: list = None,
|
||||||
|
|
||||||
# Debugging and Logging Parameters
|
# Debugging and Logging Parameters
|
||||||
verbose: bool = True,
|
verbose: bool = True,
|
||||||
log_console: bool = False,
|
log_console: bool = False,
|
||||||
|
# Streaming Parameters
|
||||||
|
stream: bool = False,
|
||||||
url: str = None,
|
url: str = None,
|
||||||
|
check_robots_txt: bool = False,
|
||||||
):
|
):
|
||||||
self.url = url
|
self.url = url
|
||||||
|
|
||||||
# Content Processing Parameters
|
# Content Processing Parameters
|
||||||
self.word_count_threshold = word_count_threshold
|
self.word_count_threshold = word_count_threshold
|
||||||
self.extraction_strategy = extraction_strategy
|
self.extraction_strategy = extraction_strategy
|
||||||
@@ -401,6 +461,8 @@ class CrawlerRunConfig:
|
|||||||
self.remove_forms = remove_forms
|
self.remove_forms = remove_forms
|
||||||
self.prettiify = prettiify
|
self.prettiify = prettiify
|
||||||
self.parser_type = parser_type
|
self.parser_type = parser_type
|
||||||
|
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
|
||||||
|
self.proxy_config = proxy_config
|
||||||
|
|
||||||
# SSL Parameters
|
# SSL Parameters
|
||||||
self.fetch_ssl_certificate = fetch_ssl_certificate
|
self.fetch_ssl_certificate = fetch_ssl_certificate
|
||||||
@@ -412,6 +474,7 @@ class CrawlerRunConfig:
|
|||||||
self.disable_cache = disable_cache
|
self.disable_cache = disable_cache
|
||||||
self.no_cache_read = no_cache_read
|
self.no_cache_read = no_cache_read
|
||||||
self.no_cache_write = no_cache_write
|
self.no_cache_write = no_cache_write
|
||||||
|
self.shared_data = shared_data
|
||||||
|
|
||||||
# Page Navigation and Timing Parameters
|
# Page Navigation and Timing Parameters
|
||||||
self.wait_until = wait_until
|
self.wait_until = wait_until
|
||||||
@@ -446,7 +509,9 @@ class CrawlerRunConfig:
|
|||||||
self.exclude_external_images = exclude_external_images
|
self.exclude_external_images = exclude_external_images
|
||||||
|
|
||||||
# Link and Domain Handling Parameters
|
# Link and Domain Handling Parameters
|
||||||
self.exclude_social_media_domains = exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
|
self.exclude_social_media_domains = (
|
||||||
|
exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
|
||||||
|
)
|
||||||
self.exclude_external_links = exclude_external_links
|
self.exclude_external_links = exclude_external_links
|
||||||
self.exclude_social_media_links = exclude_social_media_links
|
self.exclude_social_media_links = exclude_social_media_links
|
||||||
self.exclude_domains = exclude_domains or []
|
self.exclude_domains = exclude_domains or []
|
||||||
@@ -455,19 +520,28 @@ class CrawlerRunConfig:
|
|||||||
self.verbose = verbose
|
self.verbose = verbose
|
||||||
self.log_console = log_console
|
self.log_console = log_console
|
||||||
|
|
||||||
|
# Streaming Parameters
|
||||||
|
self.stream = stream
|
||||||
|
|
||||||
|
# Robots.txt Handling Parameters
|
||||||
|
self.check_robots_txt = check_robots_txt
|
||||||
|
|
||||||
# Validate type of extraction strategy and chunking strategy if they are provided
|
# Validate type of extraction strategy and chunking strategy if they are provided
|
||||||
if self.extraction_strategy is not None and not isinstance(
|
if self.extraction_strategy is not None and not isinstance(
|
||||||
self.extraction_strategy, ExtractionStrategy
|
self.extraction_strategy, ExtractionStrategy
|
||||||
):
|
):
|
||||||
raise ValueError("extraction_strategy must be an instance of ExtractionStrategy")
|
raise ValueError(
|
||||||
|
"extraction_strategy must be an instance of ExtractionStrategy"
|
||||||
|
)
|
||||||
if self.chunking_strategy is not None and not isinstance(
|
if self.chunking_strategy is not None and not isinstance(
|
||||||
self.chunking_strategy, ChunkingStrategy
|
self.chunking_strategy, ChunkingStrategy
|
||||||
):
|
):
|
||||||
raise ValueError("chunking_strategy must be an instance of ChunkingStrategy")
|
raise ValueError(
|
||||||
|
"chunking_strategy must be an instance of ChunkingStrategy"
|
||||||
|
)
|
||||||
|
|
||||||
# Set default chunking strategy if None
|
# Set default chunking strategy if None
|
||||||
if self.chunking_strategy is None:
|
if self.chunking_strategy is None:
|
||||||
from .chunking_strategy import RegexChunking
|
|
||||||
self.chunking_strategy = RegexChunking()
|
self.chunking_strategy = RegexChunking()
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
@@ -476,7 +550,7 @@ class CrawlerRunConfig:
|
|||||||
# Content Processing Parameters
|
# Content Processing Parameters
|
||||||
word_count_threshold=kwargs.get("word_count_threshold", 200),
|
word_count_threshold=kwargs.get("word_count_threshold", 200),
|
||||||
extraction_strategy=kwargs.get("extraction_strategy"),
|
extraction_strategy=kwargs.get("extraction_strategy"),
|
||||||
chunking_strategy=kwargs.get("chunking_strategy"),
|
chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
|
||||||
markdown_generator=kwargs.get("markdown_generator"),
|
markdown_generator=kwargs.get("markdown_generator"),
|
||||||
content_filter=kwargs.get("content_filter"),
|
content_filter=kwargs.get("content_filter"),
|
||||||
only_text=kwargs.get("only_text", False),
|
only_text=kwargs.get("only_text", False),
|
||||||
@@ -487,10 +561,10 @@ class CrawlerRunConfig:
|
|||||||
remove_forms=kwargs.get("remove_forms", False),
|
remove_forms=kwargs.get("remove_forms", False),
|
||||||
prettiify=kwargs.get("prettiify", False),
|
prettiify=kwargs.get("prettiify", False),
|
||||||
parser_type=kwargs.get("parser_type", "lxml"),
|
parser_type=kwargs.get("parser_type", "lxml"),
|
||||||
|
scraping_strategy=kwargs.get("scraping_strategy"),
|
||||||
|
proxy_config=kwargs.get("proxy_config"),
|
||||||
# SSL Parameters
|
# SSL Parameters
|
||||||
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
|
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
|
||||||
|
|
||||||
# Caching Parameters
|
# Caching Parameters
|
||||||
cache_mode=kwargs.get("cache_mode"),
|
cache_mode=kwargs.get("cache_mode"),
|
||||||
session_id=kwargs.get("session_id"),
|
session_id=kwargs.get("session_id"),
|
||||||
@@ -498,7 +572,7 @@ class CrawlerRunConfig:
|
|||||||
disable_cache=kwargs.get("disable_cache", False),
|
disable_cache=kwargs.get("disable_cache", False),
|
||||||
no_cache_read=kwargs.get("no_cache_read", False),
|
no_cache_read=kwargs.get("no_cache_read", False),
|
||||||
no_cache_write=kwargs.get("no_cache_write", False),
|
no_cache_write=kwargs.get("no_cache_write", False),
|
||||||
|
shared_data=kwargs.get("shared_data", None),
|
||||||
# Page Navigation and Timing Parameters
|
# Page Navigation and Timing Parameters
|
||||||
wait_until=kwargs.get("wait_until", "domcontentloaded"),
|
wait_until=kwargs.get("wait_until", "domcontentloaded"),
|
||||||
page_timeout=kwargs.get("page_timeout", 60000),
|
page_timeout=kwargs.get("page_timeout", 60000),
|
||||||
@@ -508,7 +582,6 @@ class CrawlerRunConfig:
|
|||||||
mean_delay=kwargs.get("mean_delay", 0.1),
|
mean_delay=kwargs.get("mean_delay", 0.1),
|
||||||
max_range=kwargs.get("max_range", 0.3),
|
max_range=kwargs.get("max_range", 0.3),
|
||||||
semaphore_count=kwargs.get("semaphore_count", 5),
|
semaphore_count=kwargs.get("semaphore_count", 5),
|
||||||
|
|
||||||
# Page Interaction Parameters
|
# Page Interaction Parameters
|
||||||
js_code=kwargs.get("js_code"),
|
js_code=kwargs.get("js_code"),
|
||||||
js_only=kwargs.get("js_only", False),
|
js_only=kwargs.get("js_only", False),
|
||||||
@@ -521,29 +594,37 @@ class CrawlerRunConfig:
|
|||||||
override_navigator=kwargs.get("override_navigator", False),
|
override_navigator=kwargs.get("override_navigator", False),
|
||||||
magic=kwargs.get("magic", False),
|
magic=kwargs.get("magic", False),
|
||||||
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
|
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
|
||||||
|
|
||||||
# Media Handling Parameters
|
# Media Handling Parameters
|
||||||
screenshot=kwargs.get("screenshot", False),
|
screenshot=kwargs.get("screenshot", False),
|
||||||
screenshot_wait_for=kwargs.get("screenshot_wait_for"),
|
screenshot_wait_for=kwargs.get("screenshot_wait_for"),
|
||||||
screenshot_height_threshold=kwargs.get("screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD),
|
screenshot_height_threshold=kwargs.get(
|
||||||
|
"screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
|
||||||
|
),
|
||||||
pdf=kwargs.get("pdf", False),
|
pdf=kwargs.get("pdf", False),
|
||||||
image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD),
|
image_description_min_word_threshold=kwargs.get(
|
||||||
image_score_threshold=kwargs.get("image_score_threshold", IMAGE_SCORE_THRESHOLD),
|
"image_description_min_word_threshold",
|
||||||
|
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||||
|
),
|
||||||
|
image_score_threshold=kwargs.get(
|
||||||
|
"image_score_threshold", IMAGE_SCORE_THRESHOLD
|
||||||
|
),
|
||||||
exclude_external_images=kwargs.get("exclude_external_images", False),
|
exclude_external_images=kwargs.get("exclude_external_images", False),
|
||||||
|
|
||||||
# Link and Domain Handling Parameters
|
# Link and Domain Handling Parameters
|
||||||
exclude_social_media_domains=kwargs.get("exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS),
|
exclude_social_media_domains=kwargs.get(
|
||||||
|
"exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
|
||||||
|
),
|
||||||
exclude_external_links=kwargs.get("exclude_external_links", False),
|
exclude_external_links=kwargs.get("exclude_external_links", False),
|
||||||
exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
|
exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
|
||||||
exclude_domains=kwargs.get("exclude_domains", []),
|
exclude_domains=kwargs.get("exclude_domains", []),
|
||||||
|
|
||||||
# Debugging and Logging Parameters
|
# Debugging and Logging Parameters
|
||||||
verbose=kwargs.get("verbose", True),
|
verbose=kwargs.get("verbose", True),
|
||||||
log_console=kwargs.get("log_console", False),
|
log_console=kwargs.get("log_console", False),
|
||||||
|
# Streaming Parameters
|
||||||
|
stream=kwargs.get("stream", False),
|
||||||
url=kwargs.get("url"),
|
url=kwargs.get("url"),
|
||||||
|
check_robots_txt=kwargs.get("check_robots_txt", False),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Create a funciton returns dict of the object
|
# Create a funciton returns dict of the object
|
||||||
def to_dict(self):
|
def to_dict(self):
|
||||||
return {
|
return {
|
||||||
@@ -560,6 +641,8 @@ class CrawlerRunConfig:
|
|||||||
"remove_forms": self.remove_forms,
|
"remove_forms": self.remove_forms,
|
||||||
"prettiify": self.prettiify,
|
"prettiify": self.prettiify,
|
||||||
"parser_type": self.parser_type,
|
"parser_type": self.parser_type,
|
||||||
|
"scraping_strategy": self.scraping_strategy,
|
||||||
|
"proxy_config": self.proxy_config,
|
||||||
"fetch_ssl_certificate": self.fetch_ssl_certificate,
|
"fetch_ssl_certificate": self.fetch_ssl_certificate,
|
||||||
"cache_mode": self.cache_mode,
|
"cache_mode": self.cache_mode,
|
||||||
"session_id": self.session_id,
|
"session_id": self.session_id,
|
||||||
@@ -567,6 +650,7 @@ class CrawlerRunConfig:
|
|||||||
"disable_cache": self.disable_cache,
|
"disable_cache": self.disable_cache,
|
||||||
"no_cache_read": self.no_cache_read,
|
"no_cache_read": self.no_cache_read,
|
||||||
"no_cache_write": self.no_cache_write,
|
"no_cache_write": self.no_cache_write,
|
||||||
|
"shared_data": self.shared_data,
|
||||||
"wait_until": self.wait_until,
|
"wait_until": self.wait_until,
|
||||||
"page_timeout": self.page_timeout,
|
"page_timeout": self.page_timeout,
|
||||||
"wait_for": self.wait_for,
|
"wait_for": self.wait_for,
|
||||||
@@ -599,5 +683,33 @@ class CrawlerRunConfig:
|
|||||||
"exclude_domains": self.exclude_domains,
|
"exclude_domains": self.exclude_domains,
|
||||||
"verbose": self.verbose,
|
"verbose": self.verbose,
|
||||||
"log_console": self.log_console,
|
"log_console": self.log_console,
|
||||||
|
"stream": self.stream,
|
||||||
"url": self.url,
|
"url": self.url,
|
||||||
|
"check_robots_txt": self.check_robots_txt,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
def clone(self, **kwargs):
|
||||||
|
"""Create a copy of this configuration with updated values.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
**kwargs: Key-value pairs of configuration options to update
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
CrawlerRunConfig: A new instance with the specified updates
|
||||||
|
|
||||||
|
Example:
|
||||||
|
```python
|
||||||
|
# Create a new config with streaming enabled
|
||||||
|
stream_config = config.clone(stream=True)
|
||||||
|
|
||||||
|
# Create a new config with multiple updates
|
||||||
|
new_config = config.clone(
|
||||||
|
stream=True,
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
config_dict = self.to_dict()
|
||||||
|
config_dict.update(kwargs)
|
||||||
|
return CrawlerRunConfig.from_kwargs(config_dict)
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -1,27 +1,30 @@
|
|||||||
import os, sys
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import aiosqlite
|
import aiosqlite
|
||||||
import asyncio
|
import asyncio
|
||||||
from typing import Optional, Tuple, Dict
|
from typing import Optional, Dict
|
||||||
from contextlib import asynccontextmanager
|
from contextlib import asynccontextmanager
|
||||||
import logging
|
import logging
|
||||||
import json # Added for serialization/deserialization
|
import json # Added for serialization/deserialization
|
||||||
from .utils import ensure_content_dirs, generate_content_hash
|
from .utils import ensure_content_dirs, generate_content_hash
|
||||||
from .models import CrawlResult, MarkdownGenerationResult
|
from .models import CrawlResult, MarkdownGenerationResult
|
||||||
import xxhash
|
|
||||||
import aiofiles
|
import aiofiles
|
||||||
from .config import NEED_MIGRATION
|
|
||||||
from .version_manager import VersionManager
|
from .version_manager import VersionManager
|
||||||
from .async_logger import AsyncLogger
|
from .async_logger import AsyncLogger
|
||||||
from .utils import get_error_context, create_box_message
|
from .utils import get_error_context, create_box_message
|
||||||
# Set up logging
|
|
||||||
logging.basicConfig(level=logging.INFO)
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
base_directory = DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
|
# Set up logging
|
||||||
|
# logging.basicConfig(level=logging.INFO)
|
||||||
|
# logger = logging.getLogger(__name__)
|
||||||
|
# logger.setLevel(logging.INFO)
|
||||||
|
|
||||||
|
base_directory = DB_PATH = os.path.join(
|
||||||
|
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
|
||||||
|
)
|
||||||
os.makedirs(DB_PATH, exist_ok=True)
|
os.makedirs(DB_PATH, exist_ok=True)
|
||||||
DB_PATH = os.path.join(base_directory, "crawl4ai.db")
|
DB_PATH = os.path.join(base_directory, "crawl4ai.db")
|
||||||
|
|
||||||
|
|
||||||
class AsyncDatabaseManager:
|
class AsyncDatabaseManager:
|
||||||
def __init__(self, pool_size: int = 10, max_retries: int = 3):
|
def __init__(self, pool_size: int = 10, max_retries: int = 3):
|
||||||
self.db_path = DB_PATH
|
self.db_path = DB_PATH
|
||||||
@@ -32,28 +35,27 @@ class AsyncDatabaseManager:
|
|||||||
self.pool_lock = asyncio.Lock()
|
self.pool_lock = asyncio.Lock()
|
||||||
self.init_lock = asyncio.Lock()
|
self.init_lock = asyncio.Lock()
|
||||||
self.connection_semaphore = asyncio.Semaphore(pool_size)
|
self.connection_semaphore = asyncio.Semaphore(pool_size)
|
||||||
self._initialized = False
|
self._initialized = False
|
||||||
self.version_manager = VersionManager()
|
self.version_manager = VersionManager()
|
||||||
self.logger = AsyncLogger(
|
self.logger = AsyncLogger(
|
||||||
log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
|
log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
|
||||||
verbose=False,
|
verbose=False,
|
||||||
tag_width=10
|
tag_width=10,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def initialize(self):
|
async def initialize(self):
|
||||||
"""Initialize the database and connection pool"""
|
"""Initialize the database and connection pool"""
|
||||||
try:
|
try:
|
||||||
self.logger.info("Initializing database", tag="INIT")
|
self.logger.info("Initializing database", tag="INIT")
|
||||||
# Ensure the database file exists
|
# Ensure the database file exists
|
||||||
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
|
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
|
||||||
|
|
||||||
# Check if version update is needed
|
# Check if version update is needed
|
||||||
needs_update = self.version_manager.needs_update()
|
needs_update = self.version_manager.needs_update()
|
||||||
|
|
||||||
# Always ensure base table exists
|
# Always ensure base table exists
|
||||||
await self.ainit_db()
|
await self.ainit_db()
|
||||||
|
|
||||||
# Verify the table exists
|
# Verify the table exists
|
||||||
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
|
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
|
||||||
async with db.execute(
|
async with db.execute(
|
||||||
@@ -62,33 +64,37 @@ class AsyncDatabaseManager:
|
|||||||
result = await cursor.fetchone()
|
result = await cursor.fetchone()
|
||||||
if not result:
|
if not result:
|
||||||
raise Exception("crawled_data table was not created")
|
raise Exception("crawled_data table was not created")
|
||||||
|
|
||||||
# If version changed or fresh install, run updates
|
# If version changed or fresh install, run updates
|
||||||
if needs_update:
|
if needs_update:
|
||||||
self.logger.info("New version detected, running updates", tag="INIT")
|
self.logger.info("New version detected, running updates", tag="INIT")
|
||||||
await self.update_db_schema()
|
await self.update_db_schema()
|
||||||
from .migrations import run_migration # Import here to avoid circular imports
|
from .migrations import (
|
||||||
|
run_migration,
|
||||||
|
) # Import here to avoid circular imports
|
||||||
|
|
||||||
await run_migration()
|
await run_migration()
|
||||||
self.version_manager.update_version() # Update stored version after successful migration
|
self.version_manager.update_version() # Update stored version after successful migration
|
||||||
self.logger.success("Version update completed successfully", tag="COMPLETE")
|
self.logger.success(
|
||||||
|
"Version update completed successfully", tag="COMPLETE"
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
self.logger.success("Database initialization completed successfully", tag="COMPLETE")
|
self.logger.success(
|
||||||
|
"Database initialization completed successfully", tag="COMPLETE"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message="Database initialization error: {error}",
|
message="Database initialization error: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
self.logger.info(
|
self.logger.info(
|
||||||
message="Database will be initialized on first use",
|
message="Database will be initialized on first use", tag="INIT"
|
||||||
tag="INIT"
|
|
||||||
)
|
)
|
||||||
|
|
||||||
raise
|
raise
|
||||||
|
|
||||||
|
|
||||||
async def cleanup(self):
|
async def cleanup(self):
|
||||||
"""Cleanup connections when shutting down"""
|
"""Cleanup connections when shutting down"""
|
||||||
async with self.pool_lock:
|
async with self.pool_lock:
|
||||||
@@ -107,6 +113,7 @@ class AsyncDatabaseManager:
|
|||||||
self._initialized = True
|
self._initialized = True
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
error_context = get_error_context(sys.exc_info())
|
error_context = get_error_context(sys.exc_info())
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
|
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
|
||||||
@@ -115,41 +122,52 @@ class AsyncDatabaseManager:
|
|||||||
params={
|
params={
|
||||||
"error": str(e),
|
"error": str(e),
|
||||||
"context": error_context["code_context"],
|
"context": error_context["code_context"],
|
||||||
"traceback": error_context["full_traceback"]
|
"traceback": error_context["full_traceback"],
|
||||||
}
|
},
|
||||||
)
|
)
|
||||||
raise
|
raise
|
||||||
|
|
||||||
await self.connection_semaphore.acquire()
|
await self.connection_semaphore.acquire()
|
||||||
task_id = id(asyncio.current_task())
|
task_id = id(asyncio.current_task())
|
||||||
|
|
||||||
try:
|
try:
|
||||||
async with self.pool_lock:
|
async with self.pool_lock:
|
||||||
if task_id not in self.connection_pool:
|
if task_id not in self.connection_pool:
|
||||||
try:
|
try:
|
||||||
conn = await aiosqlite.connect(
|
conn = await aiosqlite.connect(self.db_path, timeout=30.0)
|
||||||
self.db_path,
|
await conn.execute("PRAGMA journal_mode = WAL")
|
||||||
timeout=30.0
|
await conn.execute("PRAGMA busy_timeout = 5000")
|
||||||
)
|
|
||||||
await conn.execute('PRAGMA journal_mode = WAL')
|
|
||||||
await conn.execute('PRAGMA busy_timeout = 5000')
|
|
||||||
|
|
||||||
# Verify database structure
|
# Verify database structure
|
||||||
async with conn.execute("PRAGMA table_info(crawled_data)") as cursor:
|
async with conn.execute(
|
||||||
|
"PRAGMA table_info(crawled_data)"
|
||||||
|
) as cursor:
|
||||||
columns = await cursor.fetchall()
|
columns = await cursor.fetchall()
|
||||||
column_names = [col[1] for col in columns]
|
column_names = [col[1] for col in columns]
|
||||||
expected_columns = {
|
expected_columns = {
|
||||||
'url', 'html', 'cleaned_html', 'markdown', 'extracted_content',
|
"url",
|
||||||
'success', 'media', 'links', 'metadata', 'screenshot',
|
"html",
|
||||||
'response_headers', 'downloaded_files'
|
"cleaned_html",
|
||||||
|
"markdown",
|
||||||
|
"extracted_content",
|
||||||
|
"success",
|
||||||
|
"media",
|
||||||
|
"links",
|
||||||
|
"metadata",
|
||||||
|
"screenshot",
|
||||||
|
"response_headers",
|
||||||
|
"downloaded_files",
|
||||||
}
|
}
|
||||||
missing_columns = expected_columns - set(column_names)
|
missing_columns = expected_columns - set(column_names)
|
||||||
if missing_columns:
|
if missing_columns:
|
||||||
raise ValueError(f"Database missing columns: {missing_columns}")
|
raise ValueError(
|
||||||
|
f"Database missing columns: {missing_columns}"
|
||||||
|
)
|
||||||
|
|
||||||
self.connection_pool[task_id] = conn
|
self.connection_pool[task_id] = conn
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
error_context = get_error_context(sys.exc_info())
|
error_context = get_error_context(sys.exc_info())
|
||||||
error_message = (
|
error_message = (
|
||||||
f"Unexpected error in db get_connection at line {error_context['line_no']} "
|
f"Unexpected error in db get_connection at line {error_context['line_no']} "
|
||||||
@@ -158,7 +176,7 @@ class AsyncDatabaseManager:
|
|||||||
f"Code context:\n{error_context['code_context']}"
|
f"Code context:\n{error_context['code_context']}"
|
||||||
)
|
)
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message=create_box_message(error_message, type= "error"),
|
message=create_box_message(error_message, type="error"),
|
||||||
)
|
)
|
||||||
|
|
||||||
raise
|
raise
|
||||||
@@ -167,6 +185,7 @@ class AsyncDatabaseManager:
|
|||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
error_context = get_error_context(sys.exc_info())
|
error_context = get_error_context(sys.exc_info())
|
||||||
error_message = (
|
error_message = (
|
||||||
f"Unexpected error in db get_connection at line {error_context['line_no']} "
|
f"Unexpected error in db get_connection at line {error_context['line_no']} "
|
||||||
@@ -175,7 +194,7 @@ class AsyncDatabaseManager:
|
|||||||
f"Code context:\n{error_context['code_context']}"
|
f"Code context:\n{error_context['code_context']}"
|
||||||
)
|
)
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message=create_box_message(error_message, type= "error"),
|
message=create_box_message(error_message, type="error"),
|
||||||
)
|
)
|
||||||
raise
|
raise
|
||||||
finally:
|
finally:
|
||||||
@@ -185,7 +204,6 @@ class AsyncDatabaseManager:
|
|||||||
del self.connection_pool[task_id]
|
del self.connection_pool[task_id]
|
||||||
self.connection_semaphore.release()
|
self.connection_semaphore.release()
|
||||||
|
|
||||||
|
|
||||||
async def execute_with_retry(self, operation, *args):
|
async def execute_with_retry(self, operation, *args):
|
||||||
"""Execute database operations with retry logic"""
|
"""Execute database operations with retry logic"""
|
||||||
for attempt in range(self.max_retries):
|
for attempt in range(self.max_retries):
|
||||||
@@ -200,18 +218,16 @@ class AsyncDatabaseManager:
|
|||||||
message="Operation failed after {retries} attempts: {error}",
|
message="Operation failed after {retries} attempts: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={
|
params={"retries": self.max_retries, "error": str(e)},
|
||||||
"retries": self.max_retries,
|
)
|
||||||
"error": str(e)
|
|
||||||
}
|
|
||||||
)
|
|
||||||
raise
|
raise
|
||||||
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
|
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
|
||||||
|
|
||||||
async def ainit_db(self):
|
async def ainit_db(self):
|
||||||
"""Initialize database schema"""
|
"""Initialize database schema"""
|
||||||
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
|
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
|
||||||
await db.execute('''
|
await db.execute(
|
||||||
|
"""
|
||||||
CREATE TABLE IF NOT EXISTS crawled_data (
|
CREATE TABLE IF NOT EXISTS crawled_data (
|
||||||
url TEXT PRIMARY KEY,
|
url TEXT PRIMARY KEY,
|
||||||
html TEXT,
|
html TEXT,
|
||||||
@@ -226,21 +242,27 @@ class AsyncDatabaseManager:
|
|||||||
response_headers TEXT DEFAULT "{}",
|
response_headers TEXT DEFAULT "{}",
|
||||||
downloaded_files TEXT DEFAULT "{}" -- New column added
|
downloaded_files TEXT DEFAULT "{}" -- New column added
|
||||||
)
|
)
|
||||||
''')
|
"""
|
||||||
|
)
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
async def update_db_schema(self):
|
async def update_db_schema(self):
|
||||||
"""Update database schema if needed"""
|
"""Update database schema if needed"""
|
||||||
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
|
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
|
||||||
cursor = await db.execute("PRAGMA table_info(crawled_data)")
|
cursor = await db.execute("PRAGMA table_info(crawled_data)")
|
||||||
columns = await cursor.fetchall()
|
columns = await cursor.fetchall()
|
||||||
column_names = [column[1] for column in columns]
|
column_names = [column[1] for column in columns]
|
||||||
|
|
||||||
# List of new columns to add
|
# List of new columns to add
|
||||||
new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files']
|
new_columns = [
|
||||||
|
"media",
|
||||||
|
"links",
|
||||||
|
"metadata",
|
||||||
|
"screenshot",
|
||||||
|
"response_headers",
|
||||||
|
"downloaded_files",
|
||||||
|
]
|
||||||
|
|
||||||
for column in new_columns:
|
for column in new_columns:
|
||||||
if column not in column_names:
|
if column not in column_names:
|
||||||
await self.aalter_db_add_column(column, db)
|
await self.aalter_db_add_column(column, db)
|
||||||
@@ -248,75 +270,95 @@ class AsyncDatabaseManager:
|
|||||||
|
|
||||||
async def aalter_db_add_column(self, new_column: str, db):
|
async def aalter_db_add_column(self, new_column: str, db):
|
||||||
"""Add new column to the database"""
|
"""Add new column to the database"""
|
||||||
if new_column == 'response_headers':
|
if new_column == "response_headers":
|
||||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"')
|
await db.execute(
|
||||||
|
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
await db.execute(
|
||||||
|
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
|
||||||
|
)
|
||||||
self.logger.info(
|
self.logger.info(
|
||||||
message="Added column '{column}' to the database",
|
message="Added column '{column}' to the database",
|
||||||
tag="INIT",
|
tag="INIT",
|
||||||
params={"column": new_column}
|
params={"column": new_column},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
|
async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
|
||||||
"""Retrieve cached URL data as CrawlResult"""
|
"""Retrieve cached URL data as CrawlResult"""
|
||||||
|
|
||||||
async def _get(db):
|
async def _get(db):
|
||||||
async with db.execute(
|
async with db.execute(
|
||||||
'SELECT * FROM crawled_data WHERE url = ?', (url,)
|
"SELECT * FROM crawled_data WHERE url = ?", (url,)
|
||||||
) as cursor:
|
) as cursor:
|
||||||
row = await cursor.fetchone()
|
row = await cursor.fetchone()
|
||||||
if not row:
|
if not row:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# Get column names
|
# Get column names
|
||||||
columns = [description[0] for description in cursor.description]
|
columns = [description[0] for description in cursor.description]
|
||||||
# Create dict from row data
|
# Create dict from row data
|
||||||
row_dict = dict(zip(columns, row))
|
row_dict = dict(zip(columns, row))
|
||||||
|
|
||||||
# Load content from files using stored hashes
|
# Load content from files using stored hashes
|
||||||
content_fields = {
|
content_fields = {
|
||||||
'html': row_dict['html'],
|
"html": row_dict["html"],
|
||||||
'cleaned_html': row_dict['cleaned_html'],
|
"cleaned_html": row_dict["cleaned_html"],
|
||||||
'markdown': row_dict['markdown'],
|
"markdown": row_dict["markdown"],
|
||||||
'extracted_content': row_dict['extracted_content'],
|
"extracted_content": row_dict["extracted_content"],
|
||||||
'screenshot': row_dict['screenshot'],
|
"screenshot": row_dict["screenshot"],
|
||||||
'screenshots': row_dict['screenshot'],
|
"screenshots": row_dict["screenshot"],
|
||||||
}
|
}
|
||||||
|
|
||||||
for field, hash_value in content_fields.items():
|
for field, hash_value in content_fields.items():
|
||||||
if hash_value:
|
if hash_value:
|
||||||
content = await self._load_content(
|
content = await self._load_content(
|
||||||
hash_value,
|
hash_value,
|
||||||
field.split('_')[0] # Get content type from field name
|
field.split("_")[0], # Get content type from field name
|
||||||
)
|
)
|
||||||
row_dict[field] = content or ""
|
row_dict[field] = content or ""
|
||||||
else:
|
else:
|
||||||
row_dict[field] = ""
|
row_dict[field] = ""
|
||||||
|
|
||||||
# Parse JSON fields
|
# Parse JSON fields
|
||||||
json_fields = ['media', 'links', 'metadata', 'response_headers', 'markdown']
|
json_fields = [
|
||||||
|
"media",
|
||||||
|
"links",
|
||||||
|
"metadata",
|
||||||
|
"response_headers",
|
||||||
|
"markdown",
|
||||||
|
]
|
||||||
for field in json_fields:
|
for field in json_fields:
|
||||||
try:
|
try:
|
||||||
row_dict[field] = json.loads(row_dict[field]) if row_dict[field] else {}
|
row_dict[field] = (
|
||||||
|
json.loads(row_dict[field]) if row_dict[field] else {}
|
||||||
|
)
|
||||||
except json.JSONDecodeError:
|
except json.JSONDecodeError:
|
||||||
row_dict[field] = {}
|
# Very UGLY, never mention it to me please
|
||||||
|
if field == "markdown" and isinstance(row_dict[field], str):
|
||||||
|
row_dict[field] = row_dict[field]
|
||||||
|
else:
|
||||||
|
row_dict[field] = {}
|
||||||
|
|
||||||
|
if isinstance(row_dict["markdown"], Dict):
|
||||||
|
row_dict["markdown_v2"] = row_dict["markdown"]
|
||||||
|
if row_dict["markdown"].get("raw_markdown"):
|
||||||
|
row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
|
||||||
|
|
||||||
if isinstance(row_dict['markdown'], Dict):
|
|
||||||
row_dict['markdown_v2'] = row_dict['markdown']
|
|
||||||
if row_dict['markdown'].get('raw_markdown'):
|
|
||||||
row_dict['markdown'] = row_dict['markdown']['raw_markdown']
|
|
||||||
|
|
||||||
# Parse downloaded_files
|
# Parse downloaded_files
|
||||||
try:
|
try:
|
||||||
row_dict['downloaded_files'] = json.loads(row_dict['downloaded_files']) if row_dict['downloaded_files'] else []
|
row_dict["downloaded_files"] = (
|
||||||
|
json.loads(row_dict["downloaded_files"])
|
||||||
|
if row_dict["downloaded_files"]
|
||||||
|
else []
|
||||||
|
)
|
||||||
except json.JSONDecodeError:
|
except json.JSONDecodeError:
|
||||||
row_dict['downloaded_files'] = []
|
row_dict["downloaded_files"] = []
|
||||||
|
|
||||||
# Remove any fields not in CrawlResult model
|
# Remove any fields not in CrawlResult model
|
||||||
valid_fields = CrawlResult.__annotations__.keys()
|
valid_fields = CrawlResult.__annotations__.keys()
|
||||||
filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
|
filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
|
||||||
|
|
||||||
return CrawlResult(**filtered_dict)
|
return CrawlResult(**filtered_dict)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -326,7 +368,7 @@ class AsyncDatabaseManager:
|
|||||||
message="Error retrieving cached URL: {error}",
|
message="Error retrieving cached URL: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
return None
|
return None
|
||||||
|
|
||||||
@@ -334,37 +376,52 @@ class AsyncDatabaseManager:
|
|||||||
"""Cache CrawlResult data"""
|
"""Cache CrawlResult data"""
|
||||||
# Store content files and get hashes
|
# Store content files and get hashes
|
||||||
content_map = {
|
content_map = {
|
||||||
'html': (result.html, 'html'),
|
"html": (result.html, "html"),
|
||||||
'cleaned_html': (result.cleaned_html or "", 'cleaned'),
|
"cleaned_html": (result.cleaned_html or "", "cleaned"),
|
||||||
'markdown': None,
|
"markdown": None,
|
||||||
'extracted_content': (result.extracted_content or "", 'extracted'),
|
"extracted_content": (result.extracted_content or "", "extracted"),
|
||||||
'screenshot': (result.screenshot or "", 'screenshots')
|
"screenshot": (result.screenshot or "", "screenshots"),
|
||||||
}
|
}
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if isinstance(result.markdown, MarkdownGenerationResult):
|
if isinstance(result.markdown, MarkdownGenerationResult):
|
||||||
content_map['markdown'] = (result.markdown.model_dump_json(), 'markdown')
|
content_map["markdown"] = (
|
||||||
elif hasattr(result, 'markdown_v2'):
|
result.markdown.model_dump_json(),
|
||||||
content_map['markdown'] = (result.markdown_v2.model_dump_json(), 'markdown')
|
"markdown",
|
||||||
|
)
|
||||||
|
elif hasattr(result, "markdown_v2"):
|
||||||
|
content_map["markdown"] = (
|
||||||
|
result.markdown_v2.model_dump_json(),
|
||||||
|
"markdown",
|
||||||
|
)
|
||||||
elif isinstance(result.markdown, str):
|
elif isinstance(result.markdown, str):
|
||||||
markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
|
markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
|
||||||
content_map['markdown'] = (markdown_result.model_dump_json(), 'markdown')
|
content_map["markdown"] = (
|
||||||
|
markdown_result.model_dump_json(),
|
||||||
|
"markdown",
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
content_map['markdown'] = (MarkdownGenerationResult().model_dump_json(), 'markdown')
|
content_map["markdown"] = (
|
||||||
|
MarkdownGenerationResult().model_dump_json(),
|
||||||
|
"markdown",
|
||||||
|
)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.warning(
|
self.logger.warning(
|
||||||
message=f"Error processing markdown content: {str(e)}",
|
message=f"Error processing markdown content: {str(e)}", tag="WARNING"
|
||||||
tag="WARNING"
|
|
||||||
)
|
)
|
||||||
# Fallback to empty markdown result
|
# Fallback to empty markdown result
|
||||||
content_map['markdown'] = (MarkdownGenerationResult().model_dump_json(), 'markdown')
|
content_map["markdown"] = (
|
||||||
|
MarkdownGenerationResult().model_dump_json(),
|
||||||
|
"markdown",
|
||||||
|
)
|
||||||
|
|
||||||
content_hashes = {}
|
content_hashes = {}
|
||||||
for field, (content, content_type) in content_map.items():
|
for field, (content, content_type) in content_map.items():
|
||||||
content_hashes[field] = await self._store_content(content, content_type)
|
content_hashes[field] = await self._store_content(content, content_type)
|
||||||
|
|
||||||
async def _cache(db):
|
async def _cache(db):
|
||||||
await db.execute('''
|
await db.execute(
|
||||||
|
"""
|
||||||
INSERT INTO crawled_data (
|
INSERT INTO crawled_data (
|
||||||
url, html, cleaned_html, markdown,
|
url, html, cleaned_html, markdown,
|
||||||
extracted_content, success, media, links, metadata,
|
extracted_content, success, media, links, metadata,
|
||||||
@@ -383,20 +440,22 @@ class AsyncDatabaseManager:
|
|||||||
screenshot = excluded.screenshot,
|
screenshot = excluded.screenshot,
|
||||||
response_headers = excluded.response_headers,
|
response_headers = excluded.response_headers,
|
||||||
downloaded_files = excluded.downloaded_files
|
downloaded_files = excluded.downloaded_files
|
||||||
''', (
|
""",
|
||||||
result.url,
|
(
|
||||||
content_hashes['html'],
|
result.url,
|
||||||
content_hashes['cleaned_html'],
|
content_hashes["html"],
|
||||||
content_hashes['markdown'],
|
content_hashes["cleaned_html"],
|
||||||
content_hashes['extracted_content'],
|
content_hashes["markdown"],
|
||||||
result.success,
|
content_hashes["extracted_content"],
|
||||||
json.dumps(result.media),
|
result.success,
|
||||||
json.dumps(result.links),
|
json.dumps(result.media),
|
||||||
json.dumps(result.metadata or {}),
|
json.dumps(result.links),
|
||||||
content_hashes['screenshot'],
|
json.dumps(result.metadata or {}),
|
||||||
json.dumps(result.response_headers or {}),
|
content_hashes["screenshot"],
|
||||||
json.dumps(result.downloaded_files or [])
|
json.dumps(result.response_headers or {}),
|
||||||
))
|
json.dumps(result.downloaded_files or []),
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
await self.execute_with_retry(_cache)
|
await self.execute_with_retry(_cache)
|
||||||
@@ -405,14 +464,14 @@ class AsyncDatabaseManager:
|
|||||||
message="Error caching URL: {error}",
|
message="Error caching URL: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def aget_total_count(self) -> int:
|
async def aget_total_count(self) -> int:
|
||||||
"""Get total number of cached URLs"""
|
"""Get total number of cached URLs"""
|
||||||
|
|
||||||
async def _count(db):
|
async def _count(db):
|
||||||
async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
|
async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
|
||||||
result = await cursor.fetchone()
|
result = await cursor.fetchone()
|
||||||
return result[0] if result else 0
|
return result[0] if result else 0
|
||||||
|
|
||||||
@@ -423,14 +482,15 @@ class AsyncDatabaseManager:
|
|||||||
message="Error getting total count: {error}",
|
message="Error getting total count: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
async def aclear_db(self):
|
async def aclear_db(self):
|
||||||
"""Clear all data from the database"""
|
"""Clear all data from the database"""
|
||||||
|
|
||||||
async def _clear(db):
|
async def _clear(db):
|
||||||
await db.execute('DELETE FROM crawled_data')
|
await db.execute("DELETE FROM crawled_data")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
await self.execute_with_retry(_clear)
|
await self.execute_with_retry(_clear)
|
||||||
@@ -439,13 +499,14 @@ class AsyncDatabaseManager:
|
|||||||
message="Error clearing database: {error}",
|
message="Error clearing database: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
|
|
||||||
async def aflush_db(self):
|
async def aflush_db(self):
|
||||||
"""Drop the entire table"""
|
"""Drop the entire table"""
|
||||||
|
|
||||||
async def _flush(db):
|
async def _flush(db):
|
||||||
await db.execute('DROP TABLE IF EXISTS crawled_data')
|
await db.execute("DROP TABLE IF EXISTS crawled_data")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
await self.execute_with_retry(_flush)
|
await self.execute_with_retry(_flush)
|
||||||
@@ -454,42 +515,44 @@ class AsyncDatabaseManager:
|
|||||||
message="Error flushing database: {error}",
|
message="Error flushing database: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
async def _store_content(self, content: str, content_type: str) -> str:
|
async def _store_content(self, content: str, content_type: str) -> str:
|
||||||
"""Store content in filesystem and return hash"""
|
"""Store content in filesystem and return hash"""
|
||||||
if not content:
|
if not content:
|
||||||
return ""
|
return ""
|
||||||
|
|
||||||
content_hash = generate_content_hash(content)
|
content_hash = generate_content_hash(content)
|
||||||
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
||||||
|
|
||||||
# Only write if file doesn't exist
|
# Only write if file doesn't exist
|
||||||
if not os.path.exists(file_path):
|
if not os.path.exists(file_path):
|
||||||
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
|
async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
|
||||||
await f.write(content)
|
await f.write(content)
|
||||||
|
|
||||||
return content_hash
|
return content_hash
|
||||||
|
|
||||||
async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]:
|
async def _load_content(
|
||||||
|
self, content_hash: str, content_type: str
|
||||||
|
) -> Optional[str]:
|
||||||
"""Load content from filesystem by hash"""
|
"""Load content from filesystem by hash"""
|
||||||
if not content_hash:
|
if not content_hash:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
||||||
try:
|
try:
|
||||||
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
|
async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
|
||||||
return await f.read()
|
return await f.read()
|
||||||
except:
|
except:
|
||||||
self.logger.error(
|
self.logger.error(
|
||||||
message="Failed to load content: {file_path}",
|
message="Failed to load content: {file_path}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
force_verbose=True,
|
force_verbose=True,
|
||||||
params={"file_path": file_path}
|
params={"file_path": file_path},
|
||||||
)
|
)
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
# Create a singleton instance
|
# Create a singleton instance
|
||||||
async_db_manager = AsyncDatabaseManager()
|
async_db_manager = AsyncDatabaseManager()
|
||||||
|
|||||||
647
crawl4ai/async_dispatcher.py
Normal file
647
crawl4ai/async_dispatcher.py
Normal file
@@ -0,0 +1,647 @@
|
|||||||
|
from typing import Dict, Optional, List, Tuple
|
||||||
|
from .async_configs import CrawlerRunConfig
|
||||||
|
from .models import (
|
||||||
|
CrawlResult,
|
||||||
|
CrawlerTaskResult,
|
||||||
|
CrawlStatus,
|
||||||
|
DisplayMode,
|
||||||
|
CrawlStats,
|
||||||
|
DomainState,
|
||||||
|
)
|
||||||
|
|
||||||
|
from rich.live import Live
|
||||||
|
from rich.table import Table
|
||||||
|
from rich.console import Console
|
||||||
|
from rich import box
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from collections.abc import AsyncGenerator
|
||||||
|
import time
|
||||||
|
import psutil
|
||||||
|
import asyncio
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
import random
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
class RateLimiter:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
base_delay: Tuple[float, float] = (1.0, 3.0),
|
||||||
|
max_delay: float = 60.0,
|
||||||
|
max_retries: int = 3,
|
||||||
|
rate_limit_codes: List[int] = None,
|
||||||
|
):
|
||||||
|
self.base_delay = base_delay
|
||||||
|
self.max_delay = max_delay
|
||||||
|
self.max_retries = max_retries
|
||||||
|
self.rate_limit_codes = rate_limit_codes or [429, 503]
|
||||||
|
self.domains: Dict[str, DomainState] = {}
|
||||||
|
|
||||||
|
def get_domain(self, url: str) -> str:
|
||||||
|
return urlparse(url).netloc
|
||||||
|
|
||||||
|
async def wait_if_needed(self, url: str) -> None:
|
||||||
|
domain = self.get_domain(url)
|
||||||
|
state = self.domains.get(domain)
|
||||||
|
|
||||||
|
if not state:
|
||||||
|
self.domains[domain] = DomainState()
|
||||||
|
state = self.domains[domain]
|
||||||
|
|
||||||
|
now = time.time()
|
||||||
|
if state.last_request_time:
|
||||||
|
wait_time = max(0, state.current_delay - (now - state.last_request_time))
|
||||||
|
if wait_time > 0:
|
||||||
|
await asyncio.sleep(wait_time)
|
||||||
|
|
||||||
|
# Random delay within base range if no current delay
|
||||||
|
if state.current_delay == 0:
|
||||||
|
state.current_delay = random.uniform(*self.base_delay)
|
||||||
|
|
||||||
|
state.last_request_time = time.time()
|
||||||
|
|
||||||
|
def update_delay(self, url: str, status_code: int) -> bool:
|
||||||
|
domain = self.get_domain(url)
|
||||||
|
state = self.domains[domain]
|
||||||
|
|
||||||
|
if status_code in self.rate_limit_codes:
|
||||||
|
state.fail_count += 1
|
||||||
|
if state.fail_count > self.max_retries:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Exponential backoff with random jitter
|
||||||
|
state.current_delay = min(
|
||||||
|
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Gradually reduce delay on success
|
||||||
|
state.current_delay = max(
|
||||||
|
random.uniform(*self.base_delay), state.current_delay * 0.75
|
||||||
|
)
|
||||||
|
state.fail_count = 0
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
class CrawlerMonitor:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
max_visible_rows: int = 15,
|
||||||
|
display_mode: DisplayMode = DisplayMode.DETAILED,
|
||||||
|
):
|
||||||
|
self.console = Console()
|
||||||
|
self.max_visible_rows = max_visible_rows
|
||||||
|
self.display_mode = display_mode
|
||||||
|
self.stats: Dict[str, CrawlStats] = {}
|
||||||
|
self.process = psutil.Process()
|
||||||
|
self.start_time = datetime.now()
|
||||||
|
self.live = Live(self._create_table(), refresh_per_second=2)
|
||||||
|
|
||||||
|
def start(self):
|
||||||
|
self.live.start()
|
||||||
|
|
||||||
|
def stop(self):
|
||||||
|
self.live.stop()
|
||||||
|
|
||||||
|
def add_task(self, task_id: str, url: str):
|
||||||
|
self.stats[task_id] = CrawlStats(
|
||||||
|
task_id=task_id, url=url, status=CrawlStatus.QUEUED
|
||||||
|
)
|
||||||
|
self.live.update(self._create_table())
|
||||||
|
|
||||||
|
def update_task(self, task_id: str, **kwargs):
|
||||||
|
if task_id in self.stats:
|
||||||
|
for key, value in kwargs.items():
|
||||||
|
setattr(self.stats[task_id], key, value)
|
||||||
|
self.live.update(self._create_table())
|
||||||
|
|
||||||
|
def _create_aggregated_table(self) -> Table:
|
||||||
|
"""Creates a compact table showing only aggregated statistics"""
|
||||||
|
table = Table(
|
||||||
|
box=box.ROUNDED,
|
||||||
|
title="Crawler Status Overview",
|
||||||
|
title_style="bold magenta",
|
||||||
|
header_style="bold blue",
|
||||||
|
show_lines=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Calculate statistics
|
||||||
|
total_tasks = len(self.stats)
|
||||||
|
queued = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
|
||||||
|
)
|
||||||
|
in_progress = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
|
||||||
|
)
|
||||||
|
completed = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
|
||||||
|
)
|
||||||
|
failed = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
|
||||||
|
)
|
||||||
|
|
||||||
|
# Memory statistics
|
||||||
|
current_memory = self.process.memory_info().rss / (1024 * 1024)
|
||||||
|
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
|
||||||
|
peak_memory = max(
|
||||||
|
(stat.peak_memory for stat in self.stats.values()), default=0.0
|
||||||
|
)
|
||||||
|
|
||||||
|
# Duration
|
||||||
|
duration = datetime.now() - self.start_time
|
||||||
|
|
||||||
|
# Create status row
|
||||||
|
table.add_column("Status", style="bold cyan")
|
||||||
|
table.add_column("Count", justify="right")
|
||||||
|
table.add_column("Percentage", justify="right")
|
||||||
|
|
||||||
|
table.add_row("Total Tasks", str(total_tasks), "100%")
|
||||||
|
table.add_row(
|
||||||
|
"[yellow]In Queue[/yellow]",
|
||||||
|
str(queued),
|
||||||
|
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[blue]In Progress[/blue]",
|
||||||
|
str(in_progress),
|
||||||
|
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[green]Completed[/green]",
|
||||||
|
str(completed),
|
||||||
|
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[red]Failed[/red]",
|
||||||
|
str(failed),
|
||||||
|
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add memory information
|
||||||
|
table.add_section()
|
||||||
|
table.add_row(
|
||||||
|
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[yellow]Runtime[/yellow]",
|
||||||
|
str(timedelta(seconds=int(duration.total_seconds()))),
|
||||||
|
"",
|
||||||
|
)
|
||||||
|
|
||||||
|
return table
|
||||||
|
|
||||||
|
def _create_detailed_table(self) -> Table:
|
||||||
|
table = Table(
|
||||||
|
box=box.ROUNDED,
|
||||||
|
title="Crawler Performance Monitor",
|
||||||
|
title_style="bold magenta",
|
||||||
|
header_style="bold blue",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add columns
|
||||||
|
table.add_column("Task ID", style="cyan", no_wrap=True)
|
||||||
|
table.add_column("URL", style="cyan", no_wrap=True)
|
||||||
|
table.add_column("Status", style="bold")
|
||||||
|
table.add_column("Memory (MB)", justify="right")
|
||||||
|
table.add_column("Peak (MB)", justify="right")
|
||||||
|
table.add_column("Duration", justify="right")
|
||||||
|
table.add_column("Info", style="italic")
|
||||||
|
|
||||||
|
# Add summary row
|
||||||
|
total_memory = sum(stat.memory_usage for stat in self.stats.values())
|
||||||
|
active_count = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
|
||||||
|
)
|
||||||
|
completed_count = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
|
||||||
|
)
|
||||||
|
failed_count = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
|
||||||
|
)
|
||||||
|
|
||||||
|
table.add_row(
|
||||||
|
"[bold yellow]SUMMARY",
|
||||||
|
f"Total: {len(self.stats)}",
|
||||||
|
f"Active: {active_count}",
|
||||||
|
f"{total_memory:.1f}",
|
||||||
|
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
|
||||||
|
str(
|
||||||
|
timedelta(
|
||||||
|
seconds=int((datetime.now() - self.start_time).total_seconds())
|
||||||
|
)
|
||||||
|
),
|
||||||
|
f"✓{completed_count} ✗{failed_count}",
|
||||||
|
style="bold",
|
||||||
|
)
|
||||||
|
|
||||||
|
table.add_section()
|
||||||
|
|
||||||
|
# Add rows for each task
|
||||||
|
visible_stats = sorted(
|
||||||
|
self.stats.values(),
|
||||||
|
key=lambda x: (
|
||||||
|
x.status != CrawlStatus.IN_PROGRESS,
|
||||||
|
x.status != CrawlStatus.QUEUED,
|
||||||
|
x.end_time or datetime.max,
|
||||||
|
),
|
||||||
|
)[: self.max_visible_rows]
|
||||||
|
|
||||||
|
for stat in visible_stats:
|
||||||
|
status_style = {
|
||||||
|
CrawlStatus.QUEUED: "white",
|
||||||
|
CrawlStatus.IN_PROGRESS: "yellow",
|
||||||
|
CrawlStatus.COMPLETED: "green",
|
||||||
|
CrawlStatus.FAILED: "red",
|
||||||
|
}[stat.status]
|
||||||
|
|
||||||
|
table.add_row(
|
||||||
|
stat.task_id[:8], # Show first 8 chars of task ID
|
||||||
|
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
|
||||||
|
f"[{status_style}]{stat.status.value}[/{status_style}]",
|
||||||
|
f"{stat.memory_usage:.1f}",
|
||||||
|
f"{stat.peak_memory:.1f}",
|
||||||
|
stat.duration,
|
||||||
|
stat.error_message[:40] if stat.error_message else "",
|
||||||
|
)
|
||||||
|
|
||||||
|
return table
|
||||||
|
|
||||||
|
def _create_table(self) -> Table:
|
||||||
|
"""Creates the appropriate table based on display mode"""
|
||||||
|
if self.display_mode == DisplayMode.AGGREGATED:
|
||||||
|
return self._create_aggregated_table()
|
||||||
|
return self._create_detailed_table()
|
||||||
|
|
||||||
|
|
||||||
|
class BaseDispatcher(ABC):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
rate_limiter: Optional[RateLimiter] = None,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
):
|
||||||
|
self.crawler = None
|
||||||
|
self._domain_last_hit: Dict[str, float] = {}
|
||||||
|
self.concurrent_sessions = 0
|
||||||
|
self.rate_limiter = rate_limiter
|
||||||
|
self.monitor = monitor
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def crawl_url(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
task_id: str,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
) -> CrawlerTaskResult:
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def run_urls(
|
||||||
|
self,
|
||||||
|
urls: List[str],
|
||||||
|
crawler: "AsyncWebCrawler", # noqa: F821
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
) -> List[CrawlerTaskResult]:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class MemoryAdaptiveDispatcher(BaseDispatcher):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
memory_threshold_percent: float = 90.0,
|
||||||
|
check_interval: float = 1.0,
|
||||||
|
max_session_permit: int = 20,
|
||||||
|
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
|
||||||
|
rate_limiter: Optional[RateLimiter] = None,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
):
|
||||||
|
super().__init__(rate_limiter, monitor)
|
||||||
|
self.memory_threshold_percent = memory_threshold_percent
|
||||||
|
self.check_interval = check_interval
|
||||||
|
self.max_session_permit = max_session_permit
|
||||||
|
self.memory_wait_timeout = memory_wait_timeout
|
||||||
|
self.result_queue = asyncio.Queue() # Queue for storing results
|
||||||
|
|
||||||
|
async def crawl_url(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
task_id: str,
|
||||||
|
) -> CrawlerTaskResult:
|
||||||
|
start_time = datetime.now()
|
||||||
|
error_message = ""
|
||||||
|
memory_usage = peak_memory = 0.0
|
||||||
|
|
||||||
|
try:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
|
||||||
|
)
|
||||||
|
self.concurrent_sessions += 1
|
||||||
|
|
||||||
|
if self.rate_limiter:
|
||||||
|
await self.rate_limiter.wait_if_needed(url)
|
||||||
|
|
||||||
|
process = psutil.Process()
|
||||||
|
start_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
result = await self.crawler.arun(url, config=config, session_id=task_id)
|
||||||
|
end_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
|
||||||
|
memory_usage = peak_memory = end_memory - start_memory
|
||||||
|
|
||||||
|
if self.rate_limiter and result.status_code:
|
||||||
|
if not self.rate_limiter.update_delay(url, result.status_code):
|
||||||
|
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
result = CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=datetime.now(),
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
await self.result_queue.put(result)
|
||||||
|
return result
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
error_message = result.error_message
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
elif self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_message = str(e)
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
result = CrawlResult(
|
||||||
|
url=url, html="", metadata={}, success=False, error_message=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
end_time = datetime.now()
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id,
|
||||||
|
end_time=end_time,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
self.concurrent_sessions -= 1
|
||||||
|
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=end_time,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def run_urls(
|
||||||
|
self,
|
||||||
|
urls: List[str],
|
||||||
|
crawler: "AsyncWebCrawler", # noqa: F821
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
) -> List[CrawlerTaskResult]:
|
||||||
|
self.crawler = crawler
|
||||||
|
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
pending_tasks = []
|
||||||
|
active_tasks = []
|
||||||
|
task_queue = []
|
||||||
|
|
||||||
|
for url in urls:
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.add_task(task_id, url)
|
||||||
|
task_queue.append((url, task_id))
|
||||||
|
|
||||||
|
while task_queue or active_tasks:
|
||||||
|
wait_start_time = time.time()
|
||||||
|
while len(active_tasks) < self.max_session_permit and task_queue:
|
||||||
|
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
|
||||||
|
# Check if we've exceeded the timeout
|
||||||
|
if time.time() - wait_start_time > self.memory_wait_timeout:
|
||||||
|
raise MemoryError(
|
||||||
|
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
|
||||||
|
)
|
||||||
|
await asyncio.sleep(self.check_interval)
|
||||||
|
continue
|
||||||
|
|
||||||
|
url, task_id = task_queue.pop(0)
|
||||||
|
task = asyncio.create_task(self.crawl_url(url, config, task_id))
|
||||||
|
active_tasks.append(task)
|
||||||
|
|
||||||
|
if not active_tasks:
|
||||||
|
await asyncio.sleep(self.check_interval)
|
||||||
|
continue
|
||||||
|
|
||||||
|
done, pending = await asyncio.wait(
|
||||||
|
active_tasks, return_when=asyncio.FIRST_COMPLETED
|
||||||
|
)
|
||||||
|
|
||||||
|
pending_tasks.extend(done)
|
||||||
|
active_tasks = list(pending)
|
||||||
|
|
||||||
|
return await asyncio.gather(*pending_tasks)
|
||||||
|
finally:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.stop()
|
||||||
|
|
||||||
|
async def run_urls_stream(
|
||||||
|
self,
|
||||||
|
urls: List[str],
|
||||||
|
crawler: "AsyncWebCrawler",
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
) -> AsyncGenerator[CrawlerTaskResult, None]:
|
||||||
|
self.crawler = crawler
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
active_tasks = []
|
||||||
|
task_queue = []
|
||||||
|
completed_count = 0
|
||||||
|
total_urls = len(urls)
|
||||||
|
|
||||||
|
# Initialize task queue
|
||||||
|
for url in urls:
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.add_task(task_id, url)
|
||||||
|
task_queue.append((url, task_id))
|
||||||
|
|
||||||
|
while completed_count < total_urls:
|
||||||
|
# Start new tasks if memory permits
|
||||||
|
while len(active_tasks) < self.max_session_permit and task_queue:
|
||||||
|
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
|
||||||
|
await asyncio.sleep(self.check_interval)
|
||||||
|
continue
|
||||||
|
|
||||||
|
url, task_id = task_queue.pop(0)
|
||||||
|
task = asyncio.create_task(self.crawl_url(url, config, task_id))
|
||||||
|
active_tasks.append(task)
|
||||||
|
|
||||||
|
if not active_tasks and not task_queue:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Wait for any task to complete and yield results
|
||||||
|
if active_tasks:
|
||||||
|
done, pending = await asyncio.wait(
|
||||||
|
active_tasks,
|
||||||
|
timeout=0.1,
|
||||||
|
return_when=asyncio.FIRST_COMPLETED
|
||||||
|
)
|
||||||
|
for completed_task in done:
|
||||||
|
result = await completed_task
|
||||||
|
completed_count += 1
|
||||||
|
yield result
|
||||||
|
active_tasks = list(pending)
|
||||||
|
else:
|
||||||
|
await asyncio.sleep(self.check_interval)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.stop()
|
||||||
|
|
||||||
|
class SemaphoreDispatcher(BaseDispatcher):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
semaphore_count: int = 5,
|
||||||
|
max_session_permit: int = 20,
|
||||||
|
rate_limiter: Optional[RateLimiter] = None,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
):
|
||||||
|
super().__init__(rate_limiter, monitor)
|
||||||
|
self.semaphore_count = semaphore_count
|
||||||
|
self.max_session_permit = max_session_permit
|
||||||
|
|
||||||
|
async def crawl_url(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
task_id: str,
|
||||||
|
semaphore: asyncio.Semaphore = None,
|
||||||
|
) -> CrawlerTaskResult:
|
||||||
|
start_time = datetime.now()
|
||||||
|
error_message = ""
|
||||||
|
memory_usage = peak_memory = 0.0
|
||||||
|
|
||||||
|
try:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.rate_limiter:
|
||||||
|
await self.rate_limiter.wait_if_needed(url)
|
||||||
|
|
||||||
|
async with semaphore:
|
||||||
|
process = psutil.Process()
|
||||||
|
start_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
result = await self.crawler.arun(url, config=config, session_id=task_id)
|
||||||
|
end_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
|
||||||
|
memory_usage = peak_memory = end_memory - start_memory
|
||||||
|
|
||||||
|
if self.rate_limiter and result.status_code:
|
||||||
|
if not self.rate_limiter.update_delay(url, result.status_code):
|
||||||
|
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=datetime.now(),
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
error_message = result.error_message
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
elif self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_message = str(e)
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
result = CrawlResult(
|
||||||
|
url=url, html="", metadata={}, success=False, error_message=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
end_time = datetime.now()
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id,
|
||||||
|
end_time=end_time,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=end_time,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def run_urls(
|
||||||
|
self,
|
||||||
|
crawler: "AsyncWebCrawler", # noqa: F821
|
||||||
|
urls: List[str],
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
) -> List[CrawlerTaskResult]:
|
||||||
|
self.crawler = crawler
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
semaphore = asyncio.Semaphore(self.semaphore_count)
|
||||||
|
tasks = []
|
||||||
|
|
||||||
|
for url in urls:
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.add_task(task_id, url)
|
||||||
|
task = asyncio.create_task(
|
||||||
|
self.crawl_url(url, config, task_id, semaphore)
|
||||||
|
)
|
||||||
|
tasks.append(task)
|
||||||
|
|
||||||
|
return await asyncio.gather(*tasks, return_exceptions=True)
|
||||||
|
finally:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.stop()
|
||||||
588
crawl4ai/async_dispatcher_.py
Normal file
588
crawl4ai/async_dispatcher_.py
Normal file
@@ -0,0 +1,588 @@
|
|||||||
|
from typing import Dict, Optional, List, Tuple
|
||||||
|
from .async_configs import CrawlerRunConfig
|
||||||
|
from .models import (
|
||||||
|
CrawlResult,
|
||||||
|
CrawlerTaskResult,
|
||||||
|
CrawlStatus,
|
||||||
|
DisplayMode,
|
||||||
|
CrawlStats,
|
||||||
|
DomainState,
|
||||||
|
)
|
||||||
|
|
||||||
|
from rich.live import Live
|
||||||
|
from rich.table import Table
|
||||||
|
from rich.console import Console
|
||||||
|
from rich import box
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
|
||||||
|
import time
|
||||||
|
import psutil
|
||||||
|
import asyncio
|
||||||
|
import uuid
|
||||||
|
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
import random
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
|
||||||
|
|
||||||
|
class RateLimiter:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
base_delay: Tuple[float, float] = (1.0, 3.0),
|
||||||
|
max_delay: float = 60.0,
|
||||||
|
max_retries: int = 3,
|
||||||
|
rate_limit_codes: List[int] = None,
|
||||||
|
):
|
||||||
|
self.base_delay = base_delay
|
||||||
|
self.max_delay = max_delay
|
||||||
|
self.max_retries = max_retries
|
||||||
|
self.rate_limit_codes = rate_limit_codes or [429, 503]
|
||||||
|
self.domains: Dict[str, DomainState] = {}
|
||||||
|
|
||||||
|
def get_domain(self, url: str) -> str:
|
||||||
|
return urlparse(url).netloc
|
||||||
|
|
||||||
|
async def wait_if_needed(self, url: str) -> None:
|
||||||
|
domain = self.get_domain(url)
|
||||||
|
state = self.domains.get(domain)
|
||||||
|
|
||||||
|
if not state:
|
||||||
|
self.domains[domain] = DomainState()
|
||||||
|
state = self.domains[domain]
|
||||||
|
|
||||||
|
now = time.time()
|
||||||
|
if state.last_request_time:
|
||||||
|
wait_time = max(0, state.current_delay - (now - state.last_request_time))
|
||||||
|
if wait_time > 0:
|
||||||
|
await asyncio.sleep(wait_time)
|
||||||
|
|
||||||
|
# Random delay within base range if no current delay
|
||||||
|
if state.current_delay == 0:
|
||||||
|
state.current_delay = random.uniform(*self.base_delay)
|
||||||
|
|
||||||
|
state.last_request_time = time.time()
|
||||||
|
|
||||||
|
def update_delay(self, url: str, status_code: int) -> bool:
|
||||||
|
domain = self.get_domain(url)
|
||||||
|
state = self.domains[domain]
|
||||||
|
|
||||||
|
if status_code in self.rate_limit_codes:
|
||||||
|
state.fail_count += 1
|
||||||
|
if state.fail_count > self.max_retries:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Exponential backoff with random jitter
|
||||||
|
state.current_delay = min(
|
||||||
|
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# Gradually reduce delay on success
|
||||||
|
state.current_delay = max(
|
||||||
|
random.uniform(*self.base_delay), state.current_delay * 0.75
|
||||||
|
)
|
||||||
|
state.fail_count = 0
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
class CrawlerMonitor:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
max_visible_rows: int = 15,
|
||||||
|
display_mode: DisplayMode = DisplayMode.DETAILED,
|
||||||
|
):
|
||||||
|
self.console = Console()
|
||||||
|
self.max_visible_rows = max_visible_rows
|
||||||
|
self.display_mode = display_mode
|
||||||
|
self.stats: Dict[str, CrawlStats] = {}
|
||||||
|
self.process = psutil.Process()
|
||||||
|
self.start_time = datetime.now()
|
||||||
|
self.live = Live(self._create_table(), refresh_per_second=2)
|
||||||
|
|
||||||
|
def start(self):
|
||||||
|
self.live.start()
|
||||||
|
|
||||||
|
def stop(self):
|
||||||
|
self.live.stop()
|
||||||
|
|
||||||
|
def add_task(self, task_id: str, url: str):
|
||||||
|
self.stats[task_id] = CrawlStats(
|
||||||
|
task_id=task_id, url=url, status=CrawlStatus.QUEUED
|
||||||
|
)
|
||||||
|
self.live.update(self._create_table())
|
||||||
|
|
||||||
|
def update_task(self, task_id: str, **kwargs):
|
||||||
|
if task_id in self.stats:
|
||||||
|
for key, value in kwargs.items():
|
||||||
|
setattr(self.stats[task_id], key, value)
|
||||||
|
self.live.update(self._create_table())
|
||||||
|
|
||||||
|
def _create_aggregated_table(self) -> Table:
|
||||||
|
"""Creates a compact table showing only aggregated statistics"""
|
||||||
|
table = Table(
|
||||||
|
box=box.ROUNDED,
|
||||||
|
title="Crawler Status Overview",
|
||||||
|
title_style="bold magenta",
|
||||||
|
header_style="bold blue",
|
||||||
|
show_lines=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Calculate statistics
|
||||||
|
total_tasks = len(self.stats)
|
||||||
|
queued = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
|
||||||
|
)
|
||||||
|
in_progress = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
|
||||||
|
)
|
||||||
|
completed = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
|
||||||
|
)
|
||||||
|
failed = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
|
||||||
|
)
|
||||||
|
|
||||||
|
# Memory statistics
|
||||||
|
current_memory = self.process.memory_info().rss / (1024 * 1024)
|
||||||
|
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
|
||||||
|
peak_memory = max(
|
||||||
|
(stat.peak_memory for stat in self.stats.values()), default=0.0
|
||||||
|
)
|
||||||
|
|
||||||
|
# Duration
|
||||||
|
duration = datetime.now() - self.start_time
|
||||||
|
|
||||||
|
# Create status row
|
||||||
|
table.add_column("Status", style="bold cyan")
|
||||||
|
table.add_column("Count", justify="right")
|
||||||
|
table.add_column("Percentage", justify="right")
|
||||||
|
|
||||||
|
table.add_row("Total Tasks", str(total_tasks), "100%")
|
||||||
|
table.add_row(
|
||||||
|
"[yellow]In Queue[/yellow]",
|
||||||
|
str(queued),
|
||||||
|
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[blue]In Progress[/blue]",
|
||||||
|
str(in_progress),
|
||||||
|
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[green]Completed[/green]",
|
||||||
|
str(completed),
|
||||||
|
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[red]Failed[/red]",
|
||||||
|
str(failed),
|
||||||
|
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add memory information
|
||||||
|
table.add_section()
|
||||||
|
table.add_row(
|
||||||
|
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
|
||||||
|
)
|
||||||
|
table.add_row(
|
||||||
|
"[yellow]Runtime[/yellow]",
|
||||||
|
str(timedelta(seconds=int(duration.total_seconds()))),
|
||||||
|
"",
|
||||||
|
)
|
||||||
|
|
||||||
|
return table
|
||||||
|
|
||||||
|
def _create_detailed_table(self) -> Table:
|
||||||
|
table = Table(
|
||||||
|
box=box.ROUNDED,
|
||||||
|
title="Crawler Performance Monitor",
|
||||||
|
title_style="bold magenta",
|
||||||
|
header_style="bold blue",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add columns
|
||||||
|
table.add_column("Task ID", style="cyan", no_wrap=True)
|
||||||
|
table.add_column("URL", style="cyan", no_wrap=True)
|
||||||
|
table.add_column("Status", style="bold")
|
||||||
|
table.add_column("Memory (MB)", justify="right")
|
||||||
|
table.add_column("Peak (MB)", justify="right")
|
||||||
|
table.add_column("Duration", justify="right")
|
||||||
|
table.add_column("Info", style="italic")
|
||||||
|
|
||||||
|
# Add summary row
|
||||||
|
total_memory = sum(stat.memory_usage for stat in self.stats.values())
|
||||||
|
active_count = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
|
||||||
|
)
|
||||||
|
completed_count = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
|
||||||
|
)
|
||||||
|
failed_count = sum(
|
||||||
|
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
|
||||||
|
)
|
||||||
|
|
||||||
|
table.add_row(
|
||||||
|
"[bold yellow]SUMMARY",
|
||||||
|
f"Total: {len(self.stats)}",
|
||||||
|
f"Active: {active_count}",
|
||||||
|
f"{total_memory:.1f}",
|
||||||
|
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
|
||||||
|
str(
|
||||||
|
timedelta(
|
||||||
|
seconds=int((datetime.now() - self.start_time).total_seconds())
|
||||||
|
)
|
||||||
|
),
|
||||||
|
f"✓{completed_count} ✗{failed_count}",
|
||||||
|
style="bold",
|
||||||
|
)
|
||||||
|
|
||||||
|
table.add_section()
|
||||||
|
|
||||||
|
# Add rows for each task
|
||||||
|
visible_stats = sorted(
|
||||||
|
self.stats.values(),
|
||||||
|
key=lambda x: (
|
||||||
|
x.status != CrawlStatus.IN_PROGRESS,
|
||||||
|
x.status != CrawlStatus.QUEUED,
|
||||||
|
x.end_time or datetime.max,
|
||||||
|
),
|
||||||
|
)[: self.max_visible_rows]
|
||||||
|
|
||||||
|
for stat in visible_stats:
|
||||||
|
status_style = {
|
||||||
|
CrawlStatus.QUEUED: "white",
|
||||||
|
CrawlStatus.IN_PROGRESS: "yellow",
|
||||||
|
CrawlStatus.COMPLETED: "green",
|
||||||
|
CrawlStatus.FAILED: "red",
|
||||||
|
}[stat.status]
|
||||||
|
|
||||||
|
table.add_row(
|
||||||
|
stat.task_id[:8], # Show first 8 chars of task ID
|
||||||
|
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
|
||||||
|
f"[{status_style}]{stat.status.value}[/{status_style}]",
|
||||||
|
f"{stat.memory_usage:.1f}",
|
||||||
|
f"{stat.peak_memory:.1f}",
|
||||||
|
stat.duration,
|
||||||
|
stat.error_message[:40] if stat.error_message else "",
|
||||||
|
)
|
||||||
|
|
||||||
|
return table
|
||||||
|
|
||||||
|
def _create_table(self) -> Table:
|
||||||
|
"""Creates the appropriate table based on display mode"""
|
||||||
|
if self.display_mode == DisplayMode.AGGREGATED:
|
||||||
|
return self._create_aggregated_table()
|
||||||
|
return self._create_detailed_table()
|
||||||
|
|
||||||
|
|
||||||
|
class BaseDispatcher(ABC):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
rate_limiter: Optional[RateLimiter] = None,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
):
|
||||||
|
self.crawler = None
|
||||||
|
self._domain_last_hit: Dict[str, float] = {}
|
||||||
|
self.concurrent_sessions = 0
|
||||||
|
self.rate_limiter = rate_limiter
|
||||||
|
self.monitor = monitor
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def crawl_url(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
task_id: str,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
) -> CrawlerTaskResult:
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def run_urls(
|
||||||
|
self,
|
||||||
|
urls: List[str],
|
||||||
|
crawler: "AsyncWebCrawler", # noqa: F821
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
) -> List[CrawlerTaskResult]:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class MemoryAdaptiveDispatcher(BaseDispatcher):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
memory_threshold_percent: float = 90.0,
|
||||||
|
check_interval: float = 1.0,
|
||||||
|
max_session_permit: int = 20,
|
||||||
|
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
|
||||||
|
rate_limiter: Optional[RateLimiter] = None,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
):
|
||||||
|
super().__init__(rate_limiter, monitor)
|
||||||
|
self.memory_threshold_percent = memory_threshold_percent
|
||||||
|
self.check_interval = check_interval
|
||||||
|
self.max_session_permit = max_session_permit
|
||||||
|
self.memory_wait_timeout = memory_wait_timeout
|
||||||
|
|
||||||
|
async def crawl_url(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
task_id: str,
|
||||||
|
) -> CrawlerTaskResult:
|
||||||
|
start_time = datetime.now()
|
||||||
|
error_message = ""
|
||||||
|
memory_usage = peak_memory = 0.0
|
||||||
|
|
||||||
|
try:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
|
||||||
|
)
|
||||||
|
self.concurrent_sessions += 1
|
||||||
|
|
||||||
|
if self.rate_limiter:
|
||||||
|
await self.rate_limiter.wait_if_needed(url)
|
||||||
|
|
||||||
|
process = psutil.Process()
|
||||||
|
start_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
result = await self.crawler.arun(url, config=config, session_id=task_id)
|
||||||
|
end_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
|
||||||
|
memory_usage = peak_memory = end_memory - start_memory
|
||||||
|
|
||||||
|
if self.rate_limiter and result.status_code:
|
||||||
|
if not self.rate_limiter.update_delay(url, result.status_code):
|
||||||
|
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=datetime.now(),
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
error_message = result.error_message
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
elif self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_message = str(e)
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
result = CrawlResult(
|
||||||
|
url=url, html="", metadata={}, success=False, error_message=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
end_time = datetime.now()
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id,
|
||||||
|
end_time=end_time,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
self.concurrent_sessions -= 1
|
||||||
|
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=end_time,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def run_urls(
|
||||||
|
self,
|
||||||
|
urls: List[str],
|
||||||
|
crawler: "AsyncWebCrawler", # noqa: F821
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
) -> List[CrawlerTaskResult]:
|
||||||
|
self.crawler = crawler
|
||||||
|
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
pending_tasks = []
|
||||||
|
active_tasks = []
|
||||||
|
task_queue = []
|
||||||
|
|
||||||
|
for url in urls:
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.add_task(task_id, url)
|
||||||
|
task_queue.append((url, task_id))
|
||||||
|
|
||||||
|
while task_queue or active_tasks:
|
||||||
|
wait_start_time = time.time()
|
||||||
|
while len(active_tasks) < self.max_session_permit and task_queue:
|
||||||
|
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
|
||||||
|
# Check if we've exceeded the timeout
|
||||||
|
if time.time() - wait_start_time > self.memory_wait_timeout:
|
||||||
|
raise MemoryError(
|
||||||
|
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
|
||||||
|
)
|
||||||
|
await asyncio.sleep(self.check_interval)
|
||||||
|
continue
|
||||||
|
|
||||||
|
url, task_id = task_queue.pop(0)
|
||||||
|
task = asyncio.create_task(self.crawl_url(url, config, task_id))
|
||||||
|
active_tasks.append(task)
|
||||||
|
|
||||||
|
if not active_tasks:
|
||||||
|
await asyncio.sleep(self.check_interval)
|
||||||
|
continue
|
||||||
|
|
||||||
|
done, pending = await asyncio.wait(
|
||||||
|
active_tasks, return_when=asyncio.FIRST_COMPLETED
|
||||||
|
)
|
||||||
|
|
||||||
|
pending_tasks.extend(done)
|
||||||
|
active_tasks = list(pending)
|
||||||
|
|
||||||
|
return await asyncio.gather(*pending_tasks)
|
||||||
|
finally:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.stop()
|
||||||
|
|
||||||
|
|
||||||
|
class SemaphoreDispatcher(BaseDispatcher):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
semaphore_count: int = 5,
|
||||||
|
max_session_permit: int = 20,
|
||||||
|
rate_limiter: Optional[RateLimiter] = None,
|
||||||
|
monitor: Optional[CrawlerMonitor] = None,
|
||||||
|
):
|
||||||
|
super().__init__(rate_limiter, monitor)
|
||||||
|
self.semaphore_count = semaphore_count
|
||||||
|
self.max_session_permit = max_session_permit
|
||||||
|
|
||||||
|
async def crawl_url(
|
||||||
|
self,
|
||||||
|
url: str,
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
task_id: str,
|
||||||
|
semaphore: asyncio.Semaphore = None,
|
||||||
|
) -> CrawlerTaskResult:
|
||||||
|
start_time = datetime.now()
|
||||||
|
error_message = ""
|
||||||
|
memory_usage = peak_memory = 0.0
|
||||||
|
|
||||||
|
try:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.rate_limiter:
|
||||||
|
await self.rate_limiter.wait_if_needed(url)
|
||||||
|
|
||||||
|
async with semaphore:
|
||||||
|
process = psutil.Process()
|
||||||
|
start_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
result = await self.crawler.arun(url, config=config, session_id=task_id)
|
||||||
|
end_memory = process.memory_info().rss / (1024 * 1024)
|
||||||
|
|
||||||
|
memory_usage = peak_memory = end_memory - start_memory
|
||||||
|
|
||||||
|
if self.rate_limiter and result.status_code:
|
||||||
|
if not self.rate_limiter.update_delay(url, result.status_code):
|
||||||
|
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=datetime.now(),
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
error_message = result.error_message
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
elif self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
error_message = str(e)
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
|
||||||
|
result = CrawlResult(
|
||||||
|
url=url, html="", metadata={}, success=False, error_message=str(e)
|
||||||
|
)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
end_time = datetime.now()
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.update_task(
|
||||||
|
task_id,
|
||||||
|
end_time=end_time,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
return CrawlerTaskResult(
|
||||||
|
task_id=task_id,
|
||||||
|
url=url,
|
||||||
|
result=result,
|
||||||
|
memory_usage=memory_usage,
|
||||||
|
peak_memory=peak_memory,
|
||||||
|
start_time=start_time,
|
||||||
|
end_time=end_time,
|
||||||
|
error_message=error_message,
|
||||||
|
)
|
||||||
|
|
||||||
|
async def run_urls(
|
||||||
|
self,
|
||||||
|
crawler: "AsyncWebCrawler", # noqa: F821
|
||||||
|
urls: List[str],
|
||||||
|
config: CrawlerRunConfig,
|
||||||
|
) -> List[CrawlerTaskResult]:
|
||||||
|
self.crawler = crawler
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
semaphore = asyncio.Semaphore(self.semaphore_count)
|
||||||
|
tasks = []
|
||||||
|
|
||||||
|
for url in urls:
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.add_task(task_id, url)
|
||||||
|
task = asyncio.create_task(
|
||||||
|
self.crawl_url(url, config, task_id, semaphore)
|
||||||
|
)
|
||||||
|
tasks.append(task)
|
||||||
|
|
||||||
|
return await asyncio.gather(*tasks, return_exceptions=True)
|
||||||
|
finally:
|
||||||
|
if self.monitor:
|
||||||
|
self.monitor.stop()
|
||||||
@@ -1,10 +1,10 @@
|
|||||||
from enum import Enum
|
from enum import Enum
|
||||||
from typing import Optional, Dict, Any, Union
|
from typing import Optional, Dict, Any
|
||||||
from colorama import Fore, Back, Style, init
|
from colorama import Fore, Style, init
|
||||||
import time
|
|
||||||
import os
|
import os
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
|
|
||||||
class LogLevel(Enum):
|
class LogLevel(Enum):
|
||||||
DEBUG = 1
|
DEBUG = 1
|
||||||
INFO = 2
|
INFO = 2
|
||||||
@@ -12,23 +12,24 @@ class LogLevel(Enum):
|
|||||||
WARNING = 4
|
WARNING = 4
|
||||||
ERROR = 5
|
ERROR = 5
|
||||||
|
|
||||||
|
|
||||||
class AsyncLogger:
|
class AsyncLogger:
|
||||||
"""
|
"""
|
||||||
Asynchronous logger with support for colored console output and file logging.
|
Asynchronous logger with support for colored console output and file logging.
|
||||||
Supports templated messages with colored components.
|
Supports templated messages with colored components.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
DEFAULT_ICONS = {
|
DEFAULT_ICONS = {
|
||||||
'INIT': '→',
|
"INIT": "→",
|
||||||
'READY': '✓',
|
"READY": "✓",
|
||||||
'FETCH': '↓',
|
"FETCH": "↓",
|
||||||
'SCRAPE': '◆',
|
"SCRAPE": "◆",
|
||||||
'EXTRACT': '■',
|
"EXTRACT": "■",
|
||||||
'COMPLETE': '●',
|
"COMPLETE": "●",
|
||||||
'ERROR': '×',
|
"ERROR": "×",
|
||||||
'DEBUG': '⋯',
|
"DEBUG": "⋯",
|
||||||
'INFO': 'ℹ',
|
"INFO": "ℹ",
|
||||||
'WARNING': '⚠',
|
"WARNING": "⚠",
|
||||||
}
|
}
|
||||||
|
|
||||||
DEFAULT_COLORS = {
|
DEFAULT_COLORS = {
|
||||||
@@ -46,11 +47,11 @@ class AsyncLogger:
|
|||||||
tag_width: int = 10,
|
tag_width: int = 10,
|
||||||
icons: Optional[Dict[str, str]] = None,
|
icons: Optional[Dict[str, str]] = None,
|
||||||
colors: Optional[Dict[LogLevel, str]] = None,
|
colors: Optional[Dict[LogLevel, str]] = None,
|
||||||
verbose: bool = True
|
verbose: bool = True,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Initialize the logger.
|
Initialize the logger.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
log_file: Optional file path for logging
|
log_file: Optional file path for logging
|
||||||
log_level: Minimum log level to display
|
log_level: Minimum log level to display
|
||||||
@@ -66,7 +67,7 @@ class AsyncLogger:
|
|||||||
self.icons = icons or self.DEFAULT_ICONS
|
self.icons = icons or self.DEFAULT_ICONS
|
||||||
self.colors = colors or self.DEFAULT_COLORS
|
self.colors = colors or self.DEFAULT_COLORS
|
||||||
self.verbose = verbose
|
self.verbose = verbose
|
||||||
|
|
||||||
# Create log file directory if needed
|
# Create log file directory if needed
|
||||||
if log_file:
|
if log_file:
|
||||||
os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
|
os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
|
||||||
@@ -77,18 +78,20 @@ class AsyncLogger:
|
|||||||
|
|
||||||
def _get_icon(self, tag: str) -> str:
|
def _get_icon(self, tag: str) -> str:
|
||||||
"""Get the icon for a tag, defaulting to info icon if not found."""
|
"""Get the icon for a tag, defaulting to info icon if not found."""
|
||||||
return self.icons.get(tag, self.icons['INFO'])
|
return self.icons.get(tag, self.icons["INFO"])
|
||||||
|
|
||||||
def _write_to_file(self, message: str):
|
def _write_to_file(self, message: str):
|
||||||
"""Write a message to the log file if configured."""
|
"""Write a message to the log file if configured."""
|
||||||
if self.log_file:
|
if self.log_file:
|
||||||
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
|
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
|
||||||
with open(self.log_file, 'a', encoding='utf-8') as f:
|
with open(self.log_file, "a", encoding="utf-8") as f:
|
||||||
# Strip ANSI color codes for file output
|
# Strip ANSI color codes for file output
|
||||||
clean_message = message.replace(Fore.RESET, '').replace(Style.RESET_ALL, '')
|
clean_message = message.replace(Fore.RESET, "").replace(
|
||||||
|
Style.RESET_ALL, ""
|
||||||
|
)
|
||||||
for color in vars(Fore).values():
|
for color in vars(Fore).values():
|
||||||
if isinstance(color, str):
|
if isinstance(color, str):
|
||||||
clean_message = clean_message.replace(color, '')
|
clean_message = clean_message.replace(color, "")
|
||||||
f.write(f"[{timestamp}] {clean_message}\n")
|
f.write(f"[{timestamp}] {clean_message}\n")
|
||||||
|
|
||||||
def _log(
|
def _log(
|
||||||
@@ -99,11 +102,11 @@ class AsyncLogger:
|
|||||||
params: Optional[Dict[str, Any]] = None,
|
params: Optional[Dict[str, Any]] = None,
|
||||||
colors: Optional[Dict[str, str]] = None,
|
colors: Optional[Dict[str, str]] = None,
|
||||||
base_color: Optional[str] = None,
|
base_color: Optional[str] = None,
|
||||||
**kwargs
|
**kwargs,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Core logging method that handles message formatting and output.
|
Core logging method that handles message formatting and output.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
level: Log level for this message
|
level: Log level for this message
|
||||||
message: Message template string
|
message: Message template string
|
||||||
@@ -120,7 +123,7 @@ class AsyncLogger:
|
|||||||
try:
|
try:
|
||||||
# First format the message with raw parameters
|
# First format the message with raw parameters
|
||||||
formatted_message = message.format(**params)
|
formatted_message = message.format(**params)
|
||||||
|
|
||||||
# Then apply colors if specified
|
# Then apply colors if specified
|
||||||
if colors:
|
if colors:
|
||||||
for key, color in colors.items():
|
for key, color in colors.items():
|
||||||
@@ -128,12 +131,13 @@ class AsyncLogger:
|
|||||||
if key in params:
|
if key in params:
|
||||||
value_str = str(params[key])
|
value_str = str(params[key])
|
||||||
formatted_message = formatted_message.replace(
|
formatted_message = formatted_message.replace(
|
||||||
value_str,
|
value_str, f"{color}{value_str}{Style.RESET_ALL}"
|
||||||
f"{color}{value_str}{Style.RESET_ALL}"
|
|
||||||
)
|
)
|
||||||
|
|
||||||
except KeyError as e:
|
except KeyError as e:
|
||||||
formatted_message = f"LOGGING ERROR: Missing parameter {e} in message template"
|
formatted_message = (
|
||||||
|
f"LOGGING ERROR: Missing parameter {e} in message template"
|
||||||
|
)
|
||||||
level = LogLevel.ERROR
|
level = LogLevel.ERROR
|
||||||
else:
|
else:
|
||||||
formatted_message = message
|
formatted_message = message
|
||||||
@@ -175,11 +179,11 @@ class AsyncLogger:
|
|||||||
success: bool,
|
success: bool,
|
||||||
timing: float,
|
timing: float,
|
||||||
tag: str = "FETCH",
|
tag: str = "FETCH",
|
||||||
url_length: int = 50
|
url_length: int = 50,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Convenience method for logging URL fetch status.
|
Convenience method for logging URL fetch status.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
url: The URL being processed
|
url: The URL being processed
|
||||||
success: Whether the operation was successful
|
success: Whether the operation was successful
|
||||||
@@ -195,24 +199,20 @@ class AsyncLogger:
|
|||||||
"url": url,
|
"url": url,
|
||||||
"url_length": url_length,
|
"url_length": url_length,
|
||||||
"status": success,
|
"status": success,
|
||||||
"timing": timing
|
"timing": timing,
|
||||||
},
|
},
|
||||||
colors={
|
colors={
|
||||||
"status": Fore.GREEN if success else Fore.RED,
|
"status": Fore.GREEN if success else Fore.RED,
|
||||||
"timing": Fore.YELLOW
|
"timing": Fore.YELLOW,
|
||||||
}
|
},
|
||||||
)
|
)
|
||||||
|
|
||||||
def error_status(
|
def error_status(
|
||||||
self,
|
self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
|
||||||
url: str,
|
|
||||||
error: str,
|
|
||||||
tag: str = "ERROR",
|
|
||||||
url_length: int = 50
|
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Convenience method for logging error status.
|
Convenience method for logging error status.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
url: The URL being processed
|
url: The URL being processed
|
||||||
error: Error message
|
error: Error message
|
||||||
@@ -223,9 +223,5 @@ class AsyncLogger:
|
|||||||
level=LogLevel.ERROR,
|
level=LogLevel.ERROR,
|
||||||
message="{url:.{url_length}}... | Error: {error}",
|
message="{url:.{url_length}}... | Error: {error}",
|
||||||
tag=tag,
|
tag=tag,
|
||||||
params={
|
params={"url": url, "url_length": url_length, "error": error},
|
||||||
"url": url,
|
)
|
||||||
"url_length": url_length,
|
|
||||||
"error": error
|
|
||||||
}
|
|
||||||
)
|
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -4,7 +4,7 @@ from enum import Enum
|
|||||||
class CacheMode(Enum):
|
class CacheMode(Enum):
|
||||||
"""
|
"""
|
||||||
Defines the caching behavior for web crawling operations.
|
Defines the caching behavior for web crawling operations.
|
||||||
|
|
||||||
Modes:
|
Modes:
|
||||||
- ENABLED: Normal caching behavior (read and write)
|
- ENABLED: Normal caching behavior (read and write)
|
||||||
- DISABLED: No caching at all
|
- DISABLED: No caching at all
|
||||||
@@ -12,6 +12,7 @@ class CacheMode(Enum):
|
|||||||
- WRITE_ONLY: Only write to cache, don't read
|
- WRITE_ONLY: Only write to cache, don't read
|
||||||
- BYPASS: Bypass cache for this operation
|
- BYPASS: Bypass cache for this operation
|
||||||
"""
|
"""
|
||||||
|
|
||||||
ENABLED = "enabled"
|
ENABLED = "enabled"
|
||||||
DISABLED = "disabled"
|
DISABLED = "disabled"
|
||||||
READ_ONLY = "read_only"
|
READ_ONLY = "read_only"
|
||||||
@@ -22,10 +23,10 @@ class CacheMode(Enum):
|
|||||||
class CacheContext:
|
class CacheContext:
|
||||||
"""
|
"""
|
||||||
Encapsulates cache-related decisions and URL handling.
|
Encapsulates cache-related decisions and URL handling.
|
||||||
|
|
||||||
This class centralizes all cache-related logic and URL type checking,
|
This class centralizes all cache-related logic and URL type checking,
|
||||||
making the caching behavior more predictable and maintainable.
|
making the caching behavior more predictable and maintainable.
|
||||||
|
|
||||||
Attributes:
|
Attributes:
|
||||||
url (str): The URL being processed.
|
url (str): The URL being processed.
|
||||||
cache_mode (CacheMode): The cache mode for the current operation.
|
cache_mode (CacheMode): The cache mode for the current operation.
|
||||||
@@ -36,10 +37,11 @@ class CacheContext:
|
|||||||
is_raw_html (bool): True if the URL is raw HTML, False otherwise.
|
is_raw_html (bool): True if the URL is raw HTML, False otherwise.
|
||||||
_url_display (str): The display name for the URL (web, local file, or raw HTML).
|
_url_display (str): The display name for the URL (web, local file, or raw HTML).
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
|
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
|
||||||
"""
|
"""
|
||||||
Initializes the CacheContext with the provided URL and cache mode.
|
Initializes the CacheContext with the provided URL and cache mode.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
url (str): The URL being processed.
|
url (str): The URL being processed.
|
||||||
cache_mode (CacheMode): The cache mode for the current operation.
|
cache_mode (CacheMode): The cache mode for the current operation.
|
||||||
@@ -48,42 +50,42 @@ class CacheContext:
|
|||||||
self.url = url
|
self.url = url
|
||||||
self.cache_mode = cache_mode
|
self.cache_mode = cache_mode
|
||||||
self.always_bypass = always_bypass
|
self.always_bypass = always_bypass
|
||||||
self.is_cacheable = url.startswith(('http://', 'https://', 'file://'))
|
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
|
||||||
self.is_web_url = url.startswith(('http://', 'https://'))
|
self.is_web_url = url.startswith(("http://", "https://"))
|
||||||
self.is_local_file = url.startswith("file://")
|
self.is_local_file = url.startswith("file://")
|
||||||
self.is_raw_html = url.startswith("raw:")
|
self.is_raw_html = url.startswith("raw:")
|
||||||
self._url_display = url if not self.is_raw_html else "Raw HTML"
|
self._url_display = url if not self.is_raw_html else "Raw HTML"
|
||||||
|
|
||||||
def should_read(self) -> bool:
|
def should_read(self) -> bool:
|
||||||
"""
|
"""
|
||||||
Determines if cache should be read based on context.
|
Determines if cache should be read based on context.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. If always_bypass is True or is_cacheable is False, return False.
|
1. If always_bypass is True or is_cacheable is False, return False.
|
||||||
2. If cache_mode is ENABLED or READ_ONLY, return True.
|
2. If cache_mode is ENABLED or READ_ONLY, return True.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
bool: True if cache should be read, False otherwise.
|
bool: True if cache should be read, False otherwise.
|
||||||
"""
|
"""
|
||||||
if self.always_bypass or not self.is_cacheable:
|
if self.always_bypass or not self.is_cacheable:
|
||||||
return False
|
return False
|
||||||
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
|
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
|
||||||
|
|
||||||
def should_write(self) -> bool:
|
def should_write(self) -> bool:
|
||||||
"""
|
"""
|
||||||
Determines if cache should be written based on context.
|
Determines if cache should be written based on context.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. If always_bypass is True or is_cacheable is False, return False.
|
1. If always_bypass is True or is_cacheable is False, return False.
|
||||||
2. If cache_mode is ENABLED or WRITE_ONLY, return True.
|
2. If cache_mode is ENABLED or WRITE_ONLY, return True.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
bool: True if cache should be written, False otherwise.
|
bool: True if cache should be written, False otherwise.
|
||||||
"""
|
"""
|
||||||
if self.always_bypass or not self.is_cacheable:
|
if self.always_bypass or not self.is_cacheable:
|
||||||
return False
|
return False
|
||||||
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
|
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def display_url(self) -> str:
|
def display_url(self) -> str:
|
||||||
"""Returns the URL in display format."""
|
"""Returns the URL in display format."""
|
||||||
@@ -94,11 +96,11 @@ def _legacy_to_cache_mode(
|
|||||||
disable_cache: bool = False,
|
disable_cache: bool = False,
|
||||||
bypass_cache: bool = False,
|
bypass_cache: bool = False,
|
||||||
no_cache_read: bool = False,
|
no_cache_read: bool = False,
|
||||||
no_cache_write: bool = False
|
no_cache_write: bool = False,
|
||||||
) -> CacheMode:
|
) -> CacheMode:
|
||||||
"""
|
"""
|
||||||
Converts legacy cache parameters to the new CacheMode enum.
|
Converts legacy cache parameters to the new CacheMode enum.
|
||||||
|
|
||||||
This is an internal function to help transition from the old boolean flags
|
This is an internal function to help transition from the old boolean flags
|
||||||
to the new CacheMode system.
|
to the new CacheMode system.
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -3,49 +3,53 @@ import re
|
|||||||
from collections import Counter
|
from collections import Counter
|
||||||
import string
|
import string
|
||||||
from .model_loader import load_nltk_punkt
|
from .model_loader import load_nltk_punkt
|
||||||
from .utils import *
|
|
||||||
|
|
||||||
# Define the abstract base class for chunking strategies
|
# Define the abstract base class for chunking strategies
|
||||||
class ChunkingStrategy(ABC):
|
class ChunkingStrategy(ABC):
|
||||||
"""
|
"""
|
||||||
Abstract base class for chunking strategies.
|
Abstract base class for chunking strategies.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
"""
|
"""
|
||||||
Abstract method to chunk the given text.
|
Abstract method to chunk the given text.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
text (str): The text to chunk.
|
text (str): The text to chunk.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
list: A list of chunks.
|
list: A list of chunks.
|
||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
# Create an identity chunking strategy f(x) = [x]
|
# Create an identity chunking strategy f(x) = [x]
|
||||||
class IdentityChunking(ChunkingStrategy):
|
class IdentityChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that returns the input text as a single chunk.
|
Chunking strategy that returns the input text as a single chunk.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
return [text]
|
return [text]
|
||||||
|
|
||||||
|
|
||||||
# Regex-based chunking
|
# Regex-based chunking
|
||||||
class RegexChunking(ChunkingStrategy):
|
class RegexChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that splits text based on regular expression patterns.
|
Chunking strategy that splits text based on regular expression patterns.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, patterns=None, **kwargs):
|
def __init__(self, patterns=None, **kwargs):
|
||||||
"""
|
"""
|
||||||
Initialize the RegexChunking object.
|
Initialize the RegexChunking object.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
patterns (list): A list of regular expression patterns to split text.
|
patterns (list): A list of regular expression patterns to split text.
|
||||||
"""
|
"""
|
||||||
if patterns is None:
|
if patterns is None:
|
||||||
patterns = [r'\n\n'] # Default split pattern
|
patterns = [r"\n\n"] # Default split pattern
|
||||||
self.patterns = patterns
|
self.patterns = patterns
|
||||||
|
|
||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
@@ -56,18 +60,19 @@ class RegexChunking(ChunkingStrategy):
|
|||||||
new_paragraphs.extend(re.split(pattern, paragraph))
|
new_paragraphs.extend(re.split(pattern, paragraph))
|
||||||
paragraphs = new_paragraphs
|
paragraphs = new_paragraphs
|
||||||
return paragraphs
|
return paragraphs
|
||||||
|
|
||||||
# NLP-based sentence chunking
|
|
||||||
|
# NLP-based sentence chunking
|
||||||
class NlpSentenceChunking(ChunkingStrategy):
|
class NlpSentenceChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
|
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, **kwargs):
|
def __init__(self, **kwargs):
|
||||||
"""
|
"""
|
||||||
Initialize the NlpSentenceChunking object.
|
Initialize the NlpSentenceChunking object.
|
||||||
"""
|
"""
|
||||||
load_nltk_punkt()
|
load_nltk_punkt()
|
||||||
|
|
||||||
|
|
||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
# Improved regex for sentence splitting
|
# Improved regex for sentence splitting
|
||||||
@@ -75,31 +80,34 @@ class NlpSentenceChunking(ChunkingStrategy):
|
|||||||
# r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
|
# r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
|
||||||
# )
|
# )
|
||||||
# sentences = sentence_endings.split(text)
|
# sentences = sentence_endings.split(text)
|
||||||
# sens = [sent.strip() for sent in sentences if sent]
|
# sens = [sent.strip() for sent in sentences if sent]
|
||||||
from nltk.tokenize import sent_tokenize
|
from nltk.tokenize import sent_tokenize
|
||||||
|
|
||||||
sentences = sent_tokenize(text)
|
sentences = sent_tokenize(text)
|
||||||
sens = [sent.strip() for sent in sentences]
|
sens = [sent.strip() for sent in sentences]
|
||||||
|
|
||||||
return list(set(sens))
|
return list(set(sens))
|
||||||
|
|
||||||
|
|
||||||
# Topic-based segmentation using TextTiling
|
# Topic-based segmentation using TextTiling
|
||||||
class TopicSegmentationChunking(ChunkingStrategy):
|
class TopicSegmentationChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
|
Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Segment the text into topics using TextTilingTokenizer
|
1. Segment the text into topics using TextTilingTokenizer
|
||||||
2. Extract keywords for each topic segment
|
2. Extract keywords for each topic segment
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, num_keywords=3, **kwargs):
|
def __init__(self, num_keywords=3, **kwargs):
|
||||||
"""
|
"""
|
||||||
Initialize the TopicSegmentationChunking object.
|
Initialize the TopicSegmentationChunking object.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
num_keywords (int): The number of keywords to extract for each topic segment.
|
num_keywords (int): The number of keywords to extract for each topic segment.
|
||||||
"""
|
"""
|
||||||
import nltk as nl
|
import nltk as nl
|
||||||
|
|
||||||
self.tokenizer = nl.tokenize.TextTilingTokenizer()
|
self.tokenizer = nl.tokenize.TextTilingTokenizer()
|
||||||
self.num_keywords = num_keywords
|
self.num_keywords = num_keywords
|
||||||
|
|
||||||
@@ -111,8 +119,14 @@ class TopicSegmentationChunking(ChunkingStrategy):
|
|||||||
def extract_keywords(self, text: str) -> list:
|
def extract_keywords(self, text: str) -> list:
|
||||||
# Tokenize and remove stopwords and punctuation
|
# Tokenize and remove stopwords and punctuation
|
||||||
import nltk as nl
|
import nltk as nl
|
||||||
|
|
||||||
tokens = nl.toknize.word_tokenize(text)
|
tokens = nl.toknize.word_tokenize(text)
|
||||||
tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]
|
tokens = [
|
||||||
|
token.lower()
|
||||||
|
for token in tokens
|
||||||
|
if token not in nl.corpus.stopwords.words("english")
|
||||||
|
and token not in string.punctuation
|
||||||
|
]
|
||||||
|
|
||||||
# Calculate frequency distribution
|
# Calculate frequency distribution
|
||||||
freq_dist = Counter(tokens)
|
freq_dist = Counter(tokens)
|
||||||
@@ -123,23 +137,27 @@ class TopicSegmentationChunking(ChunkingStrategy):
|
|||||||
# Segment the text into topics
|
# Segment the text into topics
|
||||||
segments = self.chunk(text)
|
segments = self.chunk(text)
|
||||||
# Extract keywords for each topic segment
|
# Extract keywords for each topic segment
|
||||||
segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
|
segments_with_topics = [
|
||||||
|
(segment, self.extract_keywords(segment)) for segment in segments
|
||||||
|
]
|
||||||
return segments_with_topics
|
return segments_with_topics
|
||||||
|
|
||||||
|
|
||||||
# Fixed-length word chunks
|
# Fixed-length word chunks
|
||||||
class FixedLengthWordChunking(ChunkingStrategy):
|
class FixedLengthWordChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that splits text into fixed-length word chunks.
|
Chunking strategy that splits text into fixed-length word chunks.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Split the text into words
|
1. Split the text into words
|
||||||
2. Create chunks of fixed length
|
2. Create chunks of fixed length
|
||||||
3. Return the list of chunks
|
3. Return the list of chunks
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, chunk_size=100, **kwargs):
|
def __init__(self, chunk_size=100, **kwargs):
|
||||||
"""
|
"""
|
||||||
Initialize the fixed-length word chunking strategy with the given chunk size.
|
Initialize the fixed-length word chunking strategy with the given chunk size.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
chunk_size (int): The size of each chunk in words.
|
chunk_size (int): The size of each chunk in words.
|
||||||
"""
|
"""
|
||||||
@@ -147,23 +165,28 @@ class FixedLengthWordChunking(ChunkingStrategy):
|
|||||||
|
|
||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
words = text.split()
|
words = text.split()
|
||||||
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
|
return [
|
||||||
|
" ".join(words[i : i + self.chunk_size])
|
||||||
|
for i in range(0, len(words), self.chunk_size)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
# Sliding window chunking
|
# Sliding window chunking
|
||||||
class SlidingWindowChunking(ChunkingStrategy):
|
class SlidingWindowChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that splits text into overlapping word chunks.
|
Chunking strategy that splits text into overlapping word chunks.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Split the text into words
|
1. Split the text into words
|
||||||
2. Create chunks of fixed length
|
2. Create chunks of fixed length
|
||||||
3. Return the list of chunks
|
3. Return the list of chunks
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, window_size=100, step=50, **kwargs):
|
def __init__(self, window_size=100, step=50, **kwargs):
|
||||||
"""
|
"""
|
||||||
Initialize the sliding window chunking strategy with the given window size and
|
Initialize the sliding window chunking strategy with the given window size and
|
||||||
step size.
|
step size.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
window_size (int): The size of the sliding window in words.
|
window_size (int): The size of the sliding window in words.
|
||||||
step (int): The step size for sliding the window in words.
|
step (int): The step size for sliding the window in words.
|
||||||
@@ -174,35 +197,37 @@ class SlidingWindowChunking(ChunkingStrategy):
|
|||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
words = text.split()
|
words = text.split()
|
||||||
chunks = []
|
chunks = []
|
||||||
|
|
||||||
if len(words) <= self.window_size:
|
if len(words) <= self.window_size:
|
||||||
return [text]
|
return [text]
|
||||||
|
|
||||||
for i in range(0, len(words) - self.window_size + 1, self.step):
|
for i in range(0, len(words) - self.window_size + 1, self.step):
|
||||||
chunk = ' '.join(words[i:i + self.window_size])
|
chunk = " ".join(words[i : i + self.window_size])
|
||||||
chunks.append(chunk)
|
chunks.append(chunk)
|
||||||
|
|
||||||
# Handle the last chunk if it doesn't align perfectly
|
# Handle the last chunk if it doesn't align perfectly
|
||||||
if i + self.window_size < len(words):
|
if i + self.window_size < len(words):
|
||||||
chunks.append(' '.join(words[-self.window_size:]))
|
chunks.append(" ".join(words[-self.window_size :]))
|
||||||
|
|
||||||
return chunks
|
return chunks
|
||||||
|
|
||||||
|
|
||||||
class OverlappingWindowChunking(ChunkingStrategy):
|
class OverlappingWindowChunking(ChunkingStrategy):
|
||||||
"""
|
"""
|
||||||
Chunking strategy that splits text into overlapping word chunks.
|
Chunking strategy that splits text into overlapping word chunks.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Split the text into words using whitespace
|
1. Split the text into words using whitespace
|
||||||
2. Create chunks of fixed length equal to the window size
|
2. Create chunks of fixed length equal to the window size
|
||||||
3. Slide the window by the overlap size
|
3. Slide the window by the overlap size
|
||||||
4. Return the list of chunks
|
4. Return the list of chunks
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, window_size=1000, overlap=100, **kwargs):
|
def __init__(self, window_size=1000, overlap=100, **kwargs):
|
||||||
"""
|
"""
|
||||||
Initialize the overlapping window chunking strategy with the given window size and
|
Initialize the overlapping window chunking strategy with the given window size and
|
||||||
overlap size.
|
overlap size.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
window_size (int): The size of the window in words.
|
window_size (int): The size of the window in words.
|
||||||
overlap (int): The size of the overlap between consecutive chunks in words.
|
overlap (int): The size of the overlap between consecutive chunks in words.
|
||||||
@@ -213,19 +238,19 @@ class OverlappingWindowChunking(ChunkingStrategy):
|
|||||||
def chunk(self, text: str) -> list:
|
def chunk(self, text: str) -> list:
|
||||||
words = text.split()
|
words = text.split()
|
||||||
chunks = []
|
chunks = []
|
||||||
|
|
||||||
if len(words) <= self.window_size:
|
if len(words) <= self.window_size:
|
||||||
return [text]
|
return [text]
|
||||||
|
|
||||||
start = 0
|
start = 0
|
||||||
while start < len(words):
|
while start < len(words):
|
||||||
end = start + self.window_size
|
end = start + self.window_size
|
||||||
chunk = ' '.join(words[start:end])
|
chunk = " ".join(words[start:end])
|
||||||
chunks.append(chunk)
|
chunks.append(chunk)
|
||||||
|
|
||||||
if end >= len(words):
|
if end >= len(words):
|
||||||
break
|
break
|
||||||
|
|
||||||
start = end - self.overlap
|
start = end - self.overlap
|
||||||
|
|
||||||
return chunks
|
return chunks
|
||||||
|
|||||||
@@ -8,15 +8,22 @@ from .async_logger import AsyncLogger
|
|||||||
logger = AsyncLogger(verbose=True)
|
logger = AsyncLogger(verbose=True)
|
||||||
docs_manager = DocsManager(logger)
|
docs_manager = DocsManager(logger)
|
||||||
|
|
||||||
|
|
||||||
def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
|
def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
|
||||||
"""Print formatted table with headers and rows"""
|
"""Print formatted table with headers and rows"""
|
||||||
widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
|
widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
|
||||||
border = '+' + '+'.join('-' * (w + 2 * padding) for w in widths) + '+'
|
border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
|
||||||
|
|
||||||
def format_row(row):
|
def format_row(row):
|
||||||
return '|' + '|'.join(f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
|
return (
|
||||||
for cell, w in zip(row, widths)) + '|'
|
"|"
|
||||||
|
+ "|".join(
|
||||||
|
f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
|
||||||
|
for cell, w in zip(row, widths)
|
||||||
|
)
|
||||||
|
+ "|"
|
||||||
|
)
|
||||||
|
|
||||||
click.echo(border)
|
click.echo(border)
|
||||||
click.echo(format_row(headers))
|
click.echo(format_row(headers))
|
||||||
click.echo(border)
|
click.echo(border)
|
||||||
@@ -24,19 +31,24 @@ def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
|
|||||||
click.echo(format_row(row))
|
click.echo(format_row(row))
|
||||||
click.echo(border)
|
click.echo(border)
|
||||||
|
|
||||||
|
|
||||||
@click.group()
|
@click.group()
|
||||||
def cli():
|
def cli():
|
||||||
"""Crawl4AI Command Line Interface"""
|
"""Crawl4AI Command Line Interface"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
@cli.group()
|
@cli.group()
|
||||||
def docs():
|
def docs():
|
||||||
"""Documentation operations"""
|
"""Documentation operations"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
@docs.command()
|
@docs.command()
|
||||||
@click.argument('sections', nargs=-1)
|
@click.argument("sections", nargs=-1)
|
||||||
@click.option('--mode', type=click.Choice(['extended', 'condensed']), default='extended')
|
@click.option(
|
||||||
|
"--mode", type=click.Choice(["extended", "condensed"]), default="extended"
|
||||||
|
)
|
||||||
def combine(sections: tuple, mode: str):
|
def combine(sections: tuple, mode: str):
|
||||||
"""Combine documentation sections"""
|
"""Combine documentation sections"""
|
||||||
try:
|
try:
|
||||||
@@ -46,16 +58,17 @@ def combine(sections: tuple, mode: str):
|
|||||||
logger.error(str(e), tag="ERROR")
|
logger.error(str(e), tag="ERROR")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
@docs.command()
|
@docs.command()
|
||||||
@click.argument('query')
|
@click.argument("query")
|
||||||
@click.option('--top-k', '-k', default=5)
|
@click.option("--top-k", "-k", default=5)
|
||||||
@click.option('--build-index', is_flag=True, help='Build index if missing')
|
@click.option("--build-index", is_flag=True, help="Build index if missing")
|
||||||
def search(query: str, top_k: int, build_index: bool):
|
def search(query: str, top_k: int, build_index: bool):
|
||||||
"""Search documentation"""
|
"""Search documentation"""
|
||||||
try:
|
try:
|
||||||
result = docs_manager.search(query, top_k)
|
result = docs_manager.search(query, top_k)
|
||||||
if result == "No search index available. Call build_search_index() first.":
|
if result == "No search index available. Call build_search_index() first.":
|
||||||
if build_index or click.confirm('No search index found. Build it now?'):
|
if build_index or click.confirm("No search index found. Build it now?"):
|
||||||
asyncio.run(docs_manager.llm_text.generate_index_files())
|
asyncio.run(docs_manager.llm_text.generate_index_files())
|
||||||
result = docs_manager.search(query, top_k)
|
result = docs_manager.search(query, top_k)
|
||||||
click.echo(result)
|
click.echo(result)
|
||||||
@@ -63,6 +76,7 @@ def search(query: str, top_k: int, build_index: bool):
|
|||||||
click.echo(f"Error: {str(e)}", err=True)
|
click.echo(f"Error: {str(e)}", err=True)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
@docs.command()
|
@docs.command()
|
||||||
def update():
|
def update():
|
||||||
"""Update docs from GitHub"""
|
"""Update docs from GitHub"""
|
||||||
@@ -73,22 +87,25 @@ def update():
|
|||||||
click.echo(f"Error: {str(e)}", err=True)
|
click.echo(f"Error: {str(e)}", err=True)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
@docs.command()
|
@docs.command()
|
||||||
@click.option('--force-facts', is_flag=True, help='Force regenerate fact files')
|
@click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
|
||||||
@click.option('--clear-cache', is_flag=True, help='Clear BM25 cache')
|
@click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
|
||||||
def index(force_facts: bool, clear_cache: bool):
|
def index(force_facts: bool, clear_cache: bool):
|
||||||
"""Build or rebuild search indexes"""
|
"""Build or rebuild search indexes"""
|
||||||
try:
|
try:
|
||||||
asyncio.run(docs_manager.ensure_docs_exist())
|
asyncio.run(docs_manager.ensure_docs_exist())
|
||||||
asyncio.run(docs_manager.llm_text.generate_index_files(
|
asyncio.run(
|
||||||
force_generate_facts=force_facts,
|
docs_manager.llm_text.generate_index_files(
|
||||||
clear_bm25_cache=clear_cache
|
force_generate_facts=force_facts, clear_bm25_cache=clear_cache
|
||||||
))
|
)
|
||||||
|
)
|
||||||
click.echo("Search indexes built successfully")
|
click.echo("Search indexes built successfully")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
click.echo(f"Error: {str(e)}", err=True)
|
click.echo(f"Error: {str(e)}", err=True)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
# Add docs list command
|
# Add docs list command
|
||||||
@docs.command()
|
@docs.command()
|
||||||
def list():
|
def list():
|
||||||
@@ -96,10 +113,11 @@ def list():
|
|||||||
try:
|
try:
|
||||||
sections = docs_manager.list()
|
sections = docs_manager.list()
|
||||||
print_table(["Sections"], [[section] for section in sections])
|
print_table(["Sections"], [[section] for section in sections])
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
click.echo(f"Error: {str(e)}", err=True)
|
click.echo(f"Error: {str(e)}", err=True)
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
cli()
|
if __name__ == "__main__":
|
||||||
|
cli()
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ DEFAULT_PROVIDER = "openai/gpt-4o-mini"
|
|||||||
MODEL_REPO_BRANCH = "new-release-0.0.2"
|
MODEL_REPO_BRANCH = "new-release-0.0.2"
|
||||||
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
|
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
|
||||||
PROVIDER_MODELS = {
|
PROVIDER_MODELS = {
|
||||||
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
|
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
|
||||||
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
|
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
|
||||||
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
|
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
|
||||||
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
|
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
|
||||||
@@ -22,27 +22,49 @@ PROVIDER_MODELS = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
# Chunk token threshold
|
# Chunk token threshold
|
||||||
CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
|
CHUNK_TOKEN_THRESHOLD = 2**11 # 2048 tokens
|
||||||
OVERLAP_RATE = 0.1
|
OVERLAP_RATE = 0.1
|
||||||
WORD_TOKEN_RATE = 1.3
|
WORD_TOKEN_RATE = 1.3
|
||||||
|
|
||||||
# Threshold for the minimum number of word in a HTML tag to be considered
|
# Threshold for the minimum number of word in a HTML tag to be considered
|
||||||
MIN_WORD_THRESHOLD = 1
|
MIN_WORD_THRESHOLD = 1
|
||||||
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
|
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
|
||||||
|
|
||||||
IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height']
|
IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
|
||||||
ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
|
ONLY_TEXT_ELIGIBLE_TAGS = [
|
||||||
|
"b",
|
||||||
|
"i",
|
||||||
|
"u",
|
||||||
|
"span",
|
||||||
|
"del",
|
||||||
|
"ins",
|
||||||
|
"sub",
|
||||||
|
"sup",
|
||||||
|
"strong",
|
||||||
|
"em",
|
||||||
|
"code",
|
||||||
|
"kbd",
|
||||||
|
"var",
|
||||||
|
"s",
|
||||||
|
"q",
|
||||||
|
"abbr",
|
||||||
|
"cite",
|
||||||
|
"dfn",
|
||||||
|
"time",
|
||||||
|
"small",
|
||||||
|
"mark",
|
||||||
|
]
|
||||||
SOCIAL_MEDIA_DOMAINS = [
|
SOCIAL_MEDIA_DOMAINS = [
|
||||||
'facebook.com',
|
"facebook.com",
|
||||||
'twitter.com',
|
"twitter.com",
|
||||||
'x.com',
|
"x.com",
|
||||||
'linkedin.com',
|
"linkedin.com",
|
||||||
'instagram.com',
|
"instagram.com",
|
||||||
'pinterest.com',
|
"pinterest.com",
|
||||||
'tiktok.com',
|
"tiktok.com",
|
||||||
'snapchat.com',
|
"snapchat.com",
|
||||||
'reddit.com',
|
"reddit.com",
|
||||||
]
|
]
|
||||||
|
|
||||||
# Threshold for the Image extraction - Range is 1 to 6
|
# Threshold for the Image extraction - Range is 1 to 6
|
||||||
# Images are scored based on point based system, to filter based on usefulness. Points are assigned
|
# Images are scored based on point based system, to filter based on usefulness. Points are assigned
|
||||||
@@ -60,5 +82,5 @@ NEED_MIGRATION = True
|
|||||||
URL_LOG_SHORTEN_LENGTH = 30
|
URL_LOG_SHORTEN_LENGTH = 30
|
||||||
SHOW_DEPRECATION_WARNINGS = True
|
SHOW_DEPRECATION_WARNINGS = True
|
||||||
SCREENSHOT_HEIGHT_TRESHOLD = 10000
|
SCREENSHOT_HEIGHT_TRESHOLD = 10000
|
||||||
PAGE_TIMEOUT=60000
|
PAGE_TIMEOUT = 60000
|
||||||
DOWNLOAD_PAGE_TIMEOUT=60000
|
DOWNLOAD_PAGE_TIMEOUT = 60000
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -15,54 +15,53 @@ import logging, time
|
|||||||
import base64
|
import base64
|
||||||
from PIL import Image, ImageDraw, ImageFont
|
from PIL import Image, ImageDraw, ImageFont
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from typing import List, Callable
|
from typing import Callable
|
||||||
import requests
|
import requests
|
||||||
import os
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from .utils import *
|
from .utils import *
|
||||||
|
|
||||||
logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
|
logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
|
||||||
logger.setLevel(logging.WARNING)
|
logger.setLevel(logging.WARNING)
|
||||||
|
|
||||||
logger_driver = logging.getLogger('selenium.webdriver.common.service')
|
logger_driver = logging.getLogger("selenium.webdriver.common.service")
|
||||||
logger_driver.setLevel(logging.WARNING)
|
logger_driver.setLevel(logging.WARNING)
|
||||||
|
|
||||||
urllib3_logger = logging.getLogger('urllib3.connectionpool')
|
urllib3_logger = logging.getLogger("urllib3.connectionpool")
|
||||||
urllib3_logger.setLevel(logging.WARNING)
|
urllib3_logger.setLevel(logging.WARNING)
|
||||||
|
|
||||||
# Disable http.client logging
|
# Disable http.client logging
|
||||||
http_client_logger = logging.getLogger('http.client')
|
http_client_logger = logging.getLogger("http.client")
|
||||||
http_client_logger.setLevel(logging.WARNING)
|
http_client_logger.setLevel(logging.WARNING)
|
||||||
|
|
||||||
# Disable driver_finder and service logging
|
# Disable driver_finder and service logging
|
||||||
driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
|
driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
|
||||||
driver_finder_logger.setLevel(logging.WARNING)
|
driver_finder_logger.setLevel(logging.WARNING)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
class CrawlerStrategy(ABC):
|
class CrawlerStrategy(ABC):
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def crawl(self, url: str, **kwargs) -> str:
|
def crawl(self, url: str, **kwargs) -> str:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def take_screenshot(self, save_path: str):
|
def take_screenshot(self, save_path: str):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def update_user_agent(self, user_agent: str):
|
def update_user_agent(self, user_agent: str):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def set_hook(self, hook_type: str, hook: Callable):
|
def set_hook(self, hook_type: str, hook: Callable):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
class CloudCrawlerStrategy(CrawlerStrategy):
|
class CloudCrawlerStrategy(CrawlerStrategy):
|
||||||
def __init__(self, use_cached_html = False):
|
def __init__(self, use_cached_html=False):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
self.use_cached_html = use_cached_html
|
self.use_cached_html = use_cached_html
|
||||||
|
|
||||||
def crawl(self, url: str) -> str:
|
def crawl(self, url: str) -> str:
|
||||||
data = {
|
data = {
|
||||||
"urls": [url],
|
"urls": [url],
|
||||||
@@ -76,6 +75,7 @@ class CloudCrawlerStrategy(CrawlerStrategy):
|
|||||||
html = response["results"][0]["html"]
|
html = response["results"][0]["html"]
|
||||||
return sanitize_input_encode(html)
|
return sanitize_input_encode(html)
|
||||||
|
|
||||||
|
|
||||||
class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
||||||
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
|
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
|
||||||
super().__init__()
|
super().__init__()
|
||||||
@@ -87,20 +87,25 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
if kwargs.get("user_agent"):
|
if kwargs.get("user_agent"):
|
||||||
self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
|
self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
|
||||||
else:
|
else:
|
||||||
user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
user_agent = kwargs.get(
|
||||||
|
"user_agent",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
|
||||||
|
)
|
||||||
self.options.add_argument(f"--user-agent={user_agent}")
|
self.options.add_argument(f"--user-agent={user_agent}")
|
||||||
self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
self.options.add_argument(
|
||||||
|
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||||
|
)
|
||||||
|
|
||||||
self.options.headless = kwargs.get("headless", True)
|
self.options.headless = kwargs.get("headless", True)
|
||||||
if self.options.headless:
|
if self.options.headless:
|
||||||
self.options.add_argument("--headless")
|
self.options.add_argument("--headless")
|
||||||
|
|
||||||
self.options.add_argument("--disable-gpu")
|
self.options.add_argument("--disable-gpu")
|
||||||
self.options.add_argument("--window-size=1920,1080")
|
self.options.add_argument("--window-size=1920,1080")
|
||||||
self.options.add_argument("--no-sandbox")
|
self.options.add_argument("--no-sandbox")
|
||||||
self.options.add_argument("--disable-dev-shm-usage")
|
self.options.add_argument("--disable-dev-shm-usage")
|
||||||
self.options.add_argument("--disable-blink-features=AutomationControlled")
|
self.options.add_argument("--disable-blink-features=AutomationControlled")
|
||||||
|
|
||||||
# self.options.add_argument("--disable-dev-shm-usage")
|
# self.options.add_argument("--disable-dev-shm-usage")
|
||||||
self.options.add_argument("--disable-gpu")
|
self.options.add_argument("--disable-gpu")
|
||||||
# self.options.add_argument("--disable-extensions")
|
# self.options.add_argument("--disable-extensions")
|
||||||
@@ -120,14 +125,14 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
self.use_cached_html = use_cached_html
|
self.use_cached_html = use_cached_html
|
||||||
self.js_code = js_code
|
self.js_code = js_code
|
||||||
self.verbose = kwargs.get("verbose", False)
|
self.verbose = kwargs.get("verbose", False)
|
||||||
|
|
||||||
# Hooks
|
# Hooks
|
||||||
self.hooks = {
|
self.hooks = {
|
||||||
'on_driver_created': None,
|
"on_driver_created": None,
|
||||||
'on_user_agent_updated': None,
|
"on_user_agent_updated": None,
|
||||||
'before_get_url': None,
|
"before_get_url": None,
|
||||||
'after_get_url': None,
|
"after_get_url": None,
|
||||||
'before_return_html': None
|
"before_return_html": None,
|
||||||
}
|
}
|
||||||
|
|
||||||
# chromedriver_autoinstaller.install()
|
# chromedriver_autoinstaller.install()
|
||||||
@@ -137,31 +142,28 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
# chromedriver_path = chromedriver_autoinstaller.install()
|
# chromedriver_path = chromedriver_autoinstaller.install()
|
||||||
# chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
|
# chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
|
||||||
# self.service = Service(chromedriver_autoinstaller.install())
|
# self.service = Service(chromedriver_autoinstaller.install())
|
||||||
|
|
||||||
|
|
||||||
# chromedriver_path = ChromeDriverManager().install()
|
# chromedriver_path = ChromeDriverManager().install()
|
||||||
# self.service = Service(chromedriver_path)
|
# self.service = Service(chromedriver_path)
|
||||||
# self.service.log_path = "NUL"
|
# self.service.log_path = "NUL"
|
||||||
# self.driver = webdriver.Chrome(service=self.service, options=self.options)
|
# self.driver = webdriver.Chrome(service=self.service, options=self.options)
|
||||||
|
|
||||||
# Use selenium-manager (built into Selenium 4.10.0+)
|
# Use selenium-manager (built into Selenium 4.10.0+)
|
||||||
self.service = Service()
|
self.service = Service()
|
||||||
self.driver = webdriver.Chrome(options=self.options)
|
self.driver = webdriver.Chrome(options=self.options)
|
||||||
|
|
||||||
self.driver = self.execute_hook('on_driver_created', self.driver)
|
self.driver = self.execute_hook("on_driver_created", self.driver)
|
||||||
|
|
||||||
if kwargs.get("cookies"):
|
if kwargs.get("cookies"):
|
||||||
for cookie in kwargs.get("cookies"):
|
for cookie in kwargs.get("cookies"):
|
||||||
self.driver.add_cookie(cookie)
|
self.driver.add_cookie(cookie)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def set_hook(self, hook_type: str, hook: Callable):
|
def set_hook(self, hook_type: str, hook: Callable):
|
||||||
if hook_type in self.hooks:
|
if hook_type in self.hooks:
|
||||||
self.hooks[hook_type] = hook
|
self.hooks[hook_type] = hook
|
||||||
else:
|
else:
|
||||||
raise ValueError(f"Invalid hook type: {hook_type}")
|
raise ValueError(f"Invalid hook type: {hook_type}")
|
||||||
|
|
||||||
def execute_hook(self, hook_type: str, *args):
|
def execute_hook(self, hook_type: str, *args):
|
||||||
hook = self.hooks.get(hook_type)
|
hook = self.hooks.get(hook_type)
|
||||||
if hook:
|
if hook:
|
||||||
@@ -170,7 +172,9 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
if isinstance(result, webdriver.Chrome):
|
if isinstance(result, webdriver.Chrome):
|
||||||
return result
|
return result
|
||||||
else:
|
else:
|
||||||
raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.")
|
raise TypeError(
|
||||||
|
f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
|
||||||
|
)
|
||||||
# If the hook returns None or there is no hook, return self.driver
|
# If the hook returns None or there is no hook, return self.driver
|
||||||
return self.driver
|
return self.driver
|
||||||
|
|
||||||
@@ -178,60 +182,77 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
self.options.add_argument(f"user-agent={user_agent}")
|
self.options.add_argument(f"user-agent={user_agent}")
|
||||||
self.driver.quit()
|
self.driver.quit()
|
||||||
self.driver = webdriver.Chrome(service=self.service, options=self.options)
|
self.driver = webdriver.Chrome(service=self.service, options=self.options)
|
||||||
self.driver = self.execute_hook('on_user_agent_updated', self.driver)
|
self.driver = self.execute_hook("on_user_agent_updated", self.driver)
|
||||||
|
|
||||||
def set_custom_headers(self, headers: dict):
|
def set_custom_headers(self, headers: dict):
|
||||||
# Enable Network domain for sending headers
|
# Enable Network domain for sending headers
|
||||||
self.driver.execute_cdp_cmd('Network.enable', {})
|
self.driver.execute_cdp_cmd("Network.enable", {})
|
||||||
# Set extra HTTP headers
|
# Set extra HTTP headers
|
||||||
self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers})
|
self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
|
||||||
|
|
||||||
def _ensure_page_load(self, max_checks=6, check_interval=0.01):
|
def _ensure_page_load(self, max_checks=6, check_interval=0.01):
|
||||||
initial_length = len(self.driver.page_source)
|
initial_length = len(self.driver.page_source)
|
||||||
|
|
||||||
for ix in range(max_checks):
|
for ix in range(max_checks):
|
||||||
# print(f"Checking page load: {ix}")
|
# print(f"Checking page load: {ix}")
|
||||||
time.sleep(check_interval)
|
time.sleep(check_interval)
|
||||||
current_length = len(self.driver.page_source)
|
current_length = len(self.driver.page_source)
|
||||||
|
|
||||||
if current_length != initial_length:
|
if current_length != initial_length:
|
||||||
break
|
break
|
||||||
|
|
||||||
return self.driver.page_source
|
return self.driver.page_source
|
||||||
|
|
||||||
def crawl(self, url: str, **kwargs) -> str:
|
def crawl(self, url: str, **kwargs) -> str:
|
||||||
# Create md5 hash of the URL
|
# Create md5 hash of the URL
|
||||||
import hashlib
|
import hashlib
|
||||||
|
|
||||||
url_hash = hashlib.md5(url.encode()).hexdigest()
|
url_hash = hashlib.md5(url.encode()).hexdigest()
|
||||||
|
|
||||||
if self.use_cached_html:
|
if self.use_cached_html:
|
||||||
cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash)
|
cache_file_path = os.path.join(
|
||||||
|
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
|
||||||
|
".crawl4ai",
|
||||||
|
"cache",
|
||||||
|
url_hash,
|
||||||
|
)
|
||||||
if os.path.exists(cache_file_path):
|
if os.path.exists(cache_file_path):
|
||||||
with open(cache_file_path, "r") as f:
|
with open(cache_file_path, "r") as f:
|
||||||
return sanitize_input_encode(f.read())
|
return sanitize_input_encode(f.read())
|
||||||
|
|
||||||
try:
|
try:
|
||||||
self.driver = self.execute_hook('before_get_url', self.driver)
|
self.driver = self.execute_hook("before_get_url", self.driver)
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
|
print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
|
||||||
self.driver.get(url) #<html><head></head><body></body></html>
|
self.driver.get(url) # <html><head></head><body></body></html>
|
||||||
|
|
||||||
WebDriverWait(self.driver, 20).until(
|
WebDriverWait(self.driver, 20).until(
|
||||||
lambda d: d.execute_script('return document.readyState') == 'complete'
|
lambda d: d.execute_script("return document.readyState") == "complete"
|
||||||
)
|
)
|
||||||
WebDriverWait(self.driver, 10).until(
|
WebDriverWait(self.driver, 10).until(
|
||||||
EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
|
EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
|
||||||
)
|
)
|
||||||
|
|
||||||
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
|
self.driver.execute_script(
|
||||||
|
"window.scrollTo(0, document.body.scrollHeight);"
|
||||||
self.driver = self.execute_hook('after_get_url', self.driver)
|
)
|
||||||
html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source
|
|
||||||
can_not_be_done_headless = False # Look at my creativity for naming variables
|
self.driver = self.execute_hook("after_get_url", self.driver)
|
||||||
|
html = sanitize_input_encode(
|
||||||
|
self._ensure_page_load()
|
||||||
|
) # self.driver.page_source
|
||||||
|
can_not_be_done_headless = (
|
||||||
|
False # Look at my creativity for naming variables
|
||||||
|
)
|
||||||
|
|
||||||
# TODO: Very ugly approach, but promise to change it!
|
# TODO: Very ugly approach, but promise to change it!
|
||||||
if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>":
|
if (
|
||||||
print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...")
|
kwargs.get("bypass_headless", False)
|
||||||
|
or html == "<html><head></head><body></body></html>"
|
||||||
|
):
|
||||||
|
print(
|
||||||
|
"[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
|
||||||
|
)
|
||||||
can_not_be_done_headless = True
|
can_not_be_done_headless = True
|
||||||
options = Options()
|
options = Options()
|
||||||
options.headless = False
|
options.headless = False
|
||||||
@@ -239,27 +260,31 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
options.add_argument("--window-size=5,5")
|
options.add_argument("--window-size=5,5")
|
||||||
driver = webdriver.Chrome(service=self.service, options=options)
|
driver = webdriver.Chrome(service=self.service, options=options)
|
||||||
driver.get(url)
|
driver.get(url)
|
||||||
self.driver = self.execute_hook('after_get_url', driver)
|
self.driver = self.execute_hook("after_get_url", driver)
|
||||||
html = sanitize_input_encode(driver.page_source)
|
html = sanitize_input_encode(driver.page_source)
|
||||||
driver.quit()
|
driver.quit()
|
||||||
|
|
||||||
# Execute JS code if provided
|
# Execute JS code if provided
|
||||||
self.js_code = kwargs.get("js_code", self.js_code)
|
self.js_code = kwargs.get("js_code", self.js_code)
|
||||||
if self.js_code and type(self.js_code) == str:
|
if self.js_code and type(self.js_code) == str:
|
||||||
self.driver.execute_script(self.js_code)
|
self.driver.execute_script(self.js_code)
|
||||||
# Optionally, wait for some condition after executing the JS code
|
# Optionally, wait for some condition after executing the JS code
|
||||||
WebDriverWait(self.driver, 10).until(
|
WebDriverWait(self.driver, 10).until(
|
||||||
lambda driver: driver.execute_script("return document.readyState") == "complete"
|
lambda driver: driver.execute_script("return document.readyState")
|
||||||
|
== "complete"
|
||||||
)
|
)
|
||||||
elif self.js_code and type(self.js_code) == list:
|
elif self.js_code and type(self.js_code) == list:
|
||||||
for js in self.js_code:
|
for js in self.js_code:
|
||||||
self.driver.execute_script(js)
|
self.driver.execute_script(js)
|
||||||
WebDriverWait(self.driver, 10).until(
|
WebDriverWait(self.driver, 10).until(
|
||||||
lambda driver: driver.execute_script("return document.readyState") == "complete"
|
lambda driver: driver.execute_script(
|
||||||
|
"return document.readyState"
|
||||||
|
)
|
||||||
|
== "complete"
|
||||||
)
|
)
|
||||||
|
|
||||||
# Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
|
# Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
|
||||||
wait_for = kwargs.get('wait_for', False)
|
wait_for = kwargs.get("wait_for", False)
|
||||||
if wait_for:
|
if wait_for:
|
||||||
if callable(wait_for):
|
if callable(wait_for):
|
||||||
print("[LOG] 🔄 Waiting for condition...")
|
print("[LOG] 🔄 Waiting for condition...")
|
||||||
@@ -268,32 +293,37 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
print("[LOG] 🔄 Waiting for condition...")
|
print("[LOG] 🔄 Waiting for condition...")
|
||||||
WebDriverWait(self.driver, 20).until(
|
WebDriverWait(self.driver, 20).until(
|
||||||
EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
|
EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
|
||||||
)
|
)
|
||||||
|
|
||||||
if not can_not_be_done_headless:
|
if not can_not_be_done_headless:
|
||||||
html = sanitize_input_encode(self.driver.page_source)
|
html = sanitize_input_encode(self.driver.page_source)
|
||||||
self.driver = self.execute_hook('before_return_html', self.driver, html)
|
self.driver = self.execute_hook("before_return_html", self.driver, html)
|
||||||
|
|
||||||
# Store in cache
|
# Store in cache
|
||||||
cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash)
|
cache_file_path = os.path.join(
|
||||||
|
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
|
||||||
|
".crawl4ai",
|
||||||
|
"cache",
|
||||||
|
url_hash,
|
||||||
|
)
|
||||||
with open(cache_file_path, "w", encoding="utf-8") as f:
|
with open(cache_file_path, "w", encoding="utf-8") as f:
|
||||||
f.write(html)
|
f.write(html)
|
||||||
|
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||||
|
|
||||||
return html
|
return html
|
||||||
except InvalidArgumentException as e:
|
except InvalidArgumentException as e:
|
||||||
if not hasattr(e, 'msg'):
|
if not hasattr(e, "msg"):
|
||||||
e.msg = sanitize_input_encode(str(e))
|
e.msg = sanitize_input_encode(str(e))
|
||||||
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
|
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
|
||||||
except WebDriverException as e:
|
except WebDriverException as e:
|
||||||
# If e does nlt have msg attribute create it and set it to str(e)
|
# If e does nlt have msg attribute create it and set it to str(e)
|
||||||
if not hasattr(e, 'msg'):
|
if not hasattr(e, "msg"):
|
||||||
e.msg = sanitize_input_encode(str(e))
|
e.msg = sanitize_input_encode(str(e))
|
||||||
raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
|
raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
if not hasattr(e, 'msg'):
|
if not hasattr(e, "msg"):
|
||||||
e.msg = sanitize_input_encode(str(e))
|
e.msg = sanitize_input_encode(str(e))
|
||||||
raise Exception(f"Failed to crawl {url}: {e.msg}")
|
raise Exception(f"Failed to crawl {url}: {e.msg}")
|
||||||
|
|
||||||
@@ -301,7 +331,9 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
try:
|
try:
|
||||||
# Get the dimensions of the page
|
# Get the dimensions of the page
|
||||||
total_width = self.driver.execute_script("return document.body.scrollWidth")
|
total_width = self.driver.execute_script("return document.body.scrollWidth")
|
||||||
total_height = self.driver.execute_script("return document.body.scrollHeight")
|
total_height = self.driver.execute_script(
|
||||||
|
"return document.body.scrollHeight"
|
||||||
|
)
|
||||||
|
|
||||||
# Set the window size to the dimensions of the page
|
# Set the window size to the dimensions of the page
|
||||||
self.driver.set_window_size(total_width, total_height)
|
self.driver.set_window_size(total_width, total_height)
|
||||||
@@ -313,25 +345,27 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
image = Image.open(BytesIO(screenshot))
|
image = Image.open(BytesIO(screenshot))
|
||||||
|
|
||||||
# Convert image to RGB mode (this will handle both RGB and RGBA images)
|
# Convert image to RGB mode (this will handle both RGB and RGBA images)
|
||||||
rgb_image = image.convert('RGB')
|
rgb_image = image.convert("RGB")
|
||||||
|
|
||||||
# Convert to JPEG and compress
|
# Convert to JPEG and compress
|
||||||
buffered = BytesIO()
|
buffered = BytesIO()
|
||||||
rgb_image.save(buffered, format="JPEG", quality=85)
|
rgb_image.save(buffered, format="JPEG", quality=85)
|
||||||
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
|
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
|
||||||
|
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(f"[LOG] 📸 Screenshot taken and converted to base64")
|
print("[LOG] 📸 Screenshot taken and converted to base64")
|
||||||
|
|
||||||
return img_base64
|
return img_base64
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
|
error_message = sanitize_input_encode(
|
||||||
|
f"Failed to take screenshot: {str(e)}"
|
||||||
|
)
|
||||||
print(error_message)
|
print(error_message)
|
||||||
|
|
||||||
# Generate an image with black background
|
# Generate an image with black background
|
||||||
img = Image.new('RGB', (800, 600), color='black')
|
img = Image.new("RGB", (800, 600), color="black")
|
||||||
draw = ImageDraw.Draw(img)
|
draw = ImageDraw.Draw(img)
|
||||||
|
|
||||||
# Load a font
|
# Load a font
|
||||||
try:
|
try:
|
||||||
font = ImageFont.truetype("arial.ttf", 40)
|
font = ImageFont.truetype("arial.ttf", 40)
|
||||||
@@ -345,16 +379,16 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
|||||||
|
|
||||||
# Calculate text position
|
# Calculate text position
|
||||||
text_position = (10, 10)
|
text_position = (10, 10)
|
||||||
|
|
||||||
# Draw the text on the image
|
# Draw the text on the image
|
||||||
draw.text(text_position, wrapped_text, fill=text_color, font=font)
|
draw.text(text_position, wrapped_text, fill=text_color, font=font)
|
||||||
|
|
||||||
# Convert to base64
|
# Convert to base64
|
||||||
buffered = BytesIO()
|
buffered = BytesIO()
|
||||||
img.save(buffered, format="JPEG")
|
img.save(buffered, format="JPEG")
|
||||||
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
|
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
|
||||||
|
|
||||||
return img_base64
|
return img_base64
|
||||||
|
|
||||||
def quit(self):
|
def quit(self):
|
||||||
self.driver.quit()
|
self.driver.quit()
|
||||||
|
|||||||
@@ -7,11 +7,13 @@ DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".cra
|
|||||||
os.makedirs(DB_PATH, exist_ok=True)
|
os.makedirs(DB_PATH, exist_ok=True)
|
||||||
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
|
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
|
||||||
|
|
||||||
|
|
||||||
def init_db():
|
def init_db():
|
||||||
global DB_PATH
|
global DB_PATH
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute('''
|
cursor.execute(
|
||||||
|
"""
|
||||||
CREATE TABLE IF NOT EXISTS crawled_data (
|
CREATE TABLE IF NOT EXISTS crawled_data (
|
||||||
url TEXT PRIMARY KEY,
|
url TEXT PRIMARY KEY,
|
||||||
html TEXT,
|
html TEXT,
|
||||||
@@ -24,31 +26,42 @@ def init_db():
|
|||||||
metadata TEXT DEFAULT "{}",
|
metadata TEXT DEFAULT "{}",
|
||||||
screenshot TEXT DEFAULT ""
|
screenshot TEXT DEFAULT ""
|
||||||
)
|
)
|
||||||
''')
|
"""
|
||||||
|
)
|
||||||
conn.commit()
|
conn.commit()
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
def alter_db_add_screenshot(new_column: str = "media"):
|
def alter_db_add_screenshot(new_column: str = "media"):
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
cursor.execute(
|
||||||
|
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
|
||||||
|
)
|
||||||
conn.commit()
|
conn.commit()
|
||||||
conn.close()
|
conn.close()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error altering database to add screenshot column: {e}")
|
print(f"Error altering database to add screenshot column: {e}")
|
||||||
|
|
||||||
|
|
||||||
def check_db_path():
|
def check_db_path():
|
||||||
if not DB_PATH:
|
if not DB_PATH:
|
||||||
raise ValueError("Database path is not set or is empty.")
|
raise ValueError("Database path is not set or is empty.")
|
||||||
|
|
||||||
def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
|
|
||||||
|
def get_cached_url(
|
||||||
|
url: str,
|
||||||
|
) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,))
|
cursor.execute(
|
||||||
|
"SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
|
||||||
|
(url,),
|
||||||
|
)
|
||||||
result = cursor.fetchone()
|
result = cursor.fetchone()
|
||||||
conn.close()
|
conn.close()
|
||||||
return result
|
return result
|
||||||
@@ -56,12 +69,25 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str
|
|||||||
print(f"Error retrieving cached URL: {e}")
|
print(f"Error retrieving cached URL: {e}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
|
|
||||||
|
def cache_url(
|
||||||
|
url: str,
|
||||||
|
html: str,
|
||||||
|
cleaned_html: str,
|
||||||
|
markdown: str,
|
||||||
|
extracted_content: str,
|
||||||
|
success: bool,
|
||||||
|
media: str = "{}",
|
||||||
|
links: str = "{}",
|
||||||
|
metadata: str = "{}",
|
||||||
|
screenshot: str = "",
|
||||||
|
):
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute('''
|
cursor.execute(
|
||||||
|
"""
|
||||||
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
|
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||||
ON CONFLICT(url) DO UPDATE SET
|
ON CONFLICT(url) DO UPDATE SET
|
||||||
@@ -74,18 +100,32 @@ def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_c
|
|||||||
links = excluded.links,
|
links = excluded.links,
|
||||||
metadata = excluded.metadata,
|
metadata = excluded.metadata,
|
||||||
screenshot = excluded.screenshot
|
screenshot = excluded.screenshot
|
||||||
''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
|
""",
|
||||||
|
(
|
||||||
|
url,
|
||||||
|
html,
|
||||||
|
cleaned_html,
|
||||||
|
markdown,
|
||||||
|
extracted_content,
|
||||||
|
success,
|
||||||
|
media,
|
||||||
|
links,
|
||||||
|
metadata,
|
||||||
|
screenshot,
|
||||||
|
),
|
||||||
|
)
|
||||||
conn.commit()
|
conn.commit()
|
||||||
conn.close()
|
conn.close()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error caching URL: {e}")
|
print(f"Error caching URL: {e}")
|
||||||
|
|
||||||
|
|
||||||
def get_total_count() -> int:
|
def get_total_count() -> int:
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute('SELECT COUNT(*) FROM crawled_data')
|
cursor.execute("SELECT COUNT(*) FROM crawled_data")
|
||||||
result = cursor.fetchone()
|
result = cursor.fetchone()
|
||||||
conn.close()
|
conn.close()
|
||||||
return result[0]
|
return result[0]
|
||||||
@@ -93,43 +133,48 @@ def get_total_count() -> int:
|
|||||||
print(f"Error getting total count: {e}")
|
print(f"Error getting total count: {e}")
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
def clear_db():
|
def clear_db():
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute('DELETE FROM crawled_data')
|
cursor.execute("DELETE FROM crawled_data")
|
||||||
conn.commit()
|
conn.commit()
|
||||||
conn.close()
|
conn.close()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error clearing database: {e}")
|
print(f"Error clearing database: {e}")
|
||||||
|
|
||||||
|
|
||||||
def flush_db():
|
def flush_db():
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute('DROP TABLE crawled_data')
|
cursor.execute("DROP TABLE crawled_data")
|
||||||
conn.commit()
|
conn.commit()
|
||||||
conn.close()
|
conn.close()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error flushing database: {e}")
|
print(f"Error flushing database: {e}")
|
||||||
|
|
||||||
|
|
||||||
def update_existing_records(new_column: str = "media", default_value: str = "{}"):
|
def update_existing_records(new_column: str = "media", default_value: str = "{}"):
|
||||||
check_db_path()
|
check_db_path()
|
||||||
try:
|
try:
|
||||||
conn = sqlite3.connect(DB_PATH)
|
conn = sqlite3.connect(DB_PATH)
|
||||||
cursor = conn.cursor()
|
cursor = conn.cursor()
|
||||||
cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL')
|
cursor.execute(
|
||||||
|
f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
|
||||||
|
)
|
||||||
conn.commit()
|
conn.commit()
|
||||||
conn.close()
|
conn.close()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error updating existing records: {e}")
|
print(f"Error updating existing records: {e}")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
# Delete the existing database file
|
# Delete the existing database file
|
||||||
if os.path.exists(DB_PATH):
|
if os.path.exists(DB_PATH):
|
||||||
os.remove(DB_PATH)
|
os.remove(DB_PATH)
|
||||||
init_db()
|
init_db()
|
||||||
# alter_db_add_screenshot("COL_NAME")
|
# alter_db_add_screenshot("COL_NAME")
|
||||||
|
|
||||||
|
|||||||
@@ -4,6 +4,7 @@ from pathlib import Path
|
|||||||
from crawl4ai.async_logger import AsyncLogger
|
from crawl4ai.async_logger import AsyncLogger
|
||||||
from crawl4ai.llmtxt import AsyncLLMTextManager
|
from crawl4ai.llmtxt import AsyncLLMTextManager
|
||||||
|
|
||||||
|
|
||||||
class DocsManager:
|
class DocsManager:
|
||||||
def __init__(self, logger=None):
|
def __init__(self, logger=None):
|
||||||
self.docs_dir = Path.home() / ".crawl4ai" / "docs"
|
self.docs_dir = Path.home() / ".crawl4ai" / "docs"
|
||||||
@@ -21,11 +22,14 @@ class DocsManager:
|
|||||||
"""Copy from local docs or download from GitHub"""
|
"""Copy from local docs or download from GitHub"""
|
||||||
try:
|
try:
|
||||||
# Try local first
|
# Try local first
|
||||||
if self.local_docs.exists() and (any(self.local_docs.glob("*.md")) or any(self.local_docs.glob("*.tokens"))):
|
if self.local_docs.exists() and (
|
||||||
|
any(self.local_docs.glob("*.md"))
|
||||||
|
or any(self.local_docs.glob("*.tokens"))
|
||||||
|
):
|
||||||
# Empty the local docs directory
|
# Empty the local docs directory
|
||||||
for file_path in self.docs_dir.glob("*.md"):
|
for file_path in self.docs_dir.glob("*.md"):
|
||||||
file_path.unlink()
|
file_path.unlink()
|
||||||
# for file_path in self.docs_dir.glob("*.tokens"):
|
# for file_path in self.docs_dir.glob("*.tokens"):
|
||||||
# file_path.unlink()
|
# file_path.unlink()
|
||||||
for file_path in self.local_docs.glob("*.md"):
|
for file_path in self.local_docs.glob("*.md"):
|
||||||
shutil.copy2(file_path, self.docs_dir / file_path.name)
|
shutil.copy2(file_path, self.docs_dir / file_path.name)
|
||||||
@@ -36,14 +40,14 @@ class DocsManager:
|
|||||||
# Fallback to GitHub
|
# Fallback to GitHub
|
||||||
response = requests.get(
|
response = requests.get(
|
||||||
"https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
|
"https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
|
||||||
headers={'Accept': 'application/vnd.github.v3+json'}
|
headers={"Accept": "application/vnd.github.v3+json"},
|
||||||
)
|
)
|
||||||
response.raise_for_status()
|
response.raise_for_status()
|
||||||
|
|
||||||
for item in response.json():
|
for item in response.json():
|
||||||
if item['type'] == 'file' and item['name'].endswith('.md'):
|
if item["type"] == "file" and item["name"].endswith(".md"):
|
||||||
content = requests.get(item['download_url']).text
|
content = requests.get(item["download_url"]).text
|
||||||
with open(self.docs_dir / item['name'], 'w', encoding='utf-8') as f:
|
with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
|
||||||
f.write(content)
|
f.write(content)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
@@ -57,11 +61,15 @@ class DocsManager:
|
|||||||
# Remove [0-9]+_ prefix
|
# Remove [0-9]+_ prefix
|
||||||
names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
|
names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
|
||||||
# Exclude those end with .xs.md and .q.md
|
# Exclude those end with .xs.md and .q.md
|
||||||
names = [name for name in names if not name.endswith(".xs") and not name.endswith(".q")]
|
names = [
|
||||||
|
name
|
||||||
|
for name in names
|
||||||
|
if not name.endswith(".xs") and not name.endswith(".q")
|
||||||
|
]
|
||||||
return names
|
return names
|
||||||
|
|
||||||
def generate(self, sections, mode="extended"):
|
def generate(self, sections, mode="extended"):
|
||||||
return self.llm_text.generate(sections, mode)
|
return self.llm_text.generate(sections, mode)
|
||||||
|
|
||||||
def search(self, query: str, top_k: int = 5):
|
def search(self, query: str, top_k: int = 5):
|
||||||
return self.llm_text.search(query, top_k)
|
return self.llm_text.search(query, top_k)
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -54,13 +54,13 @@ class HTML2Text(html.parser.HTMLParser):
|
|||||||
self.td_count = 0
|
self.td_count = 0
|
||||||
self.table_start = False
|
self.table_start = False
|
||||||
self.unicode_snob = config.UNICODE_SNOB # covered in cli
|
self.unicode_snob = config.UNICODE_SNOB # covered in cli
|
||||||
|
|
||||||
self.escape_snob = config.ESCAPE_SNOB # covered in cli
|
self.escape_snob = config.ESCAPE_SNOB # covered in cli
|
||||||
self.escape_backslash = config.ESCAPE_BACKSLASH # covered in cli
|
self.escape_backslash = config.ESCAPE_BACKSLASH # covered in cli
|
||||||
self.escape_dot = config.ESCAPE_DOT # covered in cli
|
self.escape_dot = config.ESCAPE_DOT # covered in cli
|
||||||
self.escape_plus = config.ESCAPE_PLUS # covered in cli
|
self.escape_plus = config.ESCAPE_PLUS # covered in cli
|
||||||
self.escape_dash = config.ESCAPE_DASH # covered in cli
|
self.escape_dash = config.ESCAPE_DASH # covered in cli
|
||||||
|
|
||||||
self.links_each_paragraph = config.LINKS_EACH_PARAGRAPH
|
self.links_each_paragraph = config.LINKS_EACH_PARAGRAPH
|
||||||
self.body_width = bodywidth # covered in cli
|
self.body_width = bodywidth # covered in cli
|
||||||
self.skip_internal_links = config.SKIP_INTERNAL_LINKS # covered in cli
|
self.skip_internal_links = config.SKIP_INTERNAL_LINKS # covered in cli
|
||||||
@@ -144,8 +144,8 @@ class HTML2Text(html.parser.HTMLParser):
|
|||||||
|
|
||||||
def update_params(self, **kwargs):
|
def update_params(self, **kwargs):
|
||||||
for key, value in kwargs.items():
|
for key, value in kwargs.items():
|
||||||
setattr(self, key, value)
|
setattr(self, key, value)
|
||||||
|
|
||||||
def feed(self, data: str) -> None:
|
def feed(self, data: str) -> None:
|
||||||
data = data.replace("</' + 'script>", "</ignore>")
|
data = data.replace("</' + 'script>", "</ignore>")
|
||||||
super().feed(data)
|
super().feed(data)
|
||||||
@@ -903,7 +903,13 @@ class HTML2Text(html.parser.HTMLParser):
|
|||||||
self.empty_link = False
|
self.empty_link = False
|
||||||
|
|
||||||
if not self.code and not self.pre and not entity_char:
|
if not self.code and not self.pre and not entity_char:
|
||||||
data = escape_md_section(data, snob=self.escape_snob, escape_dot=self.escape_dot, escape_plus=self.escape_plus, escape_dash=self.escape_dash)
|
data = escape_md_section(
|
||||||
|
data,
|
||||||
|
snob=self.escape_snob,
|
||||||
|
escape_dot=self.escape_dot,
|
||||||
|
escape_plus=self.escape_plus,
|
||||||
|
escape_dash=self.escape_dash,
|
||||||
|
)
|
||||||
self.preceding_data = data
|
self.preceding_data = data
|
||||||
self.o(data, puredata=True)
|
self.o(data, puredata=True)
|
||||||
|
|
||||||
@@ -1006,6 +1012,7 @@ class HTML2Text(html.parser.HTMLParser):
|
|||||||
newlines += 1
|
newlines += 1
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
|
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
|
||||||
if bodywidth is None:
|
if bodywidth is None:
|
||||||
bodywidth = config.BODY_WIDTH
|
bodywidth = config.BODY_WIDTH
|
||||||
@@ -1013,6 +1020,7 @@ def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) ->
|
|||||||
|
|
||||||
return h.handle(html)
|
return h.handle(html)
|
||||||
|
|
||||||
|
|
||||||
class CustomHTML2Text(HTML2Text):
|
class CustomHTML2Text(HTML2Text):
|
||||||
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
|
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
|
||||||
super().__init__(*args, **kwargs)
|
super().__init__(*args, **kwargs)
|
||||||
@@ -1022,8 +1030,8 @@ class CustomHTML2Text(HTML2Text):
|
|||||||
self.current_preserved_tag = None
|
self.current_preserved_tag = None
|
||||||
self.preserved_content = []
|
self.preserved_content = []
|
||||||
self.preserve_depth = 0
|
self.preserve_depth = 0
|
||||||
self.handle_code_in_pre = handle_code_in_pre
|
self.handle_code_in_pre = handle_code_in_pre
|
||||||
|
|
||||||
# Configuration options
|
# Configuration options
|
||||||
self.skip_internal_links = False
|
self.skip_internal_links = False
|
||||||
self.single_line_break = False
|
self.single_line_break = False
|
||||||
@@ -1041,9 +1049,9 @@ class CustomHTML2Text(HTML2Text):
|
|||||||
def update_params(self, **kwargs):
|
def update_params(self, **kwargs):
|
||||||
"""Update parameters and set preserved tags."""
|
"""Update parameters and set preserved tags."""
|
||||||
for key, value in kwargs.items():
|
for key, value in kwargs.items():
|
||||||
if key == 'preserve_tags':
|
if key == "preserve_tags":
|
||||||
self.preserve_tags = set(value)
|
self.preserve_tags = set(value)
|
||||||
elif key == 'handle_code_in_pre':
|
elif key == "handle_code_in_pre":
|
||||||
self.handle_code_in_pre = value
|
self.handle_code_in_pre = value
|
||||||
else:
|
else:
|
||||||
setattr(self, key, value)
|
setattr(self, key, value)
|
||||||
@@ -1056,17 +1064,19 @@ class CustomHTML2Text(HTML2Text):
|
|||||||
self.current_preserved_tag = tag
|
self.current_preserved_tag = tag
|
||||||
self.preserved_content = []
|
self.preserved_content = []
|
||||||
# Format opening tag with attributes
|
# Format opening tag with attributes
|
||||||
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
|
attr_str = "".join(
|
||||||
self.preserved_content.append(f'<{tag}{attr_str}>')
|
f' {k}="{v}"' for k, v in attrs.items() if v is not None
|
||||||
|
)
|
||||||
|
self.preserved_content.append(f"<{tag}{attr_str}>")
|
||||||
self.preserve_depth += 1
|
self.preserve_depth += 1
|
||||||
return
|
return
|
||||||
else:
|
else:
|
||||||
self.preserve_depth -= 1
|
self.preserve_depth -= 1
|
||||||
if self.preserve_depth == 0:
|
if self.preserve_depth == 0:
|
||||||
self.preserved_content.append(f'</{tag}>')
|
self.preserved_content.append(f"</{tag}>")
|
||||||
# Output the preserved HTML block with proper spacing
|
# Output the preserved HTML block with proper spacing
|
||||||
preserved_html = ''.join(self.preserved_content)
|
preserved_html = "".join(self.preserved_content)
|
||||||
self.o('\n' + preserved_html + '\n')
|
self.o("\n" + preserved_html + "\n")
|
||||||
self.current_preserved_tag = None
|
self.current_preserved_tag = None
|
||||||
return
|
return
|
||||||
|
|
||||||
@@ -1074,29 +1084,31 @@ class CustomHTML2Text(HTML2Text):
|
|||||||
if self.preserve_depth > 0:
|
if self.preserve_depth > 0:
|
||||||
if start:
|
if start:
|
||||||
# Format nested tags with attributes
|
# Format nested tags with attributes
|
||||||
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
|
attr_str = "".join(
|
||||||
self.preserved_content.append(f'<{tag}{attr_str}>')
|
f' {k}="{v}"' for k, v in attrs.items() if v is not None
|
||||||
|
)
|
||||||
|
self.preserved_content.append(f"<{tag}{attr_str}>")
|
||||||
else:
|
else:
|
||||||
self.preserved_content.append(f'</{tag}>')
|
self.preserved_content.append(f"</{tag}>")
|
||||||
return
|
return
|
||||||
|
|
||||||
# Handle pre tags
|
# Handle pre tags
|
||||||
if tag == 'pre':
|
if tag == "pre":
|
||||||
if start:
|
if start:
|
||||||
self.o('```\n') # Markdown code block start
|
self.o("```\n") # Markdown code block start
|
||||||
self.inside_pre = True
|
self.inside_pre = True
|
||||||
else:
|
else:
|
||||||
self.o('\n```\n') # Markdown code block end
|
self.o("\n```\n") # Markdown code block end
|
||||||
self.inside_pre = False
|
self.inside_pre = False
|
||||||
elif tag == 'code':
|
elif tag == "code":
|
||||||
if self.inside_pre and not self.handle_code_in_pre:
|
if self.inside_pre and not self.handle_code_in_pre:
|
||||||
# Ignore code tags inside pre blocks if handle_code_in_pre is False
|
# Ignore code tags inside pre blocks if handle_code_in_pre is False
|
||||||
return
|
return
|
||||||
if start:
|
if start:
|
||||||
self.o('`') # Markdown inline code start
|
self.o("`") # Markdown inline code start
|
||||||
self.inside_code = True
|
self.inside_code = True
|
||||||
else:
|
else:
|
||||||
self.o('`') # Markdown inline code end
|
self.o("`") # Markdown inline code end
|
||||||
self.inside_code = False
|
self.inside_code = False
|
||||||
else:
|
else:
|
||||||
super().handle_tag(tag, attrs, start)
|
super().handle_tag(tag, attrs, start)
|
||||||
@@ -1113,13 +1125,12 @@ class CustomHTML2Text(HTML2Text):
|
|||||||
return
|
return
|
||||||
if self.inside_code:
|
if self.inside_code:
|
||||||
# Inline code: no newlines allowed
|
# Inline code: no newlines allowed
|
||||||
self.o(data.replace('\n', ' '))
|
self.o(data.replace("\n", " "))
|
||||||
return
|
return
|
||||||
|
|
||||||
# Default behavior for other tags
|
# Default behavior for other tags
|
||||||
super().handle_data(data, entity_char)
|
super().handle_data(data, entity_char)
|
||||||
|
|
||||||
|
|
||||||
# # Handle pre tags
|
# # Handle pre tags
|
||||||
# if tag == 'pre':
|
# if tag == 'pre':
|
||||||
# if start:
|
# if start:
|
||||||
|
|||||||
@@ -1,2 +1,3 @@
|
|||||||
class OutCallback:
|
class OutCallback:
|
||||||
def __call__(self, s: str) -> None: ...
|
def __call__(self, s: str) -> None:
|
||||||
|
...
|
||||||
|
|||||||
@@ -210,7 +210,7 @@ def escape_md_section(
|
|||||||
snob: bool = False,
|
snob: bool = False,
|
||||||
escape_dot: bool = True,
|
escape_dot: bool = True,
|
||||||
escape_plus: bool = True,
|
escape_plus: bool = True,
|
||||||
escape_dash: bool = True
|
escape_dash: bool = True,
|
||||||
) -> str:
|
) -> str:
|
||||||
"""
|
"""
|
||||||
Escapes markdown-sensitive characters across whole document sections.
|
Escapes markdown-sensitive characters across whole document sections.
|
||||||
@@ -233,6 +233,7 @@ def escape_md_section(
|
|||||||
|
|
||||||
return text
|
return text
|
||||||
|
|
||||||
|
|
||||||
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
|
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
|
||||||
"""
|
"""
|
||||||
Given the lines of a table
|
Given the lines of a table
|
||||||
|
|||||||
@@ -6,25 +6,44 @@ from .async_logger import AsyncLogger, LogLevel
|
|||||||
# Initialize logger
|
# Initialize logger
|
||||||
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
|
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
|
||||||
|
|
||||||
|
|
||||||
def post_install():
|
def post_install():
|
||||||
"""Run all post-installation tasks"""
|
"""Run all post-installation tasks"""
|
||||||
logger.info("Running post-installation setup...", tag="INIT")
|
logger.info("Running post-installation setup...", tag="INIT")
|
||||||
install_playwright()
|
install_playwright()
|
||||||
run_migration()
|
run_migration()
|
||||||
logger.success("Post-installation setup completed!", tag="COMPLETE")
|
logger.success("Post-installation setup completed!", tag="COMPLETE")
|
||||||
|
|
||||||
|
|
||||||
def install_playwright():
|
def install_playwright():
|
||||||
logger.info("Installing Playwright browsers...", tag="INIT")
|
logger.info("Installing Playwright browsers...", tag="INIT")
|
||||||
try:
|
try:
|
||||||
# subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
|
# subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
|
||||||
subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chromium"])
|
subprocess.check_call(
|
||||||
logger.success("Playwright installation completed successfully.", tag="COMPLETE")
|
[
|
||||||
except subprocess.CalledProcessError as e:
|
sys.executable,
|
||||||
|
"-m",
|
||||||
|
"playwright",
|
||||||
|
"install",
|
||||||
|
"--with-deps",
|
||||||
|
"--force",
|
||||||
|
"chromium",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
logger.success(
|
||||||
|
"Playwright installation completed successfully.", tag="COMPLETE"
|
||||||
|
)
|
||||||
|
except subprocess.CalledProcessError:
|
||||||
# logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
|
# logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
|
||||||
logger.warning(f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation.")
|
logger.warning(
|
||||||
except Exception as e:
|
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
# logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
|
# logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
|
||||||
logger.warning(f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation.")
|
logger.warning(
|
||||||
|
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def run_migration():
|
def run_migration():
|
||||||
"""Initialize database during installation"""
|
"""Initialize database during installation"""
|
||||||
@@ -33,18 +52,26 @@ def run_migration():
|
|||||||
from crawl4ai.async_database import async_db_manager
|
from crawl4ai.async_database import async_db_manager
|
||||||
|
|
||||||
asyncio.run(async_db_manager.initialize())
|
asyncio.run(async_db_manager.initialize())
|
||||||
logger.success("Database initialization completed successfully.", tag="COMPLETE")
|
logger.success(
|
||||||
|
"Database initialization completed successfully.", tag="COMPLETE"
|
||||||
|
)
|
||||||
except ImportError:
|
except ImportError:
|
||||||
logger.warning("Database module not found. Will initialize on first use.")
|
logger.warning("Database module not found. Will initialize on first use.")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Database initialization failed: {e}")
|
logger.warning(f"Database initialization failed: {e}")
|
||||||
logger.warning("Database will be initialized on first use")
|
logger.warning("Database will be initialized on first use")
|
||||||
|
|
||||||
|
|
||||||
async def run_doctor():
|
async def run_doctor():
|
||||||
"""Test if Crawl4AI is working properly"""
|
"""Test if Crawl4AI is working properly"""
|
||||||
logger.info("Running Crawl4AI health check...", tag="INIT")
|
logger.info("Running Crawl4AI health check...", tag="INIT")
|
||||||
try:
|
try:
|
||||||
from .async_webcrawler import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
from .async_webcrawler import (
|
||||||
|
AsyncWebCrawler,
|
||||||
|
BrowserConfig,
|
||||||
|
CrawlerRunConfig,
|
||||||
|
CacheMode,
|
||||||
|
)
|
||||||
|
|
||||||
browser_config = BrowserConfig(
|
browser_config = BrowserConfig(
|
||||||
headless=True,
|
headless=True,
|
||||||
@@ -52,7 +79,7 @@ async def run_doctor():
|
|||||||
ignore_https_errors=True,
|
ignore_https_errors=True,
|
||||||
light_mode=True,
|
light_mode=True,
|
||||||
viewport_width=1280,
|
viewport_width=1280,
|
||||||
viewport_height=720
|
viewport_height=720,
|
||||||
)
|
)
|
||||||
|
|
||||||
run_config = CrawlerRunConfig(
|
run_config = CrawlerRunConfig(
|
||||||
@@ -62,10 +89,7 @@ async def run_doctor():
|
|||||||
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
logger.info("Testing crawling capabilities...", tag="TEST")
|
logger.info("Testing crawling capabilities...", tag="TEST")
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
|
||||||
url="https://crawl4ai.com",
|
|
||||||
config=run_config
|
|
||||||
)
|
|
||||||
|
|
||||||
if result and result.markdown:
|
if result and result.markdown:
|
||||||
logger.success("✅ Crawling test passed!", tag="COMPLETE")
|
logger.success("✅ Crawling test passed!", tag="COMPLETE")
|
||||||
@@ -77,7 +101,9 @@ async def run_doctor():
|
|||||||
logger.error(f"❌ Test failed: {e}", tag="ERROR")
|
logger.error(f"❌ Test failed: {e}", tag="ERROR")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
def doctor():
|
def doctor():
|
||||||
"""Entry point for the doctor command"""
|
"""Entry point for the doctor command"""
|
||||||
import asyncio
|
import asyncio
|
||||||
|
|
||||||
return asyncio.run(run_doctor())
|
return asyncio.run(run_doctor())
|
||||||
|
|||||||
@@ -1,15 +1,18 @@
|
|||||||
import os, sys
|
import os
|
||||||
|
|
||||||
|
|
||||||
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
|
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
|
||||||
def load_js_script(script_name):
|
def load_js_script(script_name):
|
||||||
# Get the path of the current script
|
# Get the path of the current script
|
||||||
current_script_path = os.path.dirname(os.path.realpath(__file__))
|
current_script_path = os.path.dirname(os.path.realpath(__file__))
|
||||||
# Get the path of the script to load
|
# Get the path of the script to load
|
||||||
script_path = os.path.join(current_script_path, script_name + '.js')
|
script_path = os.path.join(current_script_path, script_name + ".js")
|
||||||
# Check if the script exists
|
# Check if the script exists
|
||||||
if not os.path.exists(script_path):
|
if not os.path.exists(script_path):
|
||||||
raise ValueError(f"Script {script_name} not found in the folder {current_script_path}")
|
raise ValueError(
|
||||||
|
f"Script {script_name} not found in the folder {current_script_path}"
|
||||||
|
)
|
||||||
# Load the content of the script
|
# Load the content of the script
|
||||||
with open(script_path, 'r') as f:
|
with open(script_path, "r") as f:
|
||||||
script_content = f.read()
|
script_content = f.read()
|
||||||
return script_content
|
return script_content
|
||||||
|
|||||||
@@ -11,16 +11,16 @@ from rank_bm25 import BM25Okapi
|
|||||||
from nltk.tokenize import word_tokenize
|
from nltk.tokenize import word_tokenize
|
||||||
from nltk.corpus import stopwords
|
from nltk.corpus import stopwords
|
||||||
from nltk.stem import WordNetLemmatizer
|
from nltk.stem import WordNetLemmatizer
|
||||||
from litellm import completion, batch_completion
|
from litellm import batch_completion
|
||||||
from .async_logger import AsyncLogger
|
from .async_logger import AsyncLogger
|
||||||
import litellm
|
import litellm
|
||||||
import pickle
|
import pickle
|
||||||
import hashlib # <--- ADDED for file-hash
|
import hashlib # <--- ADDED for file-hash
|
||||||
from fnmatch import fnmatch
|
|
||||||
import glob
|
import glob
|
||||||
|
|
||||||
litellm.set_verbose = False
|
litellm.set_verbose = False
|
||||||
|
|
||||||
|
|
||||||
def _compute_file_hash(file_path: Path) -> str:
|
def _compute_file_hash(file_path: Path) -> str:
|
||||||
"""Compute MD5 hash for the file's entire content."""
|
"""Compute MD5 hash for the file's entire content."""
|
||||||
hash_md5 = hashlib.md5()
|
hash_md5 = hashlib.md5()
|
||||||
@@ -29,13 +29,14 @@ def _compute_file_hash(file_path: Path) -> str:
|
|||||||
hash_md5.update(chunk)
|
hash_md5.update(chunk)
|
||||||
return hash_md5.hexdigest()
|
return hash_md5.hexdigest()
|
||||||
|
|
||||||
|
|
||||||
class AsyncLLMTextManager:
|
class AsyncLLMTextManager:
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
docs_dir: Path,
|
docs_dir: Path,
|
||||||
logger: Optional[AsyncLogger] = None,
|
logger: Optional[AsyncLogger] = None,
|
||||||
max_concurrent_calls: int = 5,
|
max_concurrent_calls: int = 5,
|
||||||
batch_size: int = 3
|
batch_size: int = 3,
|
||||||
) -> None:
|
) -> None:
|
||||||
self.docs_dir = docs_dir
|
self.docs_dir = docs_dir
|
||||||
self.logger = logger
|
self.logger = logger
|
||||||
@@ -51,7 +52,7 @@ class AsyncLLMTextManager:
|
|||||||
contents = []
|
contents = []
|
||||||
for file_path in doc_batch:
|
for file_path in doc_batch:
|
||||||
try:
|
try:
|
||||||
with open(file_path, 'r', encoding='utf-8') as f:
|
with open(file_path, "r", encoding="utf-8") as f:
|
||||||
contents.append(f.read())
|
contents.append(f.read())
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error reading {file_path}: {str(e)}")
|
self.logger.error(f"Error reading {file_path}: {str(e)}")
|
||||||
@@ -77,43 +78,53 @@ Wrap your response in <index>...</index> tags.
|
|||||||
# Prepare messages for batch processing
|
# Prepare messages for batch processing
|
||||||
messages_list = [
|
messages_list = [
|
||||||
[
|
[
|
||||||
{"role": "user", "content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}"}
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
|
||||||
|
}
|
||||||
]
|
]
|
||||||
for content in contents if content
|
for content in contents
|
||||||
|
if content
|
||||||
]
|
]
|
||||||
|
|
||||||
try:
|
try:
|
||||||
responses = batch_completion(
|
responses = batch_completion(
|
||||||
model="anthropic/claude-3-5-sonnet-latest",
|
model="anthropic/claude-3-5-sonnet-latest",
|
||||||
messages=messages_list,
|
messages=messages_list,
|
||||||
logger_fn=None
|
logger_fn=None,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Process responses and save index files
|
# Process responses and save index files
|
||||||
for response, file_path in zip(responses, doc_batch):
|
for response, file_path in zip(responses, doc_batch):
|
||||||
try:
|
try:
|
||||||
index_content_match = re.search(
|
index_content_match = re.search(
|
||||||
r'<index>(.*?)</index>',
|
r"<index>(.*?)</index>",
|
||||||
response.choices[0].message.content,
|
response.choices[0].message.content,
|
||||||
re.DOTALL
|
re.DOTALL,
|
||||||
)
|
)
|
||||||
if not index_content_match:
|
if not index_content_match:
|
||||||
self.logger.warning(f"No <index>...</index> content found for {file_path}")
|
self.logger.warning(
|
||||||
|
f"No <index>...</index> content found for {file_path}"
|
||||||
|
)
|
||||||
continue
|
continue
|
||||||
|
|
||||||
index_content = re.sub(
|
index_content = re.sub(
|
||||||
r"\n\s*\n", "\n", index_content_match.group(1)
|
r"\n\s*\n", "\n", index_content_match.group(1)
|
||||||
).strip()
|
).strip()
|
||||||
if index_content:
|
if index_content:
|
||||||
index_file = file_path.with_suffix('.q.md')
|
index_file = file_path.with_suffix(".q.md")
|
||||||
with open(index_file, 'w', encoding='utf-8') as f:
|
with open(index_file, "w", encoding="utf-8") as f:
|
||||||
f.write(index_content)
|
f.write(index_content)
|
||||||
self.logger.info(f"Created index file: {index_file}")
|
self.logger.info(f"Created index file: {index_file}")
|
||||||
else:
|
else:
|
||||||
self.logger.warning(f"No index content found in response for {file_path}")
|
self.logger.warning(
|
||||||
|
f"No index content found in response for {file_path}"
|
||||||
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error processing response for {file_path}: {str(e)}")
|
self.logger.error(
|
||||||
|
f"Error processing response for {file_path}: {str(e)}"
|
||||||
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error in batch completion: {str(e)}")
|
self.logger.error(f"Error in batch completion: {str(e)}")
|
||||||
@@ -171,7 +182,12 @@ Wrap your response in <index>...</index> tags.
|
|||||||
|
|
||||||
lemmatizer = WordNetLemmatizer()
|
lemmatizer = WordNetLemmatizer()
|
||||||
stop_words = set(stopwords.words("english")) - {
|
stop_words = set(stopwords.words("english")) - {
|
||||||
"how", "what", "when", "where", "why", "which",
|
"how",
|
||||||
|
"what",
|
||||||
|
"when",
|
||||||
|
"where",
|
||||||
|
"why",
|
||||||
|
"which",
|
||||||
}
|
}
|
||||||
|
|
||||||
tokens = []
|
tokens = []
|
||||||
@@ -222,7 +238,9 @@ Wrap your response in <index>...</index> tags.
|
|||||||
self.logger.info("Checking which .q.md files need (re)indexing...")
|
self.logger.info("Checking which .q.md files need (re)indexing...")
|
||||||
|
|
||||||
# Gather all .q.md files
|
# Gather all .q.md files
|
||||||
q_files = [self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")]
|
q_files = [
|
||||||
|
self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
|
||||||
|
]
|
||||||
|
|
||||||
# We'll store known (unchanged) facts in these lists
|
# We'll store known (unchanged) facts in these lists
|
||||||
existing_facts: List[str] = []
|
existing_facts: List[str] = []
|
||||||
@@ -243,7 +261,9 @@ Wrap your response in <index>...</index> tags.
|
|||||||
# Otherwise, load the existing cache and compare hash
|
# Otherwise, load the existing cache and compare hash
|
||||||
cache = self._load_or_create_token_cache(qf)
|
cache = self._load_or_create_token_cache(qf)
|
||||||
# If the .q.tokens was out of date (i.e. changed hash), we reindex
|
# If the .q.tokens was out of date (i.e. changed hash), we reindex
|
||||||
if len(cache["facts"]) == 0 or cache.get("content_hash") != _compute_file_hash(qf):
|
if len(cache["facts"]) == 0 or cache.get(
|
||||||
|
"content_hash"
|
||||||
|
) != _compute_file_hash(qf):
|
||||||
needSet.append(qf)
|
needSet.append(qf)
|
||||||
else:
|
else:
|
||||||
# File is unchanged → retrieve cached token data
|
# File is unchanged → retrieve cached token data
|
||||||
@@ -255,20 +275,29 @@ Wrap your response in <index>...</index> tags.
|
|||||||
if not needSet and not clear_cache:
|
if not needSet and not clear_cache:
|
||||||
# If no file needs reindexing, try loading existing index
|
# If no file needs reindexing, try loading existing index
|
||||||
if self.maybe_load_bm25_index(clear_cache=False):
|
if self.maybe_load_bm25_index(clear_cache=False):
|
||||||
self.logger.info("No new/changed .q.md files found. Using existing BM25 index.")
|
self.logger.info(
|
||||||
|
"No new/changed .q.md files found. Using existing BM25 index."
|
||||||
|
)
|
||||||
return
|
return
|
||||||
else:
|
else:
|
||||||
# If there's no existing index, we must build a fresh index from the old caches
|
# If there's no existing index, we must build a fresh index from the old caches
|
||||||
self.logger.info("No existing BM25 index found. Building from cached facts.")
|
self.logger.info(
|
||||||
|
"No existing BM25 index found. Building from cached facts."
|
||||||
|
)
|
||||||
if existing_facts:
|
if existing_facts:
|
||||||
self.logger.info(f"Building BM25 index with {len(existing_facts)} cached facts.")
|
self.logger.info(
|
||||||
|
f"Building BM25 index with {len(existing_facts)} cached facts."
|
||||||
|
)
|
||||||
self.bm25_index = BM25Okapi(existing_tokens)
|
self.bm25_index = BM25Okapi(existing_tokens)
|
||||||
self.tokenized_facts = existing_facts
|
self.tokenized_facts = existing_facts
|
||||||
with open(self.bm25_index_file, "wb") as f:
|
with open(self.bm25_index_file, "wb") as f:
|
||||||
pickle.dump({
|
pickle.dump(
|
||||||
"bm25_index": self.bm25_index,
|
{
|
||||||
"tokenized_facts": self.tokenized_facts
|
"bm25_index": self.bm25_index,
|
||||||
}, f)
|
"tokenized_facts": self.tokenized_facts,
|
||||||
|
},
|
||||||
|
f,
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
self.logger.warning("No facts found at all. Index remains empty.")
|
self.logger.warning("No facts found at all. Index remains empty.")
|
||||||
return
|
return
|
||||||
@@ -311,7 +340,9 @@ Wrap your response in <index>...</index> tags.
|
|||||||
self._save_token_cache(file, fresh_cache)
|
self._save_token_cache(file, fresh_cache)
|
||||||
|
|
||||||
mem_usage = process.memory_info().rss / 1024 / 1024
|
mem_usage = process.memory_info().rss / 1024 / 1024
|
||||||
self.logger.debug(f"Memory usage after {file.name}: {mem_usage:.2f}MB")
|
self.logger.debug(
|
||||||
|
f"Memory usage after {file.name}: {mem_usage:.2f}MB"
|
||||||
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
self.logger.error(f"Error processing {file}: {str(e)}")
|
self.logger.error(f"Error processing {file}: {str(e)}")
|
||||||
@@ -328,40 +359,49 @@ Wrap your response in <index>...</index> tags.
|
|||||||
all_tokens = existing_tokens + new_tokens
|
all_tokens = existing_tokens + new_tokens
|
||||||
|
|
||||||
# 3) Build BM25 index from combined facts
|
# 3) Build BM25 index from combined facts
|
||||||
self.logger.info(f"Building BM25 index with {len(all_facts)} total facts (old + new).")
|
self.logger.info(
|
||||||
|
f"Building BM25 index with {len(all_facts)} total facts (old + new)."
|
||||||
|
)
|
||||||
self.bm25_index = BM25Okapi(all_tokens)
|
self.bm25_index = BM25Okapi(all_tokens)
|
||||||
self.tokenized_facts = all_facts
|
self.tokenized_facts = all_facts
|
||||||
|
|
||||||
# 4) Save the updated BM25 index to disk
|
# 4) Save the updated BM25 index to disk
|
||||||
with open(self.bm25_index_file, "wb") as f:
|
with open(self.bm25_index_file, "wb") as f:
|
||||||
pickle.dump({
|
pickle.dump(
|
||||||
"bm25_index": self.bm25_index,
|
{
|
||||||
"tokenized_facts": self.tokenized_facts
|
"bm25_index": self.bm25_index,
|
||||||
}, f)
|
"tokenized_facts": self.tokenized_facts,
|
||||||
|
},
|
||||||
|
f,
|
||||||
|
)
|
||||||
|
|
||||||
final_mem = process.memory_info().rss / 1024 / 1024
|
final_mem = process.memory_info().rss / 1024 / 1024
|
||||||
self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
|
self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
|
||||||
|
|
||||||
async def generate_index_files(self, force_generate_facts: bool = False, clear_bm25_cache: bool = False) -> None:
|
async def generate_index_files(
|
||||||
|
self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
|
||||||
|
) -> None:
|
||||||
"""
|
"""
|
||||||
Generate index files for all documents in parallel batches
|
Generate index files for all documents in parallel batches
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
force_generate_facts (bool): If True, regenerate indexes even if they exist
|
force_generate_facts (bool): If True, regenerate indexes even if they exist
|
||||||
clear_bm25_cache (bool): If True, clear existing BM25 index cache
|
clear_bm25_cache (bool): If True, clear existing BM25 index cache
|
||||||
"""
|
"""
|
||||||
self.logger.info("Starting index generation for documentation files.")
|
self.logger.info("Starting index generation for documentation files.")
|
||||||
|
|
||||||
md_files = [
|
md_files = [
|
||||||
self.docs_dir / f for f in os.listdir(self.docs_dir)
|
self.docs_dir / f
|
||||||
if f.endswith('.md') and not any(f.endswith(x) for x in ['.q.md', '.xs.md'])
|
for f in os.listdir(self.docs_dir)
|
||||||
|
if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
|
||||||
]
|
]
|
||||||
|
|
||||||
# Filter out files that already have .q files unless force=True
|
# Filter out files that already have .q files unless force=True
|
||||||
if not force_generate_facts:
|
if not force_generate_facts:
|
||||||
md_files = [
|
md_files = [
|
||||||
f for f in md_files
|
f
|
||||||
if not (self.docs_dir / f.name.replace('.md', '.q.md')).exists()
|
for f in md_files
|
||||||
|
if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
|
||||||
]
|
]
|
||||||
|
|
||||||
if not md_files:
|
if not md_files:
|
||||||
@@ -369,8 +409,10 @@ Wrap your response in <index>...</index> tags.
|
|||||||
else:
|
else:
|
||||||
# Process documents in batches
|
# Process documents in batches
|
||||||
for i in range(0, len(md_files), self.batch_size):
|
for i in range(0, len(md_files), self.batch_size):
|
||||||
batch = md_files[i:i + self.batch_size]
|
batch = md_files[i : i + self.batch_size]
|
||||||
self.logger.info(f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}")
|
self.logger.info(
|
||||||
|
f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
|
||||||
|
)
|
||||||
await self._process_document_batch(batch)
|
await self._process_document_batch(batch)
|
||||||
|
|
||||||
self.logger.info("Index generation complete, building/updating search index.")
|
self.logger.info("Index generation complete, building/updating search index.")
|
||||||
@@ -378,21 +420,31 @@ Wrap your response in <index>...</index> tags.
|
|||||||
|
|
||||||
def generate(self, sections: List[str], mode: str = "extended") -> str:
|
def generate(self, sections: List[str], mode: str = "extended") -> str:
|
||||||
# Get all markdown files
|
# Get all markdown files
|
||||||
all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + \
|
all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
|
||||||
glob.glob(str(self.docs_dir / "[0-9]*.xs.md"))
|
str(self.docs_dir / "[0-9]*.xs.md")
|
||||||
|
)
|
||||||
|
|
||||||
# Extract base names without extensions
|
# Extract base names without extensions
|
||||||
base_docs = {Path(f).name.split('.')[0] for f in all_files
|
base_docs = {
|
||||||
if not Path(f).name.endswith('.q.md')}
|
Path(f).name.split(".")[0]
|
||||||
|
for f in all_files
|
||||||
|
if not Path(f).name.endswith(".q.md")
|
||||||
|
}
|
||||||
|
|
||||||
# Filter by sections if provided
|
# Filter by sections if provided
|
||||||
if sections:
|
if sections:
|
||||||
base_docs = {doc for doc in base_docs
|
base_docs = {
|
||||||
if any(section.lower() in doc.lower() for section in sections)}
|
doc
|
||||||
|
for doc in base_docs
|
||||||
|
if any(section.lower() in doc.lower() for section in sections)
|
||||||
|
}
|
||||||
|
|
||||||
# Get file paths based on mode
|
# Get file paths based on mode
|
||||||
files = []
|
files = []
|
||||||
for doc in sorted(base_docs, key=lambda x: int(x.split('_')[0]) if x.split('_')[0].isdigit() else 999999):
|
for doc in sorted(
|
||||||
|
base_docs,
|
||||||
|
key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
|
||||||
|
):
|
||||||
if mode == "condensed":
|
if mode == "condensed":
|
||||||
xs_file = self.docs_dir / f"{doc}.xs.md"
|
xs_file = self.docs_dir / f"{doc}.xs.md"
|
||||||
regular_file = self.docs_dir / f"{doc}.md"
|
regular_file = self.docs_dir / f"{doc}.md"
|
||||||
@@ -404,7 +456,7 @@ Wrap your response in <index>...</index> tags.
|
|||||||
content = []
|
content = []
|
||||||
for file in files:
|
for file in files:
|
||||||
try:
|
try:
|
||||||
with open(file, 'r', encoding='utf-8') as f:
|
with open(file, "r", encoding="utf-8") as f:
|
||||||
fname = Path(file).name
|
fname = Path(file).name
|
||||||
content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
|
content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
@@ -443,15 +495,9 @@ Wrap your response in <index>...</index> tags.
|
|||||||
for file, _ in ranked_files:
|
for file, _ in ranked_files:
|
||||||
main_doc = str(file).replace(".q.md", ".md")
|
main_doc = str(file).replace(".q.md", ".md")
|
||||||
if os.path.exists(self.docs_dir / main_doc):
|
if os.path.exists(self.docs_dir / main_doc):
|
||||||
with open(self.docs_dir / main_doc, "r", encoding='utf-8') as f:
|
with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
|
||||||
only_file_name = main_doc.split("/")[-1]
|
only_file_name = main_doc.split("/")[-1]
|
||||||
content = [
|
content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
|
||||||
"#" * 20,
|
|
||||||
f"# {only_file_name}",
|
|
||||||
"#" * 20,
|
|
||||||
"",
|
|
||||||
f.read()
|
|
||||||
]
|
|
||||||
results.append("\n".join(content))
|
results.append("\n".join(content))
|
||||||
|
|
||||||
return "\n\n---\n\n".join(results)
|
return "\n\n---\n\n".join(results)
|
||||||
@@ -482,7 +528,9 @@ Wrap your response in <index>...</index> tags.
|
|||||||
if len(components) == 3:
|
if len(components) == 3:
|
||||||
code_ref = components[2].strip()
|
code_ref = components[2].strip()
|
||||||
code_tokens = self.preprocess_text(code_ref)
|
code_tokens = self.preprocess_text(code_ref)
|
||||||
code_match_score = len(set(query_tokens) & set(code_tokens)) / len(query_tokens)
|
code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
|
||||||
|
query_tokens
|
||||||
|
)
|
||||||
|
|
||||||
file_data[file_path]["total_score"] += score
|
file_data[file_path]["total_score"] += score
|
||||||
file_data[file_path]["match_count"] += 1
|
file_data[file_path]["match_count"] += 1
|
||||||
|
|||||||
@@ -2,77 +2,94 @@ from abc import ABC, abstractmethod
|
|||||||
from typing import Optional, Dict, Any, Tuple
|
from typing import Optional, Dict, Any, Tuple
|
||||||
from .models import MarkdownGenerationResult
|
from .models import MarkdownGenerationResult
|
||||||
from .html2text import CustomHTML2Text
|
from .html2text import CustomHTML2Text
|
||||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
|
from .content_filter_strategy import RelevantContentFilter
|
||||||
import re
|
import re
|
||||||
from urllib.parse import urljoin
|
from urllib.parse import urljoin
|
||||||
|
|
||||||
# Pre-compile the regex pattern
|
# Pre-compile the regex pattern
|
||||||
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
|
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
|
||||||
|
|
||||||
|
|
||||||
def fast_urljoin(base: str, url: str) -> str:
|
def fast_urljoin(base: str, url: str) -> str:
|
||||||
"""Fast URL joining for common cases."""
|
"""Fast URL joining for common cases."""
|
||||||
if url.startswith(('http://', 'https://', 'mailto:', '//')):
|
if url.startswith(("http://", "https://", "mailto:", "//")):
|
||||||
return url
|
return url
|
||||||
if url.startswith('/'):
|
if url.startswith("/"):
|
||||||
# Handle absolute paths
|
# Handle absolute paths
|
||||||
if base.endswith('/'):
|
if base.endswith("/"):
|
||||||
return base[:-1] + url
|
return base[:-1] + url
|
||||||
return base + url
|
return base + url
|
||||||
return urljoin(base, url)
|
return urljoin(base, url)
|
||||||
|
|
||||||
|
|
||||||
class MarkdownGenerationStrategy(ABC):
|
class MarkdownGenerationStrategy(ABC):
|
||||||
"""Abstract base class for markdown generation strategies."""
|
"""Abstract base class for markdown generation strategies."""
|
||||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
content_filter: Optional[RelevantContentFilter] = None,
|
||||||
|
options: Optional[Dict[str, Any]] = None,
|
||||||
|
):
|
||||||
self.content_filter = content_filter
|
self.content_filter = content_filter
|
||||||
self.options = options or {}
|
self.options = options or {}
|
||||||
|
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def generate_markdown(self,
|
def generate_markdown(
|
||||||
cleaned_html: str,
|
self,
|
||||||
base_url: str = "",
|
cleaned_html: str,
|
||||||
html2text_options: Optional[Dict[str, Any]] = None,
|
base_url: str = "",
|
||||||
content_filter: Optional[RelevantContentFilter] = None,
|
html2text_options: Optional[Dict[str, Any]] = None,
|
||||||
citations: bool = True,
|
content_filter: Optional[RelevantContentFilter] = None,
|
||||||
**kwargs) -> MarkdownGenerationResult:
|
citations: bool = True,
|
||||||
|
**kwargs,
|
||||||
|
) -> MarkdownGenerationResult:
|
||||||
"""Generate markdown from cleaned HTML."""
|
"""Generate markdown from cleaned HTML."""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||||
"""
|
"""
|
||||||
Default implementation of markdown generation strategy.
|
Default implementation of markdown generation strategy.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Generate raw markdown from cleaned HTML.
|
1. Generate raw markdown from cleaned HTML.
|
||||||
2. Convert links to citations.
|
2. Convert links to citations.
|
||||||
3. Generate fit markdown if content filter is provided.
|
3. Generate fit markdown if content filter is provided.
|
||||||
4. Return MarkdownGenerationResult.
|
4. Return MarkdownGenerationResult.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
||||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
|
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||||
"""
|
"""
|
||||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
content_filter: Optional[RelevantContentFilter] = None,
|
||||||
|
options: Optional[Dict[str, Any]] = None,
|
||||||
|
):
|
||||||
super().__init__(content_filter, options)
|
super().__init__(content_filter, options)
|
||||||
|
|
||||||
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
|
def convert_links_to_citations(
|
||||||
|
self, markdown: str, base_url: str = ""
|
||||||
|
) -> Tuple[str, str]:
|
||||||
"""
|
"""
|
||||||
Convert links in markdown to citations.
|
Convert links in markdown to citations.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Find all links in the markdown.
|
1. Find all links in the markdown.
|
||||||
2. Convert links to citations.
|
2. Convert links to citations.
|
||||||
3. Return converted markdown and references markdown.
|
3. Return converted markdown and references markdown.
|
||||||
|
|
||||||
Note:
|
Note:
|
||||||
This function uses a regex pattern to find links in markdown.
|
This function uses a regex pattern to find links in markdown.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
markdown (str): Markdown text.
|
markdown (str): Markdown text.
|
||||||
base_url (str): Base URL for URL joins.
|
base_url (str): Base URL for URL joins.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Tuple[str, str]: Converted markdown and references markdown.
|
Tuple[str, str]: Converted markdown and references markdown.
|
||||||
"""
|
"""
|
||||||
@@ -81,57 +98,65 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
parts = []
|
parts = []
|
||||||
last_end = 0
|
last_end = 0
|
||||||
counter = 1
|
counter = 1
|
||||||
|
|
||||||
for match in LINK_PATTERN.finditer(markdown):
|
for match in LINK_PATTERN.finditer(markdown):
|
||||||
parts.append(markdown[last_end:match.start()])
|
parts.append(markdown[last_end : match.start()])
|
||||||
text, url, title = match.groups()
|
text, url, title = match.groups()
|
||||||
|
|
||||||
# Use cached URL if available, otherwise compute and cache
|
# Use cached URL if available, otherwise compute and cache
|
||||||
if base_url and not url.startswith(('http://', 'https://', 'mailto:')):
|
if base_url and not url.startswith(("http://", "https://", "mailto:")):
|
||||||
if url not in url_cache:
|
if url not in url_cache:
|
||||||
url_cache[url] = fast_urljoin(base_url, url)
|
url_cache[url] = fast_urljoin(base_url, url)
|
||||||
url = url_cache[url]
|
url = url_cache[url]
|
||||||
|
|
||||||
if url not in link_map:
|
if url not in link_map:
|
||||||
desc = []
|
desc = []
|
||||||
if title: desc.append(title)
|
if title:
|
||||||
if text and text != title: desc.append(text)
|
desc.append(title)
|
||||||
|
if text and text != title:
|
||||||
|
desc.append(text)
|
||||||
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
|
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
|
||||||
counter += 1
|
counter += 1
|
||||||
|
|
||||||
num = link_map[url][0]
|
num = link_map[url][0]
|
||||||
parts.append(f"{text}⟨{num}⟩" if not match.group(0).startswith('!') else f"![{text}⟨{num}⟩]")
|
parts.append(
|
||||||
|
f"{text}⟨{num}⟩"
|
||||||
|
if not match.group(0).startswith("!")
|
||||||
|
else f"![{text}⟨{num}⟩]"
|
||||||
|
)
|
||||||
last_end = match.end()
|
last_end = match.end()
|
||||||
|
|
||||||
parts.append(markdown[last_end:])
|
parts.append(markdown[last_end:])
|
||||||
converted_text = ''.join(parts)
|
converted_text = "".join(parts)
|
||||||
|
|
||||||
# Pre-build reference strings
|
# Pre-build reference strings
|
||||||
references = ["\n\n## References\n\n"]
|
references = ["\n\n## References\n\n"]
|
||||||
references.extend(
|
references.extend(
|
||||||
f"⟨{num}⟩ {url}{desc}\n"
|
f"⟨{num}⟩ {url}{desc}\n"
|
||||||
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
|
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
|
||||||
)
|
)
|
||||||
|
|
||||||
return converted_text, ''.join(references)
|
|
||||||
|
|
||||||
def generate_markdown(self,
|
return converted_text, "".join(references)
|
||||||
cleaned_html: str,
|
|
||||||
base_url: str = "",
|
def generate_markdown(
|
||||||
html2text_options: Optional[Dict[str, Any]] = None,
|
self,
|
||||||
options: Optional[Dict[str, Any]] = None,
|
cleaned_html: str,
|
||||||
content_filter: Optional[RelevantContentFilter] = None,
|
base_url: str = "",
|
||||||
citations: bool = True,
|
html2text_options: Optional[Dict[str, Any]] = None,
|
||||||
**kwargs) -> MarkdownGenerationResult:
|
options: Optional[Dict[str, Any]] = None,
|
||||||
|
content_filter: Optional[RelevantContentFilter] = None,
|
||||||
|
citations: bool = True,
|
||||||
|
**kwargs,
|
||||||
|
) -> MarkdownGenerationResult:
|
||||||
"""
|
"""
|
||||||
Generate markdown with citations from cleaned HTML.
|
Generate markdown with citations from cleaned HTML.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Generate raw markdown from cleaned HTML.
|
1. Generate raw markdown from cleaned HTML.
|
||||||
2. Convert links to citations.
|
2. Convert links to citations.
|
||||||
3. Generate fit markdown if content filter is provided.
|
3. Generate fit markdown if content filter is provided.
|
||||||
4. Return MarkdownGenerationResult.
|
4. Return MarkdownGenerationResult.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
cleaned_html (str): Cleaned HTML content.
|
cleaned_html (str): Cleaned HTML content.
|
||||||
base_url (str): Base URL for URL joins.
|
base_url (str): Base URL for URL joins.
|
||||||
@@ -139,7 +164,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
|
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
|
||||||
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
||||||
citations (bool): Whether to generate citations.
|
citations (bool): Whether to generate citations.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||||
"""
|
"""
|
||||||
@@ -147,16 +172,16 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
# Initialize HTML2Text with default options for better conversion
|
# Initialize HTML2Text with default options for better conversion
|
||||||
h = CustomHTML2Text(baseurl=base_url)
|
h = CustomHTML2Text(baseurl=base_url)
|
||||||
default_options = {
|
default_options = {
|
||||||
'body_width': 0, # Disable text wrapping
|
"body_width": 0, # Disable text wrapping
|
||||||
'ignore_emphasis': False,
|
"ignore_emphasis": False,
|
||||||
'ignore_links': False,
|
"ignore_links": False,
|
||||||
'ignore_images': False,
|
"ignore_images": False,
|
||||||
'protect_links': True,
|
"protect_links": True,
|
||||||
'single_line_break': True,
|
"single_line_break": True,
|
||||||
'mark_code': True,
|
"mark_code": True,
|
||||||
'escape_snob': False
|
"escape_snob": False,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Update with custom options if provided
|
# Update with custom options if provided
|
||||||
if html2text_options:
|
if html2text_options:
|
||||||
default_options.update(html2text_options)
|
default_options.update(html2text_options)
|
||||||
@@ -164,7 +189,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
default_options.update(options)
|
default_options.update(options)
|
||||||
elif self.options:
|
elif self.options:
|
||||||
default_options.update(self.options)
|
default_options.update(self.options)
|
||||||
|
|
||||||
h.update_params(**default_options)
|
h.update_params(**default_options)
|
||||||
|
|
||||||
# Ensure we have valid input
|
# Ensure we have valid input
|
||||||
@@ -178,17 +203,18 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
raw_markdown = h.handle(cleaned_html)
|
raw_markdown = h.handle(cleaned_html)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
|
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
|
||||||
|
|
||||||
raw_markdown = raw_markdown.replace(' ```', '```')
|
raw_markdown = raw_markdown.replace(" ```", "```")
|
||||||
|
|
||||||
# Convert links to citations
|
# Convert links to citations
|
||||||
markdown_with_citations: str = raw_markdown
|
markdown_with_citations: str = raw_markdown
|
||||||
references_markdown: str = ""
|
references_markdown: str = ""
|
||||||
if citations:
|
if citations:
|
||||||
try:
|
try:
|
||||||
markdown_with_citations, references_markdown = self.convert_links_to_citations(
|
(
|
||||||
raw_markdown, base_url
|
markdown_with_citations,
|
||||||
)
|
references_markdown,
|
||||||
|
) = self.convert_links_to_citations(raw_markdown, base_url)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
markdown_with_citations = raw_markdown
|
markdown_with_citations = raw_markdown
|
||||||
references_markdown = f"Error generating citations: {str(e)}"
|
references_markdown = f"Error generating citations: {str(e)}"
|
||||||
@@ -200,7 +226,9 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
try:
|
try:
|
||||||
content_filter = content_filter or self.content_filter
|
content_filter = content_filter or self.content_filter
|
||||||
filtered_html = content_filter.filter_content(cleaned_html)
|
filtered_html = content_filter.filter_content(cleaned_html)
|
||||||
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
|
filtered_html = "\n".join(
|
||||||
|
"<div>{}</div>".format(s) for s in filtered_html
|
||||||
|
)
|
||||||
fit_markdown = h.handle(filtered_html)
|
fit_markdown = h.handle(filtered_html)
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
fit_markdown = f"Error generating fit markdown: {str(e)}"
|
fit_markdown = f"Error generating fit markdown: {str(e)}"
|
||||||
|
|||||||
@@ -1,13 +1,11 @@
|
|||||||
import os
|
import os
|
||||||
import asyncio
|
import asyncio
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import aiosqlite
|
import aiosqlite
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
import xxhash
|
import xxhash
|
||||||
import aiofiles
|
import aiofiles
|
||||||
import shutil
|
import shutil
|
||||||
import time
|
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from .async_logger import AsyncLogger, LogLevel
|
from .async_logger import AsyncLogger, LogLevel
|
||||||
|
|
||||||
@@ -17,18 +15,19 @@ logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
|
|||||||
# logging.basicConfig(level=logging.INFO)
|
# logging.basicConfig(level=logging.INFO)
|
||||||
# logger = logging.getLogger(__name__)
|
# logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
class DatabaseMigration:
|
class DatabaseMigration:
|
||||||
def __init__(self, db_path: str):
|
def __init__(self, db_path: str):
|
||||||
self.db_path = db_path
|
self.db_path = db_path
|
||||||
self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
|
self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
|
||||||
|
|
||||||
def _ensure_content_dirs(self, base_path: str) -> dict:
|
def _ensure_content_dirs(self, base_path: str) -> dict:
|
||||||
dirs = {
|
dirs = {
|
||||||
'html': 'html_content',
|
"html": "html_content",
|
||||||
'cleaned': 'cleaned_html',
|
"cleaned": "cleaned_html",
|
||||||
'markdown': 'markdown_content',
|
"markdown": "markdown_content",
|
||||||
'extracted': 'extracted_content',
|
"extracted": "extracted_content",
|
||||||
'screenshots': 'screenshots'
|
"screenshots": "screenshots",
|
||||||
}
|
}
|
||||||
content_paths = {}
|
content_paths = {}
|
||||||
for key, dirname in dirs.items():
|
for key, dirname in dirs.items():
|
||||||
@@ -47,43 +46,55 @@ class DatabaseMigration:
|
|||||||
async def _store_content(self, content: str, content_type: str) -> str:
|
async def _store_content(self, content: str, content_type: str) -> str:
|
||||||
if not content:
|
if not content:
|
||||||
return ""
|
return ""
|
||||||
|
|
||||||
content_hash = self._generate_content_hash(content)
|
content_hash = self._generate_content_hash(content)
|
||||||
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
||||||
|
|
||||||
if not os.path.exists(file_path):
|
if not os.path.exists(file_path):
|
||||||
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
|
async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
|
||||||
await f.write(content)
|
await f.write(content)
|
||||||
|
|
||||||
return content_hash
|
return content_hash
|
||||||
|
|
||||||
async def migrate_database(self):
|
async def migrate_database(self):
|
||||||
"""Migrate existing database to file-based storage"""
|
"""Migrate existing database to file-based storage"""
|
||||||
# logger.info("Starting database migration...")
|
# logger.info("Starting database migration...")
|
||||||
logger.info("Starting database migration...", tag="INIT")
|
logger.info("Starting database migration...", tag="INIT")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
async with aiosqlite.connect(self.db_path) as db:
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
# Get all rows
|
# Get all rows
|
||||||
async with db.execute(
|
async with db.execute(
|
||||||
'''SELECT url, html, cleaned_html, markdown,
|
"""SELECT url, html, cleaned_html, markdown,
|
||||||
extracted_content, screenshot FROM crawled_data'''
|
extracted_content, screenshot FROM crawled_data"""
|
||||||
) as cursor:
|
) as cursor:
|
||||||
rows = await cursor.fetchall()
|
rows = await cursor.fetchall()
|
||||||
|
|
||||||
migrated_count = 0
|
migrated_count = 0
|
||||||
for row in rows:
|
for row in rows:
|
||||||
url, html, cleaned_html, markdown, extracted_content, screenshot = row
|
(
|
||||||
|
url,
|
||||||
|
html,
|
||||||
|
cleaned_html,
|
||||||
|
markdown,
|
||||||
|
extracted_content,
|
||||||
|
screenshot,
|
||||||
|
) = row
|
||||||
|
|
||||||
# Store content in files and get hashes
|
# Store content in files and get hashes
|
||||||
html_hash = await self._store_content(html, 'html')
|
html_hash = await self._store_content(html, "html")
|
||||||
cleaned_hash = await self._store_content(cleaned_html, 'cleaned')
|
cleaned_hash = await self._store_content(cleaned_html, "cleaned")
|
||||||
markdown_hash = await self._store_content(markdown, 'markdown')
|
markdown_hash = await self._store_content(markdown, "markdown")
|
||||||
extracted_hash = await self._store_content(extracted_content, 'extracted')
|
extracted_hash = await self._store_content(
|
||||||
screenshot_hash = await self._store_content(screenshot, 'screenshots')
|
extracted_content, "extracted"
|
||||||
|
)
|
||||||
|
screenshot_hash = await self._store_content(
|
||||||
|
screenshot, "screenshots"
|
||||||
|
)
|
||||||
|
|
||||||
# Update database with hashes
|
# Update database with hashes
|
||||||
await db.execute('''
|
await db.execute(
|
||||||
|
"""
|
||||||
UPDATE crawled_data
|
UPDATE crawled_data
|
||||||
SET html = ?,
|
SET html = ?,
|
||||||
cleaned_html = ?,
|
cleaned_html = ?,
|
||||||
@@ -91,40 +102,51 @@ class DatabaseMigration:
|
|||||||
extracted_content = ?,
|
extracted_content = ?,
|
||||||
screenshot = ?
|
screenshot = ?
|
||||||
WHERE url = ?
|
WHERE url = ?
|
||||||
''', (html_hash, cleaned_hash, markdown_hash,
|
""",
|
||||||
extracted_hash, screenshot_hash, url))
|
(
|
||||||
|
html_hash,
|
||||||
|
cleaned_hash,
|
||||||
|
markdown_hash,
|
||||||
|
extracted_hash,
|
||||||
|
screenshot_hash,
|
||||||
|
url,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
migrated_count += 1
|
migrated_count += 1
|
||||||
if migrated_count % 100 == 0:
|
if migrated_count % 100 == 0:
|
||||||
logger.info(f"Migrated {migrated_count} records...", tag="INIT")
|
logger.info(f"Migrated {migrated_count} records...", tag="INIT")
|
||||||
|
|
||||||
|
|
||||||
await db.commit()
|
await db.commit()
|
||||||
logger.success(f"Migration completed. {migrated_count} records processed.", tag="COMPLETE")
|
logger.success(
|
||||||
|
f"Migration completed. {migrated_count} records processed.",
|
||||||
|
tag="COMPLETE",
|
||||||
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
# logger.error(f"Migration failed: {e}")
|
# logger.error(f"Migration failed: {e}")
|
||||||
logger.error(
|
logger.error(
|
||||||
message="Migration failed: {error}",
|
message="Migration failed: {error}",
|
||||||
tag="ERROR",
|
tag="ERROR",
|
||||||
params={"error": str(e)}
|
params={"error": str(e)},
|
||||||
)
|
)
|
||||||
raise e
|
raise e
|
||||||
|
|
||||||
|
|
||||||
async def backup_database(db_path: str) -> str:
|
async def backup_database(db_path: str) -> str:
|
||||||
"""Create backup of existing database"""
|
"""Create backup of existing database"""
|
||||||
if not os.path.exists(db_path):
|
if not os.path.exists(db_path):
|
||||||
logger.info("No existing database found. Skipping backup.", tag="INIT")
|
logger.info("No existing database found. Skipping backup.", tag="INIT")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
# Create backup with timestamp
|
# Create backup with timestamp
|
||||||
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
|
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
backup_path = f"{db_path}.backup_{timestamp}"
|
backup_path = f"{db_path}.backup_{timestamp}"
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Wait for any potential write operations to finish
|
# Wait for any potential write operations to finish
|
||||||
await asyncio.sleep(1)
|
await asyncio.sleep(1)
|
||||||
|
|
||||||
# Create backup
|
# Create backup
|
||||||
shutil.copy2(db_path, backup_path)
|
shutil.copy2(db_path, backup_path)
|
||||||
logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
|
logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
|
||||||
@@ -132,37 +154,41 @@ async def backup_database(db_path: str) -> str:
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
# logger.error(f"Backup failed: {e}")
|
# logger.error(f"Backup failed: {e}")
|
||||||
logger.error(
|
logger.error(
|
||||||
message="Migration failed: {error}",
|
message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
|
||||||
tag="ERROR",
|
)
|
||||||
params={"error": str(e)}
|
|
||||||
)
|
|
||||||
raise e
|
raise e
|
||||||
|
|
||||||
|
|
||||||
async def run_migration(db_path: Optional[str] = None):
|
async def run_migration(db_path: Optional[str] = None):
|
||||||
"""Run database migration"""
|
"""Run database migration"""
|
||||||
if db_path is None:
|
if db_path is None:
|
||||||
db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
|
db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
|
||||||
|
|
||||||
if not os.path.exists(db_path):
|
if not os.path.exists(db_path):
|
||||||
logger.info("No existing database found. Skipping migration.", tag="INIT")
|
logger.info("No existing database found. Skipping migration.", tag="INIT")
|
||||||
return
|
return
|
||||||
|
|
||||||
# Create backup first
|
# Create backup first
|
||||||
backup_path = await backup_database(db_path)
|
backup_path = await backup_database(db_path)
|
||||||
if not backup_path:
|
if not backup_path:
|
||||||
return
|
return
|
||||||
|
|
||||||
migration = DatabaseMigration(db_path)
|
migration = DatabaseMigration(db_path)
|
||||||
await migration.migrate_database()
|
await migration.migrate_database()
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
"""CLI entry point for migration"""
|
"""CLI entry point for migration"""
|
||||||
import argparse
|
import argparse
|
||||||
parser = argparse.ArgumentParser(description='Migrate Crawl4AI database to file-based storage')
|
|
||||||
parser.add_argument('--db-path', help='Custom database path')
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Migrate Crawl4AI database to file-based storage"
|
||||||
|
)
|
||||||
|
parser.add_argument("--db-path", help="Custom database path")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
asyncio.run(run_migration(args.db_path))
|
asyncio.run(run_migration(args.db_path))
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
main()
|
main()
|
||||||
|
|||||||
@@ -2,109 +2,125 @@ from functools import lru_cache
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import subprocess, os
|
import subprocess, os
|
||||||
import shutil
|
import shutil
|
||||||
import tarfile
|
|
||||||
from .model_loader import *
|
from .model_loader import *
|
||||||
import argparse
|
import argparse
|
||||||
import urllib.request
|
|
||||||
from crawl4ai.config import MODEL_REPO_BRANCH
|
from crawl4ai.config import MODEL_REPO_BRANCH
|
||||||
|
|
||||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def get_available_memory(device):
|
def get_available_memory(device):
|
||||||
import torch
|
import torch
|
||||||
if device.type == 'cuda':
|
|
||||||
|
if device.type == "cuda":
|
||||||
return torch.cuda.get_device_properties(device).total_memory
|
return torch.cuda.get_device_properties(device).total_memory
|
||||||
elif device.type == 'mps':
|
elif device.type == "mps":
|
||||||
return 48 * 1024 ** 3 # Assuming 8GB for MPS, as a conservative estimate
|
return 48 * 1024**3 # Assuming 8GB for MPS, as a conservative estimate
|
||||||
else:
|
else:
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def calculate_batch_size(device):
|
def calculate_batch_size(device):
|
||||||
available_memory = get_available_memory(device)
|
available_memory = get_available_memory(device)
|
||||||
|
|
||||||
if device.type == 'cpu':
|
if device.type == "cpu":
|
||||||
return 16
|
return 16
|
||||||
elif device.type in ['cuda', 'mps']:
|
elif device.type in ["cuda", "mps"]:
|
||||||
# Adjust these thresholds based on your model size and available memory
|
# Adjust these thresholds based on your model size and available memory
|
||||||
if available_memory >= 31 * 1024 ** 3: # > 32GB
|
if available_memory >= 31 * 1024**3: # > 32GB
|
||||||
return 256
|
return 256
|
||||||
elif available_memory >= 15 * 1024 ** 3: # > 16GB to 32GB
|
elif available_memory >= 15 * 1024**3: # > 16GB to 32GB
|
||||||
return 128
|
return 128
|
||||||
elif available_memory >= 8 * 1024 ** 3: # 8GB to 16GB
|
elif available_memory >= 8 * 1024**3: # 8GB to 16GB
|
||||||
return 64
|
return 64
|
||||||
else:
|
else:
|
||||||
return 32
|
return 32
|
||||||
else:
|
else:
|
||||||
return 16 # Default batch size
|
return 16 # Default batch size
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def get_device():
|
def get_device():
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
if torch.cuda.is_available():
|
if torch.cuda.is_available():
|
||||||
device = torch.device('cuda')
|
device = torch.device("cuda")
|
||||||
elif torch.backends.mps.is_available():
|
elif torch.backends.mps.is_available():
|
||||||
device = torch.device('mps')
|
device = torch.device("mps")
|
||||||
else:
|
else:
|
||||||
device = torch.device('cpu')
|
device = torch.device("cpu")
|
||||||
return device
|
return device
|
||||||
|
|
||||||
|
|
||||||
def set_model_device(model):
|
def set_model_device(model):
|
||||||
device = get_device()
|
device = get_device()
|
||||||
model.to(device)
|
model.to(device)
|
||||||
return model, device
|
return model, device
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def get_home_folder():
|
def get_home_folder():
|
||||||
home_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
|
home_folder = os.path.join(
|
||||||
|
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
|
||||||
|
)
|
||||||
os.makedirs(home_folder, exist_ok=True)
|
os.makedirs(home_folder, exist_ok=True)
|
||||||
os.makedirs(f"{home_folder}/cache", exist_ok=True)
|
os.makedirs(f"{home_folder}/cache", exist_ok=True)
|
||||||
os.makedirs(f"{home_folder}/models", exist_ok=True)
|
os.makedirs(f"{home_folder}/models", exist_ok=True)
|
||||||
return home_folder
|
return home_folder
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def load_bert_base_uncased():
|
def load_bert_base_uncased():
|
||||||
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
|
from transformers import BertTokenizer, BertModel
|
||||||
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
|
|
||||||
model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
|
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
|
||||||
|
model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
|
||||||
model.eval()
|
model.eval()
|
||||||
model, device = set_model_device(model)
|
model, device = set_model_device(model)
|
||||||
return tokenizer, model
|
return tokenizer, model
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
|
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
|
||||||
"""Load the Hugging Face model for embedding.
|
"""Load the Hugging Face model for embedding.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
|
model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
tuple: The tokenizer and model.
|
tuple: The tokenizer and model.
|
||||||
"""
|
"""
|
||||||
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
|
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
|
||||||
model = AutoModel.from_pretrained(model_name, resume_download=None)
|
model = AutoModel.from_pretrained(model_name, resume_download=None)
|
||||||
model.eval()
|
model.eval()
|
||||||
model, device = set_model_device(model)
|
model, device = set_model_device(model)
|
||||||
return tokenizer, model
|
return tokenizer, model
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def load_text_classifier():
|
def load_text_classifier():
|
||||||
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
||||||
from transformers import pipeline
|
from transformers import pipeline
|
||||||
import torch
|
|
||||||
|
|
||||||
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
|
"dstefa/roberta-base_topic_classification_nyt_news"
|
||||||
|
)
|
||||||
|
model = AutoModelForSequenceClassification.from_pretrained(
|
||||||
|
"dstefa/roberta-base_topic_classification_nyt_news"
|
||||||
|
)
|
||||||
model.eval()
|
model.eval()
|
||||||
model, device = set_model_device(model)
|
model, device = set_model_device(model)
|
||||||
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
||||||
return pipe
|
return pipe
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def load_text_multilabel_classifier():
|
def load_text_multilabel_classifier():
|
||||||
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
||||||
import numpy as np
|
|
||||||
from scipy.special import expit
|
from scipy.special import expit
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
@@ -116,18 +132,27 @@ def load_text_multilabel_classifier():
|
|||||||
# else:
|
# else:
|
||||||
# device = torch.device("cpu")
|
# device = torch.device("cpu")
|
||||||
# # return load_spacy_model(), torch.device("cpu")
|
# # return load_spacy_model(), torch.device("cpu")
|
||||||
|
|
||||||
|
|
||||||
MODEL = "cardiffnlp/tweet-topic-21-multi"
|
MODEL = "cardiffnlp/tweet-topic-21-multi"
|
||||||
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
|
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
|
||||||
model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
|
model = AutoModelForSequenceClassification.from_pretrained(
|
||||||
|
MODEL, resume_download=None
|
||||||
|
)
|
||||||
model.eval()
|
model.eval()
|
||||||
model, device = set_model_device(model)
|
model, device = set_model_device(model)
|
||||||
class_mapping = model.config.id2label
|
class_mapping = model.config.id2label
|
||||||
|
|
||||||
def _classifier(texts, threshold=0.5, max_length=64):
|
def _classifier(texts, threshold=0.5, max_length=64):
|
||||||
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
|
tokens = tokenizer(
|
||||||
tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device
|
texts,
|
||||||
|
return_tensors="pt",
|
||||||
|
padding=True,
|
||||||
|
truncation=True,
|
||||||
|
max_length=max_length,
|
||||||
|
)
|
||||||
|
tokens = {
|
||||||
|
key: val.to(device) for key, val in tokens.items()
|
||||||
|
} # Move tokens to the selected device
|
||||||
|
|
||||||
with torch.no_grad():
|
with torch.no_grad():
|
||||||
output = model(**tokens)
|
output = model(**tokens)
|
||||||
@@ -138,35 +163,41 @@ def load_text_multilabel_classifier():
|
|||||||
|
|
||||||
batch_labels = []
|
batch_labels = []
|
||||||
for prediction in predictions:
|
for prediction in predictions:
|
||||||
labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
|
labels = [
|
||||||
|
class_mapping[i] for i, value in enumerate(prediction) if value == 1
|
||||||
|
]
|
||||||
batch_labels.append(labels)
|
batch_labels.append(labels)
|
||||||
|
|
||||||
return batch_labels
|
return batch_labels
|
||||||
|
|
||||||
return _classifier, device
|
return _classifier, device
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def load_nltk_punkt():
|
def load_nltk_punkt():
|
||||||
import nltk
|
import nltk
|
||||||
|
|
||||||
try:
|
try:
|
||||||
nltk.data.find('tokenizers/punkt')
|
nltk.data.find("tokenizers/punkt")
|
||||||
except LookupError:
|
except LookupError:
|
||||||
nltk.download('punkt')
|
nltk.download("punkt")
|
||||||
return nltk.data.find('tokenizers/punkt')
|
return nltk.data.find("tokenizers/punkt")
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def load_spacy_model():
|
def load_spacy_model():
|
||||||
import spacy
|
import spacy
|
||||||
|
|
||||||
name = "models/reuters"
|
name = "models/reuters"
|
||||||
home_folder = get_home_folder()
|
home_folder = get_home_folder()
|
||||||
model_folder = Path(home_folder) / name
|
model_folder = Path(home_folder) / name
|
||||||
|
|
||||||
# Check if the model directory already exists
|
# Check if the model directory already exists
|
||||||
if not (model_folder.exists() and any(model_folder.iterdir())):
|
if not (model_folder.exists() and any(model_folder.iterdir())):
|
||||||
repo_url = "https://github.com/unclecode/crawl4ai.git"
|
repo_url = "https://github.com/unclecode/crawl4ai.git"
|
||||||
branch = MODEL_REPO_BRANCH
|
branch = MODEL_REPO_BRANCH
|
||||||
repo_folder = Path(home_folder) / "crawl4ai"
|
repo_folder = Path(home_folder) / "crawl4ai"
|
||||||
|
|
||||||
print("[LOG] ⏬ Downloading Spacy model for the first time...")
|
print("[LOG] ⏬ Downloading Spacy model for the first time...")
|
||||||
|
|
||||||
# Remove existing repo folder if it exists
|
# Remove existing repo folder if it exists
|
||||||
@@ -176,7 +207,9 @@ def load_spacy_model():
|
|||||||
if model_folder.exists():
|
if model_folder.exists():
|
||||||
shutil.rmtree(model_folder)
|
shutil.rmtree(model_folder)
|
||||||
except PermissionError:
|
except PermissionError:
|
||||||
print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
|
print(
|
||||||
|
"[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
|
||||||
|
)
|
||||||
print(f"- {repo_folder}")
|
print(f"- {repo_folder}")
|
||||||
print(f"- {model_folder}")
|
print(f"- {model_folder}")
|
||||||
return None
|
return None
|
||||||
@@ -187,7 +220,7 @@ def load_spacy_model():
|
|||||||
["git", "clone", "-b", branch, repo_url, str(repo_folder)],
|
["git", "clone", "-b", branch, repo_url, str(repo_folder)],
|
||||||
stdout=subprocess.DEVNULL,
|
stdout=subprocess.DEVNULL,
|
||||||
stderr=subprocess.DEVNULL,
|
stderr=subprocess.DEVNULL,
|
||||||
check=True
|
check=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Create the models directory if it doesn't exist
|
# Create the models directory if it doesn't exist
|
||||||
@@ -215,6 +248,7 @@ def load_spacy_model():
|
|||||||
print(f"Error loading spacy model: {e}")
|
print(f"Error loading spacy model: {e}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def download_all_models(remove_existing=False):
|
def download_all_models(remove_existing=False):
|
||||||
"""Download all models required for Crawl4AI."""
|
"""Download all models required for Crawl4AI."""
|
||||||
if remove_existing:
|
if remove_existing:
|
||||||
@@ -243,14 +277,20 @@ def download_all_models(remove_existing=False):
|
|||||||
load_nltk_punkt()
|
load_nltk_punkt()
|
||||||
print("[LOG] ✅ All models downloaded successfully.")
|
print("[LOG] ✅ All models downloaded successfully.")
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
print("[LOG] Welcome to the Crawl4AI Model Downloader!")
|
print("[LOG] Welcome to the Crawl4AI Model Downloader!")
|
||||||
print("[LOG] This script will download all the models required for Crawl4AI.")
|
print("[LOG] This script will download all the models required for Crawl4AI.")
|
||||||
parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
|
parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
|
||||||
parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
|
parser.add_argument(
|
||||||
|
"--remove-existing",
|
||||||
|
action="store_true",
|
||||||
|
help="Remove existing models before downloading",
|
||||||
|
)
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
download_all_models(remove_existing=args.remove_existing)
|
download_all_models(remove_existing=args.remove_existing)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
main()
|
main()
|
||||||
|
|||||||
@@ -1,21 +1,83 @@
|
|||||||
from pydantic import BaseModel, HttpUrl
|
from pydantic import BaseModel, HttpUrl
|
||||||
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
||||||
|
from enum import Enum
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
from .ssl_certificate import SSLCertificate
|
from .ssl_certificate import SSLCertificate
|
||||||
|
from datetime import datetime
|
||||||
|
from datetime import timedelta
|
||||||
|
|
||||||
|
|
||||||
|
###############################
|
||||||
|
# Dispatcher Models
|
||||||
|
###############################
|
||||||
|
@dataclass
|
||||||
|
class DomainState:
|
||||||
|
last_request_time: float = 0
|
||||||
|
current_delay: float = 0
|
||||||
|
fail_count: int = 0
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class CrawlerTaskResult:
|
||||||
|
task_id: str
|
||||||
|
url: str
|
||||||
|
result: "CrawlResult"
|
||||||
|
memory_usage: float
|
||||||
|
peak_memory: float
|
||||||
|
start_time: datetime
|
||||||
|
end_time: datetime
|
||||||
|
error_message: str = ""
|
||||||
|
|
||||||
|
|
||||||
|
class CrawlStatus(Enum):
|
||||||
|
QUEUED = "QUEUED"
|
||||||
|
IN_PROGRESS = "IN_PROGRESS"
|
||||||
|
COMPLETED = "COMPLETED"
|
||||||
|
FAILED = "FAILED"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class CrawlStats:
|
||||||
|
task_id: str
|
||||||
|
url: str
|
||||||
|
status: CrawlStatus
|
||||||
|
start_time: Optional[datetime] = None
|
||||||
|
end_time: Optional[datetime] = None
|
||||||
|
memory_usage: float = 0.0
|
||||||
|
peak_memory: float = 0.0
|
||||||
|
error_message: str = ""
|
||||||
|
|
||||||
|
@property
|
||||||
|
def duration(self) -> str:
|
||||||
|
if not self.start_time:
|
||||||
|
return "0:00"
|
||||||
|
end = self.end_time or datetime.now()
|
||||||
|
duration = end - self.start_time
|
||||||
|
return str(timedelta(seconds=int(duration.total_seconds())))
|
||||||
|
|
||||||
|
|
||||||
|
class DisplayMode(Enum):
|
||||||
|
DETAILED = "DETAILED"
|
||||||
|
AGGREGATED = "AGGREGATED"
|
||||||
|
|
||||||
|
|
||||||
|
###############################
|
||||||
|
# Crawler Models
|
||||||
|
###############################
|
||||||
@dataclass
|
@dataclass
|
||||||
class TokenUsage:
|
class TokenUsage:
|
||||||
completion_tokens: int = 0
|
completion_tokens: int = 0
|
||||||
prompt_tokens: int = 0
|
prompt_tokens: int = 0
|
||||||
total_tokens: int = 0
|
total_tokens: int = 0
|
||||||
completion_tokens_details: Optional[dict] = None
|
completion_tokens_details: Optional[dict] = None
|
||||||
prompt_tokens_details: Optional[dict] = None
|
prompt_tokens_details: Optional[dict] = None
|
||||||
|
|
||||||
|
|
||||||
class UrlModel(BaseModel):
|
class UrlModel(BaseModel):
|
||||||
url: HttpUrl
|
url: HttpUrl
|
||||||
forced: bool = False
|
forced: bool = False
|
||||||
|
|
||||||
|
|
||||||
class MarkdownGenerationResult(BaseModel):
|
class MarkdownGenerationResult(BaseModel):
|
||||||
raw_markdown: str
|
raw_markdown: str
|
||||||
markdown_with_citations: str
|
markdown_with_citations: str
|
||||||
@@ -23,6 +85,16 @@ class MarkdownGenerationResult(BaseModel):
|
|||||||
fit_markdown: Optional[str] = None
|
fit_markdown: Optional[str] = None
|
||||||
fit_html: Optional[str] = None
|
fit_html: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
class DispatchResult(BaseModel):
|
||||||
|
task_id: str
|
||||||
|
memory_usage: float
|
||||||
|
peak_memory: float
|
||||||
|
start_time: datetime
|
||||||
|
end_time: datetime
|
||||||
|
error_message: str = ""
|
||||||
|
|
||||||
|
|
||||||
class CrawlResult(BaseModel):
|
class CrawlResult(BaseModel):
|
||||||
url: str
|
url: str
|
||||||
html: str
|
html: str
|
||||||
@@ -32,7 +104,7 @@ class CrawlResult(BaseModel):
|
|||||||
links: Dict[str, List[Dict]] = {}
|
links: Dict[str, List[Dict]] = {}
|
||||||
downloaded_files: Optional[List[str]] = None
|
downloaded_files: Optional[List[str]] = None
|
||||||
screenshot: Optional[str] = None
|
screenshot: Optional[str] = None
|
||||||
pdf : Optional[bytes] = None
|
pdf: Optional[bytes] = None
|
||||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||||
fit_markdown: Optional[str] = None
|
fit_markdown: Optional[str] = None
|
||||||
@@ -44,9 +116,13 @@ class CrawlResult(BaseModel):
|
|||||||
response_headers: Optional[dict] = None
|
response_headers: Optional[dict] = None
|
||||||
status_code: Optional[int] = None
|
status_code: Optional[int] = None
|
||||||
ssl_certificate: Optional[SSLCertificate] = None
|
ssl_certificate: Optional[SSLCertificate] = None
|
||||||
|
dispatch_result: Optional[DispatchResult] = None
|
||||||
|
redirected_url: Optional[str] = None
|
||||||
|
|
||||||
class Config:
|
class Config:
|
||||||
arbitrary_types_allowed = True
|
arbitrary_types_allowed = True
|
||||||
|
|
||||||
|
|
||||||
class AsyncCrawlResponse(BaseModel):
|
class AsyncCrawlResponse(BaseModel):
|
||||||
html: str
|
html: str
|
||||||
response_headers: Dict[str, str]
|
response_headers: Dict[str, str]
|
||||||
@@ -56,6 +132,51 @@ class AsyncCrawlResponse(BaseModel):
|
|||||||
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||||
downloaded_files: Optional[List[str]] = None
|
downloaded_files: Optional[List[str]] = None
|
||||||
ssl_certificate: Optional[SSLCertificate] = None
|
ssl_certificate: Optional[SSLCertificate] = None
|
||||||
|
redirected_url: Optional[str] = None
|
||||||
|
|
||||||
class Config:
|
class Config:
|
||||||
arbitrary_types_allowed = True
|
arbitrary_types_allowed = True
|
||||||
|
|
||||||
|
|
||||||
|
###############################
|
||||||
|
# Scraping Models
|
||||||
|
###############################
|
||||||
|
class MediaItem(BaseModel):
|
||||||
|
src: Optional[str] = ""
|
||||||
|
alt: Optional[str] = ""
|
||||||
|
desc: Optional[str] = ""
|
||||||
|
score: Optional[int] = 0
|
||||||
|
type: str = "image"
|
||||||
|
group_id: Optional[int] = 0
|
||||||
|
format: Optional[str] = None
|
||||||
|
width: Optional[int] = None
|
||||||
|
|
||||||
|
|
||||||
|
class Link(BaseModel):
|
||||||
|
href: Optional[str] = ""
|
||||||
|
text: Optional[str] = ""
|
||||||
|
title: Optional[str] = ""
|
||||||
|
base_domain: Optional[str] = ""
|
||||||
|
|
||||||
|
|
||||||
|
class Media(BaseModel):
|
||||||
|
images: List[MediaItem] = []
|
||||||
|
videos: List[
|
||||||
|
MediaItem
|
||||||
|
] = [] # Using MediaItem model for now, can be extended with Video model if needed
|
||||||
|
audios: List[
|
||||||
|
MediaItem
|
||||||
|
] = [] # Using MediaItem model for now, can be extended with Audio model if needed
|
||||||
|
|
||||||
|
|
||||||
|
class Links(BaseModel):
|
||||||
|
internal: List[Link] = []
|
||||||
|
external: List[Link] = []
|
||||||
|
|
||||||
|
|
||||||
|
class ScrapingResult(BaseModel):
|
||||||
|
cleaned_html: str
|
||||||
|
success: bool
|
||||||
|
media: Media = Media()
|
||||||
|
links: Links = Links()
|
||||||
|
metadata: Dict[str, Any] = {}
|
||||||
|
|||||||
@@ -202,3 +202,808 @@ Avoid Common Mistakes:
|
|||||||
|
|
||||||
Result
|
Result
|
||||||
Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
|
Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
|
||||||
|
|
||||||
|
|
||||||
|
PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
|
||||||
|
|
||||||
|
INPUT HTML:
|
||||||
|
<|HTML_CONTENT_START|>
|
||||||
|
{HTML}
|
||||||
|
<|HTML_CONTENT_END|>
|
||||||
|
|
||||||
|
|
||||||
|
SPECIFIC INSTRUCTION:
|
||||||
|
<|USER_INSTRUCTION_START|>
|
||||||
|
{REQUEST}
|
||||||
|
<|USER_INSTRUCTION_END|>
|
||||||
|
|
||||||
|
TASK DETAILS:
|
||||||
|
1. Content Selection
|
||||||
|
- DO: Keep essential information, main content, key details
|
||||||
|
- DO: Preserve hierarchical structure using markdown headers
|
||||||
|
- DO: Keep code blocks, tables, key lists
|
||||||
|
- DON'T: Include navigation menus, ads, footers, cookie notices
|
||||||
|
- DON'T: Keep social media widgets, sidebars, related content
|
||||||
|
|
||||||
|
2. Content Transformation
|
||||||
|
- DO: Use proper markdown syntax (#, ##, **, `, etc)
|
||||||
|
- DO: Convert tables to markdown tables
|
||||||
|
- DO: Preserve code formatting with ```language blocks
|
||||||
|
- DO: Maintain link texts but remove tracking parameters
|
||||||
|
- DON'T: Include HTML tags in output
|
||||||
|
- DON'T: Keep class names, ids, or other HTML attributes
|
||||||
|
|
||||||
|
3. Content Organization
|
||||||
|
- DO: Maintain logical flow of information
|
||||||
|
- DO: Group related content under appropriate headers
|
||||||
|
- DO: Use consistent header levels
|
||||||
|
- DON'T: Fragment related content
|
||||||
|
- DON'T: Duplicate information
|
||||||
|
|
||||||
|
Example Input:
|
||||||
|
<div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
|
||||||
|
<div class="sidebar">Related articles...</div>
|
||||||
|
|
||||||
|
Example Output:
|
||||||
|
# Setup Guide
|
||||||
|
Follow these steps...
|
||||||
|
|
||||||
|
IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
|
||||||
|
|
||||||
|
OUTPUT FORMAT:
|
||||||
|
Wrap your response in <content> tags. Use proper markdown throughout.
|
||||||
|
<content>
|
||||||
|
[Your markdown content here]
|
||||||
|
</content>
|
||||||
|
|
||||||
|
Begin filtering now."""
|
||||||
|
|
||||||
|
JSON_SCHEMA_BUILDER= """
|
||||||
|
# HTML Schema Generation Instructions
|
||||||
|
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
|
||||||
|
|
||||||
|
## Your Core Responsibilities:
|
||||||
|
1. Analyze HTML structure to identify repeating patterns and important data points
|
||||||
|
2. Generate valid JSON schemas following the specified format
|
||||||
|
3. Create appropriate selectors that will work reliably for data extraction
|
||||||
|
4. Name fields meaningfully based on their content and purpose
|
||||||
|
5. Handle both specific user requests and autonomous pattern detection
|
||||||
|
|
||||||
|
## Available Schema Types You Can Generate:
|
||||||
|
|
||||||
|
<schema_types>
|
||||||
|
1. Basic Single-Level Schema
|
||||||
|
- Use for simple, flat data structures
|
||||||
|
- Example: Product cards, user profiles
|
||||||
|
- Direct field extractions
|
||||||
|
|
||||||
|
2. Nested Object Schema
|
||||||
|
- Use for hierarchical data
|
||||||
|
- Example: Articles with author details
|
||||||
|
- Contains objects within objects
|
||||||
|
|
||||||
|
3. List Schema
|
||||||
|
- Use for repeating elements
|
||||||
|
- Example: Comment sections, product lists
|
||||||
|
- Handles arrays of similar items
|
||||||
|
|
||||||
|
4. Complex Nested Lists
|
||||||
|
- Use for multi-level data
|
||||||
|
- Example: Categories with subcategories
|
||||||
|
- Multiple levels of nesting
|
||||||
|
|
||||||
|
5. Transformation Schema
|
||||||
|
- Use for data requiring processing
|
||||||
|
- Supports regex and text transformations
|
||||||
|
- Special attribute handling
|
||||||
|
</schema_types>
|
||||||
|
|
||||||
|
<schema_structure>
|
||||||
|
Your output must always be a JSON object with this structure:
|
||||||
|
{
|
||||||
|
"name": "Descriptive name of the pattern",
|
||||||
|
"baseSelector": "CSS selector for the repeating element",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "field_name",
|
||||||
|
"selector": "CSS selector",
|
||||||
|
"type": "text|attribute|nested|list|regex",
|
||||||
|
"attribute": "attribute_name", // Optional
|
||||||
|
"transform": "transformation_type", // Optional
|
||||||
|
"pattern": "regex_pattern", // Optional
|
||||||
|
"fields": [] // For nested/list types
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
</schema_structure>
|
||||||
|
|
||||||
|
<type_definitions>
|
||||||
|
Available field types:
|
||||||
|
- text: Direct text extraction
|
||||||
|
- attribute: HTML attribute extraction
|
||||||
|
- nested: Object containing other fields
|
||||||
|
- list: Array of similar items
|
||||||
|
- regex: Pattern-based extraction
|
||||||
|
</type_definitions>
|
||||||
|
|
||||||
|
<behavior_rules>
|
||||||
|
1. When given a specific query:
|
||||||
|
- Focus on extracting requested data points
|
||||||
|
- Use most specific selectors possible
|
||||||
|
- Include all fields mentioned in the query
|
||||||
|
|
||||||
|
2. When no query is provided:
|
||||||
|
- Identify main content areas
|
||||||
|
- Extract all meaningful data points
|
||||||
|
- Use semantic structure to determine importance
|
||||||
|
- Include prices, dates, titles, and other common data types
|
||||||
|
|
||||||
|
3. Always:
|
||||||
|
- Use reliable CSS selectors
|
||||||
|
- Handle dynamic class names appropriately
|
||||||
|
- Create descriptive field names
|
||||||
|
- Follow consistent naming conventions
|
||||||
|
</behavior_rules>
|
||||||
|
|
||||||
|
<examples>
|
||||||
|
1. Basic Product Card Example:
|
||||||
|
<html>
|
||||||
|
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
|
||||||
|
<h2 class="product-title">Gaming Laptop</h2>
|
||||||
|
<span class="price">$999.99</span>
|
||||||
|
<img src="laptop.jpg" alt="Gaming Laptop">
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Product Cards",
|
||||||
|
"baseSelector": ".product-card",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
|
||||||
|
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": ".product-title",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": ".price",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "image_url",
|
||||||
|
"selector": "img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
2. Article with Author Details Example:
|
||||||
|
<html>
|
||||||
|
<article>
|
||||||
|
<h1>The Future of AI</h1>
|
||||||
|
<div class="author-info">
|
||||||
|
<span class="author-name">Dr. Smith</span>
|
||||||
|
<img src="author.jpg" alt="Dr. Smith">
|
||||||
|
</div>
|
||||||
|
</article>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Article Details",
|
||||||
|
"baseSelector": "article",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": "h1",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "author",
|
||||||
|
"type": "nested",
|
||||||
|
"selector": ".author-info",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".author-name",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "avatar",
|
||||||
|
"selector": "img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
3. Comments Section Example:
|
||||||
|
<html>
|
||||||
|
<div class="comments-container">
|
||||||
|
<div class="comment" data-user-id="123">
|
||||||
|
<div class="user-name">John123</div>
|
||||||
|
<p class="comment-text">Great article!</p>
|
||||||
|
</div>
|
||||||
|
<div class="comment" data-user-id="456">
|
||||||
|
<div class="user-name">Alice456</div>
|
||||||
|
<p class="comment-text">Thanks for sharing.</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Comment Section",
|
||||||
|
"baseSelector": ".comments-container",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "comments",
|
||||||
|
"type": "list",
|
||||||
|
"selector": ".comment",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "user",
|
||||||
|
"selector": ".user-name",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "content",
|
||||||
|
"selector": ".comment-text",
|
||||||
|
"type": "text"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
4. E-commerce Categories Example:
|
||||||
|
<html>
|
||||||
|
<div class="category-section" data-category="electronics">
|
||||||
|
<h2>Electronics</h2>
|
||||||
|
<div class="subcategory">
|
||||||
|
<h3>Laptops</h3>
|
||||||
|
<div class="product">
|
||||||
|
<span class="product-name">MacBook Pro</span>
|
||||||
|
<span class="price">$1299</span>
|
||||||
|
</div>
|
||||||
|
<div class="product">
|
||||||
|
<span class="product-name">Dell XPS</span>
|
||||||
|
<span class="price">$999</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "E-commerce Categories",
|
||||||
|
"baseSelector": ".category-section",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "category_name",
|
||||||
|
"selector": "h2",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "subcategories",
|
||||||
|
"type": "nested_list",
|
||||||
|
"selector": ".subcategory",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": "h3",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "products",
|
||||||
|
"type": "list",
|
||||||
|
"selector": ".product",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".product-name",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": ".price",
|
||||||
|
"type": "text"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
5. Job Listings with Transformations Example:
|
||||||
|
<html>
|
||||||
|
<div class="job-post">
|
||||||
|
<h3 class="job-title">Senior Developer</h3>
|
||||||
|
<span class="salary-text">Salary: $120,000/year</span>
|
||||||
|
<span class="location"> New York, NY </span>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Job Listings",
|
||||||
|
"baseSelector": ".job-post",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": ".job-title",
|
||||||
|
"type": "text",
|
||||||
|
"transform": "uppercase"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "salary",
|
||||||
|
"selector": ".salary-text",
|
||||||
|
"type": "regex",
|
||||||
|
"pattern": "\\$([\\d,]+)"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "location",
|
||||||
|
"selector": ".location",
|
||||||
|
"type": "text",
|
||||||
|
"transform": "strip"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
6. Skyscanner Place Card Example:
|
||||||
|
<html>
|
||||||
|
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
|
||||||
|
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
|
||||||
|
<div class="PlaceCard_nameContent__ODUwZ">
|
||||||
|
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
|
||||||
|
</div>
|
||||||
|
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
|
||||||
|
</div>
|
||||||
|
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
|
||||||
|
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
|
||||||
|
<div class="PriceDescription_container__NjEzM">
|
||||||
|
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
|
||||||
|
</div>
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Skyscanner Place Cards",
|
||||||
|
"baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "city_name",
|
||||||
|
"selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "country_name",
|
||||||
|
"selector": "span[class*='PlaceCard_subName__']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "description",
|
||||||
|
"selector": "span[class*='PlaceCard_advertLabel__']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "flight_price",
|
||||||
|
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "flight_url",
|
||||||
|
"selector": "a[data-testid='flights-link']",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
</examples>
|
||||||
|
|
||||||
|
|
||||||
|
<output_requirements>
|
||||||
|
Your output must:
|
||||||
|
1. Be valid JSON only
|
||||||
|
2. Include no explanatory text
|
||||||
|
3. Follow the exact schema structure provided
|
||||||
|
4. Use appropriate field types
|
||||||
|
5. Include all required fields
|
||||||
|
6. Use valid CSS selectors
|
||||||
|
</output_requirements>
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
JSON_SCHEMA_BUILDER_XPATH = """
|
||||||
|
# HTML Schema Generation Instructions
|
||||||
|
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
|
||||||
|
|
||||||
|
## Your Core Responsibilities:
|
||||||
|
1. Analyze HTML structure to identify repeating patterns and important data points
|
||||||
|
2. Generate valid JSON schemas following the specified format
|
||||||
|
3. Create appropriate XPath selectors that will work reliably for data extraction
|
||||||
|
4. Name fields meaningfully based on their content and purpose
|
||||||
|
5. Handle both specific user requests and autonomous pattern detection
|
||||||
|
|
||||||
|
## Available Schema Types You Can Generate:
|
||||||
|
|
||||||
|
<schema_types>
|
||||||
|
1. Basic Single-Level Schema
|
||||||
|
- Use for simple, flat data structures
|
||||||
|
- Example: Product cards, user profiles
|
||||||
|
- Direct field extractions
|
||||||
|
|
||||||
|
2. Nested Object Schema
|
||||||
|
- Use for hierarchical data
|
||||||
|
- Example: Articles with author details
|
||||||
|
- Contains objects within objects
|
||||||
|
|
||||||
|
3. List Schema
|
||||||
|
- Use for repeating elements
|
||||||
|
- Example: Comment sections, product lists
|
||||||
|
- Handles arrays of similar items
|
||||||
|
|
||||||
|
4. Complex Nested Lists
|
||||||
|
- Use for multi-level data
|
||||||
|
- Example: Categories with subcategories
|
||||||
|
- Multiple levels of nesting
|
||||||
|
|
||||||
|
5. Transformation Schema
|
||||||
|
- Use for data requiring processing
|
||||||
|
- Supports regex and text transformations
|
||||||
|
- Special attribute handling
|
||||||
|
</schema_types>
|
||||||
|
|
||||||
|
<schema_structure>
|
||||||
|
Your output must always be a JSON object with this structure:
|
||||||
|
{
|
||||||
|
"name": "Descriptive name of the pattern",
|
||||||
|
"baseSelector": "XPath selector for the repeating element",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "field_name",
|
||||||
|
"selector": "XPath selector",
|
||||||
|
"type": "text|attribute|nested|list|regex",
|
||||||
|
"attribute": "attribute_name", // Optional
|
||||||
|
"transform": "transformation_type", // Optional
|
||||||
|
"pattern": "regex_pattern", // Optional
|
||||||
|
"fields": [] // For nested/list types
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
</schema_structure>
|
||||||
|
|
||||||
|
<type_definitions>
|
||||||
|
Available field types:
|
||||||
|
- text: Direct text extraction
|
||||||
|
- attribute: HTML attribute extraction
|
||||||
|
- nested: Object containing other fields
|
||||||
|
- list: Array of similar items
|
||||||
|
- regex: Pattern-based extraction
|
||||||
|
</type_definitions>
|
||||||
|
|
||||||
|
<behavior_rules>
|
||||||
|
1. When given a specific query:
|
||||||
|
- Focus on extracting requested data points
|
||||||
|
- Use most specific selectors possible
|
||||||
|
- Include all fields mentioned in the query
|
||||||
|
|
||||||
|
2. When no query is provided:
|
||||||
|
- Identify main content areas
|
||||||
|
- Extract all meaningful data points
|
||||||
|
- Use semantic structure to determine importance
|
||||||
|
- Include prices, dates, titles, and other common data types
|
||||||
|
|
||||||
|
3. Always:
|
||||||
|
- Use reliable XPath selectors
|
||||||
|
- Handle dynamic element IDs appropriately
|
||||||
|
- Create descriptive field names
|
||||||
|
- Follow consistent naming conventions
|
||||||
|
</behavior_rules>
|
||||||
|
|
||||||
|
<examples>
|
||||||
|
1. Basic Product Card Example:
|
||||||
|
<html>
|
||||||
|
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
|
||||||
|
<h2 class="product-title">Gaming Laptop</h2>
|
||||||
|
<span class="price">$999.99</span>
|
||||||
|
<img src="laptop.jpg" alt="Gaming Laptop">
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Product Cards",
|
||||||
|
"baseSelector": "//div[@class='product-card']",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
|
||||||
|
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": ".//h2[@class='product-title']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": ".//span[@class='price']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "image_url",
|
||||||
|
"selector": ".//img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
2. Article with Author Details Example:
|
||||||
|
<html>
|
||||||
|
<article>
|
||||||
|
<h1>The Future of AI</h1>
|
||||||
|
<div class="author-info">
|
||||||
|
<span class="author-name">Dr. Smith</span>
|
||||||
|
<img src="author.jpg" alt="Dr. Smith">
|
||||||
|
</div>
|
||||||
|
</article>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Article Details",
|
||||||
|
"baseSelector": "//article",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": ".//h1",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "author",
|
||||||
|
"type": "nested",
|
||||||
|
"selector": ".//div[@class='author-info']",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".//span[@class='author-name']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "avatar",
|
||||||
|
"selector": ".//img",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "src"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
3. Comments Section Example:
|
||||||
|
<html>
|
||||||
|
<div class="comments-container">
|
||||||
|
<div class="comment" data-user-id="123">
|
||||||
|
<div class="user-name">John123</div>
|
||||||
|
<p class="comment-text">Great article!</p>
|
||||||
|
</div>
|
||||||
|
<div class="comment" data-user-id="456">
|
||||||
|
<div class="user-name">Alice456</div>
|
||||||
|
<p class="comment-text">Thanks for sharing.</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Comment Section",
|
||||||
|
"baseSelector": "//div[@class='comments-container']",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "comments",
|
||||||
|
"type": "list",
|
||||||
|
"selector": ".//div[@class='comment']",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "user",
|
||||||
|
"selector": ".//div[@class='user-name']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "content",
|
||||||
|
"selector": ".//p[@class='comment-text']",
|
||||||
|
"type": "text"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
4. E-commerce Categories Example:
|
||||||
|
<html>
|
||||||
|
<div class="category-section" data-category="electronics">
|
||||||
|
<h2>Electronics</h2>
|
||||||
|
<div class="subcategory">
|
||||||
|
<h3>Laptops</h3>
|
||||||
|
<div class="product">
|
||||||
|
<span class="product-name">MacBook Pro</span>
|
||||||
|
<span class="price">$1299</span>
|
||||||
|
</div>
|
||||||
|
<div class="product">
|
||||||
|
<span class="product-name">Dell XPS</span>
|
||||||
|
<span class="price">$999</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "E-commerce Categories",
|
||||||
|
"baseSelector": "//div[@class='category-section']",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "category_name",
|
||||||
|
"selector": ".//h2",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "subcategories",
|
||||||
|
"type": "nested_list",
|
||||||
|
"selector": ".//div[@class='subcategory']",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".//h3",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "products",
|
||||||
|
"type": "list",
|
||||||
|
"selector": ".//div[@class='product']",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "name",
|
||||||
|
"selector": ".//span[@class='product-name']",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": ".//span[@class='price']",
|
||||||
|
"type": "text"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
5. Job Listings with Transformations Example:
|
||||||
|
<html>
|
||||||
|
<div class="job-post">
|
||||||
|
<h3 class="job-title">Senior Developer</h3>
|
||||||
|
<span class="salary-text">Salary: $120,000/year</span>
|
||||||
|
<span class="location"> New York, NY </span>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Job Listings",
|
||||||
|
"baseSelector": "//div[@class='job-post']",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": ".//h3[@class='job-title']",
|
||||||
|
"type": "text",
|
||||||
|
"transform": "uppercase"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "salary",
|
||||||
|
"selector": ".//span[@class='salary-text']",
|
||||||
|
"type": "regex",
|
||||||
|
"pattern": "\\$([\\d,]+)"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "location",
|
||||||
|
"selector": ".//span[@class='location']",
|
||||||
|
"type": "text",
|
||||||
|
"transform": "strip"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
6. Skyscanner Place Card Example:
|
||||||
|
<html>
|
||||||
|
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
|
||||||
|
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
|
||||||
|
<div class="PlaceCard_nameContent__ODUwZ">
|
||||||
|
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
|
||||||
|
</div>
|
||||||
|
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
|
||||||
|
</div>
|
||||||
|
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
|
||||||
|
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
|
||||||
|
<div class="PriceDescription_container__NjEzM">
|
||||||
|
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
|
||||||
|
</div>
|
||||||
|
</a>
|
||||||
|
</div>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
Generated Schema:
|
||||||
|
{
|
||||||
|
"name": "Skyscanner Place Cards",
|
||||||
|
"baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
|
||||||
|
"baseFields": [
|
||||||
|
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
|
||||||
|
],
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "city_name",
|
||||||
|
"selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "country_name",
|
||||||
|
"selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "description",
|
||||||
|
"selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "flight_price",
|
||||||
|
"selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "flight_url",
|
||||||
|
"selector": ".//a[@data-testid='flights-link']",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
</examples>
|
||||||
|
|
||||||
|
<output_requirements>
|
||||||
|
Your output must:
|
||||||
|
1. Be valid JSON only
|
||||||
|
2. Include no explanatory text
|
||||||
|
3. Follow the exact schema structure provided
|
||||||
|
4. Use appropriate field types
|
||||||
|
5. Include all required fields
|
||||||
|
6. Use valid XPath selectors
|
||||||
|
</output_requirements>
|
||||||
|
"""
|
||||||
@@ -13,10 +13,10 @@ from pathlib import Path
|
|||||||
class SSLCertificate:
|
class SSLCertificate:
|
||||||
"""
|
"""
|
||||||
A class representing an SSL certificate with methods to export in various formats.
|
A class representing an SSL certificate with methods to export in various formats.
|
||||||
|
|
||||||
Attributes:
|
Attributes:
|
||||||
cert_info (Dict[str, Any]): The certificate information.
|
cert_info (Dict[str, Any]): The certificate information.
|
||||||
|
|
||||||
Methods:
|
Methods:
|
||||||
from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
|
from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
|
||||||
from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
|
from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
|
||||||
@@ -26,32 +26,35 @@ class SSLCertificate:
|
|||||||
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
|
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
|
||||||
export_as_text() -> str: Export the certificate as text format.
|
export_as_text() -> str: Export the certificate as text format.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self, cert_info: Dict[str, Any]):
|
def __init__(self, cert_info: Dict[str, Any]):
|
||||||
self._cert_info = self._decode_cert_data(cert_info)
|
self._cert_info = self._decode_cert_data(cert_info)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']:
|
def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
|
||||||
"""
|
"""
|
||||||
Create SSLCertificate instance from a URL.
|
Create SSLCertificate instance from a URL.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
url (str): URL of the website.
|
url (str): URL of the website.
|
||||||
timeout (int): Timeout for the connection (default: 10).
|
timeout (int): Timeout for the connection (default: 10).
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
|
Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
hostname = urlparse(url).netloc
|
hostname = urlparse(url).netloc
|
||||||
if ':' in hostname:
|
if ":" in hostname:
|
||||||
hostname = hostname.split(':')[0]
|
hostname = hostname.split(":")[0]
|
||||||
|
|
||||||
context = ssl.create_default_context()
|
context = ssl.create_default_context()
|
||||||
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
|
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
|
||||||
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
|
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
|
||||||
cert_binary = ssock.getpeercert(binary_form=True)
|
cert_binary = ssock.getpeercert(binary_form=True)
|
||||||
x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_ASN1, cert_binary)
|
x509 = OpenSSL.crypto.load_certificate(
|
||||||
|
OpenSSL.crypto.FILETYPE_ASN1, cert_binary
|
||||||
|
)
|
||||||
|
|
||||||
cert_info = {
|
cert_info = {
|
||||||
"subject": dict(x509.get_subject().get_components()),
|
"subject": dict(x509.get_subject().get_components()),
|
||||||
"issuer": dict(x509.get_issuer().get_components()),
|
"issuer": dict(x509.get_issuer().get_components()),
|
||||||
@@ -61,32 +64,33 @@ class SSLCertificate:
|
|||||||
"not_after": x509.get_notAfter(),
|
"not_after": x509.get_notAfter(),
|
||||||
"fingerprint": x509.digest("sha256").hex(),
|
"fingerprint": x509.digest("sha256").hex(),
|
||||||
"signature_algorithm": x509.get_signature_algorithm(),
|
"signature_algorithm": x509.get_signature_algorithm(),
|
||||||
"raw_cert": base64.b64encode(cert_binary)
|
"raw_cert": base64.b64encode(cert_binary),
|
||||||
}
|
}
|
||||||
|
|
||||||
# Add extensions
|
# Add extensions
|
||||||
extensions = []
|
extensions = []
|
||||||
for i in range(x509.get_extension_count()):
|
for i in range(x509.get_extension_count()):
|
||||||
ext = x509.get_extension(i)
|
ext = x509.get_extension(i)
|
||||||
extensions.append({
|
extensions.append(
|
||||||
"name": ext.get_short_name(),
|
{"name": ext.get_short_name(), "value": str(ext)}
|
||||||
"value": str(ext)
|
)
|
||||||
})
|
|
||||||
cert_info["extensions"] = extensions
|
cert_info["extensions"] = extensions
|
||||||
|
|
||||||
return SSLCertificate(cert_info)
|
return SSLCertificate(cert_info)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def _decode_cert_data(data: Any) -> Any:
|
def _decode_cert_data(data: Any) -> Any:
|
||||||
"""Helper method to decode bytes in certificate data."""
|
"""Helper method to decode bytes in certificate data."""
|
||||||
if isinstance(data, bytes):
|
if isinstance(data, bytes):
|
||||||
return data.decode('utf-8')
|
return data.decode("utf-8")
|
||||||
elif isinstance(data, dict):
|
elif isinstance(data, dict):
|
||||||
return {
|
return {
|
||||||
(k.decode('utf-8') if isinstance(k, bytes) else k): SSLCertificate._decode_cert_data(v)
|
(
|
||||||
|
k.decode("utf-8") if isinstance(k, bytes) else k
|
||||||
|
): SSLCertificate._decode_cert_data(v)
|
||||||
for k, v in data.items()
|
for k, v in data.items()
|
||||||
}
|
}
|
||||||
elif isinstance(data, list):
|
elif isinstance(data, list):
|
||||||
@@ -96,58 +100,57 @@ class SSLCertificate:
|
|||||||
def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
|
def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
|
||||||
"""
|
"""
|
||||||
Export certificate as JSON.
|
Export certificate as JSON.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
filepath (Optional[str]): Path to save the JSON file (default: None).
|
filepath (Optional[str]): Path to save the JSON file (default: None).
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Optional[str]: JSON string if successful, None otherwise.
|
Optional[str]: JSON string if successful, None otherwise.
|
||||||
"""
|
"""
|
||||||
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
|
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
|
||||||
if filepath:
|
if filepath:
|
||||||
Path(filepath).write_text(json_str, encoding='utf-8')
|
Path(filepath).write_text(json_str, encoding="utf-8")
|
||||||
return None
|
return None
|
||||||
return json_str
|
return json_str
|
||||||
|
|
||||||
def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
|
def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
|
||||||
"""
|
"""
|
||||||
Export certificate as PEM.
|
Export certificate as PEM.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
filepath (Optional[str]): Path to save the PEM file (default: None).
|
filepath (Optional[str]): Path to save the PEM file (default: None).
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Optional[str]: PEM string if successful, None otherwise.
|
Optional[str]: PEM string if successful, None otherwise.
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
x509 = OpenSSL.crypto.load_certificate(
|
x509 = OpenSSL.crypto.load_certificate(
|
||||||
OpenSSL.crypto.FILETYPE_ASN1,
|
OpenSSL.crypto.FILETYPE_ASN1,
|
||||||
base64.b64decode(self._cert_info['raw_cert'])
|
base64.b64decode(self._cert_info["raw_cert"]),
|
||||||
)
|
)
|
||||||
pem_data = OpenSSL.crypto.dump_certificate(
|
pem_data = OpenSSL.crypto.dump_certificate(
|
||||||
OpenSSL.crypto.FILETYPE_PEM,
|
OpenSSL.crypto.FILETYPE_PEM, x509
|
||||||
x509
|
).decode("utf-8")
|
||||||
).decode('utf-8')
|
|
||||||
|
|
||||||
if filepath:
|
if filepath:
|
||||||
Path(filepath).write_text(pem_data, encoding='utf-8')
|
Path(filepath).write_text(pem_data, encoding="utf-8")
|
||||||
return None
|
return None
|
||||||
return pem_data
|
return pem_data
|
||||||
except Exception as e:
|
except Exception:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
|
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
|
||||||
"""
|
"""
|
||||||
Export certificate as DER.
|
Export certificate as DER.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
filepath (Optional[str]): Path to save the DER file (default: None).
|
filepath (Optional[str]): Path to save the DER file (default: None).
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Optional[bytes]: DER bytes if successful, None otherwise.
|
Optional[bytes]: DER bytes if successful, None otherwise.
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
der_data = base64.b64decode(self._cert_info['raw_cert'])
|
der_data = base64.b64decode(self._cert_info["raw_cert"])
|
||||||
if filepath:
|
if filepath:
|
||||||
Path(filepath).write_bytes(der_data)
|
Path(filepath).write_bytes(der_data)
|
||||||
return None
|
return None
|
||||||
@@ -158,24 +161,24 @@ class SSLCertificate:
|
|||||||
@property
|
@property
|
||||||
def issuer(self) -> Dict[str, str]:
|
def issuer(self) -> Dict[str, str]:
|
||||||
"""Get certificate issuer information."""
|
"""Get certificate issuer information."""
|
||||||
return self._cert_info.get('issuer', {})
|
return self._cert_info.get("issuer", {})
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def subject(self) -> Dict[str, str]:
|
def subject(self) -> Dict[str, str]:
|
||||||
"""Get certificate subject information."""
|
"""Get certificate subject information."""
|
||||||
return self._cert_info.get('subject', {})
|
return self._cert_info.get("subject", {})
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def valid_from(self) -> str:
|
def valid_from(self) -> str:
|
||||||
"""Get certificate validity start date."""
|
"""Get certificate validity start date."""
|
||||||
return self._cert_info.get('not_before', '')
|
return self._cert_info.get("not_before", "")
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def valid_until(self) -> str:
|
def valid_until(self) -> str:
|
||||||
"""Get certificate validity end date."""
|
"""Get certificate validity end date."""
|
||||||
return self._cert_info.get('not_after', '')
|
return self._cert_info.get("not_after", "")
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def fingerprint(self) -> str:
|
def fingerprint(self) -> str:
|
||||||
"""Get certificate fingerprint."""
|
"""Get certificate fingerprint."""
|
||||||
return self._cert_info.get('fingerprint', '')
|
return self._cert_info.get("fingerprint", "")
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ import re
|
|||||||
class UserAgentGenerator:
|
class UserAgentGenerator:
|
||||||
"""
|
"""
|
||||||
Generate random user agents with specified constraints.
|
Generate random user agents with specified constraints.
|
||||||
|
|
||||||
Attributes:
|
Attributes:
|
||||||
desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
|
desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
|
||||||
mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
|
mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
|
||||||
@@ -18,7 +18,7 @@ class UserAgentGenerator:
|
|||||||
safari_versions (list): A list of possible Safari browser versions.
|
safari_versions (list): A list of possible Safari browser versions.
|
||||||
ios_versions (list): A list of possible iOS browser versions.
|
ios_versions (list): A list of possible iOS browser versions.
|
||||||
android_versions (list): A list of possible Android browser versions.
|
android_versions (list): A list of possible Android browser versions.
|
||||||
|
|
||||||
Methods:
|
Methods:
|
||||||
generate_user_agent(
|
generate_user_agent(
|
||||||
platform: Literal["desktop", "mobile"] = "desktop",
|
platform: Literal["desktop", "mobile"] = "desktop",
|
||||||
@@ -30,8 +30,9 @@ class UserAgentGenerator:
|
|||||||
safari_version: Optional[str] = None,
|
safari_version: Optional[str] = None,
|
||||||
ios_version: Optional[str] = None,
|
ios_version: Optional[str] = None,
|
||||||
android_version: Optional[str] = None
|
android_version: Optional[str] = None
|
||||||
): Generates a random user agent string based on the specified parameters.
|
): Generates a random user agent string based on the specified parameters.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
# Previous platform definitions remain the same...
|
# Previous platform definitions remain the same...
|
||||||
self.desktop_platforms = {
|
self.desktop_platforms = {
|
||||||
@@ -47,7 +48,7 @@ class UserAgentGenerator:
|
|||||||
"generic": "(X11; Linux x86_64)",
|
"generic": "(X11; Linux x86_64)",
|
||||||
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
|
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
|
||||||
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
|
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
|
||||||
}
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
self.mobile_platforms = {
|
self.mobile_platforms = {
|
||||||
@@ -60,26 +61,14 @@ class UserAgentGenerator:
|
|||||||
"ios": {
|
"ios": {
|
||||||
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
|
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
|
||||||
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
|
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
|
||||||
}
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
# Browser Combinations
|
# Browser Combinations
|
||||||
self.browser_combinations = {
|
self.browser_combinations = {
|
||||||
1: [
|
1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
|
||||||
["chrome"],
|
2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
|
||||||
["firefox"],
|
3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
|
||||||
["safari"],
|
|
||||||
["edge"]
|
|
||||||
],
|
|
||||||
2: [
|
|
||||||
["gecko", "firefox"],
|
|
||||||
["chrome", "safari"],
|
|
||||||
["webkit", "safari"]
|
|
||||||
],
|
|
||||||
3: [
|
|
||||||
["chrome", "safari", "edge"],
|
|
||||||
["webkit", "chrome", "safari"]
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
|
|
||||||
# Rendering Engines with versions
|
# Rendering Engines with versions
|
||||||
@@ -90,7 +79,7 @@ class UserAgentGenerator:
|
|||||||
"Gecko/20100101",
|
"Gecko/20100101",
|
||||||
"Gecko/20100101", # Firefox usually uses this constant version
|
"Gecko/20100101", # Firefox usually uses this constant version
|
||||||
"Gecko/2010010",
|
"Gecko/2010010",
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
|
|
||||||
# Browser Versions
|
# Browser Versions
|
||||||
@@ -135,25 +124,25 @@ class UserAgentGenerator:
|
|||||||
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
|
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
|
||||||
"""
|
"""
|
||||||
Get a valid combination of browser versions.
|
Get a valid combination of browser versions.
|
||||||
|
|
||||||
How it works:
|
How it works:
|
||||||
1. Check if the number of browsers is supported.
|
1. Check if the number of browsers is supported.
|
||||||
2. Randomly choose a combination of browsers.
|
2. Randomly choose a combination of browsers.
|
||||||
3. Iterate through the combination and add browser versions.
|
3. Iterate through the combination and add browser versions.
|
||||||
4. Return the browser stack.
|
4. Return the browser stack.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
num_browsers: Number of browser specifications (1-3)
|
num_browsers: Number of browser specifications (1-3)
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
List[str]: A list of browser versions.
|
List[str]: A list of browser versions.
|
||||||
"""
|
"""
|
||||||
if num_browsers not in self.browser_combinations:
|
if num_browsers not in self.browser_combinations:
|
||||||
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
|
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
|
||||||
|
|
||||||
combination = random.choice(self.browser_combinations[num_browsers])
|
combination = random.choice(self.browser_combinations[num_browsers])
|
||||||
browser_stack = []
|
browser_stack = []
|
||||||
|
|
||||||
for browser in combination:
|
for browser in combination:
|
||||||
if browser == "chrome":
|
if browser == "chrome":
|
||||||
browser_stack.append(random.choice(self.chrome_versions))
|
browser_stack.append(random.choice(self.chrome_versions))
|
||||||
@@ -167,18 +156,20 @@ class UserAgentGenerator:
|
|||||||
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
|
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
|
||||||
elif browser == "webkit":
|
elif browser == "webkit":
|
||||||
browser_stack.append(self.rendering_engines["chrome_webkit"])
|
browser_stack.append(self.rendering_engines["chrome_webkit"])
|
||||||
|
|
||||||
return browser_stack
|
return browser_stack
|
||||||
|
|
||||||
def generate(self,
|
def generate(
|
||||||
device_type: Optional[Literal['desktop', 'mobile']] = None,
|
self,
|
||||||
os_type: Optional[str] = None,
|
device_type: Optional[Literal["desktop", "mobile"]] = None,
|
||||||
device_brand: Optional[str] = None,
|
os_type: Optional[str] = None,
|
||||||
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
|
device_brand: Optional[str] = None,
|
||||||
num_browsers: int = 3) -> str:
|
browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
|
||||||
|
num_browsers: int = 3,
|
||||||
|
) -> str:
|
||||||
"""
|
"""
|
||||||
Generate a random user agent with specified constraints.
|
Generate a random user agent with specified constraints.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
device_type: 'desktop' or 'mobile'
|
device_type: 'desktop' or 'mobile'
|
||||||
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
|
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
|
||||||
@@ -188,23 +179,23 @@ class UserAgentGenerator:
|
|||||||
"""
|
"""
|
||||||
# Get platform string
|
# Get platform string
|
||||||
platform = self.get_random_platform(device_type, os_type, device_brand)
|
platform = self.get_random_platform(device_type, os_type, device_brand)
|
||||||
|
|
||||||
# Start with Mozilla
|
# Start with Mozilla
|
||||||
components = ["Mozilla/5.0", platform]
|
components = ["Mozilla/5.0", platform]
|
||||||
|
|
||||||
# Add browser stack
|
# Add browser stack
|
||||||
browser_stack = self.get_browser_stack(num_browsers)
|
browser_stack = self.get_browser_stack(num_browsers)
|
||||||
|
|
||||||
# Add appropriate legacy token based on browser stack
|
# Add appropriate legacy token based on browser stack
|
||||||
if "Firefox" in str(browser_stack):
|
if "Firefox" in str(browser_stack):
|
||||||
components.append(random.choice(self.rendering_engines["gecko"]))
|
components.append(random.choice(self.rendering_engines["gecko"]))
|
||||||
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
|
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
|
||||||
components.append(self.rendering_engines["chrome_webkit"])
|
components.append(self.rendering_engines["chrome_webkit"])
|
||||||
components.append("(KHTML, like Gecko)")
|
components.append("(KHTML, like Gecko)")
|
||||||
|
|
||||||
# Add browser versions
|
# Add browser versions
|
||||||
components.extend(browser_stack)
|
components.extend(browser_stack)
|
||||||
|
|
||||||
return " ".join(components)
|
return " ".join(components)
|
||||||
|
|
||||||
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
|
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
|
||||||
@@ -215,16 +206,20 @@ class UserAgentGenerator:
|
|||||||
|
|
||||||
def get_random_platform(self, device_type, os_type, device_brand):
|
def get_random_platform(self, device_type, os_type, device_brand):
|
||||||
"""Helper method to get random platform based on constraints"""
|
"""Helper method to get random platform based on constraints"""
|
||||||
platforms = self.desktop_platforms if device_type == 'desktop' else \
|
platforms = (
|
||||||
self.mobile_platforms if device_type == 'mobile' else \
|
self.desktop_platforms
|
||||||
{**self.desktop_platforms, **self.mobile_platforms}
|
if device_type == "desktop"
|
||||||
|
else self.mobile_platforms
|
||||||
|
if device_type == "mobile"
|
||||||
|
else {**self.desktop_platforms, **self.mobile_platforms}
|
||||||
|
)
|
||||||
|
|
||||||
if os_type:
|
if os_type:
|
||||||
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
|
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
|
||||||
if os_type in platform_group:
|
if os_type in platform_group:
|
||||||
platforms = {os_type: platform_group[os_type]}
|
platforms = {os_type: platform_group[os_type]}
|
||||||
break
|
break
|
||||||
|
|
||||||
os_key = random.choice(list(platforms.keys()))
|
os_key = random.choice(list(platforms.keys()))
|
||||||
if device_brand and device_brand in platforms[os_key]:
|
if device_brand and device_brand in platforms[os_key]:
|
||||||
return platforms[os_key][device_brand]
|
return platforms[os_key][device_brand]
|
||||||
@@ -233,73 +228,72 @@ class UserAgentGenerator:
|
|||||||
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
|
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
|
||||||
"""Parse a user agent string to extract browser and version information"""
|
"""Parse a user agent string to extract browser and version information"""
|
||||||
browsers = {
|
browsers = {
|
||||||
'chrome': r'Chrome/(\d+)',
|
"chrome": r"Chrome/(\d+)",
|
||||||
'edge': r'Edg/(\d+)',
|
"edge": r"Edg/(\d+)",
|
||||||
'safari': r'Version/(\d+)',
|
"safari": r"Version/(\d+)",
|
||||||
'firefox': r'Firefox/(\d+)'
|
"firefox": r"Firefox/(\d+)",
|
||||||
}
|
}
|
||||||
|
|
||||||
result = {}
|
result = {}
|
||||||
for browser, pattern in browsers.items():
|
for browser, pattern in browsers.items():
|
||||||
match = re.search(pattern, user_agent)
|
match = re.search(pattern, user_agent)
|
||||||
if match:
|
if match:
|
||||||
result[browser] = match.group(1)
|
result[browser] = match.group(1)
|
||||||
|
|
||||||
return result
|
return result
|
||||||
|
|
||||||
def generate_client_hints(self, user_agent: str) -> str:
|
def generate_client_hints(self, user_agent: str) -> str:
|
||||||
"""Generate Sec-CH-UA header value based on user agent string"""
|
"""Generate Sec-CH-UA header value based on user agent string"""
|
||||||
browsers = self.parse_user_agent(user_agent)
|
browsers = self.parse_user_agent(user_agent)
|
||||||
|
|
||||||
# Client hints components
|
# Client hints components
|
||||||
hints = []
|
hints = []
|
||||||
|
|
||||||
# Handle different browser combinations
|
# Handle different browser combinations
|
||||||
if 'chrome' in browsers:
|
if "chrome" in browsers:
|
||||||
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
|
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
|
||||||
hints.append('"Not_A Brand";v="8"')
|
hints.append('"Not_A Brand";v="8"')
|
||||||
|
|
||||||
if 'edge' in browsers:
|
if "edge" in browsers:
|
||||||
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
|
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
|
||||||
else:
|
else:
|
||||||
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
|
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
|
||||||
|
|
||||||
elif 'firefox' in browsers:
|
elif "firefox" in browsers:
|
||||||
# Firefox doesn't typically send Sec-CH-UA
|
# Firefox doesn't typically send Sec-CH-UA
|
||||||
return '""'
|
return '""'
|
||||||
|
|
||||||
elif 'safari' in browsers:
|
elif "safari" in browsers:
|
||||||
# Safari's format for client hints
|
# Safari's format for client hints
|
||||||
hints.append(f'"Safari";v="{browsers["safari"]}"')
|
hints.append(f'"Safari";v="{browsers["safari"]}"')
|
||||||
hints.append('"Not_A Brand";v="8"')
|
hints.append('"Not_A Brand";v="8"')
|
||||||
|
|
||||||
return ', '.join(hints)
|
return ", ".join(hints)
|
||||||
|
|
||||||
|
|
||||||
# Example usage:
|
# Example usage:
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
generator = UserAgentGenerator()
|
generator = UserAgentGenerator()
|
||||||
print(generator.generate())
|
print(generator.generate())
|
||||||
|
|
||||||
print("\nSingle browser (Chrome):")
|
print("\nSingle browser (Chrome):")
|
||||||
print(generator.generate(num_browsers=1, browser_type='chrome'))
|
print(generator.generate(num_browsers=1, browser_type="chrome"))
|
||||||
|
|
||||||
print("\nTwo browsers (Gecko/Firefox):")
|
print("\nTwo browsers (Gecko/Firefox):")
|
||||||
print(generator.generate(num_browsers=2))
|
print(generator.generate(num_browsers=2))
|
||||||
|
|
||||||
print("\nThree browsers (Chrome/Safari/Edge):")
|
print("\nThree browsers (Chrome/Safari/Edge):")
|
||||||
print(generator.generate(num_browsers=3))
|
print(generator.generate(num_browsers=3))
|
||||||
|
|
||||||
print("\nFirefox on Linux:")
|
print("\nFirefox on Linux:")
|
||||||
print(generator.generate(
|
print(
|
||||||
device_type='desktop',
|
generator.generate(
|
||||||
os_type='linux',
|
device_type="desktop",
|
||||||
browser_type='firefox',
|
os_type="linux",
|
||||||
num_browsers=2
|
browser_type="firefox",
|
||||||
))
|
num_browsers=2,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
print("\nChrome/Safari/Edge on Windows:")
|
print("\nChrome/Safari/Edge on Windows:")
|
||||||
print(generator.generate(
|
print(generator.generate(device_type="desktop", os_type="windows", num_browsers=3))
|
||||||
device_type='desktop',
|
|
||||||
os_type='windows',
|
|
||||||
num_browsers=3
|
|
||||||
))
|
|
||||||
|
|||||||
1571
crawl4ai/utils.py
1571
crawl4ai/utils.py
File diff suppressed because it is too large
Load Diff
@@ -1,14 +1,14 @@
|
|||||||
# version_manager.py
|
# version_manager.py
|
||||||
import os
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from packaging import version
|
from packaging import version
|
||||||
from . import __version__
|
from . import __version__
|
||||||
|
|
||||||
|
|
||||||
class VersionManager:
|
class VersionManager:
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.home_dir = Path.home() / ".crawl4ai"
|
self.home_dir = Path.home() / ".crawl4ai"
|
||||||
self.version_file = self.home_dir / "version.txt"
|
self.version_file = self.home_dir / "version.txt"
|
||||||
|
|
||||||
def get_installed_version(self):
|
def get_installed_version(self):
|
||||||
"""Get the version recorded in home directory"""
|
"""Get the version recorded in home directory"""
|
||||||
if not self.version_file.exists():
|
if not self.version_file.exists():
|
||||||
@@ -17,14 +17,13 @@ class VersionManager:
|
|||||||
return version.parse(self.version_file.read_text().strip())
|
return version.parse(self.version_file.read_text().strip())
|
||||||
except:
|
except:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def update_version(self):
|
def update_version(self):
|
||||||
"""Update the version file to current library version"""
|
"""Update the version file to current library version"""
|
||||||
self.version_file.write_text(__version__.__version__)
|
self.version_file.write_text(__version__.__version__)
|
||||||
|
|
||||||
def needs_update(self):
|
def needs_update(self):
|
||||||
"""Check if database needs update based on version"""
|
"""Check if database needs update based on version"""
|
||||||
installed = self.get_installed_version()
|
installed = self.get_installed_version()
|
||||||
current = version.parse(__version__.__version__)
|
current = version.parse(__version__.__version__)
|
||||||
return installed is None or installed < current
|
return installed is None or installed < current
|
||||||
|
|
||||||
|
|||||||
@@ -1,9 +1,10 @@
|
|||||||
import os, time
|
import os, time
|
||||||
|
|
||||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .models import UrlModel, CrawlResult
|
from .models import UrlModel, CrawlResult
|
||||||
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
|
from .database import init_db, get_cached_url, cache_url
|
||||||
from .utils import *
|
from .utils import *
|
||||||
from .chunking_strategy import *
|
from .chunking_strategy import *
|
||||||
from .extraction_strategy import *
|
from .extraction_strategy import *
|
||||||
@@ -14,31 +15,44 @@ from .content_scraping_strategy import WebScrapingStrategy
|
|||||||
from .config import *
|
from .config import *
|
||||||
import warnings
|
import warnings
|
||||||
import json
|
import json
|
||||||
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
|
|
||||||
|
warnings.filterwarnings(
|
||||||
|
"ignore",
|
||||||
|
message='Field "model_name" has conflict with protected namespace "model_".',
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class WebCrawler:
|
class WebCrawler:
|
||||||
def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False):
|
def __init__(
|
||||||
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
|
self,
|
||||||
|
crawler_strategy: CrawlerStrategy = None,
|
||||||
|
always_by_pass_cache: bool = False,
|
||||||
|
verbose: bool = False,
|
||||||
|
):
|
||||||
|
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
|
||||||
|
verbose=verbose
|
||||||
|
)
|
||||||
self.always_by_pass_cache = always_by_pass_cache
|
self.always_by_pass_cache = always_by_pass_cache
|
||||||
self.crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
|
self.crawl4ai_folder = os.path.join(
|
||||||
|
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
|
||||||
|
)
|
||||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||||
init_db()
|
init_db()
|
||||||
self.ready = False
|
self.ready = False
|
||||||
|
|
||||||
def warmup(self):
|
def warmup(self):
|
||||||
print("[LOG] 🌤️ Warming up the WebCrawler")
|
print("[LOG] 🌤️ Warming up the WebCrawler")
|
||||||
self.run(
|
self.run(
|
||||||
url='https://google.com/',
|
url="https://google.com/",
|
||||||
word_count_threshold=5,
|
word_count_threshold=5,
|
||||||
extraction_strategy=NoExtractionStrategy(),
|
extraction_strategy=NoExtractionStrategy(),
|
||||||
bypass_cache=False,
|
bypass_cache=False,
|
||||||
verbose=False
|
verbose=False,
|
||||||
)
|
)
|
||||||
self.ready = True
|
self.ready = True
|
||||||
print("[LOG] 🌞 WebCrawler is ready to crawl")
|
print("[LOG] 🌞 WebCrawler is ready to crawl")
|
||||||
|
|
||||||
def fetch_page(
|
def fetch_page(
|
||||||
self,
|
self,
|
||||||
url_model: UrlModel,
|
url_model: UrlModel,
|
||||||
@@ -80,6 +94,7 @@ class WebCrawler:
|
|||||||
**kwargs,
|
**kwargs,
|
||||||
) -> List[CrawlResult]:
|
) -> List[CrawlResult]:
|
||||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||||
|
|
||||||
def fetch_page_wrapper(url_model, *args, **kwargs):
|
def fetch_page_wrapper(url_model, *args, **kwargs):
|
||||||
return self.fetch_page(url_model, *args, **kwargs)
|
return self.fetch_page(url_model, *args, **kwargs)
|
||||||
|
|
||||||
@@ -104,150 +119,176 @@ class WebCrawler:
|
|||||||
return results
|
return results
|
||||||
|
|
||||||
def run(
|
def run(
|
||||||
self,
|
self,
|
||||||
url: str,
|
url: str,
|
||||||
word_count_threshold=MIN_WORD_THRESHOLD,
|
word_count_threshold=MIN_WORD_THRESHOLD,
|
||||||
extraction_strategy: ExtractionStrategy = None,
|
extraction_strategy: ExtractionStrategy = None,
|
||||||
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||||
bypass_cache: bool = False,
|
bypass_cache: bool = False,
|
||||||
css_selector: str = None,
|
css_selector: str = None,
|
||||||
screenshot: bool = False,
|
screenshot: bool = False,
|
||||||
user_agent: str = None,
|
user_agent: str = None,
|
||||||
verbose=True,
|
verbose=True,
|
||||||
**kwargs,
|
**kwargs,
|
||||||
) -> CrawlResult:
|
) -> CrawlResult:
|
||||||
try:
|
try:
|
||||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||||
extraction_strategy.verbose = verbose
|
extraction_strategy.verbose = verbose
|
||||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||||
raise ValueError("Unsupported extraction strategy")
|
raise ValueError("Unsupported extraction strategy")
|
||||||
if not isinstance(chunking_strategy, ChunkingStrategy):
|
if not isinstance(chunking_strategy, ChunkingStrategy):
|
||||||
raise ValueError("Unsupported chunking strategy")
|
raise ValueError("Unsupported chunking strategy")
|
||||||
|
|
||||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
|
||||||
|
|
||||||
cached = None
|
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||||
screenshot_data = None
|
|
||||||
extracted_content = None
|
|
||||||
if not bypass_cache and not self.always_by_pass_cache:
|
|
||||||
cached = get_cached_url(url)
|
|
||||||
|
|
||||||
if kwargs.get("warmup", True) and not self.ready:
|
|
||||||
return None
|
|
||||||
|
|
||||||
if cached:
|
|
||||||
html = sanitize_input_encode(cached[1])
|
|
||||||
extracted_content = sanitize_input_encode(cached[4])
|
|
||||||
if screenshot:
|
|
||||||
screenshot_data = cached[9]
|
|
||||||
if not screenshot_data:
|
|
||||||
cached = None
|
|
||||||
|
|
||||||
if not cached or not html:
|
|
||||||
if user_agent:
|
|
||||||
self.crawler_strategy.update_user_agent(user_agent)
|
|
||||||
t1 = time.time()
|
|
||||||
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
|
|
||||||
t2 = time.time()
|
|
||||||
if verbose:
|
|
||||||
print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds")
|
|
||||||
if screenshot:
|
|
||||||
screenshot_data = self.crawler_strategy.take_screenshot()
|
|
||||||
|
|
||||||
|
cached = None
|
||||||
crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs)
|
screenshot_data = None
|
||||||
crawl_result.success = bool(html)
|
extracted_content = None
|
||||||
return crawl_result
|
if not bypass_cache and not self.always_by_pass_cache:
|
||||||
except Exception as e:
|
cached = get_cached_url(url)
|
||||||
if not hasattr(e, "msg"):
|
|
||||||
e.msg = str(e)
|
if kwargs.get("warmup", True) and not self.ready:
|
||||||
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
|
return None
|
||||||
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
|
|
||||||
|
if cached:
|
||||||
|
html = sanitize_input_encode(cached[1])
|
||||||
|
extracted_content = sanitize_input_encode(cached[4])
|
||||||
|
if screenshot:
|
||||||
|
screenshot_data = cached[9]
|
||||||
|
if not screenshot_data:
|
||||||
|
cached = None
|
||||||
|
|
||||||
|
if not cached or not html:
|
||||||
|
if user_agent:
|
||||||
|
self.crawler_strategy.update_user_agent(user_agent)
|
||||||
|
t1 = time.time()
|
||||||
|
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
|
||||||
|
t2 = time.time()
|
||||||
|
if verbose:
|
||||||
|
print(
|
||||||
|
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
|
||||||
|
)
|
||||||
|
if screenshot:
|
||||||
|
screenshot_data = self.crawler_strategy.take_screenshot()
|
||||||
|
|
||||||
|
crawl_result = self.process_html(
|
||||||
|
url,
|
||||||
|
html,
|
||||||
|
extracted_content,
|
||||||
|
word_count_threshold,
|
||||||
|
extraction_strategy,
|
||||||
|
chunking_strategy,
|
||||||
|
css_selector,
|
||||||
|
screenshot_data,
|
||||||
|
verbose,
|
||||||
|
bool(cached),
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
crawl_result.success = bool(html)
|
||||||
|
return crawl_result
|
||||||
|
except Exception as e:
|
||||||
|
if not hasattr(e, "msg"):
|
||||||
|
e.msg = str(e)
|
||||||
|
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
|
||||||
|
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
|
||||||
|
|
||||||
def process_html(
|
def process_html(
|
||||||
self,
|
self,
|
||||||
url: str,
|
url: str,
|
||||||
html: str,
|
html: str,
|
||||||
extracted_content: str,
|
extracted_content: str,
|
||||||
word_count_threshold: int,
|
word_count_threshold: int,
|
||||||
extraction_strategy: ExtractionStrategy,
|
extraction_strategy: ExtractionStrategy,
|
||||||
chunking_strategy: ChunkingStrategy,
|
chunking_strategy: ChunkingStrategy,
|
||||||
css_selector: str,
|
css_selector: str,
|
||||||
screenshot: bool,
|
screenshot: bool,
|
||||||
verbose: bool,
|
verbose: bool,
|
||||||
is_cached: bool,
|
is_cached: bool,
|
||||||
**kwargs,
|
**kwargs,
|
||||||
) -> CrawlResult:
|
) -> CrawlResult:
|
||||||
t = time.time()
|
t = time.time()
|
||||||
# Extract content from HTML
|
# Extract content from HTML
|
||||||
try:
|
try:
|
||||||
t1 = time.time()
|
t1 = time.time()
|
||||||
scrapping_strategy = WebScrapingStrategy()
|
scrapping_strategy = WebScrapingStrategy()
|
||||||
extra_params = {k: v for k, v in kwargs.items() if k not in ["only_text", "image_description_min_word_threshold"]}
|
extra_params = {
|
||||||
result = scrapping_strategy.scrap(
|
k: v
|
||||||
url,
|
for k, v in kwargs.items()
|
||||||
html,
|
if k not in ["only_text", "image_description_min_word_threshold"]
|
||||||
word_count_threshold=word_count_threshold,
|
}
|
||||||
css_selector=css_selector,
|
result = scrapping_strategy.scrap(
|
||||||
only_text=kwargs.get("only_text", False),
|
url,
|
||||||
image_description_min_word_threshold=kwargs.get(
|
html,
|
||||||
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
|
word_count_threshold=word_count_threshold,
|
||||||
),
|
css_selector=css_selector,
|
||||||
**extra_params,
|
only_text=kwargs.get("only_text", False),
|
||||||
|
image_description_min_word_threshold=kwargs.get(
|
||||||
|
"image_description_min_word_threshold",
|
||||||
|
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||||
|
),
|
||||||
|
**extra_params,
|
||||||
|
)
|
||||||
|
|
||||||
|
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
|
||||||
|
if verbose:
|
||||||
|
print(
|
||||||
|
f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
|
||||||
)
|
)
|
||||||
|
|
||||||
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
|
|
||||||
if verbose:
|
|
||||||
print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")
|
|
||||||
|
|
||||||
if result is None:
|
|
||||||
raise ValueError(f"Failed to extract content from the website: {url}")
|
|
||||||
except InvalidCSSSelectorError as e:
|
|
||||||
raise ValueError(str(e))
|
|
||||||
|
|
||||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
|
||||||
markdown = sanitize_input_encode(result.get("markdown", ""))
|
|
||||||
media = result.get("media", [])
|
|
||||||
links = result.get("links", [])
|
|
||||||
metadata = result.get("metadata", {})
|
|
||||||
|
|
||||||
if extracted_content is None:
|
|
||||||
if verbose:
|
|
||||||
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
|
|
||||||
|
|
||||||
sections = chunking_strategy.chunk(markdown)
|
if result is None:
|
||||||
extracted_content = extraction_strategy.run(url, sections)
|
raise ValueError(f"Failed to extract content from the website: {url}")
|
||||||
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
|
except InvalidCSSSelectorError as e:
|
||||||
|
raise ValueError(str(e))
|
||||||
|
|
||||||
if verbose:
|
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||||
print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
|
markdown = sanitize_input_encode(result.get("markdown", ""))
|
||||||
|
media = result.get("media", [])
|
||||||
screenshot = None if not screenshot else screenshot
|
links = result.get("links", [])
|
||||||
|
metadata = result.get("metadata", {})
|
||||||
if not is_cached:
|
|
||||||
cache_url(
|
if extracted_content is None:
|
||||||
url,
|
if verbose:
|
||||||
html,
|
print(
|
||||||
cleaned_html,
|
f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
|
||||||
markdown,
|
)
|
||||||
extracted_content,
|
|
||||||
True,
|
sections = chunking_strategy.chunk(markdown)
|
||||||
json.dumps(media),
|
extracted_content = extraction_strategy.run(url, sections)
|
||||||
json.dumps(links),
|
extracted_content = json.dumps(
|
||||||
json.dumps(metadata),
|
extracted_content, indent=4, default=str, ensure_ascii=False
|
||||||
screenshot=screenshot,
|
)
|
||||||
)
|
|
||||||
|
if verbose:
|
||||||
return CrawlResult(
|
print(
|
||||||
url=url,
|
f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
|
||||||
html=html,
|
)
|
||||||
cleaned_html=format_html(cleaned_html),
|
|
||||||
markdown=markdown,
|
screenshot = None if not screenshot else screenshot
|
||||||
media=media,
|
|
||||||
links=links,
|
if not is_cached:
|
||||||
metadata=metadata,
|
cache_url(
|
||||||
|
url,
|
||||||
|
html,
|
||||||
|
cleaned_html,
|
||||||
|
markdown,
|
||||||
|
extracted_content,
|
||||||
|
True,
|
||||||
|
json.dumps(media),
|
||||||
|
json.dumps(links),
|
||||||
|
json.dumps(metadata),
|
||||||
screenshot=screenshot,
|
screenshot=screenshot,
|
||||||
extracted_content=extracted_content,
|
)
|
||||||
success=True,
|
|
||||||
error_message="",
|
return CrawlResult(
|
||||||
)
|
url=url,
|
||||||
|
html=html,
|
||||||
|
cleaned_html=format_html(cleaned_html),
|
||||||
|
markdown=markdown,
|
||||||
|
media=media,
|
||||||
|
links=links,
|
||||||
|
metadata=metadata,
|
||||||
|
screenshot=screenshot,
|
||||||
|
extracted_content=extracted_content,
|
||||||
|
success=True,
|
||||||
|
error_message="",
|
||||||
|
)
|
||||||
|
|||||||
@@ -9,13 +9,11 @@ from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||||
import json
|
import json
|
||||||
|
|
||||||
|
|
||||||
async def extract_amazon_products():
|
async def extract_amazon_products():
|
||||||
# Initialize browser config
|
# Initialize browser config
|
||||||
browser_config = BrowserConfig(
|
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||||
browser_type="chromium",
|
|
||||||
headless=True
|
|
||||||
)
|
|
||||||
|
|
||||||
# Initialize crawler config with JSON CSS extraction strategy
|
# Initialize crawler config with JSON CSS extraction strategy
|
||||||
crawler_config = CrawlerRunConfig(
|
crawler_config = CrawlerRunConfig(
|
||||||
extraction_strategy=JsonCssExtractionStrategy(
|
extraction_strategy=JsonCssExtractionStrategy(
|
||||||
@@ -27,74 +25,70 @@ async def extract_amazon_products():
|
|||||||
"name": "asin",
|
"name": "asin",
|
||||||
"selector": "",
|
"selector": "",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "data-asin"
|
"attribute": "data-asin",
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "title",
|
|
||||||
"selector": "h2 a span",
|
|
||||||
"type": "text"
|
|
||||||
},
|
},
|
||||||
|
{"name": "title", "selector": "h2 a span", "type": "text"},
|
||||||
{
|
{
|
||||||
"name": "url",
|
"name": "url",
|
||||||
"selector": "h2 a",
|
"selector": "h2 a",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "href"
|
"attribute": "href",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "image",
|
"name": "image",
|
||||||
"selector": ".s-image",
|
"selector": ".s-image",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "src"
|
"attribute": "src",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "rating",
|
"name": "rating",
|
||||||
"selector": ".a-icon-star-small .a-icon-alt",
|
"selector": ".a-icon-star-small .a-icon-alt",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "reviews_count",
|
"name": "reviews_count",
|
||||||
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "price",
|
"name": "price",
|
||||||
"selector": ".a-price .a-offscreen",
|
"selector": ".a-price .a-offscreen",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "original_price",
|
"name": "original_price",
|
||||||
"selector": ".a-price.a-text-price .a-offscreen",
|
"selector": ".a-price.a-text-price .a-offscreen",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "sponsored",
|
"name": "sponsored",
|
||||||
"selector": ".puis-sponsored-label-text",
|
"selector": ".puis-sponsored-label-text",
|
||||||
"type": "exists"
|
"type": "exists",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "delivery_info",
|
"name": "delivery_info",
|
||||||
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
"multiple": True
|
"multiple": True,
|
||||||
}
|
},
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
# Example search URL (you should replace with your actual Amazon URL)
|
# Example search URL (you should replace with your actual Amazon URL)
|
||||||
url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
|
url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
|
||||||
|
|
||||||
# Use context manager for proper resource handling
|
# Use context manager for proper resource handling
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
# Extract the data
|
# Extract the data
|
||||||
result = await crawler.arun(url=url, config=crawler_config)
|
result = await crawler.arun(url=url, config=crawler_config)
|
||||||
|
|
||||||
# Process and print the results
|
# Process and print the results
|
||||||
if result and result.extracted_content:
|
if result and result.extracted_content:
|
||||||
# Parse the JSON string into a list of products
|
# Parse the JSON string into a list of products
|
||||||
products = json.loads(result.extracted_content)
|
products = json.loads(result.extracted_content)
|
||||||
|
|
||||||
# Process each product in the list
|
# Process each product in the list
|
||||||
for product in products:
|
for product in products:
|
||||||
print("\nProduct Details:")
|
print("\nProduct Details:")
|
||||||
@@ -105,10 +99,12 @@ async def extract_amazon_products():
|
|||||||
print(f"Rating: {product.get('rating')}")
|
print(f"Rating: {product.get('rating')}")
|
||||||
print(f"Reviews: {product.get('reviews_count')}")
|
print(f"Reviews: {product.get('reviews_count')}")
|
||||||
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
||||||
if product.get('delivery_info'):
|
if product.get("delivery_info"):
|
||||||
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
||||||
print("-" * 80)
|
print("-" * 80)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import asyncio
|
import asyncio
|
||||||
|
|
||||||
asyncio.run(extract_amazon_products())
|
asyncio.run(extract_amazon_products())
|
||||||
|
|||||||
@@ -10,17 +10,17 @@ from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|||||||
import json
|
import json
|
||||||
from playwright.async_api import Page, BrowserContext
|
from playwright.async_api import Page, BrowserContext
|
||||||
|
|
||||||
|
|
||||||
async def extract_amazon_products():
|
async def extract_amazon_products():
|
||||||
# Initialize browser config
|
# Initialize browser config
|
||||||
browser_config = BrowserConfig(
|
browser_config = BrowserConfig(
|
||||||
# browser_type="chromium",
|
# browser_type="chromium",
|
||||||
headless=True
|
headless=True
|
||||||
)
|
)
|
||||||
|
|
||||||
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
|
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
|
||||||
crawler_config = CrawlerRunConfig(
|
crawler_config = CrawlerRunConfig(
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
|
||||||
extraction_strategy=JsonCssExtractionStrategy(
|
extraction_strategy=JsonCssExtractionStrategy(
|
||||||
schema={
|
schema={
|
||||||
"name": "Amazon Product Search Results",
|
"name": "Amazon Product Search Results",
|
||||||
@@ -30,102 +30,105 @@ async def extract_amazon_products():
|
|||||||
"name": "asin",
|
"name": "asin",
|
||||||
"selector": "",
|
"selector": "",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "data-asin"
|
"attribute": "data-asin",
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "title",
|
|
||||||
"selector": "h2 a span",
|
|
||||||
"type": "text"
|
|
||||||
},
|
},
|
||||||
|
{"name": "title", "selector": "h2 a span", "type": "text"},
|
||||||
{
|
{
|
||||||
"name": "url",
|
"name": "url",
|
||||||
"selector": "h2 a",
|
"selector": "h2 a",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "href"
|
"attribute": "href",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "image",
|
"name": "image",
|
||||||
"selector": ".s-image",
|
"selector": ".s-image",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "src"
|
"attribute": "src",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "rating",
|
"name": "rating",
|
||||||
"selector": ".a-icon-star-small .a-icon-alt",
|
"selector": ".a-icon-star-small .a-icon-alt",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "reviews_count",
|
"name": "reviews_count",
|
||||||
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "price",
|
"name": "price",
|
||||||
"selector": ".a-price .a-offscreen",
|
"selector": ".a-price .a-offscreen",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "original_price",
|
"name": "original_price",
|
||||||
"selector": ".a-price.a-text-price .a-offscreen",
|
"selector": ".a-price.a-text-price .a-offscreen",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "sponsored",
|
"name": "sponsored",
|
||||||
"selector": ".puis-sponsored-label-text",
|
"selector": ".puis-sponsored-label-text",
|
||||||
"type": "exists"
|
"type": "exists",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "delivery_info",
|
"name": "delivery_info",
|
||||||
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
"multiple": True
|
"multiple": True,
|
||||||
}
|
},
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
)
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
url = "https://www.amazon.com/"
|
url = "https://www.amazon.com/"
|
||||||
|
|
||||||
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
|
async def after_goto(
|
||||||
|
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
|
||||||
|
):
|
||||||
"""Hook called after navigating to each URL"""
|
"""Hook called after navigating to each URL"""
|
||||||
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Wait for search box to be available
|
# Wait for search box to be available
|
||||||
search_box = await page.wait_for_selector('#twotabsearchtextbox', timeout=1000)
|
search_box = await page.wait_for_selector(
|
||||||
|
"#twotabsearchtextbox", timeout=1000
|
||||||
|
)
|
||||||
|
|
||||||
# Type the search query
|
# Type the search query
|
||||||
await search_box.fill('Samsung Galaxy Tab')
|
await search_box.fill("Samsung Galaxy Tab")
|
||||||
|
|
||||||
# Get the search button and prepare for navigation
|
# Get the search button and prepare for navigation
|
||||||
search_button = await page.wait_for_selector('#nav-search-submit-button', timeout=1000)
|
search_button = await page.wait_for_selector(
|
||||||
|
"#nav-search-submit-button", timeout=1000
|
||||||
|
)
|
||||||
|
|
||||||
# Click with navigation waiting
|
# Click with navigation waiting
|
||||||
await search_button.click()
|
await search_button.click()
|
||||||
|
|
||||||
# Wait for search results to load
|
# Wait for search results to load
|
||||||
await page.wait_for_selector('[data-component-type="s-search-result"]', timeout=10000)
|
await page.wait_for_selector(
|
||||||
|
'[data-component-type="s-search-result"]', timeout=10000
|
||||||
|
)
|
||||||
print("[HOOK] Search completed and results loaded!")
|
print("[HOOK] Search completed and results loaded!")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"[HOOK] Error during search operation: {str(e)}")
|
print(f"[HOOK] Error during search operation: {str(e)}")
|
||||||
|
|
||||||
return page
|
return page
|
||||||
|
|
||||||
# Use context manager for proper resource handling
|
# Use context manager for proper resource handling
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
|
||||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||||
|
|
||||||
# Extract the data
|
# Extract the data
|
||||||
result = await crawler.arun(url=url, config=crawler_config)
|
result = await crawler.arun(url=url, config=crawler_config)
|
||||||
|
|
||||||
# Process and print the results
|
# Process and print the results
|
||||||
if result and result.extracted_content:
|
if result and result.extracted_content:
|
||||||
# Parse the JSON string into a list of products
|
# Parse the JSON string into a list of products
|
||||||
products = json.loads(result.extracted_content)
|
products = json.loads(result.extracted_content)
|
||||||
|
|
||||||
# Process each product in the list
|
# Process each product in the list
|
||||||
for product in products:
|
for product in products:
|
||||||
print("\nProduct Details:")
|
print("\nProduct Details:")
|
||||||
@@ -136,10 +139,12 @@ async def extract_amazon_products():
|
|||||||
print(f"Rating: {product.get('rating')}")
|
print(f"Rating: {product.get('rating')}")
|
||||||
print(f"Reviews: {product.get('reviews_count')}")
|
print(f"Reviews: {product.get('reviews_count')}")
|
||||||
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
||||||
if product.get('delivery_info'):
|
if product.get("delivery_info"):
|
||||||
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
||||||
print("-" * 80)
|
print("-" * 80)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import asyncio
|
import asyncio
|
||||||
|
|
||||||
asyncio.run(extract_amazon_products())
|
asyncio.run(extract_amazon_products())
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ from crawl4ai import AsyncWebCrawler, CacheMode
|
|||||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||||
import json
|
import json
|
||||||
from playwright.async_api import Page, BrowserContext
|
|
||||||
|
|
||||||
async def extract_amazon_products():
|
async def extract_amazon_products():
|
||||||
# Initialize browser config
|
# Initialize browser config
|
||||||
@@ -16,7 +16,7 @@ async def extract_amazon_products():
|
|||||||
# browser_type="chromium",
|
# browser_type="chromium",
|
||||||
headless=True
|
headless=True
|
||||||
)
|
)
|
||||||
|
|
||||||
js_code_to_search = """
|
js_code_to_search = """
|
||||||
const task = async () => {
|
const task = async () => {
|
||||||
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
|
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
|
||||||
@@ -30,7 +30,7 @@ async def extract_amazon_products():
|
|||||||
"""
|
"""
|
||||||
crawler_config = CrawlerRunConfig(
|
crawler_config = CrawlerRunConfig(
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
js_code = js_code_to_search,
|
js_code=js_code_to_search,
|
||||||
wait_for='css:[data-component-type="s-search-result"]',
|
wait_for='css:[data-component-type="s-search-result"]',
|
||||||
extraction_strategy=JsonCssExtractionStrategy(
|
extraction_strategy=JsonCssExtractionStrategy(
|
||||||
schema={
|
schema={
|
||||||
@@ -41,75 +41,70 @@ async def extract_amazon_products():
|
|||||||
"name": "asin",
|
"name": "asin",
|
||||||
"selector": "",
|
"selector": "",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "data-asin"
|
"attribute": "data-asin",
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "title",
|
|
||||||
"selector": "h2 a span",
|
|
||||||
"type": "text"
|
|
||||||
},
|
},
|
||||||
|
{"name": "title", "selector": "h2 a span", "type": "text"},
|
||||||
{
|
{
|
||||||
"name": "url",
|
"name": "url",
|
||||||
"selector": "h2 a",
|
"selector": "h2 a",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "href"
|
"attribute": "href",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "image",
|
"name": "image",
|
||||||
"selector": ".s-image",
|
"selector": ".s-image",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "src"
|
"attribute": "src",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "rating",
|
"name": "rating",
|
||||||
"selector": ".a-icon-star-small .a-icon-alt",
|
"selector": ".a-icon-star-small .a-icon-alt",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "reviews_count",
|
"name": "reviews_count",
|
||||||
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "price",
|
"name": "price",
|
||||||
"selector": ".a-price .a-offscreen",
|
"selector": ".a-price .a-offscreen",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "original_price",
|
"name": "original_price",
|
||||||
"selector": ".a-price.a-text-price .a-offscreen",
|
"selector": ".a-price.a-text-price .a-offscreen",
|
||||||
"type": "text"
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "sponsored",
|
"name": "sponsored",
|
||||||
"selector": ".puis-sponsored-label-text",
|
"selector": ".puis-sponsored-label-text",
|
||||||
"type": "exists"
|
"type": "exists",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "delivery_info",
|
"name": "delivery_info",
|
||||||
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
"multiple": True
|
"multiple": True,
|
||||||
}
|
},
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
)
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Example search URL (you should replace with your actual Amazon URL)
|
# Example search URL (you should replace with your actual Amazon URL)
|
||||||
url = "https://www.amazon.com/"
|
url = "https://www.amazon.com/"
|
||||||
|
|
||||||
|
|
||||||
# Use context manager for proper resource handling
|
# Use context manager for proper resource handling
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
# Extract the data
|
# Extract the data
|
||||||
result = await crawler.arun(url=url, config=crawler_config)
|
result = await crawler.arun(url=url, config=crawler_config)
|
||||||
|
|
||||||
# Process and print the results
|
# Process and print the results
|
||||||
if result and result.extracted_content:
|
if result and result.extracted_content:
|
||||||
# Parse the JSON string into a list of products
|
# Parse the JSON string into a list of products
|
||||||
products = json.loads(result.extracted_content)
|
products = json.loads(result.extracted_content)
|
||||||
|
|
||||||
# Process each product in the list
|
# Process each product in the list
|
||||||
for product in products:
|
for product in products:
|
||||||
print("\nProduct Details:")
|
print("\nProduct Details:")
|
||||||
@@ -120,10 +115,12 @@ async def extract_amazon_products():
|
|||||||
print(f"Rating: {product.get('rating')}")
|
print(f"Rating: {product.get('rating')}")
|
||||||
print(f"Reviews: {product.get('reviews_count')}")
|
print(f"Reviews: {product.get('reviews_count')}")
|
||||||
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
||||||
if product.get('delivery_info'):
|
if product.get("delivery_info"):
|
||||||
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
||||||
print("-" * 80)
|
print("-" * 80)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import asyncio
|
import asyncio
|
||||||
|
|
||||||
asyncio.run(extract_amazon_products())
|
asyncio.run(extract_amazon_products())
|
||||||
|
|||||||
@@ -1,12 +1,16 @@
|
|||||||
# File: async_webcrawler_multiple_urls_example.py
|
# File: async_webcrawler_multiple_urls_example.py
|
||||||
import os, sys
|
import os, sys
|
||||||
|
|
||||||
# append 2 parent directories to sys.path to import crawl4ai
|
# append 2 parent directories to sys.path to import crawl4ai
|
||||||
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
parent_dir = os.path.dirname(
|
||||||
|
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
)
|
||||||
sys.path.append(parent_dir)
|
sys.path.append(parent_dir)
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler
|
from crawl4ai import AsyncWebCrawler
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# Initialize the AsyncWebCrawler
|
# Initialize the AsyncWebCrawler
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
@@ -16,7 +20,7 @@ async def main():
|
|||||||
"https://python.org",
|
"https://python.org",
|
||||||
"https://github.com",
|
"https://github.com",
|
||||||
"https://stackoverflow.com",
|
"https://stackoverflow.com",
|
||||||
"https://news.ycombinator.com"
|
"https://news.ycombinator.com",
|
||||||
]
|
]
|
||||||
|
|
||||||
# Set up crawling parameters
|
# Set up crawling parameters
|
||||||
@@ -27,7 +31,7 @@ async def main():
|
|||||||
urls=urls,
|
urls=urls,
|
||||||
word_count_threshold=word_count_threshold,
|
word_count_threshold=word_count_threshold,
|
||||||
bypass_cache=True,
|
bypass_cache=True,
|
||||||
verbose=True
|
verbose=True,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Process the results
|
# Process the results
|
||||||
@@ -36,7 +40,9 @@ async def main():
|
|||||||
print(f"Successfully crawled: {result.url}")
|
print(f"Successfully crawled: {result.url}")
|
||||||
print(f"Title: {result.metadata.get('title', 'N/A')}")
|
print(f"Title: {result.metadata.get('title', 'N/A')}")
|
||||||
print(f"Word count: {len(result.markdown.split())}")
|
print(f"Word count: {len(result.markdown.split())}")
|
||||||
print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
|
print(
|
||||||
|
f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
|
||||||
|
)
|
||||||
print(f"Number of images: {len(result.media.get('images', []))}")
|
print(f"Number of images: {len(result.media.get('images', []))}")
|
||||||
print("---")
|
print("---")
|
||||||
else:
|
else:
|
||||||
@@ -44,5 +50,6 @@ async def main():
|
|||||||
print(f"Error: {result.error_message}")
|
print(f"Error: {result.error_message}")
|
||||||
print("---")
|
print("---")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -6,10 +6,8 @@ This example demonstrates optimal browser usage patterns in Crawl4AI:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import os
|
|
||||||
from typing import List
|
from typing import List
|
||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
|
||||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,31 +1,32 @@
|
|||||||
import os, time
|
import os, time
|
||||||
|
|
||||||
# append the path to the root of the project
|
# append the path to the root of the project
|
||||||
import sys
|
import sys
|
||||||
import asyncio
|
import asyncio
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
|
|
||||||
|
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
|
||||||
from firecrawl import FirecrawlApp
|
from firecrawl import FirecrawlApp
|
||||||
from crawl4ai import AsyncWebCrawler
|
from crawl4ai import AsyncWebCrawler
|
||||||
__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'
|
|
||||||
|
__data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
|
||||||
|
|
||||||
|
|
||||||
async def compare():
|
async def compare():
|
||||||
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
|
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
|
||||||
|
|
||||||
# Tet Firecrawl with a simple crawl
|
# Tet Firecrawl with a simple crawl
|
||||||
start = time.time()
|
start = time.time()
|
||||||
scrape_status = app.scrape_url(
|
scrape_status = app.scrape_url(
|
||||||
'https://www.nbcnews.com/business',
|
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
|
||||||
params={'formats': ['markdown', 'html']}
|
|
||||||
)
|
)
|
||||||
end = time.time()
|
end = time.time()
|
||||||
print(f"Time taken: {end - start} seconds")
|
print(f"Time taken: {end - start} seconds")
|
||||||
print(len(scrape_status['markdown']))
|
print(len(scrape_status["markdown"]))
|
||||||
# save the markdown content with provider name
|
# save the markdown content with provider name
|
||||||
with open(f"{__data__}/firecrawl_simple.md", "w") as f:
|
with open(f"{__data__}/firecrawl_simple.md", "w") as f:
|
||||||
f.write(scrape_status['markdown'])
|
f.write(scrape_status["markdown"])
|
||||||
# Count how many "cldnry.s-nbcnews.com" are in the markdown
|
# Count how many "cldnry.s-nbcnews.com" are in the markdown
|
||||||
print(scrape_status['markdown'].count("cldnry.s-nbcnews.com"))
|
print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
start = time.time()
|
start = time.time()
|
||||||
@@ -33,13 +34,13 @@ async def compare():
|
|||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
# js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
|
# js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
|
||||||
word_count_threshold=0,
|
word_count_threshold=0,
|
||||||
bypass_cache=True,
|
bypass_cache=True,
|
||||||
verbose=False
|
verbose=False,
|
||||||
)
|
)
|
||||||
end = time.time()
|
end = time.time()
|
||||||
print(f"Time taken: {end - start} seconds")
|
print(f"Time taken: {end - start} seconds")
|
||||||
print(len(result.markdown))
|
print(len(result.markdown))
|
||||||
# save the markdown content with provider name
|
# save the markdown content with provider name
|
||||||
with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
|
with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
|
||||||
f.write(result.markdown)
|
f.write(result.markdown)
|
||||||
# count how many "cldnry.s-nbcnews.com" are in the markdown
|
# count how many "cldnry.s-nbcnews.com" are in the markdown
|
||||||
@@ -48,10 +49,12 @@ async def compare():
|
|||||||
start = time.time()
|
start = time.time()
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
|
js_code=[
|
||||||
|
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||||
|
],
|
||||||
word_count_threshold=0,
|
word_count_threshold=0,
|
||||||
bypass_cache=True,
|
bypass_cache=True,
|
||||||
verbose=False
|
verbose=False,
|
||||||
)
|
)
|
||||||
end = time.time()
|
end = time.time()
|
||||||
print(f"Time taken: {end - start} seconds")
|
print(f"Time taken: {end - start} seconds")
|
||||||
@@ -61,7 +64,7 @@ async def compare():
|
|||||||
f.write(result.markdown)
|
f.write(result.markdown)
|
||||||
# count how many "cldnry.s-nbcnews.com" are in the markdown
|
# count how many "cldnry.s-nbcnews.com" are in the markdown
|
||||||
print(result.markdown.count("cldnry.s-nbcnews.com"))
|
print(result.markdown.count("cldnry.s-nbcnews.com"))
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(compare())
|
asyncio.run(compare())
|
||||||
|
|
||||||
136
docs/examples/dispatcher_example.py
Normal file
136
docs/examples/dispatcher_example.py
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
import asyncio
|
||||||
|
import time
|
||||||
|
from rich import print
|
||||||
|
from rich.table import Table
|
||||||
|
from crawl4ai import (
|
||||||
|
AsyncWebCrawler,
|
||||||
|
BrowserConfig,
|
||||||
|
CrawlerRunConfig,
|
||||||
|
MemoryAdaptiveDispatcher,
|
||||||
|
SemaphoreDispatcher,
|
||||||
|
RateLimiter,
|
||||||
|
CrawlerMonitor,
|
||||||
|
DisplayMode,
|
||||||
|
CacheMode,
|
||||||
|
LXMLWebScrapingStrategy,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def memory_adaptive(urls, browser_config, run_config):
|
||||||
|
"""Memory adaptive crawler with monitoring"""
|
||||||
|
start = time.perf_counter()
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=70.0,
|
||||||
|
max_session_permit=10,
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
max_visible_rows=15, display_mode=DisplayMode.DETAILED
|
||||||
|
),
|
||||||
|
)
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls, config=run_config, dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
duration = time.perf_counter() - start
|
||||||
|
return len(results), duration
|
||||||
|
|
||||||
|
|
||||||
|
async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
|
||||||
|
"""Memory adaptive crawler with rate limiting"""
|
||||||
|
start = time.perf_counter()
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=70.0,
|
||||||
|
max_session_permit=10,
|
||||||
|
rate_limiter=RateLimiter(
|
||||||
|
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
|
||||||
|
),
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
max_visible_rows=15, display_mode=DisplayMode.DETAILED
|
||||||
|
),
|
||||||
|
)
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls, config=run_config, dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
duration = time.perf_counter() - start
|
||||||
|
return len(results), duration
|
||||||
|
|
||||||
|
|
||||||
|
async def semaphore(urls, browser_config, run_config):
|
||||||
|
"""Basic semaphore crawler"""
|
||||||
|
start = time.perf_counter()
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
dispatcher = SemaphoreDispatcher(
|
||||||
|
semaphore_count=5,
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
max_visible_rows=15, display_mode=DisplayMode.DETAILED
|
||||||
|
),
|
||||||
|
)
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls, config=run_config, dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
duration = time.perf_counter() - start
|
||||||
|
return len(results), duration
|
||||||
|
|
||||||
|
|
||||||
|
async def semaphore_with_rate_limit(urls, browser_config, run_config):
|
||||||
|
"""Semaphore crawler with rate limiting"""
|
||||||
|
start = time.perf_counter()
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
dispatcher = SemaphoreDispatcher(
|
||||||
|
semaphore_count=5,
|
||||||
|
rate_limiter=RateLimiter(
|
||||||
|
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
|
||||||
|
),
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
max_visible_rows=15, display_mode=DisplayMode.DETAILED
|
||||||
|
),
|
||||||
|
)
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls, config=run_config, dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
duration = time.perf_counter() - start
|
||||||
|
return len(results), duration
|
||||||
|
|
||||||
|
|
||||||
|
def create_performance_table(results):
|
||||||
|
"""Creates a rich table showing performance results"""
|
||||||
|
table = Table(title="Crawler Strategy Performance Comparison")
|
||||||
|
table.add_column("Strategy", style="cyan")
|
||||||
|
table.add_column("URLs Crawled", justify="right", style="green")
|
||||||
|
table.add_column("Time (seconds)", justify="right", style="yellow")
|
||||||
|
table.add_column("URLs/second", justify="right", style="magenta")
|
||||||
|
|
||||||
|
sorted_results = sorted(results.items(), key=lambda x: x[1][1])
|
||||||
|
|
||||||
|
for strategy, (urls_crawled, duration) in sorted_results:
|
||||||
|
urls_per_second = urls_crawled / duration
|
||||||
|
table.add_row(
|
||||||
|
strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return table
|
||||||
|
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
urls = [f"https://example.com/page{i}" for i in range(1, 40)]
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
|
||||||
|
|
||||||
|
results = {
|
||||||
|
"Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
|
||||||
|
# "Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
|
||||||
|
# urls, browser_config, run_config
|
||||||
|
# ),
|
||||||
|
# "Semaphore": await semaphore(urls, browser_config, run_config),
|
||||||
|
# "Semaphore + Rate Limit": await semaphore_with_rate_limit(
|
||||||
|
# urls, browser_config, run_config
|
||||||
|
# ),
|
||||||
|
}
|
||||||
|
|
||||||
|
table = create_performance_table(results)
|
||||||
|
print("\nPerformance Summary:")
|
||||||
|
print(table)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
@@ -6,63 +6,80 @@ import base64
|
|||||||
import os
|
import os
|
||||||
from typing import Dict, Any
|
from typing import Dict, Any
|
||||||
|
|
||||||
|
|
||||||
class Crawl4AiTester:
|
class Crawl4AiTester:
|
||||||
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
|
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
|
||||||
self.base_url = base_url
|
self.base_url = base_url
|
||||||
self.api_token = api_token or os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code" # Check environment variable as fallback
|
self.api_token = (
|
||||||
self.headers = {'Authorization': f'Bearer {self.api_token}'} if self.api_token else {}
|
api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
|
||||||
|
) # Check environment variable as fallback
|
||||||
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
|
self.headers = (
|
||||||
|
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
|
||||||
|
)
|
||||||
|
|
||||||
|
def submit_and_wait(
|
||||||
|
self, request_data: Dict[str, Any], timeout: int = 300
|
||||||
|
) -> Dict[str, Any]:
|
||||||
# Submit crawl job
|
# Submit crawl job
|
||||||
response = requests.post(f"{self.base_url}/crawl", json=request_data, headers=self.headers)
|
response = requests.post(
|
||||||
|
f"{self.base_url}/crawl", json=request_data, headers=self.headers
|
||||||
|
)
|
||||||
if response.status_code == 403:
|
if response.status_code == 403:
|
||||||
raise Exception("API token is invalid or missing")
|
raise Exception("API token is invalid or missing")
|
||||||
task_id = response.json()["task_id"]
|
task_id = response.json()["task_id"]
|
||||||
print(f"Task ID: {task_id}")
|
print(f"Task ID: {task_id}")
|
||||||
|
|
||||||
# Poll for result
|
# Poll for result
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
while True:
|
while True:
|
||||||
if time.time() - start_time > timeout:
|
if time.time() - start_time > timeout:
|
||||||
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
|
raise TimeoutError(
|
||||||
|
f"Task {task_id} did not complete within {timeout} seconds"
|
||||||
result = requests.get(f"{self.base_url}/task/{task_id}", headers=self.headers)
|
)
|
||||||
|
|
||||||
|
result = requests.get(
|
||||||
|
f"{self.base_url}/task/{task_id}", headers=self.headers
|
||||||
|
)
|
||||||
status = result.json()
|
status = result.json()
|
||||||
|
|
||||||
if status["status"] == "failed":
|
if status["status"] == "failed":
|
||||||
print("Task failed:", status.get("error"))
|
print("Task failed:", status.get("error"))
|
||||||
raise Exception(f"Task failed: {status.get('error')}")
|
raise Exception(f"Task failed: {status.get('error')}")
|
||||||
|
|
||||||
if status["status"] == "completed":
|
if status["status"] == "completed":
|
||||||
return status
|
return status
|
||||||
|
|
||||||
time.sleep(2)
|
time.sleep(2)
|
||||||
|
|
||||||
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
response = requests.post(f"{self.base_url}/crawl_sync", json=request_data, headers=self.headers, timeout=60)
|
response = requests.post(
|
||||||
|
f"{self.base_url}/crawl_sync",
|
||||||
|
json=request_data,
|
||||||
|
headers=self.headers,
|
||||||
|
timeout=60,
|
||||||
|
)
|
||||||
if response.status_code == 408:
|
if response.status_code == 408:
|
||||||
raise TimeoutError("Task did not complete within server timeout")
|
raise TimeoutError("Task did not complete within server timeout")
|
||||||
response.raise_for_status()
|
response.raise_for_status()
|
||||||
return response.json()
|
return response.json()
|
||||||
|
|
||||||
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
"""Directly crawl without using task queue"""
|
"""Directly crawl without using task queue"""
|
||||||
response = requests.post(
|
response = requests.post(
|
||||||
f"{self.base_url}/crawl_direct",
|
f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
|
||||||
json=request_data,
|
|
||||||
headers=self.headers
|
|
||||||
)
|
)
|
||||||
response.raise_for_status()
|
response.raise_for_status()
|
||||||
return response.json()
|
return response.json()
|
||||||
|
|
||||||
|
|
||||||
def test_docker_deployment(version="basic"):
|
def test_docker_deployment(version="basic"):
|
||||||
tester = Crawl4AiTester(
|
tester = Crawl4AiTester(
|
||||||
base_url="http://localhost:11235" ,
|
base_url="http://localhost:11235",
|
||||||
# base_url="https://api.crawl4ai.com" # just for example
|
# base_url="https://api.crawl4ai.com" # just for example
|
||||||
# api_token="test" # just for example
|
# api_token="test" # just for example
|
||||||
)
|
)
|
||||||
print(f"Testing Crawl4AI Docker {version} version")
|
print(f"Testing Crawl4AI Docker {version} version")
|
||||||
|
|
||||||
# Health check with timeout and retry
|
# Health check with timeout and retry
|
||||||
max_retries = 5
|
max_retries = 5
|
||||||
for i in range(max_retries):
|
for i in range(max_retries):
|
||||||
@@ -70,19 +87,19 @@ def test_docker_deployment(version="basic"):
|
|||||||
health = requests.get(f"{tester.base_url}/health", timeout=10)
|
health = requests.get(f"{tester.base_url}/health", timeout=10)
|
||||||
print("Health check:", health.json())
|
print("Health check:", health.json())
|
||||||
break
|
break
|
||||||
except requests.exceptions.RequestException as e:
|
except requests.exceptions.RequestException:
|
||||||
if i == max_retries - 1:
|
if i == max_retries - 1:
|
||||||
print(f"Failed to connect after {max_retries} attempts")
|
print(f"Failed to connect after {max_retries} attempts")
|
||||||
sys.exit(1)
|
sys.exit(1)
|
||||||
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
|
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
|
||||||
time.sleep(5)
|
time.sleep(5)
|
||||||
|
|
||||||
# Test cases based on version
|
# Test cases based on version
|
||||||
test_basic_crawl_direct(tester)
|
test_basic_crawl_direct(tester)
|
||||||
test_basic_crawl(tester)
|
test_basic_crawl(tester)
|
||||||
test_basic_crawl(tester)
|
test_basic_crawl(tester)
|
||||||
test_basic_crawl_sync(tester)
|
test_basic_crawl_sync(tester)
|
||||||
|
|
||||||
if version in ["full", "transformer"]:
|
if version in ["full", "transformer"]:
|
||||||
test_cosine_extraction(tester)
|
test_cosine_extraction(tester)
|
||||||
|
|
||||||
@@ -92,49 +109,52 @@ def test_docker_deployment(version="basic"):
|
|||||||
test_llm_extraction(tester)
|
test_llm_extraction(tester)
|
||||||
test_llm_with_ollama(tester)
|
test_llm_with_ollama(tester)
|
||||||
test_screenshot(tester)
|
test_screenshot(tester)
|
||||||
|
|
||||||
|
|
||||||
def test_basic_crawl(tester: Crawl4AiTester):
|
def test_basic_crawl(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing Basic Crawl ===")
|
print("\n=== Testing Basic Crawl ===")
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.nbcnews.com/business",
|
"urls": "https://www.nbcnews.com/business",
|
||||||
"priority": 10,
|
"priority": 10,
|
||||||
"session_id": "test"
|
"session_id": "test",
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||||
assert result["result"]["success"]
|
assert result["result"]["success"]
|
||||||
assert len(result["result"]["markdown"]) > 0
|
assert len(result["result"]["markdown"]) > 0
|
||||||
|
|
||||||
|
|
||||||
def test_basic_crawl_sync(tester: Crawl4AiTester):
|
def test_basic_crawl_sync(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing Basic Crawl (Sync) ===")
|
print("\n=== Testing Basic Crawl (Sync) ===")
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.nbcnews.com/business",
|
"urls": "https://www.nbcnews.com/business",
|
||||||
"priority": 10,
|
"priority": 10,
|
||||||
"session_id": "test"
|
"session_id": "test",
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.submit_sync(request)
|
result = tester.submit_sync(request)
|
||||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||||
assert result['status'] == 'completed'
|
assert result["status"] == "completed"
|
||||||
assert result['result']['success']
|
assert result["result"]["success"]
|
||||||
assert len(result['result']['markdown']) > 0
|
assert len(result["result"]["markdown"]) > 0
|
||||||
|
|
||||||
|
|
||||||
def test_basic_crawl_direct(tester: Crawl4AiTester):
|
def test_basic_crawl_direct(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing Basic Crawl (Direct) ===")
|
print("\n=== Testing Basic Crawl (Direct) ===")
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.nbcnews.com/business",
|
"urls": "https://www.nbcnews.com/business",
|
||||||
"priority": 10,
|
"priority": 10,
|
||||||
# "session_id": "test"
|
# "session_id": "test"
|
||||||
"cache_mode": "bypass" # or "enabled", "disabled", "read_only", "write_only"
|
"cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.crawl_direct(request)
|
result = tester.crawl_direct(request)
|
||||||
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
|
||||||
assert result['result']['success']
|
assert result["result"]["success"]
|
||||||
assert len(result['result']['markdown']) > 0
|
assert len(result["result"]["markdown"]) > 0
|
||||||
|
|
||||||
|
|
||||||
def test_js_execution(tester: Crawl4AiTester):
|
def test_js_execution(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing JS Execution ===")
|
print("\n=== Testing JS Execution ===")
|
||||||
request = {
|
request = {
|
||||||
@@ -144,32 +164,29 @@ def test_js_execution(tester: Crawl4AiTester):
|
|||||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||||
],
|
],
|
||||||
"wait_for": "article.tease-card:nth-child(10)",
|
"wait_for": "article.tease-card:nth-child(10)",
|
||||||
"crawler_params": {
|
"crawler_params": {"headless": True},
|
||||||
"headless": True
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
print(f"JS execution result length: {len(result['result']['markdown'])}")
|
||||||
assert result["result"]["success"]
|
assert result["result"]["success"]
|
||||||
|
|
||||||
|
|
||||||
def test_css_selector(tester: Crawl4AiTester):
|
def test_css_selector(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing CSS Selector ===")
|
print("\n=== Testing CSS Selector ===")
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.nbcnews.com/business",
|
"urls": "https://www.nbcnews.com/business",
|
||||||
"priority": 7,
|
"priority": 7,
|
||||||
"css_selector": ".wide-tease-item__description",
|
"css_selector": ".wide-tease-item__description",
|
||||||
"crawler_params": {
|
"crawler_params": {"headless": True},
|
||||||
"headless": True
|
"extra": {"word_count_threshold": 10},
|
||||||
},
|
|
||||||
"extra": {"word_count_threshold": 10}
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
print(f"CSS selector result length: {len(result['result']['markdown'])}")
|
||||||
assert result["result"]["success"]
|
assert result["result"]["success"]
|
||||||
|
|
||||||
|
|
||||||
def test_structured_extraction(tester: Crawl4AiTester):
|
def test_structured_extraction(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing Structured Extraction ===")
|
print("\n=== Testing Structured Extraction ===")
|
||||||
schema = {
|
schema = {
|
||||||
@@ -190,21 +207,16 @@ def test_structured_extraction(tester: Crawl4AiTester):
|
|||||||
"name": "price",
|
"name": "price",
|
||||||
"selector": "td:nth-child(2)",
|
"selector": "td:nth-child(2)",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
}
|
},
|
||||||
],
|
],
|
||||||
}
|
}
|
||||||
|
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.coinbase.com/explore",
|
"urls": "https://www.coinbase.com/explore",
|
||||||
"priority": 9,
|
"priority": 9,
|
||||||
"extraction_config": {
|
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
|
||||||
"type": "json_css",
|
|
||||||
"params": {
|
|
||||||
"schema": schema
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
extracted = json.loads(result["result"]["extracted_content"])
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
print(f"Extracted {len(extracted)} items")
|
print(f"Extracted {len(extracted)} items")
|
||||||
@@ -212,6 +224,7 @@ def test_structured_extraction(tester: Crawl4AiTester):
|
|||||||
assert result["result"]["success"]
|
assert result["result"]["success"]
|
||||||
assert len(extracted) > 0
|
assert len(extracted) > 0
|
||||||
|
|
||||||
|
|
||||||
def test_llm_extraction(tester: Crawl4AiTester):
|
def test_llm_extraction(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing LLM Extraction ===")
|
print("\n=== Testing LLM Extraction ===")
|
||||||
schema = {
|
schema = {
|
||||||
@@ -219,20 +232,20 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
|||||||
"properties": {
|
"properties": {
|
||||||
"model_name": {
|
"model_name": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"description": "Name of the OpenAI model."
|
"description": "Name of the OpenAI model.",
|
||||||
},
|
},
|
||||||
"input_fee": {
|
"input_fee": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"description": "Fee for input token for the OpenAI model."
|
"description": "Fee for input token for the OpenAI model.",
|
||||||
},
|
},
|
||||||
"output_fee": {
|
"output_fee": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"description": "Fee for output token for the OpenAI model."
|
"description": "Fee for output token for the OpenAI model.",
|
||||||
}
|
},
|
||||||
},
|
},
|
||||||
"required": ["model_name", "input_fee", "output_fee"]
|
"required": ["model_name", "input_fee", "output_fee"],
|
||||||
}
|
}
|
||||||
|
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://openai.com/api/pricing",
|
"urls": "https://openai.com/api/pricing",
|
||||||
"priority": 8,
|
"priority": 8,
|
||||||
@@ -243,12 +256,12 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
|||||||
"api_token": os.getenv("OPENAI_API_KEY"),
|
"api_token": os.getenv("OPENAI_API_KEY"),
|
||||||
"schema": schema,
|
"schema": schema,
|
||||||
"extraction_type": "schema",
|
"extraction_type": "schema",
|
||||||
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
|
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
|
||||||
}
|
},
|
||||||
},
|
},
|
||||||
"crawler_params": {"word_count_threshold": 1}
|
"crawler_params": {"word_count_threshold": 1},
|
||||||
}
|
}
|
||||||
|
|
||||||
try:
|
try:
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
extracted = json.loads(result["result"]["extracted_content"])
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
@@ -258,6 +271,7 @@ def test_llm_extraction(tester: Crawl4AiTester):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
def test_llm_with_ollama(tester: Crawl4AiTester):
|
def test_llm_with_ollama(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing LLM with Ollama ===")
|
print("\n=== Testing LLM with Ollama ===")
|
||||||
schema = {
|
schema = {
|
||||||
@@ -265,20 +279,20 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
|||||||
"properties": {
|
"properties": {
|
||||||
"article_title": {
|
"article_title": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"description": "The main title of the news article"
|
"description": "The main title of the news article",
|
||||||
},
|
},
|
||||||
"summary": {
|
"summary": {
|
||||||
"type": "string",
|
"type": "string",
|
||||||
"description": "A brief summary of the article content"
|
"description": "A brief summary of the article content",
|
||||||
},
|
},
|
||||||
"main_topics": {
|
"main_topics": {
|
||||||
"type": "array",
|
"type": "array",
|
||||||
"items": {"type": "string"},
|
"items": {"type": "string"},
|
||||||
"description": "Main topics or themes discussed in the article"
|
"description": "Main topics or themes discussed in the article",
|
||||||
}
|
},
|
||||||
}
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.nbcnews.com/business",
|
"urls": "https://www.nbcnews.com/business",
|
||||||
"priority": 8,
|
"priority": 8,
|
||||||
@@ -288,13 +302,13 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
|||||||
"provider": "ollama/llama2",
|
"provider": "ollama/llama2",
|
||||||
"schema": schema,
|
"schema": schema,
|
||||||
"extraction_type": "schema",
|
"extraction_type": "schema",
|
||||||
"instruction": "Extract the main article information including title, summary, and main topics."
|
"instruction": "Extract the main article information including title, summary, and main topics.",
|
||||||
}
|
},
|
||||||
},
|
},
|
||||||
"extra": {"word_count_threshold": 1},
|
"extra": {"word_count_threshold": 1},
|
||||||
"crawler_params": {"verbose": True}
|
"crawler_params": {"verbose": True},
|
||||||
}
|
}
|
||||||
|
|
||||||
try:
|
try:
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
extracted = json.loads(result["result"]["extracted_content"])
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
@@ -303,6 +317,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Ollama extraction test failed: {str(e)}")
|
print(f"Ollama extraction test failed: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
def test_cosine_extraction(tester: Crawl4AiTester):
|
def test_cosine_extraction(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing Cosine Extraction ===")
|
print("\n=== Testing Cosine Extraction ===")
|
||||||
request = {
|
request = {
|
||||||
@@ -314,11 +329,11 @@ def test_cosine_extraction(tester: Crawl4AiTester):
|
|||||||
"semantic_filter": "business finance economy",
|
"semantic_filter": "business finance economy",
|
||||||
"word_count_threshold": 10,
|
"word_count_threshold": 10,
|
||||||
"max_dist": 0.2,
|
"max_dist": 0.2,
|
||||||
"top_k": 3
|
"top_k": 3,
|
||||||
}
|
},
|
||||||
}
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
try:
|
try:
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
extracted = json.loads(result["result"]["extracted_content"])
|
extracted = json.loads(result["result"]["extracted_content"])
|
||||||
@@ -328,30 +343,30 @@ def test_cosine_extraction(tester: Crawl4AiTester):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Cosine extraction test failed: {str(e)}")
|
print(f"Cosine extraction test failed: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
def test_screenshot(tester: Crawl4AiTester):
|
def test_screenshot(tester: Crawl4AiTester):
|
||||||
print("\n=== Testing Screenshot ===")
|
print("\n=== Testing Screenshot ===")
|
||||||
request = {
|
request = {
|
||||||
"urls": "https://www.nbcnews.com/business",
|
"urls": "https://www.nbcnews.com/business",
|
||||||
"priority": 5,
|
"priority": 5,
|
||||||
"screenshot": True,
|
"screenshot": True,
|
||||||
"crawler_params": {
|
"crawler_params": {"headless": True},
|
||||||
"headless": True
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
result = tester.submit_and_wait(request)
|
result = tester.submit_and_wait(request)
|
||||||
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
print("Screenshot captured:", bool(result["result"]["screenshot"]))
|
||||||
|
|
||||||
if result["result"]["screenshot"]:
|
if result["result"]["screenshot"]:
|
||||||
# Save screenshot
|
# Save screenshot
|
||||||
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
screenshot_data = base64.b64decode(result["result"]["screenshot"])
|
||||||
with open("test_screenshot.jpg", "wb") as f:
|
with open("test_screenshot.jpg", "wb") as f:
|
||||||
f.write(screenshot_data)
|
f.write(screenshot_data)
|
||||||
print("Screenshot saved as test_screenshot.jpg")
|
print("Screenshot saved as test_screenshot.jpg")
|
||||||
|
|
||||||
assert result["result"]["success"]
|
assert result["result"]["success"]
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
|
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
|
||||||
# version = "full"
|
# version = "full"
|
||||||
test_docker_deployment(version)
|
test_docker_deployment(version)
|
||||||
|
|||||||
@@ -9,18 +9,17 @@ This example shows how to:
|
|||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
import os
|
import os
|
||||||
from typing import Dict, Any
|
|
||||||
|
|
||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
from crawl4ai.extraction_strategy import (
|
from crawl4ai.extraction_strategy import (
|
||||||
LLMExtractionStrategy,
|
LLMExtractionStrategy,
|
||||||
JsonCssExtractionStrategy,
|
JsonCssExtractionStrategy,
|
||||||
JsonXPathExtractionStrategy
|
JsonXPathExtractionStrategy,
|
||||||
)
|
)
|
||||||
from crawl4ai.chunking_strategy import RegexChunking, IdentityChunking
|
|
||||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
|
|
||||||
|
|
||||||
async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
|
async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
|
||||||
"""Helper function to run extraction with proper configuration"""
|
"""Helper function to run extraction with proper configuration"""
|
||||||
try:
|
try:
|
||||||
@@ -30,78 +29,90 @@ async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str
|
|||||||
extraction_strategy=strategy,
|
extraction_strategy=strategy,
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
content_filter=PruningContentFilter() # For fit_markdown support
|
content_filter=PruningContentFilter() # For fit_markdown support
|
||||||
)
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Run the crawler
|
# Run the crawler
|
||||||
result = await crawler.arun(url=url, config=config)
|
result = await crawler.arun(url=url, config=config)
|
||||||
|
|
||||||
if result.success:
|
if result.success:
|
||||||
print(f"\n=== {name} Results ===")
|
print(f"\n=== {name} Results ===")
|
||||||
print(f"Extracted Content: {result.extracted_content}")
|
print(f"Extracted Content: {result.extracted_content}")
|
||||||
print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
|
print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
|
||||||
print(f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}")
|
print(
|
||||||
|
f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
|
||||||
|
)
|
||||||
else:
|
else:
|
||||||
print(f"Error in {name}: Crawl failed")
|
print(f"Error in {name}: Crawl failed")
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error in {name}: {str(e)}")
|
print(f"Error in {name}: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# Example URL (replace with actual URL)
|
# Example URL (replace with actual URL)
|
||||||
url = "https://example.com/product-page"
|
url = "https://example.com/product-page"
|
||||||
|
|
||||||
# Configure browser settings
|
# Configure browser settings
|
||||||
browser_config = BrowserConfig(
|
browser_config = BrowserConfig(headless=True, verbose=True)
|
||||||
headless=True,
|
|
||||||
verbose=True
|
|
||||||
)
|
|
||||||
|
|
||||||
# Initialize extraction strategies
|
# Initialize extraction strategies
|
||||||
|
|
||||||
# 1. LLM Extraction with different input formats
|
# 1. LLM Extraction with different input formats
|
||||||
markdown_strategy = LLMExtractionStrategy(
|
markdown_strategy = LLMExtractionStrategy(
|
||||||
provider="openai/gpt-4o-mini",
|
provider="openai/gpt-4o-mini",
|
||||||
api_token=os.getenv("OPENAI_API_KEY"),
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
instruction="Extract product information including name, price, and description"
|
instruction="Extract product information including name, price, and description",
|
||||||
)
|
)
|
||||||
|
|
||||||
html_strategy = LLMExtractionStrategy(
|
html_strategy = LLMExtractionStrategy(
|
||||||
input_format="html",
|
input_format="html",
|
||||||
provider="openai/gpt-4o-mini",
|
provider="openai/gpt-4o-mini",
|
||||||
api_token=os.getenv("OPENAI_API_KEY"),
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
instruction="Extract product information from HTML including structured data"
|
instruction="Extract product information from HTML including structured data",
|
||||||
)
|
)
|
||||||
|
|
||||||
fit_markdown_strategy = LLMExtractionStrategy(
|
fit_markdown_strategy = LLMExtractionStrategy(
|
||||||
input_format="fit_markdown",
|
input_format="fit_markdown",
|
||||||
provider="openai/gpt-4o-mini",
|
provider="openai/gpt-4o-mini",
|
||||||
api_token=os.getenv("OPENAI_API_KEY"),
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
instruction="Extract product information from cleaned markdown"
|
instruction="Extract product information from cleaned markdown",
|
||||||
)
|
)
|
||||||
|
|
||||||
# 2. JSON CSS Extraction (automatically uses HTML input)
|
# 2. JSON CSS Extraction (automatically uses HTML input)
|
||||||
css_schema = {
|
css_schema = {
|
||||||
"baseSelector": ".product",
|
"baseSelector": ".product",
|
||||||
"fields": [
|
"fields": [
|
||||||
{"name": "title", "selector": "h1.product-title", "type": "text"},
|
{"name": "title", "selector": "h1.product-title", "type": "text"},
|
||||||
{"name": "price", "selector": ".price", "type": "text"},
|
{"name": "price", "selector": ".price", "type": "text"},
|
||||||
{"name": "description", "selector": ".description", "type": "text"}
|
{"name": "description", "selector": ".description", "type": "text"},
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
css_strategy = JsonCssExtractionStrategy(schema=css_schema)
|
css_strategy = JsonCssExtractionStrategy(schema=css_schema)
|
||||||
|
|
||||||
# 3. JSON XPath Extraction (automatically uses HTML input)
|
# 3. JSON XPath Extraction (automatically uses HTML input)
|
||||||
xpath_schema = {
|
xpath_schema = {
|
||||||
"baseSelector": "//div[@class='product']",
|
"baseSelector": "//div[@class='product']",
|
||||||
"fields": [
|
"fields": [
|
||||||
{"name": "title", "selector": ".//h1[@class='product-title']/text()", "type": "text"},
|
{
|
||||||
{"name": "price", "selector": ".//span[@class='price']/text()", "type": "text"},
|
"name": "title",
|
||||||
{"name": "description", "selector": ".//div[@class='description']/text()", "type": "text"}
|
"selector": ".//h1[@class='product-title']/text()",
|
||||||
]
|
"type": "text",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "price",
|
||||||
|
"selector": ".//span[@class='price']/text()",
|
||||||
|
"type": "text",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "description",
|
||||||
|
"selector": ".//div[@class='description']/text()",
|
||||||
|
"type": "text",
|
||||||
|
},
|
||||||
|
],
|
||||||
}
|
}
|
||||||
xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
|
xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
|
||||||
|
|
||||||
# Use context manager for proper resource handling
|
# Use context manager for proper resource handling
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
# Run all strategies
|
# Run all strategies
|
||||||
@@ -111,5 +122,6 @@ async def main():
|
|||||||
await run_extraction(crawler, url, css_strategy, "CSS Extraction")
|
await run_extraction(crawler, url, css_strategy, "CSS Extraction")
|
||||||
await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
|
await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -1,20 +1,23 @@
|
|||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import *
|
from crawl4ai import *
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
browser_config = BrowserConfig(headless=True, verbose=True)
|
browser_config = BrowserConfig(headless=True, verbose=True)
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
crawler_config = CrawlerRunConfig(
|
crawler_config = CrawlerRunConfig(
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
content_filter=PruningContentFilter(
|
||||||
)
|
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||||
|
)
|
||||||
|
),
|
||||||
)
|
)
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.helloworld.org",
|
url="https://www.helloworld.org", config=crawler_config
|
||||||
config=crawler_config
|
|
||||||
)
|
)
|
||||||
print(result.markdown_v2.raw_markdown[:500])
|
print(result.markdown_v2.raw_markdown[:500])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -1,19 +1,18 @@
|
|||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
from playwright.async_api import Page, BrowserContext
|
from playwright.async_api import Page, BrowserContext
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
print("🔗 Hooks Example: Demonstrating different hook use cases")
|
print("🔗 Hooks Example: Demonstrating different hook use cases")
|
||||||
|
|
||||||
# Configure browser settings
|
# Configure browser settings
|
||||||
browser_config = BrowserConfig(
|
browser_config = BrowserConfig(headless=True)
|
||||||
headless=True
|
|
||||||
)
|
|
||||||
|
|
||||||
# Configure crawler settings
|
# Configure crawler settings
|
||||||
crawler_run_config = CrawlerRunConfig(
|
crawler_run_config = CrawlerRunConfig(
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||||
wait_for="body",
|
wait_for="body",
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Create crawler instance
|
# Create crawler instance
|
||||||
@@ -30,16 +29,22 @@ async def main():
|
|||||||
"""Hook called after a new page and context are created"""
|
"""Hook called after a new page and context are created"""
|
||||||
print("[HOOK] on_page_context_created - New page created!")
|
print("[HOOK] on_page_context_created - New page created!")
|
||||||
# Example: Set default viewport size
|
# Example: Set default viewport size
|
||||||
await context.add_cookies([{
|
await context.add_cookies(
|
||||||
'name': 'session_id',
|
[
|
||||||
'value': 'example_session',
|
{
|
||||||
'domain': '.example.com',
|
"name": "session_id",
|
||||||
'path': '/'
|
"value": "example_session",
|
||||||
}])
|
"domain": ".example.com",
|
||||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
"path": "/",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
await page.set_viewport_size({"width": 1080, "height": 800})
|
||||||
return page
|
return page
|
||||||
|
|
||||||
async def on_user_agent_updated(page: Page, context: BrowserContext, user_agent: str, **kwargs):
|
async def on_user_agent_updated(
|
||||||
|
page: Page, context: BrowserContext, user_agent: str, **kwargs
|
||||||
|
):
|
||||||
"""Hook called when the user agent is updated"""
|
"""Hook called when the user agent is updated"""
|
||||||
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
|
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
|
||||||
return page
|
return page
|
||||||
@@ -53,17 +58,17 @@ async def main():
|
|||||||
"""Hook called before navigating to each URL"""
|
"""Hook called before navigating to each URL"""
|
||||||
print(f"[HOOK] before_goto - About to visit: {url}")
|
print(f"[HOOK] before_goto - About to visit: {url}")
|
||||||
# Example: Add custom headers for the request
|
# Example: Add custom headers for the request
|
||||||
await page.set_extra_http_headers({
|
await page.set_extra_http_headers({"Custom-Header": "my-value"})
|
||||||
"Custom-Header": "my-value"
|
|
||||||
})
|
|
||||||
return page
|
return page
|
||||||
|
|
||||||
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
|
async def after_goto(
|
||||||
|
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
|
||||||
|
):
|
||||||
"""Hook called after navigating to each URL"""
|
"""Hook called after navigating to each URL"""
|
||||||
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
||||||
# Example: Wait for a specific element to be loaded
|
# Example: Wait for a specific element to be loaded
|
||||||
try:
|
try:
|
||||||
await page.wait_for_selector('.content', timeout=1000)
|
await page.wait_for_selector(".content", timeout=1000)
|
||||||
print("Content element found!")
|
print("Content element found!")
|
||||||
except:
|
except:
|
||||||
print("Content element not found, continuing anyway")
|
print("Content element not found, continuing anyway")
|
||||||
@@ -76,7 +81,9 @@ async def main():
|
|||||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
|
||||||
return page
|
return page
|
||||||
|
|
||||||
async def before_return_html(page: Page, context: BrowserContext, html:str, **kwargs):
|
async def before_return_html(
|
||||||
|
page: Page, context: BrowserContext, html: str, **kwargs
|
||||||
|
):
|
||||||
"""Hook called before returning the HTML content"""
|
"""Hook called before returning the HTML content"""
|
||||||
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
|
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
|
||||||
# Example: You could modify the HTML content here if needed
|
# Example: You could modify the HTML content here if needed
|
||||||
@@ -84,7 +91,9 @@ async def main():
|
|||||||
|
|
||||||
# Set all the hooks
|
# Set all the hooks
|
||||||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"on_page_context_created", on_page_context_created
|
||||||
|
)
|
||||||
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
|
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
|
||||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||||||
@@ -95,13 +104,15 @@ async def main():
|
|||||||
await crawler.start()
|
await crawler.start()
|
||||||
|
|
||||||
# Example usage: crawl a simple website
|
# Example usage: crawl a simple website
|
||||||
url = 'https://example.com'
|
url = "https://example.com"
|
||||||
result = await crawler.arun(url, config=crawler_run_config)
|
result = await crawler.arun(url, config=crawler_run_config)
|
||||||
print(f"\nCrawled URL: {result.url}")
|
print(f"\nCrawled URL: {result.url}")
|
||||||
print(f"HTML length: {len(result.html)}")
|
print(f"HTML length: {len(result.html)}")
|
||||||
|
|
||||||
await crawler.close()
|
await crawler.close()
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import asyncio
|
import asyncio
|
||||||
asyncio.run(main())
|
|
||||||
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# Example 1: Setting language when creating the crawler
|
# Example 1: Setting language when creating the crawler
|
||||||
crawler1 = AsyncWebCrawler(
|
crawler1 = AsyncWebCrawler(
|
||||||
@@ -9,11 +10,15 @@ async def main():
|
|||||||
)
|
)
|
||||||
)
|
)
|
||||||
result1 = await crawler1.arun("https://www.example.com")
|
result1 = await crawler1.arun("https://www.example.com")
|
||||||
print("Example 1 result:", result1.extracted_content[:100]) # Print first 100 characters
|
print(
|
||||||
|
"Example 1 result:", result1.extracted_content[:100]
|
||||||
|
) # Print first 100 characters
|
||||||
|
|
||||||
# Example 2: Setting language before crawling
|
# Example 2: Setting language before crawling
|
||||||
crawler2 = AsyncWebCrawler()
|
crawler2 = AsyncWebCrawler()
|
||||||
crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
|
crawler2.crawler_strategy.headers[
|
||||||
|
"Accept-Language"
|
||||||
|
] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
|
||||||
result2 = await crawler2.arun("https://www.example.com")
|
result2 = await crawler2.arun("https://www.example.com")
|
||||||
print("Example 2 result:", result2.extracted_content[:100])
|
print("Example 2 result:", result2.extracted_content[:100])
|
||||||
|
|
||||||
@@ -21,7 +26,7 @@ async def main():
|
|||||||
crawler3 = AsyncWebCrawler()
|
crawler3 = AsyncWebCrawler()
|
||||||
result3 = await crawler3.arun(
|
result3 = await crawler3.arun(
|
||||||
"https://www.example.com",
|
"https://www.example.com",
|
||||||
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
|
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"},
|
||||||
)
|
)
|
||||||
print("Example 3 result:", result3.extracted_content[:100])
|
print("Example 3 result:", result3.extracted_content[:100])
|
||||||
|
|
||||||
@@ -31,15 +36,15 @@ async def main():
|
|||||||
("https://www.example.org", "es-ES,es;q=0.9"),
|
("https://www.example.org", "es-ES,es;q=0.9"),
|
||||||
("https://www.example.net", "de-DE,de;q=0.9"),
|
("https://www.example.net", "de-DE,de;q=0.9"),
|
||||||
]
|
]
|
||||||
|
|
||||||
crawler4 = AsyncWebCrawler()
|
crawler4 = AsyncWebCrawler()
|
||||||
results = await asyncio.gather(*[
|
results = await asyncio.gather(
|
||||||
crawler4.arun(url, headers={"Accept-Language": lang})
|
*[crawler4.arun(url, headers={"Accept-Language": lang}) for url, lang in urls]
|
||||||
for url, lang in urls
|
)
|
||||||
])
|
|
||||||
|
|
||||||
for url, result in zip([u for u, _ in urls], results):
|
for url, result in zip([u for u, _ in urls], results):
|
||||||
print(f"Result for {url}:", result.extracted_content[:100])
|
print(f"Result for {url}:", result.extracted_content[:100])
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -3,32 +3,37 @@ from crawl4ai.crawler_strategy import *
|
|||||||
import asyncio
|
import asyncio
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
url = r'https://openai.com/api/pricing/'
|
url = r"https://openai.com/api/pricing/"
|
||||||
|
|
||||||
|
|
||||||
class OpenAIModelFee(BaseModel):
|
class OpenAIModelFee(BaseModel):
|
||||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
output_fee: str = Field(
|
||||||
|
..., description="Fee for output token for the OpenAI model."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
from crawl4ai import AsyncWebCrawler
|
from crawl4ai import AsyncWebCrawler
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# Use AsyncWebCrawler
|
# Use AsyncWebCrawler
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url=url,
|
url=url,
|
||||||
word_count_threshold=1,
|
word_count_threshold=1,
|
||||||
extraction_strategy= LLMExtractionStrategy(
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
|
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
|
||||||
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
|
provider="groq/llama-3.1-70b-versatile",
|
||||||
|
api_token=os.getenv("GROQ_API_KEY"),
|
||||||
schema=OpenAIModelFee.model_json_schema(),
|
schema=OpenAIModelFee.model_json_schema(),
|
||||||
extraction_type="schema",
|
extraction_type="schema",
|
||||||
instruction="From the crawled content, extract all mentioned model names along with their " \
|
instruction="From the crawled content, extract all mentioned model names along with their "
|
||||||
"fees for input and output tokens. Make sure not to miss anything in the entire content. " \
|
"fees for input and output tokens. Make sure not to miss anything in the entire content. "
|
||||||
'One extracted model JSON format should look like this: ' \
|
"One extracted model JSON format should look like this: "
|
||||||
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
|
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
|
||||||
),
|
),
|
||||||
|
|
||||||
)
|
)
|
||||||
print("Success:", result.success)
|
print("Success:", result.success)
|
||||||
model_fees = json.loads(result.extracted_content)
|
model_fees = json.loads(result.extracted_content)
|
||||||
@@ -37,4 +42,5 @@ async def main():
|
|||||||
with open(".data/data.json", "w", encoding="utf-8") as f:
|
with open(".data/data.json", "w", encoding="utf-8") as f:
|
||||||
f.write(result.extracted_content)
|
f.write(result.extracted_content)
|
||||||
|
|
||||||
|
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
87
docs/examples/llm_markdown_generator.py
Normal file
87
docs/examples/llm_markdown_generator.py
Normal file
@@ -0,0 +1,87 @@
|
|||||||
|
import os
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
|
from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||||
|
|
||||||
|
async def test_llm_filter():
|
||||||
|
# Create an HTML source that needs intelligent filtering
|
||||||
|
url = "https://docs.python.org/3/tutorial/classes.html"
|
||||||
|
|
||||||
|
browser_config = BrowserConfig(
|
||||||
|
headless=True,
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||||
|
run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
# First get the raw HTML
|
||||||
|
result = await crawler.arun(url, config=run_config)
|
||||||
|
html = result.cleaned_html
|
||||||
|
|
||||||
|
# Initialize LLM filter with focused instruction
|
||||||
|
filter = LLMContentFilter(
|
||||||
|
provider="openai/gpt-4o",
|
||||||
|
api_token=os.getenv('OPENAI_API_KEY'),
|
||||||
|
instruction="""
|
||||||
|
Focus on extracting the core educational content about Python classes.
|
||||||
|
Include:
|
||||||
|
- Key concepts and their explanations
|
||||||
|
- Important code examples
|
||||||
|
- Essential technical details
|
||||||
|
Exclude:
|
||||||
|
- Navigation elements
|
||||||
|
- Sidebars
|
||||||
|
- Footer content
|
||||||
|
- Version information
|
||||||
|
- Any non-essential UI elements
|
||||||
|
|
||||||
|
Format the output as clean markdown with proper code blocks and headers.
|
||||||
|
""",
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
filter = LLMContentFilter(
|
||||||
|
provider="openai/gpt-4o",
|
||||||
|
api_token=os.getenv('OPENAI_API_KEY'),
|
||||||
|
chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
|
||||||
|
instruction="""
|
||||||
|
Extract the main educational content while preserving its original wording and substance completely. Your task is to:
|
||||||
|
|
||||||
|
1. Maintain the exact language and terminology used in the main content
|
||||||
|
2. Keep all technical explanations, examples, and educational content intact
|
||||||
|
3. Preserve the original flow and structure of the core content
|
||||||
|
4. Remove only clearly irrelevant elements like:
|
||||||
|
- Navigation menus
|
||||||
|
- Advertisement sections
|
||||||
|
- Cookie notices
|
||||||
|
- Footers with site information
|
||||||
|
- Sidebars with external links
|
||||||
|
- Any UI elements that don't contribute to learning
|
||||||
|
|
||||||
|
The goal is to create a clean markdown version that reads exactly like the original article,
|
||||||
|
keeping all valuable content but free from distracting elements. Imagine you're creating
|
||||||
|
a perfect reading experience where nothing valuable is lost, but all noise is removed.
|
||||||
|
""",
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Apply filtering
|
||||||
|
filtered_content = filter.filter_content(html, ignore_cache = True)
|
||||||
|
|
||||||
|
# Show results
|
||||||
|
print("\nFiltered Content Length:", len(filtered_content))
|
||||||
|
print("\nFirst 500 chars of filtered content:")
|
||||||
|
if filtered_content:
|
||||||
|
print(filtered_content[0][:500])
|
||||||
|
|
||||||
|
# Save on disc the markdown version
|
||||||
|
with open("filtered_content.md", "w", encoding="utf-8") as f:
|
||||||
|
f.write("\n".join(filtered_content))
|
||||||
|
|
||||||
|
# Show token usage
|
||||||
|
filter.show_usage()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(test_llm_filter())
|
||||||
@@ -8,12 +8,12 @@ import asyncio
|
|||||||
import time
|
import time
|
||||||
import json
|
import json
|
||||||
import re
|
import re
|
||||||
from typing import Dict, List
|
from typing import Dict
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
|
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
|
||||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
|
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||||
from crawl4ai.extraction_strategy import (
|
from crawl4ai.extraction_strategy import (
|
||||||
JsonCssExtractionStrategy,
|
JsonCssExtractionStrategy,
|
||||||
LLMExtractionStrategy,
|
LLMExtractionStrategy,
|
||||||
@@ -62,6 +62,7 @@ async def clean_content():
|
|||||||
print(f"Full Markdown Length: {full_markdown_length}")
|
print(f"Full Markdown Length: {full_markdown_length}")
|
||||||
print(f"Fit Markdown Length: {fit_markdown_length}")
|
print(f"Fit Markdown Length: {fit_markdown_length}")
|
||||||
|
|
||||||
|
|
||||||
async def link_analysis():
|
async def link_analysis():
|
||||||
crawler_config = CrawlerRunConfig(
|
crawler_config = CrawlerRunConfig(
|
||||||
cache_mode=CacheMode.ENABLED,
|
cache_mode=CacheMode.ENABLED,
|
||||||
@@ -76,9 +77,10 @@ async def link_analysis():
|
|||||||
print(f"Found {len(result.links['internal'])} internal links")
|
print(f"Found {len(result.links['internal'])} internal links")
|
||||||
print(f"Found {len(result.links['external'])} external links")
|
print(f"Found {len(result.links['external'])} external links")
|
||||||
|
|
||||||
for link in result.links['internal'][:5]:
|
for link in result.links["internal"][:5]:
|
||||||
print(f"Href: {link['href']}\nText: {link['text']}\n")
|
print(f"Href: {link['href']}\nText: {link['text']}\n")
|
||||||
|
|
||||||
|
|
||||||
# JavaScript Execution Example
|
# JavaScript Execution Example
|
||||||
async def simple_example_with_running_js_code():
|
async def simple_example_with_running_js_code():
|
||||||
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
||||||
@@ -112,25 +114,29 @@ async def simple_example_with_css_selector():
|
|||||||
)
|
)
|
||||||
print(result.markdown[:500])
|
print(result.markdown[:500])
|
||||||
|
|
||||||
|
|
||||||
async def media_handling():
|
async def media_handling():
|
||||||
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True)
|
crawler_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True
|
||||||
|
)
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business", config=crawler_config
|
||||||
config=crawler_config
|
|
||||||
)
|
)
|
||||||
for img in result.media['images'][:5]:
|
for img in result.media["images"][:5]:
|
||||||
print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
|
print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
|
||||||
|
|
||||||
|
|
||||||
async def custom_hook_workflow(verbose=True):
|
async def custom_hook_workflow(verbose=True):
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
# Set a 'before_goto' hook to run custom code just before navigation
|
# Set a 'before_goto' hook to run custom code just before navigation
|
||||||
crawler.crawler_strategy.set_hook("before_goto", lambda page, context: print("[Hook] Preparing to navigate..."))
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"before_goto",
|
||||||
|
lambda page, context: print("[Hook] Preparing to navigate..."),
|
||||||
|
)
|
||||||
|
|
||||||
# Perform the crawl operation
|
# Perform the crawl operation
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(url="https://crawl4ai.com")
|
||||||
url="https://crawl4ai.com"
|
|
||||||
)
|
|
||||||
print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
|
print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
|
||||||
|
|
||||||
|
|
||||||
@@ -225,7 +231,7 @@ async def extract_structured_data_using_css_extractor():
|
|||||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||||
schema = {
|
schema = {
|
||||||
"name": "KidoCode Courses",
|
"name": "KidoCode Courses",
|
||||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
"baseSelector": "section.charge-methodology .framework-collection-item.w-dyn-item",
|
||||||
"fields": [
|
"fields": [
|
||||||
{
|
{
|
||||||
"name": "section_title",
|
"name": "section_title",
|
||||||
@@ -273,6 +279,7 @@ async def extract_structured_data_using_css_extractor():
|
|||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||||
js_code=[js_click_tabs],
|
js_code=[js_click_tabs],
|
||||||
|
delay_before_return_html=1
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
@@ -412,21 +419,22 @@ async def cosine_similarity_extraction():
|
|||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
extraction_strategy=CosineStrategy(
|
extraction_strategy=CosineStrategy(
|
||||||
word_count_threshold=10,
|
word_count_threshold=10,
|
||||||
max_dist=0.2, # Maximum distance between two words
|
max_dist=0.2, # Maximum distance between two words
|
||||||
linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
|
linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
|
||||||
top_k=3, # Number of top keywords to extract
|
top_k=3, # Number of top keywords to extract
|
||||||
sim_threshold=0.3, # Similarity threshold for clustering
|
sim_threshold=0.3, # Similarity threshold for clustering
|
||||||
semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
|
semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
|
||||||
verbose=True
|
verbose=True,
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
|
url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
|
||||||
config=crawl_config
|
config=crawl_config,
|
||||||
)
|
)
|
||||||
print(json.loads(result.extracted_content)[:5])
|
print(json.loads(result.extracted_content)[:5])
|
||||||
|
|
||||||
|
|
||||||
# Browser Comparison
|
# Browser Comparison
|
||||||
async def crawl_custom_browser_type():
|
async def crawl_custom_browser_type():
|
||||||
print("\n--- Browser Comparison ---")
|
print("\n--- Browser Comparison ---")
|
||||||
@@ -484,39 +492,42 @@ async def crawl_with_user_simulation():
|
|||||||
result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
|
result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
|
||||||
print(result.markdown)
|
print(result.markdown)
|
||||||
|
|
||||||
|
|
||||||
async def ssl_certification():
|
async def ssl_certification():
|
||||||
# Configure crawler to fetch SSL certificate
|
# Configure crawler to fetch SSL certificate
|
||||||
config = CrawlerRunConfig(
|
config = CrawlerRunConfig(
|
||||||
fetch_ssl_certificate=True,
|
fetch_ssl_certificate=True,
|
||||||
cache_mode=CacheMode.BYPASS # Bypass cache to always get fresh certificates
|
cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(url="https://example.com", config=config)
|
||||||
url='https://example.com',
|
|
||||||
config=config
|
|
||||||
)
|
|
||||||
|
|
||||||
if result.success and result.ssl_certificate:
|
if result.success and result.ssl_certificate:
|
||||||
cert = result.ssl_certificate
|
cert = result.ssl_certificate
|
||||||
|
|
||||||
# 1. Access certificate properties directly
|
# 1. Access certificate properties directly
|
||||||
print("\nCertificate Information:")
|
print("\nCertificate Information:")
|
||||||
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
||||||
print(f"Valid until: {cert.valid_until}")
|
print(f"Valid until: {cert.valid_until}")
|
||||||
print(f"Fingerprint: {cert.fingerprint}")
|
print(f"Fingerprint: {cert.fingerprint}")
|
||||||
|
|
||||||
# 2. Export certificate in different formats
|
# 2. Export certificate in different formats
|
||||||
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
|
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
|
||||||
print("\nCertificate exported to:")
|
print("\nCertificate exported to:")
|
||||||
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
|
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
|
||||||
|
|
||||||
pem_data = cert.to_pem(os.path.join(tmp_dir, "certificate.pem")) # For web servers
|
pem_data = cert.to_pem(
|
||||||
|
os.path.join(tmp_dir, "certificate.pem")
|
||||||
|
) # For web servers
|
||||||
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
|
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
|
||||||
|
|
||||||
der_data = cert.to_der(os.path.join(tmp_dir, "certificate.der")) # For Java apps
|
der_data = cert.to_der(
|
||||||
|
os.path.join(tmp_dir, "certificate.der")
|
||||||
|
) # For Java apps
|
||||||
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
|
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
|
||||||
|
|
||||||
|
|
||||||
# Speed Comparison
|
# Speed Comparison
|
||||||
async def speed_comparison():
|
async def speed_comparison():
|
||||||
print("\n--- Speed Comparison ---")
|
print("\n--- Speed Comparison ---")
|
||||||
@@ -581,29 +592,26 @@ async def speed_comparison():
|
|||||||
# Main execution
|
# Main execution
|
||||||
async def main():
|
async def main():
|
||||||
# Basic examples
|
# Basic examples
|
||||||
# await simple_crawl()
|
await simple_crawl()
|
||||||
# await simple_example_with_running_js_code()
|
await simple_example_with_running_js_code()
|
||||||
# await simple_example_with_css_selector()
|
await simple_example_with_css_selector()
|
||||||
|
|
||||||
# Advanced examples
|
# Advanced examples
|
||||||
# await extract_structured_data_using_css_extractor()
|
await extract_structured_data_using_css_extractor()
|
||||||
await extract_structured_data_using_llm(
|
await extract_structured_data_using_llm(
|
||||||
"openai/gpt-4o", os.getenv("OPENAI_API_KEY")
|
"openai/gpt-4o", os.getenv("OPENAI_API_KEY")
|
||||||
)
|
)
|
||||||
# await crawl_dynamic_content_pages_method_1()
|
await crawl_dynamic_content_pages_method_1()
|
||||||
# await crawl_dynamic_content_pages_method_2()
|
await crawl_dynamic_content_pages_method_2()
|
||||||
|
|
||||||
# Browser comparisons
|
# Browser comparisons
|
||||||
# await crawl_custom_browser_type()
|
await crawl_custom_browser_type()
|
||||||
|
|
||||||
# Performance testing
|
|
||||||
# await speed_comparison()
|
|
||||||
|
|
||||||
# Screenshot example
|
# Screenshot example
|
||||||
# await capture_and_save_screenshot(
|
await capture_and_save_screenshot(
|
||||||
# "https://www.example.com",
|
"https://www.example.com",
|
||||||
# os.path.join(__location__, "tmp/example_screenshot.jpg")
|
os.path.join(__location__, "tmp/example_screenshot.jpg")
|
||||||
# )
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -1,6 +1,10 @@
|
|||||||
import os, sys
|
import os, sys
|
||||||
|
|
||||||
# append parent directory to system path
|
# append parent directory to system path
|
||||||
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";
|
sys.path.append(
|
||||||
|
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||||
|
)
|
||||||
|
os.environ["FIRECRAWL_API_KEY"] = "fc-84b370ccfad44beabc686b38f1769692"
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
# import nest_asyncio
|
# import nest_asyncio
|
||||||
@@ -15,7 +19,7 @@ from bs4 import BeautifulSoup
|
|||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
|
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||||
from crawl4ai.extraction_strategy import (
|
from crawl4ai.extraction_strategy import (
|
||||||
JsonCssExtractionStrategy,
|
JsonCssExtractionStrategy,
|
||||||
LLMExtractionStrategy,
|
LLMExtractionStrategy,
|
||||||
@@ -32,9 +36,12 @@ print("Website: https://crawl4ai.com")
|
|||||||
async def simple_crawl():
|
async def simple_crawl():
|
||||||
print("\n--- Basic Usage ---")
|
print("\n--- Basic Usage ---")
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
result = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode= CacheMode.BYPASS)
|
result = await crawler.arun(
|
||||||
|
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
print(result.markdown[:500]) # Print first 500 characters
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
|
||||||
async def simple_example_with_running_js_code():
|
async def simple_example_with_running_js_code():
|
||||||
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
||||||
# New code to handle the wait_for parameter
|
# New code to handle the wait_for parameter
|
||||||
@@ -57,6 +64,7 @@ async def simple_example_with_running_js_code():
|
|||||||
)
|
)
|
||||||
print(result.markdown[:500]) # Print first 500 characters
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
|
||||||
async def simple_example_with_css_selector():
|
async def simple_example_with_css_selector():
|
||||||
print("\n--- Using CSS Selectors ---")
|
print("\n--- Using CSS Selectors ---")
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
@@ -67,42 +75,44 @@ async def simple_example_with_css_selector():
|
|||||||
)
|
)
|
||||||
print(result.markdown[:500]) # Print first 500 characters
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
|
||||||
async def use_proxy():
|
async def use_proxy():
|
||||||
print("\n--- Using a Proxy ---")
|
print("\n--- Using a Proxy ---")
|
||||||
print(
|
print(
|
||||||
"Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
|
"Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
|
||||||
)
|
)
|
||||||
# Uncomment and modify the following lines to use a proxy
|
# Uncomment and modify the following lines to use a proxy
|
||||||
async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
|
async with AsyncWebCrawler(
|
||||||
|
verbose=True, proxy="http://your-proxy-url:port"
|
||||||
|
) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
|
||||||
cache_mode= CacheMode.BYPASS
|
|
||||||
)
|
)
|
||||||
if result.success:
|
if result.success:
|
||||||
print(result.markdown[:500]) # Print first 500 characters
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
|
||||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url=url,
|
url=url, screenshot=True, cache_mode=CacheMode.BYPASS
|
||||||
screenshot=True,
|
|
||||||
cache_mode= CacheMode.BYPASS
|
|
||||||
)
|
)
|
||||||
|
|
||||||
if result.success and result.screenshot:
|
if result.success and result.screenshot:
|
||||||
import base64
|
import base64
|
||||||
|
|
||||||
# Decode the base64 screenshot data
|
# Decode the base64 screenshot data
|
||||||
screenshot_data = base64.b64decode(result.screenshot)
|
screenshot_data = base64.b64decode(result.screenshot)
|
||||||
|
|
||||||
# Save the screenshot as a JPEG file
|
# Save the screenshot as a JPEG file
|
||||||
with open(output_path, 'wb') as f:
|
with open(output_path, "wb") as f:
|
||||||
f.write(screenshot_data)
|
f.write(screenshot_data)
|
||||||
|
|
||||||
print(f"Screenshot saved successfully to {output_path}")
|
print(f"Screenshot saved successfully to {output_path}")
|
||||||
else:
|
else:
|
||||||
print("Failed to capture screenshot")
|
print("Failed to capture screenshot")
|
||||||
|
|
||||||
|
|
||||||
class OpenAIModelFee(BaseModel):
|
class OpenAIModelFee(BaseModel):
|
||||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||||
@@ -110,16 +120,19 @@ class OpenAIModelFee(BaseModel):
|
|||||||
..., description="Fee for output token for the OpenAI model."
|
..., description="Fee for output token for the OpenAI model."
|
||||||
)
|
)
|
||||||
|
|
||||||
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
|
|
||||||
|
async def extract_structured_data_using_llm(
|
||||||
|
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
|
||||||
|
):
|
||||||
print(f"\n--- Extracting Structured Data with {provider} ---")
|
print(f"\n--- Extracting Structured Data with {provider} ---")
|
||||||
|
|
||||||
if api_token is None and provider != "ollama":
|
if api_token is None and provider != "ollama":
|
||||||
print(f"API token is required for {provider}. Skipping this example.")
|
print(f"API token is required for {provider}. Skipping this example.")
|
||||||
return
|
return
|
||||||
|
|
||||||
# extra_args = {}
|
# extra_args = {}
|
||||||
extra_args={
|
extra_args = {
|
||||||
"temperature": 0,
|
"temperature": 0,
|
||||||
"top_p": 0.9,
|
"top_p": 0.9,
|
||||||
"max_tokens": 2000,
|
"max_tokens": 2000,
|
||||||
# any other supported parameters for litellm
|
# any other supported parameters for litellm
|
||||||
@@ -139,52 +152,49 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
|
|||||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
||||||
extra_args=extra_args
|
extra_args=extra_args,
|
||||||
),
|
),
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
print(result.extracted_content)
|
print(result.extracted_content)
|
||||||
|
|
||||||
|
|
||||||
async def extract_structured_data_using_css_extractor():
|
async def extract_structured_data_using_css_extractor():
|
||||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||||
schema = {
|
schema = {
|
||||||
"name": "KidoCode Courses",
|
"name": "KidoCode Courses",
|
||||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||||
"fields": [
|
"fields": [
|
||||||
{
|
{
|
||||||
"name": "section_title",
|
"name": "section_title",
|
||||||
"selector": "h3.heading-50",
|
"selector": "h3.heading-50",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "section_description",
|
"name": "section_description",
|
||||||
"selector": ".charge-content",
|
"selector": ".charge-content",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "course_name",
|
"name": "course_name",
|
||||||
"selector": ".text-block-93",
|
"selector": ".text-block-93",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "course_description",
|
"name": "course_description",
|
||||||
"selector": ".course-content-text",
|
"selector": ".course-content-text",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "course_icon",
|
"name": "course_icon",
|
||||||
"selector": ".image-92",
|
"selector": ".image-92",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "src"
|
"attribute": "src",
|
||||||
}
|
},
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
|
|
||||||
async with AsyncWebCrawler(
|
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||||||
headless=True,
|
|
||||||
verbose=True
|
|
||||||
) as crawler:
|
|
||||||
|
|
||||||
# Create the JavaScript that handles clicking multiple times
|
# Create the JavaScript that handles clicking multiple times
|
||||||
js_click_tabs = """
|
js_click_tabs = """
|
||||||
(async () => {
|
(async () => {
|
||||||
@@ -198,19 +208,20 @@ async def extract_structured_data_using_css_extractor():
|
|||||||
await new Promise(r => setTimeout(r, 500));
|
await new Promise(r => setTimeout(r, 500));
|
||||||
}
|
}
|
||||||
})();
|
})();
|
||||||
"""
|
"""
|
||||||
|
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.kidocode.com/degrees/technology",
|
url="https://www.kidocode.com/degrees/technology",
|
||||||
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
|
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
|
||||||
js_code=[js_click_tabs],
|
js_code=[js_click_tabs],
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
|
|
||||||
companies = json.loads(result.extracted_content)
|
companies = json.loads(result.extracted_content)
|
||||||
print(f"Successfully extracted {len(companies)} companies")
|
print(f"Successfully extracted {len(companies)} companies")
|
||||||
print(json.dumps(companies[0], indent=2))
|
print(json.dumps(companies[0], indent=2))
|
||||||
|
|
||||||
|
|
||||||
# Advanced Session-Based Crawling with Dynamic Content 🔄
|
# Advanced Session-Based Crawling with Dynamic Content 🔄
|
||||||
async def crawl_dynamic_content_pages_method_1():
|
async def crawl_dynamic_content_pages_method_1():
|
||||||
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
|
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
|
||||||
@@ -267,6 +278,7 @@ async def crawl_dynamic_content_pages_method_1():
|
|||||||
await crawler.crawler_strategy.kill_session(session_id)
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||||
|
|
||||||
|
|
||||||
async def crawl_dynamic_content_pages_method_2():
|
async def crawl_dynamic_content_pages_method_2():
|
||||||
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
|
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
|
||||||
|
|
||||||
@@ -334,8 +346,11 @@ async def crawl_dynamic_content_pages_method_2():
|
|||||||
await crawler.crawler_strategy.kill_session(session_id)
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||||
|
|
||||||
|
|
||||||
async def crawl_dynamic_content_pages_method_3():
|
async def crawl_dynamic_content_pages_method_3():
|
||||||
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")
|
print(
|
||||||
|
"\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---"
|
||||||
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||||
@@ -357,7 +372,7 @@ async def crawl_dynamic_content_pages_method_3():
|
|||||||
const firstCommit = commits[0].textContent.trim();
|
const firstCommit = commits[0].textContent.trim();
|
||||||
return firstCommit !== window.firstCommit;
|
return firstCommit !== window.firstCommit;
|
||||||
}"""
|
}"""
|
||||||
|
|
||||||
schema = {
|
schema = {
|
||||||
"name": "Commit Extractor",
|
"name": "Commit Extractor",
|
||||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||||
@@ -395,40 +410,53 @@ async def crawl_dynamic_content_pages_method_3():
|
|||||||
await crawler.crawler_strategy.kill_session(session_id)
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||||
|
|
||||||
|
|
||||||
async def crawl_custom_browser_type():
|
async def crawl_custom_browser_type():
|
||||||
# Use Firefox
|
# Use Firefox
|
||||||
start = time.time()
|
start = time.time()
|
||||||
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
|
async with AsyncWebCrawler(
|
||||||
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
|
browser_type="firefox", verbose=True, headless=True
|
||||||
|
) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://www.example.com", cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
print(result.markdown[:500])
|
print(result.markdown[:500])
|
||||||
print("Time taken: ", time.time() - start)
|
print("Time taken: ", time.time() - start)
|
||||||
|
|
||||||
# Use WebKit
|
# Use WebKit
|
||||||
start = time.time()
|
start = time.time()
|
||||||
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
|
async with AsyncWebCrawler(
|
||||||
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
|
browser_type="webkit", verbose=True, headless=True
|
||||||
|
) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://www.example.com", cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
print(result.markdown[:500])
|
print(result.markdown[:500])
|
||||||
print("Time taken: ", time.time() - start)
|
print("Time taken: ", time.time() - start)
|
||||||
|
|
||||||
# Use Chromium (default)
|
# Use Chromium (default)
|
||||||
start = time.time()
|
start = time.time()
|
||||||
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
|
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||||
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
|
result = await crawler.arun(
|
||||||
|
url="https://www.example.com", cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
print(result.markdown[:500])
|
print(result.markdown[:500])
|
||||||
print("Time taken: ", time.time() - start)
|
print("Time taken: ", time.time() - start)
|
||||||
|
|
||||||
|
|
||||||
async def crawl_with_user_simultion():
|
async def crawl_with_user_simultion():
|
||||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||||
url = "YOUR-URL-HERE"
|
url = "YOUR-URL-HERE"
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url=url,
|
url=url,
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
|
magic=True, # Automatically detects and removes overlays, popups, and other elements that block content
|
||||||
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
|
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
|
||||||
# override_navigator = True # Overrides the navigator object to make it look like a real user
|
# override_navigator = True # Overrides the navigator object to make it look like a real user
|
||||||
)
|
)
|
||||||
|
|
||||||
print(result.markdown)
|
print(result.markdown)
|
||||||
|
|
||||||
|
|
||||||
async def speed_comparison():
|
async def speed_comparison():
|
||||||
# print("\n--- Speed Comparison ---")
|
# print("\n--- Speed Comparison ---")
|
||||||
@@ -439,18 +467,18 @@ async def speed_comparison():
|
|||||||
# print()
|
# print()
|
||||||
# Simulated Firecrawl performance
|
# Simulated Firecrawl performance
|
||||||
from firecrawl import FirecrawlApp
|
from firecrawl import FirecrawlApp
|
||||||
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
|
|
||||||
|
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
|
||||||
start = time.time()
|
start = time.time()
|
||||||
scrape_status = app.scrape_url(
|
scrape_status = app.scrape_url(
|
||||||
'https://www.nbcnews.com/business',
|
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
|
||||||
params={'formats': ['markdown', 'html']}
|
|
||||||
)
|
)
|
||||||
end = time.time()
|
end = time.time()
|
||||||
print("Firecrawl:")
|
print("Firecrawl:")
|
||||||
print(f"Time taken: {end - start:.2f} seconds")
|
print(f"Time taken: {end - start:.2f} seconds")
|
||||||
print(f"Content length: {len(scrape_status['markdown'])} characters")
|
print(f"Content length: {len(scrape_status['markdown'])} characters")
|
||||||
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
|
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
|
||||||
print()
|
print()
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
# Crawl4AI simple crawl
|
# Crawl4AI simple crawl
|
||||||
@@ -474,7 +502,9 @@ async def speed_comparison():
|
|||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
word_count_threshold=0,
|
word_count_threshold=0,
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
content_filter=PruningContentFilter(
|
||||||
|
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||||
|
)
|
||||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||||
),
|
),
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
@@ -498,7 +528,9 @@ async def speed_comparison():
|
|||||||
word_count_threshold=0,
|
word_count_threshold=0,
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
content_filter=PruningContentFilter(
|
||||||
|
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||||
|
)
|
||||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||||
),
|
),
|
||||||
verbose=False,
|
verbose=False,
|
||||||
@@ -520,11 +552,12 @@ async def speed_comparison():
|
|||||||
print("If you run these tests in an environment with better network conditions,")
|
print("If you run these tests in an environment with better network conditions,")
|
||||||
print("you may observe an even more significant speed advantage for Crawl4AI.")
|
print("you may observe an even more significant speed advantage for Crawl4AI.")
|
||||||
|
|
||||||
|
|
||||||
async def generate_knowledge_graph():
|
async def generate_knowledge_graph():
|
||||||
class Entity(BaseModel):
|
class Entity(BaseModel):
|
||||||
name: str
|
name: str
|
||||||
description: str
|
description: str
|
||||||
|
|
||||||
class Relationship(BaseModel):
|
class Relationship(BaseModel):
|
||||||
entity1: Entity
|
entity1: Entity
|
||||||
entity2: Entity
|
entity2: Entity
|
||||||
@@ -536,11 +569,11 @@ async def generate_knowledge_graph():
|
|||||||
relationships: List[Relationship]
|
relationships: List[Relationship]
|
||||||
|
|
||||||
extraction_strategy = LLMExtractionStrategy(
|
extraction_strategy = LLMExtractionStrategy(
|
||||||
provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
|
provider="openai/gpt-4o-mini", # Or any other provider, including Ollama and open source models
|
||||||
api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
|
api_token=os.getenv("OPENAI_API_KEY"), # In case of Ollama just pass "no-token"
|
||||||
schema=KnowledgeGraph.model_json_schema(),
|
schema=KnowledgeGraph.model_json_schema(),
|
||||||
extraction_type="schema",
|
extraction_type="schema",
|
||||||
instruction="""Extract entities and relationships from the given text."""
|
instruction="""Extract entities and relationships from the given text.""",
|
||||||
)
|
)
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
url = "https://paulgraham.com/love.html"
|
url = "https://paulgraham.com/love.html"
|
||||||
@@ -554,27 +587,22 @@ async def generate_knowledge_graph():
|
|||||||
with open(os.path.join(__location__, "kb.json"), "w") as f:
|
with open(os.path.join(__location__, "kb.json"), "w") as f:
|
||||||
f.write(result.extracted_content)
|
f.write(result.extracted_content)
|
||||||
|
|
||||||
|
|
||||||
async def fit_markdown_remove_overlay():
|
async def fit_markdown_remove_overlay():
|
||||||
|
|
||||||
async with AsyncWebCrawler(
|
async with AsyncWebCrawler(
|
||||||
headless=True, # Set to False to see what is happening
|
headless=True, # Set to False to see what is happening
|
||||||
verbose=True,
|
verbose=True,
|
||||||
user_agent_mode="random",
|
user_agent_mode="random",
|
||||||
user_agent_generator_config={
|
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
|
||||||
"device_type": "mobile",
|
|
||||||
"os_type": "android"
|
|
||||||
},
|
|
||||||
) as crawler:
|
) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url='https://www.kidocode.com/degrees/technology',
|
url="https://www.kidocode.com/degrees/technology",
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
content_filter=PruningContentFilter(
|
content_filter=PruningContentFilter(
|
||||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||||
),
|
),
|
||||||
options={
|
options={"ignore_links": True},
|
||||||
"ignore_links": True
|
|
||||||
}
|
|
||||||
),
|
),
|
||||||
# markdown_generator=DefaultMarkdownGenerator(
|
# markdown_generator=DefaultMarkdownGenerator(
|
||||||
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
|
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
|
||||||
@@ -583,31 +611,38 @@ async def fit_markdown_remove_overlay():
|
|||||||
# }
|
# }
|
||||||
# ),
|
# ),
|
||||||
)
|
)
|
||||||
|
|
||||||
if result.success:
|
if result.success:
|
||||||
print(len(result.markdown_v2.raw_markdown))
|
print(len(result.markdown_v2.raw_markdown))
|
||||||
print(len(result.markdown_v2.markdown_with_citations))
|
print(len(result.markdown_v2.markdown_with_citations))
|
||||||
print(len(result.markdown_v2.fit_markdown))
|
print(len(result.markdown_v2.fit_markdown))
|
||||||
|
|
||||||
# Save clean html
|
# Save clean html
|
||||||
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
|
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
|
||||||
f.write(result.cleaned_html)
|
f.write(result.cleaned_html)
|
||||||
|
|
||||||
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
|
with open(
|
||||||
|
os.path.join(__location__, "output/output_raw_markdown.md"), "w"
|
||||||
|
) as f:
|
||||||
f.write(result.markdown_v2.raw_markdown)
|
f.write(result.markdown_v2.raw_markdown)
|
||||||
|
|
||||||
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
|
with open(
|
||||||
f.write(result.markdown_v2.markdown_with_citations)
|
os.path.join(__location__, "output/output_markdown_with_citations.md"),
|
||||||
|
"w",
|
||||||
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:
|
) as f:
|
||||||
|
f.write(result.markdown_v2.markdown_with_citations)
|
||||||
|
|
||||||
|
with open(
|
||||||
|
os.path.join(__location__, "output/output_fit_markdown.md"), "w"
|
||||||
|
) as f:
|
||||||
f.write(result.markdown_v2.fit_markdown)
|
f.write(result.markdown_v2.fit_markdown)
|
||||||
|
|
||||||
print("Done")
|
print("Done")
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
# await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||||
|
|
||||||
# await simple_crawl()
|
# await simple_crawl()
|
||||||
# await simple_example_with_running_js_code()
|
# await simple_example_with_running_js_code()
|
||||||
# await simple_example_with_css_selector()
|
# await simple_example_with_css_selector()
|
||||||
@@ -618,7 +653,7 @@ async def main():
|
|||||||
# LLM extraction examples
|
# LLM extraction examples
|
||||||
# await extract_structured_data_using_llm()
|
# await extract_structured_data_using_llm()
|
||||||
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||||
# await extract_structured_data_using_llm("ollama/llama3.2")
|
# await extract_structured_data_using_llm("ollama/llama3.2")
|
||||||
|
|
||||||
# You always can pass custom headers to the extraction strategy
|
# You always can pass custom headers to the extraction strategy
|
||||||
# custom_headers = {
|
# custom_headers = {
|
||||||
@@ -626,13 +661,13 @@ async def main():
|
|||||||
# "X-Custom-Header": "Some-Value"
|
# "X-Custom-Header": "Some-Value"
|
||||||
# }
|
# }
|
||||||
# await extract_structured_data_using_llm(extra_headers=custom_headers)
|
# await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||||
|
|
||||||
# await crawl_dynamic_content_pages_method_1()
|
# await crawl_dynamic_content_pages_method_1()
|
||||||
# await crawl_dynamic_content_pages_method_2()
|
# await crawl_dynamic_content_pages_method_2()
|
||||||
await crawl_dynamic_content_pages_method_3()
|
await crawl_dynamic_content_pages_method_3()
|
||||||
|
|
||||||
# await crawl_custom_browser_type()
|
# await crawl_custom_browser_type()
|
||||||
|
|
||||||
# await speed_comparison()
|
# await speed_comparison()
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -10,15 +10,17 @@ from functools import lru_cache
|
|||||||
|
|
||||||
console = Console()
|
console = Console()
|
||||||
|
|
||||||
|
|
||||||
@lru_cache()
|
@lru_cache()
|
||||||
def create_crawler():
|
def create_crawler():
|
||||||
crawler = WebCrawler(verbose=True)
|
crawler = WebCrawler(verbose=True)
|
||||||
crawler.warmup()
|
crawler.warmup()
|
||||||
return crawler
|
return crawler
|
||||||
|
|
||||||
|
|
||||||
def print_result(result):
|
def print_result(result):
|
||||||
# Print each key in one line and just the first 10 characters of each one's value and three dots
|
# Print each key in one line and just the first 10 characters of each one's value and three dots
|
||||||
console.print(f"\t[bold]Result:[/bold]")
|
console.print("\t[bold]Result:[/bold]")
|
||||||
for key, value in result.model_dump().items():
|
for key, value in result.model_dump().items():
|
||||||
if isinstance(value, str) and value:
|
if isinstance(value, str) and value:
|
||||||
console.print(f"\t{key}: [green]{value[:20]}...[/green]")
|
console.print(f"\t{key}: [green]{value[:20]}...[/green]")
|
||||||
@@ -33,18 +35,27 @@ def cprint(message, press_any_key=False):
|
|||||||
console.print("Press any key to continue...", style="")
|
console.print("Press any key to continue...", style="")
|
||||||
input()
|
input()
|
||||||
|
|
||||||
|
|
||||||
def basic_usage(crawler):
|
def basic_usage(crawler):
|
||||||
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
|
cprint(
|
||||||
result = crawler.run(url="https://www.nbcnews.com/business", only_text = True)
|
"🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
|
||||||
|
)
|
||||||
|
result = crawler.run(url="https://www.nbcnews.com/business", only_text=True)
|
||||||
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def basic_usage_some_params(crawler):
|
def basic_usage_some_params(crawler):
|
||||||
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
|
cprint(
|
||||||
result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True)
|
"🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
|
||||||
|
)
|
||||||
|
result = crawler.run(
|
||||||
|
url="https://www.nbcnews.com/business", word_count_threshold=1, only_text=True
|
||||||
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def screenshot_usage(crawler):
|
def screenshot_usage(crawler):
|
||||||
cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
|
cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
|
||||||
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
|
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
|
||||||
@@ -55,16 +66,23 @@ def screenshot_usage(crawler):
|
|||||||
cprint("Screenshot saved to 'screenshot.png'!")
|
cprint("Screenshot saved to 'screenshot.png'!")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def understanding_parameters(crawler):
|
def understanding_parameters(crawler):
|
||||||
cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
|
cprint(
|
||||||
cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.")
|
"\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]"
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action."
|
||||||
|
)
|
||||||
|
|
||||||
# First crawl (reads from cache)
|
# First crawl (reads from cache)
|
||||||
cprint("1️⃣ First crawl (caches the result):", True)
|
cprint("1️⃣ First crawl (caches the result):", True)
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||||
end_time = time.time()
|
end_time = time.time()
|
||||||
cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
|
cprint(
|
||||||
|
f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]"
|
||||||
|
)
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
# Force to crawl again
|
# Force to crawl again
|
||||||
@@ -72,169 +90,232 @@ def understanding_parameters(crawler):
|
|||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
|
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
|
||||||
end_time = time.time()
|
end_time = time.time()
|
||||||
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
|
cprint(
|
||||||
|
f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]"
|
||||||
|
)
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def add_chunking_strategy(crawler):
|
def add_chunking_strategy(crawler):
|
||||||
# Adding a chunking strategy: RegexChunking
|
# Adding a chunking strategy: RegexChunking
|
||||||
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
|
cprint(
|
||||||
cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
|
"\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!"
|
||||||
|
)
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
chunking_strategy=RegexChunking(patterns=["\n\n"]),
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
# Adding another chunking strategy: NlpSentenceChunking
|
# Adding another chunking strategy: NlpSentenceChunking
|
||||||
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
|
cprint(
|
||||||
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
|
"\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!"
|
||||||
|
)
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking()
|
||||||
chunking_strategy=NlpSentenceChunking()
|
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def add_extraction_strategy(crawler):
|
def add_extraction_strategy(crawler):
|
||||||
# Adding an extraction strategy: CosineStrategy
|
# Adding an extraction strategy: CosineStrategy
|
||||||
cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
|
cprint(
|
||||||
cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
|
"\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!"
|
||||||
|
)
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True)
|
extraction_strategy=CosineStrategy(
|
||||||
|
word_count_threshold=10,
|
||||||
|
max_dist=0.2,
|
||||||
|
linkage_method="ward",
|
||||||
|
top_k=3,
|
||||||
|
sim_threshold=0.3,
|
||||||
|
verbose=True,
|
||||||
|
),
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
# Using semantic_filter with CosineStrategy
|
# Using semantic_filter with CosineStrategy
|
||||||
cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
|
cprint(
|
||||||
|
"You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!"
|
||||||
|
)
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
extraction_strategy=CosineStrategy(
|
extraction_strategy=CosineStrategy(
|
||||||
semantic_filter="inflation rent prices",
|
semantic_filter="inflation rent prices",
|
||||||
)
|
),
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]"
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
|
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def add_llm_extraction_strategy(crawler):
|
def add_llm_extraction_strategy(crawler):
|
||||||
# Adding an LLM extraction strategy without instructions
|
# Adding an LLM extraction strategy without instructions
|
||||||
cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
|
cprint(
|
||||||
cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
|
"\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!"
|
||||||
|
)
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
|
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
|
||||||
|
),
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]"
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
|
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
# Adding an LLM extraction strategy with instructions
|
# Adding an LLM extraction strategy with instructions
|
||||||
cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
|
cprint(
|
||||||
cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
|
"\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!"
|
||||||
|
)
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
extraction_strategy=LLMExtractionStrategy(
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
provider="openai/gpt-4o",
|
provider="openai/gpt-4o",
|
||||||
api_token=os.getenv('OPENAI_API_KEY'),
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
instruction="I am interested in only financial news"
|
instruction="I am interested in only financial news",
|
||||||
)
|
),
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]"
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
|
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
extraction_strategy=LLMExtractionStrategy(
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
provider="openai/gpt-4o",
|
provider="openai/gpt-4o",
|
||||||
api_token=os.getenv('OPENAI_API_KEY'),
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
instruction="Extract only content related to technology"
|
instruction="Extract only content related to technology",
|
||||||
)
|
),
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]"
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]")
|
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def targeted_extraction(crawler):
|
def targeted_extraction(crawler):
|
||||||
# Using a CSS selector to extract only H2 tags
|
# Using a CSS selector to extract only H2 tags
|
||||||
cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
|
cprint(
|
||||||
result = crawler.run(
|
"\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]",
|
||||||
url="https://www.nbcnews.com/business",
|
True,
|
||||||
css_selector="h2"
|
|
||||||
)
|
)
|
||||||
|
result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
|
||||||
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def interactive_extraction(crawler):
|
def interactive_extraction(crawler):
|
||||||
# Passing JavaScript code to interact with the page
|
# Passing JavaScript code to interact with the page
|
||||||
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
|
cprint(
|
||||||
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
|
"\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"In this example we try to click the 'Load More' button on the page using JavaScript code."
|
||||||
|
)
|
||||||
js_code = """
|
js_code = """
|
||||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||||
loadMoreButton && loadMoreButton.click();
|
loadMoreButton && loadMoreButton.click();
|
||||||
"""
|
"""
|
||||||
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||||
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
||||||
result = crawler.run(
|
result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
|
||||||
url="https://www.nbcnews.com/business",
|
cprint(
|
||||||
js = js_code
|
"[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
|
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def multiple_scrip(crawler):
|
def multiple_scrip(crawler):
|
||||||
# Passing JavaScript code to interact with the page
|
# Passing JavaScript code to interact with the page
|
||||||
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
|
cprint(
|
||||||
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
|
"\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
|
||||||
js_code = ["""
|
True,
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"In this example we try to click the 'Load More' button on the page using JavaScript code."
|
||||||
|
)
|
||||||
|
js_code = [
|
||||||
|
"""
|
||||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
||||||
loadMoreButton && loadMoreButton.click();
|
loadMoreButton && loadMoreButton.click();
|
||||||
"""] * 2
|
"""
|
||||||
|
] * 2
|
||||||
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
|
||||||
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
|
||||||
result = crawler.run(
|
result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
|
||||||
url="https://www.nbcnews.com/business",
|
cprint(
|
||||||
js = js_code
|
"[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
|
||||||
)
|
)
|
||||||
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
|
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
def using_crawler_hooks(crawler):
|
def using_crawler_hooks(crawler):
|
||||||
# Example usage of the hooks for authentication and setting a cookie
|
# Example usage of the hooks for authentication and setting a cookie
|
||||||
def on_driver_created(driver):
|
def on_driver_created(driver):
|
||||||
print("[HOOK] on_driver_created")
|
print("[HOOK] on_driver_created")
|
||||||
# Example customization: maximize the window
|
# Example customization: maximize the window
|
||||||
driver.maximize_window()
|
driver.maximize_window()
|
||||||
|
|
||||||
# Example customization: logging in to a hypothetical website
|
# Example customization: logging in to a hypothetical website
|
||||||
driver.get('https://example.com/login')
|
driver.get("https://example.com/login")
|
||||||
|
|
||||||
from selenium.webdriver.support.ui import WebDriverWait
|
from selenium.webdriver.support.ui import WebDriverWait
|
||||||
from selenium.webdriver.common.by import By
|
from selenium.webdriver.common.by import By
|
||||||
from selenium.webdriver.support import expected_conditions as EC
|
from selenium.webdriver.support import expected_conditions as EC
|
||||||
|
|
||||||
WebDriverWait(driver, 10).until(
|
WebDriverWait(driver, 10).until(
|
||||||
EC.presence_of_element_located((By.NAME, 'username'))
|
EC.presence_of_element_located((By.NAME, "username"))
|
||||||
)
|
)
|
||||||
driver.find_element(By.NAME, 'username').send_keys('testuser')
|
driver.find_element(By.NAME, "username").send_keys("testuser")
|
||||||
driver.find_element(By.NAME, 'password').send_keys('password123')
|
driver.find_element(By.NAME, "password").send_keys("password123")
|
||||||
driver.find_element(By.NAME, 'login').click()
|
driver.find_element(By.NAME, "login").click()
|
||||||
WebDriverWait(driver, 10).until(
|
WebDriverWait(driver, 10).until(
|
||||||
EC.presence_of_element_located((By.ID, 'welcome'))
|
EC.presence_of_element_located((By.ID, "welcome"))
|
||||||
)
|
)
|
||||||
# Add a custom cookie
|
# Add a custom cookie
|
||||||
driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
|
driver.add_cookie({"name": "test_cookie", "value": "cookie_value"})
|
||||||
return driver
|
return driver
|
||||||
|
|
||||||
|
|
||||||
def before_get_url(driver):
|
def before_get_url(driver):
|
||||||
print("[HOOK] before_get_url")
|
print("[HOOK] before_get_url")
|
||||||
# Example customization: add a custom header
|
# Example customization: add a custom header
|
||||||
# Enable Network domain for sending headers
|
# Enable Network domain for sending headers
|
||||||
driver.execute_cdp_cmd('Network.enable', {})
|
driver.execute_cdp_cmd("Network.enable", {})
|
||||||
# Add a custom header
|
# Add a custom header
|
||||||
driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
|
driver.execute_cdp_cmd(
|
||||||
|
"Network.setExtraHTTPHeaders", {"headers": {"X-Test-Header": "test"}}
|
||||||
|
)
|
||||||
return driver
|
return driver
|
||||||
|
|
||||||
def after_get_url(driver):
|
def after_get_url(driver):
|
||||||
print("[HOOK] after_get_url")
|
print("[HOOK] after_get_url")
|
||||||
# Example customization: log the URL
|
# Example customization: log the URL
|
||||||
@@ -246,48 +327,59 @@ def using_crawler_hooks(crawler):
|
|||||||
# Example customization: log the HTML
|
# Example customization: log the HTML
|
||||||
print(len(html))
|
print(len(html))
|
||||||
return driver
|
return driver
|
||||||
|
|
||||||
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
|
cprint(
|
||||||
|
"\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]",
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
|
||||||
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
|
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
|
||||||
crawler_strategy.set_hook('on_driver_created', on_driver_created)
|
crawler_strategy.set_hook("on_driver_created", on_driver_created)
|
||||||
crawler_strategy.set_hook('before_get_url', before_get_url)
|
crawler_strategy.set_hook("before_get_url", before_get_url)
|
||||||
crawler_strategy.set_hook('after_get_url', after_get_url)
|
crawler_strategy.set_hook("after_get_url", after_get_url)
|
||||||
crawler_strategy.set_hook('before_return_html', before_return_html)
|
crawler_strategy.set_hook("before_return_html", before_return_html)
|
||||||
|
|
||||||
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
|
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
|
||||||
crawler.warmup()
|
crawler.warmup()
|
||||||
result = crawler.run(url="https://example.com")
|
result = crawler.run(url="https://example.com")
|
||||||
|
|
||||||
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
|
||||||
print_result(result= result)
|
print_result(result=result)
|
||||||
|
|
||||||
|
|
||||||
def using_crawler_hooks_dleay_example(crawler):
|
def using_crawler_hooks_dleay_example(crawler):
|
||||||
def delay(driver):
|
def delay(driver):
|
||||||
print("Delaying for 5 seconds...")
|
print("Delaying for 5 seconds...")
|
||||||
time.sleep(5)
|
time.sleep(5)
|
||||||
print("Resuming...")
|
print("Resuming...")
|
||||||
|
|
||||||
def create_crawler():
|
def create_crawler():
|
||||||
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
|
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
|
||||||
crawler_strategy.set_hook('after_get_url', delay)
|
crawler_strategy.set_hook("after_get_url", delay)
|
||||||
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
|
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
|
||||||
crawler.warmup()
|
crawler.warmup()
|
||||||
return crawler
|
return crawler
|
||||||
|
|
||||||
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]")
|
cprint(
|
||||||
|
"\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]"
|
||||||
|
)
|
||||||
crawler = create_crawler()
|
crawler = create_crawler()
|
||||||
result = crawler.run(url="https://google.com", bypass_cache=True)
|
result = crawler.run(url="https://google.com", bypass_cache=True)
|
||||||
|
|
||||||
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
|
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
|
||||||
print_result(result)
|
print_result(result)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
|
cprint(
|
||||||
cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
|
"🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]"
|
||||||
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
|
)
|
||||||
|
cprint(
|
||||||
|
"⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]"
|
||||||
|
)
|
||||||
|
cprint(
|
||||||
|
"If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files."
|
||||||
|
)
|
||||||
|
|
||||||
crawler = create_crawler()
|
crawler = create_crawler()
|
||||||
|
|
||||||
@@ -295,7 +387,7 @@ def main():
|
|||||||
basic_usage(crawler)
|
basic_usage(crawler)
|
||||||
# basic_usage_some_params(crawler)
|
# basic_usage_some_params(crawler)
|
||||||
understanding_parameters(crawler)
|
understanding_parameters(crawler)
|
||||||
|
|
||||||
crawler.always_by_pass_cache = True
|
crawler.always_by_pass_cache = True
|
||||||
screenshot_usage(crawler)
|
screenshot_usage(crawler)
|
||||||
add_chunking_strategy(crawler)
|
add_chunking_strategy(crawler)
|
||||||
@@ -305,8 +397,10 @@ def main():
|
|||||||
interactive_extraction(crawler)
|
interactive_extraction(crawler)
|
||||||
multiple_scrip(crawler)
|
multiple_scrip(crawler)
|
||||||
|
|
||||||
cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")
|
cprint(
|
||||||
|
"\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
main()
|
main()
|
||||||
|
|
||||||
|
|||||||
@@ -11,7 +11,9 @@ from groq import Groq
|
|||||||
# Import threadpools to run the crawl_url function in a separate thread
|
# Import threadpools to run the crawl_url function in a separate thread
|
||||||
from concurrent.futures import ThreadPoolExecutor
|
from concurrent.futures import ThreadPoolExecutor
|
||||||
|
|
||||||
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
|
client = AsyncOpenAI(
|
||||||
|
base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")
|
||||||
|
)
|
||||||
|
|
||||||
# Instrument the OpenAI client
|
# Instrument the OpenAI client
|
||||||
cl.instrument_openai()
|
cl.instrument_openai()
|
||||||
@@ -25,41 +27,39 @@ settings = {
|
|||||||
"presence_penalty": 0,
|
"presence_penalty": 0,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def extract_urls(text):
|
def extract_urls(text):
|
||||||
url_pattern = re.compile(r'(https?://\S+)')
|
url_pattern = re.compile(r"(https?://\S+)")
|
||||||
return url_pattern.findall(text)
|
return url_pattern.findall(text)
|
||||||
|
|
||||||
|
|
||||||
def crawl_url(url):
|
def crawl_url(url):
|
||||||
data = {
|
data = {
|
||||||
"urls": [url],
|
"urls": [url],
|
||||||
"include_raw_html": True,
|
"include_raw_html": True,
|
||||||
"word_count_threshold": 10,
|
"word_count_threshold": 10,
|
||||||
"extraction_strategy": "NoExtractionStrategy",
|
"extraction_strategy": "NoExtractionStrategy",
|
||||||
"chunking_strategy": "RegexChunking"
|
"chunking_strategy": "RegexChunking",
|
||||||
}
|
}
|
||||||
response = requests.post("https://crawl4ai.com/crawl", json=data)
|
response = requests.post("https://crawl4ai.com/crawl", json=data)
|
||||||
response_data = response.json()
|
response_data = response.json()
|
||||||
response_data = response_data['results'][0]
|
response_data = response_data["results"][0]
|
||||||
return response_data['markdown']
|
return response_data["markdown"]
|
||||||
|
|
||||||
|
|
||||||
@cl.on_chat_start
|
@cl.on_chat_start
|
||||||
async def on_chat_start():
|
async def on_chat_start():
|
||||||
cl.user_session.set("session", {
|
cl.user_session.set("session", {"history": [], "context": {}})
|
||||||
"history": [],
|
await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
|
||||||
"context": {}
|
|
||||||
})
|
|
||||||
await cl.Message(
|
|
||||||
content="Welcome to the chat! How can I assist you today?"
|
|
||||||
).send()
|
|
||||||
|
|
||||||
@cl.on_message
|
@cl.on_message
|
||||||
async def on_message(message: cl.Message):
|
async def on_message(message: cl.Message):
|
||||||
user_session = cl.user_session.get("session")
|
user_session = cl.user_session.get("session")
|
||||||
|
|
||||||
# Extract URLs from the user's message
|
# Extract URLs from the user's message
|
||||||
urls = extract_urls(message.content)
|
urls = extract_urls(message.content)
|
||||||
|
|
||||||
|
|
||||||
futures = []
|
futures = []
|
||||||
with ThreadPoolExecutor() as executor:
|
with ThreadPoolExecutor() as executor:
|
||||||
for url in urls:
|
for url in urls:
|
||||||
@@ -69,16 +69,9 @@ async def on_message(message: cl.Message):
|
|||||||
|
|
||||||
for url, result in zip(urls, results):
|
for url, result in zip(urls, results):
|
||||||
ref_number = f"REF_{len(user_session['context']) + 1}"
|
ref_number = f"REF_{len(user_session['context']) + 1}"
|
||||||
user_session["context"][ref_number] = {
|
user_session["context"][ref_number] = {"url": url, "content": result}
|
||||||
"url": url,
|
|
||||||
"content": result
|
|
||||||
}
|
|
||||||
|
|
||||||
|
user_session["history"].append({"role": "user", "content": message.content})
|
||||||
user_session["history"].append({
|
|
||||||
"role": "user",
|
|
||||||
"content": message.content
|
|
||||||
})
|
|
||||||
|
|
||||||
# Create a system message that includes the context
|
# Create a system message that includes the context
|
||||||
context_messages = [
|
context_messages = [
|
||||||
@@ -95,26 +88,17 @@ async def on_message(message: cl.Message):
|
|||||||
"If not, there is no need to add a references section. "
|
"If not, there is no need to add a references section. "
|
||||||
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
|
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
|
||||||
"\n\n".join(context_messages)
|
"\n\n".join(context_messages)
|
||||||
)
|
),
|
||||||
}
|
}
|
||||||
else:
|
else:
|
||||||
system_message = {
|
system_message = {"role": "system", "content": "You are a helpful assistant."}
|
||||||
"role": "system",
|
|
||||||
"content": "You are a helpful assistant."
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
msg = cl.Message(content="")
|
msg = cl.Message(content="")
|
||||||
await msg.send()
|
await msg.send()
|
||||||
|
|
||||||
# Get response from the LLM
|
# Get response from the LLM
|
||||||
stream = await client.chat.completions.create(
|
stream = await client.chat.completions.create(
|
||||||
messages=[
|
messages=[system_message, *user_session["history"]], stream=True, **settings
|
||||||
system_message,
|
|
||||||
*user_session["history"]
|
|
||||||
],
|
|
||||||
stream=True,
|
|
||||||
**settings
|
|
||||||
)
|
)
|
||||||
|
|
||||||
assistant_response = ""
|
assistant_response = ""
|
||||||
@@ -124,10 +108,7 @@ async def on_message(message: cl.Message):
|
|||||||
await msg.stream_token(token)
|
await msg.stream_token(token)
|
||||||
|
|
||||||
# Add assistant message to the history
|
# Add assistant message to the history
|
||||||
user_session["history"].append({
|
user_session["history"].append({"role": "assistant", "content": assistant_response})
|
||||||
"role": "assistant",
|
|
||||||
"content": assistant_response
|
|
||||||
})
|
|
||||||
await msg.update()
|
await msg.update()
|
||||||
|
|
||||||
# Append the reference section to the assistant's response
|
# Append the reference section to the assistant's response
|
||||||
@@ -154,10 +135,11 @@ async def on_audio_chunk(chunk: cl.AudioChunk):
|
|||||||
|
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
@cl.step(type="tool")
|
@cl.step(type="tool")
|
||||||
async def speech_to_text(audio_file):
|
async def speech_to_text(audio_file):
|
||||||
cli = Groq()
|
cli = Groq()
|
||||||
|
|
||||||
response = await client.audio.transcriptions.create(
|
response = await client.audio.transcriptions.create(
|
||||||
model="whisper-large-v3", file=audio_file
|
model="whisper-large-v3", file=audio_file
|
||||||
)
|
)
|
||||||
@@ -172,24 +154,19 @@ async def on_audio_end(elements: list[ElementBased]):
|
|||||||
audio_buffer.seek(0) # Move the file pointer to the beginning
|
audio_buffer.seek(0) # Move the file pointer to the beginning
|
||||||
audio_file = audio_buffer.read()
|
audio_file = audio_buffer.read()
|
||||||
audio_mime_type: str = cl.user_session.get("audio_mime_type")
|
audio_mime_type: str = cl.user_session.get("audio_mime_type")
|
||||||
|
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
|
whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
|
||||||
transcription = await speech_to_text(whisper_input)
|
transcription = await speech_to_text(whisper_input)
|
||||||
end_time = time.time()
|
end_time = time.time()
|
||||||
print(f"Transcription took {end_time - start_time} seconds")
|
print(f"Transcription took {end_time - start_time} seconds")
|
||||||
|
|
||||||
user_msg = cl.Message(
|
user_msg = cl.Message(author="You", type="user_message", content=transcription)
|
||||||
author="You",
|
|
||||||
type="user_message",
|
|
||||||
content=transcription
|
|
||||||
)
|
|
||||||
await user_msg.send()
|
await user_msg.send()
|
||||||
await on_message(user_msg)
|
await on_message(user_msg)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
from chainlit.cli import run_chainlit
|
from chainlit.cli import run_chainlit
|
||||||
|
|
||||||
run_chainlit(__file__)
|
run_chainlit(__file__)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +1,3 @@
|
|||||||
|
|
||||||
import requests, base64, os
|
import requests, base64, os
|
||||||
|
|
||||||
data = {
|
data = {
|
||||||
@@ -6,59 +5,50 @@ data = {
|
|||||||
"screenshot": True,
|
"screenshot": True,
|
||||||
}
|
}
|
||||||
|
|
||||||
response = requests.post("https://crawl4ai.com/crawl", json=data)
|
response = requests.post("https://crawl4ai.com/crawl", json=data)
|
||||||
result = response.json()['results'][0]
|
result = response.json()["results"][0]
|
||||||
print(result.keys())
|
print(result.keys())
|
||||||
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
|
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
|
||||||
# 'links', 'screenshot', 'markdown', 'extracted_content',
|
# 'links', 'screenshot', 'markdown', 'extracted_content',
|
||||||
# 'metadata', 'error_message'])
|
# 'metadata', 'error_message'])
|
||||||
with open("screenshot.png", "wb") as f:
|
with open("screenshot.png", "wb") as f:
|
||||||
f.write(base64.b64decode(result['screenshot']))
|
f.write(base64.b64decode(result["screenshot"]))
|
||||||
|
|
||||||
# Example of filtering the content using CSS selectors
|
# Example of filtering the content using CSS selectors
|
||||||
data = {
|
data = {
|
||||||
"urls": [
|
"urls": ["https://www.nbcnews.com/business"],
|
||||||
"https://www.nbcnews.com/business"
|
|
||||||
],
|
|
||||||
"css_selector": "article",
|
"css_selector": "article",
|
||||||
"screenshot": True,
|
"screenshot": True,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Example of executing a JS script on the page before extracting the content
|
# Example of executing a JS script on the page before extracting the content
|
||||||
data = {
|
data = {
|
||||||
"urls": [
|
"urls": ["https://www.nbcnews.com/business"],
|
||||||
"https://www.nbcnews.com/business"
|
|
||||||
],
|
|
||||||
"screenshot": True,
|
"screenshot": True,
|
||||||
'js' : ["""
|
"js": [
|
||||||
|
"""
|
||||||
const loadMoreButton = Array.from(document.querySelectorAll('button')).
|
const loadMoreButton = Array.from(document.querySelectorAll('button')).
|
||||||
find(button => button.textContent.includes('Load More'));
|
find(button => button.textContent.includes('Load More'));
|
||||||
loadMoreButton && loadMoreButton.click();
|
loadMoreButton && loadMoreButton.click();
|
||||||
"""]
|
"""
|
||||||
|
],
|
||||||
}
|
}
|
||||||
|
|
||||||
# Example of using a custom extraction strategy
|
# Example of using a custom extraction strategy
|
||||||
data = {
|
data = {
|
||||||
"urls": [
|
"urls": ["https://www.nbcnews.com/business"],
|
||||||
"https://www.nbcnews.com/business"
|
|
||||||
],
|
|
||||||
"extraction_strategy": "CosineStrategy",
|
"extraction_strategy": "CosineStrategy",
|
||||||
"extraction_strategy_args": {
|
"extraction_strategy_args": {"semantic_filter": "inflation rent prices"},
|
||||||
"semantic_filter": "inflation rent prices"
|
|
||||||
},
|
|
||||||
}
|
}
|
||||||
|
|
||||||
# Example of using LLM to extract content
|
# Example of using LLM to extract content
|
||||||
data = {
|
data = {
|
||||||
"urls": [
|
"urls": ["https://www.nbcnews.com/business"],
|
||||||
"https://www.nbcnews.com/business"
|
|
||||||
],
|
|
||||||
"extraction_strategy": "LLMExtractionStrategy",
|
"extraction_strategy": "LLMExtractionStrategy",
|
||||||
"extraction_strategy_args": {
|
"extraction_strategy_args": {
|
||||||
"provider": "groq/llama3-8b-8192",
|
"provider": "groq/llama3-8b-8192",
|
||||||
"api_token": os.environ.get("GROQ_API_KEY"),
|
"api_token": os.environ.get("GROQ_API_KEY"),
|
||||||
"instruction": """I am interested in only financial news,
|
"instruction": """I am interested in only financial news,
|
||||||
and translate them in French."""
|
and translate them in French.""",
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
135
docs/examples/scraping_strategies_performance.py
Normal file
135
docs/examples/scraping_strategies_performance.py
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
import time, re
|
||||||
|
from crawl4ai.content_scraping_strategy import WebScrapingStrategy, LXMLWebScrapingStrategy
|
||||||
|
import time
|
||||||
|
import functools
|
||||||
|
from collections import defaultdict
|
||||||
|
|
||||||
|
class TimingStats:
|
||||||
|
def __init__(self):
|
||||||
|
self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
|
||||||
|
|
||||||
|
def add(self, strategy_name, func_name, elapsed):
|
||||||
|
self.stats[strategy_name][func_name]["calls"] += 1
|
||||||
|
self.stats[strategy_name][func_name]["total_time"] += elapsed
|
||||||
|
|
||||||
|
def report(self):
|
||||||
|
for strategy_name, funcs in self.stats.items():
|
||||||
|
print(f"\n{strategy_name} Timing Breakdown:")
|
||||||
|
print("-" * 60)
|
||||||
|
print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
|
||||||
|
print("-" * 60)
|
||||||
|
|
||||||
|
for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
|
||||||
|
avg_ms = (data["total_time"] / data["calls"]) * 1000
|
||||||
|
print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
|
||||||
|
|
||||||
|
timing_stats = TimingStats()
|
||||||
|
|
||||||
|
# Modify timing decorator
|
||||||
|
def timing_decorator(strategy_name):
|
||||||
|
def decorator(func):
|
||||||
|
@functools.wraps(func)
|
||||||
|
def wrapper(*args, **kwargs):
|
||||||
|
start = time.time()
|
||||||
|
result = func(*args, **kwargs)
|
||||||
|
elapsed = time.time() - start
|
||||||
|
timing_stats.add(strategy_name, func.__name__, elapsed)
|
||||||
|
return result
|
||||||
|
return wrapper
|
||||||
|
return decorator
|
||||||
|
|
||||||
|
# Modified decorator application
|
||||||
|
def apply_decorators(cls, method_name, strategy_name):
|
||||||
|
try:
|
||||||
|
original_method = getattr(cls, method_name)
|
||||||
|
decorated_method = timing_decorator(strategy_name)(original_method)
|
||||||
|
setattr(cls, method_name, decorated_method)
|
||||||
|
except AttributeError:
|
||||||
|
print(f"Method {method_name} not found in class {cls.__name__}.")
|
||||||
|
|
||||||
|
# Apply to key methods
|
||||||
|
methods_to_profile = [
|
||||||
|
'_scrap',
|
||||||
|
# 'process_element',
|
||||||
|
'_process_element',
|
||||||
|
'process_image',
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# Apply decorators to both strategies
|
||||||
|
for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
|
||||||
|
for method in methods_to_profile:
|
||||||
|
apply_decorators(strategy, method, name)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_large_html(n_elements=1000):
|
||||||
|
html = ['<!DOCTYPE html><html><head></head><body>']
|
||||||
|
for i in range(n_elements):
|
||||||
|
html.append(f'''
|
||||||
|
<div class="article">
|
||||||
|
<h2>Heading {i}</h2>
|
||||||
|
<div>
|
||||||
|
<div>
|
||||||
|
<p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
<img src="image{i}.jpg" alt="Image {i}">
|
||||||
|
<ul>
|
||||||
|
<li>List item {i}.1</li>
|
||||||
|
<li>List item {i}.2</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
''')
|
||||||
|
html.append('</body></html>')
|
||||||
|
return ''.join(html)
|
||||||
|
|
||||||
|
def test_scraping():
|
||||||
|
# Initialize both scrapers
|
||||||
|
original_scraper = WebScrapingStrategy()
|
||||||
|
selected_scraper = LXMLWebScrapingStrategy()
|
||||||
|
|
||||||
|
# Generate test HTML
|
||||||
|
print("Generating HTML...")
|
||||||
|
html = generate_large_html(5000)
|
||||||
|
print(f"HTML Size: {len(html)/1024:.2f} KB")
|
||||||
|
|
||||||
|
# Time the scraping
|
||||||
|
print("\nStarting scrape...")
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
kwargs = {
|
||||||
|
"url": "http://example.com",
|
||||||
|
"html": html,
|
||||||
|
"word_count_threshold": 5,
|
||||||
|
"keep_data_attributes": True
|
||||||
|
}
|
||||||
|
|
||||||
|
t1 = time.perf_counter()
|
||||||
|
result_selected = selected_scraper.scrap(**kwargs)
|
||||||
|
t2 = time.perf_counter()
|
||||||
|
|
||||||
|
result_original = original_scraper.scrap(**kwargs)
|
||||||
|
t3 = time.perf_counter()
|
||||||
|
|
||||||
|
elapsed = t3 - start_time
|
||||||
|
print(f"\nScraping completed in {elapsed:.2f} seconds")
|
||||||
|
|
||||||
|
timing_stats.report()
|
||||||
|
|
||||||
|
# Print stats of LXML output
|
||||||
|
print("\Turbo Output:")
|
||||||
|
print(f"\nExtracted links: {len(result_selected.links.internal) + len(result_selected.links.external)}")
|
||||||
|
print(f"Extracted images: {len(result_selected.media.images)}")
|
||||||
|
print(f"Clean HTML size: {len(result_selected.cleaned_html)/1024:.2f} KB")
|
||||||
|
print(f"Scraping time: {t2 - t1:.2f} seconds")
|
||||||
|
|
||||||
|
# Print stats of original output
|
||||||
|
print("\nOriginal Output:")
|
||||||
|
print(f"\nExtracted links: {len(result_original.links.internal) + len(result_original.links.external)}")
|
||||||
|
print(f"Extracted images: {len(result_original.media.images)}")
|
||||||
|
print(f"Clean HTML size: {len(result_original.cleaned_html)/1024:.2f} KB")
|
||||||
|
print(f"Scraping time: {t3 - t1:.2f} seconds")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
test_scraping()
|
||||||
@@ -5,42 +5,47 @@ import os
|
|||||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||||
|
|
||||||
# Create tmp directory if it doesn't exist
|
# Create tmp directory if it doesn't exist
|
||||||
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
parent_dir = os.path.dirname(
|
||||||
|
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
)
|
||||||
tmp_dir = os.path.join(parent_dir, "tmp")
|
tmp_dir = os.path.join(parent_dir, "tmp")
|
||||||
os.makedirs(tmp_dir, exist_ok=True)
|
os.makedirs(tmp_dir, exist_ok=True)
|
||||||
|
|
||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
# Configure crawler to fetch SSL certificate
|
# Configure crawler to fetch SSL certificate
|
||||||
config = CrawlerRunConfig(
|
config = CrawlerRunConfig(
|
||||||
fetch_ssl_certificate=True,
|
fetch_ssl_certificate=True,
|
||||||
cache_mode=CacheMode.BYPASS # Bypass cache to always get fresh certificates
|
cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(url="https://example.com", config=config)
|
||||||
url='https://example.com',
|
|
||||||
config=config
|
|
||||||
)
|
|
||||||
|
|
||||||
if result.success and result.ssl_certificate:
|
if result.success and result.ssl_certificate:
|
||||||
cert = result.ssl_certificate
|
cert = result.ssl_certificate
|
||||||
|
|
||||||
# 1. Access certificate properties directly
|
# 1. Access certificate properties directly
|
||||||
print("\nCertificate Information:")
|
print("\nCertificate Information:")
|
||||||
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
||||||
print(f"Valid until: {cert.valid_until}")
|
print(f"Valid until: {cert.valid_until}")
|
||||||
print(f"Fingerprint: {cert.fingerprint}")
|
print(f"Fingerprint: {cert.fingerprint}")
|
||||||
|
|
||||||
# 2. Export certificate in different formats
|
# 2. Export certificate in different formats
|
||||||
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
|
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
|
||||||
print("\nCertificate exported to:")
|
print("\nCertificate exported to:")
|
||||||
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
|
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
|
||||||
|
|
||||||
pem_data = cert.to_pem(os.path.join(tmp_dir, "certificate.pem")) # For web servers
|
pem_data = cert.to_pem(
|
||||||
|
os.path.join(tmp_dir, "certificate.pem")
|
||||||
|
) # For web servers
|
||||||
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
|
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
|
||||||
|
|
||||||
der_data = cert.to_der(os.path.join(tmp_dir, "certificate.der")) # For Java apps
|
der_data = cert.to_der(
|
||||||
|
os.path.join(tmp_dir, "certificate.der")
|
||||||
|
) # For Java apps
|
||||||
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
|
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -1,39 +1,41 @@
|
|||||||
import os
|
import os
|
||||||
import time
|
|
||||||
import json
|
import json
|
||||||
from crawl4ai.web_crawler import WebCrawler
|
from crawl4ai.web_crawler import WebCrawler
|
||||||
from crawl4ai.chunking_strategy import *
|
from crawl4ai.chunking_strategy import *
|
||||||
from crawl4ai.extraction_strategy import *
|
from crawl4ai.extraction_strategy import *
|
||||||
from crawl4ai.crawler_strategy import *
|
from crawl4ai.crawler_strategy import *
|
||||||
|
|
||||||
url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
|
url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"
|
||||||
|
|
||||||
crawler = WebCrawler()
|
crawler = WebCrawler()
|
||||||
crawler.warmup()
|
crawler.warmup()
|
||||||
|
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
|
||||||
class PageSummary(BaseModel):
|
class PageSummary(BaseModel):
|
||||||
title: str = Field(..., description="Title of the page.")
|
title: str = Field(..., description="Title of the page.")
|
||||||
summary: str = Field(..., description="Summary of the page.")
|
summary: str = Field(..., description="Summary of the page.")
|
||||||
brief_summary: str = Field(..., description="Brief summary of the page.")
|
brief_summary: str = Field(..., description="Brief summary of the page.")
|
||||||
keywords: list = Field(..., description="Keywords assigned to the page.")
|
keywords: list = Field(..., description="Keywords assigned to the page.")
|
||||||
|
|
||||||
|
|
||||||
result = crawler.run(
|
result = crawler.run(
|
||||||
url=url,
|
url=url,
|
||||||
word_count_threshold=1,
|
word_count_threshold=1,
|
||||||
extraction_strategy= LLMExtractionStrategy(
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
|
provider="openai/gpt-4o",
|
||||||
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
schema=PageSummary.model_json_schema(),
|
schema=PageSummary.model_json_schema(),
|
||||||
extraction_type="schema",
|
extraction_type="schema",
|
||||||
apply_chunking =False,
|
apply_chunking=False,
|
||||||
instruction="From the crawled content, extract the following details: "\
|
instruction="From the crawled content, extract the following details: "
|
||||||
"1. Title of the page "\
|
"1. Title of the page "
|
||||||
"2. Summary of the page, which is a detailed summary "\
|
"2. Summary of the page, which is a detailed summary "
|
||||||
"3. Brief summary of the page, which is a paragraph text "\
|
"3. Brief summary of the page, which is a paragraph text "
|
||||||
"4. Keywords assigned to the page, which is a list of keywords. "\
|
"4. Keywords assigned to the page, which is a list of keywords. "
|
||||||
'The extracted JSON format should look like this: '\
|
"The extracted JSON format should look like this: "
|
||||||
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
|
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }',
|
||||||
),
|
),
|
||||||
bypass_cache=True,
|
bypass_cache=True,
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
import os, sys
|
import os, sys
|
||||||
|
|
||||||
# append the parent directory to the sys.path
|
# append the parent directory to the sys.path
|
||||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
sys.path.append(parent_dir)
|
sys.path.append(parent_dir)
|
||||||
@@ -13,19 +14,18 @@ import json
|
|||||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||||
|
|
||||||
|
|
||||||
# 1. File Download Processing Example
|
# 1. File Download Processing Example
|
||||||
async def download_example():
|
async def download_example():
|
||||||
"""Example of downloading files from Python.org"""
|
"""Example of downloading files from Python.org"""
|
||||||
# downloads_path = os.path.join(os.getcwd(), "downloads")
|
# downloads_path = os.path.join(os.getcwd(), "downloads")
|
||||||
downloads_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
|
downloads_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
|
||||||
os.makedirs(downloads_path, exist_ok=True)
|
os.makedirs(downloads_path, exist_ok=True)
|
||||||
|
|
||||||
print(f"Downloads will be saved to: {downloads_path}")
|
print(f"Downloads will be saved to: {downloads_path}")
|
||||||
|
|
||||||
async with AsyncWebCrawler(
|
async with AsyncWebCrawler(
|
||||||
accept_downloads=True,
|
accept_downloads=True, downloads_path=downloads_path, verbose=True
|
||||||
downloads_path=downloads_path,
|
|
||||||
verbose=True
|
|
||||||
) as crawler:
|
) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.python.org/downloads/",
|
url="https://www.python.org/downloads/",
|
||||||
@@ -40,9 +40,9 @@ async def download_example():
|
|||||||
}
|
}
|
||||||
""",
|
""",
|
||||||
delay_before_return_html=1, # Wait 5 seconds to ensure download starts
|
delay_before_return_html=1, # Wait 5 seconds to ensure download starts
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
|
|
||||||
if result.downloaded_files:
|
if result.downloaded_files:
|
||||||
print("\nDownload successful!")
|
print("\nDownload successful!")
|
||||||
print("Downloaded files:")
|
print("Downloaded files:")
|
||||||
@@ -52,25 +52,26 @@ async def download_example():
|
|||||||
else:
|
else:
|
||||||
print("\nNo files were downloaded")
|
print("\nNo files were downloaded")
|
||||||
|
|
||||||
|
|
||||||
# 2. Local File and Raw HTML Processing Example
|
# 2. Local File and Raw HTML Processing Example
|
||||||
async def local_and_raw_html_example():
|
async def local_and_raw_html_example():
|
||||||
"""Example of processing local files and raw HTML"""
|
"""Example of processing local files and raw HTML"""
|
||||||
# Create a sample HTML file
|
# Create a sample HTML file
|
||||||
sample_file = os.path.join(__data__, "sample.html")
|
sample_file = os.path.join(__data__, "sample.html")
|
||||||
with open(sample_file, "w") as f:
|
with open(sample_file, "w") as f:
|
||||||
f.write("""
|
f.write(
|
||||||
|
"""
|
||||||
<html><body>
|
<html><body>
|
||||||
<h1>Test Content</h1>
|
<h1>Test Content</h1>
|
||||||
<p>This is a test paragraph.</p>
|
<p>This is a test paragraph.</p>
|
||||||
</body></html>
|
</body></html>
|
||||||
""")
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
# Process local file
|
# Process local file
|
||||||
local_result = await crawler.arun(
|
local_result = await crawler.arun(url=f"file://{os.path.abspath(sample_file)}")
|
||||||
url=f"file://{os.path.abspath(sample_file)}"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Process raw HTML
|
# Process raw HTML
|
||||||
raw_html = """
|
raw_html = """
|
||||||
<html><body>
|
<html><body>
|
||||||
@@ -78,16 +79,15 @@ async def local_and_raw_html_example():
|
|||||||
<p>This is a test of raw HTML processing.</p>
|
<p>This is a test of raw HTML processing.</p>
|
||||||
</body></html>
|
</body></html>
|
||||||
"""
|
"""
|
||||||
raw_result = await crawler.arun(
|
raw_result = await crawler.arun(url=f"raw:{raw_html}")
|
||||||
url=f"raw:{raw_html}"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Clean up
|
# Clean up
|
||||||
os.remove(sample_file)
|
os.remove(sample_file)
|
||||||
|
|
||||||
print("Local file content:", local_result.markdown)
|
print("Local file content:", local_result.markdown)
|
||||||
print("\nRaw HTML content:", raw_result.markdown)
|
print("\nRaw HTML content:", raw_result.markdown)
|
||||||
|
|
||||||
|
|
||||||
# 3. Enhanced Markdown Generation Example
|
# 3. Enhanced Markdown Generation Example
|
||||||
async def markdown_generation_example():
|
async def markdown_generation_example():
|
||||||
"""Example of enhanced markdown generation with citations and LLM-friendly features"""
|
"""Example of enhanced markdown generation with citations and LLM-friendly features"""
|
||||||
@@ -97,58 +97,66 @@ async def markdown_generation_example():
|
|||||||
# user_query="History and cultivation",
|
# user_query="History and cultivation",
|
||||||
bm25_threshold=1.0
|
bm25_threshold=1.0
|
||||||
)
|
)
|
||||||
|
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://en.wikipedia.org/wiki/Apple",
|
url="https://en.wikipedia.org/wiki/Apple",
|
||||||
css_selector="main div#bodyContent",
|
css_selector="main div#bodyContent",
|
||||||
content_filter=content_filter,
|
content_filter=content_filter,
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
|
|
||||||
from crawl4ai import AsyncWebCrawler
|
|
||||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||||
|
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://en.wikipedia.org/wiki/Apple",
|
url="https://en.wikipedia.org/wiki/Apple",
|
||||||
css_selector="main div#bodyContent",
|
css_selector="main div#bodyContent",
|
||||||
content_filter=BM25ContentFilter()
|
content_filter=BM25ContentFilter(),
|
||||||
)
|
)
|
||||||
print(result.markdown_v2.fit_markdown)
|
print(result.markdown_v2.fit_markdown)
|
||||||
|
|
||||||
print("\nMarkdown Generation Results:")
|
print("\nMarkdown Generation Results:")
|
||||||
print(f"1. Original markdown length: {len(result.markdown)}")
|
print(f"1. Original markdown length: {len(result.markdown)}")
|
||||||
print(f"2. New markdown versions (markdown_v2):")
|
print("2. New markdown versions (markdown_v2):")
|
||||||
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
|
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
|
||||||
print(f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}")
|
print(
|
||||||
print(f" - References section length: {len(result.markdown_v2.references_markdown)}")
|
f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}"
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
f" - References section length: {len(result.markdown_v2.references_markdown)}"
|
||||||
|
)
|
||||||
if result.markdown_v2.fit_markdown:
|
if result.markdown_v2.fit_markdown:
|
||||||
print(f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}")
|
print(
|
||||||
|
f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}"
|
||||||
|
)
|
||||||
|
|
||||||
# Save examples to files
|
# Save examples to files
|
||||||
output_dir = os.path.join(__data__, "markdown_examples")
|
output_dir = os.path.join(__data__, "markdown_examples")
|
||||||
os.makedirs(output_dir, exist_ok=True)
|
os.makedirs(output_dir, exist_ok=True)
|
||||||
|
|
||||||
# Save different versions
|
# Save different versions
|
||||||
with open(os.path.join(output_dir, "1_raw_markdown.md"), "w") as f:
|
with open(os.path.join(output_dir, "1_raw_markdown.md"), "w") as f:
|
||||||
f.write(result.markdown_v2.raw_markdown)
|
f.write(result.markdown_v2.raw_markdown)
|
||||||
|
|
||||||
with open(os.path.join(output_dir, "2_citations_markdown.md"), "w") as f:
|
with open(os.path.join(output_dir, "2_citations_markdown.md"), "w") as f:
|
||||||
f.write(result.markdown_v2.markdown_with_citations)
|
f.write(result.markdown_v2.markdown_with_citations)
|
||||||
|
|
||||||
with open(os.path.join(output_dir, "3_references.md"), "w") as f:
|
with open(os.path.join(output_dir, "3_references.md"), "w") as f:
|
||||||
f.write(result.markdown_v2.references_markdown)
|
f.write(result.markdown_v2.references_markdown)
|
||||||
|
|
||||||
if result.markdown_v2.fit_markdown:
|
if result.markdown_v2.fit_markdown:
|
||||||
with open(os.path.join(output_dir, "4_filtered_markdown.md"), "w") as f:
|
with open(os.path.join(output_dir, "4_filtered_markdown.md"), "w") as f:
|
||||||
f.write(result.markdown_v2.fit_markdown)
|
f.write(result.markdown_v2.fit_markdown)
|
||||||
|
|
||||||
print(f"\nMarkdown examples saved to: {output_dir}")
|
print(f"\nMarkdown examples saved to: {output_dir}")
|
||||||
|
|
||||||
# Show a sample of citations and references
|
# Show a sample of citations and references
|
||||||
print("\nSample of markdown with citations:")
|
print("\nSample of markdown with citations:")
|
||||||
print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
|
print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
|
||||||
print("Sample of references:")
|
print("Sample of references:")
|
||||||
print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...")
|
print(
|
||||||
|
"\n".join(result.markdown_v2.references_markdown.split("\n")[:10]) + "..."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
# 4. Browser Management Example
|
# 4. Browser Management Example
|
||||||
async def browser_management_example():
|
async def browser_management_example():
|
||||||
@@ -156,38 +164,38 @@ async def browser_management_example():
|
|||||||
# Use the specified user directory path
|
# Use the specified user directory path
|
||||||
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
||||||
os.makedirs(user_data_dir, exist_ok=True)
|
os.makedirs(user_data_dir, exist_ok=True)
|
||||||
|
|
||||||
print(f"Browser profile will be saved to: {user_data_dir}")
|
print(f"Browser profile will be saved to: {user_data_dir}")
|
||||||
|
|
||||||
async with AsyncWebCrawler(
|
async with AsyncWebCrawler(
|
||||||
use_managed_browser=True,
|
use_managed_browser=True,
|
||||||
user_data_dir=user_data_dir,
|
user_data_dir=user_data_dir,
|
||||||
headless=False,
|
headless=False,
|
||||||
verbose=True
|
verbose=True,
|
||||||
) as crawler:
|
) as crawler:
|
||||||
|
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://crawl4ai.com",
|
url="https://crawl4ai.com",
|
||||||
# session_id="persistent_session_1",
|
# session_id="persistent_session_1",
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
# Use GitHub as an example - it's a good test for browser management
|
# Use GitHub as an example - it's a good test for browser management
|
||||||
# because it requires proper browser handling
|
# because it requires proper browser handling
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://github.com/trending",
|
url="https://github.com/trending",
|
||||||
# session_id="persistent_session_1",
|
# session_id="persistent_session_1",
|
||||||
cache_mode=CacheMode.BYPASS
|
cache_mode=CacheMode.BYPASS,
|
||||||
)
|
)
|
||||||
|
|
||||||
print("\nBrowser session result:", result.success)
|
print("\nBrowser session result:", result.success)
|
||||||
if result.success:
|
if result.success:
|
||||||
print("Page title:", result.metadata.get('title', 'No title found'))
|
print("Page title:", result.metadata.get("title", "No title found"))
|
||||||
|
|
||||||
|
|
||||||
# 5. API Usage Example
|
# 5. API Usage Example
|
||||||
async def api_example():
|
async def api_example():
|
||||||
"""Example of using the new API endpoints"""
|
"""Example of using the new API endpoints"""
|
||||||
api_token = os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code"
|
api_token = os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
|
||||||
headers = {'Authorization': f'Bearer {api_token}'}
|
headers = {"Authorization": f"Bearer {api_token}"}
|
||||||
async with aiohttp.ClientSession() as session:
|
async with aiohttp.ClientSession() as session:
|
||||||
# Submit crawl job
|
# Submit crawl job
|
||||||
crawl_request = {
|
crawl_request = {
|
||||||
@@ -199,25 +207,17 @@ async def api_example():
|
|||||||
"name": "Hacker News Articles",
|
"name": "Hacker News Articles",
|
||||||
"baseSelector": ".athing",
|
"baseSelector": ".athing",
|
||||||
"fields": [
|
"fields": [
|
||||||
{
|
{"name": "title", "selector": ".title a", "type": "text"},
|
||||||
"name": "title",
|
{"name": "score", "selector": ".score", "type": "text"},
|
||||||
"selector": ".title a",
|
|
||||||
"type": "text"
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"name": "score",
|
|
||||||
"selector": ".score",
|
|
||||||
"type": "text"
|
|
||||||
},
|
|
||||||
{
|
{
|
||||||
"name": "url",
|
"name": "url",
|
||||||
"selector": ".title a",
|
"selector": ".title a",
|
||||||
"type": "attribute",
|
"type": "attribute",
|
||||||
"attribute": "href"
|
"attribute": "href",
|
||||||
}
|
},
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
}
|
},
|
||||||
},
|
},
|
||||||
"crawler_params": {
|
"crawler_params": {
|
||||||
"headless": True,
|
"headless": True,
|
||||||
@@ -227,51 +227,50 @@ async def api_example():
|
|||||||
# "screenshot": True,
|
# "screenshot": True,
|
||||||
# "magic": True
|
# "magic": True
|
||||||
}
|
}
|
||||||
|
|
||||||
async with session.post(
|
async with session.post(
|
||||||
"http://localhost:11235/crawl",
|
"http://localhost:11235/crawl", json=crawl_request, headers=headers
|
||||||
json=crawl_request,
|
|
||||||
headers=headers
|
|
||||||
) as response:
|
) as response:
|
||||||
task_data = await response.json()
|
task_data = await response.json()
|
||||||
task_id = task_data["task_id"]
|
task_id = task_data["task_id"]
|
||||||
|
|
||||||
# Check task status
|
# Check task status
|
||||||
while True:
|
while True:
|
||||||
async with session.get(
|
async with session.get(
|
||||||
f"http://localhost:11235/task/{task_id}",
|
f"http://localhost:11235/task/{task_id}", headers=headers
|
||||||
headers=headers
|
|
||||||
) as status_response:
|
) as status_response:
|
||||||
result = await status_response.json()
|
result = await status_response.json()
|
||||||
print(f"Task status: {result['status']}")
|
print(f"Task status: {result['status']}")
|
||||||
|
|
||||||
if result["status"] == "completed":
|
if result["status"] == "completed":
|
||||||
print("Task completed!")
|
print("Task completed!")
|
||||||
print("Results:")
|
print("Results:")
|
||||||
news = json.loads(result["results"][0]['extracted_content'])
|
news = json.loads(result["results"][0]["extracted_content"])
|
||||||
print(json.dumps(news[:4], indent=2))
|
print(json.dumps(news[:4], indent=2))
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
await asyncio.sleep(1)
|
await asyncio.sleep(1)
|
||||||
|
|
||||||
|
|
||||||
# Main execution
|
# Main execution
|
||||||
async def main():
|
async def main():
|
||||||
# print("Running Crawl4AI feature examples...")
|
# print("Running Crawl4AI feature examples...")
|
||||||
|
|
||||||
# print("\n1. Running Download Example:")
|
# print("\n1. Running Download Example:")
|
||||||
# await download_example()
|
# await download_example()
|
||||||
|
|
||||||
# print("\n2. Running Markdown Generation Example:")
|
# print("\n2. Running Markdown Generation Example:")
|
||||||
# await markdown_generation_example()
|
# await markdown_generation_example()
|
||||||
|
|
||||||
# # print("\n3. Running Local and Raw HTML Example:")
|
# # print("\n3. Running Local and Raw HTML Example:")
|
||||||
# await local_and_raw_html_example()
|
# await local_and_raw_html_example()
|
||||||
|
|
||||||
# # print("\n4. Running Browser Management Example:")
|
# # print("\n4. Running Browser Management Example:")
|
||||||
await browser_management_example()
|
await browser_management_example()
|
||||||
|
|
||||||
# print("\n5. Running API Example:")
|
# print("\n5. Running API Example:")
|
||||||
await api_example()
|
await api_example()
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
@@ -10,18 +10,17 @@ import asyncio
|
|||||||
import os
|
import os
|
||||||
import json
|
import json
|
||||||
import re
|
import re
|
||||||
from typing import List, Optional, Dict, Any
|
from typing import List
|
||||||
from pydantic import BaseModel, Field
|
|
||||||
from crawl4ai import (
|
from crawl4ai import (
|
||||||
AsyncWebCrawler,
|
AsyncWebCrawler,
|
||||||
BrowserConfig,
|
BrowserConfig,
|
||||||
CrawlerRunConfig,
|
CrawlerRunConfig,
|
||||||
CacheMode,
|
CacheMode,
|
||||||
LLMExtractionStrategy,
|
LLMExtractionStrategy,
|
||||||
JsonCssExtractionStrategy
|
JsonCssExtractionStrategy,
|
||||||
)
|
)
|
||||||
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
||||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
# Sample HTML for demonstrations
|
# Sample HTML for demonstrations
|
||||||
@@ -52,17 +51,18 @@ SAMPLE_HTML = """
|
|||||||
</div>
|
</div>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
async def demo_ssl_features():
|
async def demo_ssl_features():
|
||||||
"""
|
"""
|
||||||
Enhanced SSL & Security Features Demo
|
Enhanced SSL & Security Features Demo
|
||||||
-----------------------------------
|
-----------------------------------
|
||||||
|
|
||||||
This example demonstrates the new SSL certificate handling and security features:
|
This example demonstrates the new SSL certificate handling and security features:
|
||||||
1. Custom certificate paths
|
1. Custom certificate paths
|
||||||
2. SSL verification options
|
2. SSL verification options
|
||||||
3. HTTPS error handling
|
3. HTTPS error handling
|
||||||
4. Certificate validation configurations
|
4. Certificate validation configurations
|
||||||
|
|
||||||
These features are particularly useful when:
|
These features are particularly useful when:
|
||||||
- Working with self-signed certificates
|
- Working with self-signed certificates
|
||||||
- Dealing with corporate proxies
|
- Dealing with corporate proxies
|
||||||
@@ -76,14 +76,11 @@ async def demo_ssl_features():
|
|||||||
|
|
||||||
run_config = CrawlerRunConfig(
|
run_config = CrawlerRunConfig(
|
||||||
cache_mode=CacheMode.BYPASS,
|
cache_mode=CacheMode.BYPASS,
|
||||||
fetch_ssl_certificate=True # Enable SSL certificate fetching
|
fetch_ssl_certificate=True, # Enable SSL certificate fetching
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(url="https://example.com", config=run_config)
|
||||||
url="https://example.com",
|
|
||||||
config=run_config
|
|
||||||
)
|
|
||||||
print(f"SSL Crawl Success: {result.success}")
|
print(f"SSL Crawl Success: {result.success}")
|
||||||
result.ssl_certificate.to_json(
|
result.ssl_certificate.to_json(
|
||||||
os.path.join(os.getcwd(), "ssl_certificate.json")
|
os.path.join(os.getcwd(), "ssl_certificate.json")
|
||||||
@@ -91,11 +88,12 @@ async def demo_ssl_features():
|
|||||||
if not result.success:
|
if not result.success:
|
||||||
print(f"SSL Error: {result.error_message}")
|
print(f"SSL Error: {result.error_message}")
|
||||||
|
|
||||||
|
|
||||||
async def demo_content_filtering():
|
async def demo_content_filtering():
|
||||||
"""
|
"""
|
||||||
Smart Content Filtering Demo
|
Smart Content Filtering Demo
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
Demonstrates advanced content filtering capabilities:
|
Demonstrates advanced content filtering capabilities:
|
||||||
1. Custom filter to identify and extract specific content
|
1. Custom filter to identify and extract specific content
|
||||||
2. Integration with markdown generation
|
2. Integration with markdown generation
|
||||||
@@ -110,87 +108,90 @@ async def demo_content_filtering():
|
|||||||
super().__init__()
|
super().__init__()
|
||||||
# Add news-specific patterns
|
# Add news-specific patterns
|
||||||
self.negative_patterns = re.compile(
|
self.negative_patterns = re.compile(
|
||||||
r'nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending',
|
r"nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending",
|
||||||
re.I
|
re.I,
|
||||||
)
|
)
|
||||||
self.min_word_count = 30 # Higher threshold for news content
|
self.min_word_count = 30 # Higher threshold for news content
|
||||||
|
|
||||||
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
|
def filter_content(
|
||||||
|
self, html: str, min_word_threshold: int = None
|
||||||
|
) -> List[str]:
|
||||||
"""
|
"""
|
||||||
Implements news-specific content filtering logic.
|
Implements news-specific content filtering logic.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
html (str): HTML content to be filtered
|
html (str): HTML content to be filtered
|
||||||
min_word_threshold (int, optional): Minimum word count threshold
|
min_word_threshold (int, optional): Minimum word count threshold
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
List[str]: List of filtered HTML content blocks
|
List[str]: List of filtered HTML content blocks
|
||||||
"""
|
"""
|
||||||
if not html or not isinstance(html, str):
|
if not html or not isinstance(html, str):
|
||||||
return []
|
return []
|
||||||
|
|
||||||
soup = BeautifulSoup(html, 'lxml')
|
soup = BeautifulSoup(html, "lxml")
|
||||||
if not soup.body:
|
if not soup.body:
|
||||||
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
|
soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
|
||||||
|
|
||||||
body = soup.find('body')
|
body = soup.find("body")
|
||||||
|
|
||||||
# Extract chunks with metadata
|
# Extract chunks with metadata
|
||||||
chunks = self.extract_text_chunks(body, min_word_threshold or self.min_word_count)
|
chunks = self.extract_text_chunks(
|
||||||
|
body, min_word_threshold or self.min_word_count
|
||||||
|
)
|
||||||
|
|
||||||
# Filter chunks based on news-specific criteria
|
# Filter chunks based on news-specific criteria
|
||||||
filtered_chunks = []
|
filtered_chunks = []
|
||||||
for _, text, tag_type, element in chunks:
|
for _, text, tag_type, element in chunks:
|
||||||
# Skip if element has negative class/id
|
# Skip if element has negative class/id
|
||||||
if self.is_excluded(element):
|
if self.is_excluded(element):
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Headers are important in news articles
|
# Headers are important in news articles
|
||||||
if tag_type == 'header':
|
if tag_type == "header":
|
||||||
filtered_chunks.append(self.clean_element(element))
|
filtered_chunks.append(self.clean_element(element))
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# For content, check word count and link density
|
# For content, check word count and link density
|
||||||
text = element.get_text(strip=True)
|
text = element.get_text(strip=True)
|
||||||
if len(text.split()) >= (min_word_threshold or self.min_word_count):
|
if len(text.split()) >= (min_word_threshold or self.min_word_count):
|
||||||
# Calculate link density
|
# Calculate link density
|
||||||
links_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
|
links_text = " ".join(
|
||||||
|
a.get_text(strip=True) for a in element.find_all("a")
|
||||||
|
)
|
||||||
link_density = len(links_text) / len(text) if text else 1
|
link_density = len(links_text) / len(text) if text else 1
|
||||||
|
|
||||||
# Accept if link density is reasonable
|
# Accept if link density is reasonable
|
||||||
if link_density < 0.5:
|
if link_density < 0.5:
|
||||||
filtered_chunks.append(self.clean_element(element))
|
filtered_chunks.append(self.clean_element(element))
|
||||||
|
|
||||||
return filtered_chunks
|
return filtered_chunks
|
||||||
|
|
||||||
# Create markdown generator with custom filter
|
# Create markdown generator with custom filter
|
||||||
markdown_gen = DefaultMarkdownGenerator(
|
markdown_gen = DefaultMarkdownGenerator(content_filter=CustomNewsFilter())
|
||||||
content_filter=CustomNewsFilter()
|
|
||||||
)
|
|
||||||
|
|
||||||
run_config = CrawlerRunConfig(
|
run_config = CrawlerRunConfig(
|
||||||
markdown_generator=markdown_gen,
|
markdown_generator=markdown_gen, cache_mode=CacheMode.BYPASS
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://news.ycombinator.com",
|
url="https://news.ycombinator.com", config=run_config
|
||||||
config=run_config
|
|
||||||
)
|
)
|
||||||
print("Filtered Content Sample:")
|
print("Filtered Content Sample:")
|
||||||
print(result.markdown[:500]) # Show first 500 chars
|
print(result.markdown[:500]) # Show first 500 chars
|
||||||
|
|
||||||
|
|
||||||
async def demo_json_extraction():
|
async def demo_json_extraction():
|
||||||
"""
|
"""
|
||||||
Improved JSON Extraction Demo
|
Improved JSON Extraction Demo
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
Demonstrates the enhanced JSON extraction capabilities:
|
Demonstrates the enhanced JSON extraction capabilities:
|
||||||
1. Base element attributes extraction
|
1. Base element attributes extraction
|
||||||
2. Complex nested structures
|
2. Complex nested structures
|
||||||
3. Multiple extraction patterns
|
3. Multiple extraction patterns
|
||||||
|
|
||||||
Key features shown:
|
Key features shown:
|
||||||
- Extracting attributes from base elements (href, data-* attributes)
|
- Extracting attributes from base elements (href, data-* attributes)
|
||||||
- Processing repeated patterns
|
- Processing repeated patterns
|
||||||
@@ -206,7 +207,7 @@ async def demo_json_extraction():
|
|||||||
"baseSelector": "div.article-list",
|
"baseSelector": "div.article-list",
|
||||||
"baseFields": [
|
"baseFields": [
|
||||||
{"name": "list_id", "type": "attribute", "attribute": "data-list-id"},
|
{"name": "list_id", "type": "attribute", "attribute": "data-list-id"},
|
||||||
{"name": "category", "type": "attribute", "attribute": "data-category"}
|
{"name": "category", "type": "attribute", "attribute": "data-category"},
|
||||||
],
|
],
|
||||||
"fields": [
|
"fields": [
|
||||||
{
|
{
|
||||||
@@ -214,8 +215,16 @@ async def demo_json_extraction():
|
|||||||
"selector": "article.post",
|
"selector": "article.post",
|
||||||
"type": "nested_list",
|
"type": "nested_list",
|
||||||
"baseFields": [
|
"baseFields": [
|
||||||
{"name": "post_id", "type": "attribute", "attribute": "data-post-id"},
|
{
|
||||||
{"name": "author_id", "type": "attribute", "attribute": "data-author"}
|
"name": "post_id",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "data-post-id",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "author_id",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "data-author",
|
||||||
|
},
|
||||||
],
|
],
|
||||||
"fields": [
|
"fields": [
|
||||||
{
|
{
|
||||||
@@ -223,60 +232,68 @@ async def demo_json_extraction():
|
|||||||
"selector": "h2.title a",
|
"selector": "h2.title a",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
"baseFields": [
|
"baseFields": [
|
||||||
{"name": "url", "type": "attribute", "attribute": "href"}
|
{
|
||||||
]
|
"name": "url",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href",
|
||||||
|
}
|
||||||
|
],
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"name": "author",
|
"name": "author",
|
||||||
"selector": "div.meta a.author",
|
"selector": "div.meta a.author",
|
||||||
"type": "text",
|
"type": "text",
|
||||||
"baseFields": [
|
"baseFields": [
|
||||||
{"name": "profile_url", "type": "attribute", "attribute": "href"}
|
{
|
||||||
]
|
"name": "profile_url",
|
||||||
},
|
"type": "attribute",
|
||||||
{
|
"attribute": "href",
|
||||||
"name": "date",
|
}
|
||||||
"selector": "span.date",
|
],
|
||||||
"type": "text"
|
|
||||||
},
|
},
|
||||||
|
{"name": "date", "selector": "span.date", "type": "text"},
|
||||||
{
|
{
|
||||||
"name": "read_more",
|
"name": "read_more",
|
||||||
"selector": "a.read-more",
|
"selector": "a.read-more",
|
||||||
"type": "nested",
|
"type": "nested",
|
||||||
"fields": [
|
"fields": [
|
||||||
{"name": "text", "type": "text"},
|
{"name": "text", "type": "text"},
|
||||||
{"name": "url", "type": "attribute", "attribute": "href"}
|
{
|
||||||
]
|
"name": "url",
|
||||||
}
|
"type": "attribute",
|
||||||
]
|
"attribute": "href",
|
||||||
|
},
|
||||||
|
],
|
||||||
|
},
|
||||||
|
],
|
||||||
}
|
}
|
||||||
]
|
],
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
|
|
||||||
# Demonstrate extraction from raw HTML
|
# Demonstrate extraction from raw HTML
|
||||||
run_config = CrawlerRunConfig(
|
run_config = CrawlerRunConfig(
|
||||||
extraction_strategy=json_strategy,
|
extraction_strategy=json_strategy, cache_mode=CacheMode.BYPASS
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="raw:" + SAMPLE_HTML, # Use raw: prefix for raw HTML
|
url="raw:" + SAMPLE_HTML, # Use raw: prefix for raw HTML
|
||||||
config=run_config
|
config=run_config,
|
||||||
)
|
)
|
||||||
print("Extracted Content:")
|
print("Extracted Content:")
|
||||||
print(result.extracted_content)
|
print(result.extracted_content)
|
||||||
|
|
||||||
|
|
||||||
async def demo_input_formats():
|
async def demo_input_formats():
|
||||||
"""
|
"""
|
||||||
Input Format Handling Demo
|
Input Format Handling Demo
|
||||||
----------------------
|
----------------------
|
||||||
|
|
||||||
Demonstrates how LLM extraction can work with different input formats:
|
Demonstrates how LLM extraction can work with different input formats:
|
||||||
1. Markdown (default) - Good for simple text extraction
|
1. Markdown (default) - Good for simple text extraction
|
||||||
2. HTML - Better when you need structure and attributes
|
2. HTML - Better when you need structure and attributes
|
||||||
|
|
||||||
This example shows how HTML input can be beneficial when:
|
This example shows how HTML input can be beneficial when:
|
||||||
- You need to understand the DOM structure
|
- You need to understand the DOM structure
|
||||||
- You want to extract both visible text and HTML attributes
|
- You want to extract both visible text and HTML attributes
|
||||||
@@ -350,7 +367,7 @@ async def demo_input_formats():
|
|||||||
</footer>
|
</footer>
|
||||||
</div>
|
</div>
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Use raw:// prefix to pass HTML content directly
|
# Use raw:// prefix to pass HTML content directly
|
||||||
url = f"raw://{dummy_html}"
|
url = f"raw://{dummy_html}"
|
||||||
|
|
||||||
@@ -359,18 +376,30 @@ async def demo_input_formats():
|
|||||||
|
|
||||||
# Define our schema using Pydantic
|
# Define our schema using Pydantic
|
||||||
class JobRequirement(BaseModel):
|
class JobRequirement(BaseModel):
|
||||||
category: str = Field(description="Category of the requirement (e.g., Technical, Soft Skills)")
|
category: str = Field(
|
||||||
items: List[str] = Field(description="List of specific requirements in this category")
|
description="Category of the requirement (e.g., Technical, Soft Skills)"
|
||||||
priority: str = Field(description="Priority level (Required/Preferred) based on the HTML class or context")
|
)
|
||||||
|
items: List[str] = Field(
|
||||||
|
description="List of specific requirements in this category"
|
||||||
|
)
|
||||||
|
priority: str = Field(
|
||||||
|
description="Priority level (Required/Preferred) based on the HTML class or context"
|
||||||
|
)
|
||||||
|
|
||||||
class JobPosting(BaseModel):
|
class JobPosting(BaseModel):
|
||||||
title: str = Field(description="Job title")
|
title: str = Field(description="Job title")
|
||||||
department: str = Field(description="Department or team")
|
department: str = Field(description="Department or team")
|
||||||
location: str = Field(description="Job location, including remote options")
|
location: str = Field(description="Job location, including remote options")
|
||||||
salary_range: Optional[str] = Field(description="Salary range if specified")
|
salary_range: Optional[str] = Field(description="Salary range if specified")
|
||||||
requirements: List[JobRequirement] = Field(description="Categorized job requirements")
|
requirements: List[JobRequirement] = Field(
|
||||||
application_deadline: Optional[str] = Field(description="Application deadline if specified")
|
description="Categorized job requirements"
|
||||||
contact_info: Optional[dict] = Field(description="Contact information from footer or contact section")
|
)
|
||||||
|
application_deadline: Optional[str] = Field(
|
||||||
|
description="Application deadline if specified"
|
||||||
|
)
|
||||||
|
contact_info: Optional[dict] = Field(
|
||||||
|
description="Contact information from footer or contact section"
|
||||||
|
)
|
||||||
|
|
||||||
# First try with markdown (default)
|
# First try with markdown (default)
|
||||||
markdown_strategy = LLMExtractionStrategy(
|
markdown_strategy = LLMExtractionStrategy(
|
||||||
@@ -382,7 +411,7 @@ async def demo_input_formats():
|
|||||||
Extract job posting details into structured data. Focus on the visible text content
|
Extract job posting details into structured data. Focus on the visible text content
|
||||||
and organize requirements into categories.
|
and organize requirements into categories.
|
||||||
""",
|
""",
|
||||||
input_format="markdown" # default
|
input_format="markdown", # default
|
||||||
)
|
)
|
||||||
|
|
||||||
# Then with HTML for better structure understanding
|
# Then with HTML for better structure understanding
|
||||||
@@ -400,34 +429,25 @@ async def demo_input_formats():
|
|||||||
|
|
||||||
Use HTML attributes and classes to enhance extraction accuracy.
|
Use HTML attributes and classes to enhance extraction accuracy.
|
||||||
""",
|
""",
|
||||||
input_format="html" # explicitly use HTML
|
input_format="html", # explicitly use HTML
|
||||||
)
|
)
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
async with AsyncWebCrawler() as crawler:
|
||||||
# Try with markdown first
|
# Try with markdown first
|
||||||
markdown_config = CrawlerRunConfig(
|
markdown_config = CrawlerRunConfig(extraction_strategy=markdown_strategy)
|
||||||
extraction_strategy=markdown_strategy
|
markdown_result = await crawler.arun(url=url, config=markdown_config)
|
||||||
)
|
|
||||||
markdown_result = await crawler.arun(
|
|
||||||
url=url,
|
|
||||||
config=markdown_config
|
|
||||||
)
|
|
||||||
print("\nMarkdown-based Extraction Result:")
|
print("\nMarkdown-based Extraction Result:")
|
||||||
items = json.loads(markdown_result.extracted_content)
|
items = json.loads(markdown_result.extracted_content)
|
||||||
print(json.dumps(items, indent=2))
|
print(json.dumps(items, indent=2))
|
||||||
|
|
||||||
# Then with HTML for better structure understanding
|
# Then with HTML for better structure understanding
|
||||||
html_config = CrawlerRunConfig(
|
html_config = CrawlerRunConfig(extraction_strategy=html_strategy)
|
||||||
extraction_strategy=html_strategy
|
html_result = await crawler.arun(url=url, config=html_config)
|
||||||
)
|
|
||||||
html_result = await crawler.arun(
|
|
||||||
url=url,
|
|
||||||
config=html_config
|
|
||||||
)
|
|
||||||
print("\nHTML-based Extraction Result:")
|
print("\nHTML-based Extraction Result:")
|
||||||
items = json.loads(html_result.extracted_content)
|
items = json.loads(html_result.extracted_content)
|
||||||
print(json.dumps(items, indent=2))
|
print(json.dumps(items, indent=2))
|
||||||
|
|
||||||
|
|
||||||
# Main execution
|
# Main execution
|
||||||
async def main():
|
async def main():
|
||||||
print("Crawl4AI v0.4.24 Feature Walkthrough")
|
print("Crawl4AI v0.4.24 Feature Walkthrough")
|
||||||
@@ -439,5 +459,6 @@ async def main():
|
|||||||
await demo_json_extraction()
|
await demo_json_extraction()
|
||||||
# await demo_input_formats()
|
# await demo_input_formats()
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
|
|||||||
354
docs/examples/v0_4_3b2_features_demo.py
Normal file
354
docs/examples/v0_4_3b2_features_demo.py
Normal file
@@ -0,0 +1,354 @@
|
|||||||
|
"""
|
||||||
|
Crawl4ai v0.4.3b2 Features Demo
|
||||||
|
============================
|
||||||
|
|
||||||
|
This demonstration showcases three major categories of new features in Crawl4ai v0.4.3:
|
||||||
|
|
||||||
|
1. Efficiency & Speed:
|
||||||
|
- Memory-efficient dispatcher strategies
|
||||||
|
- New scraping algorithm
|
||||||
|
- Streaming support for batch crawling
|
||||||
|
|
||||||
|
2. LLM Integration:
|
||||||
|
- Automatic schema generation
|
||||||
|
- LLM-powered content filtering
|
||||||
|
- Smart markdown generation
|
||||||
|
|
||||||
|
3. Core Improvements:
|
||||||
|
- Robots.txt compliance
|
||||||
|
- Proxy rotation
|
||||||
|
- Enhanced URL handling
|
||||||
|
- Shared data among hooks
|
||||||
|
- add page routes
|
||||||
|
|
||||||
|
Each demo function can be run independently or as part of the full suite.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import random
|
||||||
|
from typing import Optional, Dict
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
from crawl4ai import (
|
||||||
|
AsyncWebCrawler,
|
||||||
|
BrowserConfig,
|
||||||
|
CrawlerRunConfig,
|
||||||
|
CacheMode,
|
||||||
|
DisplayMode,
|
||||||
|
MemoryAdaptiveDispatcher,
|
||||||
|
CrawlerMonitor,
|
||||||
|
DefaultMarkdownGenerator,
|
||||||
|
LXMLWebScrapingStrategy,
|
||||||
|
JsonCssExtractionStrategy,
|
||||||
|
LLMContentFilter
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_memory_dispatcher():
|
||||||
|
"""Demonstrates the new memory-efficient dispatcher system.
|
||||||
|
|
||||||
|
Key Features:
|
||||||
|
- Adaptive memory management
|
||||||
|
- Real-time performance monitoring
|
||||||
|
- Concurrent session control
|
||||||
|
"""
|
||||||
|
print("\n=== Memory Dispatcher Demo ===")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Configuration
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
crawler_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
markdown_generator=DefaultMarkdownGenerator()
|
||||||
|
)
|
||||||
|
|
||||||
|
# Test URLs
|
||||||
|
urls = ["http://example.com", "http://example.org", "http://example.net"] * 3
|
||||||
|
|
||||||
|
print("\n📈 Initializing crawler with memory monitoring...")
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
monitor = CrawlerMonitor(
|
||||||
|
max_visible_rows=10,
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=80.0,
|
||||||
|
check_interval=0.5,
|
||||||
|
max_session_permit=5,
|
||||||
|
monitor=monitor
|
||||||
|
)
|
||||||
|
|
||||||
|
print("\n🚀 Starting batch crawl...")
|
||||||
|
results = await dispatcher.run_urls(
|
||||||
|
urls=urls,
|
||||||
|
crawler=crawler,
|
||||||
|
config=crawler_config,
|
||||||
|
)
|
||||||
|
print(f"\n✅ Completed {len(results)} URLs successfully")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"\n❌ Error in memory dispatcher demo: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_streaming_support():
|
||||||
|
"""
|
||||||
|
2. Streaming Support Demo
|
||||||
|
======================
|
||||||
|
Shows how to process URLs as they complete using streaming
|
||||||
|
"""
|
||||||
|
print("\n=== 2. Streaming Support Demo ===")
|
||||||
|
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)
|
||||||
|
|
||||||
|
# Test URLs
|
||||||
|
urls = ["http://example.com", "http://example.org", "http://example.net"] * 2
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
# Initialize dispatcher for streaming
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5)
|
||||||
|
|
||||||
|
print("Starting streaming crawl...")
|
||||||
|
async for result in dispatcher.run_urls_stream(
|
||||||
|
urls=urls, crawler=crawler, config=crawler_config
|
||||||
|
):
|
||||||
|
# Process each result as it arrives
|
||||||
|
print(
|
||||||
|
f"Received result for {result.url} - Success: {result.result.success}"
|
||||||
|
)
|
||||||
|
if result.result.success:
|
||||||
|
print(f"Content length: {len(result.result.markdown)}")
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_content_scraping():
|
||||||
|
"""
|
||||||
|
3. Content Scraping Strategy Demo
|
||||||
|
==============================
|
||||||
|
Demonstrates the new LXMLWebScrapingStrategy for faster content scraping.
|
||||||
|
"""
|
||||||
|
print("\n=== 3. Content Scraping Strategy Demo ===")
|
||||||
|
|
||||||
|
crawler = AsyncWebCrawler()
|
||||||
|
url = "https://example.com/article"
|
||||||
|
|
||||||
|
# Configure with the new LXML strategy
|
||||||
|
config = CrawlerRunConfig(scraping_strategy=LXMLWebScrapingStrategy(), verbose=True)
|
||||||
|
|
||||||
|
print("Scraping content with LXML strategy...")
|
||||||
|
async with crawler:
|
||||||
|
result = await crawler.arun(url, config=config)
|
||||||
|
if result.success:
|
||||||
|
print("Successfully scraped content using LXML strategy")
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_llm_markdown():
|
||||||
|
"""
|
||||||
|
4. LLM-Powered Markdown Generation Demo
|
||||||
|
===================================
|
||||||
|
Shows how to use the new LLM-powered content filtering and markdown generation.
|
||||||
|
"""
|
||||||
|
print("\n=== 4. LLM-Powered Markdown Generation Demo ===")
|
||||||
|
|
||||||
|
crawler = AsyncWebCrawler()
|
||||||
|
url = "https://docs.python.org/3/tutorial/classes.html"
|
||||||
|
|
||||||
|
content_filter = LLMContentFilter(
|
||||||
|
provider="openai/gpt-4o",
|
||||||
|
api_token=os.getenv("OPENAI_API_KEY"),
|
||||||
|
instruction="""
|
||||||
|
Focus on extracting the core educational content about Python classes.
|
||||||
|
Include:
|
||||||
|
- Key concepts and their explanations
|
||||||
|
- Important code examples
|
||||||
|
- Essential technical details
|
||||||
|
Exclude:
|
||||||
|
- Navigation elements
|
||||||
|
- Sidebars
|
||||||
|
- Footer content
|
||||||
|
- Version information
|
||||||
|
- Any non-essential UI elements
|
||||||
|
|
||||||
|
Format the output as clean markdown with proper code blocks and headers.
|
||||||
|
""",
|
||||||
|
verbose=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Configure LLM-powered markdown generation
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
|
content_filter=content_filter
|
||||||
|
),
|
||||||
|
cache_mode = CacheMode.BYPASS,
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
print("Generating focused markdown with LLM...")
|
||||||
|
async with crawler:
|
||||||
|
result = await crawler.arun(url, config=config)
|
||||||
|
if result.success and result.markdown_v2:
|
||||||
|
print("Successfully generated LLM-filtered markdown")
|
||||||
|
print("First 500 chars of filtered content:")
|
||||||
|
print(result.markdown_v2.fit_markdown[:500])
|
||||||
|
print("Successfully generated LLM-filtered markdown")
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_robots_compliance():
|
||||||
|
"""
|
||||||
|
5. Robots.txt Compliance Demo
|
||||||
|
==========================
|
||||||
|
Demonstrates the new robots.txt compliance feature with SQLite caching.
|
||||||
|
"""
|
||||||
|
print("\n=== 5. Robots.txt Compliance Demo ===")
|
||||||
|
|
||||||
|
crawler = AsyncWebCrawler()
|
||||||
|
urls = ["https://example.com", "https://facebook.com", "https://twitter.com"]
|
||||||
|
|
||||||
|
# Enable robots.txt checking
|
||||||
|
config = CrawlerRunConfig(check_robots_txt=True, verbose=True)
|
||||||
|
|
||||||
|
print("Crawling with robots.txt compliance...")
|
||||||
|
async with crawler:
|
||||||
|
results = await crawler.arun_many(urls, config=config)
|
||||||
|
for result in results:
|
||||||
|
if result.status_code == 403:
|
||||||
|
print(f"Access blocked by robots.txt: {result.url}")
|
||||||
|
elif result.success:
|
||||||
|
print(f"Successfully crawled: {result.url}")
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_json_schema_generation():
|
||||||
|
"""
|
||||||
|
7. LLM-Powered Schema Generation Demo
|
||||||
|
=================================
|
||||||
|
Demonstrates automatic CSS and XPath schema generation using LLM models.
|
||||||
|
"""
|
||||||
|
print("\n=== 7. LLM-Powered Schema Generation Demo ===")
|
||||||
|
|
||||||
|
# Example HTML content for a job listing
|
||||||
|
html_content = """
|
||||||
|
<div class="job-listing">
|
||||||
|
<h1 class="job-title">Senior Software Engineer</h1>
|
||||||
|
<div class="job-details">
|
||||||
|
<span class="location">San Francisco, CA</span>
|
||||||
|
<span class="salary">$150,000 - $200,000</span>
|
||||||
|
<div class="requirements">
|
||||||
|
<h2>Requirements</h2>
|
||||||
|
<ul>
|
||||||
|
<li>5+ years Python experience</li>
|
||||||
|
<li>Strong background in web crawling</li>
|
||||||
|
</ul>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
"""
|
||||||
|
|
||||||
|
print("Generating CSS selectors schema...")
|
||||||
|
# Generate CSS selectors with a specific query
|
||||||
|
css_schema = JsonCssExtractionStrategy.generate_schema(
|
||||||
|
html_content,
|
||||||
|
schema_type="CSS",
|
||||||
|
query="Extract job title, location, and salary information",
|
||||||
|
provider="openai/gpt-4o", # or use other providers like "ollama"
|
||||||
|
)
|
||||||
|
print("\nGenerated CSS Schema:")
|
||||||
|
print(css_schema)
|
||||||
|
|
||||||
|
# Example of using the generated schema with crawler
|
||||||
|
crawler = AsyncWebCrawler()
|
||||||
|
url = "https://example.com/job-listing"
|
||||||
|
|
||||||
|
# Create an extraction strategy with the generated schema
|
||||||
|
extraction_strategy = JsonCssExtractionStrategy(schema=css_schema)
|
||||||
|
|
||||||
|
config = CrawlerRunConfig(extraction_strategy=extraction_strategy, verbose=True)
|
||||||
|
|
||||||
|
print("\nTesting generated schema with crawler...")
|
||||||
|
async with crawler:
|
||||||
|
result = await crawler.arun(url, config=config)
|
||||||
|
if result.success:
|
||||||
|
print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None)
|
||||||
|
print("Successfully used generated schema for crawling")
|
||||||
|
|
||||||
|
|
||||||
|
async def demo_proxy_rotation():
|
||||||
|
"""
|
||||||
|
8. Proxy Rotation Demo
|
||||||
|
===================
|
||||||
|
Demonstrates how to rotate proxies for each request using Crawl4ai.
|
||||||
|
"""
|
||||||
|
print("\n=== 8. Proxy Rotation Demo ===")
|
||||||
|
|
||||||
|
async def get_next_proxy(proxy_file: str = f"proxies.txt") -> Optional[Dict]:
|
||||||
|
"""Get next proxy from local file"""
|
||||||
|
try:
|
||||||
|
proxies = os.getenv("PROXIES", "").split(",")
|
||||||
|
|
||||||
|
ip, port, username, password = random.choice(proxies).split(":")
|
||||||
|
return {
|
||||||
|
"server": f"http://{ip}:{port}",
|
||||||
|
"username": username,
|
||||||
|
"password": password,
|
||||||
|
"ip": ip # Store original IP for verification
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error loading proxy: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# Create 10 test requests to httpbin
|
||||||
|
urls = ["https://httpbin.org/ip"] * 2
|
||||||
|
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
for url in urls:
|
||||||
|
proxy = await get_next_proxy()
|
||||||
|
if not proxy:
|
||||||
|
print("No proxy available, skipping...")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Create new config with proxy
|
||||||
|
current_config = run_config.clone(proxy_config=proxy)
|
||||||
|
result = await crawler.arun(url=url, config=current_config)
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
|
||||||
|
print(f"Proxy {proxy['ip']} -> Response IP: {ip_match.group(0) if ip_match else 'Not found'}")
|
||||||
|
verified = ip_match.group(0) == proxy['ip']
|
||||||
|
if verified:
|
||||||
|
print(f"✅ Proxy working! IP matches: {proxy['ip']}")
|
||||||
|
else:
|
||||||
|
print(f"❌ Proxy failed or IP mismatch!")
|
||||||
|
else:
|
||||||
|
print(f"Failed with proxy {proxy['ip']}")
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
"""Run all feature demonstrations."""
|
||||||
|
print("\n📊 Running Crawl4ai v0.4.3 Feature Demos\n")
|
||||||
|
|
||||||
|
# Efficiency & Speed Demos
|
||||||
|
print("\n🚀 EFFICIENCY & SPEED DEMOS")
|
||||||
|
await demo_memory_dispatcher()
|
||||||
|
await demo_streaming_support()
|
||||||
|
await demo_content_scraping()
|
||||||
|
|
||||||
|
# # LLM Integration Demos
|
||||||
|
print("\n🤖 LLM INTEGRATION DEMOS")
|
||||||
|
await demo_json_schema_generation()
|
||||||
|
await demo_llm_markdown()
|
||||||
|
|
||||||
|
# # Core Improvements
|
||||||
|
print("\n🔧 CORE IMPROVEMENT DEMOS")
|
||||||
|
await demo_robots_compliance()
|
||||||
|
await demo_proxy_rotation()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
@@ -1,15 +1,17 @@
|
|||||||
# Advanced Features (Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
|
# Overview of Some Important Advanced Features
|
||||||
|
(Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
|
||||||
|
|
||||||
Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
|
Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
|
||||||
|
|
||||||
1. **Proxy Usage**
|
1. **Proxy Usage**
|
||||||
2. **Capturing PDFs & Screenshots**
|
2. **Capturing PDFs & Screenshots**
|
||||||
3. **Handling SSL Certificates**
|
3. **Handling SSL Certificates**
|
||||||
4. **Custom Headers**
|
4. **Custom Headers**
|
||||||
5. **Session Persistence & Local Storage**
|
5. **Session Persistence & Local Storage**
|
||||||
|
6. **Robots.txt Compliance**
|
||||||
|
|
||||||
> **Prerequisites**
|
> **Prerequisites**
|
||||||
> - You have a basic grasp of [AsyncWebCrawler Basics](./async-webcrawler-basics.md)
|
> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
|
||||||
> - You know how to run or configure your Python environment with Playwright installed
|
> - You know how to run or configure your Python environment with Playwright installed
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -84,7 +86,7 @@ async def main():
|
|||||||
# Save PDF
|
# Save PDF
|
||||||
if result.pdf:
|
if result.pdf:
|
||||||
with open("wikipedia_page.pdf", "wb") as f:
|
with open("wikipedia_page.pdf", "wb") as f:
|
||||||
f.write(b64decode(result.pdf))
|
f.write(result.pdf)
|
||||||
|
|
||||||
print("[OK] PDF & screenshot captured.")
|
print("[OK] PDF & screenshot captured.")
|
||||||
else:
|
else:
|
||||||
@@ -186,7 +188,7 @@ if __name__ == "__main__":
|
|||||||
|
|
||||||
**Notes**
|
**Notes**
|
||||||
- Some sites may react differently to certain headers (e.g., `Accept-Language`).
|
- Some sites may react differently to certain headers (e.g., `Accept-Language`).
|
||||||
- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-anti-bot.md) or use `UserAgentGenerator`.
|
- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-based-crawling.md) or use `UserAgentGenerator`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -246,7 +248,43 @@ You can sign in once, export the browser context, and reuse it later—without r
|
|||||||
- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
|
- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
|
||||||
- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
|
- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
|
||||||
|
|
||||||
**See**: [Detailed session management tutorial](./hooks-custom.md#using-storage_state) or [Explanations → Browser Context & Managed Browser](../../explanations/browser-management.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
|
**See**: [Detailed session management tutorial](./session-management.md) or [Explanations → Browser Context & Managed Browser](./identity-based-crawling.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Robots.txt Compliance
|
||||||
|
|
||||||
|
Crawl4AI supports respecting robots.txt rules with efficient caching:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
# Enable robots.txt checking in config
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
check_robots_txt=True # Will check and respect robots.txt rules
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
"https://example.com",
|
||||||
|
config=config
|
||||||
|
)
|
||||||
|
|
||||||
|
if not result.success and result.status_code == 403:
|
||||||
|
print("Access denied by robots.txt")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Points**
|
||||||
|
- Robots.txt files are cached locally for efficiency
|
||||||
|
- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
|
||||||
|
- Cache has a default TTL of 7 days
|
||||||
|
- If robots.txt can't be fetched, crawling is allowed
|
||||||
|
- Returns 403 status code if URL is disallowed
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -283,7 +321,10 @@ async def main():
|
|||||||
|
|
||||||
# 3. Crawl
|
# 3. Crawl
|
||||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||||
result = await crawler.arun("https://secure.example.com/protected", config=crawler_cfg)
|
result = await crawler.arun(
|
||||||
|
url = "https://secure.example.com/protected",
|
||||||
|
config=crawler_cfg
|
||||||
|
)
|
||||||
|
|
||||||
if result.success:
|
if result.success:
|
||||||
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
|
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
|
||||||
@@ -317,13 +358,8 @@ You’ve now explored several **advanced** features:
|
|||||||
- **SSL Certificate** retrieval & exporting
|
- **SSL Certificate** retrieval & exporting
|
||||||
- **Custom Headers** for language or specialized requests
|
- **Custom Headers** for language or specialized requests
|
||||||
- **Session Persistence** via storage state
|
- **Session Persistence** via storage state
|
||||||
|
- **Robots.txt Compliance**
|
||||||
**Where to go next**:
|
|
||||||
|
|
||||||
- **[Hooks & Custom Code](./hooks-custom.md)**: For multi-step interactions (clicking “Load More,” performing logins, etc.)
|
|
||||||
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: If you need more sophisticated user simulation or stealth.
|
|
||||||
- **[Reference → BrowserConfig & CrawlerRunConfig](../../reference/configuration.md)**: Detailed param descriptions for everything you’ve seen here and more.
|
|
||||||
|
|
||||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
||||||
|
|
||||||
**Last Updated**: 2024-XX-XX
|
**Last Updated**: 2025-01-01
|
||||||
@@ -1,136 +0,0 @@
|
|||||||
# Content Processing
|
|
||||||
|
|
||||||
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
|
|
||||||
|
|
||||||
## Media Processing
|
|
||||||
|
|
||||||
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
|
|
||||||
|
|
||||||
### Image Processing
|
|
||||||
|
|
||||||
The library handles various image scenarios, including:
|
|
||||||
- Regular images
|
|
||||||
- Lazy-loaded images
|
|
||||||
- Background images
|
|
||||||
- Responsive images
|
|
||||||
- Image metadata and context
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
config = CrawlerRunConfig()
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
for image in result.media["images"]:
|
|
||||||
# Each image includes rich metadata
|
|
||||||
print(f"Source: {image['src']}")
|
|
||||||
print(f"Alt text: {image['alt']}")
|
|
||||||
print(f"Description: {image['desc']}")
|
|
||||||
print(f"Context: {image['context']}") # Surrounding text
|
|
||||||
print(f"Relevance score: {image['score']}") # 0-10 score
|
|
||||||
```
|
|
||||||
|
|
||||||
### Handling Lazy-Loaded Content
|
|
||||||
|
|
||||||
Crawl4AI already handles lazy loading for media elements. You can customize the wait time for lazy-loaded content with `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
wait_for="css:img[data-src]", # Wait for lazy images
|
|
||||||
delay_before_return_html=2.0 # Additional wait time
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Video and Audio Content
|
|
||||||
|
|
||||||
The library extracts video and audio elements with their metadata:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
config = CrawlerRunConfig()
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
# Process videos
|
|
||||||
for video in result.media["videos"]:
|
|
||||||
print(f"Video source: {video['src']}")
|
|
||||||
print(f"Type: {video['type']}")
|
|
||||||
print(f"Duration: {video.get('duration')}")
|
|
||||||
print(f"Thumbnail: {video.get('poster')}")
|
|
||||||
|
|
||||||
# Process audio
|
|
||||||
for audio in result.media["audios"]:
|
|
||||||
print(f"Audio source: {audio['src']}")
|
|
||||||
print(f"Type: {audio['type']}")
|
|
||||||
print(f"Duration: {audio.get('duration')}")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Link Analysis
|
|
||||||
|
|
||||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
|
||||||
|
|
||||||
### Link Classification
|
|
||||||
|
|
||||||
The library automatically categorizes links into:
|
|
||||||
- Internal links (same domain)
|
|
||||||
- External links (different domains)
|
|
||||||
- Social media links
|
|
||||||
- Navigation links
|
|
||||||
- Content links
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
config = CrawlerRunConfig()
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
# Analyze internal links
|
|
||||||
for link in result.links["internal"]:
|
|
||||||
print(f"Internal: {link['href']}")
|
|
||||||
print(f"Link text: {link['text']}")
|
|
||||||
print(f"Context: {link['context']}") # Surrounding text
|
|
||||||
print(f"Type: {link['type']}") # nav, content, etc.
|
|
||||||
|
|
||||||
# Analyze external links
|
|
||||||
for link in result.links["external"]:
|
|
||||||
print(f"External: {link['href']}")
|
|
||||||
print(f"Domain: {link['domain']}")
|
|
||||||
print(f"Type: {link['type']}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Smart Link Filtering
|
|
||||||
|
|
||||||
Control which links are included in the results with `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
exclude_external_links=True, # Remove external links
|
|
||||||
exclude_social_media_links=True, # Remove social media links
|
|
||||||
exclude_social_media_domains=[ # Custom social media domains
|
|
||||||
"facebook.com", "twitter.com", "instagram.com"
|
|
||||||
],
|
|
||||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Metadata Extraction
|
|
||||||
|
|
||||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
config = CrawlerRunConfig()
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
metadata = result.metadata
|
|
||||||
print(f"Title: {metadata['title']}")
|
|
||||||
print(f"Description: {metadata['description']}")
|
|
||||||
print(f"Keywords: {metadata['keywords']}")
|
|
||||||
print(f"Author: {metadata['author']}")
|
|
||||||
print(f"Published Date: {metadata['published_date']}")
|
|
||||||
print(f"Modified Date: {metadata['modified_date']}")
|
|
||||||
print(f"Language: {metadata['language']}")
|
|
||||||
```
|
|
||||||
12
docs/md_v2/advanced/crawl-dispatcher.md
Normal file
12
docs/md_v2/advanced/crawl-dispatcher.md
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
# Crawl Dispatcher
|
||||||
|
|
||||||
|
We’re excited to announce a **Crawl Dispatcher** module that can handle **thousands** of crawling tasks simultaneously. By efficiently managing system resources (memory, CPU, network), this dispatcher ensures high-performance data extraction at scale. It also provides **real-time monitoring** of each crawler’s status, memory usage, and overall progress.
|
||||||
|
|
||||||
|
Stay tuned—this feature is **coming soon** in an upcoming release of Crawl4AI! For the latest news, keep an eye on our changelogs and follow [@unclecode](https://twitter.com/unclecode) on X.
|
||||||
|
|
||||||
|
Below is a **sample** of how the dispatcher’s performance monitor might look in action:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
We can’t wait to bring you this streamlined, **scalable** approach to multi-URL crawling—**watch this space** for updates!
|
||||||
@@ -17,18 +17,6 @@ async def main():
|
|||||||
asyncio.run(main())
|
asyncio.run(main())
|
||||||
```
|
```
|
||||||
|
|
||||||
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
config = CrawlerRunConfig(accept_downloads=True)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
# ...
|
|
||||||
```
|
|
||||||
|
|
||||||
## Specifying Download Location
|
## Specifying Download Location
|
||||||
|
|
||||||
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||||||
@@ -98,7 +86,8 @@ async def download_multiple_files(url: str, download_path: str):
|
|||||||
const downloadLinks = document.querySelectorAll('a[download]');
|
const downloadLinks = document.querySelectorAll('a[download]');
|
||||||
for (const link of downloadLinks) {
|
for (const link of downloadLinks) {
|
||||||
link.click();
|
link.click();
|
||||||
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
|
// Delay between clicks
|
||||||
|
await new Promise(r => setTimeout(r, 2000));
|
||||||
}
|
}
|
||||||
""",
|
""",
|
||||||
wait_for=10 # Wait for all downloads to start
|
wait_for=10 # Wait for all downloads to start
|
||||||
@@ -1,121 +1,254 @@
|
|||||||
# Hooks & Auth for AsyncWebCrawler
|
# Hooks & Auth in AsyncWebCrawler
|
||||||
|
|
||||||
Crawl4AI's `AsyncWebCrawler` allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This updated documentation demonstrates how to use hooks, including the new `on_page_context_created` hook, and ensures compatibility with `BrowserConfig` and `CrawlerRunConfig`.
|
Crawl4AI’s **hooks** let you customize the crawler at specific points in the pipeline:
|
||||||
|
|
||||||
## Example: Using Crawler Hooks with AsyncWebCrawler
|
1. **`on_browser_created`** – After browser creation.
|
||||||
|
2. **`on_page_context_created`** – After a new context & page are created.
|
||||||
|
3. **`before_goto`** – Just before navigating to a page.
|
||||||
|
4. **`after_goto`** – Right after navigation completes.
|
||||||
|
5. **`on_user_agent_updated`** – Whenever the user agent changes.
|
||||||
|
6. **`on_execution_started`** – Once custom JavaScript execution begins.
|
||||||
|
7. **`before_retrieve_html`** – Just before the crawler retrieves final HTML.
|
||||||
|
8. **`before_return_html`** – Right before returning the HTML content.
|
||||||
|
|
||||||
In this example, we'll:
|
**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
|
||||||
|
|
||||||
1. Configure the browser and set up authentication when it's created.
|
> note "Important Hook Usage Warning"
|
||||||
2. Apply custom routing and initial actions when the page context is created.
|
**Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
|
||||||
3. Add custom headers before navigating to the URL.
|
|
||||||
4. Log the current URL after navigation.
|
|
||||||
5. Perform actions after JavaScript execution.
|
|
||||||
6. Log the length of the HTML before returning it.
|
|
||||||
|
|
||||||
### Hook Definitions
|
> **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
|
||||||
|
|
||||||
|
> **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.
|
||||||
|
|
||||||
|
> **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.
|
||||||
|
|
||||||
|
|
||||||
|
Below is an example demonstration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Example: Using Hooks in AsyncWebCrawler
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import asyncio
|
import asyncio
|
||||||
from crawl4ai import AsyncWebCrawler
|
import json
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
from playwright.async_api import Page, Browser, BrowserContext
|
from playwright.async_api import Page, BrowserContext
|
||||||
|
|
||||||
def log_routing(route):
|
|
||||||
# Example: block loading images
|
|
||||||
if route.request.resource_type == "image":
|
|
||||||
print(f"[HOOK] Blocking image request: {route.request.url}")
|
|
||||||
asyncio.create_task(route.abort())
|
|
||||||
else:
|
|
||||||
asyncio.create_task(route.continue_())
|
|
||||||
|
|
||||||
async def on_browser_created(browser: Browser, **kwargs):
|
|
||||||
print("[HOOK] on_browser_created")
|
|
||||||
# Example: Set browser viewport size and log in
|
|
||||||
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
|
|
||||||
page = await context.new_page()
|
|
||||||
await page.goto("https://example.com/login")
|
|
||||||
await page.fill("input[name='username']", "testuser")
|
|
||||||
await page.fill("input[name='password']", "password123")
|
|
||||||
await page.click("button[type='submit']")
|
|
||||||
await page.wait_for_selector("#welcome")
|
|
||||||
await context.add_cookies([{"name": "auth_token", "value": "abc123", "url": "https://example.com"}])
|
|
||||||
await page.close()
|
|
||||||
await context.close()
|
|
||||||
|
|
||||||
async def on_page_context_created(context: BrowserContext, page: Page, **kwargs):
|
|
||||||
print("[HOOK] on_page_context_created")
|
|
||||||
await context.route("**", log_routing)
|
|
||||||
|
|
||||||
async def before_goto(page: Page, context: BrowserContext, **kwargs):
|
|
||||||
print("[HOOK] before_goto")
|
|
||||||
await page.set_extra_http_headers({"X-Test-Header": "test"})
|
|
||||||
|
|
||||||
async def after_goto(page: Page, context: BrowserContext, **kwargs):
|
|
||||||
print("[HOOK] after_goto")
|
|
||||||
print(f"Current URL: {page.url}")
|
|
||||||
|
|
||||||
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
|
|
||||||
print("[HOOK] on_execution_started")
|
|
||||||
await page.evaluate("console.log('Custom JS executed')")
|
|
||||||
|
|
||||||
async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs):
|
|
||||||
print("[HOOK] before_return_html")
|
|
||||||
print(f"HTML length: {len(html)}")
|
|
||||||
return page
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using the Hooks with AsyncWebCrawler
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def main():
|
async def main():
|
||||||
print("\n🔗 Using Crawler Hooks: Customize AsyncWebCrawler with hooks!")
|
print("🔗 Hooks Example: Demonstrating recommended usage")
|
||||||
|
|
||||||
# Configure browser and crawler settings
|
# 1) Configure the browser
|
||||||
browser_config = BrowserConfig(
|
browser_config = BrowserConfig(
|
||||||
headless=True,
|
headless=True,
|
||||||
viewport_width=1920,
|
verbose=True
|
||||||
viewport_height=1080
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# 2) Configure the crawler run
|
||||||
crawler_run_config = CrawlerRunConfig(
|
crawler_run_config = CrawlerRunConfig(
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||||
wait_for="footer"
|
wait_for="body",
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
)
|
)
|
||||||
|
|
||||||
# Initialize crawler
|
# 3) Create the crawler instance
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
crawler = AsyncWebCrawler(config=browser_config)
|
||||||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
|
||||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
|
||||||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
|
||||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
|
||||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
|
||||||
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
|
|
||||||
|
|
||||||
# Run the crawler
|
#
|
||||||
result = await crawler.arun(url="https://example.com", config=crawler_run_config)
|
# Define Hook Functions
|
||||||
|
#
|
||||||
|
|
||||||
print("\n📦 Crawler Hooks Result:")
|
async def on_browser_created(browser, **kwargs):
|
||||||
print(result)
|
# Called once the browser instance is created (but no pages or contexts yet)
|
||||||
|
print("[HOOK] on_browser_created - Browser created successfully!")
|
||||||
|
# Typically, do minimal setup here if needed
|
||||||
|
return browser
|
||||||
|
|
||||||
asyncio.run(main())
|
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
|
||||||
|
# Called right after a new page + context are created (ideal for auth or route config).
|
||||||
|
print("[HOOK] on_page_context_created - Setting up page & context.")
|
||||||
|
|
||||||
|
# Example 1: Route filtering (e.g., block images)
|
||||||
|
async def route_filter(route):
|
||||||
|
if route.request.resource_type == "image":
|
||||||
|
print(f"[HOOK] Blocking image request: {route.request.url}")
|
||||||
|
await route.abort()
|
||||||
|
else:
|
||||||
|
await route.continue_()
|
||||||
|
|
||||||
|
await context.route("**", route_filter)
|
||||||
|
|
||||||
|
# Example 2: (Optional) Simulate a login scenario
|
||||||
|
# (We do NOT create or close pages here, just do quick steps if needed)
|
||||||
|
# e.g., await page.goto("https://example.com/login")
|
||||||
|
# e.g., await page.fill("input[name='username']", "testuser")
|
||||||
|
# e.g., await page.fill("input[name='password']", "password123")
|
||||||
|
# e.g., await page.click("button[type='submit']")
|
||||||
|
# e.g., await page.wait_for_selector("#welcome")
|
||||||
|
# e.g., await context.add_cookies([...])
|
||||||
|
# Then continue
|
||||||
|
|
||||||
|
# Example 3: Adjust the viewport
|
||||||
|
await page.set_viewport_size({"width": 1080, "height": 600})
|
||||||
|
return page
|
||||||
|
|
||||||
|
async def before_goto(
|
||||||
|
page: Page, context: BrowserContext, url: str, **kwargs
|
||||||
|
):
|
||||||
|
# Called before navigating to each URL.
|
||||||
|
print(f"[HOOK] before_goto - About to navigate: {url}")
|
||||||
|
# e.g., inject custom headers
|
||||||
|
await page.set_extra_http_headers({
|
||||||
|
"Custom-Header": "my-value"
|
||||||
|
})
|
||||||
|
return page
|
||||||
|
|
||||||
|
async def after_goto(
|
||||||
|
page: Page, context: BrowserContext,
|
||||||
|
url: str, response, **kwargs
|
||||||
|
):
|
||||||
|
# Called after navigation completes.
|
||||||
|
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
||||||
|
# e.g., wait for a certain element if we want to verify
|
||||||
|
try:
|
||||||
|
await page.wait_for_selector('.content', timeout=1000)
|
||||||
|
print("[HOOK] Found .content element!")
|
||||||
|
except:
|
||||||
|
print("[HOOK] .content not found, continuing anyway.")
|
||||||
|
return page
|
||||||
|
|
||||||
|
async def on_user_agent_updated(
|
||||||
|
page: Page, context: BrowserContext,
|
||||||
|
user_agent: str, **kwargs
|
||||||
|
):
|
||||||
|
# Called whenever the user agent updates.
|
||||||
|
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
|
||||||
|
return page
|
||||||
|
|
||||||
|
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
|
||||||
|
# Called after custom JavaScript execution begins.
|
||||||
|
print("[HOOK] on_execution_started - JS code is running!")
|
||||||
|
return page
|
||||||
|
|
||||||
|
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
|
||||||
|
# Called before final HTML retrieval.
|
||||||
|
print("[HOOK] before_retrieve_html - We can do final actions")
|
||||||
|
# Example: Scroll again
|
||||||
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
|
||||||
|
return page
|
||||||
|
|
||||||
|
async def before_return_html(
|
||||||
|
page: Page, context: BrowserContext, html: str, **kwargs
|
||||||
|
):
|
||||||
|
# Called just before returning the HTML in the result.
|
||||||
|
print(f"[HOOK] before_return_html - HTML length: {len(html)}")
|
||||||
|
return page
|
||||||
|
|
||||||
|
#
|
||||||
|
# Attach Hooks
|
||||||
|
#
|
||||||
|
|
||||||
|
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||||||
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"on_page_context_created", on_page_context_created
|
||||||
|
)
|
||||||
|
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||||||
|
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||||
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"on_user_agent_updated", on_user_agent_updated
|
||||||
|
)
|
||||||
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"on_execution_started", on_execution_started
|
||||||
|
)
|
||||||
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"before_retrieve_html", before_retrieve_html
|
||||||
|
)
|
||||||
|
crawler.crawler_strategy.set_hook(
|
||||||
|
"before_return_html", before_return_html
|
||||||
|
)
|
||||||
|
|
||||||
|
await crawler.start()
|
||||||
|
|
||||||
|
# 4) Run the crawler on an example page
|
||||||
|
url = "https://example.com"
|
||||||
|
result = await crawler.arun(url, config=crawler_run_config)
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
print("\nCrawled URL:", result.url)
|
||||||
|
print("HTML length:", len(result.html))
|
||||||
|
else:
|
||||||
|
print("Error:", result.error_message)
|
||||||
|
|
||||||
|
await crawler.close()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
```
|
```
|
||||||
|
|
||||||
### Explanation of Hooks
|
---
|
||||||
|
|
||||||
- **`on_browser_created`**: Called when the browser is created. Use this to configure the browser or handle authentication (e.g., logging in and setting cookies).
|
## Hook Lifecycle Summary
|
||||||
- **`on_page_context_created`**: Called when a new page context is created. Use this to apply routing, block resources, or inject custom logic before navigating to the URL.
|
|
||||||
- **`before_goto`**: Called before navigating to the URL. Use this to add custom headers or perform other pre-navigation actions.
|
|
||||||
- **`after_goto`**: Called after navigation. Use this to verify content or log the URL.
|
|
||||||
- **`on_execution_started`**: Called after executing custom JavaScript. Use this to perform additional actions.
|
|
||||||
- **`before_return_html`**: Called before returning the HTML content. Use this to log details or preprocess the content.
|
|
||||||
|
|
||||||
### Additional Customizations
|
1. **`on_browser_created`**:
|
||||||
|
- Browser is up, but **no** pages or contexts yet.
|
||||||
|
- Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).
|
||||||
|
|
||||||
- **Resource Management**: Use `on_page_context_created` to block or modify requests (e.g., block images, fonts, or third-party scripts).
|
2. **`on_page_context_created`**:
|
||||||
- **Dynamic Headers**: Use `before_goto` to add or modify headers dynamically based on the URL.
|
- Perfect for advanced **auth** or route blocking.
|
||||||
- **Authentication**: Use `on_browser_created` to handle login processes and set authentication cookies or tokens.
|
- You have a **page** + **context** ready but haven’t navigated to the target URL yet.
|
||||||
- **Content Analysis**: Use `before_return_html` to analyze or modify the extracted HTML content.
|
|
||||||
|
|
||||||
These hooks provide powerful customization options for tailoring the crawling process to your needs.
|
3. **`before_goto`**:
|
||||||
|
- Right before navigation. Typically used for setting **custom headers** or logging the target URL.
|
||||||
|
|
||||||
|
4. **`after_goto`**:
|
||||||
|
- After page navigation is done. Good place for verifying content or waiting on essential elements.
|
||||||
|
|
||||||
|
5. **`on_user_agent_updated`**:
|
||||||
|
- Whenever the user agent changes (for stealth or different UA modes).
|
||||||
|
|
||||||
|
6. **`on_execution_started`**:
|
||||||
|
- If you set `js_code` or run custom scripts, this runs once your JS is about to start.
|
||||||
|
|
||||||
|
7. **`before_retrieve_html`**:
|
||||||
|
- Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.
|
||||||
|
|
||||||
|
8. **`before_return_html`**:
|
||||||
|
- The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## When to Handle Authentication
|
||||||
|
|
||||||
|
**Recommended**: Use **`on_page_context_created`** if you need to:
|
||||||
|
|
||||||
|
- Navigate to a login page or fill forms
|
||||||
|
- Set cookies or localStorage tokens
|
||||||
|
- Block resource routes to avoid ads
|
||||||
|
|
||||||
|
This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Additional Considerations
|
||||||
|
|
||||||
|
- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
|
||||||
|
- **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.
|
||||||
|
- **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.
|
||||||
|
- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
Hooks provide **fine-grained** control over:
|
||||||
|
|
||||||
|
- **Browser** creation (light tasks only)
|
||||||
|
- **Page** and **context** creation (auth, route blocking)
|
||||||
|
- **Navigation** phases
|
||||||
|
- **Final HTML** retrieval
|
||||||
|
|
||||||
|
Follow the recommended usage:
|
||||||
|
- **Login** or advanced tasks in `on_page_context_created`
|
||||||
|
- **Custom headers** or logs in `before_goto` / `after_goto`
|
||||||
|
- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
|
||||||
|
|
||||||
|
|||||||
180
docs/md_v2/advanced/identity-based-crawling.md
Normal file
180
docs/md_v2/advanced/identity-based-crawling.md
Normal file
@@ -0,0 +1,180 @@
|
|||||||
|
# Preserve Your Identity with Crawl4AI
|
||||||
|
|
||||||
|
Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers:
|
||||||
|
|
||||||
|
1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling.
|
||||||
|
2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Managed Browsers: Your Digital Identity Solution
|
||||||
|
|
||||||
|
**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
|
||||||
|
|
||||||
|
### Key Benefits
|
||||||
|
|
||||||
|
- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user.
|
||||||
|
- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
|
||||||
|
- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Creating a User Data Directory (Command-Line Approach via Playwright)
|
||||||
|
|
||||||
|
If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
|
||||||
|
|
||||||
|
1. **Find** the Playwright Chromium binary:
|
||||||
|
- On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
|
||||||
|
- To see an overview of installed browsers, run:
|
||||||
|
```bash
|
||||||
|
python -m playwright install --dry-run
|
||||||
|
```
|
||||||
|
or
|
||||||
|
```bash
|
||||||
|
playwright install --dry-run
|
||||||
|
```
|
||||||
|
(depending on your environment). This shows where Playwright keeps Chromium.
|
||||||
|
|
||||||
|
- For instance, you might see a path like:
|
||||||
|
```
|
||||||
|
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
|
||||||
|
```
|
||||||
|
on Linux, or a corresponding folder on macOS/Windows.
|
||||||
|
|
||||||
|
2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
|
||||||
|
```bash
|
||||||
|
# Linux example
|
||||||
|
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
|
||||||
|
--user-data-dir=/home/<you>/my_chrome_profile
|
||||||
|
```
|
||||||
|
```bash
|
||||||
|
# macOS example (Playwright’s internal binary)
|
||||||
|
~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
|
||||||
|
--user-data-dir=/Users/<you>/my_chrome_profile
|
||||||
|
```
|
||||||
|
```powershell
|
||||||
|
# Windows example (PowerShell/cmd)
|
||||||
|
"C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
|
||||||
|
--user-data-dir="C:\Users\<you>\my_chrome_profile"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
|
||||||
|
- This **opens** a fresh Chromium with your new or existing data folder.
|
||||||
|
- **Log into** any sites or configure your browser the way you want.
|
||||||
|
- **Close** when done—your profile data is saved in that folder.
|
||||||
|
|
||||||
|
3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||||
|
|
||||||
|
browser_config = BrowserConfig(
|
||||||
|
headless=True,
|
||||||
|
use_managed_browser=True,
|
||||||
|
user_data_dir="/home/<you>/my_chrome_profile",
|
||||||
|
browser_type="chromium"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
- Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Using Managed Browsers in Crawl4AI
|
||||||
|
|
||||||
|
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
# 1) Reference your persistent data directory
|
||||||
|
browser_config = BrowserConfig(
|
||||||
|
headless=True, # 'True' for automated runs
|
||||||
|
verbose=True,
|
||||||
|
use_managed_browser=True, # Enables persistent browser strategy
|
||||||
|
browser_type="chromium",
|
||||||
|
user_data_dir="/path/to/my-chrome-profile"
|
||||||
|
)
|
||||||
|
|
||||||
|
# 2) Standard crawl config
|
||||||
|
crawl_config = CrawlerRunConfig(
|
||||||
|
wait_for="css:.logged-in-content"
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
result = await crawler.arun(url="https://example.com/private", config=crawl_config)
|
||||||
|
if result.success:
|
||||||
|
print("Successfully accessed private data with your identity!")
|
||||||
|
else:
|
||||||
|
print("Error:", result.error_message)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
### Workflow
|
||||||
|
|
||||||
|
1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
|
||||||
|
2. **Close** that browser.
|
||||||
|
3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
|
||||||
|
4. **Crawl** – The site sees your identity as if you’re the same user who just logged in.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Magic Mode: Simplified Automation
|
||||||
|
|
||||||
|
If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://example.com",
|
||||||
|
config=CrawlerRunConfig(
|
||||||
|
magic=True, # Simplifies a lot of interaction
|
||||||
|
remove_overlay_elements=True,
|
||||||
|
page_timeout=60000
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Magic Mode**:
|
||||||
|
|
||||||
|
- Simulates a user-like experience
|
||||||
|
- Randomizes user agent & navigator
|
||||||
|
- Randomizes interactions & timings
|
||||||
|
- Masks automation signals
|
||||||
|
- Attempts pop-up handling
|
||||||
|
|
||||||
|
**But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Comparing Managed Browsers vs. Magic Mode
|
||||||
|
|
||||||
|
| Feature | **Managed Browsers** | **Magic Mode** |
|
||||||
|
|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
|
||||||
|
| **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) |
|
||||||
|
| **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity |
|
||||||
|
| **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed |
|
||||||
|
| **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) |
|
||||||
|
| **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Summary
|
||||||
|
|
||||||
|
- **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.
|
||||||
|
- **Log in** or configure sites as needed, then close the browser.
|
||||||
|
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.
|
||||||
|
- Enjoy **persistent** sessions that reflect your real identity.
|
||||||
|
- If you only need quick, ephemeral automation, **Magic Mode** might suffice.
|
||||||
|
|
||||||
|
**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
|
||||||
|
|
||||||
|
With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.
|
||||||
@@ -1,156 +0,0 @@
|
|||||||
### Preserve Your Identity with Crawl4AI
|
|
||||||
|
|
||||||
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This document introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Managed Browsers: Your Digital Identity Solution
|
|
||||||
|
|
||||||
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
|
|
||||||
|
|
||||||
#### Why Use Managed Browsers?
|
|
||||||
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
|
|
||||||
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
|
|
||||||
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
|
|
||||||
|
|
||||||
#### Steps to Use Managed Browsers
|
|
||||||
|
|
||||||
1. **Setup the Browser Configuration**:
|
|
||||||
```python
|
|
||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
|
||||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|
||||||
|
|
||||||
browser_config = BrowserConfig(
|
|
||||||
headless=False, # Set to False for initial setup to view browser actions
|
|
||||||
verbose=True,
|
|
||||||
user_agent_mode="random",
|
|
||||||
use_managed_browser=True, # Enables persistent browser sessions
|
|
||||||
browser_type="chromium",
|
|
||||||
user_data_dir="/path/to/user_profile_data" # Path to save session data
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Perform an Initial Run**:
|
|
||||||
- Run the crawler with `headless=False`.
|
|
||||||
- Manually interact with the site (e.g., solve CAPTCHA or log in).
|
|
||||||
- The browser session saves cookies, local storage, and other required data.
|
|
||||||
|
|
||||||
3. **Subsequent Runs**:
|
|
||||||
- Switch to `headless=True` for automation.
|
|
||||||
- The session data is reused, allowing seamless crawling.
|
|
||||||
|
|
||||||
#### Example: Extracting Data Using Managed Browsers
|
|
||||||
|
|
||||||
```python
|
|
||||||
import asyncio
|
|
||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
|
||||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
# Define schema for structured data extraction
|
|
||||||
schema = {
|
|
||||||
"name": "Example Data",
|
|
||||||
"baseSelector": "div.example",
|
|
||||||
"fields": [
|
|
||||||
{"name": "title", "selector": "h1", "type": "text"},
|
|
||||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Configure crawler
|
|
||||||
browser_config = BrowserConfig(
|
|
||||||
headless=True, # Automate subsequent runs
|
|
||||||
verbose=True,
|
|
||||||
use_managed_browser=True,
|
|
||||||
user_data_dir="/path/to/user_profile_data"
|
|
||||||
)
|
|
||||||
|
|
||||||
crawl_config = CrawlerRunConfig(
|
|
||||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
|
||||||
wait_for="css:div.example" # Wait for the targeted element to load
|
|
||||||
)
|
|
||||||
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
config=crawl_config
|
|
||||||
)
|
|
||||||
|
|
||||||
if result.success:
|
|
||||||
print("Extracted Data:", result.extracted_content)
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
### Benefits of Managed Browsers Over Other Methods
|
|
||||||
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Magic Mode: Simplified Automation
|
|
||||||
|
|
||||||
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
|
|
||||||
|
|
||||||
#### What Magic Mode Does:
|
|
||||||
- Simulates human browsing by randomizing interaction patterns and timing.
|
|
||||||
- Masks browser automation signals.
|
|
||||||
- Handles cookie popups and modals.
|
|
||||||
- Modifies navigator properties for enhanced compatibility.
|
|
||||||
|
|
||||||
#### Using Magic Mode
|
|
||||||
|
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
magic=True # Enables all automation features
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Magic Mode is particularly useful for:
|
|
||||||
- Quick prototyping when a Managed Browser setup is not available.
|
|
||||||
- Basic sites requiring minimal interaction or configuration.
|
|
||||||
|
|
||||||
#### Example: Combining Magic Mode with Additional Options
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def crawl_with_magic_mode(url: str):
|
|
||||||
async with AsyncWebCrawler(headless=True) as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url=url,
|
|
||||||
magic=True,
|
|
||||||
remove_overlay_elements=True, # Remove popups/modals
|
|
||||||
page_timeout=60000 # Increased timeout for complex pages
|
|
||||||
)
|
|
||||||
|
|
||||||
return result.markdown if result.success else None
|
|
||||||
```
|
|
||||||
|
|
||||||
### Magic Mode vs. Managed Browsers
|
|
||||||
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Key Comparison: Managed Browsers vs. Magic Mode
|
|
||||||
|
|
||||||
| Feature | **Managed Browsers** | **Magic Mode** |
|
|
||||||
|-------------------------|------------------------------------------|-------------------------------------|
|
|
||||||
| **Session Persistence** | Retains cookies and local storage. | No session retention. |
|
|
||||||
| **Human Interaction** | Uses real user profiles and data. | Simulates human-like patterns. |
|
|
||||||
| **Complex Sites** | Best suited for heavily configured sites.| Works well with simpler challenges.|
|
|
||||||
| **Setup Complexity** | Requires initial manual interaction. | Fully automated, one-line setup. |
|
|
||||||
|
|
||||||
#### Recommendation:
|
|
||||||
- Use **Managed Browsers** for reliable, session-based crawling and data extraction.
|
|
||||||
- Use **Magic Mode** for quick prototyping or when persistent profiles are not required.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Conclusion
|
|
||||||
|
|
||||||
- **Use Managed Browsers** to preserve your digital identity and ensure reliable, identity-based crawling with persistent sessions. This approach works seamlessly for even the most complex websites.
|
|
||||||
- **Leverage Magic Mode** for quick automation or in scenarios where persistent user profiles are not needed.
|
|
||||||
|
|
||||||
By combining these approaches, Crawl4AI provides unparalleled flexibility and capability for your crawling needs.
|
|
||||||
|
|
||||||
104
docs/md_v2/advanced/lazy-loading.md
Normal file
104
docs/md_v2/advanced/lazy-loading.md
Normal file
@@ -0,0 +1,104 @@
|
|||||||
|
## Handling Lazy-Loaded Images
|
||||||
|
|
||||||
|
Many websites now load images **lazily** as you scroll. If you need to ensure they appear in your final crawl (and in `result.media`), consider:
|
||||||
|
|
||||||
|
1. **`wait_for_images=True`** – Wait for images to fully load.
|
||||||
|
2. **`scan_full_page`** – Force the crawler to scroll the entire page, triggering lazy loads.
|
||||||
|
3. **`scroll_delay`** – Add small delays between scroll steps.
|
||||||
|
|
||||||
|
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
|
||||||
|
|
||||||
|
### Example: Ensuring Lazy Images Appear
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
|
||||||
|
from crawl4ai.async_configs import CacheMode
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
# Force the crawler to wait until images are fully loaded
|
||||||
|
wait_for_images=True,
|
||||||
|
|
||||||
|
# Option 1: If you want to automatically scroll the page to load images
|
||||||
|
scan_full_page=True, # Tells the crawler to try scrolling the entire page
|
||||||
|
scroll_delay=0.5, # Delay (seconds) between scroll steps
|
||||||
|
|
||||||
|
# Option 2: If the site uses a 'Load More' or JS triggers for images,
|
||||||
|
# you can also specify js_code or wait_for logic here.
|
||||||
|
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
|
||||||
|
result = await crawler.arun("https://www.example.com/gallery", config=config)
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
images = result.media.get("images", [])
|
||||||
|
print("Images found:", len(images))
|
||||||
|
for i, img in enumerate(images[:5]):
|
||||||
|
print(f"[Image {i}] URL: {img['src']}, Score: {img.get('score','N/A')}")
|
||||||
|
else:
|
||||||
|
print("Error:", result.error_message)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation**:
|
||||||
|
|
||||||
|
- **`wait_for_images=True`**
|
||||||
|
The crawler tries to ensure images have finished loading before finalizing the HTML.
|
||||||
|
- **`scan_full_page=True`**
|
||||||
|
Tells the crawler to attempt scrolling from top to bottom. Each scroll step helps trigger lazy loading.
|
||||||
|
- **`scroll_delay=0.5`**
|
||||||
|
Pause half a second between each scroll step. Helps the site load images before continuing.
|
||||||
|
|
||||||
|
**When to Use**:
|
||||||
|
|
||||||
|
- **Lazy-Loading**: If images appear only when the user scrolls into view, `scan_full_page` + `scroll_delay` helps the crawler see them.
|
||||||
|
- **Heavier Pages**: If a page is extremely long, be mindful that scanning the entire page can be slow. Adjust `scroll_delay` or the max scroll steps as needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Combining with Other Link & Media Filters
|
||||||
|
|
||||||
|
You can still combine **lazy-load** logic with the usual **exclude_external_images**, **exclude_domains**, or link filtration:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
wait_for_images=True,
|
||||||
|
scan_full_page=True,
|
||||||
|
scroll_delay=0.5,
|
||||||
|
|
||||||
|
# Filter out external images if you only want local ones
|
||||||
|
exclude_external_images=True,
|
||||||
|
|
||||||
|
# Exclude certain domains for links
|
||||||
|
exclude_domains=["spammycdn.com"],
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
This approach ensures you see **all** images from the main domain while ignoring external ones, and the crawler physically scrolls the entire page so that lazy-loading triggers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tips & Troubleshooting
|
||||||
|
|
||||||
|
1. **Long Pages**
|
||||||
|
- Setting `scan_full_page=True` on extremely long or infinite-scroll pages can be resource-intensive.
|
||||||
|
- Consider using [hooks](../core/page-interaction.md) or specialized logic to load specific sections or “Load More” triggers repeatedly.
|
||||||
|
|
||||||
|
2. **Mixed Image Behavior**
|
||||||
|
- Some sites load images in batches as you scroll. If you’re missing images, increase your `scroll_delay` or call multiple partial scrolls in a loop with JS code or hooks.
|
||||||
|
|
||||||
|
3. **Combining with Dynamic Wait**
|
||||||
|
- If the site has a placeholder that only changes to a real image after a certain event, you might do `wait_for="css:img.loaded"` or a custom JS `wait_for`.
|
||||||
|
|
||||||
|
4. **Caching**
|
||||||
|
- If `cache_mode` is enabled, repeated crawls might skip some network fetches. If you suspect caching is missing new images, set `cache_mode=CacheMode.BYPASS` for fresh fetches.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
With **lazy-loading** support, **wait_for_images**, and **scan_full_page** settings, you can capture the entire gallery or feed of images you expect—even if the site only loads them as the user scrolls. Combine these with the standard media filtering and domain exclusion for a complete link & media handling strategy.
|
||||||
@@ -1,52 +0,0 @@
|
|||||||
# Magic Mode & Anti-Bot Protection
|
|
||||||
|
|
||||||
Crawl4AI provides powerful anti-detection capabilities, with Magic Mode being the simplest and most comprehensive solution.
|
|
||||||
|
|
||||||
## Magic Mode
|
|
||||||
|
|
||||||
The easiest way to bypass anti-bot protections:
|
|
||||||
|
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
magic=True # Enables all anti-detection features
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Magic Mode automatically:
|
|
||||||
- Masks browser automation signals
|
|
||||||
- Simulates human-like behavior
|
|
||||||
- Overrides navigator properties
|
|
||||||
- Handles cookie consent popups
|
|
||||||
- Manages browser fingerprinting
|
|
||||||
- Randomizes timing patterns
|
|
||||||
|
|
||||||
## Manual Anti-Bot Options
|
|
||||||
|
|
||||||
While Magic Mode is recommended, you can also configure individual anti-detection features:
|
|
||||||
|
|
||||||
```python
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
simulate_user=True, # Simulate human behavior
|
|
||||||
override_navigator=True # Mask automation signals
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: When `magic=True` is used, you don't need to set these individual options.
|
|
||||||
|
|
||||||
## Example: Handling Protected Sites
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def crawl_protected_site(url: str):
|
|
||||||
async with AsyncWebCrawler(headless=True) as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url=url,
|
|
||||||
magic=True,
|
|
||||||
remove_overlay_elements=True, # Remove popups/modals
|
|
||||||
page_timeout=60000 # Increased timeout for protection checks
|
|
||||||
)
|
|
||||||
|
|
||||||
return result.markdown if result.success else None
|
|
||||||
```
|
|
||||||
@@ -1,188 +0,0 @@
|
|||||||
# Creating Browser Instances, Contexts, and Pages
|
|
||||||
|
|
||||||
## 1 Introduction
|
|
||||||
|
|
||||||
### Overview of Browser Management in Crawl4AI
|
|
||||||
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, anti-bot measures, and session persistence for high-volume, dynamic web crawling.
|
|
||||||
|
|
||||||
### Key Objectives
|
|
||||||
- **Anti-Bot Handling**:
|
|
||||||
- Implements stealth techniques to evade detection mechanisms used by modern websites.
|
|
||||||
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses.
|
|
||||||
- Supports integration with third-party services to bypass CAPTCHA challenges.
|
|
||||||
- **Persistent Sessions**:
|
|
||||||
- Retains session data (cookies, local storage) for workflows requiring user authentication.
|
|
||||||
- Allows seamless continuation of tasks across multiple runs without re-authentication.
|
|
||||||
- **Scalable Crawling**:
|
|
||||||
- Optimized resource utilization for handling thousands of URLs concurrently.
|
|
||||||
- Flexible configuration options to tailor crawling behavior to specific requirements.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2 Browser Creation Methods
|
|
||||||
|
|
||||||
### Standard Browser Creation
|
|
||||||
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
|
|
||||||
|
|
||||||
#### Features and Limitations
|
|
||||||
- **Features**:
|
|
||||||
- Quick and straightforward setup for small-scale tasks.
|
|
||||||
- Supports headless and headful modes.
|
|
||||||
- **Limitations**:
|
|
||||||
- Lacks advanced customization options like session reuse.
|
|
||||||
- May struggle with sites employing strict anti-bot measures.
|
|
||||||
|
|
||||||
#### Example Usage
|
|
||||||
```python
|
|
||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
|
||||||
|
|
||||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com")
|
|
||||||
print(result.markdown)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Persistent Contexts
|
|
||||||
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
|
|
||||||
|
|
||||||
#### Benefits of Using `user_data_dir`
|
|
||||||
- **Session Persistence**:
|
|
||||||
- Stores cookies, local storage, and cache between crawling sessions.
|
|
||||||
- Reduces overhead for repetitive logins or multi-step workflows.
|
|
||||||
- **Enhanced Performance**:
|
|
||||||
- Leverages pre-loaded resources for faster page loading.
|
|
||||||
- **Flexibility**:
|
|
||||||
- Adapts to complex workflows requiring user-specific configurations.
|
|
||||||
|
|
||||||
#### Example: Setting Up Persistent Contexts
|
|
||||||
```python
|
|
||||||
config = BrowserConfig(user_data_dir="/path/to/user/data")
|
|
||||||
async with AsyncWebCrawler(config=config) as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com")
|
|
||||||
print(result.markdown)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Managed Browser
|
|
||||||
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
|
|
||||||
|
|
||||||
#### How It Works
|
|
||||||
- **Browser Process Management**:
|
|
||||||
- Automates initialization and cleanup of browser processes.
|
|
||||||
- Optimizes resource usage by pooling and reusing browser instances.
|
|
||||||
- **Debugging Support**:
|
|
||||||
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection.
|
|
||||||
- **Anti-Bot Measures**:
|
|
||||||
- Implements stealth plugins to mimic real user behavior and bypass bot detection.
|
|
||||||
|
|
||||||
#### Features
|
|
||||||
- **Customizable Configurations**:
|
|
||||||
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation.
|
|
||||||
- **Debugging and Logging**:
|
|
||||||
- Logs detailed browser interactions for debugging and performance analysis.
|
|
||||||
- **Scalability**:
|
|
||||||
- Handles multiple browser instances concurrently, scaling dynamically based on workload.
|
|
||||||
|
|
||||||
#### Example: Using `ManagedBrowser`
|
|
||||||
```python
|
|
||||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
|
||||||
|
|
||||||
config = BrowserConfig(headless=False, debug_port=9222)
|
|
||||||
async with AsyncWebCrawler(config=config) as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com")
|
|
||||||
print(result.markdown)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3 Context and Page Management
|
|
||||||
|
|
||||||
### Creating and Configuring Browser Contexts
|
|
||||||
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
|
|
||||||
|
|
||||||
#### Customizations
|
|
||||||
- **Headers and Cookies**:
|
|
||||||
- Define custom headers to mimic specific devices or browsers.
|
|
||||||
- Set cookies for authenticated sessions.
|
|
||||||
- **Session Reuse**:
|
|
||||||
- Retain and reuse session data across multiple requests.
|
|
||||||
- Example: Preserve login states for authenticated crawls.
|
|
||||||
|
|
||||||
#### Example: Context Initialization
|
|
||||||
```python
|
|
||||||
from crawl4ai import CrawlerRunConfig
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
|
||||||
print(result.markdown)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Creating Pages
|
|
||||||
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
|
|
||||||
|
|
||||||
#### Key Features
|
|
||||||
- **IFrame Handling**:
|
|
||||||
- Extract content from embedded iframes.
|
|
||||||
- Navigate and interact with nested content.
|
|
||||||
- **Viewport Customization**:
|
|
||||||
- Adjust viewport size to match target device dimensions.
|
|
||||||
- **Lazy Loading**:
|
|
||||||
- Ensure dynamic elements are fully loaded before extraction.
|
|
||||||
|
|
||||||
#### Example: Page Initialization
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
|
||||||
print(result.markdown)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4 Advanced Features and Best Practices
|
|
||||||
|
|
||||||
### Debugging and Logging
|
|
||||||
Remote debugging provides a powerful way to troubleshoot complex crawling workflows.
|
|
||||||
|
|
||||||
#### Example: Enabling Remote Debugging
|
|
||||||
```python
|
|
||||||
config = BrowserConfig(debug_port=9222)
|
|
||||||
async with AsyncWebCrawler(config=config) as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Anti-Bot Techniques
|
|
||||||
- **Human Behavior Simulation**:
|
|
||||||
- Mimic real user actions, such as scrolling, clicking, and typing.
|
|
||||||
- Example: Use JavaScript to simulate interactions.
|
|
||||||
- **Captcha Handling**:
|
|
||||||
- Integrate with third-party services like 2Captcha or AntiCaptcha for automated solving.
|
|
||||||
|
|
||||||
#### Example: Simulating User Actions
|
|
||||||
```python
|
|
||||||
js_code = """
|
|
||||||
(async () => {
|
|
||||||
document.querySelector('input[name="search"]').value = 'test';
|
|
||||||
document.querySelector('button[type="submit"]').click();
|
|
||||||
})();
|
|
||||||
"""
|
|
||||||
config = CrawlerRunConfig(js_code=[js_code])
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Optimizations for Performance and Scalability
|
|
||||||
- **Persistent Contexts**:
|
|
||||||
- Reuse browser contexts to minimize resource consumption.
|
|
||||||
- **Concurrent Crawls**:
|
|
||||||
- Use `arun_many` with a controlled semaphore count for efficient batch processing.
|
|
||||||
|
|
||||||
#### Example: Scaling Crawls
|
|
||||||
```python
|
|
||||||
urls = ["https://example1.com", "https://example2.com"]
|
|
||||||
config = CrawlerRunConfig(semaphore_count=10)
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
results = await crawler.arun_many(urls, config=config)
|
|
||||||
for result in results:
|
|
||||||
print(result.url, result.markdown)
|
|
||||||
```
|
|
||||||
274
docs/md_v2/advanced/multi-url-crawling.md
Normal file
274
docs/md_v2/advanced/multi-url-crawling.md
Normal file
@@ -0,0 +1,274 @@
|
|||||||
|
# Advanced Multi-URL Crawling with Dispatchers
|
||||||
|
|
||||||
|
> **Heads Up**: Crawl4AI supports advanced dispatchers for **parallel** or **throttled** crawling, providing dynamic rate limiting and memory usage checks. The built-in `arun_many()` function uses these dispatchers to handle concurrency efficiently.
|
||||||
|
|
||||||
|
## 1. Introduction
|
||||||
|
|
||||||
|
When crawling many URLs:
|
||||||
|
- **Basic**: Use `arun()` in a loop (simple but less efficient)
|
||||||
|
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
|
||||||
|
- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
|
||||||
|
|
||||||
|
**Why Dispatchers?**
|
||||||
|
- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
|
||||||
|
- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
|
||||||
|
- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
|
||||||
|
- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
|
||||||
|
|
||||||
|
## 2. Core Components
|
||||||
|
|
||||||
|
### 2.1 Rate Limiter
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RateLimiter:
|
||||||
|
def __init__(
|
||||||
|
base_delay: Tuple[float, float] = (1.0, 3.0), # Random delay range between requests
|
||||||
|
max_delay: float = 60.0, # Maximum backoff delay
|
||||||
|
max_retries: int = 3, # Retries before giving up
|
||||||
|
rate_limit_codes: List[int] = [429, 503] # Status codes triggering backoff
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
The RateLimiter provides:
|
||||||
|
- Random delays between requests
|
||||||
|
- Exponential backoff on rate limit responses
|
||||||
|
- Domain-specific rate limiting
|
||||||
|
- Automatic retry handling
|
||||||
|
|
||||||
|
### 2.2 Crawler Monitor
|
||||||
|
|
||||||
|
The CrawlerMonitor provides real-time visibility into crawling operations:
|
||||||
|
|
||||||
|
```python
|
||||||
|
monitor = CrawlerMonitor(
|
||||||
|
max_visible_rows=15, # Maximum rows in live display
|
||||||
|
display_mode=DisplayMode.DETAILED # DETAILED or AGGREGATED view
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Display Modes**:
|
||||||
|
1. **DETAILED**: Shows individual task status, memory usage, and timing
|
||||||
|
2. **AGGREGATED**: Displays summary statistics and overall progress
|
||||||
|
|
||||||
|
## 3. Available Dispatchers
|
||||||
|
|
||||||
|
### 3.1 MemoryAdaptiveDispatcher (Default)
|
||||||
|
|
||||||
|
Automatically manages concurrency based on system memory usage:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=90.0, # Pause if memory exceeds this
|
||||||
|
check_interval=1.0, # How often to check memory
|
||||||
|
max_session_permit=10, # Maximum concurrent tasks
|
||||||
|
rate_limiter=RateLimiter( # Optional rate limiting
|
||||||
|
base_delay=(1.0, 2.0),
|
||||||
|
max_delay=30.0,
|
||||||
|
max_retries=2
|
||||||
|
),
|
||||||
|
monitor=CrawlerMonitor( # Optional monitoring
|
||||||
|
max_visible_rows=15,
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 SemaphoreDispatcher
|
||||||
|
|
||||||
|
Provides simple concurrency control with a fixed limit:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dispatcher = SemaphoreDispatcher(
|
||||||
|
max_session_permit=5, # Fixed concurrent tasks
|
||||||
|
rate_limiter=RateLimiter( # Optional rate limiting
|
||||||
|
base_delay=(0.5, 1.0),
|
||||||
|
max_delay=10.0
|
||||||
|
),
|
||||||
|
monitor=CrawlerMonitor( # Optional monitoring
|
||||||
|
max_visible_rows=15,
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Usage Examples
|
||||||
|
|
||||||
|
### 4.1 Batch Processing (Default)
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def crawl_batch():
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
stream=False # Default: get all results at once
|
||||||
|
)
|
||||||
|
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=70.0,
|
||||||
|
check_interval=1.0,
|
||||||
|
max_session_permit=10,
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
# Get all results at once
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls=urls,
|
||||||
|
config=run_config,
|
||||||
|
dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process all results after completion
|
||||||
|
for result in results:
|
||||||
|
if result.success:
|
||||||
|
await process_result(result)
|
||||||
|
else:
|
||||||
|
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Streaming Mode
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def crawl_streaming():
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
stream=True # Enable streaming mode
|
||||||
|
)
|
||||||
|
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=70.0,
|
||||||
|
check_interval=1.0,
|
||||||
|
max_session_permit=10,
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
# Process results as they become available
|
||||||
|
async for result in await crawler.arun_many(
|
||||||
|
urls=urls,
|
||||||
|
config=run_config,
|
||||||
|
dispatcher=dispatcher
|
||||||
|
):
|
||||||
|
if result.success:
|
||||||
|
# Process each result immediately
|
||||||
|
await process_result(result)
|
||||||
|
else:
|
||||||
|
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Semaphore-based Crawling
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def crawl_with_semaphore(urls):
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||||
|
|
||||||
|
dispatcher = SemaphoreDispatcher(
|
||||||
|
semaphore_count=5,
|
||||||
|
rate_limiter=RateLimiter(
|
||||||
|
base_delay=(0.5, 1.0),
|
||||||
|
max_delay=10.0
|
||||||
|
),
|
||||||
|
monitor=CrawlerMonitor(
|
||||||
|
max_visible_rows=15,
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls,
|
||||||
|
config=run_config,
|
||||||
|
dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
return results
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Robots.txt Consideration
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
urls = [
|
||||||
|
"https://example1.com",
|
||||||
|
"https://example2.com",
|
||||||
|
"https://example3.com"
|
||||||
|
]
|
||||||
|
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.ENABLED,
|
||||||
|
check_robots_txt=True, # Will respect robots.txt for each URL
|
||||||
|
semaphore_count=3 # Max concurrent requests
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
async for result in crawler.arun_many(urls, config=config):
|
||||||
|
if result.success:
|
||||||
|
print(f"Successfully crawled {result.url}")
|
||||||
|
elif result.status_code == 403 and "robots.txt" in result.error_message:
|
||||||
|
print(f"Skipped {result.url} - blocked by robots.txt")
|
||||||
|
else:
|
||||||
|
print(f"Failed to crawl {result.url}: {result.error_message}")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Points**:
|
||||||
|
- When `check_robots_txt=True`, each URL's robots.txt is checked before crawling
|
||||||
|
- Robots.txt files are cached for efficiency
|
||||||
|
- Failed robots.txt checks return 403 status code
|
||||||
|
- Dispatcher handles robots.txt checks automatically for each URL
|
||||||
|
|
||||||
|
## 5. Dispatch Results
|
||||||
|
|
||||||
|
Each crawl result includes dispatch information:
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class DispatchResult:
|
||||||
|
task_id: str
|
||||||
|
memory_usage: float
|
||||||
|
peak_memory: float
|
||||||
|
start_time: datetime
|
||||||
|
end_time: datetime
|
||||||
|
error_message: str = ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Access via `result.dispatch_result`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for result in results:
|
||||||
|
if result.success:
|
||||||
|
dr = result.dispatch_result
|
||||||
|
print(f"URL: {result.url}")
|
||||||
|
print(f"Memory: {dr.memory_usage:.1f}MB")
|
||||||
|
print(f"Duration: {dr.end_time - dr.start_time}")
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. Summary
|
||||||
|
|
||||||
|
1. **Two Dispatcher Types**:
|
||||||
|
- MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
|
||||||
|
- SemaphoreDispatcher: Fixed concurrency limit
|
||||||
|
|
||||||
|
2. **Optional Components**:
|
||||||
|
- RateLimiter: Smart request pacing and backoff
|
||||||
|
- CrawlerMonitor: Real-time progress visualization
|
||||||
|
|
||||||
|
3. **Key Benefits**:
|
||||||
|
- Automatic memory management
|
||||||
|
- Built-in rate limiting
|
||||||
|
- Live progress monitoring
|
||||||
|
- Flexible concurrency control
|
||||||
|
|
||||||
|
Choose the dispatcher that best fits your needs:
|
||||||
|
- **MemoryAdaptiveDispatcher**: For large crawls or limited resources
|
||||||
|
- **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios
|
||||||
@@ -1,6 +1,4 @@
|
|||||||
# Proxy & Security
|
# Proxy
|
||||||
|
|
||||||
Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
|
|
||||||
|
|
||||||
## Basic Proxy Setup
|
## Basic Proxy Setup
|
||||||
|
|
||||||
@@ -38,58 +36,33 @@ async with AsyncWebCrawler(config=browser_config) as crawler:
|
|||||||
result = await crawler.arun(url="https://example.com")
|
result = await crawler.arun(url="https://example.com")
|
||||||
```
|
```
|
||||||
|
|
||||||
## Rotating Proxies
|
Here's the corrected documentation:
|
||||||
|
|
||||||
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
|
## Rotating Proxies
|
||||||
|
|
||||||
|
Example using a proxy rotation service dynamically:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from crawl4ai.async_configs import BrowserConfig
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||||
|
|
||||||
async def get_next_proxy():
|
async def get_next_proxy():
|
||||||
# Your proxy rotation logic here
|
# Your proxy rotation logic here
|
||||||
return {"server": "http://next.proxy.com:8080"}
|
return {"server": "http://next.proxy.com:8080"}
|
||||||
|
|
||||||
browser_config = BrowserConfig()
|
async def main():
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
browser_config = BrowserConfig()
|
||||||
# Update proxy for each request
|
run_config = CrawlerRunConfig()
|
||||||
for url in urls:
|
|
||||||
proxy = await get_next_proxy()
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
browser_config.proxy_config = proxy
|
# For each URL, create a new run config with different proxy
|
||||||
result = await crawler.arun(url=url, config=browser_config)
|
for url in urls:
|
||||||
|
proxy = await get_next_proxy()
|
||||||
|
# Clone the config and update proxy - this creates a new browser context
|
||||||
|
current_config = run_config.clone(proxy_config=proxy)
|
||||||
|
result = await crawler.arun(url=url, config=current_config)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import asyncio
|
||||||
|
asyncio.run(main())
|
||||||
```
|
```
|
||||||
|
|
||||||
## Custom Headers
|
|
||||||
|
|
||||||
Add security-related headers via `BrowserConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import BrowserConfig
|
|
||||||
|
|
||||||
headers = {
|
|
||||||
"X-Forwarded-For": "203.0.113.195",
|
|
||||||
"Accept-Language": "en-US,en;q=0.9",
|
|
||||||
"Cache-Control": "no-cache",
|
|
||||||
"Pragma": "no-cache"
|
|
||||||
}
|
|
||||||
|
|
||||||
browser_config = BrowserConfig(headers=headers)
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Combining with Magic Mode
|
|
||||||
|
|
||||||
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|
||||||
|
|
||||||
browser_config = BrowserConfig(
|
|
||||||
proxy="http://proxy.example.com:8080",
|
|
||||||
headers={"Accept-Language": "en-US"}
|
|
||||||
)
|
|
||||||
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
|
|
||||||
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
|
||||||
```
|
|
||||||
|
|||||||
@@ -1,179 +0,0 @@
|
|||||||
### Session-Based Crawling for Dynamic Content
|
|
||||||
|
|
||||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
|
|
||||||
|
|
||||||
This guide explores advanced techniques for crawling dynamic content using Crawl4AI's session management features.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Understanding Session-Based Crawling
|
|
||||||
|
|
||||||
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
|
|
||||||
|
|
||||||
1. **Efficient handling of dynamic content** without reloading the page.
|
|
||||||
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling).
|
|
||||||
3. **State maintenance** for authenticated sessions or multi-step workflows.
|
|
||||||
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources.
|
|
||||||
|
|
||||||
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Basic Concepts
|
|
||||||
|
|
||||||
Before diving into examples, here are some key concepts:
|
|
||||||
|
|
||||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
|
|
||||||
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
|
|
||||||
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
|
|
||||||
- **CSS Selectors**: Target specific elements for interaction or data extraction.
|
|
||||||
- **Extraction Strategy**: Define rules to extract structured data.
|
|
||||||
- **Wait Conditions**: Specify conditions to wait for before proceeding.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Example 1: Basic Session-Based Crawling
|
|
||||||
|
|
||||||
A simple example using session-based crawling:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import asyncio
|
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|
||||||
from crawl4ai.cache_context import CacheMode
|
|
||||||
|
|
||||||
async def basic_session_crawl():
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
session_id = "dynamic_content_session"
|
|
||||||
url = "https://example.com/dynamic-content"
|
|
||||||
|
|
||||||
for page in range(3):
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
url=url,
|
|
||||||
session_id=session_id,
|
|
||||||
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
|
|
||||||
css_selector=".content-item",
|
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(config=config)
|
|
||||||
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
|
|
||||||
|
|
||||||
await crawler.crawler_strategy.kill_session(session_id)
|
|
||||||
|
|
||||||
asyncio.run(basic_session_crawl())
|
|
||||||
```
|
|
||||||
|
|
||||||
This example shows:
|
|
||||||
1. Reusing the same `session_id` across multiple requests.
|
|
||||||
2. Executing JavaScript to load more content dynamically.
|
|
||||||
3. Properly closing the session to free resources.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Advanced Technique 1: Custom Execution Hooks
|
|
||||||
|
|
||||||
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def advanced_session_crawl_with_hooks():
|
|
||||||
first_commit = ""
|
|
||||||
|
|
||||||
async def on_execution_started(page):
|
|
||||||
nonlocal first_commit
|
|
||||||
try:
|
|
||||||
while True:
|
|
||||||
await page.wait_for_selector("li.commit-item h4")
|
|
||||||
commit = await page.query_selector("li.commit-item h4")
|
|
||||||
commit = await commit.evaluate("(element) => element.textContent").strip()
|
|
||||||
if commit and commit != first_commit:
|
|
||||||
first_commit = commit
|
|
||||||
break
|
|
||||||
await asyncio.sleep(0.5)
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Warning: New content didn't appear: {e}")
|
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
session_id = "commit_session"
|
|
||||||
url = "https://github.com/example/repo/commits/main"
|
|
||||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
|
||||||
|
|
||||||
js_next_page = """document.querySelector('a.pagination-next').click();"""
|
|
||||||
|
|
||||||
for page in range(3):
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
url=url,
|
|
||||||
session_id=session_id,
|
|
||||||
js_code=js_next_page if page > 0 else None,
|
|
||||||
css_selector="li.commit-item",
|
|
||||||
js_only=page > 0,
|
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(config=config)
|
|
||||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
|
||||||
|
|
||||||
await crawler.crawler_strategy.kill_session(session_id)
|
|
||||||
|
|
||||||
asyncio.run(advanced_session_crawl_with_hooks())
|
|
||||||
```
|
|
||||||
|
|
||||||
This technique ensures new content loads before the next action.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
|
|
||||||
|
|
||||||
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def integrated_js_and_wait_crawl():
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
session_id = "integrated_session"
|
|
||||||
url = "https://github.com/example/repo/commits/main"
|
|
||||||
|
|
||||||
js_next_page_and_wait = """
|
|
||||||
(async () => {
|
|
||||||
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
|
|
||||||
const initialCommit = getCurrentCommit();
|
|
||||||
document.querySelector('a.pagination-next').click();
|
|
||||||
while (getCurrentCommit() === initialCommit) {
|
|
||||||
await new Promise(resolve => setTimeout(resolve, 100));
|
|
||||||
}
|
|
||||||
})();
|
|
||||||
"""
|
|
||||||
|
|
||||||
for page in range(3):
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
url=url,
|
|
||||||
session_id=session_id,
|
|
||||||
js_code=js_next_page_and_wait if page > 0 else None,
|
|
||||||
css_selector="li.commit-item",
|
|
||||||
js_only=page > 0,
|
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(config=config)
|
|
||||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
|
||||||
|
|
||||||
await crawler.crawler_strategy.kill_session(session_id)
|
|
||||||
|
|
||||||
asyncio.run(integrated_js_and_wait_crawl())
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Best Practices for Session-Based Crawling
|
|
||||||
|
|
||||||
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values.
|
|
||||||
2. **Close Sessions**: Always clean up sessions with `kill_session` after use.
|
|
||||||
3. **Error Handling**: Anticipate and handle errors gracefully.
|
|
||||||
4. **Respect Websites**: Follow terms of service and robots.txt.
|
|
||||||
5. **Delays**: Add delays to avoid overwhelming servers.
|
|
||||||
6. **Optimize JavaScript**: Keep scripts concise for better performance.
|
|
||||||
7. **Monitor Resources**: Track memory and CPU usage for long sessions.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Conclusion
|
|
||||||
|
|
||||||
Session-based crawling in Crawl4AI is a robust solution for handling dynamic content and multi-step workflows. By combining session management, JavaScript execution, and structured extraction strategies, you can effectively navigate and extract data from modern web applications. Always adhere to ethical web scraping practices and respect website policies.
|
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
### Session Management
|
# Session Management
|
||||||
|
|
||||||
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
|
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
|
||||||
|
|
||||||
@@ -20,8 +20,12 @@ async with AsyncWebCrawler() as crawler:
|
|||||||
session_id = "my_session"
|
session_id = "my_session"
|
||||||
|
|
||||||
# Define configurations
|
# Define configurations
|
||||||
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
|
config1 = CrawlerRunConfig(
|
||||||
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
|
url="https://example.com/page1", session_id=session_id
|
||||||
|
)
|
||||||
|
config2 = CrawlerRunConfig(
|
||||||
|
url="https://example.com/page2", session_id=session_id
|
||||||
|
)
|
||||||
|
|
||||||
# First request
|
# First request
|
||||||
result1 = await crawler.arun(config=config1)
|
result1 = await crawler.arun(config=config1)
|
||||||
@@ -54,7 +58,9 @@ async def crawl_dynamic_content():
|
|||||||
schema = {
|
schema = {
|
||||||
"name": "Commit Extractor",
|
"name": "Commit Extractor",
|
||||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||||
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
|
"fields": [{
|
||||||
|
"name": "title", "selector": "h4.markdown-title", "type": "text"
|
||||||
|
}],
|
||||||
}
|
}
|
||||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||||
|
|
||||||
@@ -87,51 +93,146 @@ async def crawl_dynamic_content():
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
#### Session Best Practices
|
## Example 1: Basic Session-Based Crawling
|
||||||
|
|
||||||
1. **Descriptive Session IDs**:
|
A simple example using session-based crawling:
|
||||||
Use meaningful names for session IDs to organize workflows:
|
|
||||||
```python
|
|
||||||
session_id = "login_flow_session"
|
|
||||||
session_id = "product_catalog_session"
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Resource Management**:
|
```python
|
||||||
Always ensure sessions are cleaned up to free resources:
|
import asyncio
|
||||||
```python
|
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||||
try:
|
from crawl4ai.cache_context import CacheMode
|
||||||
# Your crawling code here
|
|
||||||
pass
|
|
||||||
finally:
|
|
||||||
await crawler.crawler_strategy.kill_session(session_id)
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **State Maintenance**:
|
async def basic_session_crawl():
|
||||||
Reuse the session for subsequent actions within the same workflow:
|
async with AsyncWebCrawler() as crawler:
|
||||||
```python
|
session_id = "dynamic_content_session"
|
||||||
# Step 1: Login
|
url = "https://example.com/dynamic-content"
|
||||||
login_config = CrawlerRunConfig(
|
|
||||||
url="https://example.com/login",
|
|
||||||
session_id=session_id,
|
|
||||||
js_code="document.querySelector('form').submit();"
|
|
||||||
)
|
|
||||||
await crawler.arun(config=login_config)
|
|
||||||
|
|
||||||
# Step 2: Verify login success
|
for page in range(3):
|
||||||
dashboard_config = CrawlerRunConfig(
|
config = CrawlerRunConfig(
|
||||||
url="https://example.com/dashboard",
|
url=url,
|
||||||
session_id=session_id,
|
session_id=session_id,
|
||||||
wait_for="css:.user-profile" # Wait for authenticated content
|
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
|
||||||
)
|
css_selector=".content-item",
|
||||||
result = await crawler.arun(config=dashboard_config)
|
cache_mode=CacheMode.BYPASS
|
||||||
```
|
)
|
||||||
|
|
||||||
|
result = await crawler.arun(config=config)
|
||||||
|
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
|
||||||
|
|
||||||
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
|
|
||||||
|
asyncio.run(basic_session_crawl())
|
||||||
|
```
|
||||||
|
|
||||||
|
This example shows:
|
||||||
|
1. Reusing the same `session_id` across multiple requests.
|
||||||
|
2. Executing JavaScript to load more content dynamically.
|
||||||
|
3. Properly closing the session to free resources.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advanced Technique 1: Custom Execution Hooks
|
||||||
|
|
||||||
|
> Warning: You might feel confused by the end of the next few examples 😅, so make sure you are comfortable with the order of the parts before you start this.
|
||||||
|
|
||||||
|
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def advanced_session_crawl_with_hooks():
|
||||||
|
first_commit = ""
|
||||||
|
|
||||||
|
async def on_execution_started(page):
|
||||||
|
nonlocal first_commit
|
||||||
|
try:
|
||||||
|
while True:
|
||||||
|
await page.wait_for_selector("li.commit-item h4")
|
||||||
|
commit = await page.query_selector("li.commit-item h4")
|
||||||
|
commit = await commit.evaluate("(element) => element.textContent").strip()
|
||||||
|
if commit and commit != first_commit:
|
||||||
|
first_commit = commit
|
||||||
|
break
|
||||||
|
await asyncio.sleep(0.5)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Warning: New content didn't appear: {e}")
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
session_id = "commit_session"
|
||||||
|
url = "https://github.com/example/repo/commits/main"
|
||||||
|
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||||
|
|
||||||
|
js_next_page = """document.querySelector('a.pagination-next').click();"""
|
||||||
|
|
||||||
|
for page in range(3):
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
url=url,
|
||||||
|
session_id=session_id,
|
||||||
|
js_code=js_next_page if page > 0 else None,
|
||||||
|
css_selector="li.commit-item",
|
||||||
|
js_only=page > 0,
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
|
|
||||||
|
result = await crawler.arun(config=config)
|
||||||
|
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||||
|
|
||||||
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
|
|
||||||
|
asyncio.run(advanced_session_crawl_with_hooks())
|
||||||
|
```
|
||||||
|
|
||||||
|
This technique ensures new content loads before the next action.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
|
||||||
|
|
||||||
|
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def integrated_js_and_wait_crawl():
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
session_id = "integrated_session"
|
||||||
|
url = "https://github.com/example/repo/commits/main"
|
||||||
|
|
||||||
|
js_next_page_and_wait = """
|
||||||
|
(async () => {
|
||||||
|
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
|
||||||
|
const initialCommit = getCurrentCommit();
|
||||||
|
document.querySelector('a.pagination-next').click();
|
||||||
|
while (getCurrentCommit() === initialCommit) {
|
||||||
|
await new Promise(resolve => setTimeout(resolve, 100));
|
||||||
|
}
|
||||||
|
})();
|
||||||
|
"""
|
||||||
|
|
||||||
|
for page in range(3):
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
url=url,
|
||||||
|
session_id=session_id,
|
||||||
|
js_code=js_next_page_and_wait if page > 0 else None,
|
||||||
|
css_selector="li.commit-item",
|
||||||
|
js_only=page > 0,
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
|
|
||||||
|
result = await crawler.arun(config=config)
|
||||||
|
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||||
|
|
||||||
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
|
|
||||||
|
asyncio.run(integrated_js_and_wait_crawl())
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
#### Common Use Cases for Sessions
|
#### Common Use Cases for Sessions
|
||||||
|
|
||||||
1. **Authentication Flows**: Login and interact with secured pages.
|
1. **Authentication Flows**: Login and interact with secured pages.
|
||||||
2. **Pagination Handling**: Navigate through multiple pages.
|
|
||||||
3. **Form Submissions**: Fill forms, submit, and process results.
|
2. **Pagination Handling**: Navigate through multiple pages.
|
||||||
4. **Multi-step Processes**: Complete workflows that span multiple actions.
|
|
||||||
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
|
3. **Form Submissions**: Fill forms, submit, and process results.
|
||||||
|
|
||||||
|
4. **Multi-step Processes**: Complete workflows that span multiple actions.
|
||||||
|
|
||||||
|
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
|
||||||
|
|||||||
179
docs/md_v2/advanced/ssl-certificate.md
Normal file
179
docs/md_v2/advanced/ssl-certificate.md
Normal file
@@ -0,0 +1,179 @@
|
|||||||
|
# `SSLCertificate` Reference
|
||||||
|
|
||||||
|
The **`SSLCertificate`** class encapsulates an SSL certificate’s data and allows exporting it in various formats (PEM, DER, JSON, or text). It’s used within **Crawl4AI** whenever you set **`fetch_ssl_certificate=True`** in your **`CrawlerRunConfig`**.
|
||||||
|
|
||||||
|
## 1. Overview
|
||||||
|
|
||||||
|
**Location**: `crawl4ai/ssl_certificate.py`
|
||||||
|
|
||||||
|
```python
|
||||||
|
class SSLCertificate:
|
||||||
|
"""
|
||||||
|
Represents an SSL certificate with methods to export in various formats.
|
||||||
|
|
||||||
|
Main Methods:
|
||||||
|
- from_url(url, timeout=10)
|
||||||
|
- from_file(file_path)
|
||||||
|
- from_binary(binary_data)
|
||||||
|
- to_json(filepath=None)
|
||||||
|
- to_pem(filepath=None)
|
||||||
|
- to_der(filepath=None)
|
||||||
|
...
|
||||||
|
|
||||||
|
Common Properties:
|
||||||
|
- issuer
|
||||||
|
- subject
|
||||||
|
- valid_from
|
||||||
|
- valid_until
|
||||||
|
- fingerprint
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
### Typical Use Case
|
||||||
|
1. You **enable** certificate fetching in your crawl by:
|
||||||
|
```python
|
||||||
|
CrawlerRunConfig(fetch_ssl_certificate=True, ...)
|
||||||
|
```
|
||||||
|
2. After `arun()`, if `result.ssl_certificate` is present, it’s an instance of **`SSLCertificate`**.
|
||||||
|
3. You can **read** basic properties (issuer, subject, validity) or **export** them in multiple formats.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Construction & Fetching
|
||||||
|
|
||||||
|
### 2.1 **`from_url(url, timeout=10)`**
|
||||||
|
Manually load an SSL certificate from a given URL (port 443). Typically used internally, but you can call it directly if you want:
|
||||||
|
|
||||||
|
```python
|
||||||
|
cert = SSLCertificate.from_url("https://example.com")
|
||||||
|
if cert:
|
||||||
|
print("Fingerprint:", cert.fingerprint)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 **`from_file(file_path)`**
|
||||||
|
Load from a file containing certificate data in ASN.1 or DER. Rarely needed unless you have local cert files:
|
||||||
|
|
||||||
|
```python
|
||||||
|
cert = SSLCertificate.from_file("/path/to/cert.der")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 **`from_binary(binary_data)`**
|
||||||
|
Initialize from raw binary. E.g., if you captured it from a socket or another source:
|
||||||
|
|
||||||
|
```python
|
||||||
|
cert = SSLCertificate.from_binary(raw_bytes)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Common Properties
|
||||||
|
|
||||||
|
After obtaining a **`SSLCertificate`** instance (e.g. `result.ssl_certificate` from a crawl), you can read:
|
||||||
|
|
||||||
|
1. **`issuer`** *(dict)*
|
||||||
|
- E.g. `{"CN": "My Root CA", "O": "..."}`
|
||||||
|
2. **`subject`** *(dict)*
|
||||||
|
- E.g. `{"CN": "example.com", "O": "ExampleOrg"}`
|
||||||
|
3. **`valid_from`** *(str)*
|
||||||
|
- NotBefore date/time. Often in ASN.1/UTC format.
|
||||||
|
4. **`valid_until`** *(str)*
|
||||||
|
- NotAfter date/time.
|
||||||
|
5. **`fingerprint`** *(str)*
|
||||||
|
- The SHA-256 digest (lowercase hex).
|
||||||
|
- E.g. `"d14d2e..."`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Export Methods
|
||||||
|
|
||||||
|
Once you have a **`SSLCertificate`** object, you can **export** or **inspect** it:
|
||||||
|
|
||||||
|
### 4.1 **`to_json(filepath=None)` → `Optional[str]`**
|
||||||
|
- Returns a JSON string containing the parsed certificate fields.
|
||||||
|
- If `filepath` is provided, saves it to disk instead, returning `None`.
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
json_data = cert.to_json() # returns JSON string
|
||||||
|
cert.to_json("certificate.json") # writes file, returns None
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 **`to_pem(filepath=None)` → `Optional[str]`**
|
||||||
|
- Returns a PEM-encoded string (common for web servers).
|
||||||
|
- If `filepath` is provided, saves it to disk instead.
|
||||||
|
|
||||||
|
```python
|
||||||
|
pem_str = cert.to_pem() # in-memory PEM string
|
||||||
|
cert.to_pem("/path/to/cert.pem") # saved to file
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 **`to_der(filepath=None)` → `Optional[bytes]`**
|
||||||
|
- Returns the original DER (binary ASN.1) bytes.
|
||||||
|
- If `filepath` is specified, writes the bytes there instead.
|
||||||
|
|
||||||
|
```python
|
||||||
|
der_bytes = cert.to_der()
|
||||||
|
cert.to_der("certificate.der")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 (Optional) **`export_as_text()`**
|
||||||
|
- If you see a method like `export_as_text()`, it typically returns an OpenSSL-style textual representation.
|
||||||
|
- Not always needed, but can help for debugging or manual inspection.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Example Usage in Crawl4AI
|
||||||
|
|
||||||
|
Below is a minimal sample showing how the crawler obtains an SSL cert from a site, then reads or exports it. The code snippet:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
tmp_dir = "tmp"
|
||||||
|
os.makedirs(tmp_dir, exist_ok=True)
|
||||||
|
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
fetch_ssl_certificate=True,
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun("https://example.com", config=config)
|
||||||
|
if result.success and result.ssl_certificate:
|
||||||
|
cert = result.ssl_certificate
|
||||||
|
# 1. Basic Info
|
||||||
|
print("Issuer CN:", cert.issuer.get("CN", ""))
|
||||||
|
print("Valid until:", cert.valid_until)
|
||||||
|
print("Fingerprint:", cert.fingerprint)
|
||||||
|
|
||||||
|
# 2. Export
|
||||||
|
cert.to_json(os.path.join(tmp_dir, "certificate.json"))
|
||||||
|
cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
|
||||||
|
cert.to_der(os.path.join(tmp_dir, "certificate.der"))
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Notes & Best Practices
|
||||||
|
|
||||||
|
1. **Timeout**: `SSLCertificate.from_url` internally uses a default **10s** socket connect and wraps SSL.
|
||||||
|
2. **Binary Form**: The certificate is loaded in ASN.1 (DER) form, then re-parsed by `OpenSSL.crypto`.
|
||||||
|
3. **Validation**: This does **not** validate the certificate chain or trust store. It only fetches and parses.
|
||||||
|
4. **Integration**: Within Crawl4AI, you typically just set `fetch_ssl_certificate=True` in `CrawlerRunConfig`; the final result’s `ssl_certificate` is automatically built.
|
||||||
|
5. **Export**: If you need to store or analyze a cert, the `to_json` and `to_pem` are quite universal.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
- **`SSLCertificate`** is a convenience class for capturing and exporting the **TLS certificate** from your crawled site(s).
|
||||||
|
- Common usage is in the **`CrawlResult.ssl_certificate`** field, accessible after setting `fetch_ssl_certificate=True`.
|
||||||
|
- Offers quick access to essential certificate details (`issuer`, `subject`, `fingerprint`) and is easy to export (PEM, DER, JSON) for further analysis or server usage.
|
||||||
|
|
||||||
|
Use it whenever you need **insight** into a site’s certificate or require some form of cryptographic or compliance check.
|
||||||
@@ -1,244 +1,305 @@
|
|||||||
# Complete Parameter Guide for arun()
|
# `arun()` Parameter Guide (New Approach)
|
||||||
|
|
||||||
The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality.
|
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
|
||||||
|
|
||||||
## Core Parameters
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
await crawler.arun(
|
||||||
url="https://example.com", # Required: URL to crawl
|
url="https://example.com",
|
||||||
verbose=True, # Enable detailed logging
|
config=my_run_config
|
||||||
cache_mode=CacheMode.ENABLED, # Control cache behavior
|
|
||||||
warmup=True # Whether to run warmup check
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Cache Control
|
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Core Usage
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from crawl4ai import CacheMode
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||||
|
|
||||||
await crawler.arun(
|
async def main():
|
||||||
cache_mode=CacheMode.ENABLED, # Normal caching (read/write)
|
run_config = CrawlerRunConfig(
|
||||||
# Other cache modes:
|
verbose=True, # Detailed logging
|
||||||
# cache_mode=CacheMode.DISABLED # No caching at all
|
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
|
||||||
# cache_mode=CacheMode.READ_ONLY # Only read from cache
|
check_robots_txt=True, # Respect robots.txt rules
|
||||||
# cache_mode=CacheMode.WRITE_ONLY # Only write to cache
|
# ... other parameters
|
||||||
# cache_mode=CacheMode.BYPASS # Skip cache for this operation
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://example.com",
|
||||||
|
config=run_config
|
||||||
|
)
|
||||||
|
|
||||||
|
# Check if blocked by robots.txt
|
||||||
|
if not result.success and result.status_code == 403:
|
||||||
|
print(f"Error: {result.error_message}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Fields**:
|
||||||
|
- `verbose=True` logs each crawl step.
|
||||||
|
- `cache_mode` decides how to read/write the local crawl cache.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Cache Control
|
||||||
|
|
||||||
|
**`cache_mode`** (default: `CacheMode.ENABLED`)
|
||||||
|
Use a built-in enum from `CacheMode`:
|
||||||
|
- `ENABLED`: Normal caching—reads if available, writes if missing.
|
||||||
|
- `DISABLED`: No caching—always refetch pages.
|
||||||
|
- `READ_ONLY`: Reads from cache only; no new writes.
|
||||||
|
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
|
||||||
|
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
|
||||||
|
|
||||||
|
```python
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Content Processing Parameters
|
**Additional flags**:
|
||||||
|
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
|
||||||
|
- `disable_cache=True` acts like `CacheMode.DISABLED`.
|
||||||
|
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
|
||||||
|
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Content Processing & Selection
|
||||||
|
|
||||||
|
### 3.1 Text Processing
|
||||||
|
|
||||||
### Text Processing
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
word_count_threshold=10, # Minimum words per content block
|
word_count_threshold=10, # Ignore text blocks <10 words
|
||||||
image_description_min_word_threshold=5, # Minimum words for image descriptions
|
only_text=False, # If True, tries to remove non-text elements
|
||||||
only_text=False, # Extract only text content
|
keep_data_attributes=False # Keep or discard data-* attributes
|
||||||
excluded_tags=['form', 'nav'], # HTML tags to exclude
|
|
||||||
keep_data_attributes=False, # Preserve data-* attributes
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Content Selection
|
### 3.2 Content Selection
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
css_selector=".main-content", # CSS selector for content extraction
|
css_selector=".main-content", # Focus on .main-content region only
|
||||||
remove_forms=True, # Remove all form elements
|
excluded_tags=["form", "nav"], # Remove entire tag blocks
|
||||||
remove_overlay_elements=True, # Remove popups/modals/overlays
|
remove_forms=True, # Specifically strip <form> elements
|
||||||
|
remove_overlay_elements=True, # Attempt to remove modals/popups
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Link Handling
|
### 3.3 Link Handling
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
exclude_external_links=True, # Remove external links
|
exclude_external_links=True, # Remove external links from final content
|
||||||
exclude_social_media_links=True, # Remove social media links
|
exclude_social_media_links=True, # Remove links to known social sites
|
||||||
exclude_external_images=True, # Remove external images
|
exclude_domains=["ads.example.com"], # Exclude links to these domains
|
||||||
exclude_domains=["ads.example.com"], # Specific domains to exclude
|
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
|
||||||
social_media_domains=[ # Additional social media domains
|
|
||||||
"facebook.com",
|
|
||||||
"twitter.com",
|
|
||||||
"instagram.com"
|
|
||||||
]
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Browser Control Parameters
|
### 3.4 Media Filtering
|
||||||
|
|
||||||
### Basic Browser Settings
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
headless=True, # Run browser in headless mode
|
exclude_external_images=True # Strip images from other domains
|
||||||
browser_type="chromium", # Browser engine: "chromium", "firefox", "webkit"
|
|
||||||
page_timeout=60000, # Page load timeout in milliseconds
|
|
||||||
user_agent="custom-agent", # Custom user agent
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Navigation and Waiting
|
---
|
||||||
|
|
||||||
|
## 4. Page Navigation & Timing
|
||||||
|
|
||||||
|
### 4.1 Basic Browser Flow
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
wait_for="css:.dynamic-content", # Wait for element/condition
|
wait_for="css:.dynamic-content", # Wait for .dynamic-content
|
||||||
delay_before_return_html=2.0, # Wait before returning HTML (seconds)
|
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
|
||||||
|
page_timeout=60000, # Navigation & script timeout (ms)
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### JavaScript Execution
|
**Key Fields**:
|
||||||
|
- `wait_for`:
|
||||||
|
- `"css:selector"` or
|
||||||
|
- `"js:() => boolean"`
|
||||||
|
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
|
||||||
|
|
||||||
|
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
|
||||||
|
- `semaphore_count`: concurrency limit when crawling multiple URLs.
|
||||||
|
|
||||||
|
### 4.2 JavaScript Execution
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
js_code=[ # JavaScript to execute (string or list)
|
js_code=[
|
||||||
"window.scrollTo(0, document.body.scrollHeight);",
|
"window.scrollTo(0, document.body.scrollHeight);",
|
||||||
"document.querySelector('.load-more').click();"
|
"document.querySelector('.load-more')?.click();"
|
||||||
],
|
],
|
||||||
js_only=False, # Only execute JavaScript without reloading page
|
js_only=False
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Anti-Bot Features
|
- `js_code` can be a single string or a list of strings.
|
||||||
|
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
|
||||||
|
|
||||||
|
### 4.3 Anti-Bot
|
||||||
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
run_config = CrawlerRunConfig(
|
||||||
magic=True, # Enable all anti-detection features
|
magic=True,
|
||||||
simulate_user=True, # Simulate human behavior
|
simulate_user=True,
|
||||||
override_navigator=True # Override navigator properties
|
override_navigator=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
- `magic=True` tries multiple stealth features.
|
||||||
|
- `simulate_user=True` mimics mouse movements or random delays.
|
||||||
|
- `override_navigator=True` fakes some navigator properties (like user agent checks).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Session Management
|
||||||
|
|
||||||
|
**`session_id`**:
|
||||||
|
```python
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
session_id="my_session123"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Screenshot, PDF & Media Options
|
||||||
|
|
||||||
|
```python
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
screenshot=True, # Grab a screenshot as base64
|
||||||
|
screenshot_wait_for=1.0, # Wait 1s before capturing
|
||||||
|
pdf=True, # Also produce a PDF
|
||||||
|
image_description_min_word_threshold=5, # If analyzing alt text
|
||||||
|
image_score_threshold=3, # Filter out low-score images
|
||||||
|
)
|
||||||
|
```
|
||||||
|
**Where they appear**:
|
||||||
|
- `result.screenshot` → Base64 screenshot string.
|
||||||
|
- `result.pdf` → Byte array with PDF data.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Extraction Strategy
|
||||||
|
|
||||||
|
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
extraction_strategy=my_css_or_llm_strategy
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Session Management
|
The extracted data will appear in `result.extracted_content`.
|
||||||
```python
|
|
||||||
await crawler.arun(
|
|
||||||
session_id="my_session", # Session identifier for persistent browsing
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Screenshot Options
|
---
|
||||||
```python
|
|
||||||
await crawler.arun(
|
## 8. Comprehensive Example
|
||||||
screenshot=True, # Take page screenshot
|
|
||||||
screenshot_wait_for=2.0, # Wait before screenshot (seconds)
|
Below is a snippet combining many parameters:
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Proxy Configuration
|
|
||||||
```python
|
```python
|
||||||
await crawler.arun(
|
import asyncio
|
||||||
proxy="http://proxy.example.com:8080", # Simple proxy URL
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||||
proxy_config={ # Advanced proxy settings
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||||
"server": "http://proxy.example.com:8080",
|
|
||||||
"username": "user",
|
async def main():
|
||||||
"password": "pass"
|
# Example schema
|
||||||
|
schema = {
|
||||||
|
"name": "Articles",
|
||||||
|
"baseSelector": "article.post",
|
||||||
|
"fields": [
|
||||||
|
{"name": "title", "selector": "h2", "type": "text"},
|
||||||
|
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||||
|
]
|
||||||
}
|
}
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Content Extraction Parameters
|
run_config = CrawlerRunConfig(
|
||||||
|
# Core
|
||||||
### Extraction Strategy
|
verbose=True,
|
||||||
```python
|
cache_mode=CacheMode.ENABLED,
|
||||||
await crawler.arun(
|
check_robots_txt=True, # Respect robots.txt rules
|
||||||
extraction_strategy=LLMExtractionStrategy(
|
|
||||||
provider="ollama/llama2",
|
# Content
|
||||||
schema=MySchema.schema(),
|
word_count_threshold=10,
|
||||||
instruction="Extract specific data"
|
css_selector="main.content",
|
||||||
|
excluded_tags=["nav", "footer"],
|
||||||
|
exclude_external_links=True,
|
||||||
|
|
||||||
|
# Page & JS
|
||||||
|
js_code="document.querySelector('.show-more')?.click();",
|
||||||
|
wait_for="css:.loaded-block",
|
||||||
|
page_timeout=30000,
|
||||||
|
|
||||||
|
# Extraction
|
||||||
|
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||||
|
|
||||||
|
# Session
|
||||||
|
session_id="persistent_session",
|
||||||
|
|
||||||
|
# Media
|
||||||
|
screenshot=True,
|
||||||
|
pdf=True,
|
||||||
|
|
||||||
|
# Anti-bot
|
||||||
|
simulate_user=True,
|
||||||
|
magic=True,
|
||||||
)
|
)
|
||||||
)
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun("https://example.com/posts", config=run_config)
|
||||||
|
if result.success:
|
||||||
|
print("HTML length:", len(result.cleaned_html))
|
||||||
|
print("Extraction JSON:", result.extracted_content)
|
||||||
|
if result.screenshot:
|
||||||
|
print("Screenshot length:", len(result.screenshot))
|
||||||
|
if result.pdf:
|
||||||
|
print("PDF bytes length:", len(result.pdf))
|
||||||
|
else:
|
||||||
|
print("Error:", result.error_message)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
```
|
```
|
||||||
|
|
||||||
### Chunking Strategy
|
**What we covered**:
|
||||||
```python
|
1. **Crawling** the main content region, ignoring external links.
|
||||||
await crawler.arun(
|
2. Running **JavaScript** to click “.show-more”.
|
||||||
chunking_strategy=RegexChunking(
|
3. **Waiting** for “.loaded-block” to appear.
|
||||||
patterns=[r'\n\n', r'\.\s+']
|
4. Generating a **screenshot** & **PDF** of the final page.
|
||||||
)
|
5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
### HTML to Text Options
|
---
|
||||||
```python
|
|
||||||
await crawler.arun(
|
|
||||||
html2text={
|
|
||||||
"ignore_links": False,
|
|
||||||
"ignore_images": False,
|
|
||||||
"escape_dot": False,
|
|
||||||
"body_width": 0,
|
|
||||||
"protect_links": True,
|
|
||||||
"unicode_snob": True
|
|
||||||
}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Debug Options
|
## 9. Best Practices
|
||||||
```python
|
|
||||||
await crawler.arun(
|
|
||||||
log_console=True, # Log browser console messages
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Parameter Interactions and Notes
|
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
|
||||||
|
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
|
||||||
|
3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.
|
||||||
|
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
|
||||||
|
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
|
||||||
|
|
||||||
1. **Cache and Performance Setup**
|
---
|
||||||
```python
|
|
||||||
# Optimal caching for repeated crawls
|
|
||||||
await crawler.arun(
|
|
||||||
cache_mode=CacheMode.ENABLED,
|
|
||||||
word_count_threshold=10,
|
|
||||||
process_iframes=False
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Dynamic Content Handling**
|
## 10. Conclusion
|
||||||
```python
|
|
||||||
# Handle lazy-loaded content
|
|
||||||
await crawler.arun(
|
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
wait_for="css:.lazy-content",
|
|
||||||
delay_before_return_html=2.0,
|
|
||||||
cache_mode=CacheMode.WRITE_ONLY # Cache results after dynamic load
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Content Extraction Pipeline**
|
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
|
||||||
```python
|
|
||||||
# Complete extraction setup
|
|
||||||
await crawler.arun(
|
|
||||||
css_selector=".main-content",
|
|
||||||
word_count_threshold=20,
|
|
||||||
extraction_strategy=my_strategy,
|
|
||||||
chunking_strategy=my_chunking,
|
|
||||||
process_iframes=True,
|
|
||||||
remove_overlay_elements=True,
|
|
||||||
cache_mode=CacheMode.ENABLED
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best Practices
|
- Makes code **clearer** and **more maintainable**.
|
||||||
|
- Minimizes confusion about which arguments affect global vs. per-crawl behavior.
|
||||||
|
- Allows you to create **reusable** config objects for different pages or tasks.
|
||||||
|
|
||||||
1. **Performance Optimization**
|
For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md).
|
||||||
```python
|
|
||||||
await crawler.arun(
|
|
||||||
cache_mode=CacheMode.ENABLED, # Use full caching
|
|
||||||
word_count_threshold=10, # Filter out noise
|
|
||||||
process_iframes=False # Skip iframes if not needed
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Reliable Scraping**
|
Happy crawling with your **structured, flexible** config approach!
|
||||||
```python
|
|
||||||
await crawler.arun(
|
|
||||||
magic=True, # Enable anti-detection
|
|
||||||
delay_before_return_html=1.0, # Wait for dynamic content
|
|
||||||
page_timeout=60000, # Longer timeout for slow pages
|
|
||||||
cache_mode=CacheMode.WRITE_ONLY # Cache results after successful crawl
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Clean Content**
|
|
||||||
```python
|
|
||||||
await crawler.arun(
|
|
||||||
remove_overlay_elements=True, # Remove popups
|
|
||||||
excluded_tags=['nav', 'aside'],# Remove unnecessary elements
|
|
||||||
keep_data_attributes=False, # Remove data attributes
|
|
||||||
cache_mode=CacheMode.ENABLED # Use cache for faster processing
|
|
||||||
)
|
|
||||||
```
|
|
||||||
124
docs/md_v2/api/arun_many.md
Normal file
124
docs/md_v2/api/arun_many.md
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
# `arun_many(...)` Reference
|
||||||
|
|
||||||
|
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
|
||||||
|
|
||||||
|
## Function Signature
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def arun_many(
|
||||||
|
urls: Union[List[str], List[Any]],
|
||||||
|
config: Optional[CrawlerRunConfig] = None,
|
||||||
|
dispatcher: Optional[BaseDispatcher] = None,
|
||||||
|
...
|
||||||
|
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
|
||||||
|
"""
|
||||||
|
Crawl multiple URLs concurrently or in batches.
|
||||||
|
|
||||||
|
:param urls: A list of URLs (or tasks) to crawl.
|
||||||
|
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
|
||||||
|
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
|
||||||
|
...
|
||||||
|
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
|
||||||
|
"""
|
||||||
|
```
|
||||||
|
|
||||||
|
## Differences from `arun()`
|
||||||
|
|
||||||
|
1. **Multiple URLs**:
|
||||||
|
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
|
||||||
|
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
|
||||||
|
|
||||||
|
2. **Concurrency & Dispatchers**:
|
||||||
|
- **`dispatcher`** param allows advanced concurrency control.
|
||||||
|
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
|
||||||
|
- Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
|
||||||
|
|
||||||
|
3. **Streaming Support**:
|
||||||
|
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
|
||||||
|
- When streaming, use `async for` to process results as they become available.
|
||||||
|
- Ideal for processing large numbers of URLs without waiting for all to complete.
|
||||||
|
|
||||||
|
4. **Parallel** Execution**:
|
||||||
|
- `arun_many()` can run multiple requests concurrently under the hood.
|
||||||
|
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
|
||||||
|
|
||||||
|
### Basic Example (Batch Mode)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Minimal usage: The default dispatcher will be used
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls=["https://site1.com", "https://site2.com"],
|
||||||
|
config=CrawlerRunConfig(stream=False) # Default behavior
|
||||||
|
)
|
||||||
|
|
||||||
|
for res in results:
|
||||||
|
if res.success:
|
||||||
|
print(res.url, "crawled OK!")
|
||||||
|
else:
|
||||||
|
print("Failed:", res.url, "-", res.error_message)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Streaming Example
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
stream=True, # Enable streaming mode
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process results as they complete
|
||||||
|
async for result in await crawler.arun_many(
|
||||||
|
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
|
||||||
|
config=config
|
||||||
|
):
|
||||||
|
if result.success:
|
||||||
|
print(f"Just completed: {result.url}")
|
||||||
|
# Process each result immediately
|
||||||
|
process_result(result)
|
||||||
|
```
|
||||||
|
|
||||||
|
### With a Custom Dispatcher
|
||||||
|
|
||||||
|
```python
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=70.0,
|
||||||
|
max_session_permit=10
|
||||||
|
)
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
|
||||||
|
config=my_run_config,
|
||||||
|
dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Points**:
|
||||||
|
- Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
|
||||||
|
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
|
||||||
|
- If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
|
||||||
|
|
||||||
|
### Return Value
|
||||||
|
|
||||||
|
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dispatcher Reference
|
||||||
|
|
||||||
|
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
|
||||||
|
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
|
||||||
|
|
||||||
|
For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Pitfalls
|
||||||
|
|
||||||
|
1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.
|
||||||
|
2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.
|
||||||
|
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
|
||||||
@@ -1,320 +1,331 @@
|
|||||||
# AsyncWebCrawler
|
# AsyncWebCrawler
|
||||||
|
|
||||||
The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.
|
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
|
||||||
|
|
||||||
## Constructor
|
**Recommended usage**:
|
||||||
|
1. **Create** a `BrowserConfig` for global browser settings.
|
||||||
|
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
|
||||||
|
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
|
||||||
|
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Constructor Overview
|
||||||
|
|
||||||
```python
|
```python
|
||||||
AsyncWebCrawler(
|
class AsyncWebCrawler:
|
||||||
# Browser Settings
|
def __init__(
|
||||||
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
|
self,
|
||||||
headless: bool = True, # Run browser in headless mode
|
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
||||||
verbose: bool = False, # Enable verbose logging
|
config: Optional[BrowserConfig] = None,
|
||||||
|
always_bypass_cache: bool = False, # deprecated
|
||||||
# Cache Settings
|
always_by_pass_cache: Optional[bool] = None, # also deprecated
|
||||||
always_by_pass_cache: bool = False, # Always bypass cache
|
base_directory: str = ...,
|
||||||
base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache
|
thread_safe: bool = False,
|
||||||
|
**kwargs,
|
||||||
# Network Settings
|
):
|
||||||
proxy: str = None, # Simple proxy URL
|
"""
|
||||||
proxy_config: Dict = None, # Advanced proxy configuration
|
Create an AsyncWebCrawler instance.
|
||||||
|
|
||||||
# Browser Behavior
|
Args:
|
||||||
sleep_on_close: bool = False, # Wait before closing browser
|
crawler_strategy:
|
||||||
|
(Advanced) Provide a custom crawler strategy if needed.
|
||||||
# Custom Settings
|
config:
|
||||||
user_agent: str = None, # Custom user agent
|
A BrowserConfig object specifying how the browser is set up.
|
||||||
headers: Dict[str, str] = {}, # Custom HTTP headers
|
always_bypass_cache:
|
||||||
js_code: Union[str, List[str]] = None, # Default JavaScript to execute
|
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
|
||||||
|
base_directory:
|
||||||
|
Folder for storing caches/logs (if relevant).
|
||||||
|
thread_safe:
|
||||||
|
If True, attempts some concurrency safeguards. Usually False.
|
||||||
|
**kwargs:
|
||||||
|
Additional legacy or debugging parameters.
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
### Typical Initialization
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||||
|
|
||||||
|
browser_cfg = BrowserConfig(
|
||||||
|
browser_type="chromium",
|
||||||
|
headless=True,
|
||||||
|
verbose=True
|
||||||
)
|
)
|
||||||
|
|
||||||
|
crawler = AsyncWebCrawler(config=browser_cfg)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Parameters in Detail
|
**Notes**:
|
||||||
|
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
|
||||||
|
|
||||||
#### Browser Settings
|
---
|
||||||
|
|
||||||
- **browser_type** (str, optional)
|
## 2. Lifecycle: Start/Close or Context Manager
|
||||||
- Default: `"chromium"`
|
|
||||||
- Options: `"chromium"`, `"firefox"`, `"webkit"`
|
|
||||||
- Controls which browser engine to use
|
|
||||||
```python
|
|
||||||
# Example: Using Firefox
|
|
||||||
crawler = AsyncWebCrawler(browser_type="firefox")
|
|
||||||
```
|
|
||||||
|
|
||||||
- **headless** (bool, optional)
|
### 2.1 Context Manager (Recommended)
|
||||||
- Default: `True`
|
|
||||||
- When `True`, browser runs without GUI
|
|
||||||
- Set to `False` for debugging
|
|
||||||
```python
|
|
||||||
# Visible browser for debugging
|
|
||||||
crawler = AsyncWebCrawler(headless=False)
|
|
||||||
```
|
|
||||||
|
|
||||||
- **verbose** (bool, optional)
|
```python
|
||||||
- Default: `False`
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||||
- Enables detailed logging
|
result = await crawler.arun("https://example.com")
|
||||||
```python
|
# The crawler automatically starts/closes resources
|
||||||
# Enable detailed logging
|
```
|
||||||
crawler = AsyncWebCrawler(verbose=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Cache Settings
|
When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
|
||||||
|
|
||||||
- **always_by_pass_cache** (bool, optional)
|
### 2.2 Manual Start & Close
|
||||||
- Default: `False`
|
|
||||||
- When `True`, always fetches fresh content
|
|
||||||
```python
|
|
||||||
# Always fetch fresh content
|
|
||||||
crawler = AsyncWebCrawler(always_by_pass_cache=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
- **base_directory** (str, optional)
|
```python
|
||||||
- Default: User's home directory
|
crawler = AsyncWebCrawler(config=browser_cfg)
|
||||||
- Base path for cache storage
|
await crawler.start()
|
||||||
```python
|
|
||||||
# Custom cache directory
|
|
||||||
crawler = AsyncWebCrawler(base_directory="/path/to/cache")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Network Settings
|
result1 = await crawler.arun("https://example.com")
|
||||||
|
result2 = await crawler.arun("https://another.com")
|
||||||
|
|
||||||
- **proxy** (str, optional)
|
await crawler.close()
|
||||||
- Simple proxy URL
|
```
|
||||||
```python
|
|
||||||
# Using simple proxy
|
|
||||||
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
|
|
||||||
```
|
|
||||||
|
|
||||||
- **proxy_config** (Dict, optional)
|
Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
|
||||||
- Advanced proxy configuration with authentication
|
|
||||||
```python
|
|
||||||
# Advanced proxy with auth
|
|
||||||
crawler = AsyncWebCrawler(proxy_config={
|
|
||||||
"server": "http://proxy.example.com:8080",
|
|
||||||
"username": "user",
|
|
||||||
"password": "pass"
|
|
||||||
})
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Browser Behavior
|
---
|
||||||
|
|
||||||
- **sleep_on_close** (bool, optional)
|
## 3. Primary Method: `arun()`
|
||||||
- Default: `False`
|
|
||||||
- Adds delay before closing browser
|
|
||||||
```python
|
|
||||||
# Wait before closing
|
|
||||||
crawler = AsyncWebCrawler(sleep_on_close=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Custom Settings
|
|
||||||
|
|
||||||
- **user_agent** (str, optional)
|
|
||||||
- Custom user agent string
|
|
||||||
```python
|
|
||||||
# Custom user agent
|
|
||||||
crawler = AsyncWebCrawler(
|
|
||||||
user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
- **headers** (Dict[str, str], optional)
|
|
||||||
- Custom HTTP headers
|
|
||||||
```python
|
|
||||||
# Custom headers
|
|
||||||
crawler = AsyncWebCrawler(
|
|
||||||
headers={
|
|
||||||
"Accept-Language": "en-US",
|
|
||||||
"Custom-Header": "Value"
|
|
||||||
}
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
- **js_code** (Union[str, List[str]], optional)
|
|
||||||
- Default JavaScript to execute on each page
|
|
||||||
```python
|
|
||||||
# Default JavaScript
|
|
||||||
crawler = AsyncWebCrawler(
|
|
||||||
js_code=[
|
|
||||||
"window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
"document.querySelector('.load-more').click();"
|
|
||||||
]
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Methods
|
|
||||||
|
|
||||||
### arun()
|
|
||||||
|
|
||||||
The primary method for crawling web pages.
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
async def arun(
|
async def arun(
|
||||||
# Required
|
self,
|
||||||
url: str, # URL to crawl
|
url: str,
|
||||||
|
config: Optional[CrawlerRunConfig] = None,
|
||||||
# Content Selection
|
# Legacy parameters for backward compatibility...
|
||||||
css_selector: str = None, # CSS selector for content
|
|
||||||
word_count_threshold: int = 10, # Minimum words per block
|
|
||||||
|
|
||||||
# Cache Control
|
|
||||||
bypass_cache: bool = False, # Bypass cache for this request
|
|
||||||
|
|
||||||
# Session Management
|
|
||||||
session_id: str = None, # Session identifier
|
|
||||||
|
|
||||||
# Screenshot Options
|
|
||||||
screenshot: bool = False, # Take screenshot
|
|
||||||
screenshot_wait_for: float = None, # Wait before screenshot
|
|
||||||
|
|
||||||
# Content Processing
|
|
||||||
process_iframes: bool = False, # Process iframe content
|
|
||||||
remove_overlay_elements: bool = False, # Remove popups/modals
|
|
||||||
|
|
||||||
# Anti-Bot Settings
|
|
||||||
simulate_user: bool = False, # Simulate human behavior
|
|
||||||
override_navigator: bool = False, # Override navigator properties
|
|
||||||
magic: bool = False, # Enable all anti-detection
|
|
||||||
|
|
||||||
# Content Filtering
|
|
||||||
excluded_tags: List[str] = None, # HTML tags to exclude
|
|
||||||
exclude_external_links: bool = False, # Remove external links
|
|
||||||
exclude_social_media_links: bool = False, # Remove social media links
|
|
||||||
|
|
||||||
# JavaScript Handling
|
|
||||||
js_code: Union[str, List[str]] = None, # JavaScript to execute
|
|
||||||
wait_for: str = None, # Wait condition
|
|
||||||
|
|
||||||
# Page Loading
|
|
||||||
page_timeout: int = 60000, # Page load timeout (ms)
|
|
||||||
delay_before_return_html: float = None, # Wait before return
|
|
||||||
|
|
||||||
# Extraction
|
|
||||||
extraction_strategy: ExtractionStrategy = None # Extraction strategy
|
|
||||||
) -> CrawlResult:
|
) -> CrawlResult:
|
||||||
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
### Usage Examples
|
### 3.1 New Approach
|
||||||
|
|
||||||
#### Basic Crawling
|
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Advanced Crawling
|
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler(
|
|
||||||
browser_type="firefox",
|
|
||||||
verbose=True,
|
|
||||||
headers={"Custom-Header": "Value"}
|
|
||||||
) as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
css_selector=".main-content",
|
|
||||||
word_count_threshold=20,
|
|
||||||
process_iframes=True,
|
|
||||||
magic=True,
|
|
||||||
wait_for="css:.dynamic-content",
|
|
||||||
screenshot=True
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Session Management
|
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
# First request
|
|
||||||
result1 = await crawler.arun(
|
|
||||||
url="https://example.com/login",
|
|
||||||
session_id="my_session"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Subsequent request using same session
|
|
||||||
result2 = await crawler.arun(
|
|
||||||
url="https://example.com/protected",
|
|
||||||
session_id="my_session"
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Context Manager
|
|
||||||
|
|
||||||
AsyncWebCrawler implements the async context manager protocol:
|
|
||||||
|
|
||||||
```python
|
```python
|
||||||
async def __aenter__(self) -> 'AsyncWebCrawler':
|
import asyncio
|
||||||
# Initialize browser and resources
|
from crawl4ai import CrawlerRunConfig, CacheMode
|
||||||
return self
|
|
||||||
|
|
||||||
async def __aexit__(self, *args):
|
run_cfg = CrawlerRunConfig(
|
||||||
# Cleanup resources
|
cache_mode=CacheMode.BYPASS,
|
||||||
pass
|
css_selector="main.article",
|
||||||
```
|
word_count_threshold=10,
|
||||||
|
screenshot=True
|
||||||
Always use AsyncWebCrawler with async context manager:
|
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
# Your crawling code here
|
|
||||||
pass
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best Practices
|
|
||||||
|
|
||||||
1. **Resource Management**
|
|
||||||
```python
|
|
||||||
# Always use context manager
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
# Crawler will be properly cleaned up
|
|
||||||
pass
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Error Handling**
|
|
||||||
```python
|
|
||||||
try:
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
if not result.success:
|
|
||||||
print(f"Crawl failed: {result.error_message}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error: {str(e)}")
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Performance Optimization**
|
|
||||||
```python
|
|
||||||
# Enable caching for better performance
|
|
||||||
crawler = AsyncWebCrawler(
|
|
||||||
always_by_pass_cache=False,
|
|
||||||
verbose=True
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||||
|
result = await crawler.arun("https://example.com/news", config=run_cfg)
|
||||||
|
print("Crawled HTML length:", len(result.cleaned_html))
|
||||||
|
if result.screenshot:
|
||||||
|
print("Screenshot base64 length:", len(result.screenshot))
|
||||||
```
|
```
|
||||||
|
|
||||||
4. **Anti-Detection**
|
### 3.2 Legacy Parameters Still Accepted
|
||||||
|
|
||||||
|
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Batch Processing: `arun_many()`
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Maximum stealth
|
async def arun_many(
|
||||||
crawler = AsyncWebCrawler(
|
self,
|
||||||
headless=True,
|
urls: List[str],
|
||||||
user_agent="Mozilla/5.0...",
|
config: Optional[CrawlerRunConfig] = None,
|
||||||
headers={"Accept-Language": "en-US"}
|
# Legacy parameters maintained for backwards compatibility...
|
||||||
)
|
) -> List[CrawlResult]:
|
||||||
result = await crawler.arun(
|
"""
|
||||||
url="https://example.com",
|
Process multiple URLs with intelligent rate limiting and resource monitoring.
|
||||||
magic=True,
|
"""
|
||||||
simulate_user=True
|
|
||||||
)
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Note on Browser Types
|
### 4.1 Resource-Aware Crawling
|
||||||
|
|
||||||
Each browser type has its characteristics:
|
The `arun_many()` method now uses an intelligent dispatcher that:
|
||||||
|
- Monitors system memory usage
|
||||||
|
- Implements adaptive rate limiting
|
||||||
|
- Provides detailed progress monitoring
|
||||||
|
- Manages concurrent crawls efficiently
|
||||||
|
|
||||||
- **chromium**: Best overall compatibility
|
### 4.2 Example Usage
|
||||||
- **firefox**: Good for specific use cases
|
|
||||||
- **webkit**: Lighter weight, good for basic crawling
|
|
||||||
|
|
||||||
Choose based on your specific needs:
|
|
||||||
```python
|
```python
|
||||||
# High compatibility
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
|
||||||
crawler = AsyncWebCrawler(browser_type="chromium")
|
from crawl4ai.dispatcher import DisplayMode
|
||||||
|
|
||||||
# Memory efficient
|
# Configure browser
|
||||||
crawler = AsyncWebCrawler(browser_type="webkit")
|
browser_cfg = BrowserConfig(headless=True)
|
||||||
```
|
|
||||||
|
# Configure crawler with rate limiting
|
||||||
|
run_cfg = CrawlerRunConfig(
|
||||||
|
# Enable rate limiting
|
||||||
|
enable_rate_limiting=True,
|
||||||
|
rate_limit_config=RateLimitConfig(
|
||||||
|
base_delay=(1.0, 2.0), # Random delay between 1-2 seconds
|
||||||
|
max_delay=30.0, # Maximum delay after rate limit hits
|
||||||
|
max_retries=2, # Number of retries before giving up
|
||||||
|
rate_limit_codes=[429, 503] # Status codes that trigger rate limiting
|
||||||
|
),
|
||||||
|
# Resource monitoring
|
||||||
|
memory_threshold_percent=70.0, # Pause if memory exceeds this
|
||||||
|
check_interval=0.5, # How often to check resources
|
||||||
|
max_session_permit=3, # Maximum concurrent crawls
|
||||||
|
display_mode=DisplayMode.DETAILED.value # Show detailed progress
|
||||||
|
)
|
||||||
|
|
||||||
|
urls = [
|
||||||
|
"https://example.com/page1",
|
||||||
|
"https://example.com/page2",
|
||||||
|
"https://example.com/page3"
|
||||||
|
]
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||||
|
results = await crawler.arun_many(urls, config=run_cfg)
|
||||||
|
for result in results:
|
||||||
|
print(f"URL: {result.url}, Success: {result.success}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Key Features
|
||||||
|
|
||||||
|
1. **Rate Limiting**
|
||||||
|
- Automatic delay between requests
|
||||||
|
- Exponential backoff on rate limit detection
|
||||||
|
- Domain-specific rate limiting
|
||||||
|
- Configurable retry strategy
|
||||||
|
|
||||||
|
2. **Resource Monitoring**
|
||||||
|
- Memory usage tracking
|
||||||
|
- Adaptive concurrency based on system load
|
||||||
|
- Automatic pausing when resources are constrained
|
||||||
|
|
||||||
|
3. **Progress Monitoring**
|
||||||
|
- Detailed or aggregated progress display
|
||||||
|
- Real-time status updates
|
||||||
|
- Memory usage statistics
|
||||||
|
|
||||||
|
4. **Error Handling**
|
||||||
|
- Graceful handling of rate limits
|
||||||
|
- Automatic retries with backoff
|
||||||
|
- Detailed error reporting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. `CrawlResult` Output
|
||||||
|
|
||||||
|
Each `arun()` returns a **`CrawlResult`** containing:
|
||||||
|
|
||||||
|
- `url`: Final URL (if redirected).
|
||||||
|
- `html`: Original HTML.
|
||||||
|
- `cleaned_html`: Sanitized HTML.
|
||||||
|
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
|
||||||
|
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
|
||||||
|
- `screenshot`, `pdf`: If screenshots/PDF requested.
|
||||||
|
- `media`, `links`: Information about discovered images/links.
|
||||||
|
- `success`, `error_message`: Status info.
|
||||||
|
|
||||||
|
For details, see [CrawlResult doc](./crawl-result.md).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Quick Example
|
||||||
|
|
||||||
|
Below is an example hooking it all together:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||||
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||||
|
import json
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
# 1. Browser config
|
||||||
|
browser_cfg = BrowserConfig(
|
||||||
|
browser_type="firefox",
|
||||||
|
headless=False,
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# 2. Run config
|
||||||
|
schema = {
|
||||||
|
"name": "Articles",
|
||||||
|
"baseSelector": "article.post",
|
||||||
|
"fields": [
|
||||||
|
{
|
||||||
|
"name": "title",
|
||||||
|
"selector": "h2",
|
||||||
|
"type": "text"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "url",
|
||||||
|
"selector": "a",
|
||||||
|
"type": "attribute",
|
||||||
|
"attribute": "href"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
run_cfg = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||||
|
word_count_threshold=15,
|
||||||
|
remove_overlay_elements=True,
|
||||||
|
wait_for="css:.post" # Wait for posts to appear
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://example.com/blog",
|
||||||
|
config=run_cfg
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.success:
|
||||||
|
print("Cleaned HTML length:", len(result.cleaned_html))
|
||||||
|
if result.extracted_content:
|
||||||
|
articles = json.loads(result.extracted_content)
|
||||||
|
print("Extracted articles:", articles[:2])
|
||||||
|
else:
|
||||||
|
print("Error:", result.error_message)
|
||||||
|
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
**Explanation**:
|
||||||
|
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
|
||||||
|
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
|
||||||
|
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Best Practices & Migration Notes
|
||||||
|
|
||||||
|
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
|
||||||
|
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
|
||||||
|
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
|
||||||
|
|
||||||
|
```python
|
||||||
|
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
|
||||||
|
result = await crawler.arun(url="...", config=run_cfg)
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Summary
|
||||||
|
|
||||||
|
**AsyncWebCrawler** is your entry point to asynchronous crawling:
|
||||||
|
|
||||||
|
- **Constructor** accepts **`BrowserConfig`** (or defaults).
|
||||||
|
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
|
||||||
|
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
|
||||||
|
- For advanced lifecycle control, use `start()` and `close()` explicitly.
|
||||||
|
|
||||||
|
**Migration**:
|
||||||
|
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
|
||||||
|
|
||||||
|
This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
|
||||||
@@ -1,85 +0,0 @@
|
|||||||
# CrawlerRunConfig Parameters Documentation
|
|
||||||
|
|
||||||
## Content Processing Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
|
|
||||||
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
|
|
||||||
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
|
|
||||||
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
|
|
||||||
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
|
|
||||||
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
|
|
||||||
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
|
|
||||||
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
|
|
||||||
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
|
|
||||||
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
|
|
||||||
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
|
|
||||||
|
|
||||||
## Caching Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
|
|
||||||
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
|
|
||||||
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
|
|
||||||
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
|
|
||||||
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
|
|
||||||
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
|
|
||||||
|
|
||||||
## Page Navigation and Timing Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
|
|
||||||
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
|
|
||||||
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
|
|
||||||
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
|
|
||||||
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
|
|
||||||
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
|
|
||||||
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
|
|
||||||
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
|
|
||||||
|
|
||||||
## Page Interaction Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
|
|
||||||
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
|
|
||||||
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
|
|
||||||
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
|
|
||||||
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
|
|
||||||
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
|
|
||||||
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
|
|
||||||
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
|
|
||||||
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
|
|
||||||
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
|
|
||||||
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
|
|
||||||
|
|
||||||
## Media Handling Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
|
|
||||||
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
|
|
||||||
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
|
|
||||||
| `pdf` | bool | False | Whether to generate a PDF of the page |
|
|
||||||
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
|
|
||||||
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
|
|
||||||
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
|
|
||||||
|
|
||||||
## Link and Domain Handling Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
|
|
||||||
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
|
|
||||||
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
|
|
||||||
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
|
|
||||||
|
|
||||||
## Debugging and Logging Parameters
|
|
||||||
|
|
||||||
| Parameter | Type | Default | Description |
|
|
||||||
|-----------|------|---------|-------------|
|
|
||||||
| `verbose` | bool | True | Enable verbose logging |
|
|
||||||
| `log_console` | bool | False | If True, log console messages from the page |
|
|
||||||
@@ -1,302 +1,355 @@
|
|||||||
# CrawlResult
|
# `CrawlResult` Reference
|
||||||
|
|
||||||
The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.
|
The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
|
||||||
|
|
||||||
## Class Definition
|
**Location**: `crawl4ai/crawler/models.py` (for reference)
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class CrawlResult(BaseModel):
|
class CrawlResult(BaseModel):
|
||||||
"""Result of a web crawling operation."""
|
url: str
|
||||||
|
html: str
|
||||||
# Basic Information
|
success: bool
|
||||||
url: str # Crawled URL
|
cleaned_html: Optional[str] = None
|
||||||
success: bool # Whether crawl succeeded
|
media: Dict[str, List[Dict]] = {}
|
||||||
status_code: Optional[int] = None # HTTP status code
|
links: Dict[str, List[Dict]] = {}
|
||||||
error_message: Optional[str] = None # Error message if failed
|
downloaded_files: Optional[List[str]] = None
|
||||||
|
screenshot: Optional[str] = None
|
||||||
# Content
|
pdf : Optional[bytes] = None
|
||||||
html: str # Raw HTML content
|
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||||
cleaned_html: Optional[str] = None # Cleaned HTML
|
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||||
fit_html: Optional[str] = None # Most relevant HTML content
|
fit_markdown: Optional[str] = None
|
||||||
markdown: Optional[str] = None # HTML converted to markdown
|
fit_html: Optional[str] = None
|
||||||
fit_markdown: Optional[str] = None # Most relevant markdown content
|
extracted_content: Optional[str] = None
|
||||||
downloaded_files: Optional[List[str]] = None # Downloaded files
|
metadata: Optional[dict] = None
|
||||||
|
error_message: Optional[str] = None
|
||||||
# Extracted Data
|
session_id: Optional[str] = None
|
||||||
extracted_content: Optional[str] = None # Content from extraction strategy
|
response_headers: Optional[dict] = None
|
||||||
media: Dict[str, List[Dict]] = {} # Extracted media information
|
status_code: Optional[int] = None
|
||||||
links: Dict[str, List[Dict]] = {} # Extracted links
|
ssl_certificate: Optional[SSLCertificate] = None
|
||||||
metadata: Optional[dict] = None # Page metadata
|
dispatch_result: Optional[DispatchResult] = None
|
||||||
|
...
|
||||||
# Additional Data
|
|
||||||
screenshot: Optional[str] = None # Base64 encoded screenshot
|
|
||||||
session_id: Optional[str] = None # Session identifier
|
|
||||||
response_headers: Optional[dict] = None # HTTP response headers
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Properties and Their Data Structures
|
Below is a **field-by-field** explanation and possible usage patterns.
|
||||||
|
|
||||||
### Basic Information
|
---
|
||||||
|
|
||||||
|
## 1. Basic Crawl Info
|
||||||
|
|
||||||
|
### 1.1 **`url`** *(str)*
|
||||||
|
**What**: The final crawled URL (after any redirects).
|
||||||
|
**Usage**:
|
||||||
```python
|
```python
|
||||||
# Access basic information
|
print(result.url) # e.g., "https://example.com/"
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
|
|
||||||
print(result.url) # "https://example.com"
|
|
||||||
print(result.success) # True/False
|
|
||||||
print(result.status_code) # 200, 404, etc.
|
|
||||||
print(result.error_message) # Error details if failed
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Content Properties
|
### 1.2 **`success`** *(bool)*
|
||||||
|
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
|
||||||
#### HTML Content
|
**Usage**:
|
||||||
```python
|
```python
|
||||||
# Raw HTML
|
if not result.success:
|
||||||
html_content = result.html
|
print(f"Crawl failed: {result.error_message}")
|
||||||
|
|
||||||
# Cleaned HTML (removed ads, popups, etc.)
|
|
||||||
clean_content = result.cleaned_html
|
|
||||||
|
|
||||||
# Most relevant HTML content
|
|
||||||
main_content = result.fit_html
|
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Markdown Content
|
### 1.3 **`status_code`** *(Optional[int])*
|
||||||
|
**What**: The page’s HTTP status code (e.g., 200, 404).
|
||||||
|
**Usage**:
|
||||||
```python
|
```python
|
||||||
# Full markdown version
|
if result.status_code == 404:
|
||||||
markdown_content = result.markdown
|
print("Page not found!")
|
||||||
|
|
||||||
# Most relevant markdown content
|
|
||||||
main_content = result.fit_markdown
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Media Content
|
### 1.4 **`error_message`** *(Optional[str])*
|
||||||
|
**What**: If `success=False`, a textual description of the failure.
|
||||||
The media dictionary contains organized media elements:
|
**Usage**:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Structure
|
if not result.success:
|
||||||
media = {
|
print("Error:", result.error_message)
|
||||||
"images": [
|
|
||||||
{
|
|
||||||
"src": str, # Image URL
|
|
||||||
"alt": str, # Alt text
|
|
||||||
"desc": str, # Contextual description
|
|
||||||
"score": float, # Relevance score (0-10)
|
|
||||||
"type": str, # "image"
|
|
||||||
"width": int, # Image width (if available)
|
|
||||||
"height": int, # Image height (if available)
|
|
||||||
"context": str, # Surrounding text
|
|
||||||
"lazy": bool # Whether image was lazy-loaded
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"videos": [
|
|
||||||
{
|
|
||||||
"src": str, # Video URL
|
|
||||||
"type": str, # "video"
|
|
||||||
"title": str, # Video title
|
|
||||||
"poster": str, # Thumbnail URL
|
|
||||||
"duration": str, # Video duration
|
|
||||||
"description": str # Video description
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"audios": [
|
|
||||||
{
|
|
||||||
"src": str, # Audio URL
|
|
||||||
"type": str, # "audio"
|
|
||||||
"title": str, # Audio title
|
|
||||||
"duration": str, # Audio duration
|
|
||||||
"description": str # Audio description
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Example usage
|
|
||||||
for image in result.media["images"]:
|
|
||||||
if image["score"] > 5: # High-relevance images
|
|
||||||
print(f"High-quality image: {image['src']}")
|
|
||||||
print(f"Context: {image['context']}")
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Link Analysis
|
### 1.5 **`session_id`** *(Optional[str])*
|
||||||
|
**What**: The ID used for reusing a browser context across multiple calls.
|
||||||
The links dictionary organizes discovered links:
|
**Usage**:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Structure
|
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
|
||||||
links = {
|
print("Session:", result.session_id)
|
||||||
"internal": [
|
```
|
||||||
{
|
|
||||||
"href": str, # URL
|
|
||||||
"text": str, # Link text
|
|
||||||
"title": str, # Title attribute
|
|
||||||
"type": str, # Link type (nav, content, etc.)
|
|
||||||
"context": str, # Surrounding text
|
|
||||||
"score": float # Relevance score
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"external": [
|
|
||||||
{
|
|
||||||
"href": str, # External URL
|
|
||||||
"text": str, # Link text
|
|
||||||
"title": str, # Title attribute
|
|
||||||
"domain": str, # Domain name
|
|
||||||
"type": str, # Link type
|
|
||||||
"context": str # Surrounding text
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Example usage
|
### 1.6 **`response_headers`** *(Optional[dict])*
|
||||||
|
**What**: Final HTTP response headers.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
if result.response_headers:
|
||||||
|
print("Server:", result.response_headers.get("Server", "Unknown"))
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
|
||||||
|
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
|
||||||
|
`subject`, `valid_from`, `valid_until`, etc.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
if result.ssl_certificate:
|
||||||
|
print("Issuer:", result.ssl_certificate.issuer)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Raw / Cleaned Content
|
||||||
|
|
||||||
|
### 2.1 **`html`** *(str)*
|
||||||
|
**What**: The **original** unmodified HTML from the final page load.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
# Possibly large
|
||||||
|
print(len(result.html))
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 **`cleaned_html`** *(Optional[str])*
|
||||||
|
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
print(result.cleaned_html[:500]) # Show a snippet
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 **`fit_html`** *(Optional[str])*
|
||||||
|
**What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.
|
||||||
|
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
if result.fit_html:
|
||||||
|
print("High-value HTML content:", result.fit_html[:300])
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Markdown Fields
|
||||||
|
|
||||||
|
### 3.1 The Markdown Generation Approach
|
||||||
|
|
||||||
|
Crawl4AI can convert HTML→Markdown, optionally including:
|
||||||
|
|
||||||
|
- **Raw** markdown
|
||||||
|
- **Links as citations** (with a references section)
|
||||||
|
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
|
||||||
|
|
||||||
|
### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*
|
||||||
|
**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.
|
||||||
|
|
||||||
|
**`MarkdownGenerationResult`** includes:
|
||||||
|
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
|
||||||
|
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
|
||||||
|
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
|
||||||
|
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.
|
||||||
|
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
if result.markdown_v2:
|
||||||
|
md_res = result.markdown_v2
|
||||||
|
print("Raw MD:", md_res.raw_markdown[:300])
|
||||||
|
print("Citations MD:", md_res.markdown_with_citations[:300])
|
||||||
|
print("References:", md_res.references_markdown)
|
||||||
|
if md_res.fit_markdown:
|
||||||
|
print("Pruned text:", md_res.fit_markdown[:300])
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
|
||||||
|
**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
# Soon, you might see:
|
||||||
|
if isinstance(result.markdown, MarkdownGenerationResult):
|
||||||
|
print(result.markdown.raw_markdown[:200])
|
||||||
|
else:
|
||||||
|
print(result.markdown)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.4 **`fit_markdown`** *(Optional[str])*
|
||||||
|
**What**: A direct reference to the final filtered markdown (legacy approach).
|
||||||
|
**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
print(result.fit_markdown) # Legacy field, prefer result.markdown_v2.fit_markdown
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Media & Links
|
||||||
|
|
||||||
|
### 4.1 **`media`** *(Dict[str, List[Dict]])*
|
||||||
|
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
|
||||||
|
**Common Fields** in each item:
|
||||||
|
|
||||||
|
- `src` *(str)*: Media URL
|
||||||
|
- `alt` or `title` *(str)*: Descriptive text
|
||||||
|
- `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”
|
||||||
|
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
images = result.media.get("images", [])
|
||||||
|
for img in images:
|
||||||
|
if img.get("score", 0) > 5:
|
||||||
|
print("High-value image:", img["src"])
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 **`links`** *(Dict[str, List[Dict]])*
|
||||||
|
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
|
||||||
|
**Common Fields**:
|
||||||
|
|
||||||
|
- `href` *(str)*: The link target
|
||||||
|
- `text` *(str)*: Link text
|
||||||
|
- `title` *(str)*: Title attribute
|
||||||
|
- `context` *(str)*: Surrounding text snippet
|
||||||
|
- `domain` *(str)*: If external, the domain
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
for link in result.links["internal"]:
|
for link in result.links["internal"]:
|
||||||
print(f"Internal link: {link['href']}")
|
print(f"Internal link to {link['href']} with text {link['text']}")
|
||||||
print(f"Context: {link['context']}")
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Metadata
|
---
|
||||||
|
|
||||||
The metadata dictionary contains page information:
|
## 5. Additional Fields
|
||||||
|
|
||||||
|
### 5.1 **`extracted_content`** *(Optional[str])*
|
||||||
|
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
|
||||||
|
**Usage**:
|
||||||
```python
|
```python
|
||||||
# Structure
|
|
||||||
metadata = {
|
|
||||||
"title": str, # Page title
|
|
||||||
"description": str, # Meta description
|
|
||||||
"keywords": List[str], # Meta keywords
|
|
||||||
"author": str, # Author information
|
|
||||||
"published_date": str, # Publication date
|
|
||||||
"modified_date": str, # Last modified date
|
|
||||||
"language": str, # Page language
|
|
||||||
"canonical_url": str, # Canonical URL
|
|
||||||
"og_data": Dict, # Open Graph data
|
|
||||||
"twitter_data": Dict # Twitter card data
|
|
||||||
}
|
|
||||||
|
|
||||||
# Example usage
|
|
||||||
if result.metadata:
|
|
||||||
print(f"Title: {result.metadata['title']}")
|
|
||||||
print(f"Author: {result.metadata.get('author', 'Unknown')}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Extracted Content
|
|
||||||
|
|
||||||
Content from extraction strategies:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# For LLM or CSS extraction strategies
|
|
||||||
if result.extracted_content:
|
if result.extracted_content:
|
||||||
structured_data = json.loads(result.extracted_content)
|
data = json.loads(result.extracted_content)
|
||||||
print(structured_data)
|
print(data)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Screenshot
|
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
|
||||||
|
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
|
||||||
Base64 encoded screenshot:
|
**Usage**:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Save screenshot if available
|
if result.downloaded_files:
|
||||||
|
for file_path in result.downloaded_files:
|
||||||
|
print("Downloaded:", file_path)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.3 **`screenshot`** *(Optional[str])*
|
||||||
|
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
import base64
|
||||||
if result.screenshot:
|
if result.screenshot:
|
||||||
import base64
|
with open("page.png", "wb") as f:
|
||||||
|
|
||||||
# Decode and save
|
|
||||||
with open("screenshot.png", "wb") as f:
|
|
||||||
f.write(base64.b64decode(result.screenshot))
|
f.write(base64.b64decode(result.screenshot))
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage Examples
|
### 5.4 **`pdf`** *(Optional[bytes])*
|
||||||
|
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
|
||||||
### Basic Content Access
|
**Usage**:
|
||||||
```python
|
```python
|
||||||
async with AsyncWebCrawler() as crawler:
|
if result.pdf:
|
||||||
result = await crawler.arun(url="https://example.com")
|
with open("page.pdf", "wb") as f:
|
||||||
|
f.write(result.pdf)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.5 **`metadata`** *(Optional[dict])*
|
||||||
|
**What**: Page-level metadata if discovered (title, description, OG data, etc.).
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
if result.metadata:
|
||||||
|
print("Title:", result.metadata.get("title"))
|
||||||
|
print("Author:", result.metadata.get("author"))
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. `dispatch_result` (optional)
|
||||||
|
|
||||||
|
A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
|
||||||
|
|
||||||
|
- **`task_id`**: A unique identifier for the parallel task.
|
||||||
|
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
|
||||||
|
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
|
||||||
|
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
|
||||||
|
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Example usage:
|
||||||
|
for result in results:
|
||||||
|
if result.success and result.dispatch_result:
|
||||||
|
dr = result.dispatch_result
|
||||||
|
print(f"URL: {result.url}, Task ID: {dr.task_id}")
|
||||||
|
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
|
||||||
|
print(f"Duration: {dr.end_time - dr.start_time}")
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Example: Accessing Everything
|
||||||
|
|
||||||
|
```python
|
||||||
|
async def handle_result(result: CrawlResult):
|
||||||
|
if not result.success:
|
||||||
|
print("Crawl error:", result.error_message)
|
||||||
|
return
|
||||||
|
|
||||||
if result.success:
|
# Basic info
|
||||||
# Get clean content
|
print("Crawled URL:", result.url)
|
||||||
print(result.fit_markdown)
|
print("Status code:", result.status_code)
|
||||||
|
|
||||||
# Process images
|
# HTML
|
||||||
for image in result.media["images"]:
|
print("Original HTML size:", len(result.html))
|
||||||
if image["score"] > 7:
|
print("Cleaned HTML size:", len(result.cleaned_html or ""))
|
||||||
print(f"High-quality image: {image['src']}")
|
|
||||||
|
# Markdown output
|
||||||
|
if result.markdown_v2:
|
||||||
|
print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
|
||||||
|
print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
|
||||||
|
if result.markdown_v2.fit_markdown:
|
||||||
|
print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
|
||||||
|
else:
|
||||||
|
print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
|
||||||
|
|
||||||
|
# Media & Links
|
||||||
|
if "images" in result.media:
|
||||||
|
print("Image count:", len(result.media["images"]))
|
||||||
|
if "internal" in result.links:
|
||||||
|
print("Internal link count:", len(result.links["internal"]))
|
||||||
|
|
||||||
|
# Extraction strategy result
|
||||||
|
if result.extracted_content:
|
||||||
|
print("Structured data:", result.extracted_content)
|
||||||
|
|
||||||
|
# Screenshot/PDF
|
||||||
|
if result.screenshot:
|
||||||
|
print("Screenshot length:", len(result.screenshot))
|
||||||
|
if result.pdf:
|
||||||
|
print("PDF bytes length:", len(result.pdf))
|
||||||
```
|
```
|
||||||
|
|
||||||
### Complete Data Processing
|
---
|
||||||
```python
|
|
||||||
async def process_webpage(url: str) -> Dict:
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun(url=url)
|
|
||||||
|
|
||||||
if not result.success:
|
|
||||||
raise Exception(f"Crawl failed: {result.error_message}")
|
|
||||||
|
|
||||||
return {
|
|
||||||
"content": result.fit_markdown,
|
|
||||||
"images": [
|
|
||||||
img for img in result.media["images"]
|
|
||||||
if img["score"] > 5
|
|
||||||
],
|
|
||||||
"internal_links": [
|
|
||||||
link["href"] for link in result.links["internal"]
|
|
||||||
],
|
|
||||||
"metadata": result.metadata,
|
|
||||||
"status": result.status_code
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Error Handling
|
## 8. Key Points & Future
|
||||||
```python
|
|
||||||
async def safe_crawl(url: str) -> Dict:
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
try:
|
|
||||||
result = await crawler.arun(url=url)
|
|
||||||
|
|
||||||
if not result.success:
|
|
||||||
return {
|
|
||||||
"success": False,
|
|
||||||
"error": result.error_message,
|
|
||||||
"status": result.status_code
|
|
||||||
}
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"content": result.fit_markdown,
|
|
||||||
"status": result.status_code
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
return {
|
|
||||||
"success": False,
|
|
||||||
"error": str(e),
|
|
||||||
"status": None
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Best Practices
|
1. **`markdown_v2` vs `markdown`**
|
||||||
|
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
|
||||||
|
- In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
|
||||||
|
|
||||||
1. **Always Check Success**
|
2. **Fit Content**
|
||||||
```python
|
- **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
|
||||||
if not result.success:
|
- If no filter is used, they remain `None`.
|
||||||
print(f"Error: {result.error_message}")
|
|
||||||
return
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Use fit_markdown for Articles**
|
3. **References & Citations**
|
||||||
```python
|
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
|
||||||
# Better for article content
|
|
||||||
content = result.fit_markdown if result.fit_markdown else result.markdown
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Filter Media by Score**
|
4. **Links & Media**
|
||||||
```python
|
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
|
||||||
relevant_images = [
|
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
|
||||||
img for img in result.media["images"]
|
|
||||||
if img["score"] > 5
|
|
||||||
]
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Handle Missing Data**
|
5. **Error Cases**
|
||||||
```python
|
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
|
||||||
metadata = result.metadata or {}
|
- `status_code` might be `None` if we failed before an HTTP response.
|
||||||
title = metadata.get('title', 'Unknown Title')
|
|
||||||
```
|
Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.
|
||||||
@@ -1,36 +1,296 @@
|
|||||||
# Parameter Reference Table
|
# 1. **BrowserConfig** – Controlling the Browser
|
||||||
|
|
||||||
| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
|
`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
|
||||||
|-----------|---------------|------------|----------------|-------------|
|
|
||||||
| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
|
```python
|
||||||
| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||||
| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
|
|
||||||
| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
|
browser_cfg = BrowserConfig(
|
||||||
| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
|
browser_type="chromium",
|
||||||
| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
|
headless=True,
|
||||||
| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
|
viewport_width=1280,
|
||||||
| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
|
viewport_height=720,
|
||||||
| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
|
proxy="http://user:pass@proxy:8080",
|
||||||
| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
|
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
|
||||||
| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
|
)
|
||||||
| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
|
```
|
||||||
| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
|
|
||||||
| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
|
## 1.1 Parameter Highlights
|
||||||
| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
|
|
||||||
| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
|
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
|
||||||
| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
|
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
|
||||||
| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
|
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
|
||||||
| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
|
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
|
||||||
| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
|
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
|
||||||
| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
|
| **`proxy`** | `str` (default: `None`) | Single-proxy URL if you want all traffic to go through it, e.g. `"http://user:pass@proxy:8080"`. |
|
||||||
| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
|
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
|
||||||
| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
|
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
|
||||||
| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
|
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
|
||||||
| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
|
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
|
||||||
| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
|
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
|
||||||
| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
|
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
|
||||||
| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
|
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
|
||||||
| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
|
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
|
||||||
| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
|
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
|
||||||
| async_webcrawler.py | cache_mode | `kwargs.get("cache_mode", CacheMode.ENABLE)` | AsyncWebCrawler | Cache handling mode for request |
|
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
|
||||||
|
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
|
||||||
|
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
|
||||||
|
|
||||||
|
**Tips**:
|
||||||
|
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
|
||||||
|
- If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.
|
||||||
|
- For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# 2. **CrawlerRunConfig** – Controlling Each Crawl
|
||||||
|
|
||||||
|
While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||||
|
|
||||||
|
run_cfg = CrawlerRunConfig(
|
||||||
|
wait_for="css:.main-content",
|
||||||
|
word_count_threshold=15,
|
||||||
|
excluded_tags=["nav", "footer"],
|
||||||
|
exclude_external_links=True,
|
||||||
|
stream=True, # Enable streaming for arun_many()
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2.1 Parameter Highlights
|
||||||
|
|
||||||
|
We group them by category.
|
||||||
|
|
||||||
|
### A) **Content Processing**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
|
||||||
|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
|
||||||
|
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
|
||||||
|
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). |
|
||||||
|
| **`content_filter`** | `RelevantContentFilter` (None) | Filters out irrelevant text blocks. E.g., `PruningContentFilter` or `BM25ContentFilter`. |
|
||||||
|
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. |
|
||||||
|
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
|
||||||
|
| **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. |
|
||||||
|
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
|
||||||
|
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
|
||||||
|
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
|
||||||
|
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### B) **Caching & Session**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
|
||||||
|
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
|
||||||
|
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
|
||||||
|
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
|
||||||
|
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
|
||||||
|
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
|
||||||
|
|
||||||
|
Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### C) **Page Navigation & Timing**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`. |
|
||||||
|
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
|
||||||
|
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
|
||||||
|
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
|
||||||
|
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
|
||||||
|
| **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. |
|
||||||
|
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
|
||||||
|
| **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D) **Page Interaction**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
|
||||||
|
| **`js_only`** | `bool` (False) | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload. |
|
||||||
|
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
|
||||||
|
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
|
||||||
|
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
|
||||||
|
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
|
||||||
|
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
|
||||||
|
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
|
||||||
|
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
|
||||||
|
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
|
||||||
|
| **`adjust_viewport_to_content`** | `bool` (False) | Resizes viewport to match page content height. |
|
||||||
|
|
||||||
|
If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### E) **Media Handling**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`screenshot`** | `bool` (False) | Capture a screenshot (base64) in `result.screenshot`. |
|
||||||
|
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
|
||||||
|
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
|
||||||
|
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
|
||||||
|
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image’s alt text or description to be considered valid. |
|
||||||
|
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
|
||||||
|
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### F) **Link/Domain Handling**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output. |
|
||||||
|
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
|
||||||
|
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
|
||||||
|
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
|
||||||
|
|
||||||
|
Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### G) **Rate Limiting & Resource Management**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`enable_rate_limiting`** | `bool` (default: `False`) | Enable intelligent rate limiting for multiple URLs |
|
||||||
|
| **`rate_limit_config`** | `RateLimitConfig` (default: `None`) | Configuration for rate limiting behavior |
|
||||||
|
|
||||||
|
The `RateLimitConfig` class has these fields:
|
||||||
|
|
||||||
|
| **Field** | **Type / Default** | **What It Does** |
|
||||||
|
|--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`base_delay`** | `Tuple[float, float]` (1.0, 3.0) | Random delay range between requests to the same domain |
|
||||||
|
| **`max_delay`** | `float` (60.0) | Maximum delay after rate limit detection |
|
||||||
|
| **`max_retries`** | `int` (3) | Number of retries before giving up on rate-limited requests |
|
||||||
|
| **`rate_limit_codes`** | `List[int]` ([429, 503]) | HTTP status codes that trigger rate limiting behavior |
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`memory_threshold_percent`** | `float` (70.0) | Maximum memory usage before pausing new crawls |
|
||||||
|
| **`check_interval`** | `float` (1.0) | How often to check system resources (in seconds) |
|
||||||
|
| **`max_session_permit`** | `int` (20) | Maximum number of concurrent crawl sessions |
|
||||||
|
| **`display_mode`** | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### H) **Debug & Logging**
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|----------------|--------------------|---------------------------------------------------------------------------|
|
||||||
|
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
|
||||||
|
| **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2.2 Helper Methods
|
||||||
|
|
||||||
|
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Create a base configuration
|
||||||
|
base_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.ENABLED,
|
||||||
|
word_count_threshold=200
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create variations using clone()
|
||||||
|
stream_config = base_config.clone(stream=True)
|
||||||
|
no_cache_config = base_config.clone(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
stream=True
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.
|
||||||
|
|
||||||
|
## 2.3 Example Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
# Configure the browser
|
||||||
|
browser_cfg = BrowserConfig(
|
||||||
|
headless=False,
|
||||||
|
viewport_width=1280,
|
||||||
|
viewport_height=720,
|
||||||
|
proxy="http://user:pass@myproxy:8080",
|
||||||
|
text_mode=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Configure the run
|
||||||
|
run_cfg = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
session_id="my_session",
|
||||||
|
css_selector="main.article",
|
||||||
|
excluded_tags=["script", "style"],
|
||||||
|
exclude_external_links=True,
|
||||||
|
wait_for="css:.article-loaded",
|
||||||
|
screenshot=True,
|
||||||
|
enable_rate_limiting=True,
|
||||||
|
rate_limit_config=RateLimitConfig(
|
||||||
|
base_delay=(1.0, 3.0),
|
||||||
|
max_delay=60.0,
|
||||||
|
max_retries=3,
|
||||||
|
rate_limit_codes=[429, 503]
|
||||||
|
),
|
||||||
|
memory_threshold_percent=70.0,
|
||||||
|
check_interval=1.0,
|
||||||
|
max_session_permit=20,
|
||||||
|
display_mode="DETAILED",
|
||||||
|
stream=True
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://example.com/news",
|
||||||
|
config=run_cfg
|
||||||
|
)
|
||||||
|
if result.success:
|
||||||
|
print("Final cleaned_html length:", len(result.cleaned_html))
|
||||||
|
if result.screenshot:
|
||||||
|
print("Screenshot captured (base64, length):", len(result.screenshot))
|
||||||
|
else:
|
||||||
|
print("Crawl failed:", result.error_message)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
|
||||||
|
## 2.4 Compliance & Ethics
|
||||||
|
|
||||||
|
| **Parameter** | **Type / Default** | **What It Does** |
|
||||||
|
|-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
|
||||||
|
| **`check_robots_txt`**| `bool` (False) | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend. |
|
||||||
|
| **`user_agent`** | `str` (None) | User agent string to identify your crawler. Used for robots.txt checking when enabled. |
|
||||||
|
|
||||||
|
```python
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
check_robots_txt=True, # Enable robots.txt compliance
|
||||||
|
user_agent="MyBot/1.0" # Identify your crawler
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Putting It All Together
|
||||||
|
|
||||||
|
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
|
||||||
|
- **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
|
||||||
|
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Create a modified copy with the clone() method
|
||||||
|
stream_cfg = run_cfg.clone(
|
||||||
|
stream=True,
|
||||||
|
cache_mode=CacheMode.BYPASS
|
||||||
|
)
|
||||||
|
|||||||
@@ -218,12 +218,12 @@ result = await crawler.arun(
|
|||||||
|
|
||||||
## Best Practices
|
## Best Practices
|
||||||
|
|
||||||
1. **Choose the Right Strategy**
|
1. **Choose the Right Strategy**
|
||||||
- Use `LLMExtractionStrategy` for complex, unstructured content
|
- Use `LLMExtractionStrategy` for complex, unstructured content
|
||||||
- Use `JsonCssExtractionStrategy` for well-structured HTML
|
- Use `JsonCssExtractionStrategy` for well-structured HTML
|
||||||
- Use `CosineStrategy` for content similarity and clustering
|
- Use `CosineStrategy` for content similarity and clustering
|
||||||
|
|
||||||
2. **Optimize Chunking**
|
2. **Optimize Chunking**
|
||||||
```python
|
```python
|
||||||
# For long documents
|
# For long documents
|
||||||
strategy = LLMExtractionStrategy(
|
strategy = LLMExtractionStrategy(
|
||||||
@@ -232,7 +232,7 @@ result = await crawler.arun(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Handle Errors**
|
3. **Handle Errors**
|
||||||
```python
|
```python
|
||||||
try:
|
try:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
@@ -245,7 +245,7 @@ result = await crawler.arun(
|
|||||||
print(f"Extraction failed: {e}")
|
print(f"Extraction failed: {e}")
|
||||||
```
|
```
|
||||||
|
|
||||||
4. **Monitor Performance**
|
4. **Monitor Performance**
|
||||||
```python
|
```python
|
||||||
strategy = CosineStrategy(
|
strategy = CosineStrategy(
|
||||||
verbose=True, # Enable logging
|
verbose=True, # Enable logging
|
||||||
|
|||||||
BIN
docs/md_v2/assets/images/dispatcher.png
Normal file
BIN
docs/md_v2/assets/images/dispatcher.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 476 KiB |
@@ -7,6 +7,7 @@
|
|||||||
|
|
||||||
:root {
|
:root {
|
||||||
--global-font-size: 16px;
|
--global-font-size: 16px;
|
||||||
|
--global-code-font-size: 16px;
|
||||||
--global-line-height: 1.5em;
|
--global-line-height: 1.5em;
|
||||||
--global-space: 10px;
|
--global-space: 10px;
|
||||||
--font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
|
--font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
|
||||||
@@ -20,6 +21,7 @@
|
|||||||
--invert-font-color: #151515; /* Dark color for inverted elements */
|
--invert-font-color: #151515; /* Dark color for inverted elements */
|
||||||
--primary-color: #1a95e0; /* Primary color can remain the same or be adjusted for better contrast */
|
--primary-color: #1a95e0; /* Primary color can remain the same or be adjusted for better contrast */
|
||||||
--secondary-color: #727578; /* Secondary color for less important text */
|
--secondary-color: #727578; /* Secondary color for less important text */
|
||||||
|
--secondary-dimmed-color: #8b857a; /* Dimmed secondary color */
|
||||||
--error-color: #ff5555; /* Bright color for errors */
|
--error-color: #ff5555; /* Bright color for errors */
|
||||||
--progress-bar-background: #444; /* Darker background for progress bar */
|
--progress-bar-background: #444; /* Darker background for progress bar */
|
||||||
--progress-bar-fill: #1a95e0; /* Bright color for progress bar fill */
|
--progress-bar-fill: #1a95e0; /* Bright color for progress bar fill */
|
||||||
@@ -37,8 +39,9 @@
|
|||||||
--secondary-color: #a3abba;
|
--secondary-color: #a3abba;
|
||||||
--secondary-color: #d5cec0;
|
--secondary-color: #d5cec0;
|
||||||
--tertiary-color: #a3abba;
|
--tertiary-color: #a3abba;
|
||||||
--primary-color: #09b5a5; /* Updated to the brand color */
|
--primary-dimmed-color: #09b5a5; /* Updated to the brand color */
|
||||||
--primary-color: #50ffff; /* Updated to the brand color */
|
--primary-color: #50ffff; /* Updated to the brand color */
|
||||||
|
--accent-color: rgb(243, 128, 245);
|
||||||
--error-color: #ff3c74;
|
--error-color: #ff3c74;
|
||||||
--progress-bar-background: #3f3f44;
|
--progress-bar-background: #3f3f44;
|
||||||
--progress-bar-fill: #09b5a5; /* Updated to the brand color */
|
--progress-bar-fill: #09b5a5; /* Updated to the brand color */
|
||||||
@@ -80,10 +83,16 @@ pre, code {
|
|||||||
line-height: var(--global-line-height);
|
line-height: var(--global-line-height);
|
||||||
}
|
}
|
||||||
|
|
||||||
strong,
|
strong {
|
||||||
|
/* color : var(--primary-dimmed-color); */
|
||||||
|
/* background-color: #50ffff17; */
|
||||||
|
text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color);
|
||||||
|
}
|
||||||
|
|
||||||
.highlight {
|
.highlight {
|
||||||
/* background: url(//s2.svgbox.net/pen-brushes.svg?ic=brush-1&color=50ffff); */
|
/* background: url(//s2.svgbox.net/pen-brushes.svg?ic=brush-1&color=50ffff); */
|
||||||
background-color: #50ffff33;
|
background-color: #50ffff17;
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
.terminal-card > header {
|
.terminal-card > header {
|
||||||
@@ -157,4 +166,80 @@ ol li::before {
|
|||||||
counter-increment: item;
|
counter-increment: item;
|
||||||
/* float: left; */
|
/* float: left; */
|
||||||
/* padding-right: 5px; */
|
/* padding-right: 5px; */
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
/* 8 TERMINAL CSS */
|
||||||
|
|
||||||
|
.terminal code {
|
||||||
|
font-size: var(--global-code-font-size);
|
||||||
|
background: var(--block-background-color);
|
||||||
|
/* color: var(--secondary-color); */
|
||||||
|
color: var(--primary-dimmed-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.terminal pre code {
|
||||||
|
background: var(--block-background-color);
|
||||||
|
color: var(--secondary-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.hljs-keyword, .hljs-selector-tag, .hljs-built_in, .hljs-name, .hljs-tag {
|
||||||
|
color: var(--accent-color);
|
||||||
|
}
|
||||||
|
.hljs-string {
|
||||||
|
color: var(--primary-dimmed-color);
|
||||||
|
}
|
||||||
|
.hljs-comment {
|
||||||
|
color: var(--secondary-dimmed-color);
|
||||||
|
font-style: italic;
|
||||||
|
font-size: 0.9em;
|
||||||
|
}
|
||||||
|
.hljs-number {
|
||||||
|
color: var(--primary-dimmed-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.terminal strong > code, .terminal h2 > code , .terminal h3 > code {
|
||||||
|
background-color: transparent;
|
||||||
|
/* color: var(--font-color); */
|
||||||
|
color: var(--primary-dimmed-color);
|
||||||
|
text-shadow: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
blockquote {
|
||||||
|
background-color: var(--invert-font-color);
|
||||||
|
padding: 1em 2em;
|
||||||
|
border-left: 2px solid var(--primary-dimmed-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
blockquote::after {
|
||||||
|
content: "💡";
|
||||||
|
white-space: pre;
|
||||||
|
position: absolute;
|
||||||
|
top: 1em;
|
||||||
|
left: 5px;
|
||||||
|
line-height: var(--global-line-height);
|
||||||
|
color: #9ca2ab;
|
||||||
|
}
|
||||||
|
|
||||||
|
pre {
|
||||||
|
display: block;
|
||||||
|
word-break: break-word;
|
||||||
|
word-wrap: break-word;
|
||||||
|
}
|
||||||
|
|
||||||
|
.terminal h1 {
|
||||||
|
font-size: 2em;
|
||||||
|
}
|
||||||
|
|
||||||
|
.terminal h1, .terminal h2, .terminal h3, .terminal h4, .terminal h5, .terminal h6 {
|
||||||
|
text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color), 0 0 0px var(--font-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Lower max height or width for these images */
|
||||||
|
div.badges a {
|
||||||
|
/* no underline */
|
||||||
|
text-decoration: none !important;
|
||||||
|
}
|
||||||
|
div.badges a > img {
|
||||||
|
width: auto;
|
||||||
}
|
}
|
||||||
@@ -1,208 +0,0 @@
|
|||||||
# Browser Configuration
|
|
||||||
|
|
||||||
Crawl4AI supports multiple browser engines and offers extensive configuration options for browser behavior.
|
|
||||||
|
|
||||||
## Browser Types
|
|
||||||
|
|
||||||
Choose from three browser engines:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Chromium (default)
|
|
||||||
async with AsyncWebCrawler(browser_type="chromium") as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
|
|
||||||
# Firefox
|
|
||||||
async with AsyncWebCrawler(browser_type="firefox") as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
|
|
||||||
# WebKit
|
|
||||||
async with AsyncWebCrawler(browser_type="webkit") as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Basic Configuration
|
|
||||||
|
|
||||||
Common browser settings:
|
|
||||||
|
|
||||||
```python
|
|
||||||
async with AsyncWebCrawler(
|
|
||||||
headless=True, # Run in headless mode (no GUI)
|
|
||||||
verbose=True, # Enable detailed logging
|
|
||||||
sleep_on_close=False # No delay when closing browser
|
|
||||||
) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Identity Management
|
|
||||||
|
|
||||||
Control how your crawler appears to websites:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Custom user agent
|
|
||||||
async with AsyncWebCrawler(
|
|
||||||
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
|
|
||||||
) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
|
|
||||||
# Custom headers
|
|
||||||
headers = {
|
|
||||||
"Accept-Language": "en-US,en;q=0.9",
|
|
||||||
"Cache-Control": "no-cache"
|
|
||||||
}
|
|
||||||
async with AsyncWebCrawler(headers=headers) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Screenshot Capabilities
|
|
||||||
|
|
||||||
Capture page screenshots with enhanced error handling:
|
|
||||||
|
|
||||||
```python
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
screenshot=True, # Enable screenshot
|
|
||||||
screenshot_wait_for=2.0 # Wait 2 seconds before capture
|
|
||||||
)
|
|
||||||
|
|
||||||
if result.screenshot: # Base64 encoded image
|
|
||||||
import base64
|
|
||||||
with open("screenshot.png", "wb") as f:
|
|
||||||
f.write(base64.b64decode(result.screenshot))
|
|
||||||
```
|
|
||||||
|
|
||||||
## Timeouts and Waiting
|
|
||||||
|
|
||||||
Control page loading behavior:
|
|
||||||
|
|
||||||
```python
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
page_timeout=60000, # Page load timeout (ms)
|
|
||||||
delay_before_return_html=2.0, # Wait before content capture
|
|
||||||
wait_for="css:.dynamic-content" # Wait for specific element
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## JavaScript Execution
|
|
||||||
|
|
||||||
Execute custom JavaScript before crawling:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Single JavaScript command
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Multiple commands
|
|
||||||
js_commands = [
|
|
||||||
"window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
"document.querySelector('.load-more').click();"
|
|
||||||
]
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
js_code=js_commands
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Proxy Configuration
|
|
||||||
|
|
||||||
Use proxies for enhanced access:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Simple proxy
|
|
||||||
async with AsyncWebCrawler(
|
|
||||||
proxy="http://proxy.example.com:8080"
|
|
||||||
) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
|
|
||||||
# Proxy with authentication
|
|
||||||
proxy_config = {
|
|
||||||
"server": "http://proxy.example.com:8080",
|
|
||||||
"username": "user",
|
|
||||||
"password": "pass"
|
|
||||||
}
|
|
||||||
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Anti-Detection Features
|
|
||||||
|
|
||||||
Enable stealth features to avoid bot detection:
|
|
||||||
|
|
||||||
```python
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
simulate_user=True, # Simulate human behavior
|
|
||||||
override_navigator=True, # Mask automation signals
|
|
||||||
magic=True # Enable all anti-detection features
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Handling Dynamic Content
|
|
||||||
|
|
||||||
Configure browser to handle dynamic content:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Wait for dynamic content
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
wait_for="js:() => document.querySelector('.content').children.length > 10",
|
|
||||||
process_iframes=True # Process iframe content
|
|
||||||
)
|
|
||||||
|
|
||||||
# Handle lazy-loaded images
|
|
||||||
result = await crawler.arun(
|
|
||||||
url="https://example.com",
|
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
delay_before_return_html=2.0 # Wait for images to load
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Comprehensive Example
|
|
||||||
|
|
||||||
Here's how to combine various browser configurations:
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def crawl_with_advanced_config(url: str):
|
|
||||||
async with AsyncWebCrawler(
|
|
||||||
# Browser setup
|
|
||||||
browser_type="chromium",
|
|
||||||
headless=True,
|
|
||||||
verbose=True,
|
|
||||||
|
|
||||||
# Identity
|
|
||||||
user_agent="Custom User Agent",
|
|
||||||
headers={"Accept-Language": "en-US"},
|
|
||||||
|
|
||||||
# Proxy setup
|
|
||||||
proxy="http://proxy.example.com:8080"
|
|
||||||
) as crawler:
|
|
||||||
result = await crawler.arun(
|
|
||||||
url=url,
|
|
||||||
# Content handling
|
|
||||||
process_iframes=True,
|
|
||||||
screenshot=True,
|
|
||||||
|
|
||||||
# Timing
|
|
||||||
page_timeout=60000,
|
|
||||||
delay_before_return_html=2.0,
|
|
||||||
|
|
||||||
# Anti-detection
|
|
||||||
magic=True,
|
|
||||||
simulate_user=True,
|
|
||||||
|
|
||||||
# Dynamic content
|
|
||||||
js_code=[
|
|
||||||
"window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
"document.querySelector('.load-more')?.click();"
|
|
||||||
],
|
|
||||||
wait_for="css:.dynamic-content"
|
|
||||||
)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"content": result.markdown,
|
|
||||||
"screenshot": result.screenshot,
|
|
||||||
"success": result.success
|
|
||||||
}
|
|
||||||
```
|
|
||||||
@@ -1,135 +0,0 @@
|
|||||||
### Content Selection
|
|
||||||
|
|
||||||
Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.
|
|
||||||
|
|
||||||
#### CSS Selectors
|
|
||||||
|
|
||||||
Extract specific content using a `CrawlerRunConfig` with CSS selectors:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(css_selector=".main-article") # Target main article content
|
|
||||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content
|
|
||||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Content Filtering
|
|
||||||
|
|
||||||
Control content inclusion or exclusion with `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
word_count_threshold=10, # Minimum words per block
|
|
||||||
excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags
|
|
||||||
exclude_external_links=True, # Remove external links
|
|
||||||
exclude_social_media_links=True, # Remove social media links
|
|
||||||
exclude_external_images=True # Remove external images
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Iframe Content
|
|
||||||
|
|
||||||
Process iframe content by enabling specific options in `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
process_iframes=True, # Extract iframe content
|
|
||||||
remove_overlay_elements=True # Remove popups/modals that might block iframes
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Structured Content Selection Using LLMs
|
|
||||||
|
|
||||||
Leverage LLMs for intelligent content extraction:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
|
||||||
from pydantic import BaseModel
|
|
||||||
from typing import List
|
|
||||||
|
|
||||||
class ArticleContent(BaseModel):
|
|
||||||
title: str
|
|
||||||
main_points: List[str]
|
|
||||||
conclusion: str
|
|
||||||
|
|
||||||
strategy = LLMExtractionStrategy(
|
|
||||||
provider="ollama/nemotron",
|
|
||||||
schema=ArticleContent.schema(),
|
|
||||||
instruction="Extract the main article title, key points, and conclusion"
|
|
||||||
)
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
|
||||||
|
|
||||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
|
||||||
article = json.loads(result.extracted_content)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Pattern-Based Selection
|
|
||||||
|
|
||||||
Extract content matching repetitive patterns:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|
||||||
|
|
||||||
schema = {
|
|
||||||
"name": "News Articles",
|
|
||||||
"baseSelector": "article.news-item",
|
|
||||||
"fields": [
|
|
||||||
{"name": "headline", "selector": "h2", "type": "text"},
|
|
||||||
{"name": "summary", "selector": ".summary", "type": "text"},
|
|
||||||
{"name": "category", "selector": ".category", "type": "text"},
|
|
||||||
{
|
|
||||||
"name": "metadata",
|
|
||||||
"type": "nested",
|
|
||||||
"fields": [
|
|
||||||
{"name": "author", "selector": ".author", "type": "text"},
|
|
||||||
{"name": "date", "selector": ".date", "type": "text"}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
strategy = JsonCssExtractionStrategy(schema)
|
|
||||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
|
||||||
|
|
||||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
|
||||||
articles = json.loads(result.extracted_content)
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Comprehensive Example
|
|
||||||
|
|
||||||
Combine different selection methods using `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|
||||||
|
|
||||||
async def extract_article_content(url: str):
|
|
||||||
# Define structured extraction
|
|
||||||
article_schema = {
|
|
||||||
"name": "Article",
|
|
||||||
"baseSelector": "article.main",
|
|
||||||
"fields": [
|
|
||||||
{"name": "title", "selector": "h1", "type": "text"},
|
|
||||||
{"name": "content", "selector": ".content", "type": "text"}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Define configuration
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
|
||||||
word_count_threshold=10,
|
|
||||||
excluded_tags=['nav', 'footer'],
|
|
||||||
exclude_external_links=True
|
|
||||||
)
|
|
||||||
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
result = await crawler.arun(url=url, config=config)
|
|
||||||
return json.loads(result.extracted_content)
|
|
||||||
```
|
|
||||||
@@ -1,83 +0,0 @@
|
|||||||
# Content Filtering in Crawl4AI
|
|
||||||
|
|
||||||
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
|
|
||||||
|
|
||||||
## Relevance Content Filter
|
|
||||||
|
|
||||||
The `RelevanceContentFilter` is an abstract class providing a common interface for content filtering strategies. Specific algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
|
||||||
|
|
||||||
## Pruning Content Filter
|
|
||||||
|
|
||||||
The `PruningContentFilter` removes less relevant nodes based on metrics like text density, link density, and tag importance. Nodes that fall below a defined threshold are pruned, leaving only high-value content.
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
content_filter=PruningContentFilter(
|
|
||||||
min_word_threshold=5,
|
|
||||||
threshold_type='dynamic',
|
|
||||||
threshold=0.45
|
|
||||||
),
|
|
||||||
fit_markdown=True # Activates markdown fitting
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
if result.success:
|
|
||||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Parameters
|
|
||||||
|
|
||||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
|
||||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
|
||||||
- `'fixed'`: Uses a constant threshold value for all nodes.
|
|
||||||
- `'dynamic'`: Adjusts thresholds based on node properties (e.g., tag importance, text/link ratios).
|
|
||||||
- **`threshold`**: (Optional, default 0.48) Base threshold for pruning:
|
|
||||||
- Fixed: Nodes scoring below this value are removed.
|
|
||||||
- Dynamic: This value adjusts based on node characteristics.
|
|
||||||
|
|
||||||
### How It Works
|
|
||||||
|
|
||||||
The algorithm evaluates each node using:
|
|
||||||
- **Text density**: Ratio of text to overall content.
|
|
||||||
- **Link density**: Proportion of text within links.
|
|
||||||
- **Tag importance**: Weights based on HTML tag type (e.g., `<article>`, `<p>`, `<div>`).
|
|
||||||
- **Content quality**: Metrics like text length and structural importance.
|
|
||||||
|
|
||||||
## BM25 Algorithm
|
|
||||||
|
|
||||||
The `BM25ContentFilter` uses the BM25 algorithm to rank and extract text chunks based on relevance to a search query or page metadata.
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
content_filter=BM25ContentFilter(user_query="fruit nutrition health"),
|
|
||||||
fit_markdown=True # Activates markdown fitting
|
|
||||||
)
|
|
||||||
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
if result.success:
|
|
||||||
print(f"Filtered Content:\n{result.extracted_content}")
|
|
||||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}")
|
|
||||||
print(f"\nFiltered HTML:\n{result.fit_html}")
|
|
||||||
else:
|
|
||||||
print("Error:", result.error_message)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Parameters
|
|
||||||
|
|
||||||
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts metadata (title, description, keywords) and uses it as the query.
|
|
||||||
- **`bm25_threshold`**: (Optional, default 1.0) Threshold controlling relevance:
|
|
||||||
- Higher values return stricter, more relevant results.
|
|
||||||
- Lower values include more lenient filtering.
|
|
||||||
|
|
||||||
@@ -1,102 +0,0 @@
|
|||||||
# Output Formats
|
|
||||||
|
|
||||||
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.
|
|
||||||
|
|
||||||
## Basic Formats
|
|
||||||
|
|
||||||
```python
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
|
|
||||||
# Access different formats
|
|
||||||
raw_html = result.html # Original HTML
|
|
||||||
clean_html = result.cleaned_html # Sanitized HTML
|
|
||||||
markdown_v2 = result.markdown_v2 # Detailed markdown generation results
|
|
||||||
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
|
|
||||||
|
|
||||||
## Raw HTML
|
|
||||||
|
|
||||||
Original, unmodified HTML from the webpage. Useful when you need to:
|
|
||||||
- Preserve the exact page structure.
|
|
||||||
- Process HTML with your own tools.
|
|
||||||
- Debug page issues.
|
|
||||||
|
|
||||||
```python
|
|
||||||
result = await crawler.arun(url="https://example.com")
|
|
||||||
print(result.html) # Complete HTML including headers, scripts, etc.
|
|
||||||
```
|
|
||||||
|
|
||||||
## Cleaned HTML
|
|
||||||
|
|
||||||
Sanitized HTML with unnecessary elements removed. Automatically:
|
|
||||||
- Removes scripts and styles.
|
|
||||||
- Cleans up formatting.
|
|
||||||
- Preserves semantic structure.
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
|
|
||||||
keep_data_attributes=False # Remove data-* attributes
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
print(result.cleaned_html)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Standard Markdown
|
|
||||||
|
|
||||||
HTML converted to clean markdown format. This output is useful for:
|
|
||||||
- Content analysis.
|
|
||||||
- Documentation.
|
|
||||||
- Readability.
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
|
||||||
options={"include_links": True} # Include links in markdown
|
|
||||||
)
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
print(result.markdown_v2.raw_markdown) # Standard markdown with links
|
|
||||||
```
|
|
||||||
|
|
||||||
## Fit Markdown
|
|
||||||
|
|
||||||
Extract and convert only the most relevant content into markdown format. Best suited for:
|
|
||||||
- Article extraction.
|
|
||||||
- Focusing on the main content.
|
|
||||||
- Removing boilerplate.
|
|
||||||
|
|
||||||
To generate `fit_markdown`, use a content filter like `PruningContentFilter`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
content_filter=PruningContentFilter(
|
|
||||||
threshold=0.7,
|
|
||||||
threshold_type="dynamic",
|
|
||||||
min_word_threshold=100
|
|
||||||
)
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
|
|
||||||
```
|
|
||||||
|
|
||||||
## Markdown with Citations
|
|
||||||
|
|
||||||
Generate markdown that includes citations for links. This format is ideal for:
|
|
||||||
- Creating structured documentation.
|
|
||||||
- Including references for extracted content.
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
markdown_generator=DefaultMarkdownGenerator(
|
|
||||||
options={"citations": True} # Enable citations
|
|
||||||
)
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
print(result.markdown_v2.markdown_with_citations)
|
|
||||||
print(result.markdown_v2.references_markdown) # Citations section
|
|
||||||
```
|
|
||||||
@@ -1,190 +0,0 @@
|
|||||||
# Page Interaction
|
|
||||||
|
|
||||||
Crawl4AI provides powerful features for interacting with dynamic webpages, handling JavaScript execution, and managing page events.
|
|
||||||
|
|
||||||
## JavaScript Execution
|
|
||||||
|
|
||||||
### Basic Execution
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
# Single JavaScript command
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);"
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
# Multiple commands
|
|
||||||
js_commands = [
|
|
||||||
"window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
"document.querySelector('.load-more').click();",
|
|
||||||
"document.querySelector('#consent-button').click();"
|
|
||||||
]
|
|
||||||
config = CrawlerRunConfig(js_code=js_commands)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Wait Conditions
|
|
||||||
|
|
||||||
### CSS-Based Waiting
|
|
||||||
|
|
||||||
Wait for elements to appear:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(wait_for="css:.dynamic-content") # Wait for element with class 'dynamic-content'
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
### JavaScript-Based Waiting
|
|
||||||
|
|
||||||
Wait for custom conditions:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Wait for number of elements
|
|
||||||
wait_condition = """() => {
|
|
||||||
return document.querySelectorAll('.item').length > 10;
|
|
||||||
}"""
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
# Wait for dynamic content to load
|
|
||||||
wait_for_content = """() => {
|
|
||||||
const content = document.querySelector('.content');
|
|
||||||
return content && content.innerText.length > 100;
|
|
||||||
}"""
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(wait_for=f"js:{wait_for_content}")
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Handling Dynamic Content
|
|
||||||
|
|
||||||
### Load More Content
|
|
||||||
|
|
||||||
Handle infinite scroll or load more buttons:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
js_code=[
|
|
||||||
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
|
|
||||||
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();" # Click load more
|
|
||||||
],
|
|
||||||
wait_for="js:() => document.querySelectorAll('.item').length > previousCount" # Wait for new content
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Form Interaction
|
|
||||||
|
|
||||||
Handle forms and inputs:
|
|
||||||
|
|
||||||
```python
|
|
||||||
js_form_interaction = """
|
|
||||||
document.querySelector('#search').value = 'search term'; // Fill form fields
|
|
||||||
document.querySelector('form').submit(); // Submit form
|
|
||||||
"""
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
js_code=js_form_interaction,
|
|
||||||
wait_for="css:.results" # Wait for results to load
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Timing Control
|
|
||||||
|
|
||||||
### Delays and Timeouts
|
|
||||||
|
|
||||||
Control timing of interactions:
|
|
||||||
|
|
||||||
```python
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
page_timeout=60000, # Page load timeout (ms)
|
|
||||||
delay_before_return_html=2.0 # Wait before capturing content
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Complex Interactions Example
|
|
||||||
|
|
||||||
Here's an example of handling a dynamic page with multiple interactions:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|
||||||
|
|
||||||
async def crawl_dynamic_content():
|
|
||||||
async with AsyncWebCrawler() as crawler:
|
|
||||||
# Initial page load
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
js_code="document.querySelector('.cookie-accept')?.click();", # Handle cookie consent
|
|
||||||
wait_for="css:.main-content"
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
# Load more content
|
|
||||||
session_id = "dynamic_session" # Keep session for multiple interactions
|
|
||||||
|
|
||||||
for page in range(3): # Load 3 pages of content
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
session_id=session_id,
|
|
||||||
js_code=[
|
|
||||||
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
|
|
||||||
"window.previousCount = document.querySelectorAll('.item').length;", # Store item count
|
|
||||||
"document.querySelector('.load-more')?.click();" # Click load more
|
|
||||||
],
|
|
||||||
wait_for="""() => {
|
|
||||||
const currentCount = document.querySelectorAll('.item').length;
|
|
||||||
return currentCount > window.previousCount;
|
|
||||||
}""",
|
|
||||||
js_only=(page > 0) # Execute JS without reloading page for subsequent interactions
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
print(f"Page {page + 1} items:", len(result.cleaned_html))
|
|
||||||
|
|
||||||
# Clean up session
|
|
||||||
await crawler.crawler_strategy.kill_session(session_id)
|
|
||||||
```
|
|
||||||
|
|
||||||
## Using with Extraction Strategies
|
|
||||||
|
|
||||||
Combine page interaction with structured extraction:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
# Pattern-based extraction after interaction
|
|
||||||
schema = {
|
|
||||||
"name": "Dynamic Items",
|
|
||||||
"baseSelector": ".item",
|
|
||||||
"fields": [
|
|
||||||
{"name": "title", "selector": "h2", "type": "text"},
|
|
||||||
{"name": "description", "selector": ".desc", "type": "text"}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
|
||||||
wait_for="css:.item:nth-child(10)", # Wait for 10 items
|
|
||||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
|
|
||||||
# Or use LLM to analyze dynamic content
|
|
||||||
class ContentAnalysis(BaseModel):
|
|
||||||
topics: List[str]
|
|
||||||
summary: str
|
|
||||||
|
|
||||||
config = CrawlerRunConfig(
|
|
||||||
js_code="document.querySelector('.show-more').click();",
|
|
||||||
wait_for="css:.full-content",
|
|
||||||
extraction_strategy=LLMExtractionStrategy(
|
|
||||||
provider="ollama/nemotron",
|
|
||||||
schema=ContentAnalysis.schema(),
|
|
||||||
instruction="Analyze the full content"
|
|
||||||
)
|
|
||||||
)
|
|
||||||
result = await crawler.arun(url="https://example.com", config=config)
|
|
||||||
```
|
|
||||||
@@ -1,172 +0,0 @@
|
|||||||
# Quick Start Guide 🚀
|
|
||||||
|
|
||||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI, covering everything from initial setup to advanced features like chunking and extraction strategies, using asynchronous programming. Let's dive in! 🌟
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Getting Started 🛠️
|
|
||||||
|
|
||||||
Set up your environment with `BrowserConfig` and create an `AsyncWebCrawler` instance.
|
|
||||||
|
|
||||||
```python
|
|
||||||
import asyncio
|
|
||||||
from crawl4ai import AsyncWebCrawler
|
|
||||||
from crawl4ai.async_configs import BrowserConfig
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
browser_config = BrowserConfig(verbose=True)
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
# Add your crawling logic here
|
|
||||||
pass
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Basic Usage
|
|
||||||
|
|
||||||
Provide a URL and let Crawl4AI do the work!
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CrawlerRunConfig
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
browser_config = BrowserConfig(verbose=True)
|
|
||||||
crawl_config = CrawlerRunConfig(url="https://www.nbcnews.com/business")
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(config=crawl_config)
|
|
||||||
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Taking Screenshots 📸
|
|
||||||
|
|
||||||
Capture and save webpage screenshots with `CrawlerRunConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import CacheMode
|
|
||||||
|
|
||||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
|
||||||
browser_config = BrowserConfig(verbose=True)
|
|
||||||
crawl_config = CrawlerRunConfig(
|
|
||||||
url=url,
|
|
||||||
screenshot=True,
|
|
||||||
cache_mode=CacheMode.BYPASS
|
|
||||||
)
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(config=crawl_config)
|
|
||||||
|
|
||||||
if result.success and result.screenshot:
|
|
||||||
import base64
|
|
||||||
screenshot_data = base64.b64decode(result.screenshot)
|
|
||||||
with open(output_path, 'wb') as f:
|
|
||||||
f.write(screenshot_data)
|
|
||||||
print(f"Screenshot saved successfully to {output_path}")
|
|
||||||
else:
|
|
||||||
print("Failed to capture screenshot")
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Browser Selection 🌐
|
|
||||||
|
|
||||||
Choose from multiple browser engines using `BrowserConfig`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import BrowserConfig
|
|
||||||
|
|
||||||
# Use Firefox
|
|
||||||
firefox_config = BrowserConfig(browser_type="firefox", verbose=True, headless=True)
|
|
||||||
async with AsyncWebCrawler(config=firefox_config) as crawler:
|
|
||||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
|
||||||
|
|
||||||
# Use WebKit
|
|
||||||
webkit_config = BrowserConfig(browser_type="webkit", verbose=True, headless=True)
|
|
||||||
async with AsyncWebCrawler(config=webkit_config) as crawler:
|
|
||||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
|
||||||
|
|
||||||
# Use Chromium (default)
|
|
||||||
chromium_config = BrowserConfig(verbose=True, headless=True)
|
|
||||||
async with AsyncWebCrawler(config=chromium_config) as crawler:
|
|
||||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### User Simulation 🎭
|
|
||||||
|
|
||||||
Simulate real user behavior to bypass detection:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
|
||||||
|
|
||||||
browser_config = BrowserConfig(verbose=True, headless=True)
|
|
||||||
crawl_config = CrawlerRunConfig(
|
|
||||||
url="YOUR-URL-HERE",
|
|
||||||
cache_mode=CacheMode.BYPASS,
|
|
||||||
simulate_user=True, # Random mouse movements and clicks
|
|
||||||
override_navigator=True # Makes the browser appear like a real user
|
|
||||||
)
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(config=crawl_config)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Understanding Parameters 🧠
|
|
||||||
|
|
||||||
Explore caching and forcing fresh crawls:
|
|
||||||
|
|
||||||
```python
|
|
||||||
async def main():
|
|
||||||
browser_config = BrowserConfig(verbose=True)
|
|
||||||
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
# First crawl (uses cache)
|
|
||||||
result1 = await crawler.arun(config=CrawlerRunConfig(url="https://www.nbcnews.com/business"))
|
|
||||||
print(f"First crawl result: {result1.markdown[:100]}...")
|
|
||||||
|
|
||||||
# Force fresh crawl
|
|
||||||
result2 = await crawler.arun(
|
|
||||||
config=CrawlerRunConfig(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
|
|
||||||
)
|
|
||||||
print(f"Second crawl result: {result2.markdown[:100]}...")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Adding a Chunking Strategy 🧩
|
|
||||||
|
|
||||||
Split content into chunks using `RegexChunking`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from crawl4ai.chunking_strategy import RegexChunking
|
|
||||||
|
|
||||||
async def main():
|
|
||||||
browser_config = BrowserConfig(verbose=True)
|
|
||||||
crawl_config = CrawlerRunConfig(
|
|
||||||
url="https://www.nbcnews.com/business",
|
|
||||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
|
||||||
)
|
|
||||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
||||||
result = await crawler.arun(config=crawl_config)
|
|
||||||
print(f"RegexChunking result: {result.extracted_content[:200]}...")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(main())
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Advanced Features and Configurations
|
|
||||||
|
|
||||||
For advanced examples (LLM strategies, knowledge graphs, pagination handling), ensure all code aligns with the `BrowserConfig` and `CrawlerRunConfig` pattern shown above.
|
|
||||||
@@ -34,9 +34,9 @@ sequenceDiagram
|
|||||||
|
|
||||||
**Benefits for Developers and Users**
|
**Benefits for Developers and Users**
|
||||||
|
|
||||||
1. **Fine-Grained Control**: Instead of predefining all logic upfront, you can dynamically guide the crawler in response to actual data and conditions encountered mid-crawl.
|
1. **Fine-Grained Control**: Instead of predefining all logic upfront, you can dynamically guide the crawler in response to actual data and conditions encountered mid-crawl.
|
||||||
2. **Real-Time Insights**: Monitor progress, errors, or network bottlenecks as they happen, without waiting for the entire crawl to finish.
|
2. **Real-Time Insights**: Monitor progress, errors, or network bottlenecks as they happen, without waiting for the entire crawl to finish.
|
||||||
3. **Enhanced Collaboration**: Different team members or automated systems can watch the same crawl events and provide input, making the crawling process more adaptive and intelligent.
|
3. **Enhanced Collaboration**: Different team members or automated systems can watch the same crawl events and provide input, making the crawling process more adaptive and intelligent.
|
||||||
|
|
||||||
**Next Steps**
|
**Next Steps**
|
||||||
|
|
||||||
|
|||||||
@@ -33,12 +33,6 @@ Introduced significant improvements to content filtering, multi-threaded environ
|
|||||||
|
|
||||||
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
||||||
|
|
||||||
## Categories
|
|
||||||
|
|
||||||
- [Technical Deep Dives](/blog/technical) - Coming soon
|
|
||||||
- [Tutorials & Guides](/blog/tutorials) - Coming soon
|
|
||||||
- [Community Updates](/blog/community) - Coming soon
|
|
||||||
|
|
||||||
## Stay Updated
|
## Stay Updated
|
||||||
|
|
||||||
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
||||||
|
|||||||
@@ -72,9 +72,9 @@ Two big upgrades here:
|
|||||||
|
|
||||||
### 🔠 **Use Cases You’ll Love**
|
### 🔠 **Use Cases You’ll Love**
|
||||||
|
|
||||||
1. **Authenticated Crawls**: Login once, export your storage state, and reuse it across multiple requests without the headache.
|
1. **Authenticated Crawls**: Login once, export your storage state, and reuse it across multiple requests without the headache.
|
||||||
2. **Long-page Screenshots**: Perfect for blogs, e-commerce pages, or any endless-scroll website.
|
2. **Long-page Screenshots**: Perfect for blogs, e-commerce pages, or any endless-scroll website.
|
||||||
3. **PDF Export**: Create professional-looking page PDFs in seconds.
|
3. **PDF Export**: Create professional-looking page PDFs in seconds.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
138
docs/md_v2/blog/releases/v0.4.3b1.md
Normal file
138
docs/md_v2/blog/releases/v0.4.3b1.md
Normal file
@@ -0,0 +1,138 @@
|
|||||||
|
# Crawl4AI 0.4.3: Major Performance Boost & LLM Integration
|
||||||
|
|
||||||
|
We're excited to announce Crawl4AI 0.4.3, focusing on three key areas: Speed & Efficiency, LLM Integration, and Core Platform Improvements. This release significantly improves crawling performance while adding powerful new LLM-powered features.
|
||||||
|
|
||||||
|
## ⚡ Speed & Efficiency Improvements
|
||||||
|
|
||||||
|
### 1. Memory-Adaptive Dispatcher System
|
||||||
|
The new dispatcher system provides intelligent resource management and real-time monitoring:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DisplayMode
|
||||||
|
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher, CrawlerMonitor
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
urls = ["https://example1.com", "https://example2.com"] * 50
|
||||||
|
|
||||||
|
# Configure memory-aware dispatch
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=80.0, # Auto-throttle at 80% memory
|
||||||
|
check_interval=0.5, # Check every 0.5 seconds
|
||||||
|
max_session_permit=20, # Max concurrent sessions
|
||||||
|
monitor=CrawlerMonitor( # Real-time monitoring
|
||||||
|
display_mode=DisplayMode.DETAILED
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
results = await dispatcher.run_urls(
|
||||||
|
urls=urls,
|
||||||
|
crawler=crawler,
|
||||||
|
config=CrawlerRunConfig()
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Streaming Support
|
||||||
|
Process crawled URLs in real-time instead of waiting for all results:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(stream=True)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
async for result in await crawler.arun_many(urls, config=config):
|
||||||
|
print(f"Got result for {result.url}")
|
||||||
|
# Process each result immediately
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. LXML-Based Scraping
|
||||||
|
New LXML scraping strategy offering up to 20x faster parsing:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||||
|
cache_mode=CacheMode.ENABLED
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🤖 LLM Integration
|
||||||
|
|
||||||
|
### 1. LLM-Powered Markdown Generation
|
||||||
|
Smart content filtering and organization using LLMs:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
|
content_filter=LLMContentFilter(
|
||||||
|
provider="openai/gpt-4o",
|
||||||
|
instruction="Extract technical documentation and code examples"
|
||||||
|
)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Automatic Schema Generation
|
||||||
|
Generate extraction schemas instantly using LLMs instead of manual CSS/XPath writing:
|
||||||
|
|
||||||
|
```python
|
||||||
|
schema = JsonCssExtractionStrategy.generate_schema(
|
||||||
|
html_content,
|
||||||
|
schema_type="CSS",
|
||||||
|
query="Extract product name, price, and description"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔧 Core Improvements
|
||||||
|
|
||||||
|
### 1. Proxy Support & Rotation
|
||||||
|
Integrated proxy support with automatic rotation and verification:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(
|
||||||
|
proxy_config={
|
||||||
|
"server": "http://proxy:8080",
|
||||||
|
"username": "user",
|
||||||
|
"password": "pass"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Robots.txt Compliance
|
||||||
|
Built-in robots.txt support with SQLite caching:
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(check_robots_txt=True)
|
||||||
|
result = await crawler.arun(url, config=config)
|
||||||
|
if result.status_code == 403:
|
||||||
|
print("Access blocked by robots.txt")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. URL Redirection Tracking
|
||||||
|
Track final URLs after redirects:
|
||||||
|
|
||||||
|
```python
|
||||||
|
result = await crawler.arun(url)
|
||||||
|
print(f"Initial URL: {url}")
|
||||||
|
print(f"Final URL: {result.redirected_url}")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Performance Impact
|
||||||
|
|
||||||
|
- Memory usage reduced by up to 40% with adaptive dispatcher
|
||||||
|
- Parsing speed increased up to 20x with LXML strategy
|
||||||
|
- Streaming reduces memory footprint for large crawls by ~60%
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -U crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
|
For complete examples, check our [demo repository](https://github.com/unclecode/crawl4ai/examples).
|
||||||
|
|
||||||
|
## Stay Connected
|
||||||
|
|
||||||
|
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
||||||
|
- Follow [@unclecode](https://twitter.com/unclecode)
|
||||||
|
- Join our [Discord](https://discord.gg/crawl4ai)
|
||||||
|
|
||||||
|
Happy crawling! 🕷️
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user