Compare commits

..

1 Commits

Author SHA1 Message Date
UncleCode
406702a77f Delete .do/deploy.template.yaml 2024-12-31 17:33:39 +08:00
207 changed files with 18188 additions and 30099 deletions

19
.do/app.yaml Normal file
View File

@@ -0,0 +1,19 @@
alerts:
- rule: DEPLOYMENT_FAILED
- rule: DOMAIN_FAILED
name: crawl4ai
region: nyc
services:
- dockerfile_path: Dockerfile
github:
branch: 0.3.74
deploy_on_push: true
repo: unclecode/crawl4ai
health_check:
http_path: /health
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
name: web
routes:
- path: /

12
.gitattributes vendored
View File

@@ -1,12 +0,0 @@
# Documentation
*.html linguist-documentation
docs/* linguist-documentation
docs/examples/* linguist-documentation
docs/md_v2/* linguist-documentation
# Explicitly mark Python as the main language
*.py linguist-detectable=true
*.py linguist-language=Python
# Exclude HTML from language statistics
*.html linguist-detectable=false

View File

@@ -1,59 +0,0 @@
title: "[Feature Request]: "
labels: ["⚙️ New"]
body:
- type: markdown
attributes:
value: |
Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in
this discussions category to avoid duplicates. 😊
- type: textarea
id: needs_to_be_done
attributes:
label: What needs to be done?
description: Please describe the feature or functionality you'd like to see.
placeholder: "e.g., Return alt text along with images scraped from a webpages in Result"
validations:
required: true
- type: textarea
id: problem_to_solve
attributes:
label: What problem does this solve?
description: Explain the pain point or issue this feature will help address.
placeholder: "e.g., Bypass Captchas added by cloudflare"
validations:
required: true
- type: textarea
id: target_users
attributes:
label: Target users/beneficiaries
description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.)
placeholder: "e.g., Marketing teams, developers"
validations:
required: false
- type: textarea
id: current_workarounds
attributes:
label: Current alternatives/workarounds
description: Are there any existing solutions or workarounds? How does this feature improve upon them?
placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them"
validations:
required: false
- type: markdown
attributes:
value: |
### 💡 Implementation Ideas
- type: textarea
id: proposed_approach
attributes:
label: Proposed approach
description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee
and the success metrics for this feature
placeholder: "e.g., Implement a breadth first traversal algorithm for scraper"
validations:
required: false

View File

@@ -1,127 +0,0 @@
name: Bug Report
description: Report a bug with the Crawl4AI.
title: "[Bug]: "
labels: ["🐞 Bug","🩺 Needs Triage"]
body:
- type: input
id: crawl4ai_version
attributes:
label: crawl4ai version
description: Specify the version of crawl4ai you are using.
placeholder: "e.g., 2.0.0"
validations:
required: true
- type: textarea
id: expected_behavior
attributes:
label: Expected Behavior
description: Describe what you expected to happen.
placeholder: "Provide a detailed explanation of the expected outcome."
validations:
required: true
- type: textarea
id: current_behavior
attributes:
label: Current Behavior
description: Describe what is happening instead of the expected behavior.
placeholder: "Describe the actual result or issue you encountered."
validations:
required: true
- type: dropdown
id: reproducible
attributes:
label: Is this reproducible?
description: Indicate whether this bug can be reproduced consistently.
options:
- "Yes"
- "No"
validations:
required: true
- type: textarea
id: inputs
attributes:
label: Inputs Causing the Bug
description: Provide details about the inputs causing the issue.
placeholder: |
- URL(s):
- Settings used:
- Input data (if applicable):
render: bash
- type: textarea
id: steps_to_reproduce
attributes:
label: Steps to Reproduce
description: Provide step-by-step instructions to reproduce the issue.
placeholder: |
1. Go to...
2. Click on...
3. Observe the issue...
render: bash
- type: textarea
id: code_snippets
attributes:
label: Code snippets
description: Provide code snippets(if any). Add comments as necessary
placeholder: print("Hello world")
render: python
# Header Section with Title
- type: markdown
attributes:
value: |
## Supporting Information
Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem
- type: input
id: os
attributes:
label: OS
description: Please provide the operating system & distro where the issue occurs.
placeholder: "e.g., Windows, macOS, Linux"
validations:
required: true
- type: input
id: python_version
attributes:
label: Python version
description: Specify the Python version being used.
placeholder: "e.g., 3.8.5"
validations:
required: true
# Browser Field
- type: input
id: browser
attributes:
label: Browser
description: Provide the name of the browser you are using.
placeholder: "e.g., Chrome, Firefox, Safari"
validations:
required: false
# Browser Version Field
- type: input
id: browser_version
attributes:
label: Browser version
description: Provide the version of the browser you are using.
placeholder: "e.g., 91.0.4472.124"
validations:
required: false
# Error Logs Field (Text Area)
- type: textarea
id: error_logs
attributes:
label: Error logs & Screenshots (if applicable)
description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue.
placeholder: "Paste error logs here and attach your screenshots"
validations:
required: false

View File

@@ -1,8 +0,0 @@
blank_issues_enabled: false
contact_links:
- name: Feature Requests
url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests
about: "Suggest new features or enhancements for Crawl4AI"
- name: Forums - Q&A
url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a
about: "Ask questions or engage in general discussions about Crawl4AI"

View File

@@ -1,19 +0,0 @@
## Summary
Please include a summary of the change and/or which issues are fixed.
eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)
## List of files changed and why
eg: quickstart.py - To update the example as per new changes
## How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
## Checklist:
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes

24
.gitignore vendored
View File

@@ -208,7 +208,7 @@ git_issues.md
.next/
.tests/
# .issues/
.issues/
.docs/
.issues/
.gitboss/
@@ -217,24 +217,4 @@ protect-all-except-feature.sh
manage-collab.sh
publish.sh
combine.sh
combined_output.txt
.local
.scripts
tree.md
tree.md
.scripts
.local
.do
/plans
plans/
# Codeium
.codeiumignore
todo/
# windsurf rules
.windsurfrules
# windsurf rules
.windsurfrules
combined_output.txt

View File

@@ -1,225 +1,19 @@
# Changelog
All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
---
### Changed
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
## Version 0.4.3b2 (2025-01-21)
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
### Features
- **Robots.txt Compliance:**
- Added robots.txt compliance support with efficient SQLite-based caching.
- New `check_robots_txt` parameter in `CrawlerRunConfig` to enable robots.txt checking before crawling a URL.
- Automated robots.txt checking is now integrated into `AsyncWebCrawler` with 403 status codes for blocked URLs.
- **Proxy Configuration:**
- Added proxy configuration support to `CrawlerRunConfig`, allowing dynamic proxy settings per crawl request.
- Updated documentation with examples for using proxy configuration in crawl operations.
- **LLM-Powered Schema Generation:**
- Introduced a new utility for automatic CSS and XPath schema generation using OpenAI or Ollama models.
- Added comprehensive documentation and examples for schema generation.
- New prompt templates optimized for HTML schema analysis.
- **URL Redirection Tracking:**
- Added URL redirection tracking to capture the final URL after any redirects.
- The final URL is now available in the `redirected_url` field of the `AsyncCrawlResponse` object.
- **Enhanced Streamlined Documentation:**
- Refactored and improved the documentation structure for clarity and ease of use.
- Added detailed explanations of new features and updated examples.
- **Improved Browser Context Management:**
- Enhanced the management of browser contexts and added shared data support.
- Introduced the `shared_data` parameter in `CrawlerRunConfig` to pass data between hooks.
- **Memory Dispatcher System:**
- Migrated to a memory dispatcher system with enhanced monitoring capabilities.
- Introduced `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher` for improved resource management.
- Added `RateLimiter` for rate limiting support.
- New `CrawlerMonitor` for real-time monitoring of crawler operations.
- **Streaming Support:**
- Added streaming support for processing crawled URLs as they are processed.
- Enabled streaming mode with the `stream` parameter in `CrawlerRunConfig`.
- **Content Scraping Strategy:**
- Introduced a new `LXMLWebScrapingStrategy` for faster content scraping.
- Added support for selecting the scraping strategy via the `scraping_strategy` parameter in `CrawlerRunConfig`.
### Bug Fixes
- **Browser Path Management:**
- Improved browser path management for consistent behavior across different environments.
- **Memory Threshold:**
- Adjusted the default memory threshold to improve resource utilization.
- **Pydantic Model Fields:**
- Made several model fields optional with default values to improve flexibility.
### Refactor
- **Documentation Structure:**
- Reorganized documentation structure to improve navigation and readability.
- Updated styles and added new sections for advanced features.
- **Scraping Mode:**
- Replaced the `ScrapingMode` enum with a strategy pattern for more flexible content scraping.
- **Version Update:**
- Updated the version to `0.4.248`.
- **Code Cleanup:**
- Removed unused files and improved type hints.
- Applied Ruff corrections for code quality.
- **Updated dependencies:**
- Updated dependencies to their latest versions to ensure compatibility and security.
- **Ignored certain patterns and directories:**
- Updated `.gitignore` and `.codeiumignore` to ignore additional patterns and directories, streamlining the development environment.
- **Simplified Personal Story in README:**
- Streamlined the personal story and project vision in the `README.md` for clarity.
- **Removed Deprecated Files:**
- Deleted several deprecated files and examples that are no longer relevant.
---
**Previous Releases:**
### 0.4.24x (2024-12-31)
- **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling.
- **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies.
- **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction.
- **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types.
- **Performance Boost**: Optimized caching, parallel processing, and memory management.
- **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking.
- **Security Features**: Improved input validation and safe expression evaluation.
### 0.4.247 (2025-01-06)
#### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
#### Changed
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
- **Documentation Update**:
- Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
#### Removed
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
#### Fixed
- **Page Closing to Prevent Memory Leaks**:
- **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
- **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
- **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py)
- **Code**:
```python
finally:
# If no session_id is given we should close the page
if not config.session_id:
await page.close()
```
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
## [0.4.267] - 2025 - 01 - 06
### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
## [0.4.24] - 2024-12-31
### Added
- **Browser and SSL Handling**
- SSL certificate validation options in extraction strategies
- Custom certificate paths support
- Configurable certificate validation skipping
- Enhanced response status code handling with retry logic
- **Content Processing**
- New content filtering system with regex support
- Advanced chunking strategies for large content
- Memory-efficient parallel processing
- Configurable chunk size optimization
- **JSON Extraction**
- Complex JSONPath expression support
- JSON-CSS and Microdata extraction
- RDFa parsing capabilities
- Advanced data transformation pipeline
- **Field Types**
- New field types: `computed`, `conditional`, `aggregate`, `template`
- Field inheritance system
- Reusable field definitions
- Custom validation rules
### Changed
- **Performance**
- Optimized selector compilation with caching
- Improved HTML parsing efficiency
- Enhanced memory management for large documents
- Batch processing optimizations
- **Error Handling**
- More detailed error messages and categorization
- Enhanced debugging capabilities
- Improved performance metrics tracking
- Better error recovery mechanisms
### Deprecated
- Old field computation method using `eval`
- Direct browser manipulation without proper SSL handling
- Simple text-based content filtering
### Removed
- Legacy extraction patterns without proper error handling
- Unsafe eval-based field computation
- Direct DOM manipulation without sanitization
### Fixed
- Memory leaks in large document processing
- SSL certificate validation issues
- Incorrect handling of nested JSON structures
- Performance bottlenecks in parallel processing
### Security
- Improved input validation and sanitization
- Safe expression evaluation system
- Enhanced resource protection
- Rate limiting implementation
## [0.4.1] - 2024-12-08
## [0.4.1] December 8, 2024
### **File: `crawl4ai/async_crawler_strategy.py`**
#### **New Parameters and Attributes Added**
- **`text_mode` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_mode` mode (default values: 800x600 for `text_mode`, 1920x1080 otherwise).
- **`extra_args`**: Adds browser-specific flags for `text_mode` mode.
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
#### **Browser Context Adjustments**
- Added **`viewport` adjustments**: Dynamically computed based on `text_mode` or custom configuration.
- Enhanced support for `light_mode` and `text_mode` by adding specific browser arguments to reduce resource consumption.
- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.
#### **Dynamic Content Handling**
- **Full Page Scan Feature**:
@@ -915,7 +709,7 @@ This commit introduces several key enhancements, including improved error handli
- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
- Enhanced flexibility in `CosineStrategy`:
- Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
- Updated `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` for better JSON-based extraction.
- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
### Fixed
- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
@@ -1186,6 +980,6 @@ These changes focus on refining the existing codebase, resulting in a more stabl
- Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
- Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
## [v0.2.4] - 2024-06-17
## [0.2.4] - 2024-06-17
### Fixed
- Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs

View File

@@ -1,131 +0,0 @@
# Crawl4AI Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or advances of
any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official email address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series of
actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within the
community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.1, available at
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
[https://www.contributor-covenant.org/translations][translations].
[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations

View File

@@ -6,7 +6,7 @@ We would like to thank the following people for their contributions to Crawl4AI:
- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
- [Aravind Karnam](https://github.com/aravindkarnam) - Head of Community and Product
- [Aravind Karnam](https://github.com/aravindkarnam) - Developer
## Community Contributors

451
README.md
View File

@@ -1,41 +1,19 @@
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
<div align="center">
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
![PyPI - Downloads](https://img.shields.io/pypi/dm/Crawl4AI)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
<!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
</div>
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.4.3bx](#-recent-updates)
[✨ Check out latest update v0.4.2](#-recent-updates)
🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didnt meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
</details>
🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
## 🧐 Why Crawl4AI?
@@ -50,35 +28,20 @@ Thank you to everyone who has supported this project, used it, and shared feedba
1. Install Crawl4AI:
```bash
# Install the package
pip install -U crawl4ai
# For pre release versions
pip install crawl4ai --pre
# Run post-installation setup
crawl4ai-setup
# Verify your installation
crawl4ai-doctor
```
If you encounter any browser-related issues, you can install them manually:
```bash
python -m playwright install --with-deps chromium
pip install crawl4ai
crawl4ai-setup # Setup the browser
```
2. Run a simple web crawl:
```python
import asyncio
from crawl4ai import *
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
)
print(result.markdown)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
# Soone will be change to result.markdown
print(result.markdown_v2.raw_markdown)
if __name__ == "__main__":
asyncio.run(main())
@@ -164,7 +127,7 @@ if __name__ == "__main__":
✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
## Installation 🛠️
@@ -237,26 +200,193 @@ pip install -e ".[all]" # Install all optional features
</details>
<details>
<summary>🐳 <strong>Docker Deployment</strong></summary>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
Deploy your own instance of Crawl4AI with one click:
### Current Docker Support
[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
- ⚠️ Note: This setup will be replaced in the next major release
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
### What's Coming Next?
</details>
Our new Docker implementation will bring:
- Improved performance and resource efficiency
- Streamlined deployment process
- Better integration with Crawl4AI features
- Enhanced scalability options
<details>
<summary>🐳 <strong>Using Docker</strong></summary>
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
---
<details>
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
Choose the appropriate image based on your platform and needs:
### For AMD64 (Regular Linux/Windows):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-amd64
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
# With GPU support
docker pull unclecode/crawl4ai:gpu-amd64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-arm64
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
# With GPU support
docker pull unclecode/crawl4ai:gpu-arm64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
```
Need more memory? Add `--shm-size`:
```bash
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
```
Test the installation:
```bash
curl http://localhost:11235/health
```
### For Raspberry Pi (32-bit) (coming soon):
```bash
# Pull and run basic version (recommended for Raspberry Pi)
docker pull unclecode/crawl4ai:basic-armv7
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
# With increased shared memory if needed
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
```
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
</details>
<details>
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
Build the image locally based on your platform:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# For AMD64 (Regular Linux/Windows)
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
# For ARM64 (M1/M2 Macs, ARM servers)
docker build --platform linux/arm64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
```
Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
Example with all options:
```bash
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=true \
.
```
Run your local build:
```bash
# Regular run
docker run -p 11235:11235 crawl4ai:local
# With increased shared memory
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
```
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
```
### For AMD64 (Regular Linux/Windows):
```bash
# Build and run locally
docker-compose --profile local-amd64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Build and run locally
docker-compose --profile local-arm64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
```
Environment variables (optional):
```bash
# Create a .env file
CRAWL4AI_API_TOKEN=your_token
OPENAI_API_KEY=your_openai_key
CLAUDE_API_KEY=your_claude_key
```
The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
@@ -280,7 +410,7 @@ task_id = response.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
```
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
</details>
@@ -294,29 +424,24 @@ You can check the project structure in the directory [https://github.com/uncleco
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
browser_config = BrowserConfig(
async with AsyncWebCrawler(
headless=True,
verbose=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
) as crawler:
result = await crawler.arun(
url="https://docs.micronaut.io/4.7.6/guide/",
config=run_config
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
print(len(result.markdown))
print(len(result.fit_markdown))
@@ -333,7 +458,7 @@ if __name__ == "__main__":
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
import json
@@ -368,26 +493,36 @@ async def main():
"type": "attribute",
"attribute": "src"
}
}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
async with AsyncWebCrawler(
headless=False,
verbose=True
)
run_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
) as crawler:
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
// scroll to the tab
tab.scrollIntoView();
tab.click();
// Wait for content to load and animations to complete
await new Promise(r => setTimeout(r, 500));
}
})();
"""
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
config=run_config
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS
)
companies = json.loads(result.extracted_content)
@@ -407,7 +542,7 @@ if __name__ == "__main__":
```python
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
@@ -417,26 +552,21 @@ class OpenAIModelFee(BaseModel):
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url='https://openai.com/api/pricing/',
config=run_config
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
# provider="ollama/qwen2", api_token="no-token",
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
),
cache_mode=CacheMode.BYPASS,
)
print(result.extracted_content)
@@ -447,35 +577,43 @@ if __name__ == "__main__":
</details>
<details>
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
```python
import os, sys
from pathlib import Path
import asyncio, time
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import AsyncWebCrawler
async def test_news_crawl():
# Create a persistent user data directory
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
browser_config = BrowserConfig(
async with AsyncWebCrawler(
verbose=True,
headless=True,
user_data_dir=user_data_dir,
use_persistent_context=True,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
headers={
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
) as crawler:
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
result = await crawler.arun(
url,
config=run_config,
cache_mode=CacheMode.BYPASS,
magic=True,
)
@@ -485,70 +623,28 @@ async def test_news_crawl():
</details>
## ✨ Recent Updates
- **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
- **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
- **🤖 Enhanced LLM Integration**:
- **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
- **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
- **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
- **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
- **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
- **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence.
- **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects.
- **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
## ✨ Recent Updates
Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable.
- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse.
- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages.
- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving.
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
## Version Numbering in Crawl4AI
Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
### Version Numbers Explained
Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
#### Pre-release Versions
We use different suffixes to indicate development stages:
- `dev` (0.4.3dev1): Development versions, unstable
- `a` (0.4.3a1): Alpha releases, experimental features
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
- `rc` (0.4.3rc1): Release candidates, potential final version
#### Installation
- Regular installation (stable version):
```bash
pip install -U crawl4ai
```
- Install pre-release versions:
```bash
pip install crawl4ai --pre
```
- Install specific version:
```bash
pip install crawl4ai==0.4.3b1
```
#### Why Pre-releases?
We use pre-releases to:
- Test new features in real-world scenarios
- Gather feedback before final releases
- Ensure stability for production users
- Allow early adopters to try new features
For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md).
## 📖 Documentation & Roadmap
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
@@ -572,7 +668,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
## 🤝 Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
## 📄 License
@@ -613,6 +709,9 @@ We envision a future where AI is powered by real human knowledge, ensuring data
For more details, see our [full mission statement](./MISSION.md).
</details>
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)

4214
a.md Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -2,88 +2,45 @@
from .async_webcrawler import AsyncWebCrawler, CacheMode
from .async_configs import BrowserConfig, CrawlerRunConfig
from .content_scraping_strategy import (
ContentScrapingStrategy,
WebScrapingStrategy,
LXMLWebScrapingStrategy,
)
from .extraction_strategy import (
ExtractionStrategy,
LLMExtractionStrategy,
CosineStrategy,
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy
)
from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter, RelevantContentFilter
from .models import CrawlResult, MarkdownGenerationResult
from .async_dispatcher import (
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
RateLimiter,
CrawlerMonitor,
DisplayMode,
BaseDispatcher
)
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
from .models import CrawlResult
from .__version__ import __version__
__all__ = [
"AsyncWebCrawler",
"CrawlResult",
"CacheMode",
"ContentScrapingStrategy",
"WebScrapingStrategy",
"LXMLWebScrapingStrategy",
"BrowserConfig",
"CrawlerRunConfig",
"ExtractionStrategy",
"LLMExtractionStrategy",
"CosineStrategy",
"JsonCssExtractionStrategy",
"JsonXPathExtractionStrategy",
"ChunkingStrategy",
"RegexChunking",
"DefaultMarkdownGenerator",
"RelevantContentFilter",
"PruningContentFilter",
"BM25ContentFilter",
"LLMContentFilter",
"BaseDispatcher",
"MemoryAdaptiveDispatcher",
"SemaphoreDispatcher",
"RateLimiter",
"CrawlerMonitor",
"DisplayMode",
"MarkdownGenerationResult",
'BrowserConfig',
'CrawlerRunConfig',
'ExtractionStrategy',
'LLMExtractionStrategy',
'CosineStrategy',
'JsonCssExtractionStrategy',
'ChunkingStrategy',
'RegexChunking',
'DefaultMarkdownGenerator',
'PruningContentFilter',
'BM25ContentFilter',
]
def is_sync_version_installed():
try:
import selenium
return True
except ImportError:
return False
if is_sync_version_installed():
try:
from .web_crawler import WebCrawler
__all__.append("WebCrawler")
except ImportError:
print(
"Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
)
import warnings
print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
else:
WebCrawler = None
# import warnings
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
import warnings
from pydantic import warnings as pydantic_warnings
# Disable all Pydantic warnings
warnings.filterwarnings("ignore", module="pydantic")
# pydantic_warnings.filter_warnings()
# print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.4.3b3"
__version__ = "0.4.22"

View File

@@ -1,22 +1,13 @@
from .config import (
MIN_WORD_THRESHOLD,
MIN_WORD_THRESHOLD,
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
SCREENSHOT_HEIGHT_TRESHOLD,
PAGE_TIMEOUT,
IMAGE_SCORE_THRESHOLD,
SOCIAL_MEDIA_DOMAINS,
PAGE_TIMEOUT
)
from .user_agent_generator import UserAgentGenerator, UAGen, ValidUAGenerator, OnlineUAGenerator
from .user_agent_generator import UserAgentGenerator
from .extraction_strategy import ExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .deep_crawl import DeepCrawlStrategy
from .chunking_strategy import ChunkingStrategy
from .markdown_generation_strategy import MarkdownGenerationStrategy
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter, LLMContentFilter, PruningContentFilter
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
from typing import Optional, Union, List
from .cache_context import CacheMode
class BrowserConfig:
"""
@@ -33,22 +24,18 @@ class BrowserConfig:
Default: True.
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False.
cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
debugging_port (int): Port for the browser debugging protocol. Default: 9222.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False.
user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
temporary directory may be used. Default: None.
chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
is "chromium". Default: "chromium".
channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
is "chromium". Default: "chromium".
proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
is "chromium". Default: "chrome".
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
Default: None.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
viewport_width (int): Default viewport width for pages. Default: 1080.
viewport_height (int): Default viewport height for pages. Default: 600.
viewport_width (int): Default viewport width for pages. Default: 1920.
viewport_height (int): Default viewport height for pages. Default: 1080.
verbose (bool): Enable verbose logging.
Default: True.
accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
@@ -70,7 +57,7 @@ class BrowserConfig:
user_agent as-is. Default: None.
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None.
text_mode (bool): If True, disables images and other rich content for potentially faster load times.
text_only (bool): If True, disables images and other rich content for potentially faster load times.
Default: False.
light_mode (bool): Disables certain background features for performance gains. Default: False.
extra_args (list): Additional command-line arguments passed to the browser.
@@ -82,18 +69,16 @@ class BrowserConfig:
browser_type: str = "chromium",
headless: bool = True,
use_managed_browser: bool = False,
cdp_url: str = None,
use_persistent_context: bool = False,
user_data_dir: str = None,
chrome_channel: str = "chromium",
channel: str = "chromium",
chrome_channel: str = "chrome",
proxy: str = None,
proxy_config: dict = None,
viewport_width: int = 1080,
viewport_height: int = 600,
viewport_width: int = 1920,
viewport_height: int = 1080,
accept_downloads: bool = False,
downloads_path: str = None,
storage_state : Union[str, dict, None]=None,
storage_state=None,
ignore_https_errors: bool = True,
java_script_enabled: bool = True,
sleep_on_close: bool = False,
@@ -101,30 +86,28 @@ class BrowserConfig:
cookies: list = None,
headers: dict = None,
user_agent: str = (
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
# "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
),
user_agent_mode: str = "",
user_agent_generator_config: dict = {},
text_mode: bool = False,
user_agent_mode: str = None,
user_agent_generator_config: dict = None,
text_only: bool = False,
light_mode: bool = False,
extra_args: list = None,
debugging_port: int = 9222,
host: str = "localhost",
):
self.browser_type = browser_type
self.headless = headless
self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url
self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
self.channel = channel or self.browser_type or "chromium"
if self.browser_type in ["firefox", "webkit"]:
self.channel = ""
self.chrome_channel = ""
if self.browser_type == "chromium":
self.chrome_channel = "chrome"
elif self.browser_type == "firefox":
self.chrome_channel = "firefox"
elif self.browser_type == "webkit":
self.chrome_channel = "webkit"
else:
self.chrome_channel = chrome_channel or "chrome"
self.proxy = proxy
self.proxy_config = proxy_config
self.viewport_width = viewport_width
@@ -139,22 +122,18 @@ class BrowserConfig:
self.user_agent = user_agent
self.user_agent_mode = user_agent_mode
self.user_agent_generator_config = user_agent_generator_config
self.text_mode = text_mode
self.text_only = text_only
self.light_mode = light_mode
self.extra_args = extra_args if extra_args is not None else []
self.sleep_on_close = sleep_on_close
self.verbose = verbose
self.debugging_port = debugging_port
fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode == "random":
self.user_agent = fa_user_agenr_generator.generate(
user_agenr_generator = UserAgentGenerator()
if self.user_agent_mode != "random":
self.user_agent = user_agenr_generator.generate(
**(self.user_agent_generator_config or {})
)
else:
pass
self.browser_hint = UAGen.generate_client_hints(self.user_agent)
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint)
# If persistent context is requested, ensure managed browser is enabled
@@ -167,15 +146,13 @@ class BrowserConfig:
browser_type=kwargs.get("browser_type", "chromium"),
headless=kwargs.get("headless", True),
use_managed_browser=kwargs.get("use_managed_browser", False),
cdp_url=kwargs.get("cdp_url"),
use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chromium"),
channel=kwargs.get("channel", "chromium"),
chrome_channel=kwargs.get("chrome_channel", "chrome"),
proxy=kwargs.get("proxy"),
proxy_config=kwargs.get("proxy_config"),
viewport_width=kwargs.get("viewport_width", 1080),
viewport_height=kwargs.get("viewport_height", 600),
viewport_width=kwargs.get("viewport_width", 1920),
viewport_height=kwargs.get("viewport_height", 1080),
accept_downloads=kwargs.get("accept_downloads", False),
downloads_path=kwargs.get("downloads_path"),
storage_state=kwargs.get("storage_state"),
@@ -183,63 +160,17 @@ class BrowserConfig:
java_script_enabled=kwargs.get("java_script_enabled", True),
cookies=kwargs.get("cookies", []),
headers=kwargs.get("headers", {}),
user_agent=kwargs.get(
"user_agent",
user_agent=kwargs.get("user_agent",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
),
user_agent_mode=kwargs.get("user_agent_mode"),
user_agent_generator_config=kwargs.get("user_agent_generator_config"),
text_mode=kwargs.get("text_mode", False),
text_only=kwargs.get("text_only", False),
light_mode=kwargs.get("light_mode", False),
extra_args=kwargs.get("extra_args", []),
extra_args=kwargs.get("extra_args", [])
)
def to_dict(self):
return {
"browser_type": self.browser_type,
"headless": self.headless,
"use_managed_browser": self.use_managed_browser,
"cdp_url": self.cdp_url,
"use_persistent_context": self.use_persistent_context,
"user_data_dir": self.user_data_dir,
"chrome_channel": self.chrome_channel,
"channel": self.channel,
"proxy": self.proxy,
"proxy_config": self.proxy_config,
"viewport_width": self.viewport_width,
"viewport_height": self.viewport_height,
"accept_downloads": self.accept_downloads,
"downloads_path": self.downloads_path,
"storage_state": self.storage_state,
"ignore_https_errors": self.ignore_https_errors,
"java_script_enabled": self.java_script_enabled,
"cookies": self.cookies,
"headers": self.headers,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
"text_mode": self.text_mode,
"light_mode": self.light_mode,
"extra_args": self.extra_args,
"sleep_on_close": self.sleep_on_close,
"verbose": self.verbose,
"debugging_port": self.debugging_port,
}
def clone(self, **kwargs):
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
BrowserConfig: A new instance with the specified updates
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return BrowserConfig.from_kwargs(config_dict)
class CrawlerRunConfig:
"""
@@ -251,45 +182,22 @@ class CrawlerRunConfig:
By using this class, you have a single place to understand and adjust the crawling options.
Attributes:
# Content Processing Parameters
word_count_threshold (int): Minimum word count threshold before processing content.
Default: MIN_WORD_THRESHOLD (typically 200).
extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
Default: None (NoExtractionStrategy is used if None).
chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
Default: RegexChunking().
markdown_generator (MarkdownGenerationStrategy): Strategy for generating markdown.
Default: None.
content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
Default: None.
only_text (bool): If True, attempt to extract text-only content where applicable.
Default: False.
css_selector (str or None): CSS selector to extract a specific portion of the page.
Default: None.
excluded_tags (list of str or None): List of HTML tags to exclude from processing.
Default: None.
excluded_selector (str or None): CSS selector to exclude from processing.
Default: None.
keep_data_attributes (bool): If True, retain `data-*` attributes while removing unwanted attributes.
Default: False.
remove_forms (bool): If True, remove all `<form>` elements from the HTML.
Default: False.
prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
Default: False.
parser_type (str): Type of parser to use for HTML parsing.
Default: "lxml".
scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
Default: WebScrapingStrategy.
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
If None, no additional proxy config. Default: None.
# Caching Parameters
cache_mode (CacheMode or None): Defines how caching is handled.
If None, defaults to CacheMode.ENABLED internally.
Default: None.
session_id (str or None): Optional session ID to persist the browser context and the created
page instance. If the ID already exists, the crawler does not
create a new page and uses the current page to preserve the state.
session_id (str or None): Optional session ID to persist the browser context and the created
page instance. If the ID already exists, the crawler does not
create a new page and uses the current page to preserve the state;
if not, it creates a new page and context then stores it in
memory with the given session ID.
bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
Default: False.
disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
@@ -298,34 +206,36 @@ class CrawlerRunConfig:
Default: False.
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
Default: False.
shared_data (dict or None): Shared data to be passed between hooks.
Default: None.
# Page Navigation and Timing Parameters
css_selector (str or None): CSS selector to extract a specific portion of the page.
Default: None.
screenshot (bool): Whether to take a screenshot after crawling.
Default: False.
pdf (bool): Whether to generate a PDF of the page.
Default: False.
verbose (bool): Enable verbose logging.
Default: True.
only_text (bool): If True, attempt to extract text-only content where applicable.
Default: False.
image_description_min_word_threshold (int): Minimum words for image description extraction.
Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
Default: False.
js_code (str or list of str or None): JavaScript code/snippets to run on the page.
Default: None.
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
Default: None.
js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
Default: False.
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
Default: "domcontentloaded".
page_timeout (int): Timeout in ms for page operations like navigation.
Default: 60000 (60 seconds).
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
Default: None.
wait_for_images (bool): If True, wait for images to load before extracting content.
Default: False.
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
Default: 0.1.
mean_delay (float): Mean base delay between requests when calling arun_many.
Default: 0.1.
max_range (float): Max random additional delay range for requests in arun_many.
Default: 0.3.
semaphore_count (int): Number of concurrent operations allowed.
Default: 5.
# Page Interaction Parameters
js_code (str or list of str or None): JavaScript code/snippets to run on the page.
Default: None.
js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
Default: False.
ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
Default: True.
wait_for_images (bool): If True, wait for images to load before extracting content.
Default: True.
adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
Default: False.
scan_full_page (bool): If True, scroll through the entire page to load all content.
Default: False.
scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
@@ -334,423 +244,163 @@ class CrawlerRunConfig:
Default: False.
remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
Default: False.
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
Default: 0.1.
log_console (bool): If True, log console messages from the page.
Default: False.
simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
Default: False.
override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
Default: False.
magic (bool): If True, attempts automatic handling of overlays/popups.
Default: False.
adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
Default: False.
# Media Handling Parameters
screenshot (bool): Whether to take a screenshot after crawling.
Default: False.
screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
Default: None.
screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
pdf (bool): Whether to generate a PDF of the page.
Default: False.
image_description_min_word_threshold (int): Minimum words for image description extraction.
Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
image_score_threshold (int): Minimum score threshold for processing an image.
Default: IMAGE_SCORE_THRESHOLD (e.g., 3).
exclude_external_images (bool): If True, exclude all external images from processing.
Default: False.
# Link and Domain Handling Parameters
exclude_social_media_domains (list of str): List of domains to exclude for social media links.
Default: SOCIAL_MEDIA_DOMAINS (from config).
exclude_external_links (bool): If True, exclude all external links from the results.
Default: False.
exclude_social_media_links (bool): If True, exclude links pointing to social media domains.
Default: False.
exclude_domains (list of str): List of specific domains to exclude from results.
Default: [].
# Debugging and Logging Parameters
verbose (bool): Enable verbose logging.
Default: True.
log_console (bool): If True, log console messages from the page.
Default: False.
# Streaming Parameters
stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
Default: False.
# Optional Parameters
stream (bool): If True, stream the page content as it is being loaded.
url: str = None # This is not a compulsory parameter
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
user_agent (str): Custom User-Agent string to use. Default: None
user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
user_agent as-is. Default: None.
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None.
mean_delay (float): Mean base delay between requests when calling arun_many.
Default: 0.1.
max_range (float): Max random additional delay range for requests in arun_many.
Default: 0.3.
# session_id and semaphore_count might be set at runtime, not needed as defaults here.
"""
def __init__(
self,
# Content Processing Parameters
word_count_threshold: int = MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
deep_crawl_strategy: DeepCrawlStrategy = None,
markdown_generator: MarkdownGenerationStrategy = None,
content_filter : RelevantContentFilter = None,
only_text: bool = False,
css_selector: str = None,
excluded_tags: list = None,
excluded_selector: str = None,
keep_data_attributes: bool = False,
remove_forms: bool = False,
prettiify: bool = False,
parser_type: str = "lxml",
scraping_strategy: ContentScrapingStrategy = None,
proxy_config: dict = None,
# SSL Parameters
fetch_ssl_certificate: bool = False,
# Caching Parameters
cache_mode: CacheMode =None,
word_count_threshold: int = MIN_WORD_THRESHOLD ,
extraction_strategy : ExtractionStrategy=None, # Will default to NoExtractionStrategy if None
chunking_strategy : ChunkingStrategy= None, # Will default to RegexChunking if None
markdown_generator : MarkdownGenerationStrategy = None,
content_filter=None,
cache_mode=None,
session_id: str = None,
bypass_cache: bool = False,
disable_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
shared_data: dict = None,
# Page Navigation and Timing Parameters
css_selector: str = None,
screenshot: bool = False,
pdf: bool = False,
verbose: bool = True,
only_text: bool = False,
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
prettiify: bool = False,
js_code=None,
wait_for: str = None,
js_only: bool = False,
wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT,
wait_for: str = None,
wait_for_images: bool = False,
delay_before_return_html: float = 0.1,
mean_delay: float = 0.1,
max_range: float = 0.3,
semaphore_count: int = 5,
# Page Interaction Parameters
js_code: Union[str, List[str]] = None,
js_only: bool = False,
ignore_body_visibility: bool = True,
wait_for_images: bool = True,
adjust_viewport_to_content: bool = False,
scan_full_page: bool = False,
scroll_delay: float = 0.2,
process_iframes: bool = False,
remove_overlay_elements: bool = False,
delay_before_return_html: float = 0.1,
log_console: bool = False,
simulate_user: bool = False,
override_navigator: bool = False,
magic: bool = False,
adjust_viewport_to_content: bool = False,
# Media Handling Parameters
screenshot: bool = False,
screenshot_wait_for: float = None,
screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
pdf: bool = False,
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
exclude_external_images: bool = False,
# Link and Domain Handling Parameters
exclude_social_media_domains: list = None,
exclude_external_links: bool = False,
exclude_social_media_links: bool = False,
exclude_domains: list = None,
# Debugging and Logging Parameters
verbose: bool = True,
log_console: bool = False,
# Streaming Parameters
stream: bool = False,
url: str = None,
check_robots_txt: bool = False,
user_agent: str = None,
user_agent_mode: str = None,
user_agent_generator_config: dict = {},
mean_delay: float = 0.1,
max_range: float = 0.3,
semaphore_count: int = 5,
):
self.url = url
# Content Processing Parameters
self.word_count_threshold = word_count_threshold
self.extraction_strategy = extraction_strategy
self.chunking_strategy = chunking_strategy
self.deep_crawl_strategy = deep_crawl_strategy
self.markdown_generator = markdown_generator
self.content_filter = content_filter
self.only_text = only_text
self.css_selector = css_selector
self.excluded_tags = excluded_tags or []
self.excluded_selector = excluded_selector or ""
self.keep_data_attributes = keep_data_attributes
self.remove_forms = remove_forms
self.prettiify = prettiify
self.parser_type = parser_type
self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
self.proxy_config = proxy_config
# SSL Parameters
self.fetch_ssl_certificate = fetch_ssl_certificate
# Caching Parameters
self.cache_mode = cache_mode
self.session_id = session_id
self.bypass_cache = bypass_cache
self.disable_cache = disable_cache
self.no_cache_read = no_cache_read
self.no_cache_write = no_cache_write
self.shared_data = shared_data
# Page Navigation and Timing Parameters
self.css_selector = css_selector
self.screenshot = screenshot
self.pdf = pdf
self.verbose = verbose
self.only_text = only_text
self.image_description_min_word_threshold = image_description_min_word_threshold
self.prettiify = prettiify
self.js_code = js_code
self.wait_for = wait_for
self.js_only = js_only
self.wait_until = wait_until
self.page_timeout = page_timeout
self.wait_for = wait_for
self.wait_for_images = wait_for_images
self.delay_before_return_html = delay_before_return_html
self.mean_delay = mean_delay
self.max_range = max_range
self.semaphore_count = semaphore_count
# Page Interaction Parameters
self.js_code = js_code
self.js_only = js_only
self.ignore_body_visibility = ignore_body_visibility
self.wait_for_images = wait_for_images
self.adjust_viewport_to_content = adjust_viewport_to_content
self.scan_full_page = scan_full_page
self.scroll_delay = scroll_delay
self.process_iframes = process_iframes
self.remove_overlay_elements = remove_overlay_elements
self.delay_before_return_html = delay_before_return_html
self.log_console = log_console
self.simulate_user = simulate_user
self.override_navigator = override_navigator
self.magic = magic
self.adjust_viewport_to_content = adjust_viewport_to_content
# Media Handling Parameters
self.screenshot = screenshot
self.screenshot_wait_for = screenshot_wait_for
self.screenshot_height_threshold = screenshot_height_threshold
self.pdf = pdf
self.image_description_min_word_threshold = image_description_min_word_threshold
self.image_score_threshold = image_score_threshold
self.exclude_external_images = exclude_external_images
# Link and Domain Handling Parameters
self.exclude_social_media_domains = (
exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
)
self.exclude_external_links = exclude_external_links
self.exclude_social_media_links = exclude_social_media_links
self.exclude_domains = exclude_domains or []
# Debugging and Logging Parameters
self.verbose = verbose
self.log_console = log_console
# Streaming Parameters
self.stream = stream
# Robots.txt Handling Parameters
self.check_robots_txt = check_robots_txt
# User Agent Parameters
self.user_agent = user_agent
self.user_agent_mode = user_agent_mode
self.user_agent_generator_config = user_agent_generator_config
self.mean_delay = mean_delay
self.max_range = max_range
self.semaphore_count = semaphore_count
# Validate type of extraction strategy and chunking strategy if they are provided
if self.extraction_strategy is not None and not isinstance(
self.extraction_strategy, ExtractionStrategy
):
raise ValueError(
"extraction_strategy must be an instance of ExtractionStrategy"
)
if self.deep_crawl_strategy is not None and not isinstance(
self.deep_crawl_strategy, DeepCrawlStrategy
):
raise ValueError(
"deep_crawl_strategy must be an instance of DeepCrawlStrategy"
)
if self.chunking_strategy is not None and not isinstance(
self.chunking_strategy, ChunkingStrategy
):
raise ValueError(
"chunking_strategy must be an instance of ChunkingStrategy"
)
if self.extraction_strategy is not None and not isinstance(self.extraction_strategy, ExtractionStrategy):
raise ValueError("extraction_strategy must be an instance of ExtractionStrategy")
if self.chunking_strategy is not None and not isinstance(self.chunking_strategy, ChunkingStrategy):
raise ValueError("chunking_strategy must be an instance of ChunkingStrategy")
# Set default chunking strategy if None
if self.chunking_strategy is None:
from .chunking_strategy import RegexChunking
self.chunking_strategy = RegexChunking()
@staticmethod
def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
return CrawlerRunConfig(
# Content Processing Parameters
word_count_threshold=kwargs.get("word_count_threshold", 200),
extraction_strategy=kwargs.get("extraction_strategy"),
chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
deep_crawl_strategy=kwargs.get("deep_crawl_strategy"),
chunking_strategy=kwargs.get("chunking_strategy"),
markdown_generator=kwargs.get("markdown_generator"),
content_filter=kwargs.get("content_filter"),
only_text=kwargs.get("only_text", False),
css_selector=kwargs.get("css_selector"),
excluded_tags=kwargs.get("excluded_tags", []),
excluded_selector=kwargs.get("excluded_selector", ""),
keep_data_attributes=kwargs.get("keep_data_attributes", False),
remove_forms=kwargs.get("remove_forms", False),
prettiify=kwargs.get("prettiify", False),
parser_type=kwargs.get("parser_type", "lxml"),
scraping_strategy=kwargs.get("scraping_strategy"),
proxy_config=kwargs.get("proxy_config"),
# SSL Parameters
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
# Caching Parameters
cache_mode=kwargs.get("cache_mode"),
session_id=kwargs.get("session_id"),
bypass_cache=kwargs.get("bypass_cache", False),
disable_cache=kwargs.get("disable_cache", False),
no_cache_read=kwargs.get("no_cache_read", False),
no_cache_write=kwargs.get("no_cache_write", False),
shared_data=kwargs.get("shared_data", None),
# Page Navigation and Timing Parameters
css_selector=kwargs.get("css_selector"),
screenshot=kwargs.get("screenshot", False),
pdf=kwargs.get("pdf", False),
verbose=kwargs.get("verbose", True),
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD),
prettiify=kwargs.get("prettiify", False),
js_code=kwargs.get("js_code"), # If not provided here, will default inside constructor
wait_for=kwargs.get("wait_for"),
js_only=kwargs.get("js_only", False),
wait_until=kwargs.get("wait_until", "domcontentloaded"),
page_timeout=kwargs.get("page_timeout", 60000),
wait_for=kwargs.get("wait_for"),
wait_for_images=kwargs.get("wait_for_images", False),
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
mean_delay=kwargs.get("mean_delay", 0.1),
max_range=kwargs.get("max_range", 0.3),
semaphore_count=kwargs.get("semaphore_count", 5),
# Page Interaction Parameters
js_code=kwargs.get("js_code"),
js_only=kwargs.get("js_only", False),
ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
scan_full_page=kwargs.get("scan_full_page", False),
scroll_delay=kwargs.get("scroll_delay", 0.2),
process_iframes=kwargs.get("process_iframes", False),
remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
log_console=kwargs.get("log_console", False),
simulate_user=kwargs.get("simulate_user", False),
override_navigator=kwargs.get("override_navigator", False),
magic=kwargs.get("magic", False),
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
# Media Handling Parameters
screenshot=kwargs.get("screenshot", False),
screenshot_wait_for=kwargs.get("screenshot_wait_for"),
screenshot_height_threshold=kwargs.get(
"screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
),
pdf=kwargs.get("pdf", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold",
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
),
image_score_threshold=kwargs.get(
"image_score_threshold", IMAGE_SCORE_THRESHOLD
),
exclude_external_images=kwargs.get("exclude_external_images", False),
# Link and Domain Handling Parameters
exclude_social_media_domains=kwargs.get(
"exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
),
exclude_external_links=kwargs.get("exclude_external_links", False),
exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
exclude_domains=kwargs.get("exclude_domains", []),
# Debugging and Logging Parameters
verbose=kwargs.get("verbose", True),
log_console=kwargs.get("log_console", False),
# Streaming Parameters
stream=kwargs.get("stream", False),
url=kwargs.get("url"),
check_robots_txt=kwargs.get("check_robots_txt", False),
user_agent=kwargs.get("user_agent"),
user_agent_mode=kwargs.get("user_agent_mode"),
user_agent_generator_config=kwargs.get("user_agent_generator_config", {}),
screenshot_height_threshold=kwargs.get("screenshot_height_threshold", 20000),
mean_delay=kwargs.get("mean_delay", 0.1),
max_range=kwargs.get("max_range", 0.3),
semaphore_count=kwargs.get("semaphore_count", 5)
)
# Create a funciton returns dict of the object
def to_dict(self):
return {
"word_count_threshold": self.word_count_threshold,
"extraction_strategy": self.extraction_strategy,
"chunking_strategy": self.chunking_strategy,
"deep_crawl_strategy": self.deep_crawl_strategy,
"markdown_generator": self.markdown_generator,
"content_filter": self.content_filter,
"only_text": self.only_text,
"css_selector": self.css_selector,
"excluded_tags": self.excluded_tags,
"excluded_selector": self.excluded_selector,
"keep_data_attributes": self.keep_data_attributes,
"remove_forms": self.remove_forms,
"prettiify": self.prettiify,
"parser_type": self.parser_type,
"scraping_strategy": self.scraping_strategy,
"proxy_config": self.proxy_config,
"fetch_ssl_certificate": self.fetch_ssl_certificate,
"cache_mode": self.cache_mode,
"session_id": self.session_id,
"bypass_cache": self.bypass_cache,
"disable_cache": self.disable_cache,
"no_cache_read": self.no_cache_read,
"no_cache_write": self.no_cache_write,
"shared_data": self.shared_data,
"wait_until": self.wait_until,
"page_timeout": self.page_timeout,
"wait_for": self.wait_for,
"wait_for_images": self.wait_for_images,
"delay_before_return_html": self.delay_before_return_html,
"mean_delay": self.mean_delay,
"max_range": self.max_range,
"semaphore_count": self.semaphore_count,
"js_code": self.js_code,
"js_only": self.js_only,
"ignore_body_visibility": self.ignore_body_visibility,
"scan_full_page": self.scan_full_page,
"scroll_delay": self.scroll_delay,
"process_iframes": self.process_iframes,
"remove_overlay_elements": self.remove_overlay_elements,
"simulate_user": self.simulate_user,
"override_navigator": self.override_navigator,
"magic": self.magic,
"adjust_viewport_to_content": self.adjust_viewport_to_content,
"screenshot": self.screenshot,
"screenshot_wait_for": self.screenshot_wait_for,
"screenshot_height_threshold": self.screenshot_height_threshold,
"pdf": self.pdf,
"image_description_min_word_threshold": self.image_description_min_word_threshold,
"image_score_threshold": self.image_score_threshold,
"exclude_external_images": self.exclude_external_images,
"exclude_social_media_domains": self.exclude_social_media_domains,
"exclude_external_links": self.exclude_external_links,
"exclude_social_media_links": self.exclude_social_media_links,
"exclude_domains": self.exclude_domains,
"verbose": self.verbose,
"log_console": self.log_console,
"stream": self.stream,
"url": self.url,
"check_robots_txt": self.check_robots_txt,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
}
def clone(self, **kwargs):
"""Create a copy of this configuration with updated values.
Args:
**kwargs: Key-value pairs of configuration options to update
Returns:
CrawlerRunConfig: A new instance with the specified updates
Example:
```python
# Create a new config with streaming enabled
stream_config = config.clone(stream=True)
# Create a new config with multiple updates
new_config = config.clone(
stream=True,
cache_mode=CacheMode.BYPASS,
verbose=True
)
```
"""
config_dict = self.to_dict()
config_dict.update(kwargs)
return CrawlerRunConfig.from_kwargs(config_dict)

File diff suppressed because it is too large Load Diff

View File

@@ -1,30 +1,27 @@
import os
import os, sys
from pathlib import Path
import aiosqlite
import asyncio
from typing import Optional, Dict
from typing import Optional, Tuple, Dict
from contextlib import asynccontextmanager
import logging
import json # Added for serialization/deserialization
from .utils import ensure_content_dirs, generate_content_hash
from .models import CrawlResult, MarkdownGenerationResult
from .models import CrawlResult
import xxhash
import aiofiles
from .config import NEED_MIGRATION
from .version_manager import VersionManager
from .async_logger import AsyncLogger
from .utils import get_error_context, create_box_message
# Set up logging
# logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__)
# logger.setLevel(logging.INFO)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
base_directory = DB_PATH = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
base_directory = DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(base_directory, "crawl4ai.db")
class AsyncDatabaseManager:
def __init__(self, pool_size: int = 10, max_retries: int = 3):
self.db_path = DB_PATH
@@ -35,27 +32,28 @@ class AsyncDatabaseManager:
self.pool_lock = asyncio.Lock()
self.init_lock = asyncio.Lock()
self.connection_semaphore = asyncio.Semaphore(pool_size)
self._initialized = False
self._initialized = False
self.version_manager = VersionManager()
self.logger = AsyncLogger(
log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
verbose=False,
tag_width=10,
tag_width=10
)
async def initialize(self):
"""Initialize the database and connection pool"""
try:
self.logger.info("Initializing database", tag="INIT")
# Ensure the database file exists
os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
# Check if version update is needed
needs_update = self.version_manager.needs_update()
# Always ensure base table exists
await self.ainit_db()
# Verify the table exists
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
async with db.execute(
@@ -64,37 +62,33 @@ class AsyncDatabaseManager:
result = await cursor.fetchone()
if not result:
raise Exception("crawled_data table was not created")
# If version changed or fresh install, run updates
if needs_update:
self.logger.info("New version detected, running updates", tag="INIT")
await self.update_db_schema()
from .migrations import (
run_migration,
) # Import here to avoid circular imports
from .migrations import run_migration # Import here to avoid circular imports
await run_migration()
self.version_manager.update_version() # Update stored version after successful migration
self.logger.success(
"Version update completed successfully", tag="COMPLETE"
)
self.logger.success("Version update completed successfully", tag="COMPLETE")
else:
self.logger.success(
"Database initialization completed successfully", tag="COMPLETE"
)
self.logger.success("Database initialization completed successfully", tag="COMPLETE")
except Exception as e:
self.logger.error(
message="Database initialization error: {error}",
tag="ERROR",
params={"error": str(e)},
params={"error": str(e)}
)
self.logger.info(
message="Database will be initialized on first use", tag="INIT"
message="Database will be initialized on first use",
tag="INIT"
)
raise
async def cleanup(self):
"""Cleanup connections when shutting down"""
async with self.pool_lock:
@@ -113,7 +107,6 @@ class AsyncDatabaseManager:
self._initialized = True
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
self.logger.error(
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
@@ -122,52 +115,41 @@ class AsyncDatabaseManager:
params={
"error": str(e),
"context": error_context["code_context"],
"traceback": error_context["full_traceback"],
},
"traceback": error_context["full_traceback"]
}
)
raise
await self.connection_semaphore.acquire()
task_id = id(asyncio.current_task())
try:
async with self.pool_lock:
if task_id not in self.connection_pool:
try:
conn = await aiosqlite.connect(self.db_path, timeout=30.0)
await conn.execute("PRAGMA journal_mode = WAL")
await conn.execute("PRAGMA busy_timeout = 5000")
conn = await aiosqlite.connect(
self.db_path,
timeout=30.0
)
await conn.execute('PRAGMA journal_mode = WAL')
await conn.execute('PRAGMA busy_timeout = 5000')
# Verify database structure
async with conn.execute(
"PRAGMA table_info(crawled_data)"
) as cursor:
async with conn.execute("PRAGMA table_info(crawled_data)") as cursor:
columns = await cursor.fetchall()
column_names = [col[1] for col in columns]
expected_columns = {
"url",
"html",
"cleaned_html",
"markdown",
"extracted_content",
"success",
"media",
"links",
"metadata",
"screenshot",
"response_headers",
"downloaded_files",
'url', 'html', 'cleaned_html', 'markdown', 'extracted_content',
'success', 'media', 'links', 'metadata', 'screenshot',
'response_headers', 'downloaded_files'
}
missing_columns = expected_columns - set(column_names)
if missing_columns:
raise ValueError(
f"Database missing columns: {missing_columns}"
)
raise ValueError(f"Database missing columns: {missing_columns}")
self.connection_pool[task_id] = conn
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} "
@@ -176,7 +158,7 @@ class AsyncDatabaseManager:
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type="error"),
message=create_box_message(error_message, type= "error"),
)
raise
@@ -185,7 +167,6 @@ class AsyncDatabaseManager:
except Exception as e:
import sys
error_context = get_error_context(sys.exc_info())
error_message = (
f"Unexpected error in db get_connection at line {error_context['line_no']} "
@@ -194,7 +175,7 @@ class AsyncDatabaseManager:
f"Code context:\n{error_context['code_context']}"
)
self.logger.error(
message=create_box_message(error_message, type="error"),
message=create_box_message(error_message, type= "error"),
)
raise
finally:
@@ -204,6 +185,7 @@ class AsyncDatabaseManager:
del self.connection_pool[task_id]
self.connection_semaphore.release()
async def execute_with_retry(self, operation, *args):
"""Execute database operations with retry logic"""
for attempt in range(self.max_retries):
@@ -218,16 +200,18 @@ class AsyncDatabaseManager:
message="Operation failed after {retries} attempts: {error}",
tag="ERROR",
force_verbose=True,
params={"retries": self.max_retries, "error": str(e)},
)
params={
"retries": self.max_retries,
"error": str(e)
}
)
raise
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
async def ainit_db(self):
"""Initialize database schema"""
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
await db.execute(
"""
await db.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY,
html TEXT,
@@ -242,27 +226,21 @@ class AsyncDatabaseManager:
response_headers TEXT DEFAULT "{}",
downloaded_files TEXT DEFAULT "{}" -- New column added
)
"""
)
''')
await db.commit()
async def update_db_schema(self):
"""Update database schema if needed"""
async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
cursor = await db.execute("PRAGMA table_info(crawled_data)")
columns = await cursor.fetchall()
column_names = [column[1] for column in columns]
# List of new columns to add
new_columns = [
"media",
"links",
"metadata",
"screenshot",
"response_headers",
"downloaded_files",
]
new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files']
for column in new_columns:
if column not in column_names:
await self.aalter_db_add_column(column, db)
@@ -270,95 +248,70 @@ class AsyncDatabaseManager:
async def aalter_db_add_column(self, new_column: str, db):
"""Add new column to the database"""
if new_column == "response_headers":
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
)
if new_column == 'response_headers':
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"')
else:
await db.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
)
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
self.logger.info(
message="Added column '{column}' to the database",
tag="INIT",
params={"column": new_column},
)
params={"column": new_column}
)
async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
"""Retrieve cached URL data as CrawlResult"""
async def _get(db):
async with db.execute(
"SELECT * FROM crawled_data WHERE url = ?", (url,)
'SELECT * FROM crawled_data WHERE url = ?', (url,)
) as cursor:
row = await cursor.fetchone()
if not row:
return None
# Get column names
columns = [description[0] for description in cursor.description]
# Create dict from row data
row_dict = dict(zip(columns, row))
# Load content from files using stored hashes
content_fields = {
"html": row_dict["html"],
"cleaned_html": row_dict["cleaned_html"],
"markdown": row_dict["markdown"],
"extracted_content": row_dict["extracted_content"],
"screenshot": row_dict["screenshot"],
"screenshots": row_dict["screenshot"],
'html': row_dict['html'],
'cleaned_html': row_dict['cleaned_html'],
'markdown': row_dict['markdown'],
'extracted_content': row_dict['extracted_content'],
'screenshot': row_dict['screenshot'],
'screenshots': row_dict['screenshot'],
}
for field, hash_value in content_fields.items():
if hash_value:
content = await self._load_content(
hash_value,
field.split("_")[0], # Get content type from field name
hash_value,
field.split('_')[0] # Get content type from field name
)
row_dict[field] = content or ""
else:
row_dict[field] = ""
# Parse JSON fields
json_fields = [
"media",
"links",
"metadata",
"response_headers",
"markdown",
]
json_fields = ['media', 'links', 'metadata', 'response_headers']
for field in json_fields:
try:
row_dict[field] = (
json.loads(row_dict[field]) if row_dict[field] else {}
)
row_dict[field] = json.loads(row_dict[field]) if row_dict[field] else {}
except json.JSONDecodeError:
# Very UGLY, never mention it to me please
if field == "markdown" and isinstance(row_dict[field], str):
row_dict[field] = row_dict[field]
else:
row_dict[field] = {}
if isinstance(row_dict["markdown"], Dict):
row_dict["markdown_v2"] = row_dict["markdown"]
if row_dict["markdown"].get("raw_markdown"):
row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
row_dict[field] = {}
# Parse downloaded_files
try:
row_dict["downloaded_files"] = (
json.loads(row_dict["downloaded_files"])
if row_dict["downloaded_files"]
else []
)
row_dict['downloaded_files'] = json.loads(row_dict['downloaded_files']) if row_dict['downloaded_files'] else []
except json.JSONDecodeError:
row_dict["downloaded_files"] = []
row_dict['downloaded_files'] = []
# Remove any fields not in CrawlResult model
valid_fields = CrawlResult.__annotations__.keys()
filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
return CrawlResult(**filtered_dict)
try:
@@ -368,7 +321,7 @@ class AsyncDatabaseManager:
message="Error retrieving cached URL: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
params={"error": str(e)}
)
return None
@@ -376,52 +329,19 @@ class AsyncDatabaseManager:
"""Cache CrawlResult data"""
# Store content files and get hashes
content_map = {
"html": (result.html, "html"),
"cleaned_html": (result.cleaned_html or "", "cleaned"),
"markdown": None,
"extracted_content": (result.extracted_content or "", "extracted"),
"screenshot": (result.screenshot or "", "screenshots"),
'html': (result.html, 'html'),
'cleaned_html': (result.cleaned_html or "", 'cleaned'),
'markdown': (result.markdown or "", 'markdown'),
'extracted_content': (result.extracted_content or "", 'extracted'),
'screenshot': (result.screenshot or "", 'screenshots')
}
try:
if isinstance(result.markdown, MarkdownGenerationResult):
content_map["markdown"] = (
result.markdown.model_dump_json(),
"markdown",
)
elif hasattr(result, "markdown_v2"):
content_map["markdown"] = (
result.markdown_v2.model_dump_json(),
"markdown",
)
elif isinstance(result.markdown, str):
markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
content_map["markdown"] = (
markdown_result.model_dump_json(),
"markdown",
)
else:
content_map["markdown"] = (
MarkdownGenerationResult().model_dump_json(),
"markdown",
)
except Exception as e:
self.logger.warning(
message=f"Error processing markdown content: {str(e)}", tag="WARNING"
)
# Fallback to empty markdown result
content_map["markdown"] = (
MarkdownGenerationResult().model_dump_json(),
"markdown",
)
content_hashes = {}
for field, (content, content_type) in content_map.items():
content_hashes[field] = await self._store_content(content, content_type)
async def _cache(db):
await db.execute(
"""
await db.execute('''
INSERT INTO crawled_data (
url, html, cleaned_html, markdown,
extracted_content, success, media, links, metadata,
@@ -440,22 +360,20 @@ class AsyncDatabaseManager:
screenshot = excluded.screenshot,
response_headers = excluded.response_headers,
downloaded_files = excluded.downloaded_files
""",
(
result.url,
content_hashes["html"],
content_hashes["cleaned_html"],
content_hashes["markdown"],
content_hashes["extracted_content"],
result.success,
json.dumps(result.media),
json.dumps(result.links),
json.dumps(result.metadata or {}),
content_hashes["screenshot"],
json.dumps(result.response_headers or {}),
json.dumps(result.downloaded_files or []),
),
)
''', (
result.url,
content_hashes['html'],
content_hashes['cleaned_html'],
content_hashes['markdown'],
content_hashes['extracted_content'],
result.success,
json.dumps(result.media),
json.dumps(result.links),
json.dumps(result.metadata or {}),
content_hashes['screenshot'],
json.dumps(result.response_headers or {}),
json.dumps(result.downloaded_files or [])
))
try:
await self.execute_with_retry(_cache)
@@ -464,14 +382,14 @@ class AsyncDatabaseManager:
message="Error caching URL: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
params={"error": str(e)}
)
async def aget_total_count(self) -> int:
"""Get total number of cached URLs"""
async def _count(db):
async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
result = await cursor.fetchone()
return result[0] if result else 0
@@ -482,15 +400,14 @@ class AsyncDatabaseManager:
message="Error getting total count: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
params={"error": str(e)}
)
return 0
async def aclear_db(self):
"""Clear all data from the database"""
async def _clear(db):
await db.execute("DELETE FROM crawled_data")
await db.execute('DELETE FROM crawled_data')
try:
await self.execute_with_retry(_clear)
@@ -499,14 +416,13 @@ class AsyncDatabaseManager:
message="Error clearing database: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
params={"error": str(e)}
)
async def aflush_db(self):
"""Drop the entire table"""
async def _flush(db):
await db.execute("DROP TABLE IF EXISTS crawled_data")
await db.execute('DROP TABLE IF EXISTS crawled_data')
try:
await self.execute_with_retry(_flush)
@@ -515,44 +431,42 @@ class AsyncDatabaseManager:
message="Error flushing database: {error}",
tag="ERROR",
force_verbose=True,
params={"error": str(e)},
params={"error": str(e)}
)
async def _store_content(self, content: str, content_type: str) -> str:
"""Store content in filesystem and return hash"""
if not content:
return ""
content_hash = generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
# Only write if file doesn't exist
if not os.path.exists(file_path):
async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(content)
return content_hash
async def _load_content(
self, content_hash: str, content_type: str
) -> Optional[str]:
async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]:
"""Load content from filesystem by hash"""
if not content_hash:
return None
file_path = os.path.join(self.content_paths[content_type], content_hash)
try:
async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
return await f.read()
except:
self.logger.error(
message="Failed to load content: {file_path}",
tag="ERROR",
force_verbose=True,
params={"file_path": file_path},
params={"file_path": file_path}
)
return None
# Create a singleton instance
async_db_manager = AsyncDatabaseManager()

View File

@@ -1,647 +0,0 @@
from typing import Dict, Optional, List, Tuple
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
CrawlerTaskResult,
CrawlStatus,
DisplayMode,
CrawlStats,
DomainState,
)
from rich.live import Live
from rich.table import Table
from rich.console import Console
from rich import box
from datetime import datetime, timedelta
from collections.abc import AsyncGenerator
import time
import psutil
import asyncio
import uuid
from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
class RateLimiter:
def __init__(
self,
base_delay: Tuple[float, float] = (1.0, 3.0),
max_delay: float = 60.0,
max_retries: int = 3,
rate_limit_codes: List[int] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.rate_limit_codes = rate_limit_codes or [429, 503]
self.domains: Dict[str, DomainState] = {}
def get_domain(self, url: str) -> str:
return urlparse(url).netloc
async def wait_if_needed(self, url: str) -> None:
domain = self.get_domain(url)
state = self.domains.get(domain)
if not state:
self.domains[domain] = DomainState()
state = self.domains[domain]
now = time.time()
if state.last_request_time:
wait_time = max(0, state.current_delay - (now - state.last_request_time))
if wait_time > 0:
await asyncio.sleep(wait_time)
# Random delay within base range if no current delay
if state.current_delay == 0:
state.current_delay = random.uniform(*self.base_delay)
state.last_request_time = time.time()
def update_delay(self, url: str, status_code: int) -> bool:
domain = self.get_domain(url)
state = self.domains[domain]
if status_code in self.rate_limit_codes:
state.fail_count += 1
if state.fail_count > self.max_retries:
return False
# Exponential backoff with random jitter
state.current_delay = min(
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
)
else:
# Gradually reduce delay on success
state.current_delay = max(
random.uniform(*self.base_delay), state.current_delay * 0.75
)
state.fail_count = 0
return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = datetime.now()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = datetime.now() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration.total_seconds()))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int((datetime.now() - self.start_time).total_seconds())
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or datetime.max,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None
self._domain_last_hit: Dict[str, float] = {}
self.concurrent_sessions = 0
self.rate_limiter = rate_limiter
self.monitor = monitor
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__(
self,
memory_threshold_percent: float = 90.0,
check_interval: float = 1.0,
max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent
self.check_interval = check_interval
self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout
self.result_queue = asyncio.Queue() # Queue for storing results
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
self.concurrent_sessions += 1
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
await self.result_queue.put(result)
return result
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
self.concurrent_sessions -= 1
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
async def run_urls_stream(
self,
urls: List[str],
crawler: "AsyncWebCrawler",
config: CrawlerRunConfig,
) -> AsyncGenerator[CrawlerTaskResult, None]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
active_tasks = []
task_queue = []
completed_count = 0
total_urls = len(urls)
# Initialize task queue
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while completed_count < total_urls:
# Start new tasks if memory permits
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks and not task_queue:
break
# Wait for any task to complete and yield results
if active_tasks:
done, pending = await asyncio.wait(
active_tasks,
timeout=0.1,
return_when=asyncio.FIRST_COMPLETED
)
for completed_task in done:
result = await completed_task
completed_count += 1
yield result
active_tasks = list(pending)
else:
await asyncio.sleep(self.check_interval)
finally:
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
max_session_permit: int = 20,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.semaphore_count = semaphore_count
self.max_session_permit = max_session_permit
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
crawler: "AsyncWebCrawler", # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore)
)
tasks.append(task)
return await asyncio.gather(*tasks, return_exceptions=True)
finally:
if self.monitor:
self.monitor.stop()

View File

@@ -1,588 +0,0 @@
from typing import Dict, Optional, List, Tuple
from .async_configs import CrawlerRunConfig
from .models import (
CrawlResult,
CrawlerTaskResult,
CrawlStatus,
DisplayMode,
CrawlStats,
DomainState,
)
from rich.live import Live
from rich.table import Table
from rich.console import Console
from rich import box
from datetime import datetime, timedelta
import time
import psutil
import asyncio
import uuid
from urllib.parse import urlparse
import random
from abc import ABC, abstractmethod
class RateLimiter:
def __init__(
self,
base_delay: Tuple[float, float] = (1.0, 3.0),
max_delay: float = 60.0,
max_retries: int = 3,
rate_limit_codes: List[int] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.rate_limit_codes = rate_limit_codes or [429, 503]
self.domains: Dict[str, DomainState] = {}
def get_domain(self, url: str) -> str:
return urlparse(url).netloc
async def wait_if_needed(self, url: str) -> None:
domain = self.get_domain(url)
state = self.domains.get(domain)
if not state:
self.domains[domain] = DomainState()
state = self.domains[domain]
now = time.time()
if state.last_request_time:
wait_time = max(0, state.current_delay - (now - state.last_request_time))
if wait_time > 0:
await asyncio.sleep(wait_time)
# Random delay within base range if no current delay
if state.current_delay == 0:
state.current_delay = random.uniform(*self.base_delay)
state.last_request_time = time.time()
def update_delay(self, url: str, status_code: int) -> bool:
domain = self.get_domain(url)
state = self.domains[domain]
if status_code in self.rate_limit_codes:
state.fail_count += 1
if state.fail_count > self.max_retries:
return False
# Exponential backoff with random jitter
state.current_delay = min(
state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
)
else:
# Gradually reduce delay on success
state.current_delay = max(
random.uniform(*self.base_delay), state.current_delay * 0.75
)
state.fail_count = 0
return True
class CrawlerMonitor:
def __init__(
self,
max_visible_rows: int = 15,
display_mode: DisplayMode = DisplayMode.DETAILED,
):
self.console = Console()
self.max_visible_rows = max_visible_rows
self.display_mode = display_mode
self.stats: Dict[str, CrawlStats] = {}
self.process = psutil.Process()
self.start_time = datetime.now()
self.live = Live(self._create_table(), refresh_per_second=2)
def start(self):
self.live.start()
def stop(self):
self.live.stop()
def add_task(self, task_id: str, url: str):
self.stats[task_id] = CrawlStats(
task_id=task_id, url=url, status=CrawlStatus.QUEUED
)
self.live.update(self._create_table())
def update_task(self, task_id: str, **kwargs):
if task_id in self.stats:
for key, value in kwargs.items():
setattr(self.stats[task_id], key, value)
self.live.update(self._create_table())
def _create_aggregated_table(self) -> Table:
"""Creates a compact table showing only aggregated statistics"""
table = Table(
box=box.ROUNDED,
title="Crawler Status Overview",
title_style="bold magenta",
header_style="bold blue",
show_lines=True,
)
# Calculate statistics
total_tasks = len(self.stats)
queued = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
)
in_progress = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
# Memory statistics
current_memory = self.process.memory_info().rss / (1024 * 1024)
total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
peak_memory = max(
(stat.peak_memory for stat in self.stats.values()), default=0.0
)
# Duration
duration = datetime.now() - self.start_time
# Create status row
table.add_column("Status", style="bold cyan")
table.add_column("Count", justify="right")
table.add_column("Percentage", justify="right")
table.add_row("Total Tasks", str(total_tasks), "100%")
table.add_row(
"[yellow]In Queue[/yellow]",
str(queued),
f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[blue]In Progress[/blue]",
str(in_progress),
f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[green]Completed[/green]",
str(completed),
f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
table.add_row(
"[red]Failed[/red]",
str(failed),
f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
)
# Add memory information
table.add_section()
table.add_row(
"[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
)
table.add_row(
"[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
)
table.add_row(
"[yellow]Runtime[/yellow]",
str(timedelta(seconds=int(duration.total_seconds()))),
"",
)
return table
def _create_detailed_table(self) -> Table:
table = Table(
box=box.ROUNDED,
title="Crawler Performance Monitor",
title_style="bold magenta",
header_style="bold blue",
)
# Add columns
table.add_column("Task ID", style="cyan", no_wrap=True)
table.add_column("URL", style="cyan", no_wrap=True)
table.add_column("Status", style="bold")
table.add_column("Memory (MB)", justify="right")
table.add_column("Peak (MB)", justify="right")
table.add_column("Duration", justify="right")
table.add_column("Info", style="italic")
# Add summary row
total_memory = sum(stat.memory_usage for stat in self.stats.values())
active_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
)
completed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
)
failed_count = sum(
1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
)
table.add_row(
"[bold yellow]SUMMARY",
f"Total: {len(self.stats)}",
f"Active: {active_count}",
f"{total_memory:.1f}",
f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
str(
timedelta(
seconds=int((datetime.now() - self.start_time).total_seconds())
)
),
f"{completed_count}{failed_count}",
style="bold",
)
table.add_section()
# Add rows for each task
visible_stats = sorted(
self.stats.values(),
key=lambda x: (
x.status != CrawlStatus.IN_PROGRESS,
x.status != CrawlStatus.QUEUED,
x.end_time or datetime.max,
),
)[: self.max_visible_rows]
for stat in visible_stats:
status_style = {
CrawlStatus.QUEUED: "white",
CrawlStatus.IN_PROGRESS: "yellow",
CrawlStatus.COMPLETED: "green",
CrawlStatus.FAILED: "red",
}[stat.status]
table.add_row(
stat.task_id[:8], # Show first 8 chars of task ID
stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
f"[{status_style}]{stat.status.value}[/{status_style}]",
f"{stat.memory_usage:.1f}",
f"{stat.peak_memory:.1f}",
stat.duration,
stat.error_message[:40] if stat.error_message else "",
)
return table
def _create_table(self) -> Table:
"""Creates the appropriate table based on display mode"""
if self.display_mode == DisplayMode.AGGREGATED:
return self._create_aggregated_table()
return self._create_detailed_table()
class BaseDispatcher(ABC):
def __init__(
self,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
self.crawler = None
self._domain_last_hit: Dict[str, float] = {}
self.concurrent_sessions = 0
self.rate_limiter = rate_limiter
self.monitor = monitor
@abstractmethod
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
monitor: Optional[CrawlerMonitor] = None,
) -> CrawlerTaskResult:
pass
@abstractmethod
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
monitor: Optional[CrawlerMonitor] = None,
) -> List[CrawlerTaskResult]:
pass
class MemoryAdaptiveDispatcher(BaseDispatcher):
def __init__(
self,
memory_threshold_percent: float = 90.0,
check_interval: float = 1.0,
max_session_permit: int = 20,
memory_wait_timeout: float = 300.0, # 5 minutes default timeout
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.memory_threshold_percent = memory_threshold_percent
self.check_interval = check_interval
self.max_session_permit = max_session_permit
self.memory_wait_timeout = memory_wait_timeout
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
self.concurrent_sessions += 1
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
self.concurrent_sessions -= 1
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
urls: List[str],
crawler: "AsyncWebCrawler", # noqa: F821
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
pending_tasks = []
active_tasks = []
task_queue = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task_queue.append((url, task_id))
while task_queue or active_tasks:
wait_start_time = time.time()
while len(active_tasks) < self.max_session_permit and task_queue:
if psutil.virtual_memory().percent >= self.memory_threshold_percent:
# Check if we've exceeded the timeout
if time.time() - wait_start_time > self.memory_wait_timeout:
raise MemoryError(
f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
)
await asyncio.sleep(self.check_interval)
continue
url, task_id = task_queue.pop(0)
task = asyncio.create_task(self.crawl_url(url, config, task_id))
active_tasks.append(task)
if not active_tasks:
await asyncio.sleep(self.check_interval)
continue
done, pending = await asyncio.wait(
active_tasks, return_when=asyncio.FIRST_COMPLETED
)
pending_tasks.extend(done)
active_tasks = list(pending)
return await asyncio.gather(*pending_tasks)
finally:
if self.monitor:
self.monitor.stop()
class SemaphoreDispatcher(BaseDispatcher):
def __init__(
self,
semaphore_count: int = 5,
max_session_permit: int = 20,
rate_limiter: Optional[RateLimiter] = None,
monitor: Optional[CrawlerMonitor] = None,
):
super().__init__(rate_limiter, monitor)
self.semaphore_count = semaphore_count
self.max_session_permit = max_session_permit
async def crawl_url(
self,
url: str,
config: CrawlerRunConfig,
task_id: str,
semaphore: asyncio.Semaphore = None,
) -> CrawlerTaskResult:
start_time = datetime.now()
error_message = ""
memory_usage = peak_memory = 0.0
try:
if self.monitor:
self.monitor.update_task(
task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
)
if self.rate_limiter:
await self.rate_limiter.wait_if_needed(url)
async with semaphore:
process = psutil.Process()
start_memory = process.memory_info().rss / (1024 * 1024)
result = await self.crawler.arun(url, config=config, session_id=task_id)
end_memory = process.memory_info().rss / (1024 * 1024)
memory_usage = peak_memory = end_memory - start_memory
if self.rate_limiter and result.status_code:
if not self.rate_limiter.update_delay(url, result.status_code):
error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=datetime.now(),
error_message=error_message,
)
if not result.success:
error_message = result.error_message
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
elif self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
except Exception as e:
error_message = str(e)
if self.monitor:
self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
result = CrawlResult(
url=url, html="", metadata={}, success=False, error_message=str(e)
)
finally:
end_time = datetime.now()
if self.monitor:
self.monitor.update_task(
task_id,
end_time=end_time,
memory_usage=memory_usage,
peak_memory=peak_memory,
error_message=error_message,
)
return CrawlerTaskResult(
task_id=task_id,
url=url,
result=result,
memory_usage=memory_usage,
peak_memory=peak_memory,
start_time=start_time,
end_time=end_time,
error_message=error_message,
)
async def run_urls(
self,
crawler: "AsyncWebCrawler", # noqa: F821
urls: List[str],
config: CrawlerRunConfig,
) -> List[CrawlerTaskResult]:
self.crawler = crawler
if self.monitor:
self.monitor.start()
try:
semaphore = asyncio.Semaphore(self.semaphore_count)
tasks = []
for url in urls:
task_id = str(uuid.uuid4())
if self.monitor:
self.monitor.add_task(task_id, url)
task = asyncio.create_task(
self.crawl_url(url, config, task_id, semaphore)
)
tasks.append(task)
return await asyncio.gather(*tasks, return_exceptions=True)
finally:
if self.monitor:
self.monitor.stop()

View File

@@ -1,10 +1,10 @@
from enum import Enum
from typing import Optional, Dict, Any
from colorama import Fore, Style, init
from typing import Optional, Dict, Any, Union
from colorama import Fore, Back, Style, init
import time
import os
from datetime import datetime
class LogLevel(Enum):
DEBUG = 1
INFO = 2
@@ -12,24 +12,23 @@ class LogLevel(Enum):
WARNING = 4
ERROR = 5
class AsyncLogger:
"""
Asynchronous logger with support for colored console output and file logging.
Supports templated messages with colored components.
"""
DEFAULT_ICONS = {
"INIT": "",
"READY": "",
"FETCH": "",
"SCRAPE": "",
"EXTRACT": "",
"COMPLETE": "",
"ERROR": "×",
"DEBUG": "",
"INFO": "",
"WARNING": "",
'INIT': '',
'READY': '',
'FETCH': '',
'SCRAPE': '',
'EXTRACT': '',
'COMPLETE': '',
'ERROR': '×',
'DEBUG': '',
'INFO': '',
'WARNING': '',
}
DEFAULT_COLORS = {
@@ -43,15 +42,15 @@ class AsyncLogger:
def __init__(
self,
log_file: Optional[str] = None,
log_level: LogLevel = LogLevel.DEBUG,
log_level: LogLevel = LogLevel.INFO,
tag_width: int = 10,
icons: Optional[Dict[str, str]] = None,
colors: Optional[Dict[LogLevel, str]] = None,
verbose: bool = True,
verbose: bool = True
):
"""
Initialize the logger.
Args:
log_file: Optional file path for logging
log_level: Minimum log level to display
@@ -67,7 +66,7 @@ class AsyncLogger:
self.icons = icons or self.DEFAULT_ICONS
self.colors = colors or self.DEFAULT_COLORS
self.verbose = verbose
# Create log file directory if needed
if log_file:
os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
@@ -78,20 +77,18 @@ class AsyncLogger:
def _get_icon(self, tag: str) -> str:
"""Get the icon for a tag, defaulting to info icon if not found."""
return self.icons.get(tag, self.icons["INFO"])
return self.icons.get(tag, self.icons['INFO'])
def _write_to_file(self, message: str):
"""Write a message to the log file if configured."""
if self.log_file:
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
with open(self.log_file, "a", encoding="utf-8") as f:
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
with open(self.log_file, 'a', encoding='utf-8') as f:
# Strip ANSI color codes for file output
clean_message = message.replace(Fore.RESET, "").replace(
Style.RESET_ALL, ""
)
clean_message = message.replace(Fore.RESET, '').replace(Style.RESET_ALL, '')
for color in vars(Fore).values():
if isinstance(color, str):
clean_message = clean_message.replace(color, "")
clean_message = clean_message.replace(color, '')
f.write(f"[{timestamp}] {clean_message}\n")
def _log(
@@ -102,11 +99,11 @@ class AsyncLogger:
params: Optional[Dict[str, Any]] = None,
colors: Optional[Dict[str, str]] = None,
base_color: Optional[str] = None,
**kwargs,
**kwargs
):
"""
Core logging method that handles message formatting and output.
Args:
level: Log level for this message
message: Message template string
@@ -123,7 +120,7 @@ class AsyncLogger:
try:
# First format the message with raw parameters
formatted_message = message.format(**params)
# Then apply colors if specified
if colors:
for key, color in colors.items():
@@ -131,13 +128,12 @@ class AsyncLogger:
if key in params:
value_str = str(params[key])
formatted_message = formatted_message.replace(
value_str, f"{color}{value_str}{Style.RESET_ALL}"
value_str,
f"{color}{value_str}{Style.RESET_ALL}"
)
except KeyError as e:
formatted_message = (
f"LOGGING ERROR: Missing parameter {e} in message template"
)
formatted_message = f"LOGGING ERROR: Missing parameter {e} in message template"
level = LogLevel.ERROR
else:
formatted_message = message
@@ -179,11 +175,11 @@ class AsyncLogger:
success: bool,
timing: float,
tag: str = "FETCH",
url_length: int = 50,
url_length: int = 50
):
"""
Convenience method for logging URL fetch status.
Args:
url: The URL being processed
success: Whether the operation was successful
@@ -199,20 +195,24 @@ class AsyncLogger:
"url": url,
"url_length": url_length,
"status": success,
"timing": timing,
"timing": timing
},
colors={
"status": Fore.GREEN if success else Fore.RED,
"timing": Fore.YELLOW,
},
"timing": Fore.YELLOW
}
)
def error_status(
self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
self,
url: str,
error: str,
tag: str = "ERROR",
url_length: int = 50
):
"""
Convenience method for logging error status.
Args:
url: The URL being processed
error: Error message
@@ -223,5 +223,9 @@ class AsyncLogger:
level=LogLevel.ERROR,
message="{url:.{url_length}}... | Error: {error}",
tag=tag,
params={"url": url, "url_length": url_length, "error": error},
)
params={
"url": url,
"url_length": url_length,
"error": error
}
)

183
crawl4ai/async_tools.py Normal file
View File

@@ -0,0 +1,183 @@
import asyncio
import base64
import time
from abc import ABC, abstractmethod
from typing import Callable, Dict, Any, List, Optional, Awaitable
import os, sys, shutil
import tempfile, subprocess
from playwright.async_api import async_playwright, Page, Browser, Error
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
from playwright.async_api import ProxySettings
from pydantic import BaseModel
import hashlib
import json
import uuid
from .models import AsyncCrawlResponse
from .utils import create_box_message
from .user_agent_generator import UserAgentGenerator
from playwright_stealth import StealthConfig, stealth_async
class ManagedBrowser:
def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None, host: str = "localhost", debugging_port: int = 9222):
self.browser_type = browser_type
self.user_data_dir = user_data_dir
self.headless = headless
self.browser_process = None
self.temp_dir = None
self.debugging_port = debugging_port
self.host = host
self.logger = logger
self.shutting_down = False
async def start(self) -> str:
"""
Starts the browser process and returns the CDP endpoint URL.
If user_data_dir is not provided, creates a temporary directory.
"""
# Create temp dir if needed
if not self.user_data_dir:
self.temp_dir = tempfile.mkdtemp(prefix="browser-profile-")
self.user_data_dir = self.temp_dir
# Get browser path and args based on OS and browser type
browser_path = self._get_browser_path()
args = self._get_browser_args()
# Start browser process
try:
self.browser_process = subprocess.Popen(
args,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
# Monitor browser process output for errors
asyncio.create_task(self._monitor_browser_process())
await asyncio.sleep(2) # Give browser time to start
return f"http://{self.host}:{self.debugging_port}"
except Exception as e:
await self.cleanup()
raise Exception(f"Failed to start browser: {e}")
async def _monitor_browser_process(self):
"""Monitor the browser process for unexpected termination."""
if self.browser_process:
try:
stdout, stderr = await asyncio.gather(
asyncio.to_thread(self.browser_process.stdout.read),
asyncio.to_thread(self.browser_process.stderr.read)
)
# Check shutting_down flag BEFORE logging anything
if self.browser_process.poll() is not None:
if not self.shutting_down:
self.logger.error(
message="Browser process terminated unexpectedly | Code: {code} | STDOUT: {stdout} | STDERR: {stderr}",
tag="ERROR",
params={
"code": self.browser_process.returncode,
"stdout": stdout.decode(),
"stderr": stderr.decode()
}
)
await self.cleanup()
else:
self.logger.info(
message="Browser process terminated normally | Code: {code}",
tag="INFO",
params={"code": self.browser_process.returncode}
)
except Exception as e:
if not self.shutting_down:
self.logger.error(
message="Error monitoring browser process: {error}",
tag="ERROR",
params={"error": str(e)}
)
def _get_browser_path(self) -> str:
"""Returns the browser executable path based on OS and browser type"""
if sys.platform == "darwin": # macOS
paths = {
"chromium": "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
"firefox": "/Applications/Firefox.app/Contents/MacOS/firefox",
"webkit": "/Applications/Safari.app/Contents/MacOS/Safari"
}
elif sys.platform == "win32": # Windows
paths = {
"chromium": "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
"firefox": "C:\\Program Files\\Mozilla Firefox\\firefox.exe",
"webkit": None # WebKit not supported on Windows
}
else: # Linux
paths = {
"chromium": "google-chrome",
"firefox": "firefox",
"webkit": None # WebKit not supported on Linux
}
return paths.get(self.browser_type)
def _get_browser_args(self) -> List[str]:
"""Returns browser-specific command line arguments"""
base_args = [self._get_browser_path()]
if self.browser_type == "chromium":
args = [
f"--remote-debugging-port={self.debugging_port}",
f"--user-data-dir={self.user_data_dir}",
]
if self.headless:
args.append("--headless=new")
elif self.browser_type == "firefox":
args = [
"--remote-debugging-port", str(self.debugging_port),
"--profile", self.user_data_dir,
]
if self.headless:
args.append("--headless")
else:
raise NotImplementedError(f"Browser type {self.browser_type} not supported")
return base_args + args
async def cleanup(self):
"""Cleanup browser process and temporary directory"""
# Set shutting_down flag BEFORE any termination actions
self.shutting_down = True
if self.browser_process:
try:
self.browser_process.terminate()
# Wait for process to end gracefully
for _ in range(10): # 10 attempts, 100ms each
if self.browser_process.poll() is not None:
break
await asyncio.sleep(0.1)
# Force kill if still running
if self.browser_process.poll() is None:
self.browser_process.kill()
await asyncio.sleep(0.1) # Brief wait for kill to take effect
except Exception as e:
self.logger.error(
message="Error terminating browser: {error}",
tag="ERROR",
params={"error": str(e)}
)
if self.temp_dir and os.path.exists(self.temp_dir):
try:
shutil.rmtree(self.temp_dir)
except Exception as e:
self.logger.error(
message="Error removing temporary directory: {error}",
tag="ERROR",
params={"error": str(e)}
)

File diff suppressed because it is too large Load Diff

View File

@@ -4,7 +4,7 @@ from enum import Enum
class CacheMode(Enum):
"""
Defines the caching behavior for web crawling operations.
Modes:
- ENABLED: Normal caching behavior (read and write)
- DISABLED: No caching at all
@@ -12,7 +12,6 @@ class CacheMode(Enum):
- WRITE_ONLY: Only write to cache, don't read
- BYPASS: Bypass cache for this operation
"""
ENABLED = "enabled"
DISABLED = "disabled"
READ_ONLY = "read_only"
@@ -23,69 +22,32 @@ class CacheMode(Enum):
class CacheContext:
"""
Encapsulates cache-related decisions and URL handling.
This class centralizes all cache-related logic and URL type checking,
making the caching behavior more predictable and maintainable.
Attributes:
url (str): The URL being processed.
cache_mode (CacheMode): The cache mode for the current operation.
always_bypass (bool): If True, bypasses caching for this operation.
is_cacheable (bool): True if the URL is cacheable, False otherwise.
is_web_url (bool): True if the URL is a web URL, False otherwise.
is_local_file (bool): True if the URL is a local file, False otherwise.
is_raw_html (bool): True if the URL is raw HTML, False otherwise.
_url_display (str): The display name for the URL (web, local file, or raw HTML).
"""
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
"""
Initializes the CacheContext with the provided URL and cache mode.
Args:
url (str): The URL being processed.
cache_mode (CacheMode): The cache mode for the current operation.
always_bypass (bool): If True, bypasses caching for this operation.
"""
self.url = url
self.cache_mode = cache_mode
self.always_bypass = always_bypass
self.is_cacheable = url.startswith(("http://", "https://", "file://"))
self.is_web_url = url.startswith(("http://", "https://"))
self.is_cacheable = url.startswith(('http://', 'https://', 'file://'))
self.is_web_url = url.startswith(('http://', 'https://'))
self.is_local_file = url.startswith("file://")
self.is_raw_html = url.startswith("raw:")
self._url_display = url if not self.is_raw_html else "Raw HTML"
def should_read(self) -> bool:
"""
Determines if cache should be read based on context.
How it works:
1. If always_bypass is True or is_cacheable is False, return False.
2. If cache_mode is ENABLED or READ_ONLY, return True.
Returns:
bool: True if cache should be read, False otherwise.
"""
"""Determines if cache should be read based on context."""
if self.always_bypass or not self.is_cacheable:
return False
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
def should_write(self) -> bool:
"""
Determines if cache should be written based on context.
How it works:
1. If always_bypass is True or is_cacheable is False, return False.
2. If cache_mode is ENABLED or WRITE_ONLY, return True.
Returns:
bool: True if cache should be written, False otherwise.
"""
"""Determines if cache should be written based on context."""
if self.always_bypass or not self.is_cacheable:
return False
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
@property
def display_url(self) -> str:
"""Returns the URL in display format."""
@@ -96,11 +58,11 @@ def _legacy_to_cache_mode(
disable_cache: bool = False,
bypass_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
no_cache_write: bool = False
) -> CacheMode:
"""
Converts legacy cache parameters to the new CacheMode enum.
This is an internal function to help transition from the old boolean flags
to the new CacheMode system.
"""

View File

@@ -3,53 +3,23 @@ import re
from collections import Counter
import string
from .model_loader import load_nltk_punkt
from .utils import *
# Define the abstract base class for chunking strategies
class ChunkingStrategy(ABC):
"""
Abstract base class for chunking strategies.
"""
@abstractmethod
def chunk(self, text: str) -> list:
"""
Abstract method to chunk the given text.
Args:
text (str): The text to chunk.
Returns:
list: A list of chunks.
"""
pass
# Create an identity chunking strategy f(x) = [x]
class IdentityChunking(ChunkingStrategy):
"""
Chunking strategy that returns the input text as a single chunk.
"""
def chunk(self, text: str) -> list:
return [text]
# Regex-based chunking
class RegexChunking(ChunkingStrategy):
"""
Chunking strategy that splits text based on regular expression patterns.
"""
def __init__(self, patterns=None, **kwargs):
"""
Initialize the RegexChunking object.
Args:
patterns (list): A list of regular expression patterns to split text.
"""
if patterns is None:
patterns = [r"\n\n"] # Default split pattern
patterns = [r'\n\n'] # Default split pattern
self.patterns = patterns
def chunk(self, text: str) -> list:
@@ -60,19 +30,12 @@ class RegexChunking(ChunkingStrategy):
new_paragraphs.extend(re.split(pattern, paragraph))
paragraphs = new_paragraphs
return paragraphs
# NLP-based sentence chunking
# NLP-based sentence chunking
class NlpSentenceChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
"""
def __init__(self, **kwargs):
"""
Initialize the NlpSentenceChunking object.
"""
load_nltk_punkt()
pass
def chunk(self, text: str) -> list:
# Improved regex for sentence splitting
@@ -80,34 +43,18 @@ class NlpSentenceChunking(ChunkingStrategy):
# r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
# )
# sentences = sentence_endings.split(text)
# sens = [sent.strip() for sent in sentences if sent]
# sens = [sent.strip() for sent in sentences if sent]
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
sens = [sent.strip() for sent in sentences]
sens = [sent.strip() for sent in sentences]
return list(set(sens))
# Topic-based segmentation using TextTiling
class TopicSegmentationChunking(ChunkingStrategy):
"""
Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
How it works:
1. Segment the text into topics using TextTilingTokenizer
2. Extract keywords for each topic segment
"""
def __init__(self, num_keywords=3, **kwargs):
"""
Initialize the TopicSegmentationChunking object.
Args:
num_keywords (int): The number of keywords to extract for each topic segment.
"""
import nltk as nl
self.tokenizer = nl.tokenize.TextTilingTokenizer()
self.num_keywords = num_keywords
@@ -119,14 +66,8 @@ class TopicSegmentationChunking(ChunkingStrategy):
def extract_keywords(self, text: str) -> list:
# Tokenize and remove stopwords and punctuation
import nltk as nl
tokens = nl.toknize.word_tokenize(text)
tokens = [
token.lower()
for token in tokens
if token not in nl.corpus.stopwords.words("english")
and token not in string.punctuation
]
tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]
# Calculate frequency distribution
freq_dist = Counter(tokens)
@@ -137,27 +78,15 @@ class TopicSegmentationChunking(ChunkingStrategy):
# Segment the text into topics
segments = self.chunk(text)
# Extract keywords for each topic segment
segments_with_topics = [
(segment, self.extract_keywords(segment)) for segment in segments
]
segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
return segments_with_topics
# Fixed-length word chunks
class FixedLengthWordChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into fixed-length word chunks.
How it works:
1. Split the text into words
2. Create chunks of fixed length
3. Return the list of chunks
"""
def __init__(self, chunk_size=100, **kwargs):
"""
Initialize the fixed-length word chunking strategy with the given chunk size.
Args:
chunk_size (int): The size of each chunk in words.
"""
@@ -165,28 +94,15 @@ class FixedLengthWordChunking(ChunkingStrategy):
def chunk(self, text: str) -> list:
words = text.split()
return [
" ".join(words[i : i + self.chunk_size])
for i in range(0, len(words), self.chunk_size)
]
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
# Sliding window chunking
class SlidingWindowChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into overlapping word chunks.
How it works:
1. Split the text into words
2. Create chunks of fixed length
3. Return the list of chunks
"""
def __init__(self, window_size=100, step=50, **kwargs):
"""
Initialize the sliding window chunking strategy with the given window size and
step size.
Args:
window_size (int): The size of the sliding window in words.
step (int): The step size for sliding the window in words.
@@ -197,37 +113,27 @@ class SlidingWindowChunking(ChunkingStrategy):
def chunk(self, text: str) -> list:
words = text.split()
chunks = []
if len(words) <= self.window_size:
return [text]
for i in range(0, len(words) - self.window_size + 1, self.step):
chunk = " ".join(words[i : i + self.window_size])
chunk = ' '.join(words[i:i + self.window_size])
chunks.append(chunk)
# Handle the last chunk if it doesn't align perfectly
if i + self.window_size < len(words):
chunks.append(" ".join(words[-self.window_size :]))
chunks.append(' '.join(words[-self.window_size:]))
return chunks
class OverlappingWindowChunking(ChunkingStrategy):
"""
Chunking strategy that splits text into overlapping word chunks.
How it works:
1. Split the text into words using whitespace
2. Create chunks of fixed length equal to the window size
3. Slide the window by the overlap size
4. Return the list of chunks
"""
def __init__(self, window_size=1000, overlap=100, **kwargs):
"""
Initialize the overlapping window chunking strategy with the given window size and
overlap size.
Args:
window_size (int): The size of the window in words.
overlap (int): The size of the overlap between consecutive chunks in words.
@@ -238,19 +144,19 @@ class OverlappingWindowChunking(ChunkingStrategy):
def chunk(self, text: str) -> list:
words = text.split()
chunks = []
if len(words) <= self.window_size:
return [text]
start = 0
while start < len(words):
end = start + self.window_size
chunk = " ".join(words[start:end])
chunk = ' '.join(words[start:end])
chunks.append(chunk)
if end >= len(words):
break
start = end - self.overlap
return chunks
return chunks

View File

@@ -1,123 +0,0 @@
import click
import sys
import asyncio
from typing import List
from .docs_manager import DocsManager
from .async_logger import AsyncLogger
logger = AsyncLogger(verbose=True)
docs_manager = DocsManager(logger)
def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
"""Print formatted table with headers and rows"""
widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
def format_row(row):
return (
"|"
+ "|".join(
f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
for cell, w in zip(row, widths)
)
+ "|"
)
click.echo(border)
click.echo(format_row(headers))
click.echo(border)
for row in rows:
click.echo(format_row(row))
click.echo(border)
@click.group()
def cli():
"""Crawl4AI Command Line Interface"""
pass
@cli.group()
def docs():
"""Documentation operations"""
pass
@docs.command()
@click.argument("sections", nargs=-1)
@click.option(
"--mode", type=click.Choice(["extended", "condensed"]), default="extended"
)
def combine(sections: tuple, mode: str):
"""Combine documentation sections"""
try:
asyncio.run(docs_manager.ensure_docs_exist())
click.echo(docs_manager.generate(sections, mode))
except Exception as e:
logger.error(str(e), tag="ERROR")
sys.exit(1)
@docs.command()
@click.argument("query")
@click.option("--top-k", "-k", default=5)
@click.option("--build-index", is_flag=True, help="Build index if missing")
def search(query: str, top_k: int, build_index: bool):
"""Search documentation"""
try:
result = docs_manager.search(query, top_k)
if result == "No search index available. Call build_search_index() first.":
if build_index or click.confirm("No search index found. Build it now?"):
asyncio.run(docs_manager.llm_text.generate_index_files())
result = docs_manager.search(query, top_k)
click.echo(result)
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
@docs.command()
def update():
"""Update docs from GitHub"""
try:
asyncio.run(docs_manager.fetch_docs())
click.echo("Documentation updated successfully")
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
@docs.command()
@click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
@click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
def index(force_facts: bool, clear_cache: bool):
"""Build or rebuild search indexes"""
try:
asyncio.run(docs_manager.ensure_docs_exist())
asyncio.run(
docs_manager.llm_text.generate_index_files(
force_generate_facts=force_facts, clear_bm25_cache=clear_cache
)
)
click.echo("Search indexes built successfully")
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
# Add docs list command
@docs.command()
def list():
"""List available documentation sections"""
try:
sections = docs_manager.list()
print_table(["Sections"], [[section] for section in sections])
except Exception as e:
click.echo(f"Error: {str(e)}", err=True)
sys.exit(1)
if __name__ == "__main__":
cli()

View File

@@ -8,13 +8,11 @@ DEFAULT_PROVIDER = "openai/gpt-4o-mini"
MODEL_REPO_BRANCH = "new-release-0.0.2"
# Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
PROVIDER_MODELS = {
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
"groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
"openai/o1-mini": os.getenv("OPENAI_API_KEY"),
"openai/o1-preview": os.getenv("OPENAI_API_KEY"),
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
@@ -22,49 +20,27 @@ PROVIDER_MODELS = {
}
# Chunk token threshold
CHUNK_TOKEN_THRESHOLD = 2**11 # 2048 tokens
CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
OVERLAP_RATE = 0.1
WORD_TOKEN_RATE = 1.3
# Threshold for the minimum number of word in a HTML tag to be considered
# Threshold for the minimum number of word in a HTML tag to be considered
MIN_WORD_THRESHOLD = 1
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
ONLY_TEXT_ELIGIBLE_TAGS = [
"b",
"i",
"u",
"span",
"del",
"ins",
"sub",
"sup",
"strong",
"em",
"code",
"kbd",
"var",
"s",
"q",
"abbr",
"cite",
"dfn",
"time",
"small",
"mark",
]
IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height']
ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
SOCIAL_MEDIA_DOMAINS = [
"facebook.com",
"twitter.com",
"x.com",
"linkedin.com",
"instagram.com",
"pinterest.com",
"tiktok.com",
"snapchat.com",
"reddit.com",
]
'facebook.com',
'twitter.com',
'x.com',
'linkedin.com',
'instagram.com',
'pinterest.com',
'tiktok.com',
'snapchat.com',
'reddit.com',
]
# Threshold for the Image extraction - Range is 1 to 6
# Images are scored based on point based system, to filter based on usefulness. Points are assigned
@@ -82,6 +58,5 @@ NEED_MIGRATION = True
URL_LOG_SHORTEN_LENGTH = 30
SHOW_DEPRECATION_WARNINGS = True
SCREENSHOT_HEIGHT_TRESHOLD = 10000
PAGE_TIMEOUT = 60000
DOWNLOAD_PAGE_TIMEOUT = 60000
DEEP_CRAWL_BATCH_SIZE = 5
PAGE_TIMEOUT=60000
DOWNLOAD_PAGE_TIMEOUT=60000

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -15,53 +15,54 @@ import logging, time
import base64
from PIL import Image, ImageDraw, ImageFont
from io import BytesIO
from typing import Callable
from typing import List, Callable
import requests
import os
from pathlib import Path
from .utils import *
logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
logger.setLevel(logging.WARNING)
logger_driver = logging.getLogger("selenium.webdriver.common.service")
logger_driver = logging.getLogger('selenium.webdriver.common.service')
logger_driver.setLevel(logging.WARNING)
urllib3_logger = logging.getLogger("urllib3.connectionpool")
urllib3_logger = logging.getLogger('urllib3.connectionpool')
urllib3_logger.setLevel(logging.WARNING)
# Disable http.client logging
http_client_logger = logging.getLogger("http.client")
http_client_logger = logging.getLogger('http.client')
http_client_logger.setLevel(logging.WARNING)
# Disable driver_finder and service logging
driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
driver_finder_logger.setLevel(logging.WARNING)
class CrawlerStrategy(ABC):
@abstractmethod
def crawl(self, url: str, **kwargs) -> str:
pass
@abstractmethod
def take_screenshot(self, save_path: str):
pass
@abstractmethod
def update_user_agent(self, user_agent: str):
pass
@abstractmethod
def set_hook(self, hook_type: str, hook: Callable):
pass
class CloudCrawlerStrategy(CrawlerStrategy):
def __init__(self, use_cached_html=False):
def __init__(self, use_cached_html = False):
super().__init__()
self.use_cached_html = use_cached_html
def crawl(self, url: str) -> str:
data = {
"urls": [url],
@@ -75,7 +76,6 @@ class CloudCrawlerStrategy(CrawlerStrategy):
html = response["results"][0]["html"]
return sanitize_input_encode(html)
class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
super().__init__()
@@ -87,25 +87,20 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if kwargs.get("user_agent"):
self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
else:
user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
)
user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
self.options.add_argument(f"--user-agent={user_agent}")
self.options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
)
self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
self.options.headless = kwargs.get("headless", True)
if self.options.headless:
self.options.add_argument("--headless")
self.options.add_argument("--disable-gpu")
self.options.add_argument("--disable-gpu")
self.options.add_argument("--window-size=1920,1080")
self.options.add_argument("--no-sandbox")
self.options.add_argument("--disable-dev-shm-usage")
self.options.add_argument("--disable-blink-features=AutomationControlled")
self.options.add_argument("--disable-blink-features=AutomationControlled")
# self.options.add_argument("--disable-dev-shm-usage")
self.options.add_argument("--disable-gpu")
# self.options.add_argument("--disable-extensions")
@@ -125,14 +120,14 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.use_cached_html = use_cached_html
self.js_code = js_code
self.verbose = kwargs.get("verbose", False)
# Hooks
self.hooks = {
"on_driver_created": None,
"on_user_agent_updated": None,
"before_get_url": None,
"after_get_url": None,
"before_return_html": None,
'on_driver_created': None,
'on_user_agent_updated': None,
'before_get_url': None,
'after_get_url': None,
'before_return_html': None
}
# chromedriver_autoinstaller.install()
@@ -142,28 +137,31 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
# chromedriver_path = chromedriver_autoinstaller.install()
# chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
# self.service = Service(chromedriver_autoinstaller.install())
# chromedriver_path = ChromeDriverManager().install()
# self.service = Service(chromedriver_path)
# self.service.log_path = "NUL"
# self.driver = webdriver.Chrome(service=self.service, options=self.options)
# Use selenium-manager (built into Selenium 4.10.0+)
self.service = Service()
self.driver = webdriver.Chrome(options=self.options)
self.driver = self.execute_hook("on_driver_created", self.driver)
self.driver = self.execute_hook('on_driver_created', self.driver)
if kwargs.get("cookies"):
for cookie in kwargs.get("cookies"):
self.driver.add_cookie(cookie)
def set_hook(self, hook_type: str, hook: Callable):
if hook_type in self.hooks:
self.hooks[hook_type] = hook
else:
raise ValueError(f"Invalid hook type: {hook_type}")
def execute_hook(self, hook_type: str, *args):
hook = self.hooks.get(hook_type)
if hook:
@@ -172,9 +170,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
if isinstance(result, webdriver.Chrome):
return result
else:
raise TypeError(
f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
)
raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.")
# If the hook returns None or there is no hook, return self.driver
return self.driver
@@ -182,77 +178,60 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
self.options.add_argument(f"user-agent={user_agent}")
self.driver.quit()
self.driver = webdriver.Chrome(service=self.service, options=self.options)
self.driver = self.execute_hook("on_user_agent_updated", self.driver)
self.driver = self.execute_hook('on_user_agent_updated', self.driver)
def set_custom_headers(self, headers: dict):
# Enable Network domain for sending headers
self.driver.execute_cdp_cmd("Network.enable", {})
self.driver.execute_cdp_cmd('Network.enable', {})
# Set extra HTTP headers
self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers})
def _ensure_page_load(self, max_checks=6, check_interval=0.01):
def _ensure_page_load(self, max_checks=6, check_interval=0.01):
initial_length = len(self.driver.page_source)
for ix in range(max_checks):
# print(f"Checking page load: {ix}")
time.sleep(check_interval)
current_length = len(self.driver.page_source)
if current_length != initial_length:
break
return self.driver.page_source
def crawl(self, url: str, **kwargs) -> str:
# Create md5 hash of the URL
import hashlib
url_hash = hashlib.md5(url.encode()).hexdigest()
if self.use_cached_html:
cache_file_path = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
".crawl4ai",
"cache",
url_hash,
)
cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash)
if os.path.exists(cache_file_path):
with open(cache_file_path, "r") as f:
return sanitize_input_encode(f.read())
try:
self.driver = self.execute_hook("before_get_url", self.driver)
self.driver = self.execute_hook('before_get_url', self.driver)
if self.verbose:
print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
self.driver.get(url) # <html><head></head><body></body></html>
self.driver.get(url) #<html><head></head><body></body></html>
WebDriverWait(self.driver, 20).until(
lambda d: d.execute_script("return document.readyState") == "complete"
lambda d: d.execute_script('return document.readyState') == 'complete'
)
WebDriverWait(self.driver, 10).until(
EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
)
self.driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight);"
)
self.driver = self.execute_hook("after_get_url", self.driver)
html = sanitize_input_encode(
self._ensure_page_load()
) # self.driver.page_source
can_not_be_done_headless = (
False # Look at my creativity for naming variables
)
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
self.driver = self.execute_hook('after_get_url', self.driver)
html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source
can_not_be_done_headless = False # Look at my creativity for naming variables
# TODO: Very ugly approach, but promise to change it!
if (
kwargs.get("bypass_headless", False)
or html == "<html><head></head><body></body></html>"
):
print(
"[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
)
if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>":
print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...")
can_not_be_done_headless = True
options = Options()
options.headless = False
@@ -260,31 +239,27 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
options.add_argument("--window-size=5,5")
driver = webdriver.Chrome(service=self.service, options=options)
driver.get(url)
self.driver = self.execute_hook("after_get_url", driver)
self.driver = self.execute_hook('after_get_url', driver)
html = sanitize_input_encode(driver.page_source)
driver.quit()
# Execute JS code if provided
self.js_code = kwargs.get("js_code", self.js_code)
if self.js_code and type(self.js_code) == str:
self.driver.execute_script(self.js_code)
# Optionally, wait for some condition after executing the JS code
WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script("return document.readyState")
== "complete"
lambda driver: driver.execute_script("return document.readyState") == "complete"
)
elif self.js_code and type(self.js_code) == list:
for js in self.js_code:
self.driver.execute_script(js)
WebDriverWait(self.driver, 10).until(
lambda driver: driver.execute_script(
"return document.readyState"
)
== "complete"
lambda driver: driver.execute_script("return document.readyState") == "complete"
)
# Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
wait_for = kwargs.get("wait_for", False)
wait_for = kwargs.get('wait_for', False)
if wait_for:
if callable(wait_for):
print("[LOG] 🔄 Waiting for condition...")
@@ -293,37 +268,32 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
print("[LOG] 🔄 Waiting for condition...")
WebDriverWait(self.driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
)
)
if not can_not_be_done_headless:
html = sanitize_input_encode(self.driver.page_source)
self.driver = self.execute_hook("before_return_html", self.driver, html)
self.driver = self.execute_hook('before_return_html', self.driver, html)
# Store in cache
cache_file_path = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
".crawl4ai",
"cache",
url_hash,
)
cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash)
with open(cache_file_path, "w", encoding="utf-8") as f:
f.write(html)
if self.verbose:
print(f"[LOG] ✅ Crawled {url} successfully!")
return html
except InvalidArgumentException as e:
if not hasattr(e, "msg"):
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
except WebDriverException as e:
# If e does nlt have msg attribute create it and set it to str(e)
if not hasattr(e, "msg"):
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
except Exception as e:
if not hasattr(e, "msg"):
if not hasattr(e, 'msg'):
e.msg = sanitize_input_encode(str(e))
raise Exception(f"Failed to crawl {url}: {e.msg}")
@@ -331,9 +301,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
try:
# Get the dimensions of the page
total_width = self.driver.execute_script("return document.body.scrollWidth")
total_height = self.driver.execute_script(
"return document.body.scrollHeight"
)
total_height = self.driver.execute_script("return document.body.scrollHeight")
# Set the window size to the dimensions of the page
self.driver.set_window_size(total_width, total_height)
@@ -345,27 +313,25 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
image = Image.open(BytesIO(screenshot))
# Convert image to RGB mode (this will handle both RGB and RGBA images)
rgb_image = image.convert("RGB")
rgb_image = image.convert('RGB')
# Convert to JPEG and compress
buffered = BytesIO()
rgb_image.save(buffered, format="JPEG", quality=85)
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
if self.verbose:
print("[LOG] 📸 Screenshot taken and converted to base64")
print(f"[LOG] 📸 Screenshot taken and converted to base64")
return img_base64
except Exception as e:
error_message = sanitize_input_encode(
f"Failed to take screenshot: {str(e)}"
)
error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
print(error_message)
# Generate an image with black background
img = Image.new("RGB", (800, 600), color="black")
img = Image.new('RGB', (800, 600), color='black')
draw = ImageDraw.Draw(img)
# Load a font
try:
font = ImageFont.truetype("arial.ttf", 40)
@@ -379,16 +345,16 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
# Calculate text position
text_position = (10, 10)
# Draw the text on the image
draw.text(text_position, wrapped_text, fill=text_color, font=font)
# Convert to base64
buffered = BytesIO()
img.save(buffered, format="JPEG")
img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
return img_base64
def quit(self):
self.driver.quit()

View File

@@ -7,13 +7,11 @@ DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".cra
os.makedirs(DB_PATH, exist_ok=True)
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
def init_db():
global DB_PATH
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"""
cursor.execute('''
CREATE TABLE IF NOT EXISTS crawled_data (
url TEXT PRIMARY KEY,
html TEXT,
@@ -26,42 +24,31 @@ def init_db():
metadata TEXT DEFAULT "{}",
screenshot TEXT DEFAULT ""
)
"""
)
''')
conn.commit()
conn.close()
def alter_db_add_screenshot(new_column: str = "media"):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
)
cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
conn.commit()
conn.close()
except Exception as e:
print(f"Error altering database to add screenshot column: {e}")
def check_db_path():
if not DB_PATH:
raise ValueError("Database path is not set or is empty.")
def get_cached_url(
url: str,
) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
(url,),
)
cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,))
result = cursor.fetchone()
conn.close()
return result
@@ -69,25 +56,12 @@ def get_cached_url(
print(f"Error retrieving cached URL: {e}")
return None
def cache_url(
url: str,
html: str,
cleaned_html: str,
markdown: str,
extracted_content: str,
success: bool,
media: str = "{}",
links: str = "{}",
metadata: str = "{}",
screenshot: str = "",
):
def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"""
cursor.execute('''
INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
@@ -100,32 +74,18 @@ def cache_url(
links = excluded.links,
metadata = excluded.metadata,
screenshot = excluded.screenshot
""",
(
url,
html,
cleaned_html,
markdown,
extracted_content,
success,
media,
links,
metadata,
screenshot,
),
)
''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
conn.commit()
conn.close()
except Exception as e:
print(f"Error caching URL: {e}")
def get_total_count() -> int:
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM crawled_data")
cursor.execute('SELECT COUNT(*) FROM crawled_data')
result = cursor.fetchone()
conn.close()
return result[0]
@@ -133,48 +93,43 @@ def get_total_count() -> int:
print(f"Error getting total count: {e}")
return 0
def clear_db():
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("DELETE FROM crawled_data")
cursor.execute('DELETE FROM crawled_data')
conn.commit()
conn.close()
except Exception as e:
print(f"Error clearing database: {e}")
def flush_db():
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute("DROP TABLE crawled_data")
cursor.execute('DROP TABLE crawled_data')
conn.commit()
conn.close()
except Exception as e:
print(f"Error flushing database: {e}")
def update_existing_records(new_column: str = "media", default_value: str = "{}"):
check_db_path()
try:
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
)
cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL')
conn.commit()
conn.close()
except Exception as e:
print(f"Error updating existing records: {e}")
if __name__ == "__main__":
# Delete the existing database file
if os.path.exists(DB_PATH):
os.remove(DB_PATH)
init_db()
init_db()
# alter_db_add_screenshot("COL_NAME")

View File

@@ -1,29 +0,0 @@
from .bfs_deep_crawl_strategy import BFSDeepCrawlStrategy
from .filters import (
URLFilter,
FilterChain,
URLPatternFilter,
ContentTypeFilter,
DomainFilter,
)
from .scorers import (
KeywordRelevanceScorer,
PathDepthScorer,
FreshnessScorer,
CompositeScorer,
)
from .deep_crawl_strategty import DeepCrawlStrategy
__all__ = [
"BFSDeepCrawlStrategy",
"FilterChain",
"URLFilter",
"URLPatternFilter",
"ContentTypeFilter",
"DomainFilter",
"KeywordRelevanceScorer",
"PathDepthScorer",
"FreshnessScorer",
"CompositeScorer",
"DeepCrawlStrategy",
]

View File

@@ -1,193 +0,0 @@
from typing import AsyncGenerator, Optional, Dict, Set, List
from datetime import datetime
import asyncio
import logging
from urllib.parse import urlparse
from ..models import CrawlResult, TraversalStats
from .filters import FilterChain
from .scorers import URLScorer
from .deep_crawl_strategty import DeepCrawlStrategy
from ..config import DEEP_CRAWL_BATCH_SIZE
class BFSDeepCrawlStrategy(DeepCrawlStrategy):
"""Best-First Search traversal strategy with filtering and scoring."""
def __init__(
self,
max_depth: int,
filter_chain: FilterChain,
url_scorer: URLScorer,
process_external_links: bool = False,
logger: Optional[logging.Logger] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
self.url_scorer = url_scorer
self.logger = logger or logging.getLogger(__name__)
# Crawl control
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self.process_external_links = process_external_links
async def can_process_url(self, url: str, depth: int) -> bool:
"""Check if URL can be processed based on filters
This is our gatekeeper method that determines if a URL should be processed. It:
- Validates URL format using a robust built-in method
- Applies custom filters from the filter chain
- Updates statistics for blocked URLs
- Returns False early if any check fails
"""
try:
result = urlparse(url)
if not all([result.scheme, result.netloc]):
raise ValueError("Invalid URL")
if result.scheme not in ("http", "https"):
raise ValueError("URL must be HTTP or HTTPS")
if not result.netloc or "." not in result.netloc:
raise ValueError("Invalid domain")
except Exception as e:
self.logger.warning(f"Invalid URL: {url}. Error: {str(e)}")
return False
# Apply the filter chain if it's not start page
if depth != 0 and not self.filter_chain.apply(url):
return False
return True
async def _process_links(
self,
result: CrawlResult,
source_url: str,
queue: asyncio.PriorityQueue,
visited: Set[str],
depths: Dict[str, int],
) -> List[str]:
"""Process extracted links from crawl result.
This is our link processor that:
Checks depth limits
Handles both internal and external links
Checks if URL is visited already
Checks if URL can be processed - validates URL, applies Filters with can_process_url
Scores URLs for priority
Updates depth tracking dictionary
Adds valid URLs to the queue
Updates maximum depth statistics
"""
next_depth = depths[source_url] + 1
# If depth limit reached, exit without processing links
if next_depth > self.max_depth:
return
links_to_process = result.links["internal"]
if self.process_external_links:
links_to_process += result.links["external"]
for link in links_to_process:
url = link["href"]
if url in visited:
continue
if not await self.can_process_url(url, next_depth):
self.stats.urls_skipped += 1
continue
score = self.url_scorer.score(url) if self.url_scorer else 0
await queue.put((score, next_depth, url, source_url))
depths[url] = next_depth
self.stats.total_depth_reached = max(
self.stats.total_depth_reached, next_depth
)
async def arun(
self,
start_url: str,
crawler: "AsyncWebCrawler",
crawler_run_config: Optional["CrawlerRunConfig"] = None,
) -> AsyncGenerator[CrawlResult, None]:
"""Implement BFS traversal strategy"""
# Initialize traversal state
"""
queue: A priority queue where items are tuples of (score, depth, url)
Score: Determines traversal priority (lower = higher priority)
Depth: Current distance from start_url
URL: The actual URL to crawl
visited: Keeps track of URLs we've already seen to avoid cycles
depths: Maps URLs to their depths from the start URL
active_crawls: Tracks currently running crawl tasks
"""
queue = asyncio.PriorityQueue()
await queue.put((0, 0, start_url, None))
visited: Set[str] = set()
depths = {start_url: 0}
active_crawls = {} # Track URLs currently being processed with depth and score
active_crawls_lock = (
asyncio.Lock()
) # Create the lock within the same event loop
try:
while (
not queue.empty() or active_crawls
) and not self._cancel_event.is_set():
"""
This sets up our main control loop which:
- Continues while there are URLs to process (not queue.empty())
- Or while there are active crawls still running (arun_many)
- Can be interrupted via cancellation (not self._cancel_event.is_set())
"""
# Collect batch of URLs into active_crawls to process
async with active_crawls_lock:
while (
len(active_crawls) < DEEP_CRAWL_BATCH_SIZE and not queue.empty()
):
score, depth, url, parent_url = await queue.get()
active_crawls[url] = {
"depth": depth,
"score": score,
"parent_url": parent_url,
}
self.stats.current_depth = depth
if not active_crawls:
# If no active crawls exist, wait a bit and continue
await asyncio.sleep(0.1)
continue
# Process batch
try:
# This is very important to ensure recursively you don't deep_crawl down the children.
if crawler_run_config:
crawler_run_config = crawler_run_config.clone(
deep_crawl_strategy=None, stream=True
)
async for result in await crawler.arun_many(
urls=list(active_crawls.keys()),
config=crawler_run_config
):
async with active_crawls_lock:
crawl_info = active_crawls.pop(result.url, None)
if crawl_info and result.success:
await self._process_links(
result, result.url, queue, visited, depths
)
result.depth = crawl_info["depth"]
result.score = crawl_info["score"]
result.parent_url = crawl_info["parent_url"]
yield result
else:
self.logger.warning(
f"Failed to crawl {result.url}: {result.error_message}"
)
except Exception as e:
self.logger.error(f"Batch processing error: {e}")
# Continue processing other batches
continue
except Exception as e:
self.logger.error(f"Error in crawl process: {e}")
raise
finally:
self.stats.end_time = datetime.now()
async def shutdown(self):
"""Clean up resources and stop crawling"""
self._cancel_event.set()

View File

@@ -1,30 +0,0 @@
from abc import ABC, abstractmethod
from typing import AsyncGenerator, Optional
from ..models import CrawlResult
class DeepCrawlStrategy(ABC):
@abstractmethod
async def arun(
self,
url: str,
crawler: "AsyncWebCrawler",
crawler_run_config: Optional["CrawlerRunConfig"] = None,
) -> AsyncGenerator[CrawlResult, None]:
"""Traverse the given URL using the specified crawler.
Args:
url (str): The starting URL for the traversal.
crawler (AsyncWebCrawler): The crawler instance to use for traversal.
crawler_run_config (CrawlerRunConfig, optional): The configuration for the crawler.
Returns:
AsyncGenerator[CrawlResult, None]: An async generator yielding crawl results.
"""
pass
@abstractmethod
async def shutdown(self):
"""Clean up resources used by the strategy"""
pass

View File

@@ -1,868 +0,0 @@
from abc import ABC, abstractmethod
from typing import List, Pattern, Set, Union, FrozenSet
import re, time
from urllib.parse import urlparse
from array import array
import logging
from functools import lru_cache
import fnmatch
from dataclasses import dataclass
from typing import ClassVar
import weakref
import mimetypes
@dataclass
class FilterStats:
# PERF: Using dataclass creates overhead with __init__ and property access
# PERF: Could use __slots__ to reduce memory footprint
# PERF: Consider using array.array('I') for atomic increments
total_urls: int = 0
rejected_urls: int = 0
passed_urls: int = 0
class URLFilter(ABC):
# PERF: Logger creation is expensive, consider lazy initialization
# PERF: stats object creation adds overhead for each filter instance
def __init__(self, name: str = None):
self.name = name or self.__class__.__name__
self.stats = FilterStats()
self.logger = logging.getLogger(f"urlfilter.{self.name}")
@abstractmethod
def apply(self, url: str) -> bool:
pass
def _update_stats(self, passed: bool):
# PERF: Already optimized but could use bitwise operations
# PERF: Consider removing stats entirely in production/fast mode
self.stats.total_urls += 1
self.stats.passed_urls += passed
self.stats.rejected_urls += not passed
class FilterChain:
# PERF: List traversal for each URL is expensive
# PERF: Could use array.array instead of list for filters
# PERF: Consider adding fast path for single filter case
def __init__(self, filters: List[URLFilter] = None):
self.filters = filters or []
self.stats = FilterStats()
self.logger = logging.getLogger("urlfilter.chain")
def apply(self, url: str) -> bool:
# PERF: Logging on every rejection is expensive
# PERF: Could reorder filters by rejection rate
# PERF: Consider batch processing mode
self.stats.total_urls += 1
for filter_ in self.filters:
if not filter_.apply(url):
self.stats.rejected_urls += 1
self.logger.debug(f"URL {url} rejected by {filter_.name}")
return False
self.stats.passed_urls += 1
return True
class URLPatternFilter(URLFilter):
# PERF: Converting glob to regex is expensive
# PERF: Multiple regex compilation is slow
# PERF: List of patterns causes multiple regex evaluations
def __init__(
self,
patterns: Union[str, Pattern, List[Union[str, Pattern]]],
use_glob: bool = True,
):
super().__init__()
self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
self.use_glob = use_glob
self._compiled_patterns = []
# PERF: This could be consolidated into a single regex with OR conditions
# PERF: glob_to_regex creates complex patterns, could be simplified
for pattern in self.patterns:
if isinstance(pattern, str) and use_glob:
self._compiled_patterns.append(self._glob_to_regex(pattern))
else:
self._compiled_patterns.append(
re.compile(pattern) if isinstance(pattern, str) else pattern
)
def _glob_to_regex(self, pattern: str) -> Pattern:
# PERF: fnmatch.translate creates overly complex patterns
# PERF: Could cache common translations
return re.compile(fnmatch.translate(pattern))
def apply(self, url: str) -> bool:
# PERF: any() with generator is slower than direct loop with early return
# PERF: searching entire string is slower than anchored match
matches = any(pattern.search(url) for pattern in self._compiled_patterns)
self._update_stats(matches)
return matches
class ContentTypeFilter(URLFilter):
# PERF: mimetypes guessing is extremely slow
# PERF: URL parsing on every check is expensive
# PERF: No caching of results for similar extensions
def __init__(
self, allowed_types: Union[str, List[str]], check_extension: bool = True
):
super().__init__()
self.allowed_types = (
[allowed_types] if isinstance(allowed_types, str) else allowed_types
)
self.check_extension = check_extension
self._normalize_types()
def _normalize_types(self):
"""Normalize content type strings"""
self.allowed_types = [t.lower() for t in self.allowed_types]
def _check_extension(self, url: str) -> bool:
# PERF: urlparse is called on every check
# PERF: multiple string splits are expensive
# PERF: mimetypes.guess_type is very slow
ext = (
urlparse(url).path.split(".")[-1].lower()
if "." in urlparse(url).path
else ""
)
if not ext:
return True
# PERF: guess_type is main bottleneck
guessed_type = mimetypes.guess_type(url)[0]
return any(
allowed in (guessed_type or "").lower() for allowed in self.allowed_types
)
def apply(self, url: str) -> bool:
"""Check if URL's content type is allowed"""
result = True
if self.check_extension:
result = self._check_extension(url)
self._update_stats(result)
return result
class DomainFilter(URLFilter):
# PERF: Set lookups are fast but string normalizations on init are not
# PERF: Creating two sets doubles memory usage
def __init__(
self,
allowed_domains: Union[str, List[str]] = None,
blocked_domains: Union[str, List[str]] = None,
):
super().__init__()
# PERF: Normalizing domains on every init is wasteful
# PERF: Could use frozenset for immutable lists
self.allowed_domains = (
set(self._normalize_domains(allowed_domains)) if allowed_domains else None
)
self.blocked_domains = (
set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
)
def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
# PERF: strip() and lower() create new strings for each domain
# PERF: List comprehension creates intermediate list
if isinstance(domains, str):
domains = [domains]
return [d.lower().strip() for d in domains]
def _extract_domain(self, url: str) -> str:
# PERF: urlparse is called for every URL check
# PERF: lower() creates new string every time
# PERF: Could cache recent results
return urlparse(url).netloc.lower()
def apply(self, url: str) -> bool:
# PERF: Two separate set lookups in worst case
# PERF: Domain extraction happens before knowing if we have any filters
domain = self._extract_domain(url)
if domain in self.blocked_domains:
self._update_stats(False)
return False
if self.allowed_domains is not None and domain not in self.allowed_domains:
self._update_stats(False)
return False
self._update_stats(True)
return True
# Example usage:
def create_common_filter_chain() -> FilterChain:
"""Create a commonly used filter chain"""
return FilterChain(
[
URLPatternFilter(
[
"*.html",
"*.htm", # HTML files
"*/article/*",
"*/blog/*", # Common content paths
]
),
ContentTypeFilter(["text/html", "application/xhtml+xml"]),
DomainFilter(blocked_domains=["ads.*", "analytics.*"]),
]
)
####################################################################################
# Uncledoe: Optimized Version
####################################################################################
# Use __slots__ and array for maximum memory/speed efficiency
class FastFilterStats:
__slots__ = ("_counters",)
def __init__(self):
# Use array of unsigned ints for atomic operations
self._counters = array("I", [0, 0, 0]) # total, passed, rejected
@property
def total_urls(self):
return self._counters[0]
@property
def passed_urls(self):
return self._counters[1]
@property
def rejected_urls(self):
return self._counters[2]
class FastURLFilter(ABC):
"""Optimized base filter class"""
__slots__ = ("name", "stats", "_logger_ref")
def __init__(self, name: str = None):
self.name = name or self.__class__.__name__
self.stats = FastFilterStats()
# Lazy logger initialization using weakref
self._logger_ref = None
@property
def logger(self):
if self._logger_ref is None or self._logger_ref() is None:
logger = logging.getLogger(f"urlfilter.{self.name}")
self._logger_ref = weakref.ref(logger)
return self._logger_ref()
@abstractmethod
def apply(self, url: str) -> bool:
pass
def _update_stats(self, passed: bool):
# Use direct array index for speed
self.stats._counters[0] += 1 # total
self.stats._counters[1] += passed # passed
self.stats._counters[2] += not passed # rejected
class FastFilterChain:
"""Optimized filter chain"""
__slots__ = ("filters", "stats", "_logger_ref")
def __init__(self, filters: List[FastURLFilter] = None):
self.filters = tuple(filters or []) # Immutable tuple for speed
self.stats = FastFilterStats()
self._logger_ref = None
@property
def logger(self):
if self._logger_ref is None or self._logger_ref() is None:
logger = logging.getLogger("urlfilter.chain")
self._logger_ref = weakref.ref(logger)
return self._logger_ref()
def add_filter(self, filter_: FastURLFilter) -> "FastFilterChain":
"""Add a filter to the chain"""
self.filters.append(filter_)
return self # Enable method chaining
def apply(self, url: str) -> bool:
"""Optimized apply with minimal operations"""
self.stats._counters[0] += 1 # total
# Direct tuple iteration is faster than list
for f in self.filters:
if not f.apply(url):
self.stats._counters[2] += 1 # rejected
return False
self.stats._counters[1] += 1 # passed
return True
class FastURLPatternFilter(FastURLFilter):
"""Pattern filter balancing speed and completeness"""
__slots__ = ('_simple_suffixes', '_simple_prefixes', '_domain_patterns', '_path_patterns')
PATTERN_TYPES = {
'SUFFIX': 1, # *.html
'PREFIX': 2, # /foo/*
'DOMAIN': 3, # *.example.com
'PATH': 4 , # Everything else
'REGEX': 5
}
def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]], use_glob: bool = True):
super().__init__()
patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
self._simple_suffixes = set()
self._simple_prefixes = set()
self._domain_patterns = []
self._path_patterns = []
for pattern in patterns:
pattern_type = self._categorize_pattern(pattern)
self._add_pattern(pattern, pattern_type)
def _categorize_pattern(self, pattern: str) -> int:
"""Categorize pattern for specialized handling"""
if not isinstance(pattern, str):
return self.PATTERN_TYPES['PATH']
# Check if it's a regex pattern
if pattern.startswith('^') or pattern.endswith('$') or '\\d' in pattern:
return self.PATTERN_TYPES['REGEX']
if pattern.count('*') == 1:
if pattern.startswith('*.'):
return self.PATTERN_TYPES['SUFFIX']
if pattern.endswith('/*'):
return self.PATTERN_TYPES['PREFIX']
if '://' in pattern and pattern.startswith('*.'):
return self.PATTERN_TYPES['DOMAIN']
return self.PATTERN_TYPES['PATH']
def _add_pattern(self, pattern: str, pattern_type: int):
"""Add pattern to appropriate matcher"""
if pattern_type == self.PATTERN_TYPES['REGEX']:
# For regex patterns, compile directly without glob translation
if isinstance(pattern, str) and (pattern.startswith('^') or pattern.endswith('$') or '\\d' in pattern):
self._path_patterns.append(re.compile(pattern))
return
elif pattern_type == self.PATTERN_TYPES['SUFFIX']:
self._simple_suffixes.add(pattern[2:])
elif pattern_type == self.PATTERN_TYPES['PREFIX']:
self._simple_prefixes.add(pattern[:-2])
elif pattern_type == self.PATTERN_TYPES['DOMAIN']:
self._domain_patterns.append(
re.compile(pattern.replace('*.', r'[^/]+\.'))
)
else:
if isinstance(pattern, str):
# Handle complex glob patterns
if '**' in pattern:
pattern = pattern.replace('**', '.*')
if '{' in pattern:
# Convert {a,b} to (a|b)
pattern = re.sub(r'\{([^}]+)\}',
lambda m: f'({"|".join(m.group(1).split(","))})',
pattern)
pattern = fnmatch.translate(pattern)
self._path_patterns.append(
pattern if isinstance(pattern, Pattern) else re.compile(pattern)
)
@lru_cache(maxsize=10000)
def apply(self, url: str) -> bool:
"""Hierarchical pattern matching"""
# Quick suffix check (*.html)
if self._simple_suffixes:
path = url.split('?')[0]
if path.split('/')[-1].split('.')[-1] in self._simple_suffixes:
self._update_stats(True)
return True
# Domain check
if self._domain_patterns:
for pattern in self._domain_patterns:
if pattern.match(url):
self._update_stats(True)
return True
# Prefix check (/foo/*)
if self._simple_prefixes:
path = url.split('?')[0]
if any(path.startswith(p) for p in self._simple_prefixes):
self._update_stats(True)
return True
# Complex patterns
if self._path_patterns:
if any(p.search(url) for p in self._path_patterns):
self._update_stats(True)
return True
self._update_stats(False)
return False
class FastContentTypeFilter(FastURLFilter):
"""Optimized content type filter using fast lookups"""
__slots__ = ("allowed_types", "_ext_map", "_check_extension")
# Fast extension to mime type mapping
_MIME_MAP = {
# Text Formats
"txt": "text/plain",
"html": "text/html",
"htm": "text/html",
"xhtml": "application/xhtml+xml",
"css": "text/css",
"csv": "text/csv",
"ics": "text/calendar",
"js": "application/javascript",
# Images
"bmp": "image/bmp",
"gif": "image/gif",
"jpeg": "image/jpeg",
"jpg": "image/jpeg",
"png": "image/png",
"svg": "image/svg+xml",
"tiff": "image/tiff",
"ico": "image/x-icon",
"webp": "image/webp",
# Audio
"mp3": "audio/mpeg",
"wav": "audio/wav",
"ogg": "audio/ogg",
"m4a": "audio/mp4",
"aac": "audio/aac",
# Video
"mp4": "video/mp4",
"mpeg": "video/mpeg",
"webm": "video/webm",
"avi": "video/x-msvideo",
"mov": "video/quicktime",
"flv": "video/x-flv",
"wmv": "video/x-ms-wmv",
"mkv": "video/x-matroska",
# Applications
"json": "application/json",
"xml": "application/xml",
"pdf": "application/pdf",
"zip": "application/zip",
"gz": "application/gzip",
"tar": "application/x-tar",
"rar": "application/vnd.rar",
"7z": "application/x-7z-compressed",
"exe": "application/vnd.microsoft.portable-executable",
"msi": "application/x-msdownload",
# Fonts
"woff": "font/woff",
"woff2": "font/woff2",
"ttf": "font/ttf",
"otf": "font/otf",
# Microsoft Office
"doc": "application/msword",
"dot": "application/msword",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"xls": "application/vnd.ms-excel",
"ppt": "application/vnd.ms-powerpoint",
"pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
# OpenDocument Formats
"odt": "application/vnd.oasis.opendocument.text",
"ods": "application/vnd.oasis.opendocument.spreadsheet",
"odp": "application/vnd.oasis.opendocument.presentation",
# Archives
"tar.gz": "application/gzip",
"tgz": "application/gzip",
"bz2": "application/x-bzip2",
# Others
"rtf": "application/rtf",
"apk": "application/vnd.android.package-archive",
"epub": "application/epub+zip",
"jar": "application/java-archive",
"swf": "application/x-shockwave-flash",
"midi": "audio/midi",
"mid": "audio/midi",
"ps": "application/postscript",
"ai": "application/postscript",
"eps": "application/postscript",
# Custom or less common
"bin": "application/octet-stream",
"dmg": "application/x-apple-diskimage",
"iso": "application/x-iso9660-image",
"deb": "application/x-debian-package",
"rpm": "application/x-rpm",
"sqlite": "application/vnd.sqlite3",
# Placeholder
"unknown": "application/octet-stream", # Fallback for unknown file types
}
@staticmethod
@lru_cache(maxsize=1000)
def _extract_extension(path: str) -> str:
"""Fast extension extraction with caching"""
if "." not in path:
return ""
return path.rpartition(".")[-1].lower()
def __init__(
self, allowed_types: Union[str, List[str]], check_extension: bool = True
):
super().__init__()
# Normalize and store as frozenset for fast lookup
self.allowed_types = frozenset(
t.lower()
for t in (
allowed_types if isinstance(allowed_types, list) else [allowed_types]
)
)
self._check_extension = check_extension
# Pre-compute extension map for allowed types
self._ext_map = frozenset(
ext
for ext, mime in self._MIME_MAP.items()
if any(allowed in mime for allowed in self.allowed_types)
)
@lru_cache(maxsize=1000)
def _check_url_cached(self, url: str) -> bool:
"""Cached URL checking"""
if not self._check_extension:
return True
path = url.split("?")[0] # Fast path split
ext = self._extract_extension(path)
if not ext:
return True
return ext in self._ext_map
def apply(self, url: str) -> bool:
"""Fast extension check with caching"""
result = self._check_url_cached(url)
self._update_stats(result)
return result
class FastDomainFilter(FastURLFilter):
"""Optimized domain filter with fast lookups and caching"""
__slots__ = ("_allowed_domains", "_blocked_domains", "_domain_cache")
# Regex for fast domain extraction
_DOMAIN_REGEX = re.compile(r"://([^/]+)")
def __init__(
self,
allowed_domains: Union[str, List[str]] = None,
blocked_domains: Union[str, List[str]] = None,
):
super().__init__()
# Convert inputs to frozensets for immutable, fast lookups
self._allowed_domains = (
frozenset(self._normalize_domains(allowed_domains))
if allowed_domains
else None
)
self._blocked_domains = (
frozenset(self._normalize_domains(blocked_domains))
if blocked_domains
else frozenset()
)
@staticmethod
def _normalize_domains(domains: Union[str, List[str]]) -> Set[str]:
"""Fast domain normalization"""
if isinstance(domains, str):
return {domains.lower()}
return {d.lower() for d in domains}
@staticmethod
@lru_cache(maxsize=10000)
def _extract_domain(url: str) -> str:
"""Ultra-fast domain extraction with regex and caching"""
match = FastDomainFilter._DOMAIN_REGEX.search(url)
return match.group(1).lower() if match else ""
def apply(self, url: str) -> bool:
"""Optimized domain checking with early returns"""
# Skip processing if no filters
if not self._blocked_domains and self._allowed_domains is None:
self._update_stats(True)
return True
domain = self._extract_domain(url)
# Early return for blocked domains
if domain in self._blocked_domains:
self._update_stats(False)
return False
# If no allowed domains specified, accept all non-blocked
if self._allowed_domains is None:
self._update_stats(True)
return True
# Final allowed domains check
result = domain in self._allowed_domains
self._update_stats(result)
return result
def create_fast_filter_chain() -> FastFilterChain:
"""Create an optimized filter chain with filters ordered by rejection rate"""
return FastFilterChain(
[
# Domain filter first (fastest rejection)
FastDomainFilter(blocked_domains=["ads.*", "analytics.*"]),
# Content filter second (medium speed)
FastContentTypeFilter(["text/html", "application/xhtml+xml"]),
# Pattern filter last (most expensive)
FastURLPatternFilter(
[
"*.html",
"*.htm",
"*/article/*",
"*/blog/*",
]
),
]
)
def run_performance_test():
import time
import random
from itertools import cycle
# Generate test URLs
base_urls = [
"https://example.com/article/123",
"https://blog.example.com/post/456",
"https://ads.example.com/tracking",
"https://example.com/about.html",
"https://analytics.example.com/script.js",
"https://example.com/products.php",
"https://subdomain.example.com/blog/post-123",
"https://example.com/path/file.pdf",
]
# Create more varied test data
test_urls = []
for base in base_urls:
# Add original
test_urls.append(base)
# Add variations
parts = base.split("/")
for i in range(10):
parts[-1] = f"page_{i}.html"
test_urls.append("/".join(parts))
# Multiply to get enough test data
test_urls = test_urls * 10000 # Creates ~800k URLs
def benchmark(name: str, func, *args, warmup=True):
if warmup:
# Warmup run
func(*args)
# Actual timing
start = time.perf_counter_ns()
result = func(*args)
elapsed = (time.perf_counter_ns() - start) / 1_000_000 # Convert to ms
print(
f"{name:<30} {elapsed:>8.3f} ms ({len(test_urls)/elapsed*1000:,.0f} URLs/sec)"
)
return result
print("\nBenchmarking original vs optimized implementations...")
print("-" * 70)
# Original implementation
pattern_filter = URLPatternFilter(["*.html", "*/article/*"])
content_filter = ContentTypeFilter(["text/html"])
domain_filter = DomainFilter(blocked_domains=["ads.*", "analytics.*"])
chain = FilterChain([pattern_filter, content_filter, domain_filter])
# Optimized implementation
fast_pattern_filter = FastURLPatternFilter(["*.html", "*/article/*"])
fast_content_filter = FastContentTypeFilter(["text/html"])
fast_domain_filter = FastDomainFilter(blocked_domains=["ads.*", "analytics.*"])
fast_chain = FastFilterChain(
[fast_domain_filter, fast_content_filter, fast_pattern_filter]
)
# Test individual filters
print("\nSingle filter performance (first 1000 URLs):")
test_subset = test_urls[:1000]
print("\nPattern Filters:")
benchmark(
"Original Pattern Filter",
lambda: [pattern_filter.apply(url) for url in test_subset],
)
benchmark(
"Optimized Pattern Filter",
lambda: [fast_pattern_filter.apply(url) for url in test_subset],
)
print("\nContent Filters:")
benchmark(
"Original Content Filter",
lambda: [content_filter.apply(url) for url in test_subset],
)
benchmark(
"Optimized Content Filter",
lambda: [fast_content_filter.apply(url) for url in test_subset],
)
print("\nDomain Filters:")
benchmark(
"Original Domain Filter",
lambda: [domain_filter.apply(url) for url in test_subset],
)
benchmark(
"Optimized Domain Filter",
lambda: [fast_domain_filter.apply(url) for url in test_subset],
)
print("\nFull Chain Performance (all URLs):")
# Test chain
benchmark("Original Chain", lambda: [chain.apply(url) for url in test_urls])
benchmark("Optimized Chain", lambda: [fast_chain.apply(url) for url in test_urls])
# Memory usage
import sys
print("\nMemory Usage per Filter:")
print(f"Original Pattern Filter: {sys.getsizeof(pattern_filter):,} bytes")
print(f"Optimized Pattern Filter: {sys.getsizeof(fast_pattern_filter):,} bytes")
print(f"Original Content Filter: {sys.getsizeof(content_filter):,} bytes")
print(f"Optimized Content Filter: {sys.getsizeof(fast_content_filter):,} bytes")
print(f"Original Domain Filter: {sys.getsizeof(domain_filter):,} bytes")
print(f"Optimized Domain Filter: {sys.getsizeof(fast_domain_filter):,} bytes")
def test_pattern_filter():
import time
from itertools import chain
# Test cases as list of tuples instead of dict for multiple patterns
test_cases = [
# Simple suffix patterns (*.html)
("*.html", {
"https://example.com/page.html": True,
"https://example.com/path/doc.html": True,
"https://example.com/page.htm": False,
"https://example.com/page.html?param=1": True,
}),
# Path prefix patterns (/foo/*)
("*/article/*", {
"https://example.com/article/123": True,
"https://example.com/blog/article/456": True,
"https://example.com/articles/789": False,
"https://example.com/article": False,
}),
# Complex patterns
("blog-*-[0-9]", {
"https://example.com/blog-post-1": True,
"https://example.com/blog-test-9": True,
"https://example.com/blog-post": False,
"https://example.com/blog-post-x": False,
}),
# Multiple patterns case
(["*.pdf", "*/download/*"], {
"https://example.com/doc.pdf": True,
"https://example.com/download/file.txt": True,
"https://example.com/path/download/doc": True,
"https://example.com/uploads/file.txt": False,
}),
# Edge cases
("*", {
"https://example.com": True,
"": True,
"http://test.com/path": True,
}),
# Complex regex
(r"^https?://.*\.example\.com/\d+", {
"https://sub.example.com/123": True,
"http://test.example.com/456": True,
"https://example.com/789": False,
"https://sub.example.com/abc": False,
})
]
def run_accuracy_test():
print("\nAccuracy Tests:")
print("-" * 50)
all_passed = True
for patterns, test_urls in test_cases:
filter_obj = FastURLPatternFilter(patterns)
for url, expected in test_urls.items():
result = filter_obj.apply(url)
if result != expected:
print(f"❌ Failed: Pattern '{patterns}' with URL '{url}'")
print(f" Expected: {expected}, Got: {result}")
all_passed = False
else:
print(f"✅ Passed: Pattern '{patterns}' with URL '{url}'")
return all_passed
def run_speed_test():
print("\nSpeed Tests:")
print("-" * 50)
# Create a large set of test URLs
all_urls = list(chain.from_iterable(urls.keys() for _, urls in test_cases))
test_urls = all_urls * 10000 # 100K+ URLs
# Test both implementations
original = URLPatternFilter(["*.html", "*/article/*", "blog-*"])
optimized = FastURLPatternFilter(["*.html", "*/article/*", "blog-*"])
def benchmark(name, filter_obj):
start = time.perf_counter()
for url in test_urls:
filter_obj.apply(url)
elapsed = time.perf_counter() - start
urls_per_sec = len(test_urls) / elapsed
print(f"{name:<20} {elapsed:.3f}s ({urls_per_sec:,.0f} URLs/sec)")
benchmark("Original Filter:", original)
benchmark("Optimized Filter:", optimized)
# Run tests
print("Running Pattern Filter Tests...")
accuracy_passed = run_accuracy_test()
if accuracy_passed:
print("\n✨ All accuracy tests passed!")
run_speed_test()
else:
print("\n❌ Some accuracy tests failed!")
if __name__ == "__main__":
run_performance_test()
# test_pattern_filter()

File diff suppressed because it is too large Load Diff

View File

@@ -1,75 +0,0 @@
import requests
import shutil
from pathlib import Path
from crawl4ai.async_logger import AsyncLogger
from crawl4ai.llmtxt import AsyncLLMTextManager
class DocsManager:
def __init__(self, logger=None):
self.docs_dir = Path.home() / ".crawl4ai" / "docs"
self.local_docs = Path(__file__).parent.parent / "docs" / "llm.txt"
self.docs_dir.mkdir(parents=True, exist_ok=True)
self.logger = logger or AsyncLogger(verbose=True)
self.llm_text = AsyncLLMTextManager(self.docs_dir, self.logger)
async def ensure_docs_exist(self):
"""Fetch docs if not present"""
if not any(self.docs_dir.iterdir()):
await self.fetch_docs()
async def fetch_docs(self) -> bool:
"""Copy from local docs or download from GitHub"""
try:
# Try local first
if self.local_docs.exists() and (
any(self.local_docs.glob("*.md"))
or any(self.local_docs.glob("*.tokens"))
):
# Empty the local docs directory
for file_path in self.docs_dir.glob("*.md"):
file_path.unlink()
# for file_path in self.docs_dir.glob("*.tokens"):
# file_path.unlink()
for file_path in self.local_docs.glob("*.md"):
shutil.copy2(file_path, self.docs_dir / file_path.name)
# for file_path in self.local_docs.glob("*.tokens"):
# shutil.copy2(file_path, self.docs_dir / file_path.name)
return True
# Fallback to GitHub
response = requests.get(
"https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
headers={"Accept": "application/vnd.github.v3+json"},
)
response.raise_for_status()
for item in response.json():
if item["type"] == "file" and item["name"].endswith(".md"):
content = requests.get(item["download_url"]).text
with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
f.write(content)
return True
except Exception as e:
self.logger.error(f"Failed to fetch docs: {str(e)}")
raise
def list(self) -> list[str]:
"""List available topics"""
names = [file_path.stem for file_path in self.docs_dir.glob("*.md")]
# Remove [0-9]+_ prefix
names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
# Exclude those end with .xs.md and .q.md
names = [
name
for name in names
if not name.endswith(".xs") and not name.endswith(".q")
]
return names
def generate(self, sections, mode="extended"):
return self.llm_text.generate(sections, mode)
def search(self, query: str, top_k: int = 5):
return self.llm_text.search(query, top_k)

File diff suppressed because it is too large Load Diff

View File

@@ -54,13 +54,13 @@ class HTML2Text(html.parser.HTMLParser):
self.td_count = 0
self.table_start = False
self.unicode_snob = config.UNICODE_SNOB # covered in cli
self.escape_snob = config.ESCAPE_SNOB # covered in cli
self.escape_backslash = config.ESCAPE_BACKSLASH # covered in cli
self.escape_dot = config.ESCAPE_DOT # covered in cli
self.escape_plus = config.ESCAPE_PLUS # covered in cli
self.escape_dash = config.ESCAPE_DASH # covered in cli
self.links_each_paragraph = config.LINKS_EACH_PARAGRAPH
self.body_width = bodywidth # covered in cli
self.skip_internal_links = config.SKIP_INTERNAL_LINKS # covered in cli
@@ -144,8 +144,8 @@ class HTML2Text(html.parser.HTMLParser):
def update_params(self, **kwargs):
for key, value in kwargs.items():
setattr(self, key, value)
setattr(self, key, value)
def feed(self, data: str) -> None:
data = data.replace("</' + 'script>", "</ignore>")
super().feed(data)
@@ -903,13 +903,7 @@ class HTML2Text(html.parser.HTMLParser):
self.empty_link = False
if not self.code and not self.pre and not entity_char:
data = escape_md_section(
data,
snob=self.escape_snob,
escape_dot=self.escape_dot,
escape_plus=self.escape_plus,
escape_dash=self.escape_dash,
)
data = escape_md_section(data, snob=self.escape_snob, escape_dot=self.escape_dot, escape_plus=self.escape_plus, escape_dash=self.escape_dash)
self.preceding_data = data
self.o(data, puredata=True)
@@ -1012,7 +1006,6 @@ class HTML2Text(html.parser.HTMLParser):
newlines += 1
return result
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
if bodywidth is None:
bodywidth = config.BODY_WIDTH
@@ -1020,7 +1013,6 @@ def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) ->
return h.handle(html)
class CustomHTML2Text(HTML2Text):
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
super().__init__(*args, **kwargs)
@@ -1030,8 +1022,8 @@ class CustomHTML2Text(HTML2Text):
self.current_preserved_tag = None
self.preserved_content = []
self.preserve_depth = 0
self.handle_code_in_pre = handle_code_in_pre
self.handle_code_in_pre = handle_code_in_pre
# Configuration options
self.skip_internal_links = False
self.single_line_break = False
@@ -1049,9 +1041,9 @@ class CustomHTML2Text(HTML2Text):
def update_params(self, **kwargs):
"""Update parameters and set preserved tags."""
for key, value in kwargs.items():
if key == "preserve_tags":
if key == 'preserve_tags':
self.preserve_tags = set(value)
elif key == "handle_code_in_pre":
elif key == 'handle_code_in_pre':
self.handle_code_in_pre = value
else:
setattr(self, key, value)
@@ -1064,19 +1056,17 @@ class CustomHTML2Text(HTML2Text):
self.current_preserved_tag = tag
self.preserved_content = []
# Format opening tag with attributes
attr_str = "".join(
f' {k}="{v}"' for k, v in attrs.items() if v is not None
)
self.preserved_content.append(f"<{tag}{attr_str}>")
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
self.preserve_depth += 1
return
else:
self.preserve_depth -= 1
if self.preserve_depth == 0:
self.preserved_content.append(f"</{tag}>")
self.preserved_content.append(f'</{tag}>')
# Output the preserved HTML block with proper spacing
preserved_html = "".join(self.preserved_content)
self.o("\n" + preserved_html + "\n")
preserved_html = ''.join(self.preserved_content)
self.o('\n' + preserved_html + '\n')
self.current_preserved_tag = None
return
@@ -1084,31 +1074,29 @@ class CustomHTML2Text(HTML2Text):
if self.preserve_depth > 0:
if start:
# Format nested tags with attributes
attr_str = "".join(
f' {k}="{v}"' for k, v in attrs.items() if v is not None
)
self.preserved_content.append(f"<{tag}{attr_str}>")
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
self.preserved_content.append(f'<{tag}{attr_str}>')
else:
self.preserved_content.append(f"</{tag}>")
self.preserved_content.append(f'</{tag}>')
return
# Handle pre tags
if tag == "pre":
if tag == 'pre':
if start:
self.o("```\n") # Markdown code block start
self.o('```\n') # Markdown code block start
self.inside_pre = True
else:
self.o("\n```\n") # Markdown code block end
self.o('\n```\n') # Markdown code block end
self.inside_pre = False
elif tag == "code":
elif tag == 'code':
if self.inside_pre and not self.handle_code_in_pre:
# Ignore code tags inside pre blocks if handle_code_in_pre is False
return
if start:
self.o("`") # Markdown inline code start
self.o('`') # Markdown inline code start
self.inside_code = True
else:
self.o("`") # Markdown inline code end
self.o('`') # Markdown inline code end
self.inside_code = False
else:
super().handle_tag(tag, attrs, start)
@@ -1125,12 +1113,13 @@ class CustomHTML2Text(HTML2Text):
return
if self.inside_code:
# Inline code: no newlines allowed
self.o(data.replace("\n", " "))
self.o(data.replace('\n', ' '))
return
# Default behavior for other tags
super().handle_data(data, entity_char)
# # Handle pre tags
# if tag == 'pre':
# if start:

View File

@@ -1,3 +1,2 @@
class OutCallback:
def __call__(self, s: str) -> None:
...
def __call__(self, s: str) -> None: ...

View File

@@ -210,7 +210,7 @@ def escape_md_section(
snob: bool = False,
escape_dot: bool = True,
escape_plus: bool = True,
escape_dash: bool = True,
escape_dash: bool = True
) -> str:
"""
Escapes markdown-sensitive characters across whole document sections.
@@ -233,7 +233,6 @@ def escape_md_section(
return text
def reformat_table(lines: List[str], right_margin: int) -> List[str]:
"""
Given the lines of a table

View File

@@ -6,45 +6,29 @@ from .async_logger import AsyncLogger, LogLevel
# Initialize logger
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
def post_install():
"""Run all post-installation tasks"""
logger.info("Running post-installation setup...", tag="INIT")
install_playwright()
run_migration()
logger.success("Post-installation setup completed!", tag="COMPLETE")
def install_playwright():
logger.info("Installing Playwright browsers...", tag="INIT")
try:
# subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
subprocess.check_call(
[
sys.executable,
"-m",
"playwright",
"install",
"--with-deps",
"--force",
"chromium",
]
)
logger.success(
"Playwright installation completed successfully.", tag="COMPLETE"
)
except subprocess.CalledProcessError:
# logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
subprocess.check_call([sys.executable, "-m", "playwright", "install"])
logger.success("Playwright installation completed successfully.", tag="COMPLETE")
except subprocess.CalledProcessError as e:
logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
logger.warning(
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
"Please run 'python -m playwright install' manually after the installation."
)
except Exception:
# logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
except Exception as e:
logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
logger.warning(
f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
"Please run 'python -m playwright install' manually after the installation."
)
def run_migration():
"""Initialize database during installation"""
try:
@@ -52,58 +36,9 @@ def run_migration():
from crawl4ai.async_database import async_db_manager
asyncio.run(async_db_manager.initialize())
logger.success(
"Database initialization completed successfully.", tag="COMPLETE"
)
logger.success("Database initialization completed successfully.", tag="COMPLETE")
except ImportError:
logger.warning("Database module not found. Will initialize on first use.")
except Exception as e:
logger.warning(f"Database initialization failed: {e}")
logger.warning("Database will be initialized on first use")
async def run_doctor():
"""Test if Crawl4AI is working properly"""
logger.info("Running Crawl4AI health check...", tag="INIT")
try:
from .async_webcrawler import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
)
browser_config = BrowserConfig(
headless=True,
browser_type="chromium",
ignore_https_errors=True,
light_mode=True,
viewport_width=1280,
viewport_height=720,
)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
logger.info("Testing crawling capabilities...", tag="TEST")
result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
if result and result.markdown:
logger.success("✅ Crawling test passed!", tag="COMPLETE")
return True
else:
raise Exception("Failed to get content")
except Exception as e:
logger.error(f"❌ Test failed: {e}", tag="ERROR")
return False
def doctor():
"""Entry point for the doctor command"""
import asyncio
return asyncio.run(run_doctor())
logger.warning("Database will be initialized on first use")

View File

@@ -1,18 +1,15 @@
import os
import os, sys
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
def load_js_script(script_name):
# Get the path of the current script
current_script_path = os.path.dirname(os.path.realpath(__file__))
# Get the path of the script to load
script_path = os.path.join(current_script_path, script_name + ".js")
script_path = os.path.join(current_script_path, script_name + '.js')
# Check if the script exists
if not os.path.exists(script_path):
raise ValueError(
f"Script {script_name} not found in the folder {current_script_path}"
)
raise ValueError(f"Script {script_name} not found in the folder {current_script_path}")
# Load the content of the script
with open(script_path, "r") as f:
with open(script_path, 'r') as f:
script_content = f.read()
return script_content

View File

@@ -1,546 +0,0 @@
import os
from pathlib import Path
import re
from typing import Dict, List, Tuple, Optional, Any
import json
from tqdm import tqdm
import time
import psutil
import numpy as np
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from litellm import batch_completion
from .async_logger import AsyncLogger
import litellm
import pickle
import hashlib # <--- ADDED for file-hash
import glob
litellm.set_verbose = False
def _compute_file_hash(file_path: Path) -> str:
"""Compute MD5 hash for the file's entire content."""
hash_md5 = hashlib.md5()
with file_path.open("rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
class AsyncLLMTextManager:
def __init__(
self,
docs_dir: Path,
logger: Optional[AsyncLogger] = None,
max_concurrent_calls: int = 5,
batch_size: int = 3,
) -> None:
self.docs_dir = docs_dir
self.logger = logger
self.max_concurrent_calls = max_concurrent_calls
self.batch_size = batch_size
self.bm25_index = None
self.document_map: Dict[str, Any] = {}
self.tokenized_facts: List[str] = []
self.bm25_index_file = self.docs_dir / "bm25_index.pkl"
async def _process_document_batch(self, doc_batch: List[Path]) -> None:
"""Process a batch of documents in parallel"""
contents = []
for file_path in doc_batch:
try:
with open(file_path, "r", encoding="utf-8") as f:
contents.append(f.read())
except Exception as e:
self.logger.error(f"Error reading {file_path}: {str(e)}")
contents.append("") # Add empty content to maintain batch alignment
prompt = """Given a documentation file, generate a list of atomic facts where each fact:
1. Represents a single piece of knowledge
2. Contains variations in terminology for the same concept
3. References relevant code patterns if they exist
4. Is written in a way that would match natural language queries
Each fact should follow this format:
<main_concept>: <fact_statement> | <related_terms> | <code_reference>
Example Facts:
browser_config: Configure headless mode and browser type for AsyncWebCrawler | headless, browser_type, chromium, firefox | BrowserConfig(browser_type="chromium", headless=True)
redis_connection: Redis client connection requires host and port configuration | redis setup, redis client, connection params | Redis(host='localhost', port=6379, db=0)
pandas_filtering: Filter DataFrame rows using boolean conditions | dataframe filter, query, boolean indexing | df[df['column'] > 5]
Wrap your response in <index>...</index> tags.
"""
# Prepare messages for batch processing
messages_list = [
[
{
"role": "user",
"content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
}
]
for content in contents
if content
]
try:
responses = batch_completion(
model="anthropic/claude-3-5-sonnet-latest",
messages=messages_list,
logger_fn=None,
)
# Process responses and save index files
for response, file_path in zip(responses, doc_batch):
try:
index_content_match = re.search(
r"<index>(.*?)</index>",
response.choices[0].message.content,
re.DOTALL,
)
if not index_content_match:
self.logger.warning(
f"No <index>...</index> content found for {file_path}"
)
continue
index_content = re.sub(
r"\n\s*\n", "\n", index_content_match.group(1)
).strip()
if index_content:
index_file = file_path.with_suffix(".q.md")
with open(index_file, "w", encoding="utf-8") as f:
f.write(index_content)
self.logger.info(f"Created index file: {index_file}")
else:
self.logger.warning(
f"No index content found in response for {file_path}"
)
except Exception as e:
self.logger.error(
f"Error processing response for {file_path}: {str(e)}"
)
except Exception as e:
self.logger.error(f"Error in batch completion: {str(e)}")
def _validate_fact_line(self, line: str) -> Tuple[bool, Optional[str]]:
if "|" not in line:
return False, "Missing separator '|'"
parts = [p.strip() for p in line.split("|")]
if len(parts) != 3:
return False, f"Expected 3 parts, got {len(parts)}"
concept_part = parts[0]
if ":" not in concept_part:
return False, "Missing ':' in concept definition"
return True, None
def _load_or_create_token_cache(self, fact_file: Path) -> Dict:
"""
Load token cache from .q.tokens if present and matching file hash.
Otherwise return a new structure with updated file-hash.
"""
cache_file = fact_file.with_suffix(".q.tokens")
current_hash = _compute_file_hash(fact_file)
if cache_file.exists():
try:
with open(cache_file, "r") as f:
cache = json.load(f)
# If the hash matches, return it directly
if cache.get("content_hash") == current_hash:
return cache
# Otherwise, we signal that it's changed
self.logger.info(f"Hash changed for {fact_file}, reindex needed.")
except json.JSONDecodeError:
self.logger.warning(f"Corrupt token cache for {fact_file}, rebuilding.")
except Exception as e:
self.logger.warning(f"Error reading cache for {fact_file}: {str(e)}")
# Return a fresh cache
return {"facts": {}, "content_hash": current_hash}
def _save_token_cache(self, fact_file: Path, cache: Dict) -> None:
cache_file = fact_file.with_suffix(".q.tokens")
# Always ensure we're saving the correct file-hash
cache["content_hash"] = _compute_file_hash(fact_file)
with open(cache_file, "w") as f:
json.dump(cache, f)
def preprocess_text(self, text: str) -> List[str]:
parts = [x.strip() for x in text.split("|")] if "|" in text else [text]
# Remove : after the first word of parts[0]
parts[0] = re.sub(r"^(.*?):", r"\1", parts[0])
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english")) - {
"how",
"what",
"when",
"where",
"why",
"which",
}
tokens = []
for part in parts:
if "(" in part and ")" in part:
code_tokens = re.findall(
r'[\w_]+(?=\()|[\w_]+(?==[\'"]{1}[\w_]+[\'"]{1})', part
)
tokens.extend(code_tokens)
words = word_tokenize(part.lower())
tokens.extend(
[
lemmatizer.lemmatize(token)
for token in words
if token not in stop_words
]
)
return tokens
def maybe_load_bm25_index(self, clear_cache=False) -> bool:
"""
Load existing BM25 index from disk, if present and clear_cache=False.
"""
if not clear_cache and os.path.exists(self.bm25_index_file):
self.logger.info("Loading existing BM25 index from disk.")
with open(self.bm25_index_file, "rb") as f:
data = pickle.load(f)
self.tokenized_facts = data["tokenized_facts"]
self.bm25_index = data["bm25_index"]
return True
return False
def build_search_index(self, clear_cache=False) -> None:
"""
Checks for new or modified .q.md files by comparing file-hash.
If none need reindexing and clear_cache is False, loads existing index if available.
Otherwise, reindexes only changed/new files and merges or creates a new index.
"""
# If clear_cache is True, we skip partial logic: rebuild everything from scratch
if clear_cache:
self.logger.info("Clearing cache and rebuilding full search index.")
if self.bm25_index_file.exists():
self.bm25_index_file.unlink()
process = psutil.Process()
self.logger.info("Checking which .q.md files need (re)indexing...")
# Gather all .q.md files
q_files = [
self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
]
# We'll store known (unchanged) facts in these lists
existing_facts: List[str] = []
existing_tokens: List[List[str]] = []
# Keep track of invalid lines for logging
invalid_lines = []
needSet = [] # files that must be (re)indexed
for qf in q_files:
token_cache_file = qf.with_suffix(".q.tokens")
# If no .q.tokens or clear_cache is True → definitely reindex
if clear_cache or not token_cache_file.exists():
needSet.append(qf)
continue
# Otherwise, load the existing cache and compare hash
cache = self._load_or_create_token_cache(qf)
# If the .q.tokens was out of date (i.e. changed hash), we reindex
if len(cache["facts"]) == 0 or cache.get(
"content_hash"
) != _compute_file_hash(qf):
needSet.append(qf)
else:
# File is unchanged → retrieve cached token data
for line, cache_data in cache["facts"].items():
existing_facts.append(line)
existing_tokens.append(cache_data["tokens"])
self.document_map[line] = qf # track the doc for that fact
if not needSet and not clear_cache:
# If no file needs reindexing, try loading existing index
if self.maybe_load_bm25_index(clear_cache=False):
self.logger.info(
"No new/changed .q.md files found. Using existing BM25 index."
)
return
else:
# If there's no existing index, we must build a fresh index from the old caches
self.logger.info(
"No existing BM25 index found. Building from cached facts."
)
if existing_facts:
self.logger.info(
f"Building BM25 index with {len(existing_facts)} cached facts."
)
self.bm25_index = BM25Okapi(existing_tokens)
self.tokenized_facts = existing_facts
with open(self.bm25_index_file, "wb") as f:
pickle.dump(
{
"bm25_index": self.bm25_index,
"tokenized_facts": self.tokenized_facts,
},
f,
)
else:
self.logger.warning("No facts found at all. Index remains empty.")
return
# ----------------------------------------------------- /Users/unclecode/.crawl4ai/docs/14_proxy_security.q.q.tokens '/Users/unclecode/.crawl4ai/docs/14_proxy_security.q.md'
# If we reach here, we have new or changed .q.md files
# We'll parse them, reindex them, and then combine with existing_facts
# -----------------------------------------------------
self.logger.info(f"{len(needSet)} file(s) need reindexing. Parsing now...")
# 1) Parse the new or changed .q.md files
new_facts = []
new_tokens = []
with tqdm(total=len(needSet), desc="Indexing changed files") as file_pbar:
for file in needSet:
# We'll build up a fresh cache
fresh_cache = {"facts": {}, "content_hash": _compute_file_hash(file)}
try:
with open(file, "r", encoding="utf-8") as f_obj:
content = f_obj.read().strip()
lines = [l.strip() for l in content.split("\n") if l.strip()]
for line in lines:
is_valid, error = self._validate_fact_line(line)
if not is_valid:
invalid_lines.append((file, line, error))
continue
tokens = self.preprocess_text(line)
fresh_cache["facts"][line] = {
"tokens": tokens,
"added": time.time(),
}
new_facts.append(line)
new_tokens.append(tokens)
self.document_map[line] = file
# Save the new .q.tokens with updated hash
self._save_token_cache(file, fresh_cache)
mem_usage = process.memory_info().rss / 1024 / 1024
self.logger.debug(
f"Memory usage after {file.name}: {mem_usage:.2f}MB"
)
except Exception as e:
self.logger.error(f"Error processing {file}: {str(e)}")
file_pbar.update(1)
if invalid_lines:
self.logger.warning(f"Found {len(invalid_lines)} invalid fact lines:")
for file, line, error in invalid_lines:
self.logger.warning(f"{file}: {error} in line: {line[:50]}...")
# 2) Merge newly tokenized facts with the existing ones
all_facts = existing_facts + new_facts
all_tokens = existing_tokens + new_tokens
# 3) Build BM25 index from combined facts
self.logger.info(
f"Building BM25 index with {len(all_facts)} total facts (old + new)."
)
self.bm25_index = BM25Okapi(all_tokens)
self.tokenized_facts = all_facts
# 4) Save the updated BM25 index to disk
with open(self.bm25_index_file, "wb") as f:
pickle.dump(
{
"bm25_index": self.bm25_index,
"tokenized_facts": self.tokenized_facts,
},
f,
)
final_mem = process.memory_info().rss / 1024 / 1024
self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
async def generate_index_files(
self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
) -> None:
"""
Generate index files for all documents in parallel batches
Args:
force_generate_facts (bool): If True, regenerate indexes even if they exist
clear_bm25_cache (bool): If True, clear existing BM25 index cache
"""
self.logger.info("Starting index generation for documentation files.")
md_files = [
self.docs_dir / f
for f in os.listdir(self.docs_dir)
if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
]
# Filter out files that already have .q files unless force=True
if not force_generate_facts:
md_files = [
f
for f in md_files
if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
]
if not md_files:
self.logger.info("All index files exist. Use force=True to regenerate.")
else:
# Process documents in batches
for i in range(0, len(md_files), self.batch_size):
batch = md_files[i : i + self.batch_size]
self.logger.info(
f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
)
await self._process_document_batch(batch)
self.logger.info("Index generation complete, building/updating search index.")
self.build_search_index(clear_cache=clear_bm25_cache)
def generate(self, sections: List[str], mode: str = "extended") -> str:
# Get all markdown files
all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
str(self.docs_dir / "[0-9]*.xs.md")
)
# Extract base names without extensions
base_docs = {
Path(f).name.split(".")[0]
for f in all_files
if not Path(f).name.endswith(".q.md")
}
# Filter by sections if provided
if sections:
base_docs = {
doc
for doc in base_docs
if any(section.lower() in doc.lower() for section in sections)
}
# Get file paths based on mode
files = []
for doc in sorted(
base_docs,
key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
):
if mode == "condensed":
xs_file = self.docs_dir / f"{doc}.xs.md"
regular_file = self.docs_dir / f"{doc}.md"
files.append(str(xs_file if xs_file.exists() else regular_file))
else:
files.append(str(self.docs_dir / f"{doc}.md"))
# Read and format content
content = []
for file in files:
try:
with open(file, "r", encoding="utf-8") as f:
fname = Path(file).name
content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
except Exception as e:
self.logger.error(f"Error reading {file}: {str(e)}")
return "\n\n---\n\n".join(content) if content else ""
def search(self, query: str, top_k: int = 5) -> str:
if not self.bm25_index:
return "No search index available. Call build_search_index() first."
query_tokens = self.preprocess_text(query)
doc_scores = self.bm25_index.get_scores(query_tokens)
mean_score = np.mean(doc_scores)
std_score = np.std(doc_scores)
score_threshold = mean_score + (0.25 * std_score)
file_data = self._aggregate_search_scores(
doc_scores=doc_scores,
score_threshold=score_threshold,
query_tokens=query_tokens,
)
ranked_files = sorted(
file_data.items(),
key=lambda x: (
x[1]["code_match_score"] * 2.0
+ x[1]["match_count"] * 1.5
+ x[1]["total_score"]
),
reverse=True,
)[:top_k]
results = []
for file, _ in ranked_files:
main_doc = str(file).replace(".q.md", ".md")
if os.path.exists(self.docs_dir / main_doc):
with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
only_file_name = main_doc.split("/")[-1]
content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
results.append("\n".join(content))
return "\n\n---\n\n".join(results)
def _aggregate_search_scores(
self, doc_scores: List[float], score_threshold: float, query_tokens: List[str]
) -> Dict:
file_data = {}
for idx, score in enumerate(doc_scores):
if score <= score_threshold:
continue
fact = self.tokenized_facts[idx]
file_path = self.document_map[fact]
if file_path not in file_data:
file_data[file_path] = {
"total_score": 0,
"match_count": 0,
"code_match_score": 0,
"matched_facts": [],
}
components = fact.split("|") if "|" in fact else [fact]
code_match_score = 0
if len(components) == 3:
code_ref = components[2].strip()
code_tokens = self.preprocess_text(code_ref)
code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
query_tokens
)
file_data[file_path]["total_score"] += score
file_data[file_path]["match_count"] += 1
file_data[file_path]["code_match_score"] = max(
file_data[file_path]["code_match_score"], code_match_score
)
file_data[file_path]["matched_facts"].append(fact)
return file_data
def refresh_index(self) -> None:
"""Convenience method for a full rebuild."""
self.build_search_index(clear_cache=True)

View File

@@ -2,252 +2,130 @@ from abc import ABC, abstractmethod
from typing import Optional, Dict, Any, Tuple
from .models import MarkdownGenerationResult
from .html2text import CustomHTML2Text
from .content_filter_strategy import RelevantContentFilter
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
import re
from urllib.parse import urljoin
# Pre-compile the regex pattern
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
def fast_urljoin(base: str, url: str) -> str:
"""Fast URL joining for common cases."""
if url.startswith(("http://", "https://", "mailto:", "//")):
if url.startswith(('http://', 'https://', 'mailto:', '//')):
return url
if url.startswith("/"):
if url.startswith('/'):
# Handle absolute paths
if base.endswith("/"):
if base.endswith('/'):
return base[:-1] + url
return base + url
return urljoin(base, url)
class MarkdownGenerationStrategy(ABC):
"""Abstract base class for markdown generation strategies."""
def __init__(
self,
content_filter: Optional[RelevantContentFilter] = None,
options: Optional[Dict[str, Any]] = None,
):
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
self.content_filter = content_filter
self.options = options or {}
@abstractmethod
def generate_markdown(
self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs,
) -> MarkdownGenerationResult:
def generate_markdown(self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs) -> MarkdownGenerationResult:
"""Generate markdown from cleaned HTML."""
pass
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
"""
Default implementation of markdown generation strategy.
How it works:
1. Generate raw markdown from cleaned HTML.
2. Convert links to citations.
3. Generate fit markdown if content filter is provided.
4. Return MarkdownGenerationResult.
Args:
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
Returns:
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
"""
def __init__(
self,
content_filter: Optional[RelevantContentFilter] = None,
options: Optional[Dict[str, Any]] = None,
):
"""Default implementation of markdown generation strategy."""
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
super().__init__(content_filter, options)
def convert_links_to_citations(
self, markdown: str, base_url: str = ""
) -> Tuple[str, str]:
"""
Convert links in markdown to citations.
How it works:
1. Find all links in the markdown.
2. Convert links to citations.
3. Return converted markdown and references markdown.
Note:
This function uses a regex pattern to find links in markdown.
Args:
markdown (str): Markdown text.
base_url (str): Base URL for URL joins.
Returns:
Tuple[str, str]: Converted markdown and references markdown.
"""
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
link_map = {}
url_cache = {} # Cache for URL joins
parts = []
last_end = 0
counter = 1
for match in LINK_PATTERN.finditer(markdown):
parts.append(markdown[last_end : match.start()])
parts.append(markdown[last_end:match.start()])
text, url, title = match.groups()
# Use cached URL if available, otherwise compute and cache
if base_url and not url.startswith(("http://", "https://", "mailto:")):
if base_url and not url.startswith(('http://', 'https://', 'mailto:')):
if url not in url_cache:
url_cache[url] = fast_urljoin(base_url, url)
url = url_cache[url]
if url not in link_map:
desc = []
if title:
desc.append(title)
if text and text != title:
desc.append(text)
if title: desc.append(title)
if text and text != title: desc.append(text)
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
counter += 1
num = link_map[url][0]
parts.append(
f"{text}{num}"
if not match.group(0).startswith("!")
else f"![{text}{num}⟩]"
)
parts.append(f"{text}{num}" if not match.group(0).startswith('!') else f"![{text}{num}⟩]")
last_end = match.end()
parts.append(markdown[last_end:])
converted_text = "".join(parts)
converted_text = ''.join(parts)
# Pre-build reference strings
references = ["\n\n## References\n\n"]
references.extend(
f"{num}{url}{desc}\n"
f"{num}{url}{desc}\n"
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
)
return converted_text, ''.join(references)
return converted_text, "".join(references)
def generate_markdown(self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs) -> MarkdownGenerationResult:
"""Generate markdown with citations from cleaned HTML."""
# Initialize HTML2Text with options
h = CustomHTML2Text()
if html2text_options:
h.update_params(**html2text_options)
elif options:
h.update_params(**options)
elif self.options:
h.update_params(**self.options)
def generate_markdown(
self,
cleaned_html: str,
base_url: str = "",
html2text_options: Optional[Dict[str, Any]] = None,
options: Optional[Dict[str, Any]] = None,
content_filter: Optional[RelevantContentFilter] = None,
citations: bool = True,
**kwargs,
) -> MarkdownGenerationResult:
"""
Generate markdown with citations from cleaned HTML.
# Generate raw markdown
raw_markdown = h.handle(cleaned_html)
raw_markdown = raw_markdown.replace(' ```', '```')
How it works:
1. Generate raw markdown from cleaned HTML.
2. Convert links to citations.
3. Generate fit markdown if content filter is provided.
4. Return MarkdownGenerationResult.
Args:
cleaned_html (str): Cleaned HTML content.
base_url (str): Base URL for URL joins.
html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
citations (bool): Whether to generate citations.
Returns:
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
"""
try:
# Initialize HTML2Text with default options for better conversion
h = CustomHTML2Text(baseurl=base_url)
default_options = {
"body_width": 0, # Disable text wrapping
"ignore_emphasis": False,
"ignore_links": False,
"ignore_images": False,
"protect_links": True,
"single_line_break": True,
"mark_code": True,
"escape_snob": False,
}
# Update with custom options if provided
if html2text_options:
default_options.update(html2text_options)
elif options:
default_options.update(options)
elif self.options:
default_options.update(self.options)
h.update_params(**default_options)
# Ensure we have valid input
if not cleaned_html:
cleaned_html = ""
elif not isinstance(cleaned_html, str):
cleaned_html = str(cleaned_html)
# Generate raw markdown
try:
raw_markdown = h.handle(cleaned_html)
except Exception as e:
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
raw_markdown = raw_markdown.replace(" ```", "```")
# Convert links to citations
markdown_with_citations: str = raw_markdown
references_markdown: str = ""
if citations:
try:
(
markdown_with_citations,
references_markdown,
) = self.convert_links_to_citations(raw_markdown, base_url)
except Exception as e:
markdown_with_citations = raw_markdown
references_markdown = f"Error generating citations: {str(e)}"
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
try:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = "\n".join(
"<div>{}</div>".format(s) for s in filtered_html
)
fit_markdown = h.handle(filtered_html)
except Exception as e:
fit_markdown = f"Error generating fit markdown: {str(e)}"
filtered_html = ""
return MarkdownGenerationResult(
raw_markdown=raw_markdown or "",
markdown_with_citations=markdown_with_citations or "",
references_markdown=references_markdown or "",
fit_markdown=fit_markdown or "",
fit_html=filtered_html or "",
)
except Exception as e:
# If anything fails, return empty strings with error message
error_msg = f"Error in markdown generation: {str(e)}"
return MarkdownGenerationResult(
raw_markdown=error_msg,
markdown_with_citations=error_msg,
references_markdown="",
fit_markdown="",
fit_html="",
# Convert links to citations
markdown_with_citations: str = ""
references_markdown: str = ""
if citations:
markdown_with_citations, references_markdown = self.convert_links_to_citations(
raw_markdown, base_url
)
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
fit_markdown = h.handle(filtered_html)
return MarkdownGenerationResult(
raw_markdown=raw_markdown,
markdown_with_citations=markdown_with_citations,
references_markdown=references_markdown,
fit_markdown=fit_markdown,
fit_html=filtered_html,
)

View File

@@ -1,11 +1,13 @@
import os
import asyncio
import logging
from pathlib import Path
import aiosqlite
from typing import Optional
import xxhash
import aiofiles
import shutil
import time
from datetime import datetime
from .async_logger import AsyncLogger, LogLevel
@@ -15,19 +17,18 @@ logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
# logging.basicConfig(level=logging.INFO)
# logger = logging.getLogger(__name__)
class DatabaseMigration:
def __init__(self, db_path: str):
self.db_path = db_path
self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
def _ensure_content_dirs(self, base_path: str) -> dict:
dirs = {
"html": "html_content",
"cleaned": "cleaned_html",
"markdown": "markdown_content",
"extracted": "extracted_content",
"screenshots": "screenshots",
'html': 'html_content',
'cleaned': 'cleaned_html',
'markdown': 'markdown_content',
'extracted': 'extracted_content',
'screenshots': 'screenshots'
}
content_paths = {}
for key, dirname in dirs.items():
@@ -46,55 +47,43 @@ class DatabaseMigration:
async def _store_content(self, content: str, content_type: str) -> str:
if not content:
return ""
content_hash = self._generate_content_hash(content)
file_path = os.path.join(self.content_paths[content_type], content_hash)
if not os.path.exists(file_path):
async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
await f.write(content)
return content_hash
async def migrate_database(self):
"""Migrate existing database to file-based storage"""
# logger.info("Starting database migration...")
logger.info("Starting database migration...", tag="INIT")
try:
async with aiosqlite.connect(self.db_path) as db:
# Get all rows
async with db.execute(
"""SELECT url, html, cleaned_html, markdown,
extracted_content, screenshot FROM crawled_data"""
'''SELECT url, html, cleaned_html, markdown,
extracted_content, screenshot FROM crawled_data'''
) as cursor:
rows = await cursor.fetchall()
migrated_count = 0
for row in rows:
(
url,
html,
cleaned_html,
markdown,
extracted_content,
screenshot,
) = row
url, html, cleaned_html, markdown, extracted_content, screenshot = row
# Store content in files and get hashes
html_hash = await self._store_content(html, "html")
cleaned_hash = await self._store_content(cleaned_html, "cleaned")
markdown_hash = await self._store_content(markdown, "markdown")
extracted_hash = await self._store_content(
extracted_content, "extracted"
)
screenshot_hash = await self._store_content(
screenshot, "screenshots"
)
html_hash = await self._store_content(html, 'html')
cleaned_hash = await self._store_content(cleaned_html, 'cleaned')
markdown_hash = await self._store_content(markdown, 'markdown')
extracted_hash = await self._store_content(extracted_content, 'extracted')
screenshot_hash = await self._store_content(screenshot, 'screenshots')
# Update database with hashes
await db.execute(
"""
await db.execute('''
UPDATE crawled_data
SET html = ?,
cleaned_html = ?,
@@ -102,51 +91,40 @@ class DatabaseMigration:
extracted_content = ?,
screenshot = ?
WHERE url = ?
""",
(
html_hash,
cleaned_hash,
markdown_hash,
extracted_hash,
screenshot_hash,
url,
),
)
''', (html_hash, cleaned_hash, markdown_hash,
extracted_hash, screenshot_hash, url))
migrated_count += 1
if migrated_count % 100 == 0:
logger.info(f"Migrated {migrated_count} records...", tag="INIT")
await db.commit()
logger.success(
f"Migration completed. {migrated_count} records processed.",
tag="COMPLETE",
)
logger.success(f"Migration completed. {migrated_count} records processed.", tag="COMPLETE")
except Exception as e:
# logger.error(f"Migration failed: {e}")
logger.error(
message="Migration failed: {error}",
tag="ERROR",
params={"error": str(e)},
params={"error": str(e)}
)
raise e
async def backup_database(db_path: str) -> str:
"""Create backup of existing database"""
if not os.path.exists(db_path):
logger.info("No existing database found. Skipping backup.", tag="INIT")
return None
# Create backup with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
backup_path = f"{db_path}.backup_{timestamp}"
try:
# Wait for any potential write operations to finish
await asyncio.sleep(1)
# Create backup
shutil.copy2(db_path, backup_path)
logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
@@ -154,41 +132,37 @@ async def backup_database(db_path: str) -> str:
except Exception as e:
# logger.error(f"Backup failed: {e}")
logger.error(
message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
)
message="Migration failed: {error}",
tag="ERROR",
params={"error": str(e)}
)
raise e
async def run_migration(db_path: Optional[str] = None):
"""Run database migration"""
if db_path is None:
db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
if not os.path.exists(db_path):
logger.info("No existing database found. Skipping migration.", tag="INIT")
return
# Create backup first
backup_path = await backup_database(db_path)
if not backup_path:
return
migration = DatabaseMigration(db_path)
await migration.migrate_database()
def main():
"""CLI entry point for migration"""
import argparse
parser = argparse.ArgumentParser(
description="Migrate Crawl4AI database to file-based storage"
)
parser.add_argument("--db-path", help="Custom database path")
parser = argparse.ArgumentParser(description='Migrate Crawl4AI database to file-based storage')
parser.add_argument('--db-path', help='Custom database path')
args = parser.parse_args()
asyncio.run(run_migration(args.db_path))
if __name__ == "__main__":
main()
main()

View File

@@ -2,125 +2,109 @@ from functools import lru_cache
from pathlib import Path
import subprocess, os
import shutil
import tarfile
from .model_loader import *
import argparse
import urllib.request
from crawl4ai.config import MODEL_REPO_BRANCH
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache()
def get_available_memory(device):
import torch
if device.type == "cuda":
if device.type == 'cuda':
return torch.cuda.get_device_properties(device).total_memory
elif device.type == "mps":
return 48 * 1024**3 # Assuming 8GB for MPS, as a conservative estimate
elif device.type == 'mps':
return 48 * 1024 ** 3 # Assuming 8GB for MPS, as a conservative estimate
else:
return 0
@lru_cache()
def calculate_batch_size(device):
available_memory = get_available_memory(device)
if device.type == "cpu":
if device.type == 'cpu':
return 16
elif device.type in ["cuda", "mps"]:
elif device.type in ['cuda', 'mps']:
# Adjust these thresholds based on your model size and available memory
if available_memory >= 31 * 1024**3: # > 32GB
if available_memory >= 31 * 1024 ** 3: # > 32GB
return 256
elif available_memory >= 15 * 1024**3: # > 16GB to 32GB
elif available_memory >= 15 * 1024 ** 3: # > 16GB to 32GB
return 128
elif available_memory >= 8 * 1024**3: # 8GB to 16GB
elif available_memory >= 8 * 1024 ** 3: # 8GB to 16GB
return 64
else:
return 32
else:
return 16 # Default batch size
return 16 # Default batch size
@lru_cache()
def get_device():
import torch
if torch.cuda.is_available():
device = torch.device("cuda")
device = torch.device('cuda')
elif torch.backends.mps.is_available():
device = torch.device("mps")
device = torch.device('mps')
else:
device = torch.device("cpu")
return device
device = torch.device('cpu')
return device
def set_model_device(model):
device = get_device()
model.to(device)
model.to(device)
return model, device
@lru_cache()
def get_home_folder():
home_folder = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
home_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
os.makedirs(home_folder, exist_ok=True)
os.makedirs(f"{home_folder}/cache", exist_ok=True)
os.makedirs(f"{home_folder}/models", exist_ok=True)
return home_folder
return home_folder
@lru_cache()
def load_bert_base_uncased():
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
model.eval()
model, device = set_model_device(model)
return tokenizer, model
@lru_cache()
def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
"""Load the Hugging Face model for embedding.
Args:
model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
Returns:
tuple: The tokenizer and model.
"""
from transformers import AutoTokenizer, AutoModel
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
model = AutoModel.from_pretrained(model_name, resume_download=None)
model.eval()
model, device = set_model_device(model)
return tokenizer, model
@lru_cache()
def load_text_classifier():
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained(
"dstefa/roberta-base_topic_classification_nyt_news"
)
model = AutoModelForSequenceClassification.from_pretrained(
"dstefa/roberta-base_topic_classification_nyt_news"
)
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model.eval()
model, device = set_model_device(model)
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
return pipe
@lru_cache()
def load_text_multilabel_classifier():
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from scipy.special import expit
import torch
@@ -132,27 +116,18 @@ def load_text_multilabel_classifier():
# else:
# device = torch.device("cpu")
# # return load_spacy_model(), torch.device("cpu")
MODEL = "cardiffnlp/tweet-topic-21-multi"
tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL, resume_download=None
)
model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
model.eval()
model, device = set_model_device(model)
class_mapping = model.config.id2label
def _classifier(texts, threshold=0.5, max_length=64):
tokens = tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length,
)
tokens = {
key: val.to(device) for key, val in tokens.items()
} # Move tokens to the selected device
tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
tokens = {key: val.to(device) for key, val in tokens.items()} # Move tokens to the selected device
with torch.no_grad():
output = model(**tokens)
@@ -163,41 +138,35 @@ def load_text_multilabel_classifier():
batch_labels = []
for prediction in predictions:
labels = [
class_mapping[i] for i, value in enumerate(prediction) if value == 1
]
labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
batch_labels.append(labels)
return batch_labels
return _classifier, device
@lru_cache()
def load_nltk_punkt():
import nltk
try:
nltk.data.find("tokenizers/punkt")
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download("punkt")
return nltk.data.find("tokenizers/punkt")
nltk.download('punkt')
return nltk.data.find('tokenizers/punkt')
@lru_cache()
def load_spacy_model():
import spacy
name = "models/reuters"
home_folder = get_home_folder()
model_folder = Path(home_folder) / name
# Check if the model directory already exists
if not (model_folder.exists() and any(model_folder.iterdir())):
repo_url = "https://github.com/unclecode/crawl4ai.git"
branch = MODEL_REPO_BRANCH
branch = MODEL_REPO_BRANCH
repo_folder = Path(home_folder) / "crawl4ai"
print("[LOG] ⏬ Downloading Spacy model for the first time...")
# Remove existing repo folder if it exists
@@ -207,9 +176,7 @@ def load_spacy_model():
if model_folder.exists():
shutil.rmtree(model_folder)
except PermissionError:
print(
"[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
)
print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
print(f"- {repo_folder}")
print(f"- {model_folder}")
return None
@@ -220,7 +187,7 @@ def load_spacy_model():
["git", "clone", "-b", branch, repo_url, str(repo_folder)],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
check=True,
check=True
)
# Create the models directory if it doesn't exist
@@ -248,7 +215,6 @@ def load_spacy_model():
print(f"Error loading spacy model: {e}")
return None
def download_all_models(remove_existing=False):
"""Download all models required for Crawl4AI."""
if remove_existing:
@@ -277,20 +243,14 @@ def download_all_models(remove_existing=False):
load_nltk_punkt()
print("[LOG] ✅ All models downloaded successfully.")
def main():
print("[LOG] Welcome to the Crawl4AI Model Downloader!")
print("[LOG] This script will download all the models required for Crawl4AI.")
parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
parser.add_argument(
"--remove-existing",
action="store_true",
help="Remove existing models before downloading",
)
parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
args = parser.parse_args()
download_all_models(remove_existing=args.remove_existing)
if __name__ == "__main__":
main()

View File

@@ -1,85 +1,12 @@
from __future__ import annotations
from pydantic import BaseModel, HttpUrl
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
from enum import Enum
from dataclasses import dataclass
from .ssl_certificate import SSLCertificate
from datetime import datetime
from datetime import timedelta
from math import inf
from typing import List, Dict, Optional, Callable, Awaitable, Union
###############################
# Dispatcher Models
###############################
@dataclass
class DomainState:
last_request_time: float = 0
current_delay: float = 0
fail_count: int = 0
@dataclass
class CrawlerTaskResult:
task_id: str
url: str
result: "CrawlResult"
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
class CrawlStatus(Enum):
QUEUED = "QUEUED"
IN_PROGRESS = "IN_PROGRESS"
COMPLETED = "COMPLETED"
FAILED = "FAILED"
@dataclass
class CrawlStats:
task_id: str
url: str
status: CrawlStatus
start_time: Optional[datetime] = None
end_time: Optional[datetime] = None
memory_usage: float = 0.0
peak_memory: float = 0.0
error_message: str = ""
@property
def duration(self) -> str:
if not self.start_time:
return "0:00"
end = self.end_time or datetime.now()
duration = end - self.start_time
return str(timedelta(seconds=int(duration.total_seconds())))
class DisplayMode(Enum):
DETAILED = "DETAILED"
AGGREGATED = "AGGREGATED"
###############################
# Crawler Models
###############################
@dataclass
class TokenUsage:
completion_tokens: int = 0
prompt_tokens: int = 0
total_tokens: int = 0
completion_tokens_details: Optional[dict] = None
prompt_tokens_details: Optional[dict] = None
class UrlModel(BaseModel):
url: HttpUrl
forced: bool = False
class MarkdownGenerationResult(BaseModel):
raw_markdown: str
markdown_with_citations: str
@@ -87,28 +14,6 @@ class MarkdownGenerationResult(BaseModel):
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
class DispatchResult(BaseModel):
task_id: str
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
@dataclass
class TraversalStats:
"""Statistics for the traversal process"""
start_time: datetime
urls_processed: int = 0
urls_failed: int = 0
urls_skipped: int = 0
total_depth_reached: int = 0
current_depth: int = 0
class CrawlResult(BaseModel):
url: str
html: str
@@ -118,7 +23,7 @@ class CrawlResult(BaseModel):
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf: Optional[bytes] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
@@ -129,17 +34,7 @@ class CrawlResult(BaseModel):
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
redirected_url: Optional[str] = None
# Attributes for position
depth: Optional[int] = None
score: Optional[float] = -inf
parent_url: Optional[str] = None
class Config:
arbitrary_types_allowed = True
class AsyncCrawlResponse(BaseModel):
html: str
response_headers: Dict[str, str]
@@ -148,52 +43,8 @@ class AsyncCrawlResponse(BaseModel):
pdf_data: Optional[bytes] = None
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
downloaded_files: Optional[List[str]] = None
ssl_certificate: Optional[SSLCertificate] = None
redirected_url: Optional[str] = None
class Config:
arbitrary_types_allowed = True
###############################
# Scraping Models
###############################
class MediaItem(BaseModel):
src: Optional[str] = ""
alt: Optional[str] = ""
desc: Optional[str] = ""
score: Optional[int] = 0
type: str = "image"
group_id: Optional[int] = 0
format: Optional[str] = None
width: Optional[int] = None
class Link(BaseModel):
href: Optional[str] = ""
text: Optional[str] = ""
title: Optional[str] = ""
base_domain: Optional[str] = ""
class Media(BaseModel):
images: List[MediaItem] = []
videos: List[MediaItem] = (
[]
) # Using MediaItem model for now, can be extended with Video model if needed
audios: List[MediaItem] = (
[]
) # Using MediaItem model for now, can be extended with Audio model if needed
class Links(BaseModel):
internal: List[Link] = []
external: List[Link] = []
class ScrapingResult(BaseModel):
cleaned_html: str
success: bool
media: Media = Media()
links: Links = Links()
metadata: Dict[str, Any] = {}

View File

@@ -202,808 +202,3 @@ Avoid Common Mistakes:
Result
Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
INPUT HTML:
<|HTML_CONTENT_START|>
{HTML}
<|HTML_CONTENT_END|>
SPECIFIC INSTRUCTION:
<|USER_INSTRUCTION_START|>
{REQUEST}
<|USER_INSTRUCTION_END|>
TASK DETAILS:
1. Content Selection
- DO: Keep essential information, main content, key details
- DO: Preserve hierarchical structure using markdown headers
- DO: Keep code blocks, tables, key lists
- DON'T: Include navigation menus, ads, footers, cookie notices
- DON'T: Keep social media widgets, sidebars, related content
2. Content Transformation
- DO: Use proper markdown syntax (#, ##, **, `, etc)
- DO: Convert tables to markdown tables
- DO: Preserve code formatting with ```language blocks
- DO: Maintain link texts but remove tracking parameters
- DON'T: Include HTML tags in output
- DON'T: Keep class names, ids, or other HTML attributes
3. Content Organization
- DO: Maintain logical flow of information
- DO: Group related content under appropriate headers
- DO: Use consistent header levels
- DON'T: Fragment related content
- DON'T: Duplicate information
Example Input:
<div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
<div class="sidebar">Related articles...</div>
Example Output:
# Setup Guide
Follow these steps...
IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
OUTPUT FORMAT:
Wrap your response in <content> tags. Use proper markdown throughout.
<content>
[Your markdown content here]
</content>
Begin filtering now."""
JSON_SCHEMA_BUILDER= """
# HTML Schema Generation Instructions
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
## Your Core Responsibilities:
1. Analyze HTML structure to identify repeating patterns and important data points
2. Generate valid JSON schemas following the specified format
3. Create appropriate selectors that will work reliably for data extraction
4. Name fields meaningfully based on their content and purpose
5. Handle both specific user requests and autonomous pattern detection
## Available Schema Types You Can Generate:
<schema_types>
1. Basic Single-Level Schema
- Use for simple, flat data structures
- Example: Product cards, user profiles
- Direct field extractions
2. Nested Object Schema
- Use for hierarchical data
- Example: Articles with author details
- Contains objects within objects
3. List Schema
- Use for repeating elements
- Example: Comment sections, product lists
- Handles arrays of similar items
4. Complex Nested Lists
- Use for multi-level data
- Example: Categories with subcategories
- Multiple levels of nesting
5. Transformation Schema
- Use for data requiring processing
- Supports regex and text transformations
- Special attribute handling
</schema_types>
<schema_structure>
Your output must always be a JSON object with this structure:
{
"name": "Descriptive name of the pattern",
"baseSelector": "CSS selector for the repeating element",
"fields": [
{
"name": "field_name",
"selector": "CSS selector",
"type": "text|attribute|nested|list|regex",
"attribute": "attribute_name", // Optional
"transform": "transformation_type", // Optional
"pattern": "regex_pattern", // Optional
"fields": [] // For nested/list types
}
]
}
</schema_structure>
<type_definitions>
Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
</type_definitions>
<behavior_rules>
1. When given a specific query:
- Focus on extracting requested data points
- Use most specific selectors possible
- Include all fields mentioned in the query
2. When no query is provided:
- Identify main content areas
- Extract all meaningful data points
- Use semantic structure to determine importance
- Include prices, dates, titles, and other common data types
3. Always:
- Use reliable CSS selectors
- Handle dynamic class names appropriately
- Create descriptive field names
- Follow consistent naming conventions
</behavior_rules>
<examples>
1. Basic Product Card Example:
<html>
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
<h2 class="product-title">Gaming Laptop</h2>
<span class="price">$999.99</span>
<img src="laptop.jpg" alt="Gaming Laptop">
</div>
</html>
Generated Schema:
{
"name": "Product Cards",
"baseSelector": ".product-card",
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
],
"fields": [
{
"name": "title",
"selector": ".product-title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
},
{
"name": "image_url",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
2. Article with Author Details Example:
<html>
<article>
<h1>The Future of AI</h1>
<div class="author-info">
<span class="author-name">Dr. Smith</span>
<img src="author.jpg" alt="Dr. Smith">
</div>
</article>
</html>
Generated Schema:
{
"name": "Article Details",
"baseSelector": "article",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
},
{
"name": "author",
"type": "nested",
"selector": ".author-info",
"fields": [
{
"name": "name",
"selector": ".author-name",
"type": "text"
},
{
"name": "avatar",
"selector": "img",
"type": "attribute",
"attribute": "src"
}
]
}
]
}
3. Comments Section Example:
<html>
<div class="comments-container">
<div class="comment" data-user-id="123">
<div class="user-name">John123</div>
<p class="comment-text">Great article!</p>
</div>
<div class="comment" data-user-id="456">
<div class="user-name">Alice456</div>
<p class="comment-text">Thanks for sharing.</p>
</div>
</div>
</html>
Generated Schema:
{
"name": "Comment Section",
"baseSelector": ".comments-container",
"baseFields": [
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
],
"fields": [
{
"name": "comments",
"type": "list",
"selector": ".comment",
"fields": [
{
"name": "user",
"selector": ".user-name",
"type": "text"
},
{
"name": "content",
"selector": ".comment-text",
"type": "text"
}
]
}
]
}
4. E-commerce Categories Example:
<html>
<div class="category-section" data-category="electronics">
<h2>Electronics</h2>
<div class="subcategory">
<h3>Laptops</h3>
<div class="product">
<span class="product-name">MacBook Pro</span>
<span class="price">$1299</span>
</div>
<div class="product">
<span class="product-name">Dell XPS</span>
<span class="price">$999</span>
</div>
</div>
</div>
</html>
Generated Schema:
{
"name": "E-commerce Categories",
"baseSelector": ".category-section",
"baseFields": [
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
],
"fields": [
{
"name": "category_name",
"selector": "h2",
"type": "text"
},
{
"name": "subcategories",
"type": "nested_list",
"selector": ".subcategory",
"fields": [
{
"name": "name",
"selector": "h3",
"type": "text"
},
{
"name": "products",
"type": "list",
"selector": ".product",
"fields": [
{
"name": "name",
"selector": ".product-name",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
}
]
}
]
}
]
}
5. Job Listings with Transformations Example:
<html>
<div class="job-post">
<h3 class="job-title">Senior Developer</h3>
<span class="salary-text">Salary: $120,000/year</span>
<span class="location"> New York, NY </span>
</div>
</html>
Generated Schema:
{
"name": "Job Listings",
"baseSelector": ".job-post",
"fields": [
{
"name": "title",
"selector": ".job-title",
"type": "text",
"transform": "uppercase"
},
{
"name": "salary",
"selector": ".salary-text",
"type": "regex",
"pattern": "\\$([\\d,]+)"
},
{
"name": "location",
"selector": ".location",
"type": "text",
"transform": "strip"
}
]
}
6. Skyscanner Place Card Example:
<html>
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
<div class="PlaceCard_nameContent__ODUwZ">
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
</div>
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
</div>
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
<div class="PriceDescription_container__NjEzM">
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
</div>
</a>
</div>
</html>
Generated Schema:
{
"name": "Skyscanner Place Cards",
"baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
"baseFields": [
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
],
"fields": [
{
"name": "city_name",
"selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
"type": "text"
},
{
"name": "country_name",
"selector": "span[class*='PlaceCard_subName__']",
"type": "text"
},
{
"name": "description",
"selector": "span[class*='PlaceCard_advertLabel__']",
"type": "text"
},
{
"name": "flight_price",
"selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
"type": "text"
},
{
"name": "flight_url",
"selector": "a[data-testid='flights-link']",
"type": "attribute",
"attribute": "href"
}
]
}
</examples>
<output_requirements>
Your output must:
1. Be valid JSON only
2. Include no explanatory text
3. Follow the exact schema structure provided
4. Use appropriate field types
5. Include all required fields
6. Use valid CSS selectors
</output_requirements>
"""
JSON_SCHEMA_BUILDER_XPATH = """
# HTML Schema Generation Instructions
You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
## Your Core Responsibilities:
1. Analyze HTML structure to identify repeating patterns and important data points
2. Generate valid JSON schemas following the specified format
3. Create appropriate XPath selectors that will work reliably for data extraction
4. Name fields meaningfully based on their content and purpose
5. Handle both specific user requests and autonomous pattern detection
## Available Schema Types You Can Generate:
<schema_types>
1. Basic Single-Level Schema
- Use for simple, flat data structures
- Example: Product cards, user profiles
- Direct field extractions
2. Nested Object Schema
- Use for hierarchical data
- Example: Articles with author details
- Contains objects within objects
3. List Schema
- Use for repeating elements
- Example: Comment sections, product lists
- Handles arrays of similar items
4. Complex Nested Lists
- Use for multi-level data
- Example: Categories with subcategories
- Multiple levels of nesting
5. Transformation Schema
- Use for data requiring processing
- Supports regex and text transformations
- Special attribute handling
</schema_types>
<schema_structure>
Your output must always be a JSON object with this structure:
{
"name": "Descriptive name of the pattern",
"baseSelector": "XPath selector for the repeating element",
"fields": [
{
"name": "field_name",
"selector": "XPath selector",
"type": "text|attribute|nested|list|regex",
"attribute": "attribute_name", // Optional
"transform": "transformation_type", // Optional
"pattern": "regex_pattern", // Optional
"fields": [] // For nested/list types
}
]
}
</schema_structure>
<type_definitions>
Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
</type_definitions>
<behavior_rules>
1. When given a specific query:
- Focus on extracting requested data points
- Use most specific selectors possible
- Include all fields mentioned in the query
2. When no query is provided:
- Identify main content areas
- Extract all meaningful data points
- Use semantic structure to determine importance
- Include prices, dates, titles, and other common data types
3. Always:
- Use reliable XPath selectors
- Handle dynamic element IDs appropriately
- Create descriptive field names
- Follow consistent naming conventions
</behavior_rules>
<examples>
1. Basic Product Card Example:
<html>
<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
<h2 class="product-title">Gaming Laptop</h2>
<span class="price">$999.99</span>
<img src="laptop.jpg" alt="Gaming Laptop">
</div>
</html>
Generated Schema:
{
"name": "Product Cards",
"baseSelector": "//div[@class='product-card']",
"baseFields": [
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
{"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
],
"fields": [
{
"name": "title",
"selector": ".//h2[@class='product-title']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='price']",
"type": "text"
},
{
"name": "image_url",
"selector": ".//img",
"type": "attribute",
"attribute": "src"
}
]
}
2. Article with Author Details Example:
<html>
<article>
<h1>The Future of AI</h1>
<div class="author-info">
<span class="author-name">Dr. Smith</span>
<img src="author.jpg" alt="Dr. Smith">
</div>
</article>
</html>
Generated Schema:
{
"name": "Article Details",
"baseSelector": "//article",
"fields": [
{
"name": "title",
"selector": ".//h1",
"type": "text"
},
{
"name": "author",
"type": "nested",
"selector": ".//div[@class='author-info']",
"fields": [
{
"name": "name",
"selector": ".//span[@class='author-name']",
"type": "text"
},
{
"name": "avatar",
"selector": ".//img",
"type": "attribute",
"attribute": "src"
}
]
}
]
}
3. Comments Section Example:
<html>
<div class="comments-container">
<div class="comment" data-user-id="123">
<div class="user-name">John123</div>
<p class="comment-text">Great article!</p>
</div>
<div class="comment" data-user-id="456">
<div class="user-name">Alice456</div>
<p class="comment-text">Thanks for sharing.</p>
</div>
</div>
</html>
Generated Schema:
{
"name": "Comment Section",
"baseSelector": "//div[@class='comments-container']",
"fields": [
{
"name": "comments",
"type": "list",
"selector": ".//div[@class='comment']",
"baseFields": [
{"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
],
"fields": [
{
"name": "user",
"selector": ".//div[@class='user-name']",
"type": "text"
},
{
"name": "content",
"selector": ".//p[@class='comment-text']",
"type": "text"
}
]
}
]
}
4. E-commerce Categories Example:
<html>
<div class="category-section" data-category="electronics">
<h2>Electronics</h2>
<div class="subcategory">
<h3>Laptops</h3>
<div class="product">
<span class="product-name">MacBook Pro</span>
<span class="price">$1299</span>
</div>
<div class="product">
<span class="product-name">Dell XPS</span>
<span class="price">$999</span>
</div>
</div>
</div>
</html>
Generated Schema:
{
"name": "E-commerce Categories",
"baseSelector": "//div[@class='category-section']",
"baseFields": [
{"name": "data_category", "type": "attribute", "attribute": "data-category"}
],
"fields": [
{
"name": "category_name",
"selector": ".//h2",
"type": "text"
},
{
"name": "subcategories",
"type": "nested_list",
"selector": ".//div[@class='subcategory']",
"fields": [
{
"name": "name",
"selector": ".//h3",
"type": "text"
},
{
"name": "products",
"type": "list",
"selector": ".//div[@class='product']",
"fields": [
{
"name": "name",
"selector": ".//span[@class='product-name']",
"type": "text"
},
{
"name": "price",
"selector": ".//span[@class='price']",
"type": "text"
}
]
}
]
}
]
}
5. Job Listings with Transformations Example:
<html>
<div class="job-post">
<h3 class="job-title">Senior Developer</h3>
<span class="salary-text">Salary: $120,000/year</span>
<span class="location"> New York, NY </span>
</div>
</html>
Generated Schema:
{
"name": "Job Listings",
"baseSelector": "//div[@class='job-post']",
"fields": [
{
"name": "title",
"selector": ".//h3[@class='job-title']",
"type": "text",
"transform": "uppercase"
},
{
"name": "salary",
"selector": ".//span[@class='salary-text']",
"type": "regex",
"pattern": "\\$([\\d,]+)"
},
{
"name": "location",
"selector": ".//span[@class='location']",
"type": "text",
"transform": "strip"
}
]
}
6. Skyscanner Place Card Example:
<html>
<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
<div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
<div class="PlaceCard_nameContent__ODUwZ">
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
</div>
<span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
</div>
<span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
<a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
<div class="PriceDescription_container__NjEzM">
<span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
</div>
</a>
</div>
</html>
Generated Schema:
{
"name": "Skyscanner Place Cards",
"baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
"baseFields": [
{"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
],
"fields": [
{
"name": "city_name",
"selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
"type": "text"
},
{
"name": "country_name",
"selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
"type": "text"
},
{
"name": "description",
"selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
"type": "text"
},
{
"name": "flight_price",
"selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
"type": "text"
},
{
"name": "flight_url",
"selector": ".//a[@data-testid='flights-link']",
"type": "attribute",
"attribute": "href"
}
]
}
</examples>
<output_requirements>
Your output must:
1. Be valid JSON only
2. Include no explanatory text
3. Follow the exact schema structure provided
4. Use appropriate field types
5. Include all required fields
6. Use valid XPath selectors
</output_requirements>
"""

View File

@@ -1,184 +0,0 @@
"""SSL Certificate class for handling certificate operations."""
import ssl
import socket
import base64
import json
from typing import Dict, Any, Optional
from urllib.parse import urlparse
import OpenSSL.crypto
from pathlib import Path
class SSLCertificate:
"""
A class representing an SSL certificate with methods to export in various formats.
Attributes:
cert_info (Dict[str, Any]): The certificate information.
Methods:
from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
export_as_pem() -> str: Export the certificate as PEM format.
export_as_der() -> bytes: Export the certificate as DER format.
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
export_as_text() -> str: Export the certificate as text format.
"""
def __init__(self, cert_info: Dict[str, Any]):
self._cert_info = self._decode_cert_data(cert_info)
@staticmethod
def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
"""
Create SSLCertificate instance from a URL.
Args:
url (str): URL of the website.
timeout (int): Timeout for the connection (default: 10).
Returns:
Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
"""
try:
hostname = urlparse(url).netloc
if ":" in hostname:
hostname = hostname.split(":")[0]
context = ssl.create_default_context()
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
cert_binary = ssock.getpeercert(binary_form=True)
x509 = OpenSSL.crypto.load_certificate(
OpenSSL.crypto.FILETYPE_ASN1, cert_binary
)
cert_info = {
"subject": dict(x509.get_subject().get_components()),
"issuer": dict(x509.get_issuer().get_components()),
"version": x509.get_version(),
"serial_number": hex(x509.get_serial_number()),
"not_before": x509.get_notBefore(),
"not_after": x509.get_notAfter(),
"fingerprint": x509.digest("sha256").hex(),
"signature_algorithm": x509.get_signature_algorithm(),
"raw_cert": base64.b64encode(cert_binary),
}
# Add extensions
extensions = []
for i in range(x509.get_extension_count()):
ext = x509.get_extension(i)
extensions.append(
{"name": ext.get_short_name(), "value": str(ext)}
)
cert_info["extensions"] = extensions
return SSLCertificate(cert_info)
except Exception:
return None
@staticmethod
def _decode_cert_data(data: Any) -> Any:
"""Helper method to decode bytes in certificate data."""
if isinstance(data, bytes):
return data.decode("utf-8")
elif isinstance(data, dict):
return {
(
k.decode("utf-8") if isinstance(k, bytes) else k
): SSLCertificate._decode_cert_data(v)
for k, v in data.items()
}
elif isinstance(data, list):
return [SSLCertificate._decode_cert_data(item) for item in data]
return data
def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
"""
Export certificate as JSON.
Args:
filepath (Optional[str]): Path to save the JSON file (default: None).
Returns:
Optional[str]: JSON string if successful, None otherwise.
"""
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
if filepath:
Path(filepath).write_text(json_str, encoding="utf-8")
return None
return json_str
def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
"""
Export certificate as PEM.
Args:
filepath (Optional[str]): Path to save the PEM file (default: None).
Returns:
Optional[str]: PEM string if successful, None otherwise.
"""
try:
x509 = OpenSSL.crypto.load_certificate(
OpenSSL.crypto.FILETYPE_ASN1,
base64.b64decode(self._cert_info["raw_cert"]),
)
pem_data = OpenSSL.crypto.dump_certificate(
OpenSSL.crypto.FILETYPE_PEM, x509
).decode("utf-8")
if filepath:
Path(filepath).write_text(pem_data, encoding="utf-8")
return None
return pem_data
except Exception:
return None
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
"""
Export certificate as DER.
Args:
filepath (Optional[str]): Path to save the DER file (default: None).
Returns:
Optional[bytes]: DER bytes if successful, None otherwise.
"""
try:
der_data = base64.b64decode(self._cert_info["raw_cert"])
if filepath:
Path(filepath).write_bytes(der_data)
return None
return der_data
except Exception:
return None
@property
def issuer(self) -> Dict[str, str]:
"""Get certificate issuer information."""
return self._cert_info.get("issuer", {})
@property
def subject(self) -> Dict[str, str]:
"""Get certificate subject information."""
return self._cert_info.get("subject", {})
@property
def valid_from(self) -> str:
"""Get certificate validity start date."""
return self._cert_info.get("not_before", "")
@property
def valid_until(self) -> str:
"""Get certificate validity end date."""
return self._cert_info.get("not_after", "")
@property
def fingerprint(self) -> str:
"""Get certificate fingerprint."""
return self._cert_info.get("fingerprint", "")

View File

@@ -2,175 +2,8 @@ import random
from typing import Optional, Literal, List, Dict, Tuple
import re
from abc import ABC, abstractmethod
import random
from fake_useragent import UserAgent
import requests
from lxml import html
import json
from typing import Optional, List, Union, Dict
class UAGen(ABC):
@abstractmethod
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Union[str, Dict]:
pass
@staticmethod
def generate_client_hints( user_agent: str) -> str:
"""Generate Sec-CH-UA header value based on user agent string"""
def _parse_user_agent(user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information"""
browsers = {
"chrome": r"Chrome/(\d+)",
"edge": r"Edg/(\d+)",
"safari": r"Version/(\d+)",
"firefox": r"Firefox/(\d+)",
}
result = {}
for browser, pattern in browsers.items():
match = re.search(pattern, user_agent)
if match:
result[browser] = match.group(1)
return result
browsers = _parse_user_agent(user_agent)
# Client hints components
hints = []
# Handle different browser combinations
if "chrome" in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"')
if "edge" in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif "firefox" in browsers:
# Firefox doesn't typically send Sec-CH-UA
return '""'
elif "safari" in browsers:
# Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"')
return ", ".join(hints)
class ValidUAGenerator(UAGen):
def __init__(self):
self.ua = UserAgent()
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> str:
self.ua = UserAgent(
browsers=browsers or ['Chrome', 'Firefox', 'Edge'],
os=os or ['Windows', 'Mac OS X'],
min_version=min_version,
platforms=platforms or ['desktop'],
fallback=fallback
)
return self.ua.random
class OnlineUAGenerator(UAGen):
def __init__(self):
self.agents = []
self._fetch_agents()
def _fetch_agents(self):
try:
response = requests.get(
'https://www.useragents.me/',
timeout=5,
headers={'Accept': 'text/html,application/xhtml+xml'}
)
response.raise_for_status()
tree = html.fromstring(response.content)
json_text = tree.cssselect('#most-common-desktop-useragents-json-csv > div:nth-child(1) > textarea')[0].text
self.agents = json.loads(json_text)
except Exception as e:
print(f"Error fetching agents: {e}")
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Dict:
if not self.agents:
self._fetch_agents()
filtered_agents = self.agents
if pct_threshold:
filtered_agents = [a for a in filtered_agents if a['pct'] >= pct_threshold]
if browsers:
filtered_agents = [a for a in filtered_agents
if any(b.lower() in a['ua'].lower() for b in browsers)]
if os:
os_list = [os] if isinstance(os, str) else os
filtered_agents = [a for a in filtered_agents
if any(o.lower() in a['ua'].lower() for o in os_list)]
if platforms:
platform_list = [platforms] if isinstance(platforms, str) else platforms
filtered_agents = [a for a in filtered_agents
if any(p.lower() in a['ua'].lower() for p in platform_list)]
return filtered_agents[0] if filtered_agents else {'ua': fallback, 'pct': 0}
class UserAgentGenerator():
"""
Generate random user agents with specified constraints.
Attributes:
desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
browser_combinations (dict): A dictionary of possible browser combinations and their corresponding user agent strings.
rendering_engines (dict): A dictionary of possible rendering engines and their corresponding user agent strings.
chrome_versions (list): A list of possible Chrome browser versions.
firefox_versions (list): A list of possible Firefox browser versions.
edge_versions (list): A list of possible Edge browser versions.
safari_versions (list): A list of possible Safari browser versions.
ios_versions (list): A list of possible iOS browser versions.
android_versions (list): A list of possible Android browser versions.
Methods:
generate_user_agent(
platform: Literal["desktop", "mobile"] = "desktop",
browser: str = "chrome",
rendering_engine: str = "chrome_webkit",
chrome_version: Optional[str] = None,
firefox_version: Optional[str] = None,
edge_version: Optional[str] = None,
safari_version: Optional[str] = None,
ios_version: Optional[str] = None,
android_version: Optional[str] = None
): Generates a random user agent string based on the specified parameters.
"""
class UserAgentGenerator:
def __init__(self):
# Previous platform definitions remain the same...
self.desktop_platforms = {
@@ -186,7 +19,7 @@ class UserAgentGenerator():
"generic": "(X11; Linux x86_64)",
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
},
}
}
self.mobile_platforms = {
@@ -199,14 +32,26 @@ class UserAgentGenerator():
"ios": {
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
},
}
}
# Browser Combinations
self.browser_combinations = {
1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
1: [
["chrome"],
["firefox"],
["safari"],
["edge"]
],
2: [
["gecko", "firefox"],
["chrome", "safari"],
["webkit", "safari"]
],
3: [
["chrome", "safari", "edge"],
["webkit", "chrome", "safari"]
]
}
# Rendering Engines with versions
@@ -217,7 +62,7 @@ class UserAgentGenerator():
"Gecko/20100101",
"Gecko/20100101", # Firefox usually uses this constant version
"Gecko/2010010",
],
]
}
# Browser Versions
@@ -260,27 +105,13 @@ class UserAgentGenerator():
]
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
"""
Get a valid combination of browser versions.
How it works:
1. Check if the number of browsers is supported.
2. Randomly choose a combination of browsers.
3. Iterate through the combination and add browser versions.
4. Return the browser stack.
Args:
num_browsers: Number of browser specifications (1-3)
Returns:
List[str]: A list of browser versions.
"""
"""Get a valid combination of browser versions"""
if num_browsers not in self.browser_combinations:
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
combination = random.choice(self.browser_combinations[num_browsers])
browser_stack = []
for browser in combination:
if browser == "chrome":
browser_stack.append(random.choice(self.chrome_versions))
@@ -294,20 +125,18 @@ class UserAgentGenerator():
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
elif browser == "webkit":
browser_stack.append(self.rendering_engines["chrome_webkit"])
return browser_stack
def generate(
self,
device_type: Optional[Literal["desktop", "mobile"]] = None,
os_type: Optional[str] = None,
device_brand: Optional[str] = None,
browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
num_browsers: int = 3,
) -> str:
def generate(self,
device_type: Optional[Literal['desktop', 'mobile']] = None,
os_type: Optional[str] = None,
device_brand: Optional[str] = None,
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
num_browsers: int = 3) -> str:
"""
Generate a random user agent with specified constraints.
Args:
device_type: 'desktop' or 'mobile'
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
@@ -317,29 +146,23 @@ class UserAgentGenerator():
"""
# Get platform string
platform = self.get_random_platform(device_type, os_type, device_brand)
# Start with Mozilla
components = ["Mozilla/5.0", platform]
# Add browser stack
browser_stack = self.get_browser_stack(num_browsers)
# Add appropriate legacy token based on browser stack
if "Firefox" in str(browser_stack) or browser_type == "firefox":
if "Firefox" in str(browser_stack):
components.append(random.choice(self.rendering_engines["gecko"]))
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack) or browser_type == "chrome":
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
elif "Edge" in str(browser_stack) or browser_type == "edge":
components.append(self.rendering_engines["safari_webkit"])
components.append("(KHTML, like Gecko)")
elif "Safari" in str(browser_stack) or browser_type == "safari":
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
# Add browser versions
components.extend(browser_stack)
return " ".join(components)
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
@@ -350,20 +173,16 @@ class UserAgentGenerator():
def get_random_platform(self, device_type, os_type, device_brand):
"""Helper method to get random platform based on constraints"""
platforms = (
self.desktop_platforms
if device_type == "desktop"
else self.mobile_platforms
if device_type == "mobile"
else {**self.desktop_platforms, **self.mobile_platforms}
)
platforms = self.desktop_platforms if device_type == 'desktop' else \
self.mobile_platforms if device_type == 'mobile' else \
{**self.desktop_platforms, **self.mobile_platforms}
if os_type:
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
if os_type in platform_group:
platforms = {os_type: platform_group[os_type]}
break
os_key = random.choice(list(platforms.keys()))
if device_brand and device_brand in platforms[os_key]:
return platforms[os_key][device_brand]
@@ -372,58 +191,73 @@ class UserAgentGenerator():
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information"""
browsers = {
"chrome": r"Chrome/(\d+)",
"edge": r"Edg/(\d+)",
"safari": r"Version/(\d+)",
"firefox": r"Firefox/(\d+)",
'chrome': r'Chrome/(\d+)',
'edge': r'Edg/(\d+)',
'safari': r'Version/(\d+)',
'firefox': r'Firefox/(\d+)'
}
result = {}
for browser, pattern in browsers.items():
match = re.search(pattern, user_agent)
if match:
result[browser] = match.group(1)
return result
def generate_client_hints(self, user_agent: str) -> str:
"""Generate Sec-CH-UA header value based on user agent string"""
browsers = self.parse_user_agent(user_agent)
# Client hints components
hints = []
# Handle different browser combinations
if "chrome" in browsers:
if 'chrome' in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"')
if "edge" in browsers:
if 'edge' in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif "firefox" in browsers:
elif 'firefox' in browsers:
# Firefox doesn't typically send Sec-CH-UA
return '""'
elif "safari" in browsers:
elif 'safari' in browsers:
# Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"')
return ", ".join(hints)
return ', '.join(hints)
# Example usage:
if __name__ == "__main__":
generator = UserAgentGenerator()
print(generator.generate())
# Usage example:
generator = ValidUAGenerator()
ua = generator.generate()
print(ua)
print("\nSingle browser (Chrome):")
print(generator.generate(num_browsers=1, browser_type='chrome'))
generator = OnlineUAGenerator()
ua = generator.generate()
print(ua)
print("\nTwo browsers (Gecko/Firefox):")
print(generator.generate(num_browsers=2))
print("\nThree browsers (Chrome/Safari/Edge):")
print(generator.generate(num_browsers=3))
print("\nFirefox on Linux:")
print(generator.generate(
device_type='desktop',
os_type='linux',
browser_type='firefox',
num_browsers=2
))
print("\nChrome/Safari/Edge on Windows:")
print(generator.generate(
device_type='desktop',
os_type='windows',
num_browsers=3
))

File diff suppressed because it is too large Load Diff

View File

@@ -1,14 +1,14 @@
# version_manager.py
import os
from pathlib import Path
from packaging import version
from . import __version__
class VersionManager:
def __init__(self):
self.home_dir = Path.home() / ".crawl4ai"
self.version_file = self.home_dir / "version.txt"
def get_installed_version(self):
"""Get the version recorded in home directory"""
if not self.version_file.exists():
@@ -17,13 +17,14 @@ class VersionManager:
return version.parse(self.version_file.read_text().strip())
except:
return None
def update_version(self):
"""Update the version file to current library version"""
self.version_file.write_text(__version__.__version__)
def needs_update(self):
"""Check if database needs update based on version"""
installed = self.get_installed_version()
current = version.parse(__version__.__version__)
return installed is None or installed < current

View File

@@ -1,10 +1,9 @@
import os, time
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path
from .models import UrlModel, CrawlResult
from .database import init_db, get_cached_url, cache_url
from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
from .utils import *
from .chunking_strategy import *
from .extraction_strategy import *
@@ -15,44 +14,31 @@ from .content_scraping_strategy import WebScrapingStrategy
from .config import *
import warnings
import json
warnings.filterwarnings(
"ignore",
message='Field "model_name" has conflict with protected namespace "model_".',
)
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
class WebCrawler:
def __init__(
self,
crawler_strategy: CrawlerStrategy = None,
always_by_pass_cache: bool = False,
verbose: bool = False,
):
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
verbose=verbose
)
def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False):
self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
self.always_by_pass_cache = always_by_pass_cache
self.crawl4ai_folder = os.path.join(
os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
)
self.crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
os.makedirs(self.crawl4ai_folder, exist_ok=True)
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
init_db()
self.ready = False
def warmup(self):
print("[LOG] 🌤️ Warming up the WebCrawler")
self.run(
url="https://google.com/",
url='https://google.com/',
word_count_threshold=5,
extraction_strategy=NoExtractionStrategy(),
bypass_cache=False,
verbose=False,
verbose=False
)
self.ready = True
print("[LOG] 🌞 WebCrawler is ready to crawl")
def fetch_page(
self,
url_model: UrlModel,
@@ -94,7 +80,6 @@ class WebCrawler:
**kwargs,
) -> List[CrawlResult]:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
def fetch_page_wrapper(url_model, *args, **kwargs):
return self.fetch_page(url_model, *args, **kwargs)
@@ -119,176 +104,150 @@ class WebCrawler:
return results
def run(
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
try:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
self,
url: str,
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
bypass_cache: bool = False,
css_selector: str = None,
screenshot: bool = False,
user_agent: str = None,
verbose=True,
**kwargs,
) -> CrawlResult:
try:
extraction_strategy = extraction_strategy or NoExtractionStrategy()
extraction_strategy.verbose = verbose
if not isinstance(extraction_strategy, ExtractionStrategy):
raise ValueError("Unsupported extraction strategy")
if not isinstance(chunking_strategy, ChunkingStrategy):
raise ValueError("Unsupported chunking strategy")
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
cached = None
screenshot_data = None
extracted_content = None
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if kwargs.get("warmup", True) and not self.ready:
return None
if cached:
html = sanitize_input_encode(cached[1])
extracted_content = sanitize_input_encode(cached[4])
if screenshot:
screenshot_data = cached[9]
if not screenshot_data:
cached = None
if not cached or not html:
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
t1 = time.time()
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
t2 = time.time()
if verbose:
print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds")
if screenshot:
screenshot_data = self.crawler_strategy.take_screenshot()
cached = None
screenshot_data = None
extracted_content = None
if not bypass_cache and not self.always_by_pass_cache:
cached = get_cached_url(url)
if kwargs.get("warmup", True) and not self.ready:
return None
if cached:
html = sanitize_input_encode(cached[1])
extracted_content = sanitize_input_encode(cached[4])
if screenshot:
screenshot_data = cached[9]
if not screenshot_data:
cached = None
if not cached or not html:
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
t1 = time.time()
html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
t2 = time.time()
if verbose:
print(
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
)
if screenshot:
screenshot_data = self.crawler_strategy.take_screenshot()
crawl_result = self.process_html(
url,
html,
extracted_content,
word_count_threshold,
extraction_strategy,
chunking_strategy,
css_selector,
screenshot_data,
verbose,
bool(cached),
**kwargs,
)
crawl_result.success = bool(html)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs)
crawl_result.success = bool(html)
return crawl_result
except Exception as e:
if not hasattr(e, "msg"):
e.msg = str(e)
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
def process_html(
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: bool,
verbose: bool,
is_cached: bool,
**kwargs,
) -> CrawlResult:
t = time.time()
# Extract content from HTML
try:
t1 = time.time()
scrapping_strategy = WebScrapingStrategy()
extra_params = {
k: v
for k, v in kwargs.items()
if k not in ["only_text", "image_description_min_word_threshold"]
}
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold",
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
),
**extra_params,
)
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
if verbose:
print(
f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
self,
url: str,
html: str,
extracted_content: str,
word_count_threshold: int,
extraction_strategy: ExtractionStrategy,
chunking_strategy: ChunkingStrategy,
css_selector: str,
screenshot: bool,
verbose: bool,
is_cached: bool,
**kwargs,
) -> CrawlResult:
t = time.time()
# Extract content from HTML
try:
t1 = time.time()
scrapping_strategy = WebScrapingStrategy()
extra_params = {k: v for k, v in kwargs.items() if k not in ["only_text", "image_description_min_word_threshold"]}
result = scrapping_strategy.scrap(
url,
html,
word_count_threshold=word_count_threshold,
css_selector=css_selector,
only_text=kwargs.get("only_text", False),
image_description_min_word_threshold=kwargs.get(
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
),
**extra_params,
)
# result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
if verbose:
print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if extracted_content is None:
if verbose:
print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
if result is None:
raise ValueError(f"Failed to extract content from the website: {url}")
except InvalidCSSSelectorError as e:
raise ValueError(str(e))
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
markdown = sanitize_input_encode(result.get("markdown", ""))
media = result.get("media", [])
links = result.get("links", [])
metadata = result.get("metadata", {})
if extracted_content is None:
if verbose:
print(
f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
)
sections = chunking_strategy.chunk(markdown)
extracted_content = extraction_strategy.run(url, sections)
extracted_content = json.dumps(
extracted_content, indent=4, default=str, ensure_ascii=False
)
if verbose:
print(
f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
)
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
if verbose:
print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
screenshot = None if not screenshot else screenshot
if not is_cached:
cache_url(
url,
html,
cleaned_html,
markdown,
extracted_content,
True,
json.dumps(media),
json.dumps(links),
json.dumps(metadata),
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
)
return CrawlResult(
url=url,
html=html,
cleaned_html=format_html(cleaned_html),
markdown=markdown,
media=media,
links=links,
metadata=metadata,
screenshot=screenshot,
extracted_content=extracted_content,
success=True,
error_message="",
)
extracted_content=extracted_content,
success=True,
error_message="",
)

View File

@@ -1,244 +0,0 @@
# BFS Scraper Strategy: Smart Web Traversal
The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
```mermaid
flowchart TB
Start([Start]) --> Init[Initialize BFS Strategy]
Init --> InitStats[Initialize CrawlStats]
InitStats --> InitQueue[Initialize Priority Queue]
InitQueue --> AddStart[Add Start URL to Queue]
AddStart --> CheckState{Queue Empty or\nTasks Pending?}
CheckState -->|No| Cleanup[Cleanup & Stats]
Cleanup --> End([End])
CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
CheckCancel -->|Yes| Cleanup
CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
WaitComplete --> YieldResult[Yield Result]
YieldResult --> CheckState
CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
GetNextURL --> ValidateURL{Already\nVisited?}
ValidateURL -->|Yes| CheckState
ValidateURL -->|No| ProcessURL[Process URL]
subgraph URL_Processing [URL Processing]
ProcessURL --> CheckValid{URL Valid?}
CheckValid -->|No| UpdateStats[Update Skip Stats]
CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
FetchContent --> CheckError{Error?}
CheckError -->|Yes| Retry{Retry\nNeeded?}
Retry -->|Yes| FetchContent
Retry -->|No| UpdateFailStats[Update Fail Stats]
CheckError -->|No| ExtractLinks[Extract & Process Links]
ExtractLinks --> ScoreURLs[Score New URLs]
ScoreURLs --> AddToQueue[Add to Priority Queue]
end
ProcessURL --> CreateTask{Parallel\nProcessing?}
CreateTask -->|Yes| AddTask[Add to Pending Tasks]
CreateTask -->|No| DirectProcess[Process Directly]
AddTask --> CheckState
DirectProcess --> YieldResult
UpdateStats --> CheckState
UpdateRobotStats --> CheckState
UpdateFailStats --> CheckState
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End stats;
class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
```
## How It Works
The BFS strategy crawls a website by:
1. Starting from a root URL
2. Processing all URLs at the current depth
3. Moving to URLs at the next depth level
4. Continuing until maximum depth is reached
This ensures systematic coverage of the website while maintaining control over the crawling process.
## Key Features
### 1. Smart URL Processing
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=my_filters,
url_scorer=my_scorer,
max_concurrent=5
)
```
- Controls crawl depth
- Filters unwanted URLs
- Scores URLs for priority
- Manages concurrent requests
### 2. Polite Crawling
The strategy automatically implements web crawling best practices:
- Respects robots.txt
- Implements rate limiting
- Adds politeness delays
- Manages concurrent requests
### 3. Link Processing Control
```python
strategy = BFSScraperStrategy(
...,
process_external_links=False # Only process internal links
)
```
- Control whether to follow external links
- Default: internal links only
- Enable external links when needed
## Configuration Options
| Parameter | Description | Default |
|-----------|-------------|---------|
| max_depth | Maximum crawl depth | Required |
| filter_chain | URL filtering rules | Required |
| url_scorer | URL priority scoring | Required |
| max_concurrent | Max parallel requests | 5 |
| min_crawl_delay | Seconds between requests | 1 |
| process_external_links | Follow external links | False |
## Best Practices
1. **Set Appropriate Depth**
- Start with smaller depths (2-3)
- Increase based on needs
- Consider site structure
2. **Configure Filters**
- Use URL patterns
- Filter by content type
- Avoid unwanted sections
3. **Tune Performance**
- Adjust max_concurrent
- Set appropriate delays
- Monitor resource usage
4. **Handle External Links**
- Keep external_links=False for focused crawls
- Enable only when needed
- Consider additional filtering
## Example Usage
```python
from crawl4ai.scraper import BFSScraperStrategy
from crawl4ai.scraper.filters import FilterChain
from crawl4ai.scraper.scorers import BasicURLScorer
# Configure strategy
strategy = BFSScraperStrategy(
max_depth=3,
filter_chain=FilterChain([
URLPatternFilter("*.example.com/*"),
ContentTypeFilter(["text/html"])
]),
url_scorer=BasicURLScorer(),
max_concurrent=5,
min_crawl_delay=1,
process_external_links=False
)
# Use with AsyncWebScraper
scraper = AsyncWebScraper(crawler, strategy)
results = await scraper.ascrape("https://example.com")
```
## Common Use Cases
### 1. Site Mapping
```python
strategy = BFSScraperStrategy(
max_depth=5,
filter_chain=site_filter,
url_scorer=depth_scorer,
process_external_links=False
)
```
Perfect for creating complete site maps or understanding site structure.
### 2. Content Aggregation
```python
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=content_filter,
url_scorer=relevance_scorer,
max_concurrent=3
)
```
Ideal for collecting specific types of content (articles, products, etc.).
### 3. Link Analysis
```python
strategy = BFSScraperStrategy(
max_depth=1,
filter_chain=link_filter,
url_scorer=link_scorer,
process_external_links=True
)
```
Useful for analyzing both internal and external link structures.
## Advanced Features
### Progress Monitoring
```python
async for result in scraper.ascrape(url):
print(f"Current depth: {strategy.stats.current_depth}")
print(f"Processed URLs: {strategy.stats.urls_processed}")
```
### Custom URL Scoring
```python
class CustomScorer(URLScorer):
def score(self, url: str) -> float:
# Lower scores = higher priority
return score_based_on_criteria(url)
```
## Troubleshooting
1. **Slow Crawling**
- Increase max_concurrent
- Adjust min_crawl_delay
- Check network conditions
2. **Missing Content**
- Verify max_depth
- Check filter settings
- Review URL patterns
3. **High Resource Usage**
- Reduce max_concurrent
- Increase crawl delay
- Add more specific filters

View File

@@ -1,260 +0,0 @@
from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawl import (
BFSDeepCrawlStrategy,
FilterChain,
URLPatternFilter,
ContentTypeFilter,
DomainFilter,
KeywordRelevanceScorer,
PathDepthScorer,
FreshnessScorer,
CompositeScorer,
)
from crawl4ai.async_webcrawler import AsyncWebCrawler
import re
import time
import logging
browser_config = BrowserConfig(headless=True, viewport_width=800, viewport_height=600)
async def basic_example():
"""
Basic example: Deep crawl a blog site for articles
- Crawls only HTML pages
- Stays within the blog section
- Collects all results at once
"""
# Create a simple filter chain
filter_chain = FilterChain(
[
# Only crawl pages within the blog section
URLPatternFilter("*/basic/*"),
# Only process HTML pages
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=True,
)
# Create the crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
results = await crawler.arun(
"https://crawl4ai.com/mkdocs",
CrawlerRunConfig(deep_crawl_strategy=bfs_strategy),
)
# Process results
print(f"Crawled {len(results)} pages:")
for result in results:
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
async def advanced_example():
"""
Advanced example: Intelligent news site crawling
- Uses all filter types
- Implements sophisticated scoring
- Streams results
- Includes monitoring and logging
"""
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("advanced_deep_crawler")
# Create sophisticated filter chain
filter_chain = FilterChain(
[
# Domain control
DomainFilter(
allowed_domains=["techcrunch.com"],
blocked_domains=["login.techcrunch.com", "legal.yahoo.com"],
),
# URL patterns
URLPatternFilter(
[
"*/article/*",
"*/news/*",
"*/blog/*",
re.compile(r"\d{4}/\d{2}/.*"), # Date-based URLs
]
),
# Content types
ContentTypeFilter(["text/html", "application/xhtml+xml"]),
]
)
# Create composite scorer
scorer = CompositeScorer(
[
# Prioritize by keywords
KeywordRelevanceScorer(
keywords=["news", "breaking", "update", "latest"], weight=1.0
),
# Prefer optimal URL structure
PathDepthScorer(optimal_depth=3, weight=0.7),
# Prioritize fresh content
FreshnessScorer(weight=0.9),
]
)
# Initialize strategy with advanced configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, filter_chain=filter_chain, url_scorer=scorer
)
# Create crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Track statistics
stats = {"processed": 0, "errors": 0, "total_size": 0}
try:
# Use streaming mode
results = []
result_generator = await crawler.arun(
"https://techcrunch.com",
config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy, stream=True),
)
async for result in result_generator:
stats["processed"] += 1
if result.success:
stats["total_size"] += len(result.html)
logger.info(
f"Processed at depth: {result.depth} with score: {result.score:.3f} : \n {result.url}"
)
results.append(result)
else:
stats["errors"] += 1
logger.error(
f"Failed to process {result.url}: {result.error_message}"
)
# Log progress regularly
if stats["processed"] % 10 == 0:
logger.info(f"Progress: {stats['processed']} URLs processed")
except Exception as e:
logger.error(f"Scraping error: {e}")
finally:
# Print final statistics
logger.info("Scraping completed:")
logger.info(f"- URLs processed: {stats['processed']}")
logger.info(f"- Errors: {stats['errors']}")
logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
# Print filter statistics
for filter_ in filter_chain.filters:
logger.info(f"{filter_.name} stats:")
logger.info(f"- Passed: {filter_.stats.passed_urls}")
logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
# Print scorer statistics
logger.info("Scoring statistics:")
logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
logger.info(
f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}"
)
async def basic_example_many_urls():
filter_chain = FilterChain(
[
URLPatternFilter("*/basic/*"),
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=False,
)
# Create the crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
results = await crawler.arun_many(
urls=["https://crawl4ai.com/mkdocs","https://aravindkarnam.com"],
config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy),
)
# Process results
print(f"Crawled {len(results)} pages:")
for url_result in results:
for result in url_result:
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
async def basic_example_many_urls_stream():
filter_chain = FilterChain(
[
URLPatternFilter("*/basic/*"),
ContentTypeFilter(["text/html"]),
]
)
# Initialize the strategy with basic configuration
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2, # Only go 2 levels deep
filter_chain=filter_chain,
url_scorer=None, # Use default scoring
process_external_links=False,
)
# Create the crawler
async with AsyncWebCrawler(
config=browser_config,
) as crawler:
# Start scraping
try:
async for result in await crawler.arun_many(
urls=["https://crawl4ai.com/mkdocs","https://aravindkarnam.com"],
config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy,stream=True),
):
# Process results
print(f"- {result.url}: {len(result.html)} bytes")
except Exception as e:
print(f"Error during scraping: {e}")
if __name__ == "__main__":
import asyncio
import time
# Run basic example
start_time = time.perf_counter()
print("Running basic Deep crawl example...")
asyncio.run(basic_example())
end_time = time.perf_counter()
print(f"Basic deep crawl example completed in {end_time - start_time:.2f} seconds")
# Run advanced example
print("\nRunning advanced deep crawl example...")
asyncio.run(advanced_example())
print("\nRunning advanced deep crawl example with arun_many...")
asyncio.run(basic_example_many_urls())
print("\nRunning advanced deep crawl example with arun_many streaming enabled...")
asyncio.run(basic_example_many_urls_stream())

View File

@@ -1,342 +0,0 @@
# URL Filters and Scorers
The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
```mermaid
flowchart TB
Start([URL Input]) --> Chain[Filter Chain]
subgraph Chain Process
Chain --> Pattern{URL Pattern\nFilter}
Pattern -->|Match| Content{Content Type\nFilter}
Pattern -->|No Match| Reject1[Reject URL]
Content -->|Allowed| Domain{Domain\nFilter}
Content -->|Not Allowed| Reject2[Reject URL]
Domain -->|Allowed| Accept[Accept URL]
Domain -->|Blocked| Reject3[Reject URL]
end
subgraph Statistics
Pattern --> UpdatePattern[Update Pattern Stats]
Content --> UpdateContent[Update Content Stats]
Domain --> UpdateDomain[Update Domain Stats]
Accept --> UpdateChain[Update Chain Stats]
Reject1 --> UpdateChain
Reject2 --> UpdateChain
Reject3 --> UpdateChain
end
Accept --> End([End])
Reject1 --> End
Reject2 --> End
Reject3 --> End
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
class Start,End accept;
class Pattern,Content,Domain decision;
class Reject1,Reject2,Reject3 reject;
class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
```
## URL Filters
URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
### Available Filters
1. **URL Pattern Filter**
```python
pattern_filter = URLPatternFilter([
"*.example.com/*", # Glob pattern
"*/article/*", # Path pattern
re.compile(r"blog-\d+") # Regex pattern
])
```
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
2. **Content Type Filter**
```python
content_filter = ContentTypeFilter([
"text/html",
"application/pdf"
], check_extension=True)
```
- Filter by MIME types
- Extension checking
- Support for multiple content types
3. **Domain Filter**
```python
domain_filter = DomainFilter(
allowed_domains=["example.com", "blog.example.com"],
blocked_domains=["ads.example.com"]
)
```
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
### Creating Filter Chains
```python
# Create and configure a filter chain
filter_chain = FilterChain([
URLPatternFilter(["*.example.com/*"]),
ContentTypeFilter(["text/html"]),
DomainFilter(blocked_domains=["ads.*"])
])
# Add more filters
filter_chain.add_filter(
URLPatternFilter(["*/article/*"])
)
```
```mermaid
flowchart TB
Start([URL Input]) --> Composite[Composite Scorer]
subgraph Scoring Process
Composite --> Keywords[Keyword Relevance]
Composite --> Path[Path Depth]
Composite --> Content[Content Type]
Composite --> Fresh[Freshness]
Composite --> Domain[Domain Authority]
Keywords --> KeywordScore[Calculate Score]
Path --> PathScore[Calculate Score]
Content --> ContentScore[Calculate Score]
Fresh --> FreshScore[Calculate Score]
Domain --> DomainScore[Calculate Score]
KeywordScore --> Weight1[Apply Weight]
PathScore --> Weight2[Apply Weight]
ContentScore --> Weight3[Apply Weight]
FreshScore --> Weight4[Apply Weight]
DomainScore --> Weight5[Apply Weight]
end
Weight1 --> Combine[Combine Scores]
Weight2 --> Combine
Weight3 --> Combine
Weight4 --> Combine
Weight5 --> Combine
Combine --> Normalize{Normalize?}
Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
Normalize -->|No| FinalScore[Final Score]
NormalizeScore --> FinalScore
FinalScore --> Stats[Update Statistics]
Stats --> End([End])
classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
class Start,End calc;
class Keywords,Path,Content,Fresh,Domain scorer;
class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
class Normalize decision;
```
## URL Scorers
URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
### Available Scorers
1. **Keyword Relevance Scorer**
```python
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "programming"],
weight=1.0,
case_sensitive=False
)
```
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
2. **Path Depth Scorer**
```python
path_scorer = PathDepthScorer(
optimal_depth=3, # Preferred URL depth
weight=0.7
)
```
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
3. **Content Type Scorer**
```python
content_scorer = ContentTypeScorer({
r'\.html$': 1.0,
r'\.pdf$': 0.8,
r'\.xml$': 0.6
})
```
- Score based on file types
- Configurable type weights
- Pattern matching support
4. **Freshness Scorer**
```python
freshness_scorer = FreshnessScorer(weight=0.9)
```
- Score based on date indicators in URLs
- Multiple date format support
- Recency weighting
5. **Domain Authority Scorer**
```python
authority_scorer = DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9,
"medium.com": 0.7
})
```
- Score based on domain importance
- Configurable domain weights
- Default weight for unknown domains
### Combining Scorers
```python
# Create a composite scorer
composite_scorer = CompositeScorer([
KeywordRelevanceScorer(["python"], weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.7),
FreshnessScorer(weight=0.8)
], normalize=True)
```
## Best Practices
### Filter Configuration
1. **Start Restrictive**
```python
# Begin with strict filters
filter_chain = FilterChain([
DomainFilter(allowed_domains=["example.com"]),
ContentTypeFilter(["text/html"])
])
```
2. **Layer Filters**
```python
# Add more specific filters
filter_chain.add_filter(
URLPatternFilter(["*/article/*", "*/blog/*"])
)
```
3. **Monitor Filter Statistics**
```python
# Check filter performance
for filter in filter_chain.filters:
print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
```
### Scorer Configuration
1. **Balance Weights**
```python
# Balanced scoring configuration
scorer = create_balanced_scorer()
```
2. **Customize for Content**
```python
# News site configuration
news_scorer = CompositeScorer([
KeywordRelevanceScorer(["news", "article"], weight=1.0),
FreshnessScorer(weight=1.0),
PathDepthScorer(optimal_depth=2, weight=0.5)
])
```
3. **Monitor Scoring Statistics**
```python
# Check scoring distribution
print(f"Average score: {scorer.stats.average_score}")
print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
```
## Common Use Cases
### Blog Crawling
```python
blog_config = {
'filters': FilterChain([
URLPatternFilter(["*/blog/*", "*/post/*"]),
ContentTypeFilter(["text/html"])
]),
'scorer': CompositeScorer([
FreshnessScorer(weight=1.0),
KeywordRelevanceScorer(["blog", "article"], weight=0.8)
])
}
```
### Documentation Sites
```python
docs_config = {
'filters': FilterChain([
URLPatternFilter(["*/docs/*", "*/guide/*"]),
ContentTypeFilter(["text/html", "application/pdf"])
]),
'scorer': CompositeScorer([
PathDepthScorer(optimal_depth=3, weight=1.0),
KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
])
}
```
### E-commerce Sites
```python
ecommerce_config = {
'filters': FilterChain([
URLPatternFilter(["*/product/*", "*/category/*"]),
DomainFilter(blocked_domains=["ads.*", "tracker.*"])
]),
'scorer': CompositeScorer([
PathDepthScorer(optimal_depth=2, weight=1.0),
ContentTypeScorer({
r'/product/': 1.0,
r'/category/': 0.8
})
])
}
```
## Advanced Topics
### Custom Filters
```python
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
```
### Custom Scorers
```python
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
```
For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).

View File

@@ -1,206 +0,0 @@
# Scraper Examples Guide
This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
## Basic Example
The basic example demonstrates a simple blog scraping scenario:
```python
from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
# Create simple filter chain
filter_chain = FilterChain([
URLPatternFilter("*/blog/*"),
ContentTypeFilter(["text/html"])
])
# Initialize strategy
strategy = BFSScraperStrategy(
max_depth=2,
filter_chain=filter_chain,
url_scorer=None,
max_concurrent=3
)
# Create and run scraper
crawler = AsyncWebCrawler()
scraper = AsyncWebScraper(crawler, strategy)
result = await scraper.ascrape("https://example.com/blog/")
```
### Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection
## Advanced Example
The advanced example shows a sophisticated news site scraping setup with all features enabled:
```python
# Create comprehensive filter chain
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["example.com"],
blocked_domains=["ads.example.com"]
),
URLPatternFilter([
"*/article/*",
re.compile(r"\d{4}/\d{2}/.*")
]),
ContentTypeFilter(["text/html"])
])
# Create intelligent scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(
keywords=["news", "breaking"],
weight=1.0
),
PathDepthScorer(optimal_depth=3, weight=0.7),
FreshnessScorer(weight=0.9)
])
# Initialize advanced strategy
strategy = BFSScraperStrategy(
max_depth=4,
filter_chain=filter_chain,
url_scorer=scorer,
max_concurrent=5
)
```
### Features Demonstrated
1. **Advanced Filtering**
- Domain filtering
- Pattern matching
- Content type control
2. **Intelligent Scoring**
- Keyword relevance
- Path optimization
- Freshness priority
3. **Monitoring**
- Progress tracking
- Error handling
- Statistics collection
4. **Resource Management**
- Concurrent processing
- Rate limiting
- Cleanup handling
## Running the Examples
```bash
# Basic usage
python basic_scraper_example.py
# Advanced usage with logging
PYTHONPATH=. python advanced_scraper_example.py
```
## Example Output
### Basic Example
```
Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
...
```
### Advanced Example
```
INFO: Starting crawl of https://example.com/news/
INFO: Processed: https://example.com/news/breaking/story1
DEBUG: KeywordScorer: 0.85
DEBUG: FreshnessScorer: 0.95
INFO: Progress: 10 URLs processed
...
INFO: Scraping completed:
INFO: - URLs processed: 50
INFO: - Errors: 2
INFO: - Total content size: 1240.50 KB
```
## Customization
### Adding Custom Filters
```python
class CustomFilter(URLFilter):
def apply(self, url: str) -> bool:
# Your custom filtering logic
return True
filter_chain.add_filter(CustomFilter())
```
### Custom Scoring Logic
```python
class CustomScorer(URLScorer):
def _calculate_score(self, url: str) -> float:
# Your custom scoring logic
return 1.0
scorer = CompositeScorer([
CustomScorer(weight=1.0),
...
])
```
## Best Practices
1. **Start Simple**
- Begin with basic filtering
- Add features incrementally
- Test thoroughly at each step
2. **Monitor Performance**
- Watch memory usage
- Track processing times
- Adjust concurrency as needed
3. **Handle Errors**
- Implement proper error handling
- Log important events
- Track error statistics
4. **Optimize Resources**
- Set appropriate delays
- Limit concurrent requests
- Use streaming for large crawls
## Troubleshooting
Common issues and solutions:
1. **Too Many Requests**
```python
strategy = BFSScraperStrategy(
max_concurrent=3, # Reduce concurrent requests
min_crawl_delay=2 # Increase delay between requests
)
```
2. **Memory Issues**
```python
# Use streaming mode for large crawls
async for result in scraper.ascrape(url, stream=True):
process_result(result)
```
3. **Missing Content**
```python
# Check your filter chain
filter_chain = FilterChain([
URLPatternFilter("*"), # Broaden patterns
ContentTypeFilter(["*"]) # Accept all content
])
```
For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).

View File

@@ -1,189 +0,0 @@
# 🐳 Using Docker (Legacy)
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
---
<details>
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
Choose the appropriate image based on your platform and needs:
### For AMD64 (Regular Linux/Windows):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-amd64
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
# With GPU support
docker pull unclecode/crawl4ai:gpu-amd64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Basic version (recommended)
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
# Full ML/LLM support
docker pull unclecode/crawl4ai:all-arm64
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
# With GPU support
docker pull unclecode/crawl4ai:gpu-arm64
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
```
Need more memory? Add `--shm-size`:
```bash
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
```
Test the installation:
```bash
curl http://localhost:11235/health
```
### For Raspberry Pi (32-bit) (coming soon):
```bash
# Pull and run basic version (recommended for Raspberry Pi)
docker pull unclecode/crawl4ai:basic-armv7
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
# With increased shared memory if needed
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
```
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
</details>
<details>
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
Build the image locally based on your platform:
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# For AMD64 (Regular Linux/Windows)
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
# For ARM64 (M1/M2 Macs, ARM servers)
docker build --platform linux/arm64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=basic \
.
```
Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
Example with all options:
```bash
docker build --platform linux/amd64 \
--tag crawl4ai:local \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=true \
.
```
Run your local build:
```bash
# Regular run
docker run -p 11235:11235 crawl4ai:local
# With increased shared memory
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
```
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
```bash
# Clone the repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
```
### For AMD64 (Regular Linux/Windows):
```bash
# Build and run locally
docker-compose --profile local-amd64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
```
### For ARM64 (M1/M2 Macs, ARM servers):
```bash
# Build and run locally
docker-compose --profile local-arm64 up
# Run from Docker Hub
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
```
Environment variables (optional):
```bash
# Create a .env file
CRAWL4AI_API_TOKEN=your_token
OPENAI_API_KEY=your_openai_key
CLAUDE_API_KEY=your_claude_key
```
The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
Test the installation:
```bash
curl http://localhost:11235/health
```
</details>
<details>
<summary>🚀 <strong>One-Click Deployment</strong></summary>
Deploy your own instance of Crawl4AI with one click:
[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
</details>

View File

@@ -1,110 +0,0 @@
"""
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(browser_type="chromium", headless=True)
# Initialize crawler config with JSON CSS extraction strategy
crawler_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin",
},
{"name": "title", "selector": "h2 a span", "type": "text"},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href",
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src",
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text",
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text",
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text",
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text",
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists",
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True,
},
],
}
)
)
# Example search URL (you should replace with your actual Amazon URL)
url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products())

View File

@@ -1,150 +0,0 @@
"""
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
from playwright.async_api import Page, BrowserContext
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
# browser_type="chromium",
headless=True
)
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin",
},
{"name": "title", "selector": "h2 a span", "type": "text"},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href",
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src",
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text",
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text",
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text",
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text",
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists",
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True,
},
],
}
),
)
url = "https://www.amazon.com/"
async def after_goto(
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
try:
# Wait for search box to be available
search_box = await page.wait_for_selector(
"#twotabsearchtextbox", timeout=1000
)
# Type the search query
await search_box.fill("Samsung Galaxy Tab")
# Get the search button and prepare for navigation
search_button = await page.wait_for_selector(
"#nav-search-submit-button", timeout=1000
)
# Click with navigation waiting
await search_button.click()
# Wait for search results to load
await page.wait_for_selector(
'[data-component-type="s-search-result"]', timeout=10000
)
print("[HOOK] Search completed and results loaded!")
except Exception as e:
print(f"[HOOK] Error during search operation: {str(e)}")
return page
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("after_goto", after_goto)
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products())

View File

@@ -1,126 +0,0 @@
"""
This example demonstrates how to use JSON CSS extraction to scrape product information
from Amazon search results. It shows how to extract structured data like product titles,
prices, ratings, and other details using CSS selectors.
"""
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import json
async def extract_amazon_products():
# Initialize browser config
browser_config = BrowserConfig(
# browser_type="chromium",
headless=True
)
js_code_to_search = """
const task = async () => {
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
document.querySelector('#nav-search-submit-button').click();
}
await task();
"""
js_code_to_search_sync = """
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
document.querySelector('#nav-search-submit-button').click();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code=js_code_to_search,
wait_for='css:[data-component-type="s-search-result"]',
extraction_strategy=JsonCssExtractionStrategy(
schema={
"name": "Amazon Product Search Results",
"baseSelector": "[data-component-type='s-search-result']",
"fields": [
{
"name": "asin",
"selector": "",
"type": "attribute",
"attribute": "data-asin",
},
{"name": "title", "selector": "h2 a span", "type": "text"},
{
"name": "url",
"selector": "h2 a",
"type": "attribute",
"attribute": "href",
},
{
"name": "image",
"selector": ".s-image",
"type": "attribute",
"attribute": "src",
},
{
"name": "rating",
"selector": ".a-icon-star-small .a-icon-alt",
"type": "text",
},
{
"name": "reviews_count",
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
"type": "text",
},
{
"name": "price",
"selector": ".a-price .a-offscreen",
"type": "text",
},
{
"name": "original_price",
"selector": ".a-price.a-text-price .a-offscreen",
"type": "text",
},
{
"name": "sponsored",
"selector": ".puis-sponsored-label-text",
"type": "exists",
},
{
"name": "delivery_info",
"selector": "[data-cy='delivery-recipe'] .a-color-base",
"type": "text",
"multiple": True,
},
],
}
),
)
# Example search URL (you should replace with your actual Amazon URL)
url = "https://www.amazon.com/"
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract the data
result = await crawler.arun(url=url, config=crawler_config)
# Process and print the results
if result and result.extracted_content:
# Parse the JSON string into a list of products
products = json.loads(result.extracted_content)
# Process each product in the list
for product in products:
print("\nProduct Details:")
print(f"ASIN: {product.get('asin')}")
print(f"Title: {product.get('title')}")
print(f"Price: {product.get('price')}")
print(f"Original Price: {product.get('original_price')}")
print(f"Rating: {product.get('rating')}")
print(f"Reviews: {product.get('reviews_count')}")
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
if product.get("delivery_info"):
print(f"Delivery: {' '.join(product['delivery_info'])}")
print("-" * 80)
if __name__ == "__main__":
import asyncio
asyncio.run(extract_amazon_products())

View File

@@ -1,16 +1,12 @@
# File: async_webcrawler_multiple_urls_example.py
import os, sys
# append 2 parent directories to sys.path to import crawl4ai
parent_dir = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(parent_dir)
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Initialize the AsyncWebCrawler
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -20,7 +16,7 @@ async def main():
"https://python.org",
"https://github.com",
"https://stackoverflow.com",
"https://news.ycombinator.com",
"https://news.ycombinator.com"
]
# Set up crawling parameters
@@ -31,7 +27,7 @@ async def main():
urls=urls,
word_count_threshold=word_count_threshold,
bypass_cache=True,
verbose=True,
verbose=True
)
# Process the results
@@ -40,9 +36,7 @@ async def main():
print(f"Successfully crawled: {result.url}")
print(f"Title: {result.metadata.get('title', 'N/A')}")
print(f"Word count: {len(result.markdown.split())}")
print(
f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
)
print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
print(f"Number of images: {len(result.media.get('images', []))}")
print("---")
else:
@@ -50,6 +44,5 @@ async def main():
print(f"Error: {result.error_message}")
print("---")
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -1,126 +0,0 @@
"""
This example demonstrates optimal browser usage patterns in Crawl4AI:
1. Sequential crawling with session reuse
2. Parallel crawling with browser instance reuse
3. Performance optimization settings
"""
import asyncio
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def crawl_sequential(urls: List[str]):
"""
Sequential crawling using session reuse - most efficient for moderate workloads
"""
print("\n=== Sequential Crawling with Session Reuse ===")
# Configure browser with optimized settings
browser_config = BrowserConfig(
headless=True,
browser_args=[
"--disable-gpu", # Disable GPU acceleration
"--disable-dev-shm-usage", # Disable /dev/shm usage
"--no-sandbox", # Required for Docker
],
viewport={
"width": 800,
"height": 600,
}, # Smaller viewport for better performance
)
# Configure crawl settings
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(), In case you need fit_markdown
),
)
# Create single crawler instance
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
session_id = "session1" # Use same session for all URLs
for url in urls:
result = await crawler.arun(
url=url,
config=crawl_config,
session_id=session_id, # Reuse same browser tab
)
if result.success:
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
finally:
await crawler.close()
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
"""
Parallel crawling while reusing browser instance - best for large workloads
"""
print("\n=== Parallel Crawling with Browser Reuse ===")
browser_config = BrowserConfig(
headless=True,
browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
viewport={"width": 800, "height": 600},
)
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
# content_filter=PruningContentFilter(), In case you need fit_markdown
),
)
# Create single crawler instance for all parallel tasks
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
# Create tasks in batches to control concurrency
for i in range(0, len(urls), max_concurrent):
batch = urls[i : i + max_concurrent]
tasks = []
for j, url in enumerate(batch):
session_id = (
f"parallel_session_{j}" # Different session per concurrent task
)
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
tasks.append(task)
# Wait for batch to complete
results = await asyncio.gather(*tasks, return_exceptions=True)
# Process results
for url, result in zip(batch, results):
if isinstance(result, Exception):
print(f"Error crawling {url}: {str(result)}")
elif result.success:
print(f"Successfully crawled {url}")
print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
finally:
await crawler.close()
async def main():
# Example URLs
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
]
# Demo sequential crawling
await crawl_sequential(urls)
# Demo parallel crawling
await crawl_parallel(urls, max_concurrent=2)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,32 +1,31 @@
import os, time
# append the path to the root of the project
import sys
import asyncio
sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
from firecrawl import FirecrawlApp
from crawl4ai import AsyncWebCrawler
__data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'
async def compare():
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
# Tet Firecrawl with a simple crawl
start = time.time()
scrape_status = app.scrape_url(
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
print(f"Time taken: {end - start} seconds")
print(len(scrape_status["markdown"]))
print(len(scrape_status['markdown']))
# save the markdown content with provider name
with open(f"{__data__}/firecrawl_simple.md", "w") as f:
f.write(scrape_status["markdown"])
f.write(scrape_status['markdown'])
# Count how many "cldnry.s-nbcnews.com" are in the markdown
print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
print(scrape_status['markdown'].count("cldnry.s-nbcnews.com"))
async with AsyncWebCrawler() as crawler:
start = time.time()
@@ -34,13 +33,13 @@ async def compare():
url="https://www.nbcnews.com/business",
# js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
word_count_threshold=0,
bypass_cache=True,
verbose=False,
bypass_cache=True,
verbose=False
)
end = time.time()
print(f"Time taken: {end - start} seconds")
print(len(result.markdown))
# save the markdown content with provider name
# save the markdown content with provider name
with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
f.write(result.markdown)
# count how many "cldnry.s-nbcnews.com" are in the markdown
@@ -49,12 +48,10 @@ async def compare():
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=[
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
word_count_threshold=0,
bypass_cache=True,
verbose=False,
bypass_cache=True,
verbose=False
)
end = time.time()
print(f"Time taken: {end - start} seconds")
@@ -64,7 +61,7 @@ async def compare():
f.write(result.markdown)
# count how many "cldnry.s-nbcnews.com" are in the markdown
print(result.markdown.count("cldnry.s-nbcnews.com"))
if __name__ == "__main__":
asyncio.run(compare())

View File

@@ -1,136 +0,0 @@
import asyncio
import time
from rich import print
from rich.table import Table
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
MemoryAdaptiveDispatcher,
SemaphoreDispatcher,
RateLimiter,
CrawlerMonitor,
DisplayMode,
CacheMode,
LXMLWebScrapingStrategy,
)
async def memory_adaptive(urls, browser_config, run_config):
"""Memory adaptive crawler with monitoring"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10,
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
"""Memory adaptive crawler with rate limiting"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10,
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
),
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def semaphore(urls, browser_config, run_config):
"""Basic semaphore crawler"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
async def semaphore_with_rate_limit(urls, browser_config, run_config):
"""Semaphore crawler with rate limiting"""
start = time.perf_counter()
async with AsyncWebCrawler(config=browser_config) as crawler:
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
),
monitor=CrawlerMonitor(
max_visible_rows=15, display_mode=DisplayMode.DETAILED
),
)
results = await crawler.arun_many(
urls, config=run_config, dispatcher=dispatcher
)
duration = time.perf_counter() - start
return len(results), duration
def create_performance_table(results):
"""Creates a rich table showing performance results"""
table = Table(title="Crawler Strategy Performance Comparison")
table.add_column("Strategy", style="cyan")
table.add_column("URLs Crawled", justify="right", style="green")
table.add_column("Time (seconds)", justify="right", style="yellow")
table.add_column("URLs/second", justify="right", style="magenta")
sorted_results = sorted(results.items(), key=lambda x: x[1][1])
for strategy, (urls_crawled, duration) in sorted_results:
urls_per_second = urls_crawled / duration
table.add_row(
strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
)
return table
async def main():
urls = [f"https://example.com/page{i}" for i in range(1, 40)]
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
results = {
"Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
# "Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
# urls, browser_config, run_config
# ),
# "Semaphore": await semaphore(urls, browser_config, run_config),
# "Semaphore + Rate Limit": await semaphore_with_rate_limit(
# urls, browser_config, run_config
# ),
}
table = create_performance_table(results)
print("\nPerformance Summary:")
print(table)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -6,80 +6,63 @@ import base64
import os
from typing import Dict, Any
class Crawl4AiTester:
def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
self.base_url = base_url
self.api_token = (
api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
) # Check environment variable as fallback
self.headers = (
{"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
)
def submit_and_wait(
self, request_data: Dict[str, Any], timeout: int = 300
) -> Dict[str, Any]:
self.api_token = api_token or os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code" # Check environment variable as fallback
self.headers = {'Authorization': f'Bearer {self.api_token}'} if self.api_token else {}
def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
# Submit crawl job
response = requests.post(
f"{self.base_url}/crawl", json=request_data, headers=self.headers
)
response = requests.post(f"{self.base_url}/crawl", json=request_data, headers=self.headers)
if response.status_code == 403:
raise Exception("API token is invalid or missing")
task_id = response.json()["task_id"]
print(f"Task ID: {task_id}")
# Poll for result
start_time = time.time()
while True:
if time.time() - start_time > timeout:
raise TimeoutError(
f"Task {task_id} did not complete within {timeout} seconds"
)
result = requests.get(
f"{self.base_url}/task/{task_id}", headers=self.headers
)
raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
result = requests.get(f"{self.base_url}/task/{task_id}", headers=self.headers)
status = result.json()
if status["status"] == "failed":
print("Task failed:", status.get("error"))
raise Exception(f"Task failed: {status.get('error')}")
if status["status"] == "completed":
return status
time.sleep(2)
def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
response = requests.post(
f"{self.base_url}/crawl_sync",
json=request_data,
headers=self.headers,
timeout=60,
)
response = requests.post(f"{self.base_url}/crawl_sync", json=request_data, headers=self.headers, timeout=60)
if response.status_code == 408:
raise TimeoutError("Task did not complete within server timeout")
response.raise_for_status()
return response.json()
def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
"""Directly crawl without using task queue"""
response = requests.post(
f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
f"{self.base_url}/crawl_direct",
json=request_data,
headers=self.headers
)
response.raise_for_status()
return response.json()
def test_docker_deployment(version="basic"):
tester = Crawl4AiTester(
base_url="http://localhost:11235",
base_url="http://localhost:11235" ,
# base_url="https://api.crawl4ai.com" # just for example
# api_token="test" # just for example
)
print(f"Testing Crawl4AI Docker {version} version")
# Health check with timeout and retry
max_retries = 5
for i in range(max_retries):
@@ -87,19 +70,19 @@ def test_docker_deployment(version="basic"):
health = requests.get(f"{tester.base_url}/health", timeout=10)
print("Health check:", health.json())
break
except requests.exceptions.RequestException:
except requests.exceptions.RequestException as e:
if i == max_retries - 1:
print(f"Failed to connect after {max_retries} attempts")
sys.exit(1)
print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
time.sleep(5)
# Test cases based on version
test_basic_crawl_direct(tester)
test_basic_crawl(tester)
test_basic_crawl(tester)
test_basic_crawl_sync(tester)
if version in ["full", "transformer"]:
test_cosine_extraction(tester)
@@ -109,52 +92,49 @@ def test_docker_deployment(version="basic"):
test_llm_extraction(tester)
test_llm_with_ollama(tester)
test_screenshot(tester)
def test_basic_crawl(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test",
"priority": 10,
"session_id": "test"
}
result = tester.submit_and_wait(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
def test_basic_crawl_sync(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Sync) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
"session_id": "test",
"session_id": "test"
}
result = tester.submit_sync(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["status"] == "completed"
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
assert result['status'] == 'completed'
assert result['result']['success']
assert len(result['result']['markdown']) > 0
def test_basic_crawl_direct(tester: Crawl4AiTester):
print("\n=== Testing Basic Crawl (Direct) ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 10,
# "session_id": "test"
"cache_mode": "bypass", # or "enabled", "disabled", "read_only", "write_only"
"cache_mode": "bypass" # or "enabled", "disabled", "read_only", "write_only"
}
result = tester.crawl_direct(request)
print(f"Basic crawl result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
assert len(result["result"]["markdown"]) > 0
assert result['result']['success']
assert len(result['result']['markdown']) > 0
def test_js_execution(tester: Crawl4AiTester):
print("\n=== Testing JS Execution ===")
request = {
@@ -164,29 +144,32 @@ def test_js_execution(tester: Crawl4AiTester):
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
],
"wait_for": "article.tease-card:nth-child(10)",
"crawler_params": {"headless": True},
"crawler_params": {
"headless": True
}
}
result = tester.submit_and_wait(request)
print(f"JS execution result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_css_selector(tester: Crawl4AiTester):
print("\n=== Testing CSS Selector ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 7,
"css_selector": ".wide-tease-item__description",
"crawler_params": {"headless": True},
"extra": {"word_count_threshold": 10},
"crawler_params": {
"headless": True
},
"extra": {"word_count_threshold": 10}
}
result = tester.submit_and_wait(request)
print(f"CSS selector result length: {len(result['result']['markdown'])}")
assert result["result"]["success"]
def test_structured_extraction(tester: Crawl4AiTester):
print("\n=== Testing Structured Extraction ===")
schema = {
@@ -207,16 +190,21 @@ def test_structured_extraction(tester: Crawl4AiTester):
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
},
}
],
}
request = {
"urls": "https://www.coinbase.com/explore",
"priority": 9,
"extraction_config": {"type": "json_css", "params": {"schema": schema}},
"extraction_config": {
"type": "json_css",
"params": {
"schema": schema
}
}
}
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
print(f"Extracted {len(extracted)} items")
@@ -224,7 +212,6 @@ def test_structured_extraction(tester: Crawl4AiTester):
assert result["result"]["success"]
assert len(extracted) > 0
def test_llm_extraction(tester: Crawl4AiTester):
print("\n=== Testing LLM Extraction ===")
schema = {
@@ -232,20 +219,20 @@ def test_llm_extraction(tester: Crawl4AiTester):
"properties": {
"model_name": {
"type": "string",
"description": "Name of the OpenAI model.",
"description": "Name of the OpenAI model."
},
"input_fee": {
"type": "string",
"description": "Fee for input token for the OpenAI model.",
"description": "Fee for input token for the OpenAI model."
},
"output_fee": {
"type": "string",
"description": "Fee for output token for the OpenAI model.",
},
"description": "Fee for output token for the OpenAI model."
}
},
"required": ["model_name", "input_fee", "output_fee"],
"required": ["model_name", "input_fee", "output_fee"]
}
request = {
"urls": "https://openai.com/api/pricing",
"priority": 8,
@@ -256,12 +243,12 @@ def test_llm_extraction(tester: Crawl4AiTester):
"api_token": os.getenv("OPENAI_API_KEY"),
"schema": schema,
"extraction_type": "schema",
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
},
"instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
}
},
"crawler_params": {"word_count_threshold": 1},
"crawler_params": {"word_count_threshold": 1}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
@@ -271,7 +258,6 @@ def test_llm_extraction(tester: Crawl4AiTester):
except Exception as e:
print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
def test_llm_with_ollama(tester: Crawl4AiTester):
print("\n=== Testing LLM with Ollama ===")
schema = {
@@ -279,20 +265,20 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
"properties": {
"article_title": {
"type": "string",
"description": "The main title of the news article",
"description": "The main title of the news article"
},
"summary": {
"type": "string",
"description": "A brief summary of the article content",
"description": "A brief summary of the article content"
},
"main_topics": {
"type": "array",
"items": {"type": "string"},
"description": "Main topics or themes discussed in the article",
},
},
"description": "Main topics or themes discussed in the article"
}
}
}
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 8,
@@ -302,13 +288,13 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
"provider": "ollama/llama2",
"schema": schema,
"extraction_type": "schema",
"instruction": "Extract the main article information including title, summary, and main topics.",
},
"instruction": "Extract the main article information including title, summary, and main topics."
}
},
"extra": {"word_count_threshold": 1},
"crawler_params": {"verbose": True},
"crawler_params": {"verbose": True}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
@@ -317,7 +303,6 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
except Exception as e:
print(f"Ollama extraction test failed: {str(e)}")
def test_cosine_extraction(tester: Crawl4AiTester):
print("\n=== Testing Cosine Extraction ===")
request = {
@@ -329,11 +314,11 @@ def test_cosine_extraction(tester: Crawl4AiTester):
"semantic_filter": "business finance economy",
"word_count_threshold": 10,
"max_dist": 0.2,
"top_k": 3,
},
},
"top_k": 3
}
}
}
try:
result = tester.submit_and_wait(request)
extracted = json.loads(result["result"]["extracted_content"])
@@ -343,30 +328,30 @@ def test_cosine_extraction(tester: Crawl4AiTester):
except Exception as e:
print(f"Cosine extraction test failed: {str(e)}")
def test_screenshot(tester: Crawl4AiTester):
print("\n=== Testing Screenshot ===")
request = {
"urls": "https://www.nbcnews.com/business",
"priority": 5,
"screenshot": True,
"crawler_params": {"headless": True},
"crawler_params": {
"headless": True
}
}
result = tester.submit_and_wait(request)
print("Screenshot captured:", bool(result["result"]["screenshot"]))
if result["result"]["screenshot"]:
# Save screenshot
screenshot_data = base64.b64decode(result["result"]["screenshot"])
with open("test_screenshot.jpg", "wb") as f:
f.write(screenshot_data)
print("Screenshot saved as test_screenshot.jpg")
assert result["result"]["success"]
if __name__ == "__main__":
version = sys.argv[1] if len(sys.argv) > 1 else "basic"
# version = "full"
test_docker_deployment(version)
test_docker_deployment(version)

View File

@@ -1,127 +0,0 @@
"""
Example demonstrating different extraction strategies with various input formats.
This example shows how to:
1. Use different input formats (markdown, HTML, fit_markdown)
2. Work with JSON-based extractors (CSS and XPath)
3. Use LLM-based extraction with different input formats
4. Configure browser and crawler settings properly
"""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import (
LLMExtractionStrategy,
JsonCssExtractionStrategy,
JsonXPathExtractionStrategy,
)
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
"""Helper function to run extraction with proper configuration"""
try:
# Configure the crawler run settings
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=strategy,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter() # For fit_markdown support
),
)
# Run the crawler
result = await crawler.arun(url=url, config=config)
if result.success:
print(f"\n=== {name} Results ===")
print(f"Extracted Content: {result.extracted_content}")
print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
print(
f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
)
else:
print(f"Error in {name}: Crawl failed")
except Exception as e:
print(f"Error in {name}: {str(e)}")
async def main():
# Example URL (replace with actual URL)
url = "https://example.com/product-page"
# Configure browser settings
browser_config = BrowserConfig(headless=True, verbose=True)
# Initialize extraction strategies
# 1. LLM Extraction with different input formats
markdown_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information including name, price, and description",
)
html_strategy = LLMExtractionStrategy(
input_format="html",
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information from HTML including structured data",
)
fit_markdown_strategy = LLMExtractionStrategy(
input_format="fit_markdown",
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract product information from cleaned markdown",
)
# 2. JSON CSS Extraction (automatically uses HTML input)
css_schema = {
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h1.product-title", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "description", "selector": ".description", "type": "text"},
],
}
css_strategy = JsonCssExtractionStrategy(schema=css_schema)
# 3. JSON XPath Extraction (automatically uses HTML input)
xpath_schema = {
"baseSelector": "//div[@class='product']",
"fields": [
{
"name": "title",
"selector": ".//h1[@class='product-title']/text()",
"type": "text",
},
{
"name": "price",
"selector": ".//span[@class='price']/text()",
"type": "text",
},
{
"name": "description",
"selector": ".//div[@class='description']/text()",
"type": "text",
},
],
}
xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
# Use context manager for proper resource handling
async with AsyncWebCrawler(config=browser_config) as crawler:
# Run all strategies
await run_extraction(crawler, url, markdown_strategy, "Markdown LLM")
await run_extraction(crawler, url, html_strategy, "HTML LLM")
await run_extraction(crawler, url, fit_markdown_strategy, "Fit Markdown LLM")
await run_extraction(crawler, url, css_strategy, "CSS Extraction")
await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -39,8 +39,8 @@ async def main():
f.write(b64decode(result.screenshot))
# Save PDF
if result.pdf:
pdf_bytes = b64decode(result.pdf)
if result.pdf_data:
pdf_bytes = b64decode(result.pdf_data)
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
f.write(pdf_bytes)

View File

@@ -1,23 +0,0 @@
import asyncio
from crawl4ai import *
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
),
)
result = await crawler.arun(
url="https://www.helloworld.org", config=crawler_config
)
print(result.markdown_v2.raw_markdown[:500])
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,118 +0,0 @@
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext
async def main():
print("🔗 Hooks Example: Demonstrating different hook use cases")
# Configure browser settings
browser_config = BrowserConfig(headless=True)
# Configure crawler settings
crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="body",
cache_mode=CacheMode.BYPASS,
)
# Create crawler instance
crawler = AsyncWebCrawler(config=browser_config)
# Define and set hook functions
async def on_browser_created(browser, context: BrowserContext, **kwargs):
"""Hook called after the browser is created"""
print("[HOOK] on_browser_created - Browser is ready!")
# Example: Set a cookie that will be used for all requests
return browser
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
"""Hook called after a new page and context are created"""
print("[HOOK] on_page_context_created - New page created!")
# Example: Set default viewport size
await context.add_cookies(
[
{
"name": "session_id",
"value": "example_session",
"domain": ".example.com",
"path": "/",
}
]
)
await page.set_viewport_size({"width": 1080, "height": 800})
return page
async def on_user_agent_updated(
page: Page, context: BrowserContext, user_agent: str, **kwargs
):
"""Hook called when the user agent is updated"""
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
"""Hook called after custom JavaScript execution"""
print("[HOOK] on_execution_started - Custom JS executed!")
return page
async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
"""Hook called before navigating to each URL"""
print(f"[HOOK] before_goto - About to visit: {url}")
# Example: Add custom headers for the request
await page.set_extra_http_headers({"Custom-Header": "my-value"})
return page
async def after_goto(
page: Page, context: BrowserContext, url: str, response: dict, **kwargs
):
"""Hook called after navigating to each URL"""
print(f"[HOOK] after_goto - Successfully loaded: {url}")
# Example: Wait for a specific element to be loaded
try:
await page.wait_for_selector(".content", timeout=1000)
print("Content element found!")
except:
print("Content element not found, continuing anyway")
return page
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
"""Hook called before retrieving the HTML content"""
print("[HOOK] before_retrieve_html - About to get HTML content")
# Example: Scroll to bottom to trigger lazy loading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page
async def before_return_html(
page: Page, context: BrowserContext, html: str, **kwargs
):
"""Hook called before returning the HTML content"""
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
# Example: You could modify the HTML content here if needed
return page
# Set all the hooks
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook(
"on_page_context_created", on_page_context_created
)
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
await crawler.start()
# Example usage: crawl a simple website
url = "https://example.com"
result = await crawler.arun(url, config=crawler_run_config)
print(f"\nCrawled URL: {result.url}")
print(f"HTML length: {len(result.html)}")
await crawler.close()
if __name__ == "__main__":
import asyncio
asyncio.run(main())

View File

@@ -1,7 +1,6 @@
import asyncio
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
async def main():
# Example 1: Setting language when creating the crawler
crawler1 = AsyncWebCrawler(
@@ -10,15 +9,11 @@ async def main():
)
)
result1 = await crawler1.arun("https://www.example.com")
print(
"Example 1 result:", result1.extracted_content[:100]
) # Print first 100 characters
print("Example 1 result:", result1.extracted_content[:100]) # Print first 100 characters
# Example 2: Setting language before crawling
crawler2 = AsyncWebCrawler()
crawler2.crawler_strategy.headers[
"Accept-Language"
] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
result2 = await crawler2.arun("https://www.example.com")
print("Example 2 result:", result2.extracted_content[:100])
@@ -26,7 +21,7 @@ async def main():
crawler3 = AsyncWebCrawler()
result3 = await crawler3.arun(
"https://www.example.com",
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"},
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
)
print("Example 3 result:", result3.extracted_content[:100])
@@ -36,15 +31,15 @@ async def main():
("https://www.example.org", "es-ES,es;q=0.9"),
("https://www.example.net", "de-DE,de;q=0.9"),
]
crawler4 = AsyncWebCrawler()
results = await asyncio.gather(
*[crawler4.arun(url, headers={"Accept-Language": lang}) for url, lang in urls]
)
results = await asyncio.gather(*[
crawler4.arun(url, headers={"Accept-Language": lang})
for url, lang in urls
])
for url, result in zip([u for u, _ in urls], results):
print(f"Result for {url}:", result.extracted_content[:100])
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -3,37 +3,32 @@ from crawl4ai.crawler_strategy import *
import asyncio
from pydantic import BaseModel, Field
url = r"https://openai.com/api/pricing/"
url = r'https://openai.com/api/pricing/'
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
from crawl4ai import AsyncWebCrawler
async def main():
# Use AsyncWebCrawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
extraction_strategy= LLMExtractionStrategy(
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
provider="groq/llama-3.1-70b-versatile",
api_token=os.getenv("GROQ_API_KEY"),
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their "
"fees for input and output tokens. Make sure not to miss anything in the entire content. "
"One extracted model JSON format should look like this: "
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
instruction="From the crawled content, extract all mentioned model names along with their " \
"fees for input and output tokens. Make sure not to miss anything in the entire content. " \
'One extracted model JSON format should look like this: ' \
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
),
)
print("Success:", result.success)
model_fees = json.loads(result.extracted_content)
@@ -42,5 +37,4 @@ async def main():
with open(".data/data.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
asyncio.run(main())

View File

@@ -1,87 +0,0 @@
import os
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import LLMContentFilter
async def test_llm_filter():
# Create an HTML source that needs intelligent filtering
url = "https://docs.python.org/3/tutorial/classes.html"
browser_config = BrowserConfig(
headless=True,
verbose=True
)
# run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First get the raw HTML
result = await crawler.arun(url, config=run_config)
html = result.cleaned_html
# Initialize LLM filter with focused instruction
filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="""
Focus on extracting the core educational content about Python classes.
Include:
- Key concepts and their explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
- Version information
- Any non-essential UI elements
Format the output as clean markdown with proper code blocks and headers.
""",
verbose=True
)
filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
instruction="""
Extract the main educational content while preserving its original wording and substance completely. Your task is to:
1. Maintain the exact language and terminology used in the main content
2. Keep all technical explanations, examples, and educational content intact
3. Preserve the original flow and structure of the core content
4. Remove only clearly irrelevant elements like:
- Navigation menus
- Advertisement sections
- Cookie notices
- Footers with site information
- Sidebars with external links
- Any UI elements that don't contribute to learning
The goal is to create a clean markdown version that reads exactly like the original article,
keeping all valuable content but free from distracting elements. Imagine you're creating
a perfect reading experience where nothing valuable is lost, but all noise is removed.
""",
verbose=True
)
# Apply filtering
filtered_content = filter.filter_content(html, ignore_cache = True)
# Show results
print("\nFiltered Content Length:", len(filtered_content))
print("\nFirst 500 chars of filtered content:")
if filtered_content:
print(filtered_content[0][:500])
# Save on disc the markdown version
with open("filtered_content.md", "w", encoding="utf-8") as f:
f.write("\n".join(filtered_content))
# Show token usage
filter.show_usage()
if __name__ == "__main__":
asyncio.run(test_llm_filter())

View File

@@ -1,23 +1,18 @@
import os, sys
sys.path.append(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692"
import asyncio
import time
import json
import re
from typing import Dict
from typing import Dict, List
from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
)
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@@ -26,188 +21,128 @@ print("GitHub Repository: https://github.com/unclecode/crawl4ai")
print("Twitter: @unclecode")
print("Website: https://crawl4ai.com")
# Basic Example - Simple Crawl
async def simple_crawl():
print("\n--- Basic Usage ---")
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
url="https://www.nbcnews.com/business",
config=crawler_config
)
print(result.markdown[:500])
async def clean_content():
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
excluded_tags=["nav", "footer", "aside"],
remove_overlay_elements=True,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
),
options={"ignore_links": True},
),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
config=crawler_config,
)
full_markdown_length = len(result.markdown_v2.raw_markdown)
fit_markdown_length = len(result.markdown_v2.fit_markdown)
print(f"Full Markdown Length: {full_markdown_length}")
print(f"Fit Markdown Length: {fit_markdown_length}")
async def link_analysis():
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
exclude_external_links=True,
exclude_social_media_links=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=crawler_config,
)
print(f"Found {len(result.links['internal'])} internal links")
print(f"Found {len(result.links['external'])} external links")
for link in result.links["internal"][:5]:
print(f"Href: {link['href']}\nText: {link['text']}\n")
# JavaScript Execution Example
async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---")
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
browser_config = BrowserConfig(
headless=True,
java_script_enabled=True
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
js_code="const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();",
js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
# wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
url="https://www.nbcnews.com/business",
config=crawler_config
)
print(result.markdown[:500])
# CSS Selector Example
async def simple_example_with_css_selector():
print("\n--- Using CSS Selectors ---")
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, css_selector=".wide-tease-item__description"
cache_mode=CacheMode.BYPASS,
css_selector=".wide-tease-item__description"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
url="https://www.nbcnews.com/business",
config=crawler_config
)
print(result.markdown[:500])
async def media_handling():
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
)
for img in result.media["images"][:5]:
print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
async def custom_hook_workflow(verbose=True):
async with AsyncWebCrawler() as crawler:
# Set a 'before_goto' hook to run custom code just before navigation
crawler.crawler_strategy.set_hook(
"before_goto",
lambda page, context: print("[Hook] Preparing to navigate..."),
)
# Perform the crawl operation
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
# Proxy Example
async def use_proxy():
print("\n--- Using a Proxy ---")
browser_config = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "username",
"password": "password",
},
proxy="http://your-proxy-url:port"
)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", config=crawler_config
url="https://www.nbcnews.com/business",
config=crawler_config
)
if result.success:
print(result.markdown[:500])
# Screenshot Example
async def capture_and_save_screenshot(url: str, output_path: str):
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, screenshot=True)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
result = await crawler.arun(
url=url,
config=crawler_config
)
if result.success and result.screenshot:
import base64
screenshot_data = base64.b64decode(result.screenshot)
with open(output_path, "wb") as f:
with open(output_path, 'wb') as f:
f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}")
else:
print("Failed to capture screenshot")
# LLM Extraction Example
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
extra_args = {
"temperature": 0,
"top_p": 0.9,
"max_tokens": 2000
}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
page_timeout = 80000,
extraction_strategy=LLMExtractionStrategy(
provider=provider,
api_token=api_token,
@@ -215,23 +150,23 @@ async def extract_structured_data_using_llm(
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
extra_args=extra_args
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
url="https://openai.com/api/pricing/",
config=crawler_config
)
print(result.extracted_content)
# CSS Extraction Example
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .framework-collection-item.w-dyn-item",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
@@ -257,13 +192,16 @@ async def extract_structured_data_using_css_extractor():
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
"attribute": "src"
}
]
}
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
browser_config = BrowserConfig(
headless=True,
java_script_enabled=True
)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
@@ -274,24 +212,23 @@ async def extract_structured_data_using_css_extractor():
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
delay_before_return_html=1
js_code=[js_click_tabs]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology", config=crawler_config
url="https://www.kidocode.com/degrees/technology",
config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
# Dynamic Content Examples - Method 1
async def crawl_dynamic_content_pages_method_1():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -312,7 +249,10 @@ async def crawl_dynamic_content_pages_method_1():
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
browser_config = BrowserConfig(headless=False, java_script_enabled=True)
browser_config = BrowserConfig(
headless=False,
java_script_enabled=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
@@ -332,7 +272,7 @@ async def crawl_dynamic_content_pages_method_1():
css_selector="li.Box-sc-g0xbh4-0",
js_code=js_next_page if page > 0 else None,
js_only=page > 0,
session_id=session_id,
session_id=session_id
)
result = await crawler.arun(url=url, config=crawler_config)
@@ -346,12 +286,14 @@ async def crawl_dynamic_content_pages_method_1():
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Dynamic Content Examples - Method 2
async def crawl_dynamic_content_pages_method_2():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
browser_config = BrowserConfig(headless=False, java_script_enabled=True)
browser_config = BrowserConfig(
headless=False,
java_script_enabled=True
)
js_next_page_and_wait = """
(async () => {
@@ -401,7 +343,7 @@ async def crawl_dynamic_content_pages_method_2():
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
session_id=session_id,
session_id=session_id
)
result = await crawler.arun(url=url, config=crawler_config)
@@ -413,132 +355,88 @@ async def crawl_dynamic_content_pages_method_2():
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def cosine_similarity_extraction():
crawl_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=CosineStrategy(
word_count_threshold=10,
max_dist=0.2, # Maximum distance between two words
linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
top_k=3, # Number of top keywords to extract
sim_threshold=0.3, # Similarity threshold for clustering
semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
verbose=True,
),
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
config=crawl_config,
)
print(json.loads(result.extracted_content)[:5])
# Browser Comparison
async def crawl_custom_browser_type():
print("\n--- Browser Comparison ---")
# Firefox
browser_config_firefox = BrowserConfig(browser_type="firefox", headless=True)
browser_config_firefox = BrowserConfig(
browser_type="firefox",
headless=True
)
start = time.time()
async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
print("Firefox:", time.time() - start)
print(result.markdown[:500])
# WebKit
browser_config_webkit = BrowserConfig(browser_type="webkit", headless=True)
browser_config_webkit = BrowserConfig(
browser_type="webkit",
headless=True
)
start = time.time()
async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
print("WebKit:", time.time() - start)
print(result.markdown[:500])
# Chromium (default)
browser_config_chromium = BrowserConfig(browser_type="chromium", headless=True)
browser_config_chromium = BrowserConfig(
browser_type="chromium",
headless=True
)
start = time.time()
async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
result = await crawler.arun(
url="https://www.example.com",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
print("Chromium:", time.time() - start)
print(result.markdown[:500])
# Anti-Bot and User Simulation
async def crawl_with_user_simulation():
browser_config = BrowserConfig(
headless=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
}
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
magic=True,
simulate_user=True,
override_navigator=True,
override_navigator=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
result = await crawler.arun(
url="YOUR-URL-HERE",
config=crawler_config
)
print(result.markdown)
async def ssl_certification():
# Configure crawler to fetch SSL certificate
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
# 1. Access certificate properties directly
print("\nCertificate Information:")
print(f"Issuer: {cert.issuer.get('CN', '')}")
print(f"Valid until: {cert.valid_until}")
print(f"Fingerprint: {cert.fingerprint}")
# 2. Export certificate in different formats
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
print("\nCertificate exported to:")
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
pem_data = cert.to_pem(
os.path.join(tmp_dir, "certificate.pem")
) # For web servers
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
der_data = cert.to_der(
os.path.join(tmp_dir, "certificate.der")
) # For Java apps
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
# Speed Comparison
async def speed_comparison():
print("\n--- Speed Comparison ---")
# Firecrawl comparison
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
print("Firecrawl:")
@@ -549,15 +447,16 @@ async def speed_comparison():
# Crawl4AI comparisons
browser_config = BrowserConfig(headless=True)
# Simple crawl
async with AsyncWebCrawler(config=browser_config) as crawler:
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, word_count_threshold=0
),
cache_mode=CacheMode.BYPASS,
word_count_threshold=0
)
)
end = time.time()
print("Crawl4AI (simple crawl):")
@@ -575,10 +474,12 @@ async def speed_comparison():
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
threshold=0.48,
threshold_type="fixed",
min_word_threshold=0
)
),
),
)
)
)
end = time.time()
print("Crawl4AI (Markdown Plus):")
@@ -588,31 +489,30 @@ async def speed_comparison():
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
print()
# Main execution
async def main():
# Basic examples
await simple_crawl()
await simple_example_with_running_js_code()
await simple_example_with_css_selector()
# await simple_crawl()
# await simple_example_with_running_js_code()
# await simple_example_with_css_selector()
# Advanced examples
await extract_structured_data_using_css_extractor()
await extract_structured_data_using_llm(
"openai/gpt-4o", os.getenv("OPENAI_API_KEY")
)
await crawl_dynamic_content_pages_method_1()
await crawl_dynamic_content_pages_method_2()
# await extract_structured_data_using_css_extractor()
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# await crawl_dynamic_content_pages_method_1()
# await crawl_dynamic_content_pages_method_2()
# Browser comparisons
await crawl_custom_browser_type()
# await crawl_custom_browser_type()
# Performance testing
# await speed_comparison()
# Screenshot example
await capture_and_save_screenshot(
"https://www.example.com",
os.path.join(__location__, "tmp/example_screenshot.jpg")
)
# await capture_and_save_screenshot(
# "https://www.example.com",
# os.path.join(__location__, "tmp/example_screenshot.jpg")
# )
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -1,10 +1,6 @@
import os, sys
# append parent directory to system path
sys.path.append(
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
)
os.environ["FIRECRAWL_API_KEY"] = "fc-84b370ccfad44beabc686b38f1769692"
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";
import asyncio
# import nest_asyncio
@@ -19,7 +15,7 @@ from bs4 import BeautifulSoup
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
LLMExtractionStrategy,
@@ -36,12 +32,9 @@ print("Website: https://crawl4ai.com")
async def simple_crawl():
print("\n--- Basic Usage ---")
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_running_js_code():
print("\n--- Executing JavaScript and Using CSS Selectors ---")
# New code to handle the wait_for parameter
@@ -64,7 +57,6 @@ async def simple_example_with_running_js_code():
)
print(result.markdown[:500]) # Print first 500 characters
async def simple_example_with_css_selector():
print("\n--- Using CSS Selectors ---")
async with AsyncWebCrawler(verbose=True) as crawler:
@@ -75,44 +67,42 @@ async def simple_example_with_css_selector():
)
print(result.markdown[:500]) # Print first 500 characters
async def use_proxy():
print("\n--- Using a Proxy ---")
print(
"Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
)
# Uncomment and modify the following lines to use a proxy
async with AsyncWebCrawler(
verbose=True, proxy="http://your-proxy-url:port"
) as crawler:
async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
url="https://www.nbcnews.com/business",
cache_mode= CacheMode.BYPASS
)
if result.success:
print(result.markdown[:500]) # Print first 500 characters
async def capture_and_save_screenshot(url: str, output_path: str):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url, screenshot=True, cache_mode=CacheMode.BYPASS
url=url,
screenshot=True,
cache_mode= CacheMode.BYPASS
)
if result.success and result.screenshot:
import base64
# Decode the base64 screenshot data
screenshot_data = base64.b64decode(result.screenshot)
# Save the screenshot as a JPEG file
with open(output_path, "wb") as f:
with open(output_path, 'wb') as f:
f.write(screenshot_data)
print(f"Screenshot saved successfully to {output_path}")
else:
print("Failed to capture screenshot")
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -120,19 +110,16 @@ class OpenAIModelFee(BaseModel):
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
# extra_args = {}
extra_args = {
"temperature": 0,
extra_args={
"temperature": 0,
"top_p": 0.9,
"max_tokens": 2000,
# any other supported parameters for litellm
@@ -152,49 +139,52 @@ async def extract_structured_data_using_llm(
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content. One extracted model JSON format should look like this:
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
extra_args=extra_args,
extra_args=extra_args
),
cache_mode=CacheMode.BYPASS,
)
print(result.extracted_content)
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
}
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src"
}
]
}
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
async with AsyncWebCrawler(
headless=True,
verbose=True
) as crawler:
# Create the JavaScript that handles clicking multiple times
js_click_tabs = """
(async () => {
@@ -208,20 +198,19 @@ async def extract_structured_data_using_css_extractor():
await new Promise(r => setTimeout(r, 500));
}
})();
"""
"""
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
js_code=[js_click_tabs],
cache_mode=CacheMode.BYPASS,
cache_mode=CacheMode.BYPASS
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
# Advanced Session-Based Crawling with Dynamic Content 🔄
async def crawl_dynamic_content_pages_method_1():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -278,7 +267,6 @@ async def crawl_dynamic_content_pages_method_1():
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_dynamic_content_pages_method_2():
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -346,11 +334,8 @@ async def crawl_dynamic_content_pages_method_2():
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_dynamic_content_pages_method_3():
print(
"\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---"
)
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
@@ -372,7 +357,7 @@ async def crawl_dynamic_content_pages_method_3():
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.firstCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
@@ -410,53 +395,40 @@ async def crawl_dynamic_content_pages_method_3():
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
async def crawl_custom_browser_type():
# Use Firefox
start = time.time()
async with AsyncWebCrawler(
browser_type="firefox", verbose=True, headless=True
) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use WebKit
start = time.time()
async with AsyncWebCrawler(
browser_type="webkit", verbose=True, headless=True
) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
# Use Chromium (default)
start = time.time()
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(
url="https://www.example.com", cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
print(result.markdown[:500])
print("Time taken: ", time.time() - start)
async def crawl_with_user_simultion():
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
url = "YOUR-URL-HERE"
result = await crawler.arun(
url=url,
url=url,
cache_mode=CacheMode.BYPASS,
magic=True, # Automatically detects and removes overlays, popups, and other elements that block content
magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
# override_navigator = True # Overrides the navigator object to make it look like a real user
)
print(result.markdown)
print(result.markdown)
async def speed_comparison():
# print("\n--- Speed Comparison ---")
@@ -467,18 +439,18 @@ async def speed_comparison():
# print()
# Simulated Firecrawl performance
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
print("Firecrawl:")
print(f"Time taken: {end - start:.2f} seconds")
print(f"Content length: {len(scrape_status['markdown'])} characters")
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
print()
print()
async with AsyncWebCrawler() as crawler:
# Crawl4AI simple crawl
@@ -502,9 +474,7 @@ async def speed_comparison():
url="https://www.nbcnews.com/business",
word_count_threshold=0,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
cache_mode=CacheMode.BYPASS,
@@ -528,9 +498,7 @@ async def speed_comparison():
word_count_threshold=0,
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
)
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
),
verbose=False,
@@ -552,12 +520,11 @@ async def speed_comparison():
print("If you run these tests in an environment with better network conditions,")
print("you may observe an even more significant speed advantage for Crawl4AI.")
async def generate_knowledge_graph():
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
@@ -569,11 +536,11 @@ async def generate_knowledge_graph():
relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Or any other provider, including Ollama and open source models
api_token=os.getenv("OPENAI_API_KEY"), # In case of Ollama just pass "no-token"
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text.""",
provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""Extract entities and relationships from the given text."""
)
async with AsyncWebCrawler() as crawler:
url = "https://paulgraham.com/love.html"
@@ -587,22 +554,27 @@ async def generate_knowledge_graph():
with open(os.path.join(__location__, "kb.json"), "w") as f:
f.write(result.extracted_content)
async def fit_markdown_remove_overlay():
async with AsyncWebCrawler(
headless=True, # Set to False to see what is happening
verbose=True,
user_agent_mode="random",
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
headless=True, # Set to False to see what is happening
verbose=True,
user_agent_mode="random",
user_agent_generator_config={
"device_type": "mobile",
"os_type": "android"
},
) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology",
url='https://www.kidocode.com/degrees/technology',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, threshold_type="fixed", min_word_threshold=0
),
options={"ignore_links": True},
options={
"ignore_links": True
}
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
@@ -611,38 +583,31 @@ async def fit_markdown_remove_overlay():
# }
# ),
)
if result.success:
print(len(result.markdown_v2.raw_markdown))
print(len(result.markdown_v2.markdown_with_citations))
print(len(result.markdown_v2.fit_markdown))
# Save clean html
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
f.write(result.cleaned_html)
with open(
os.path.join(__location__, "output/output_raw_markdown.md"), "w"
) as f:
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
f.write(result.markdown_v2.raw_markdown)
with open(
os.path.join(__location__, "output/output_markdown_with_citations.md"),
"w",
) as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(
os.path.join(__location__, "output/output_fit_markdown.md"), "w"
) as f:
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:
f.write(result.markdown_v2.fit_markdown)
print("Done")
async def main():
# await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# await simple_crawl()
# await simple_example_with_running_js_code()
# await simple_example_with_css_selector()
@@ -653,7 +618,7 @@ async def main():
# LLM extraction examples
# await extract_structured_data_using_llm()
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm("ollama/llama3.2")
# await extract_structured_data_using_llm("ollama/llama3.2")
# You always can pass custom headers to the extraction strategy
# custom_headers = {
@@ -661,14 +626,14 @@ async def main():
# "X-Custom-Header": "Some-Value"
# }
# await extract_structured_data_using_llm(extra_headers=custom_headers)
# await crawl_dynamic_content_pages_method_1()
# await crawl_dynamic_content_pages_method_2()
await crawl_dynamic_content_pages_method_1()
await crawl_dynamic_content_pages_method_2()
await crawl_dynamic_content_pages_method_3()
# await crawl_custom_browser_type()
# await speed_comparison()
await crawl_custom_browser_type()
await speed_comparison()
if __name__ == "__main__":

View File

@@ -10,17 +10,15 @@ from functools import lru_cache
console = Console()
@lru_cache()
def create_crawler():
crawler = WebCrawler(verbose=True)
crawler.warmup()
return crawler
def print_result(result):
# Print each key in one line and just the first 10 characters of each one's value and three dots
console.print("\t[bold]Result:[/bold]")
console.print(f"\t[bold]Result:[/bold]")
for key, value in result.model_dump().items():
if isinstance(value, str) and value:
console.print(f"\t{key}: [green]{value[:20]}...[/green]")
@@ -35,27 +33,18 @@ def cprint(message, press_any_key=False):
console.print("Press any key to continue...", style="")
input()
def basic_usage(crawler):
cprint(
"🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
)
result = crawler.run(url="https://www.nbcnews.com/business", only_text=True)
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", only_text = True)
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result)
def basic_usage_some_params(crawler):
cprint(
"🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
)
result = crawler.run(
url="https://www.nbcnews.com/business", word_count_threshold=1, only_text=True
)
cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True)
cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
print_result(result)
def screenshot_usage(crawler):
cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
@@ -66,23 +55,16 @@ def screenshot_usage(crawler):
cprint("Screenshot saved to 'screenshot.png'!")
print_result(result)
def understanding_parameters(crawler):
cprint(
"\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]"
)
cprint(
"By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action."
)
cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.")
# First crawl (reads from cache)
cprint("1⃣ First crawl (caches the result):", True)
start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business")
end_time = time.time()
cprint(
f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]"
)
cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
print_result(result)
# Force to crawl again
@@ -90,232 +72,169 @@ def understanding_parameters(crawler):
start_time = time.time()
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
end_time = time.time()
cprint(
f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]"
)
cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
print_result(result)
def add_chunking_strategy(crawler):
# Adding a chunking strategy: RegexChunking
cprint(
"\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]",
True,
)
cprint(
"RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!"
)
cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"]),
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
print_result(result)
# Adding another chunking strategy: NlpSentenceChunking
cprint(
"\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]",
True,
)
cprint(
"NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!"
)
cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
result = crawler.run(
url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking()
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
print_result(result)
def add_extraction_strategy(crawler):
# Adding an extraction strategy: CosineStrategy
cprint(
"\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]",
True,
)
cprint(
"CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!"
)
cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
word_count_threshold=10,
max_dist=0.2,
linkage_method="ward",
top_k=3,
sim_threshold=0.3,
verbose=True,
),
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True)
)
cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
print_result(result)
# Using semantic_filter with CosineStrategy
cprint(
"You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!"
)
cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(
semantic_filter="inflation rent prices",
),
)
cprint(
"[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]"
)
)
cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
print_result(result)
def add_llm_extraction_strategy(crawler):
# Adding an LLM extraction strategy without instructions
cprint(
"\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]",
True,
)
cprint(
"LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!"
)
cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]"
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
print_result(result)
# Adding an LLM extraction strategy with instructions
cprint(
"\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]",
True,
)
cprint(
"Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!"
)
cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="I am interested in only financial news",
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]"
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
print_result(result)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="Extract only content related to technology",
),
)
cprint(
"[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]"
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract only content related to technology"
)
)
cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]")
print_result(result)
def targeted_extraction(crawler):
# Using a CSS selector to extract only H2 tags
cprint(
"\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]",
True,
cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
print_result(result)
def interactive_extraction(crawler):
# Passing JavaScript code to interact with the page
cprint(
"\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
True,
)
cprint(
"In this example we try to click the 'Load More' button on the page using JavaScript code."
)
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
cprint(
"[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
result = crawler.run(
url="https://www.nbcnews.com/business",
js = js_code
)
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result)
def multiple_scrip(crawler):
# Passing JavaScript code to interact with the page
cprint(
"\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
True,
)
cprint(
"In this example we try to click the 'Load More' button on the page using JavaScript code."
)
js_code = [
"""
cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
js_code = ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
] * 2
"""] * 2
# crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
# crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
cprint(
"[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
result = crawler.run(
url="https://www.nbcnews.com/business",
js = js_code
)
cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
print_result(result)
def using_crawler_hooks(crawler):
# Example usage of the hooks for authentication and setting a cookie
def on_driver_created(driver):
print("[HOOK] on_driver_created")
# Example customization: maximize the window
driver.maximize_window()
# Example customization: logging in to a hypothetical website
driver.get("https://example.com/login")
driver.get('https://example.com/login')
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.NAME, "username"))
EC.presence_of_element_located((By.NAME, 'username'))
)
driver.find_element(By.NAME, "username").send_keys("testuser")
driver.find_element(By.NAME, "password").send_keys("password123")
driver.find_element(By.NAME, "login").click()
driver.find_element(By.NAME, 'username').send_keys('testuser')
driver.find_element(By.NAME, 'password').send_keys('password123')
driver.find_element(By.NAME, 'login').click()
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "welcome"))
EC.presence_of_element_located((By.ID, 'welcome'))
)
# Add a custom cookie
driver.add_cookie({"name": "test_cookie", "value": "cookie_value"})
return driver
driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
return driver
def before_get_url(driver):
print("[HOOK] before_get_url")
# Example customization: add a custom header
# Enable Network domain for sending headers
driver.execute_cdp_cmd("Network.enable", {})
driver.execute_cdp_cmd('Network.enable', {})
# Add a custom header
driver.execute_cdp_cmd(
"Network.setExtraHTTPHeaders", {"headers": {"X-Test-Header": "test"}}
)
driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
return driver
def after_get_url(driver):
print("[HOOK] after_get_url")
# Example customization: log the URL
@@ -327,59 +246,48 @@ def using_crawler_hooks(crawler):
# Example customization: log the HTML
print(len(html))
return driver
cprint(
"\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]",
True,
)
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook("on_driver_created", on_driver_created)
crawler_strategy.set_hook("before_get_url", before_get_url)
crawler_strategy.set_hook("after_get_url", after_get_url)
crawler_strategy.set_hook("before_return_html", before_return_html)
crawler_strategy.set_hook('on_driver_created', on_driver_created)
crawler_strategy.set_hook('before_get_url', before_get_url)
crawler_strategy.set_hook('after_get_url', after_get_url)
crawler_strategy.set_hook('before_return_html', before_return_html)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup()
crawler.warmup()
result = crawler.run(url="https://example.com")
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
print_result(result=result)
print_result(result= result)
def using_crawler_hooks_dleay_example(crawler):
def delay(driver):
print("Delaying for 5 seconds...")
time.sleep(5)
print("Resuming...")
def create_crawler():
crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
crawler_strategy.set_hook("after_get_url", delay)
crawler_strategy.set_hook('after_get_url', delay)
crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
crawler.warmup()
return crawler
cprint(
"\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]"
)
cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]")
crawler = create_crawler()
result = crawler.run(url="https://google.com", bypass_cache=True)
result = crawler.run(url="https://google.com", bypass_cache=True)
cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
print_result(result)
def main():
cprint(
"🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]"
)
cprint(
"⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]"
)
cprint(
"If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files."
)
cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
crawler = create_crawler()
@@ -387,7 +295,7 @@ def main():
basic_usage(crawler)
# basic_usage_some_params(crawler)
understanding_parameters(crawler)
crawler.always_by_pass_cache = True
screenshot_usage(crawler)
add_chunking_strategy(crawler)
@@ -397,10 +305,8 @@ def main():
interactive_extraction(crawler)
multiple_scrip(crawler)
cprint(
"\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]"
)
cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")
if __name__ == "__main__":
main()

View File

@@ -702,7 +702,7 @@
"\n",
"Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
"\n",
"For more information and advanced usage, please visit the [Crawl4AI documentation](https://docs.crawl4ai.com/).\n",
"For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n",
"\n",
"Happy crawling!"
]

View File

@@ -11,9 +11,7 @@ from groq import Groq
# Import threadpools to run the crawl_url function in a separate thread
from concurrent.futures import ThreadPoolExecutor
client = AsyncOpenAI(
base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")
)
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
# Instrument the OpenAI client
cl.instrument_openai()
@@ -27,39 +25,41 @@ settings = {
"presence_penalty": 0,
}
def extract_urls(text):
url_pattern = re.compile(r"(https?://\S+)")
url_pattern = re.compile(r'(https?://\S+)')
return url_pattern.findall(text)
def crawl_url(url):
data = {
"urls": [url],
"include_raw_html": True,
"word_count_threshold": 10,
"extraction_strategy": "NoExtractionStrategy",
"chunking_strategy": "RegexChunking",
"chunking_strategy": "RegexChunking"
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
response_data = response.json()
response_data = response_data["results"][0]
return response_data["markdown"]
response_data = response_data['results'][0]
return response_data['markdown']
@cl.on_chat_start
async def on_chat_start():
cl.user_session.set("session", {"history": [], "context": {}})
await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
cl.user_session.set("session", {
"history": [],
"context": {}
})
await cl.Message(
content="Welcome to the chat! How can I assist you today?"
).send()
@cl.on_message
async def on_message(message: cl.Message):
user_session = cl.user_session.get("session")
# Extract URLs from the user's message
urls = extract_urls(message.content)
futures = []
with ThreadPoolExecutor() as executor:
for url in urls:
@@ -69,9 +69,16 @@ async def on_message(message: cl.Message):
for url, result in zip(urls, results):
ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = {"url": url, "content": result}
user_session["context"][ref_number] = {
"url": url,
"content": result
}
user_session["history"].append({"role": "user", "content": message.content})
user_session["history"].append({
"role": "user",
"content": message.content
})
# Create a system message that includes the context
context_messages = [
@@ -88,17 +95,26 @@ async def on_message(message: cl.Message):
"If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages)
),
)
}
else:
system_message = {"role": "system", "content": "You are a helpful assistant."}
system_message = {
"role": "system",
"content": "You are a helpful assistant."
}
msg = cl.Message(content="")
await msg.send()
# Get response from the LLM
stream = await client.chat.completions.create(
messages=[system_message, *user_session["history"]], stream=True, **settings
messages=[
system_message,
*user_session["history"]
],
stream=True,
**settings
)
assistant_response = ""
@@ -108,7 +124,10 @@ async def on_message(message: cl.Message):
await msg.stream_token(token)
# Add assistant message to the history
user_session["history"].append({"role": "assistant", "content": assistant_response})
user_session["history"].append({
"role": "assistant",
"content": assistant_response
})
await msg.update()
# Append the reference section to the assistant's response
@@ -135,11 +154,10 @@ async def on_audio_chunk(chunk: cl.AudioChunk):
pass
@cl.step(type="tool")
async def speech_to_text(audio_file):
cli = Groq()
response = await client.audio.transcriptions.create(
model="whisper-large-v3", file=audio_file
)
@@ -154,19 +172,24 @@ async def on_audio_end(elements: list[ElementBased]):
audio_buffer.seek(0) # Move the file pointer to the beginning
audio_file = audio_buffer.read()
audio_mime_type: str = cl.user_session.get("audio_mime_type")
start_time = time.time()
whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
transcription = await speech_to_text(whisper_input)
end_time = time.time()
print(f"Transcription took {end_time - start_time} seconds")
user_msg = cl.Message(author="You", type="user_message", content=transcription)
user_msg = cl.Message(
author="You",
type="user_message",
content=transcription
)
await user_msg.send()
await on_message(user_msg)
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)

View File

@@ -1,3 +1,4 @@
import requests, base64, os
data = {
@@ -5,50 +6,59 @@ data = {
"screenshot": True,
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()["results"][0]
response = requests.post("https://crawl4ai.com/crawl", json=data)
result = response.json()['results'][0]
print(result.keys())
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
# 'links', 'screenshot', 'markdown', 'extracted_content',
# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
# 'links', 'screenshot', 'markdown', 'extracted_content',
# 'metadata', 'error_message'])
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result["screenshot"]))
f.write(base64.b64decode(result['screenshot']))
# Example of filtering the content using CSS selectors
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"css_selector": "article",
"screenshot": True,
}
# Example of executing a JS script on the page before extracting the content
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"screenshot": True,
"js": [
"""
'js' : ["""
const loadMoreButton = Array.from(document.querySelectorAll('button')).
find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
],
"""]
}
# Example of using a custom extraction strategy
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"extraction_strategy": "CosineStrategy",
"extraction_strategy_args": {"semantic_filter": "inflation rent prices"},
"extraction_strategy_args": {
"semantic_filter": "inflation rent prices"
},
}
# Example of using LLM to extract content
data = {
"urls": ["https://www.nbcnews.com/business"],
"urls": [
"https://www.nbcnews.com/business"
],
"extraction_strategy": "LLMExtractionStrategy",
"extraction_strategy_args": {
"provider": "groq/llama3-8b-8192",
"api_token": os.environ.get("GROQ_API_KEY"),
"instruction": """I am interested in only financial news,
and translate them in French.""",
and translate them in French."""
},
}

View File

@@ -1,135 +0,0 @@
import time, re
from crawl4ai.content_scraping_strategy import WebScrapingStrategy, LXMLWebScrapingStrategy
import time
import functools
from collections import defaultdict
class TimingStats:
def __init__(self):
self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
def add(self, strategy_name, func_name, elapsed):
self.stats[strategy_name][func_name]["calls"] += 1
self.stats[strategy_name][func_name]["total_time"] += elapsed
def report(self):
for strategy_name, funcs in self.stats.items():
print(f"\n{strategy_name} Timing Breakdown:")
print("-" * 60)
print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
print("-" * 60)
for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
avg_ms = (data["total_time"] / data["calls"]) * 1000
print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
timing_stats = TimingStats()
# Modify timing decorator
def timing_decorator(strategy_name):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start
timing_stats.add(strategy_name, func.__name__, elapsed)
return result
return wrapper
return decorator
# Modified decorator application
def apply_decorators(cls, method_name, strategy_name):
try:
original_method = getattr(cls, method_name)
decorated_method = timing_decorator(strategy_name)(original_method)
setattr(cls, method_name, decorated_method)
except AttributeError:
print(f"Method {method_name} not found in class {cls.__name__}.")
# Apply to key methods
methods_to_profile = [
'_scrap',
# 'process_element',
'_process_element',
'process_image',
]
# Apply decorators to both strategies
for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
for method in methods_to_profile:
apply_decorators(strategy, method, name)
def generate_large_html(n_elements=1000):
html = ['<!DOCTYPE html><html><head></head><body>']
for i in range(n_elements):
html.append(f'''
<div class="article">
<h2>Heading {i}</h2>
<div>
<div>
<p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
</div>
</div>
<img src="image{i}.jpg" alt="Image {i}">
<ul>
<li>List item {i}.1</li>
<li>List item {i}.2</li>
</ul>
</div>
''')
html.append('</body></html>')
return ''.join(html)
def test_scraping():
# Initialize both scrapers
original_scraper = WebScrapingStrategy()
selected_scraper = LXMLWebScrapingStrategy()
# Generate test HTML
print("Generating HTML...")
html = generate_large_html(5000)
print(f"HTML Size: {len(html)/1024:.2f} KB")
# Time the scraping
print("\nStarting scrape...")
start_time = time.time()
kwargs = {
"url": "http://example.com",
"html": html,
"word_count_threshold": 5,
"keep_data_attributes": True
}
t1 = time.perf_counter()
result_selected = selected_scraper.scrap(**kwargs)
t2 = time.perf_counter()
result_original = original_scraper.scrap(**kwargs)
t3 = time.perf_counter()
elapsed = t3 - start_time
print(f"\nScraping completed in {elapsed:.2f} seconds")
timing_stats.report()
# Print stats of LXML output
print("\Turbo Output:")
print(f"\nExtracted links: {len(result_selected.links.internal) + len(result_selected.links.external)}")
print(f"Extracted images: {len(result_selected.media.images)}")
print(f"Clean HTML size: {len(result_selected.cleaned_html)/1024:.2f} KB")
print(f"Scraping time: {t2 - t1:.2f} seconds")
# Print stats of original output
print("\nOriginal Output:")
print(f"\nExtracted links: {len(result_original.links.internal) + len(result_original.links.external)}")
print(f"Extracted images: {len(result_original.media.images)}")
print(f"Clean HTML size: {len(result_original.cleaned_html)/1024:.2f} KB")
print(f"Scraping time: {t3 - t1:.2f} seconds")
if __name__ == "__main__":
test_scraping()

View File

@@ -1,51 +0,0 @@
"""Example showing how to work with SSL certificates in Crawl4AI."""
import asyncio
import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
# Create tmp directory if it doesn't exist
parent_dir = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
)
tmp_dir = os.path.join(parent_dir, "tmp")
os.makedirs(tmp_dir, exist_ok=True)
async def main():
# Configure crawler to fetch SSL certificate
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS, # Bypass cache to always get fresh certificates
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
# 1. Access certificate properties directly
print("\nCertificate Information:")
print(f"Issuer: {cert.issuer.get('CN', '')}")
print(f"Valid until: {cert.valid_until}")
print(f"Fingerprint: {cert.fingerprint}")
# 2. Export certificate in different formats
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
print("\nCertificate exported to:")
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
pem_data = cert.to_pem(
os.path.join(tmp_dir, "certificate.pem")
) # For web servers
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
der_data = cert.to_der(
os.path.join(tmp_dir, "certificate.der")
) # For Java apps
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,41 +1,39 @@
import os
import time
import json
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"
url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
crawler = WebCrawler()
crawler.warmup()
from pydantic import BaseModel, Field
class PageSummary(BaseModel):
title: str = Field(..., description="Title of the page.")
summary: str = Field(..., description="Summary of the page.")
brief_summary: str = Field(..., description="Brief summary of the page.")
keywords: list = Field(..., description="Keywords assigned to the page.")
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=PageSummary.model_json_schema(),
extraction_type="schema",
apply_chunking=False,
instruction="From the crawled content, extract the following details: "
"1. Title of the page "
"2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "
"The extracted JSON format should look like this: "
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }',
apply_chunking =False,
instruction="From the crawled content, extract the following details: "\
"1. Title of the page "\
"2. Summary of the page, which is a detailed summary "\
"3. Brief summary of the page, which is a paragraph text "\
"4. Keywords assigned to the page, which is a list of keywords. "\
'The extracted JSON format should look like this: '\
'{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
),
bypass_cache=True,
)

View File

@@ -0,0 +1,281 @@
from openai import AsyncOpenAI
from chainlit.types import ThreadDict
import chainlit as cl
from chainlit.input_widget import Select, Switch, Slider
client = AsyncOpenAI()
# Instrument the OpenAI client
cl.instrument_openai()
settings = {
"model": "gpt-3.5-turbo",
"temperature": 0.5,
"max_tokens": 500,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
}
@cl.action_callback("action_button")
async def on_action(action: cl.Action):
print("The user clicked on the action button!")
return "Thank you for clicking on the action button!"
@cl.set_chat_profiles
async def chat_profile():
return [
cl.ChatProfile(
name="GPT-3.5",
markdown_description="The underlying LLM model is **GPT-3.5**.",
icon="https://picsum.photos/200",
),
cl.ChatProfile(
name="GPT-4",
markdown_description="The underlying LLM model is **GPT-4**.",
icon="https://picsum.photos/250",
),
]
@cl.on_chat_start
async def on_chat_start():
settings = await cl.ChatSettings(
[
Select(
id="Model",
label="OpenAI - Model",
values=["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4", "gpt-4-32k"],
initial_index=0,
),
Switch(id="Streaming", label="OpenAI - Stream Tokens", initial=True),
Slider(
id="Temperature",
label="OpenAI - Temperature",
initial=1,
min=0,
max=2,
step=0.1,
),
Slider(
id="SAI_Steps",
label="Stability AI - Steps",
initial=30,
min=10,
max=150,
step=1,
description="Amount of inference steps performed on image generation.",
),
Slider(
id="SAI_Cfg_Scale",
label="Stability AI - Cfg_Scale",
initial=7,
min=1,
max=35,
step=0.1,
description="Influences how strongly your generation is guided to match your prompt.",
),
Slider(
id="SAI_Width",
label="Stability AI - Image Width",
initial=512,
min=256,
max=2048,
step=64,
tooltip="Measured in pixels",
),
Slider(
id="SAI_Height",
label="Stability AI - Image Height",
initial=512,
min=256,
max=2048,
step=64,
tooltip="Measured in pixels",
),
]
).send()
chat_profile = cl.user_session.get("chat_profile")
await cl.Message(
content=f"starting chat using the {chat_profile} chat profile"
).send()
print("A new chat session has started!")
cl.user_session.set("session", {
"history": [],
"context": []
})
image = cl.Image(url="https://c.tenor.com/uzWDSSLMCmkAAAAd/tenor.gif", name="cat image", display="inline")
# Attach the image to the message
await cl.Message(
content="You are such a good girl, aren't you?!",
elements=[image],
).send()
text_content = "Hello, this is a text element."
elements = [
cl.Text(name="simple_text", content=text_content, display="inline")
]
await cl.Message(
content="Check out this text element!",
elements=elements,
).send()
elements = [
cl.Audio(path="./assets/audio.mp3", display="inline"),
]
await cl.Message(
content="Here is an audio file",
elements=elements,
).send()
await cl.Avatar(
name="Tool 1",
url="https://avatars.githubusercontent.com/u/128686189?s=400&u=a1d1553023f8ea0921fba0debbe92a8c5f840dd9&v=4",
).send()
await cl.Message(
content="This message should not have an avatar!", author="Tool 0"
).send()
await cl.Message(
content="This message should have an avatar!", author="Tool 1"
).send()
elements = [
cl.File(
name="quickstart.py",
path="./quickstart.py",
display="inline",
),
]
await cl.Message(
content="This message has a file element", elements=elements
).send()
# Sending an action button within a chatbot message
actions = [
cl.Action(name="action_button", value="example_value", description="Click me!")
]
await cl.Message(content="Interact with this action button:", actions=actions).send()
# res = await cl.AskActionMessage(
# content="Pick an action!",
# actions=[
# cl.Action(name="continue", value="continue", label="✅ Continue"),
# cl.Action(name="cancel", value="cancel", label="❌ Cancel"),
# ],
# ).send()
# if res and res.get("value") == "continue":
# await cl.Message(
# content="Continue!",
# ).send()
# import plotly.graph_objects as go
# fig = go.Figure(
# data=[go.Bar(y=[2, 1, 3])],
# layout_title_text="An example figure",
# )
# elements = [cl.Plotly(name="chart", figure=fig, display="inline")]
# await cl.Message(content="This message has a chart", elements=elements).send()
# Sending a pdf with the local file path
# elements = [
# cl.Pdf(name="pdf1", display="inline", path="./pdf1.pdf")
# ]
# cl.Message(content="Look at this local pdf!", elements=elements).send()
@cl.on_settings_update
async def setup_agent(settings):
print("on_settings_update", settings)
@cl.on_stop
def on_stop():
print("The user wants to stop the task!")
@cl.on_chat_end
def on_chat_end():
print("The user disconnected!")
@cl.on_chat_resume
async def on_chat_resume(thread: ThreadDict):
print("The user resumed a previous chat session!")
# @cl.on_message
async def on_message(message: cl.Message):
cl.user_session.get("session")["history"].append({
"role": "user",
"content": message.content
})
response = await client.chat.completions.create(
messages=[
{
"content": "You are a helpful bot",
"role": "system"
},
*cl.user_session.get("session")["history"]
],
**settings
)
# Add assitanr message to the history
cl.user_session.get("session")["history"].append({
"role": "assistant",
"content": response.choices[0].message.content
})
# msg.content = response.choices[0].message.content
# await msg.update()
# await cl.Message(content=response.choices[0].message.content).send()
@cl.on_message
async def on_message(message: cl.Message):
cl.user_session.get("session")["history"].append({
"role": "user",
"content": message.content
})
msg = cl.Message(content="")
await msg.send()
stream = await client.chat.completions.create(
messages=[
{
"content": "You are a helpful bot",
"role": "system"
},
*cl.user_session.get("session")["history"]
],
stream = True,
**settings
)
async for part in stream:
if token := part.choices[0].delta.content or "":
await msg.stream_token(token)
# Add assitanr message to the history
cl.user_session.get("session")["history"].append({
"role": "assistant",
"content": msg.content
})
await msg.update()
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)

View File

@@ -0,0 +1,238 @@
# Make sure to install the required packageschainlit and groq
import os, time
from openai import AsyncOpenAI
import chainlit as cl
import re
import requests
from io import BytesIO
from chainlit.element import ElementBased
from groq import Groq
# Import threadpools to run the crawl_url function in a separate thread
from concurrent.futures import ThreadPoolExecutor
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
# Instrument the OpenAI client
cl.instrument_openai()
settings = {
"model": "llama3-8b-8192",
"temperature": 0.5,
"max_tokens": 500,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
}
def extract_urls(text):
url_pattern = re.compile(r'(https?://\S+)')
return url_pattern.findall(text)
def crawl_url(url):
data = {
"urls": [url],
"include_raw_html": True,
"word_count_threshold": 10,
"extraction_strategy": "NoExtractionStrategy",
"chunking_strategy": "RegexChunking"
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
response_data = response.json()
response_data = response_data['results'][0]
return response_data['markdown']
@cl.on_chat_start
async def on_chat_start():
cl.user_session.set("session", {
"history": [],
"context": {}
})
await cl.Message(
content="Welcome to the chat! How can I assist you today?"
).send()
@cl.on_message
async def on_message(message: cl.Message):
user_session = cl.user_session.get("session")
# Extract URLs from the user's message
urls = extract_urls(message.content)
futures = []
with ThreadPoolExecutor() as executor:
for url in urls:
futures.append(executor.submit(crawl_url, url))
results = [future.result() for future in futures]
for url, result in zip(urls, results):
ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = {
"url": url,
"content": result
}
# for url in urls:
# # Crawl the content of each URL and add it to the session context with a reference number
# ref_number = f"REF_{len(user_session['context']) + 1}"
# crawled_content = crawl_url(url)
# user_session["context"][ref_number] = {
# "url": url,
# "content": crawled_content
# }
user_session["history"].append({
"role": "user",
"content": message.content
})
# Create a system message that includes the context
context_messages = [
f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
for ref, data in user_session["context"].items()
]
if context_messages:
system_message = {
"role": "system",
"content": (
"You are a helpful bot. Use the following context for answering questions. "
"Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
"If the question requires any information from the provided appendices or context, refer to the sources. "
"If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages)
)
}
else:
system_message = {
"role": "system",
"content": "You are a helpful assistant."
}
msg = cl.Message(content="")
await msg.send()
# Get response from the LLM
stream = await client.chat.completions.create(
messages=[
system_message,
*user_session["history"]
],
stream=True,
**settings
)
assistant_response = ""
async for part in stream:
if token := part.choices[0].delta.content:
assistant_response += token
await msg.stream_token(token)
# Add assistant message to the history
user_session["history"].append({
"role": "assistant",
"content": assistant_response
})
await msg.update()
# Append the reference section to the assistant's response
reference_section = "\n\nReferences:\n"
for ref, data in user_session["context"].items():
reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
msg.content += reference_section
await msg.update()
@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.AudioChunk):
if chunk.isStart:
buffer = BytesIO()
# This is required for whisper to recognize the file type
buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
# Initialize the session for a new audio stream
cl.user_session.set("audio_buffer", buffer)
cl.user_session.set("audio_mime_type", chunk.mimeType)
# Write the chunks to a buffer and transcribe the whole audio at the end
cl.user_session.get("audio_buffer").write(chunk.data)
pass
@cl.step(type="tool")
async def speech_to_text(audio_file):
cli = Groq()
# response = cli.audio.transcriptions.create(
# file=audio_file, #(filename, file.read()),
# model="whisper-large-v3",
# )
response = await client.audio.transcriptions.create(
model="whisper-large-v3", file=audio_file
)
return response.text
@cl.on_audio_end
async def on_audio_end(elements: list[ElementBased]):
# Get the audio buffer from the session
audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
audio_buffer.seek(0) # Move the file pointer to the beginning
audio_file = audio_buffer.read()
audio_mime_type: str = cl.user_session.get("audio_mime_type")
# input_audio_el = cl.Audio(
# mime=audio_mime_type, content=audio_file, name=audio_buffer.name
# )
# await cl.Message(
# author="You",
# type="user_message",
# content="",
# elements=[input_audio_el, *elements]
# ).send()
# answer_message = await cl.Message(content="").send()
start_time = time.time()
whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
transcription = await speech_to_text(whisper_input)
end_time = time.time()
print(f"Transcription took {end_time - start_time} seconds")
user_msg = cl.Message(
author="You",
type="user_message",
content=transcription
)
await user_msg.send()
await on_message(user_msg)
# images = [file for file in elements if "image" in file.mime]
# text_answer = await generate_text_answer(transcription, images)
# output_name, output_audio = await text_to_speech(text_answer, audio_mime_type)
# output_audio_el = cl.Audio(
# name=output_name,
# auto_play=True,
# mime=audio_mime_type,
# content=output_audio,
# )
# answer_message.elements = [output_audio_el]
# answer_message.content = transcription
# await answer_message.update()
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)

View File

@@ -1,5 +1,4 @@
import os, sys
# append the parent directory to the sys.path
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
@@ -14,18 +13,19 @@ import json
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
# 1. File Download Processing Example
async def download_example():
"""Example of downloading files from Python.org"""
# downloads_path = os.path.join(os.getcwd(), "downloads")
downloads_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(downloads_path, exist_ok=True)
print(f"Downloads will be saved to: {downloads_path}")
async with AsyncWebCrawler(
accept_downloads=True, downloads_path=downloads_path, verbose=True
accept_downloads=True,
downloads_path=downloads_path,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://www.python.org/downloads/",
@@ -40,9 +40,9 @@ async def download_example():
}
""",
delay_before_return_html=1, # Wait 5 seconds to ensure download starts
cache_mode=CacheMode.BYPASS,
cache_mode=CacheMode.BYPASS
)
if result.downloaded_files:
print("\nDownload successful!")
print("Downloaded files:")
@@ -52,26 +52,25 @@ async def download_example():
else:
print("\nNo files were downloaded")
# 2. Local File and Raw HTML Processing Example
async def local_and_raw_html_example():
"""Example of processing local files and raw HTML"""
# Create a sample HTML file
sample_file = os.path.join(__data__, "sample.html")
with open(sample_file, "w") as f:
f.write(
"""
f.write("""
<html><body>
<h1>Test Content</h1>
<p>This is a test paragraph.</p>
</body></html>
"""
)
""")
async with AsyncWebCrawler(verbose=True) as crawler:
# Process local file
local_result = await crawler.arun(url=f"file://{os.path.abspath(sample_file)}")
local_result = await crawler.arun(
url=f"file://{os.path.abspath(sample_file)}"
)
# Process raw HTML
raw_html = """
<html><body>
@@ -79,15 +78,16 @@ async def local_and_raw_html_example():
<p>This is a test of raw HTML processing.</p>
</body></html>
"""
raw_result = await crawler.arun(url=f"raw:{raw_html}")
raw_result = await crawler.arun(
url=f"raw:{raw_html}"
)
# Clean up
os.remove(sample_file)
print("Local file content:", local_result.markdown)
print("\nRaw HTML content:", raw_result.markdown)
# 3. Enhanced Markdown Generation Example
async def markdown_generation_example():
"""Example of enhanced markdown generation with citations and LLM-friendly features"""
@@ -97,66 +97,58 @@ async def markdown_generation_example():
# user_query="History and cultivation",
bm25_threshold=1.0
)
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent",
content_filter=content_filter,
cache_mode=CacheMode.BYPASS,
cache_mode=CacheMode.BYPASS
)
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/Apple",
css_selector="main div#bodyContent",
content_filter=BM25ContentFilter(),
content_filter=BM25ContentFilter()
)
print(result.markdown_v2.fit_markdown)
print("\nMarkdown Generation Results:")
print(f"1. Original markdown length: {len(result.markdown)}")
print("2. New markdown versions (markdown_v2):")
print(f"2. New markdown versions (markdown_v2):")
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
print(
f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}"
)
print(
f" - References section length: {len(result.markdown_v2.references_markdown)}"
)
print(f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}")
print(f" - References section length: {len(result.markdown_v2.references_markdown)}")
if result.markdown_v2.fit_markdown:
print(
f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}"
)
print(f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}")
# Save examples to files
output_dir = os.path.join(__data__, "markdown_examples")
os.makedirs(output_dir, exist_ok=True)
# Save different versions
with open(os.path.join(output_dir, "1_raw_markdown.md"), "w") as f:
f.write(result.markdown_v2.raw_markdown)
with open(os.path.join(output_dir, "2_citations_markdown.md"), "w") as f:
f.write(result.markdown_v2.markdown_with_citations)
with open(os.path.join(output_dir, "3_references.md"), "w") as f:
f.write(result.markdown_v2.references_markdown)
if result.markdown_v2.fit_markdown:
with open(os.path.join(output_dir, "4_filtered_markdown.md"), "w") as f:
f.write(result.markdown_v2.fit_markdown)
print(f"\nMarkdown examples saved to: {output_dir}")
# Show a sample of citations and references
print("\nSample of markdown with citations:")
print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
print("Sample of references:")
print(
"\n".join(result.markdown_v2.references_markdown.split("\n")[:10]) + "..."
)
print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...")
# 4. Browser Management Example
async def browser_management_example():
@@ -164,38 +156,38 @@ async def browser_management_example():
# Use the specified user directory path
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
os.makedirs(user_data_dir, exist_ok=True)
print(f"Browser profile will be saved to: {user_data_dir}")
async with AsyncWebCrawler(
use_managed_browser=True,
user_data_dir=user_data_dir,
headless=False,
verbose=True,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://crawl4ai.com",
# session_id="persistent_session_1",
cache_mode=CacheMode.BYPASS,
)
cache_mode=CacheMode.BYPASS
)
# Use GitHub as an example - it's a good test for browser management
# because it requires proper browser handling
result = await crawler.arun(
url="https://github.com/trending",
# session_id="persistent_session_1",
cache_mode=CacheMode.BYPASS,
cache_mode=CacheMode.BYPASS
)
print("\nBrowser session result:", result.success)
if result.success:
print("Page title:", result.metadata.get("title", "No title found"))
print("Page title:", result.metadata.get('title', 'No title found'))
# 5. API Usage Example
async def api_example():
"""Example of using the new API endpoints"""
api_token = os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
headers = {"Authorization": f"Bearer {api_token}"}
api_token = os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code"
headers = {'Authorization': f'Bearer {api_token}'}
async with aiohttp.ClientSession() as session:
# Submit crawl job
crawl_request = {
@@ -207,17 +199,25 @@ async def api_example():
"name": "Hacker News Articles",
"baseSelector": ".athing",
"fields": [
{"name": "title", "selector": ".title a", "type": "text"},
{"name": "score", "selector": ".score", "type": "text"},
{
"name": "title",
"selector": ".title a",
"type": "text"
},
{
"name": "score",
"selector": ".score",
"type": "text"
},
{
"name": "url",
"selector": ".title a",
"type": "attribute",
"attribute": "href",
},
],
"attribute": "href"
}
]
}
},
}
},
"crawler_params": {
"headless": True,
@@ -227,50 +227,51 @@ async def api_example():
# "screenshot": True,
# "magic": True
}
async with session.post(
"http://localhost:11235/crawl", json=crawl_request, headers=headers
"http://localhost:11235/crawl",
json=crawl_request,
headers=headers
) as response:
task_data = await response.json()
task_id = task_data["task_id"]
# Check task status
while True:
async with session.get(
f"http://localhost:11235/task/{task_id}", headers=headers
f"http://localhost:11235/task/{task_id}",
headers=headers
) as status_response:
result = await status_response.json()
print(f"Task status: {result['status']}")
if result["status"] == "completed":
print("Task completed!")
print("Results:")
news = json.loads(result["results"][0]["extracted_content"])
news = json.loads(result["results"][0]['extracted_content'])
print(json.dumps(news[:4], indent=2))
break
else:
await asyncio.sleep(1)
# Main execution
async def main():
# print("Running Crawl4AI feature examples...")
# print("\n1. Running Download Example:")
# await download_example()
# print("\n2. Running Markdown Generation Example:")
# await markdown_generation_example()
# # print("\n3. Running Local and Raw HTML Example:")
# await local_and_raw_html_example()
# # print("\n4. Running Browser Management Example:")
await browser_management_example()
# print("\n5. Running API Example:")
await api_example()
if __name__ == "__main__":
asyncio.run(main())
asyncio.run(main())

View File

@@ -1,464 +0,0 @@
"""
Crawl4AI v0.4.24 Feature Walkthrough
===================================
This script demonstrates the new features introduced in Crawl4AI v0.4.24.
Each section includes detailed examples and explanations of the new capabilities.
"""
import asyncio
import os
import json
import re
from typing import List
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
LLMExtractionStrategy,
JsonCssExtractionStrategy,
)
from crawl4ai.content_filter_strategy import RelevantContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from bs4 import BeautifulSoup
# Sample HTML for demonstrations
SAMPLE_HTML = """
<div class="article-list">
<article class="post" data-category="tech" data-author="john">
<h2 class="title"><a href="/post-1">First Post</a></h2>
<div class="meta">
<a href="/author/john" class="author">John Doe</a>
<span class="date">2023-12-31</span>
</div>
<div class="content">
<p>First post content...</p>
<a href="/read-more-1" class="read-more">Read More</a>
</div>
</article>
<article class="post" data-category="science" data-author="jane">
<h2 class="title"><a href="/post-2">Second Post</a></h2>
<div class="meta">
<a href="/author/jane" class="author">Jane Smith</a>
<span class="date">2023-12-30</span>
</div>
<div class="content">
<p>Second post content...</p>
<a href="/read-more-2" class="read-more">Read More</a>
</div>
</article>
</div>
"""
async def demo_ssl_features():
"""
Enhanced SSL & Security Features Demo
-----------------------------------
This example demonstrates the new SSL certificate handling and security features:
1. Custom certificate paths
2. SSL verification options
3. HTTPS error handling
4. Certificate validation configurations
These features are particularly useful when:
- Working with self-signed certificates
- Dealing with corporate proxies
- Handling mixed content websites
- Managing different SSL security levels
"""
print("\n1. Enhanced SSL & Security Demo")
print("--------------------------------")
browser_config = BrowserConfig()
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
fetch_ssl_certificate=True, # Enable SSL certificate fetching
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=run_config)
print(f"SSL Crawl Success: {result.success}")
result.ssl_certificate.to_json(
os.path.join(os.getcwd(), "ssl_certificate.json")
)
if not result.success:
print(f"SSL Error: {result.error_message}")
async def demo_content_filtering():
"""
Smart Content Filtering Demo
----------------------
Demonstrates advanced content filtering capabilities:
1. Custom filter to identify and extract specific content
2. Integration with markdown generation
3. Flexible pruning rules
"""
print("\n2. Smart Content Filtering Demo")
print("--------------------------------")
# Create a custom content filter
class CustomNewsFilter(RelevantContentFilter):
def __init__(self):
super().__init__()
# Add news-specific patterns
self.negative_patterns = re.compile(
r"nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending",
re.I,
)
self.min_word_count = 30 # Higher threshold for news content
def filter_content(
self, html: str, min_word_threshold: int = None
) -> List[str]:
"""
Implements news-specific content filtering logic.
Args:
html (str): HTML content to be filtered
min_word_threshold (int, optional): Minimum word count threshold
Returns:
List[str]: List of filtered HTML content blocks
"""
if not html or not isinstance(html, str):
return []
soup = BeautifulSoup(html, "lxml")
if not soup.body:
soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
body = soup.find("body")
# Extract chunks with metadata
chunks = self.extract_text_chunks(
body, min_word_threshold or self.min_word_count
)
# Filter chunks based on news-specific criteria
filtered_chunks = []
for _, text, tag_type, element in chunks:
# Skip if element has negative class/id
if self.is_excluded(element):
continue
# Headers are important in news articles
if tag_type == "header":
filtered_chunks.append(self.clean_element(element))
continue
# For content, check word count and link density
text = element.get_text(strip=True)
if len(text.split()) >= (min_word_threshold or self.min_word_count):
# Calculate link density
links_text = " ".join(
a.get_text(strip=True) for a in element.find_all("a")
)
link_density = len(links_text) / len(text) if text else 1
# Accept if link density is reasonable
if link_density < 0.5:
filtered_chunks.append(self.clean_element(element))
return filtered_chunks
# Create markdown generator with custom filter
markdown_gen = DefaultMarkdownGenerator(content_filter=CustomNewsFilter())
run_config = CrawlerRunConfig(
markdown_generator=markdown_gen, cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://news.ycombinator.com", config=run_config
)
print("Filtered Content Sample:")
print(result.markdown[:500]) # Show first 500 chars
async def demo_json_extraction():
"""
Improved JSON Extraction Demo
---------------------------
Demonstrates the enhanced JSON extraction capabilities:
1. Base element attributes extraction
2. Complex nested structures
3. Multiple extraction patterns
Key features shown:
- Extracting attributes from base elements (href, data-* attributes)
- Processing repeated patterns
- Handling optional fields
"""
print("\n3. Improved JSON Extraction Demo")
print("--------------------------------")
# Define the extraction schema with base element attributes
json_strategy = JsonCssExtractionStrategy(
schema={
"name": "Blog Posts",
"baseSelector": "div.article-list",
"baseFields": [
{"name": "list_id", "type": "attribute", "attribute": "data-list-id"},
{"name": "category", "type": "attribute", "attribute": "data-category"},
],
"fields": [
{
"name": "posts",
"selector": "article.post",
"type": "nested_list",
"baseFields": [
{
"name": "post_id",
"type": "attribute",
"attribute": "data-post-id",
},
{
"name": "author_id",
"type": "attribute",
"attribute": "data-author",
},
],
"fields": [
{
"name": "title",
"selector": "h2.title a",
"type": "text",
"baseFields": [
{
"name": "url",
"type": "attribute",
"attribute": "href",
}
],
},
{
"name": "author",
"selector": "div.meta a.author",
"type": "text",
"baseFields": [
{
"name": "profile_url",
"type": "attribute",
"attribute": "href",
}
],
},
{"name": "date", "selector": "span.date", "type": "text"},
{
"name": "read_more",
"selector": "a.read-more",
"type": "nested",
"fields": [
{"name": "text", "type": "text"},
{
"name": "url",
"type": "attribute",
"attribute": "href",
},
],
},
],
}
],
}
)
# Demonstrate extraction from raw HTML
run_config = CrawlerRunConfig(
extraction_strategy=json_strategy, cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="raw:" + SAMPLE_HTML, # Use raw: prefix for raw HTML
config=run_config,
)
print("Extracted Content:")
print(result.extracted_content)
async def demo_input_formats():
"""
Input Format Handling Demo
----------------------
Demonstrates how LLM extraction can work with different input formats:
1. Markdown (default) - Good for simple text extraction
2. HTML - Better when you need structure and attributes
This example shows how HTML input can be beneficial when:
- You need to understand the DOM structure
- You want to extract both visible text and HTML attributes
- The content has complex layouts like tables or forms
"""
print("\n4. Input Format Handling Demo")
print("---------------------------")
# Create a dummy HTML with rich structure
dummy_html = """
<div class="job-posting" data-post-id="12345">
<header class="job-header">
<h1 class="job-title">Senior AI/ML Engineer</h1>
<div class="job-meta">
<span class="department">AI Research Division</span>
<span class="location" data-remote="hybrid">San Francisco (Hybrid)</span>
</div>
<div class="salary-info" data-currency="USD">
<span class="range">$150,000 - $220,000</span>
<span class="period">per year</span>
</div>
</header>
<section class="requirements">
<div class="technical-skills">
<h3>Technical Requirements</h3>
<ul class="required-skills">
<li class="skill required" data-priority="must-have">
5+ years experience in Machine Learning
</li>
<li class="skill required" data-priority="must-have">
Proficiency in Python and PyTorch/TensorFlow
</li>
<li class="skill preferred" data-priority="nice-to-have">
Experience with distributed training systems
</li>
</ul>
</div>
<div class="soft-skills">
<h3>Professional Skills</h3>
<ul class="required-skills">
<li class="skill required" data-priority="must-have">
Strong problem-solving abilities
</li>
<li class="skill preferred" data-priority="nice-to-have">
Experience leading technical teams
</li>
</ul>
</div>
</section>
<section class="timeline">
<time class="deadline" datetime="2024-02-28">
Application Deadline: February 28, 2024
</time>
</section>
<footer class="contact-section">
<div class="hiring-manager">
<h4>Hiring Manager</h4>
<div class="contact-info">
<span class="name">Dr. Sarah Chen</span>
<span class="title">Director of AI Research</span>
<span class="email">ai.hiring@example.com</span>
</div>
</div>
<div class="team-info">
<p>Join our team of 50+ researchers working on cutting-edge AI applications</p>
</div>
</footer>
</div>
"""
# Use raw:// prefix to pass HTML content directly
url = f"raw://{dummy_html}"
from pydantic import BaseModel, Field
from typing import List, Optional
# Define our schema using Pydantic
class JobRequirement(BaseModel):
category: str = Field(
description="Category of the requirement (e.g., Technical, Soft Skills)"
)
items: List[str] = Field(
description="List of specific requirements in this category"
)
priority: str = Field(
description="Priority level (Required/Preferred) based on the HTML class or context"
)
class JobPosting(BaseModel):
title: str = Field(description="Job title")
department: str = Field(description="Department or team")
location: str = Field(description="Job location, including remote options")
salary_range: Optional[str] = Field(description="Salary range if specified")
requirements: List[JobRequirement] = Field(
description="Categorized job requirements"
)
application_deadline: Optional[str] = Field(
description="Application deadline if specified"
)
contact_info: Optional[dict] = Field(
description="Contact information from footer or contact section"
)
# First try with markdown (default)
markdown_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
schema=JobPosting.model_json_schema(),
extraction_type="schema",
instruction="""
Extract job posting details into structured data. Focus on the visible text content
and organize requirements into categories.
""",
input_format="markdown", # default
)
# Then with HTML for better structure understanding
html_strategy = LLMExtractionStrategy(
provider="openai/gpt-4",
api_token=os.getenv("OPENAI_API_KEY"),
schema=JobPosting.model_json_schema(),
extraction_type="schema",
instruction="""
Extract job posting details, using HTML structure to:
1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
2. Extract contact info from the page footer or dedicated contact section
3. Parse salary information from specially formatted elements
4. Determine application deadline from timestamp or date elements
Use HTML attributes and classes to enhance extraction accuracy.
""",
input_format="html", # explicitly use HTML
)
async with AsyncWebCrawler() as crawler:
# Try with markdown first
markdown_config = CrawlerRunConfig(extraction_strategy=markdown_strategy)
markdown_result = await crawler.arun(url=url, config=markdown_config)
print("\nMarkdown-based Extraction Result:")
items = json.loads(markdown_result.extracted_content)
print(json.dumps(items, indent=2))
# Then with HTML for better structure understanding
html_config = CrawlerRunConfig(extraction_strategy=html_strategy)
html_result = await crawler.arun(url=url, config=html_config)
print("\nHTML-based Extraction Result:")
items = json.loads(html_result.extracted_content)
print(json.dumps(items, indent=2))
# Main execution
async def main():
print("Crawl4AI v0.4.24 Feature Walkthrough")
print("====================================")
# Run all demos
await demo_ssl_features()
await demo_content_filtering()
await demo_json_extraction()
# await demo_input_formats()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,351 +0,0 @@
"""
Crawl4ai v0.4.3b2 Features Demo
============================
This demonstration showcases three major categories of new features in Crawl4ai v0.4.3:
1. Efficiency & Speed:
- Memory-efficient dispatcher strategies
- New scraping algorithm
- Streaming support for batch crawling
2. LLM Integration:
- Automatic schema generation
- LLM-powered content filtering
- Smart markdown generation
3. Core Improvements:
- Robots.txt compliance
- Proxy rotation
- Enhanced URL handling
- Shared data among hooks
- add page routes
Each demo function can be run independently or as part of the full suite.
"""
import asyncio
import os
import json
import re
import random
from typing import Optional, Dict
from dotenv import load_dotenv
load_dotenv()
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CrawlerRunConfig,
CacheMode,
DisplayMode,
MemoryAdaptiveDispatcher,
CrawlerMonitor,
DefaultMarkdownGenerator,
LXMLWebScrapingStrategy,
JsonCssExtractionStrategy,
LLMContentFilter
)
async def demo_memory_dispatcher():
"""Demonstrates the new memory-efficient dispatcher system.
Key Features:
- Adaptive memory management
- Real-time performance monitoring
- Concurrent session control
"""
print("\n=== Memory Dispatcher Demo ===")
try:
# Configuration
browser_config = BrowserConfig(headless=True, verbose=False)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator()
)
# Test URLs
urls = ["http://example.com", "http://example.org", "http://example.net"] * 3
print("\n📈 Initializing crawler with memory monitoring...")
async with AsyncWebCrawler(config=browser_config) as crawler:
monitor = CrawlerMonitor(
max_visible_rows=10,
display_mode=DisplayMode.DETAILED
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0,
check_interval=0.5,
max_session_permit=5,
monitor=monitor
)
print("\n🚀 Starting batch crawl...")
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
)
print(f"\n✅ Completed {len(results)} URLs successfully")
except Exception as e:
print(f"\n❌ Error in memory dispatcher demo: {str(e)}")
async def demo_streaming_support():
"""
2. Streaming Support Demo
======================
Shows how to process URLs as they complete using streaming
"""
print("\n=== 2. Streaming Support Demo ===")
browser_config = BrowserConfig(headless=True, verbose=False)
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)
# Test URLs
urls = ["http://example.com", "http://example.org", "http://example.net"] * 2
async with AsyncWebCrawler(config=browser_config) as crawler:
# Initialize dispatcher for streaming
dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5)
print("Starting streaming crawl...")
async for result in await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
):
# Process each result as it arrives
print(
f"Received result for {result.url} - Success: {result.success}"
)
if result.success:
print(f"Content length: {len(result.markdown)}")
async def demo_content_scraping():
"""
3. Content Scraping Strategy Demo
==============================
Demonstrates the new LXMLWebScrapingStrategy for faster content scraping.
"""
print("\n=== 3. Content Scraping Strategy Demo ===")
crawler = AsyncWebCrawler()
url = "https://example.com/article"
# Configure with the new LXML strategy
config = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
print("Scraping content with LXML strategy...")
async with crawler:
result = await crawler.arun(url, config=config)
if result.success:
print("Successfully scraped content using LXML strategy")
async def demo_llm_markdown():
"""
4. LLM-Powered Markdown Generation Demo
===================================
Shows how to use the new LLM-powered content filtering and markdown generation.
"""
print("\n=== 4. LLM-Powered Markdown Generation Demo ===")
crawler = AsyncWebCrawler()
url = "https://docs.python.org/3/tutorial/classes.html"
content_filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token=os.getenv("OPENAI_API_KEY"),
instruction="""
Focus on extracting the core educational content about Python classes.
Include:
- Key concepts and their explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
- Version information
- Any non-essential UI elements
Format the output as clean markdown with proper code blocks and headers.
""",
verbose=True,
)
# Configure LLM-powered markdown generation
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=content_filter
),
cache_mode = CacheMode.BYPASS,
verbose=True
)
print("Generating focused markdown with LLM...")
async with crawler:
result = await crawler.arun(url, config=config)
if result.success and result.markdown_v2:
print("Successfully generated LLM-filtered markdown")
print("First 500 chars of filtered content:")
print(result.markdown_v2.fit_markdown[:500])
print("Successfully generated LLM-filtered markdown")
async def demo_robots_compliance():
"""
5. Robots.txt Compliance Demo
==========================
Demonstrates the new robots.txt compliance feature with SQLite caching.
"""
print("\n=== 5. Robots.txt Compliance Demo ===")
crawler = AsyncWebCrawler()
urls = ["https://example.com", "https://facebook.com", "https://twitter.com"]
# Enable robots.txt checking
config = CrawlerRunConfig(check_robots_txt=True, verbose=True)
print("Crawling with robots.txt compliance...")
async with crawler:
results = await crawler.arun_many(urls, config=config)
for result in results:
if result.status_code == 403:
print(f"Access blocked by robots.txt: {result.url}")
elif result.success:
print(f"Successfully crawled: {result.url}")
async def demo_json_schema_generation():
"""
7. LLM-Powered Schema Generation Demo
=================================
Demonstrates automatic CSS and XPath schema generation using LLM models.
"""
print("\n=== 7. LLM-Powered Schema Generation Demo ===")
# Example HTML content for a job listing
html_content = """
<div class="job-listing">
<h1 class="job-title">Senior Software Engineer</h1>
<div class="job-details">
<span class="location">San Francisco, CA</span>
<span class="salary">$150,000 - $200,000</span>
<div class="requirements">
<h2>Requirements</h2>
<ul>
<li>5+ years Python experience</li>
<li>Strong background in web crawling</li>
</ul>
</div>
</div>
</div>
"""
print("Generating CSS selectors schema...")
# Generate CSS selectors with a specific query
css_schema = JsonCssExtractionStrategy.generate_schema(
html_content,
schema_type="CSS",
query="Extract job title, location, and salary information",
provider="openai/gpt-4o", # or use other providers like "ollama"
)
print("\nGenerated CSS Schema:")
print(css_schema)
# Example of using the generated schema with crawler
crawler = AsyncWebCrawler()
url = "https://example.com/job-listing"
# Create an extraction strategy with the generated schema
extraction_strategy = JsonCssExtractionStrategy(schema=css_schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy, verbose=True)
print("\nTesting generated schema with crawler...")
async with crawler:
result = await crawler.arun(url, config=config)
if result.success:
print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None)
print("Successfully used generated schema for crawling")
async def demo_proxy_rotation():
"""
8. Proxy Rotation Demo
===================
Demonstrates how to rotate proxies for each request using Crawl4ai.
"""
print("\n=== 8. Proxy Rotation Demo ===")
async def get_next_proxy(proxy_file: str = f"proxies.txt") -> Optional[Dict]:
"""Get next proxy from local file"""
try:
proxies = os.getenv("PROXIES", "").split(",")
ip, port, username, password = random.choice(proxies).split(":")
return {
"server": f"http://{ip}:{port}",
"username": username,
"password": password,
"ip": ip # Store original IP for verification
}
except Exception as e:
print(f"Error loading proxy: {e}")
return None
# Create 10 test requests to httpbin
urls = ["https://httpbin.org/ip"] * 2
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler(config=browser_config) as crawler:
for url in urls:
proxy = await get_next_proxy()
if not proxy:
print("No proxy available, skipping...")
continue
# Create new config with proxy
current_config = run_config.clone(proxy_config=proxy, user_agent="")
result = await crawler.arun(url=url, config=current_config)
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
print(f"Proxy {proxy['ip']} -> Response IP: {ip_match.group(0) if ip_match else 'Not found'}")
verified = ip_match.group(0) == proxy['ip']
if verified:
print(f"✅ Proxy working! IP matches: {proxy['ip']}")
else:
print(f"❌ Proxy failed or IP mismatch!")
else:
print(f"Failed with proxy {proxy['ip']}")
async def main():
"""Run all feature demonstrations."""
print("\n📊 Running Crawl4ai v0.4.3 Feature Demos\n")
# Efficiency & Speed Demos
print("\n🚀 EFFICIENCY & SPEED DEMOS")
await demo_memory_dispatcher()
await demo_streaming_support()
await demo_content_scraping()
# # LLM Integration Demos
print("\n🤖 LLM INTEGRATION DEMOS")
await demo_json_schema_generation()
await demo_llm_markdown()
# # Core Improvements
print("\n🔧 CORE IMPROVEMENT DEMOS")
await demo_robots_compliance()
await demo_proxy_rotation()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,365 +0,0 @@
# Overview of Some Important Advanced Features
(Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
1. **Proxy Usage**
2. **Capturing PDFs & Screenshots**
3. **Handling SSL Certificates**
4. **Custom Headers**
5. **Session Persistence & Local Storage**
6. **Robots.txt Compliance**
> **Prerequisites**
> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
> - You know how to run or configure your Python environment with Playwright installed
---
## 1. Proxy Usage
If you need to route your crawl traffic through a proxy—whether for IP rotation, geo-testing, or privacy—Crawl4AI supports it via `BrowserConfig.proxy_config`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_cfg = BrowserConfig(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "myuser",
"password": "mypass",
},
headless=True
)
crawler_cfg = CrawlerRunConfig(
verbose=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://www.whatismyip.com/",
config=crawler_cfg
)
if result.success:
print("[OK] Page fetched via proxy.")
print("Page HTML snippet:", result.html[:200])
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Key Points**
- **`proxy_config`** expects a dict with `server` and optional auth credentials.
- Many commercial proxies provide an HTTP/HTTPS “gateway” server that you specify in `server`.
- If your proxy doesnt need auth, omit `username`/`password`.
---
## 2. Capturing PDFs & Screenshots
Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI can do both in one pass:
```python
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, CacheMode
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
)
if result.success:
# Save screenshot
if result.screenshot:
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
if result.pdf:
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
print("[OK] PDF & screenshot captured.")
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Why PDF + Screenshot?**
- Large or complex pages can be slow or error-prone with “traditional” full-page screenshots.
- Exporting a PDF is more reliable for very long pages. Crawl4AI automatically converts the first PDF page into an image if you request both.
**Relevant Parameters**
- **`pdf=True`**: Exports the current page as a PDF (base64-encoded in `result.pdf`).
- **`screenshot=True`**: Creates a screenshot (base64-encoded in `result.screenshot`).
- **`scan_full_page`** or advanced hooking can further refine how the crawler captures content.
---
## 3. Handling SSL Certificates
If you need to verify or export a sites SSL certificate—for compliance, debugging, or data analysis—Crawl4AI can fetch it during the crawl:
```python
import asyncio, os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
tmp_dir = os.path.join(os.getcwd(), "tmp")
os.makedirs(tmp_dir, exist_ok=True)
config = CrawlerRunConfig(
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.ssl_certificate:
cert = result.ssl_certificate
print("\nCertificate Information:")
print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
print(f"Valid until: {cert.valid_until}")
print(f"Fingerprint: {cert.fingerprint}")
# Export in multiple formats:
cert.to_json(os.path.join(tmp_dir, "certificate.json"))
cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
cert.to_der(os.path.join(tmp_dir, "certificate.der"))
print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
else:
print("[ERROR] No certificate or crawl failed.")
if __name__ == "__main__":
asyncio.run(main())
```
**Key Points**
- **`fetch_ssl_certificate=True`** triggers certificate retrieval.
- `result.ssl_certificate` includes methods (`to_json`, `to_pem`, `to_der`) for saving in various formats (handy for server config, Java keystores, etc.).
---
## 4. Custom Headers
Sometimes you need to set custom headers (e.g., language preferences, authentication tokens, or specialized user-agent strings). You can do this in multiple ways:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Option 1: Set headers at the crawler strategy level
crawler1 = AsyncWebCrawler(
# The underlying strategy can accept headers in its constructor
crawler_strategy=None # We'll override below for clarity
)
crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
crawler1.crawler_strategy.set_custom_headers({
"Accept-Language": "fr-FR,fr;q=0.9"
})
result1 = await crawler1.arun("https://www.example.com")
print("Example 1 result success:", result1.success)
# Option 2: Pass headers directly to `arun()`
crawler2 = AsyncWebCrawler()
result2 = await crawler2.arun(
url="https://www.example.com",
headers={"Accept-Language": "es-ES,es;q=0.9"}
)
print("Example 2 result success:", result2.success)
if __name__ == "__main__":
asyncio.run(main())
```
**Notes**
- Some sites may react differently to certain headers (e.g., `Accept-Language`).
- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-based-crawling.md) or use `UserAgentGenerator`.
---
## 5. Session Persistence & Local Storage
Crawl4AI can preserve cookies and localStorage so you can continue where you left off—ideal for logging into sites or skipping repeated auth flows.
### 5.1 `storage_state`
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
storage_dict = {
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1699999999.0,
"httpOnly": False,
"secure": False,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{"name": "token", "value": "my_auth_token"}
]
}
]
}
# Provide the storage state as a dictionary to start "already logged in"
async with AsyncWebCrawler(
headless=True,
storage_state=storage_dict
) as crawler:
result = await crawler.arun("https://example.com/protected")
if result.success:
print("Protected page content length:", len(result.html))
else:
print("Failed to crawl protected page")
if __name__ == "__main__":
asyncio.run(main())
```
### 5.2 Exporting & Reusing State
You can sign in once, export the browser context, and reuse it later—without re-entering credentials.
- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
**See**: [Detailed session management tutorial](./session-management.md) or [Explanations → Browser Context & Managed Browser](./identity-based-crawling.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
---
## 6. Robots.txt Compliance
Crawl4AI supports respecting robots.txt rules with efficient caching:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
# Enable robots.txt checking in config
config = CrawlerRunConfig(
check_robots_txt=True # Will check and respect robots.txt rules
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=config
)
if not result.success and result.status_code == 403:
print("Access denied by robots.txt")
if __name__ == "__main__":
asyncio.run(main())
```
**Key Points**
- Robots.txt files are cached locally for efficiency
- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
- Cache has a default TTL of 7 days
- If robots.txt can't be fetched, crawling is allowed
- Returns 403 status code if URL is disallowed
---
## Putting It All Together
Heres a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, youd tailor each setting to your projects needs.
```python
import os, asyncio
from base64 import b64decode
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# 1. Browser config with proxy + headless
browser_cfg = BrowserConfig(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "myuser",
"password": "mypass",
},
headless=True,
)
# 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
crawler_cfg = CrawlerRunConfig(
pdf=True,
screenshot=True,
fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS,
headers={"Accept-Language": "en-US,en;q=0.8"},
storage_state="my_storage.json", # Reuse session from a previous sign-in
verbose=True,
)
# 3. Crawl
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url = "https://secure.example.com/protected",
config=crawler_cfg
)
if result.success:
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
# Save PDF & screenshot
if result.pdf:
with open("result.pdf", "wb") as f:
f.write(b64decode(result.pdf))
if result.screenshot:
with open("result.png", "wb") as f:
f.write(b64decode(result.screenshot))
# Check SSL cert
if result.ssl_certificate:
print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
else:
print("[ERROR]", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
---
## Conclusion & Next Steps
Youve now explored several **advanced** features:
- **Proxy Usage**
- **PDF & Screenshot** capturing for large or critical pages
- **SSL Certificate** retrieval & exporting
- **Custom Headers** for language or specialized requests
- **Session Persistence** via storage state
- **Robots.txt Compliance**
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
**Last Updated**: 2025-01-01

View File

@@ -0,0 +1,223 @@
# Content Processing
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
## Content Cleaning
### Understanding Clean Content
When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
2. **Content Relevance**: Identifies and preserves meaningful content blocks
3. **Layout Analysis**: Understands page structure to identify main content areas
```python
result = await crawler.arun(
url="https://example.com",
word_count_threshold=10, # Remove blocks with fewer words
excluded_tags=['form', 'nav'], # Remove specific HTML tags
remove_overlay_elements=True # Remove popups/modals
)
# Get clean content
print(result.cleaned_html) # Cleaned HTML
print(result.markdown) # Clean markdown version
```
### Fit Markdown: Smart Content Extraction
One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
#### How Fit Markdown Works
- Analyzes content density and distribution
- Identifies content patterns and structures
- Removes boilerplate content (headers, footers, sidebars)
- Preserves the most relevant content blocks
- Maintains content hierarchy and formatting
#### Perfect For:
- Blog posts and articles
- News content
- Documentation pages
- Any page with a clear main content area
#### Not Recommended For:
- E-commerce product listings
- Search results pages
- Social media feeds
- Pages with multiple equal-weight content sections
```python
result = await crawler.arun(url="https://example.com")
# Get the most relevant content
main_content = result.fit_markdown
# Compare with regular markdown
all_content = result.markdown
print(f"Fit Markdown Length: {len(main_content)}")
print(f"Regular Markdown Length: {len(all_content)}")
```
#### Example Use Case
```python
async def extract_article_content(url: str) -> str:
"""Extract main article content from a blog or news site."""
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url)
# fit_markdown will focus on the article content,
# excluding navigation, ads, and other distractions
return result.fit_markdown
```
## Media Processing
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
### Image Processing
The library handles various image scenarios, including:
- Regular images
- Lazy-loaded images
- Background images
- Responsive images
- Image metadata and context
```python
result = await crawler.arun(url="https://example.com")
for image in result.media["images"]:
# Each image includes rich metadata
print(f"Source: {image['src']}")
print(f"Alt text: {image['alt']}")
print(f"Description: {image['desc']}")
print(f"Context: {image['context']}") # Surrounding text
print(f"Relevance score: {image['score']}") # 0-10 score
```
### Handling Lazy-Loaded Content
Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
```python
result = await crawler.arun(
url="https://example.com",
wait_for="css:img[data-src]", # Wait for lazy images
delay_before_return_html=2.0 # Additional wait time
)
```
### Video and Audio Content
The library extracts video and audio elements with their metadata:
```python
# Process videos
for video in result.media["videos"]:
print(f"Video source: {video['src']}")
print(f"Type: {video['type']}")
print(f"Duration: {video.get('duration')}")
print(f"Thumbnail: {video.get('poster')}")
# Process audio
for audio in result.media["audios"]:
print(f"Audio source: {audio['src']}")
print(f"Type: {audio['type']}")
print(f"Duration: {audio.get('duration')}")
```
## Link Analysis
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
### Link Classification
The library automatically categorizes links into:
- Internal links (same domain)
- External links (different domains)
- Social media links
- Navigation links
- Content links
```python
result = await crawler.arun(url="https://example.com")
# Analyze internal links
for link in result.links["internal"]:
print(f"Internal: {link['href']}")
print(f"Link text: {link['text']}")
print(f"Context: {link['context']}") # Surrounding text
print(f"Type: {link['type']}") # nav, content, etc.
# Analyze external links
for link in result.links["external"]:
print(f"External: {link['href']}")
print(f"Domain: {link['domain']}")
print(f"Type: {link['type']}")
```
### Smart Link Filtering
Control which links are included in the results:
```python
result = await crawler.arun(
url="https://example.com",
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_social_media_domains=[ # Custom social media domains
"facebook.com", "twitter.com", "instagram.com"
],
exclude_domains=["ads.example.com"] # Exclude specific domains
)
```
## Metadata Extraction
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
```python
result = await crawler.arun(url="https://example.com")
metadata = result.metadata
print(f"Title: {metadata['title']}")
print(f"Description: {metadata['description']}")
print(f"Keywords: {metadata['keywords']}")
print(f"Author: {metadata['author']}")
print(f"Published Date: {metadata['published_date']}")
print(f"Modified Date: {metadata['modified_date']}")
print(f"Language: {metadata['language']}")
```
## Best Practices
1. **Use Fit Markdown for Articles**
```python
# Perfect for blog posts, news articles, documentation
content = result.fit_markdown
```
2. **Handle Media Appropriately**
```python
# Filter by relevance score
relevant_images = [
img for img in result.media["images"]
if img['score'] > 5
]
```
3. **Combine Link Analysis with Content**
```python
# Get content links with context
content_links = [
link for link in result.links["internal"]
if link['type'] == 'content'
]
```
4. **Clean Content with Purpose**
```python
# Customize cleaning based on your needs
result = await crawler.arun(
url=url,
word_count_threshold=20, # Adjust based on content type
keep_data_attributes=False, # Remove data attributes
process_iframes=True # Include iframe content
)
```

View File

@@ -1,12 +0,0 @@
# Crawl Dispatcher
Were excited to announce a **Crawl Dispatcher** module that can handle **thousands** of crawling tasks simultaneously. By efficiently managing system resources (memory, CPU, network), this dispatcher ensures high-performance data extraction at scale. It also provides **real-time monitoring** of each crawlers status, memory usage, and overall progress.
Stay tuned—this feature is **coming soon** in an upcoming release of Crawl4AI! For the latest news, keep an eye on our changelogs and follow [@unclecode](https://twitter.com/unclecode) on X.
Below is a **sample** of how the dispatchers performance monitor might look in action:
![Crawl Dispatcher Performance Monitor](../assets/images/dispatcher.png)
We cant wait to bring you this streamlined, **scalable** approach to multi-URL crawling—**watch this space** for updates!

View File

@@ -1,118 +0,0 @@
# Download Handling in Crawl4AI
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
## Enabling Downloads
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
```python
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
async def main():
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
async with AsyncWebCrawler(config=config) as crawler:
# ... your crawling logic ...
asyncio.run(main())
```
## Specifying Download Location
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
```python
from crawl4ai.async_configs import BrowserConfig
import os
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
os.makedirs(downloads_path, exist_ok=True)
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
async def main():
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun(url="https://example.com")
# ...
```
## Triggering Downloads
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
```python
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig(
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
downloadLink.click();
}
""",
wait_for=5 # Wait 5 seconds for the download to start
)
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
```
## Accessing Downloaded Files
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
```python
if result.downloaded_files:
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
file_size = os.path.getsize(file_path)
print(f"- File size: {file_size} bytes")
else:
print("No files downloaded.")
```
## Example: Downloading Multiple Files
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os
from pathlib import Path
async def download_multiple_files(url: str, download_path: str):
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
async with AsyncWebCrawler(config=config) as crawler:
run_config = CrawlerRunConfig(
js_code="""
const downloadLinks = document.querySelectorAll('a[download]');
for (const link of downloadLinks) {
link.click();
// Delay between clicks
await new Promise(r => setTimeout(r, 2000));
}
""",
wait_for=10 # Wait for all downloads to start
)
result = await crawler.arun(url=url, config=run_config)
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
else:
print("No files downloaded.")
# Usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
```
## Important Considerations
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
- **Security:** Scan downloaded files for potential security threats before use.
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!

View File

@@ -1,254 +1,114 @@
# Hooks & Auth in AsyncWebCrawler
# Hooks & Auth for AsyncWebCrawler
Crawl4AIs **hooks** let you customize the crawler at specific points in the pipeline:
Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
1. **`on_browser_created`** After browser creation.
2. **`on_page_context_created`** After a new context & page are created.
3. **`before_goto`** Just before navigating to a page.
4. **`after_goto`** Right after navigation completes.
5. **`on_user_agent_updated`** Whenever the user agent changes.
6. **`on_execution_started`** Once custom JavaScript execution begins.
7. **`before_retrieve_html`** Just before the crawler retrieves final HTML.
8. **`before_return_html`** Right before returning the HTML content.
## Example: Using Crawler Hooks with AsyncWebCrawler
**Important**: Avoid heavy tasks in `on_browser_created` since you dont yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
> note "Important Hook Usage Warning"
**Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
1. Configure the browser when it's created.
2. Add custom headers before navigating to the URL.
3. Log the current URL after navigation.
4. Perform actions after JavaScript execution.
5. Log the length of the HTML before returning it.
> **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
> **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.
> **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.
Below is an example demonstration.
---
## Example: Using Hooks in AsyncWebCrawler
### Hook Definitions
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser, BrowserContext
async def main():
print("🔗 Hooks Example: Demonstrating recommended usage")
# 1) Configure the browser
browser_config = BrowserConfig(
headless=True,
verbose=True
)
# 2) Configure the crawler run
crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="body",
cache_mode=CacheMode.BYPASS
)
# 3) Create the crawler instance
crawler = AsyncWebCrawler(config=browser_config)
#
# Define Hook Functions
#
async def on_browser_created(browser, **kwargs):
# Called once the browser instance is created (but no pages or contexts yet)
print("[HOOK] on_browser_created - Browser created successfully!")
# Typically, do minimal setup here if needed
return browser
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
# Called right after a new page + context are created (ideal for auth or route config).
print("[HOOK] on_page_context_created - Setting up page & context.")
# Example 1: Route filtering (e.g., block images)
async def route_filter(route):
if route.request.resource_type == "image":
print(f"[HOOK] Blocking image request: {route.request.url}")
await route.abort()
else:
await route.continue_()
await context.route("**", route_filter)
# Example 2: (Optional) Simulate a login scenario
# (We do NOT create or close pages here, just do quick steps if needed)
# e.g., await page.goto("https://example.com/login")
# e.g., await page.fill("input[name='username']", "testuser")
# e.g., await page.fill("input[name='password']", "password123")
# e.g., await page.click("button[type='submit']")
# e.g., await page.wait_for_selector("#welcome")
# e.g., await context.add_cookies([...])
# Then continue
# Example 3: Adjust the viewport
await page.set_viewport_size({"width": 1080, "height": 600})
return page
async def before_goto(
page: Page, context: BrowserContext, url: str, **kwargs
):
# Called before navigating to each URL.
print(f"[HOOK] before_goto - About to navigate: {url}")
# e.g., inject custom headers
await page.set_extra_http_headers({
"Custom-Header": "my-value"
})
return page
async def after_goto(
page: Page, context: BrowserContext,
url: str, response, **kwargs
):
# Called after navigation completes.
print(f"[HOOK] after_goto - Successfully loaded: {url}")
# e.g., wait for a certain element if we want to verify
try:
await page.wait_for_selector('.content', timeout=1000)
print("[HOOK] Found .content element!")
except:
print("[HOOK] .content not found, continuing anyway.")
return page
async def on_user_agent_updated(
page: Page, context: BrowserContext,
user_agent: str, **kwargs
):
# Called whenever the user agent updates.
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
return page
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
# Called after custom JavaScript execution begins.
print("[HOOK] on_execution_started - JS code is running!")
return page
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
# Called before final HTML retrieval.
print("[HOOK] before_retrieve_html - We can do final actions")
# Example: Scroll again
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
return page
async def before_return_html(
page: Page, context: BrowserContext, html: str, **kwargs
):
# Called just before returning the HTML in the result.
print(f"[HOOK] before_return_html - HTML length: {len(html)}")
return page
#
# Attach Hooks
#
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook(
"on_page_context_created", on_page_context_created
)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook(
"on_user_agent_updated", on_user_agent_updated
)
crawler.crawler_strategy.set_hook(
"on_execution_started", on_execution_started
)
crawler.crawler_strategy.set_hook(
"before_retrieve_html", before_retrieve_html
)
crawler.crawler_strategy.set_hook(
"before_return_html", before_return_html
)
await crawler.start()
# 4) Run the crawler on an example page
url = "https://example.com"
result = await crawler.arun(url, config=crawler_run_config)
async def on_browser_created(browser: Browser):
print("[HOOK] on_browser_created")
# Example customization: set browser viewport size
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
page = await context.new_page()
if result.success:
print("\nCrawled URL:", result.url)
print("HTML length:", len(result.html))
else:
print("Error:", result.error_message)
# Example customization: logging in to a hypothetical website
await page.goto('https://example.com/login')
await page.fill('input[name="username"]', 'testuser')
await page.fill('input[name="password"]', 'password123')
await page.click('button[type="submit"]')
await page.wait_for_selector('#welcome')
# Add a custom cookie
await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
await page.close()
await context.close()
await crawler.close()
async def before_goto(page: Page):
print("[HOOK] before_goto")
# Example customization: add custom headers
await page.set_extra_http_headers({'X-Test-Header': 'test'})
if __name__ == "__main__":
asyncio.run(main())
async def after_goto(page: Page):
print("[HOOK] after_goto")
# Example customization: log the URL
print(f"Current URL: {page.url}")
async def on_execution_started(page: Page):
print("[HOOK] on_execution_started")
# Example customization: perform actions after JS execution
await page.evaluate("console.log('Custom JS executed')")
async def before_return_html(page: Page, html: str):
print("[HOOK] before_return_html")
# Example customization: log the HTML length
print(f"HTML length: {len(html)}")
return page
```
---
### Using the Hooks with the AsyncWebCrawler
## Hook Lifecycle Summary
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
1. **`on_browser_created`**:
- Browser is up, but **no** pages or contexts yet.
- Light setup only—dont try to open or close pages here (that belongs in `on_page_context_created`).
async def main():
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
initial_cookies = [
{"name": "sessionId", "value": "abc123", "domain": ".example.com"},
{"name": "userId", "value": "12345", "domain": ".example.com"}
]
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, cookies=initial_cookies)
crawler_strategy.set_hook('on_browser_created', on_browser_created)
crawler_strategy.set_hook('before_goto', before_goto)
crawler_strategy.set_hook('after_goto', after_goto)
crawler_strategy.set_hook('on_execution_started', on_execution_started)
crawler_strategy.set_hook('before_return_html', before_return_html)
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="footer"
)
2. **`on_page_context_created`**:
- Perfect for advanced **auth** or route blocking.
- You have a **page** + **context** ready but havent navigated to the target URL yet.
print("📦 Crawler Hooks result:")
print(result)
3. **`before_goto`**:
- Right before navigation. Typically used for setting **custom headers** or logging the target URL.
asyncio.run(main())
```
4. **`after_goto`**:
- After page navigation is done. Good place for verifying content or waiting on essential elements.
### Explanation
5. **`on_user_agent_updated`**:
- Whenever the user agent changes (for stealth or different UA modes).
- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
- `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
- `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
- `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
- `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
6. **`on_execution_started`**:
- If you set `js_code` or run custom scripts, this runs once your JS is about to start.
### Additional Ideas
7. **`before_retrieve_html`**:
- Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.
8. **`before_return_html`**:
- The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
---
## When to Handle Authentication
**Recommended**: Use **`on_page_context_created`** if you need to:
- Navigate to a login page or fill forms
- Set cookies or localStorage tokens
- Block resource routes to avoid ads
This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
---
## Additional Considerations
- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
- **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.
- **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.
- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
---
## Conclusion
Hooks provide **fine-grained** control over:
- **Browser** creation (light tasks only)
- **Page** and **context** creation (auth, route blocking)
- **Navigation** phases
- **Final HTML** retrieval
Follow the recommended usage:
- **Login** or advanced tasks in `on_page_context_created`
- **Custom headers** or logs in `before_goto` / `after_goto`
- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.

View File

@@ -1,180 +0,0 @@
# Preserve Your Identity with Crawl4AI
Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring youre recognized as a human and not mistaken for a bot. This tutorial covers:
1. **Managed Browsers** The recommended approach for persistent profiles and identity-based crawling.
2. **Magic Mode** A simplified fallback solution for quick automation without persistent identity.
---
## 1. Managed Browsers: Your Digital Identity Solution
**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
### Key Benefits
- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though youre a normal user.
- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
---
Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwrights Chromium** binary rather than a system-wide Chrome/Edge. Well show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
---
### Creating a User Data Directory (Command-Line Approach via Playwright)
If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
1. **Find** the Playwright Chromium binary:
- On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
- To see an overview of installed browsers, run:
```bash
python -m playwright install --dry-run
```
or
```bash
playwright install --dry-run
```
(depending on your environment). This shows where Playwright keeps Chromium.
- For instance, you might see a path like:
```
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
```
on Linux, or a corresponding folder on macOS/Windows.
2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
```bash
# Linux example
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
--user-data-dir=/home/<you>/my_chrome_profile
```
```bash
# macOS example (Playwrights internal binary)
~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
--user-data-dir=/Users/<you>/my_chrome_profile
```
```powershell
# Windows example (PowerShell/cmd)
"C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
--user-data-dir="C:\Users\<you>\my_chrome_profile"
```
**Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
- This **opens** a fresh Chromium with your new or existing data folder.
- **Log into** any sites or configure your browser the way you want.
- **Close** when done—your profile data is saved in that folder.
3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir="/home/<you>/my_chrome_profile",
browser_type="chromium"
)
```
- Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
---
## 3. Using Managed Browsers in Crawl4AI
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
# 1) Reference your persistent data directory
browser_config = BrowserConfig(
headless=True, # 'True' for automated runs
verbose=True,
use_managed_browser=True, # Enables persistent browser strategy
browser_type="chromium",
user_data_dir="/path/to/my-chrome-profile"
)
# 2) Standard crawl config
crawl_config = CrawlerRunConfig(
wait_for="css:.logged-in-content"
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com/private", config=crawl_config)
if result.success:
print("Successfully accessed private data with your identity!")
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
### Workflow
1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
2. **Close** that browser.
3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
4. **Crawl** The site sees your identity as if youre the same user who just logged in.
---
## 4. Magic Mode: Simplified Automation
If you **dont** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(
magic=True, # Simplifies a lot of interaction
remove_overlay_elements=True,
page_timeout=60000
)
)
```
**Magic Mode**:
- Simulates a user-like experience
- Randomizes user agent & navigator
- Randomizes interactions & timings
- Masks automation signals
- Attempts pop-up handling
**But** its no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
---
## 5. Comparing Managed Browsers vs. Magic Mode
| Feature | **Managed Browsers** | **Magic Mode** |
|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
| **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) |
| **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity |
| **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed |
| **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) |
| **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable |
---
## 6. Summary
- **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.
- **Log in** or configure sites as needed, then close the browser.
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.
- Enjoy **persistent** sessions that reflect your real identity.
- If you only need quick, ephemeral automation, **Magic Mode** might suffice.
**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.

View File

@@ -1,104 +0,0 @@
## Handling Lazy-Loaded Images
Many websites now load images **lazily** as you scroll. If you need to ensure they appear in your final crawl (and in `result.media`), consider:
1. **`wait_for_images=True`** Wait for images to fully load.
2. **`scan_full_page`** Force the crawler to scroll the entire page, triggering lazy loads.
3. **`scroll_delay`** Add small delays between scroll steps.
**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
### Example: Ensuring Lazy Images Appear
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
from crawl4ai.async_configs import CacheMode
async def main():
config = CrawlerRunConfig(
# Force the crawler to wait until images are fully loaded
wait_for_images=True,
# Option 1: If you want to automatically scroll the page to load images
scan_full_page=True, # Tells the crawler to try scrolling the entire page
scroll_delay=0.5, # Delay (seconds) between scroll steps
# Option 2: If the site uses a 'Load More' or JS triggers for images,
# you can also specify js_code or wait_for logic here.
cache_mode=CacheMode.BYPASS,
verbose=True
)
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
result = await crawler.arun("https://www.example.com/gallery", config=config)
if result.success:
images = result.media.get("images", [])
print("Images found:", len(images))
for i, img in enumerate(images[:5]):
print(f"[Image {i}] URL: {img['src']}, Score: {img.get('score','N/A')}")
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
**Explanation**:
- **`wait_for_images=True`**
The crawler tries to ensure images have finished loading before finalizing the HTML.
- **`scan_full_page=True`**
Tells the crawler to attempt scrolling from top to bottom. Each scroll step helps trigger lazy loading.
- **`scroll_delay=0.5`**
Pause half a second between each scroll step. Helps the site load images before continuing.
**When to Use**:
- **Lazy-Loading**: If images appear only when the user scrolls into view, `scan_full_page` + `scroll_delay` helps the crawler see them.
- **Heavier Pages**: If a page is extremely long, be mindful that scanning the entire page can be slow. Adjust `scroll_delay` or the max scroll steps as needed.
---
## Combining with Other Link & Media Filters
You can still combine **lazy-load** logic with the usual **exclude_external_images**, **exclude_domains**, or link filtration:
```python
config = CrawlerRunConfig(
wait_for_images=True,
scan_full_page=True,
scroll_delay=0.5,
# Filter out external images if you only want local ones
exclude_external_images=True,
# Exclude certain domains for links
exclude_domains=["spammycdn.com"],
)
```
This approach ensures you see **all** images from the main domain while ignoring external ones, and the crawler physically scrolls the entire page so that lazy-loading triggers.
---
## Tips & Troubleshooting
1. **Long Pages**
- Setting `scan_full_page=True` on extremely long or infinite-scroll pages can be resource-intensive.
- Consider using [hooks](../core/page-interaction.md) or specialized logic to load specific sections or “Load More” triggers repeatedly.
2. **Mixed Image Behavior**
- Some sites load images in batches as you scroll. If youre missing images, increase your `scroll_delay` or call multiple partial scrolls in a loop with JS code or hooks.
3. **Combining with Dynamic Wait**
- If the site has a placeholder that only changes to a real image after a certain event, you might do `wait_for="css:img.loaded"` or a custom JS `wait_for`.
4. **Caching**
- If `cache_mode` is enabled, repeated crawls might skip some network fetches. If you suspect caching is missing new images, set `cache_mode=CacheMode.BYPASS` for fresh fetches.
---
With **lazy-loading** support, **wait_for_images**, and **scan_full_page** settings, you can capture the entire gallery or feed of images you expect—even if the site only loads them as the user scrolls. Combine these with the standard media filtering and domain exclusion for a complete link & media handling strategy.

View File

@@ -0,0 +1,52 @@
# Magic Mode & Anti-Bot Protection
Crawl4AI provides powerful anti-detection capabilities, with Magic Mode being the simplest and most comprehensive solution.
## Magic Mode
The easiest way to bypass anti-bot protections:
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True # Enables all anti-detection features
)
```
Magic Mode automatically:
- Masks browser automation signals
- Simulates human-like behavior
- Overrides navigator properties
- Handles cookie consent popups
- Manages browser fingerprinting
- Randomizes timing patterns
## Manual Anti-Bot Options
While Magic Mode is recommended, you can also configure individual anti-detection features:
```python
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True # Mask automation signals
)
```
Note: When `magic=True` is used, you don't need to set these individual options.
## Example: Handling Protected Sites
```python
async def crawl_protected_site(url: str):
async with AsyncWebCrawler(headless=True) as crawler:
result = await crawler.arun(
url=url,
magic=True,
remove_overlay_elements=True, # Remove popups/modals
page_timeout=60000 # Increased timeout for protection checks
)
return result.markdown if result.success else None
```

View File

@@ -0,0 +1,136 @@
# Content Filtering in Crawl4AI
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
## Relevance Content Filter
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
## Pruning Content Filter
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
### Usage
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import PruningContentFilter
async def filter_content(url):
async with AsyncWebCrawler() as crawler:
content_filter = PruningContentFilter(
min_word_threshold=5,
threshold_type='dynamic',
threshold=0.45
)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
if result.success:
print(f"Cleaned Markdown:\n{result.fit_markdown}")
```
### Parameters
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
- `'fixed'`: Uses a constant threshold value for all nodes
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
- For fixed threshold: Nodes scoring below this value are removed
- For dynamic threshold: This value is adjusted based on node properties
### How It Works
The pruning algorithm evaluates each node using multiple metrics:
- Text density: Ratio of actual text to overall node content
- Link density: Proportion of text within links
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
- Content quality: Metrics like text length and structural importance
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
The algorithm is particularly effective for:
- Removing boilerplate content
- Eliminating navigation menus and sidebars
- Preserving main article content
- Maintaining document structure while removing noise
## BM25 Algorithm
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
### Usage
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
```python
from crawl4ai import AsyncWebCrawler
from crawl4ai.content_filter_strategy import BM25ContentFilter
async def filter_content(url, query=None):
async with AsyncWebCrawler() as crawler:
content_filter = BM25ContentFilter(user_query=query)
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
if result.success:
print(f"Filtered Content (JSON):\n{result.extracted_content}")
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
else:
print("Error:", result.error_message)
# Example usage:
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
```
### Parameters
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
## Fit Markdown Flag
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
## Custom Content Filtering Strategies
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
```python
from crawl4ai.content_filter_strategy import RelevantContentFilter
from bs4 import BeautifulSoup, Tag
from typing import List
class MyCustomFilter(RelevantContentFilter):
def filter_content(self, html: str) -> List[str]:
soup = BeautifulSoup(html, 'lxml')
# Implement custom filtering logic here
# Example: extract all paragraphs within divs with class "article-body"
filtered_paragraphs = []
for tag in soup.select("div.article-body p"):
if isinstance(tag, Tag):
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
return filtered_paragraphs
async def custom_filter_demo(url: str):
async with AsyncWebCrawler() as crawler:
custom_filter = MyCustomFilter()
result = await crawler.arun(url, extraction_strategy=custom_filter)
if result.success:
print(result.extracted_content)
```
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
## Conclusion
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.

View File

@@ -1,429 +0,0 @@
# Advanced Multi-URL Crawling with Dispatchers
> **Heads Up**: Crawl4AI supports advanced dispatchers for **parallel** or **throttled** crawling, providing dynamic rate limiting and memory usage checks. The built-in `arun_many()` function uses these dispatchers to handle concurrency efficiently.
## 1. Introduction
When crawling many URLs:
- **Basic**: Use `arun()` in a loop (simple but less efficient)
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
**Why Dispatchers?**
- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
---
## 2. Core Components
### 2.1 Rate Limiter
```python
class RateLimiter:
def __init__(
# Random delay range between requests
base_delay: Tuple[float, float] = (1.0, 3.0),
# Maximum backoff delay
max_delay: float = 60.0,
# Retries before giving up
max_retries: int = 3,
# Status codes triggering backoff
rate_limit_codes: List[int] = [429, 503]
)
```
Heres the revised and simplified explanation of the **RateLimiter**, focusing on constructor parameters and adhering to your markdown style and mkDocs guidelines.
#### RateLimiter Constructor Parameters
The **RateLimiter** is a utility that helps manage the pace of requests to avoid overloading servers or getting blocked due to rate limits. It operates internally to delay requests and handle retries but can be configured using its constructor parameters.
**Parameters of the `RateLimiter` constructor:**
1.**`base_delay`** (`Tuple[float, float]`, default: `(1.0, 3.0)`)
The range for a random delay (in seconds) between consecutive requests to the same domain.
- A random delay is chosen between `base_delay[0]` and `base_delay[1]` for each request.
- This prevents sending requests at a predictable frequency, reducing the chances of triggering rate limits.
**Example:**
If `base_delay = (2.0, 5.0)`, delays could be randomly chosen as `2.3s`, `4.1s`, etc.
---
2.**`max_delay`** (`float`, default: `60.0`)
The maximum allowable delay when rate-limiting errors occur.
- When servers return rate-limit responses (e.g., 429 or 503), the delay increases exponentially with jitter.
- The `max_delay` ensures the delay doesnt grow unreasonably high, capping it at this value.
**Example:**
For a `max_delay = 30.0`, even if backoff calculations suggest a delay of `45s`, it will cap at `30s`.
---
3.**`max_retries`** (`int`, default: `3`)
The maximum number of retries for a request if rate-limiting errors occur.
- After encountering a rate-limit response, the `RateLimiter` retries the request up to this number of times.
- If all retries fail, the request is marked as failed, and the process continues.
**Example:**
If `max_retries = 3`, the system retries a failed request three times before giving up.
---
4.**`rate_limit_codes`** (`List[int]`, default: `[429, 503]`)
A list of HTTP status codes that trigger the rate-limiting logic.
- These status codes indicate the server is overwhelmed or actively limiting requests.
- You can customize this list to include other codes based on specific server behavior.
**Example:**
If `rate_limit_codes = [429, 503, 504]`, the crawler will back off on these three error codes.
---
**How to Use the `RateLimiter`:**
Heres an example of initializing and using a `RateLimiter` in your project:
```python
from crawl4ai import RateLimiter
# Create a RateLimiter with custom settings
rate_limiter = RateLimiter(
base_delay=(2.0, 4.0), # Random delay between 2-4 seconds
max_delay=30.0, # Cap delay at 30 seconds
max_retries=5, # Retry up to 5 times on rate-limiting errors
rate_limit_codes=[429, 503] # Handle these HTTP status codes
)
# RateLimiter will handle delays and retries internally
# No additional setup is required for its operation
```
The `RateLimiter` integrates seamlessly with dispatchers like `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher`, ensuring requests are paced correctly without user intervention. Its internal mechanisms manage delays and retries to avoid overwhelming servers while maximizing efficiency.
### 2.2 Crawler Monitor
The CrawlerMonitor provides real-time visibility into crawling operations:
```python
from crawl4ai import CrawlerMonitor, DisplayMode
monitor = CrawlerMonitor(
# Maximum rows in live display
max_visible_rows=15,
# DETAILED or AGGREGATED view
display_mode=DisplayMode.DETAILED
)
```
**Display Modes**:
1. **DETAILED**: Shows individual task status, memory usage, and timing
2. **AGGREGATED**: Displays summary statistics and overall progress
---
## 3. Available Dispatchers
### 3.1 MemoryAdaptiveDispatcher (Default)
Automatically manages concurrency based on system memory usage:
```python
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=90.0, # Pause if memory exceeds this
check_interval=1.0, # How often to check memory
max_session_permit=10, # Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(1.0, 2.0),
max_delay=30.0,
max_retries=2
),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
```
**Constructor Parameters:**
1.**`memory_threshold_percent`** (`float`, default: `90.0`)
Specifies the memory usage threshold (as a percentage). If system memory usage exceeds this value, the dispatcher pauses crawling to prevent system overload.
2.**`check_interval`** (`float`, default: `1.0`)
The interval (in seconds) at which the dispatcher checks system memory usage.
3.**`max_session_permit`** (`int`, default: `10`)
The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
4.**`memory_wait_timeout`** (`float`, default: `300.0`)
Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
5.**`rate_limiter`** (`RateLimiter`, default: `None`)
Optional rate-limiting logic to avoid server-side blocking (e.g., for handling 429 or 503 errors). See **RateLimiter** for details.
6.**`monitor`** (`CrawlerMonitor`, default: `None`)
Optional monitoring for real-time task tracking and performance insights. See **CrawlerMonitor** for details.
---
### 3.2 SemaphoreDispatcher
Provides simple concurrency control with a fixed limit:
```python
from crawl4ai.async_dispatcher import SemaphoreDispatcher
dispatcher = SemaphoreDispatcher(
max_session_permit=20, # Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(0.5, 1.0),
max_delay=10.0
),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
```
**Constructor Parameters:**
1.**`max_session_permit`** (`int`, default: `20`)
The maximum number of concurrent crawling tasks allowed, irrespective of semaphore slots.
2.**`rate_limiter`** (`RateLimiter`, default: `None`)
Optional rate-limiting logic to avoid overwhelming servers. See **RateLimiter** for details.
3.**`monitor`** (`CrawlerMonitor`, default: `None`)
Optional monitoring for tracking task progress and resource usage. See **CrawlerMonitor** for details.
---
## 4. Usage Examples
### 4.1 Batch Processing (Default)
```python
async def crawl_batch():
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=False # Default: get all results at once
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=10,
monitor=CrawlerMonitor(
display_mode=DisplayMode.DETAILED
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Get all results at once
results = await crawler.arun_many(
urls=urls,
config=run_config,
dispatcher=dispatcher
)
# Process all results after completion
for result in results:
if result.success:
await process_result(result)
else:
print(f"Failed to crawl {result.url}: {result.error_message}")
```
**Review:**
- **Purpose:** Executes a batch crawl with all URLs processed together after crawling is complete.
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` to manage concurrency and system memory.
- **Stream:** Disabled (`stream=False`), so all results are collected at once for post-processing.
- **Best Use Case:** When you need to analyze results in bulk rather than individually during the crawl.
---
### 4.2 Streaming Mode
```python
async def crawl_streaming():
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=10,
monitor=CrawlerMonitor(
display_mode=DisplayMode.DETAILED
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Process results as they become available
async for result in await crawler.arun_many(
urls=urls,
config=run_config,
dispatcher=dispatcher
):
if result.success:
# Process each result immediately
await process_result(result)
else:
print(f"Failed to crawl {result.url}: {result.error_message}")
```
**Review:**
- **Purpose:** Enables streaming to process results as soon as theyre available.
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` for concurrency and memory management.
- **Stream:** Enabled (`stream=True`), allowing real-time processing during crawling.
- **Best Use Case:** When you need to act on results immediately, such as for real-time analytics or progressive data storage.
---
### 4.3 Semaphore-based Crawling
```python
async def crawl_with_semaphore(urls):
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
dispatcher = SemaphoreDispatcher(
semaphore_count=5,
rate_limiter=RateLimiter(
base_delay=(0.5, 1.0),
max_delay=10.0
),
monitor=CrawlerMonitor(
max_visible_rows=15,
display_mode=DisplayMode.DETAILED
)
)
async with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many(
urls,
config=run_config,
dispatcher=dispatcher
)
return results
```
**Review:**
- **Purpose:** Uses `SemaphoreDispatcher` to limit concurrency with a fixed number of slots.
- **Dispatcher:** Configured with a semaphore to control parallel crawling tasks.
- **Rate Limiter:** Prevents servers from being overwhelmed by pacing requests.
- **Best Use Case:** When you want precise control over the number of concurrent requests, independent of system memory.
---
### 4.4 Robots.txt Consideration
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
urls = [
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Will respect robots.txt for each URL
semaphore_count=3 # Max concurrent requests
)
async with AsyncWebCrawler() as crawler:
async for result in crawler.arun_many(urls, config=config):
if result.success:
print(f"Successfully crawled {result.url}")
elif result.status_code == 403 and "robots.txt" in result.error_message:
print(f"Skipped {result.url} - blocked by robots.txt")
else:
print(f"Failed to crawl {result.url}: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
**Review:**
- **Purpose:** Ensures compliance with `robots.txt` rules for ethical and legal web crawling.
- **Configuration:** Set `check_robots_txt=True` to validate each URL against `robots.txt` before crawling.
- **Dispatcher:** Handles requests with concurrency limits (`semaphore_count=3`).
- **Best Use Case:** When crawling websites that strictly enforce robots.txt policies or for responsible crawling practices.
---
## 5. Dispatch Results
Each crawl result includes dispatch information:
```python
@dataclass
class DispatchResult:
task_id: str
memory_usage: float
peak_memory: float
start_time: datetime
end_time: datetime
error_message: str = ""
```
Access via `result.dispatch_result`:
```python
for result in results:
if result.success:
dr = result.dispatch_result
print(f"URL: {result.url}")
print(f"Memory: {dr.memory_usage:.1f}MB")
print(f"Duration: {dr.end_time - dr.start_time}")
```
## 6. Summary
1.**Two Dispatcher Types**:
- MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
- SemaphoreDispatcher: Fixed concurrency limit
2.**Optional Components**:
- RateLimiter: Smart request pacing and backoff
- CrawlerMonitor: Real-time progress visualization
3.**Key Benefits**:
- Automatic memory management
- Built-in rate limiting
- Live progress monitoring
- Flexible concurrency control
Choose the dispatcher that best fits your needs:
- **MemoryAdaptiveDispatcher**: For large crawls or limited resources
- **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios

View File

@@ -1,68 +1,84 @@
# Proxy
# Proxy & Security
Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
## Basic Proxy Setup
Simple proxy configuration with `BrowserConfig`:
Simple proxy configuration:
```python
from crawl4ai.async_configs import BrowserConfig
# Using proxy URL
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
async with AsyncWebCrawler(config=browser_config) as crawler:
async with AsyncWebCrawler(
proxy="http://proxy.example.com:8080"
) as crawler:
result = await crawler.arun(url="https://example.com")
# Using SOCKS proxy
browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
async with AsyncWebCrawler(config=browser_config) as crawler:
async with AsyncWebCrawler(
proxy="socks5://proxy.example.com:1080"
) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Authenticated Proxy
Use an authenticated proxy with `BrowserConfig`:
Use proxy with authentication:
```python
from crawl4ai.async_configs import BrowserConfig
proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
browser_config = BrowserConfig(proxy_config=proxy_config)
async with AsyncWebCrawler(config=browser_config) as crawler:
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
Here's the corrected documentation:
## Rotating Proxies
## Rotating Proxies
Example using a proxy rotation service dynamically:
Example using a proxy rotation service:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def get_next_proxy():
# Your proxy rotation logic here
return {"server": "http://next.proxy.com:8080"}
async def main():
browser_config = BrowserConfig()
run_config = CrawlerRunConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
# For each URL, create a new run config with different proxy
for url in urls:
proxy = await get_next_proxy()
# Clone the config and update proxy - this creates a new browser context
current_config = run_config.clone(proxy_config=proxy)
result = await crawler.arun(url=url, config=current_config)
if __name__ == "__main__":
import asyncio
asyncio.run(main())
async with AsyncWebCrawler() as crawler:
# Update proxy for each request
for url in urls:
proxy = await get_next_proxy()
crawler.update_proxy(proxy)
result = await crawler.arun(url=url)
```
## Custom Headers
Add security-related headers:
```python
headers = {
"X-Forwarded-For": "203.0.113.195",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache"
}
async with AsyncWebCrawler(headers=headers) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Combining with Magic Mode
For maximum protection, combine proxy with Magic Mode:
```python
async with AsyncWebCrawler(
proxy="http://proxy.example.com:8080",
headers={"Accept-Language": "en-US"}
) as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True # Enable all anti-detection features
)
```

View File

@@ -0,0 +1,276 @@
# Session-Based Crawling for Dynamic Content
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. To effectively crawl such websites, Crawl4AI provides powerful session-based crawling capabilities.
This guide will explore advanced techniques for crawling dynamic content using Crawl4AI's session management features.
## Understanding Session-Based Crawling
Session-based crawling allows you to maintain a persistent browser session across multiple requests. This is crucial when:
1. The content changes dynamically without URL changes
2. You need to interact with the page (e.g., clicking buttons) between requests
3. The site requires authentication or maintains state across pages
Crawl4AI's `AsyncWebCrawler` class supports session-based crawling through the `session_id` parameter and related methods.
## Basic Concepts
Before diving into examples, let's review some key concepts:
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple `arun` calls to maintain state.
- **JavaScript Execution**: Use the `js_code` parameter to execute JavaScript on the page, such as clicking a "Load More" button.
- **CSS Selectors**: Use these to target specific elements for extraction or interaction.
- **Extraction Strategy**: Define how to extract structured data from the page.
- **Wait Conditions**: Specify conditions to wait for before considering the page loaded.
## Example 1: Basic Session-Based Crawling
Let's start with a basic example of session-based crawling:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
async def basic_session_crawl():
async with AsyncWebCrawler(verbose=True) as crawler:
session_id = "my_session"
url = "https://example.com/dynamic-content"
for page in range(3):
result = await crawler.arun(
url=url,
session_id=session_id,
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
css_selector=".content-item",
cache_mode=CacheMode.BYPASS
)
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl())
```
This example demonstrates:
1. Using a consistent `session_id` across multiple `arun` calls
2. Executing JavaScript to load more content after the first page
3. Using a CSS selector to extract specific content
4. Properly closing the session after crawling
## Advanced Technique 1: Custom Execution Hooks
Crawl4AI allows you to set custom hooks that execute at different stages of the crawling process. This is particularly useful for handling complex loading scenarios.
Here's an example that waits for new content to appear before proceeding:
```python
async def advanced_session_crawl_with_hooks():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.commit-item h4")
commit = await page.query_selector("li.commit-item h4")
commit = await commit.evaluate("(element) => element.textContent")
commit = commit.strip()
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
async with AsyncWebCrawler(verbose=True) as crawler:
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
url = "https://github.com/example/repo/commits/main"
session_id = "commit_session"
all_commits = []
js_next_page = """
const button = document.querySelector('a.pagination-next');
if (button) button.click();
"""
for page in range(3):
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.commit-item",
js_code=js_next_page if page > 0 else None,
cache_mode=CacheMode.BYPASS,
js_only=page > 0
)
commits = result.extracted_content.select("li.commit-item")
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
asyncio.run(advanced_session_crawl_with_hooks())
```
This technique uses a custom `on_execution_started` hook to ensure new content has loaded before proceeding to the next step.
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
Instead of using separate hooks, you can integrate the waiting logic directly into your JavaScript execution. This approach can be more concise and easier to manage for some scenarios.
Here's an example:
```python
async def integrated_js_and_wait_crawl():
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/example/repo/commits/main"
session_id = "integrated_session"
all_commits = []
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => {
const commits = document.querySelectorAll('li.commit-item h4');
return commits.length > 0 ? commits[0].textContent.trim() : null;
};
const initialCommit = getCurrentCommit();
const button = document.querySelector('a.pagination-next');
if (button) button.click();
while (true) {
await new Promise(resolve => setTimeout(resolve, 100));
const newCommit = getCurrentCommit();
if (newCommit && newCommit !== initialCommit) {
break;
}
}
})();
"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.commit-item",
"fields": [
{
"name": "title",
"selector": "h4.commit-title",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
for page in range(3):
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.commit-item",
extraction_strategy=extraction_strategy,
js_code=js_next_page_and_wait if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
asyncio.run(integrated_js_and_wait_crawl())
```
This approach combines the JavaScript for clicking the "next" button and waiting for new content to load into a single script.
## Advanced Technique 3: Using the `wait_for` Parameter
Crawl4AI provides a `wait_for` parameter that allows you to specify a condition to wait for before considering the page fully loaded. This can be particularly useful for dynamic content.
Here's an example:
```python
async def wait_for_parameter_crawl():
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/example/repo/commits/main"
session_id = "wait_for_session"
all_commits = []
js_next_page = """
const commits = document.querySelectorAll('li.commit-item h4');
if (commits.length > 0) {
window.lastCommit = commits[0].textContent.trim();
}
const button = document.querySelector('a.pagination-next');
if (button) button.click();
"""
wait_for = """() => {
const commits = document.querySelectorAll('li.commit-item h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.lastCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.commit-item",
"fields": [
{
"name": "title",
"selector": "h4.commit-title",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
for page in range(3):
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.commit-item",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
await crawler.crawler_strategy.kill_session(session_id)
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
asyncio.run(wait_for_parameter_crawl())
```
This technique separates the JavaScript execution (clicking the "next" button) from the waiting condition, providing more flexibility and clarity in some scenarios.
## Best Practices for Session-Based Crawling
1. **Use Unique Session IDs**: Ensure each crawling session has a unique `session_id` to prevent conflicts.
2. **Close Sessions**: Always close sessions using `kill_session` when you're done to free up resources.
3. **Handle Errors**: Implement proper error handling to deal with unexpected situations during crawling.
4. **Respect Website Terms**: Ensure your crawling adheres to the website's terms of service and robots.txt file.
5. **Implement Delays**: Add appropriate delays between requests to avoid overwhelming the target server.
6. **Use Extraction Strategies**: Leverage `JsonCssExtractionStrategy` or other extraction strategies for structured data extraction.
7. **Optimize JavaScript**: Keep your JavaScript execution concise and efficient to improve crawling speed.
8. **Monitor Performance**: Keep an eye on memory usage and crawling speed, especially for long-running sessions.
## Conclusion
Session-based crawling with Crawl4AI provides powerful capabilities for handling dynamic content and complex web applications. By leveraging session management, JavaScript execution, and waiting strategies, you can effectively crawl and extract data from a wide range of modern websites.
Remember to use these techniques responsibly and in compliance with website policies and ethical web scraping practices.
For more advanced usage and API details, refer to the Crawl4AI API documentation.

Some files were not shown because too many files have changed in this diff Show More