Compare commits

..

27 Commits

Author SHA1 Message Date
UncleCode
149b69c832 Update README.md 2025-01-26 10:59:48 +08:00
UncleCode
d0586f09a9 Merge branch 'vr0.4.3b3' 2025-01-25 21:57:29 +08:00
UncleCode
09ac7ed008 feat(demo): uncomment feature demos and add fake-useragent dependency
Uncomments demonstration code for memory dispatcher, streaming support,
content scraping, JSON schema generation, LLM markdown, and robots compliance
in the v0.4.3b2 features demo file. Also adds fake-useragent package as a
project dependency.

This change makes all feature demonstrations active by default and ensures
proper user agent handling capabilities.
2025-01-25 21:56:08 +08:00
UncleCode
97796f39d2 docs(examples): update proxy rotation demo and disable other demos
Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality.

No breaking changes.
2025-01-25 21:52:35 +08:00
UncleCode
4d7f91b378 refactor(user-agent): improve user agent generation system
Redesign user agent generation to be more modular and reliable:
- Add abstract base class UAGen for user agent generation
- Implement ValidUAGenerator using fake-useragent library
- Add OnlineUAGenerator for fetching real-world user agents
- Update browser configurations to use new UA generation system
- Improve client hints generation

This change makes the user agent system more maintainable and provides better real-world user agent coverage.
2025-01-25 21:16:39 +08:00
UncleCode
69a77222ef feat(browser): add CDP URL configuration support
Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser.

- Added cdp_url parameter to BrowserConfig
- Added cdp_url support in ManagedBrowser.start() method
- Updated documentation for new parameters
2025-01-24 15:53:47 +08:00
UncleCode
0afc3e9e5e refactor(examples): update API usage in features demo
Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls()
and fix result access patterns. Also improve code formatting and remove
extra whitespace.

- Replace dispatcher.run_urls with crawler.arun_many
- Update streaming demo to use new API and correct result access
- Clean up whitespace and formatting
- Simplify result property access patterns
2025-01-23 22:37:29 +08:00
UncleCode
65d33bcc0f style(docs): improve code formatting in features demo
Clean up whitespace and improve readability in v0_4_3b2_features_demo.py:
- Remove excessive blank lines between functions
- Improve config formatting for better readability
- Uncomment memory dispatcher demo in main function

No breaking changes.
2025-01-23 22:36:58 +08:00
UncleCode
6a01008a2b docs(multi-url): improve documentation clarity and update examples
- Restructure multi-URL crawling documentation with better formatting and examples
- Update code examples to use new API syntax (arun_many)
- Add detailed parameter explanations for RateLimiter and Dispatchers
- Enhance CSS styling for better documentation readability
- Fix outdated method calls in feature demo script

BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples
2025-01-23 22:33:36 +08:00
UncleCode
6dc01eae3a refactor(core): improve type hints and remove unused file
- Add RelevantContentFilter to __init__.py exports
- Update version to 0.4.3b3
- Enhance type hints in async_configs.py
- Remove empty utils.scraping.py file
- Update mkdocs configuration with version info and GitHub integration

BREAKING CHANGE: None
2025-01-23 18:53:22 +08:00
UncleCode
7b7fe84e0d docs(readme): resolve merge conflict and update version info
Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements.

No breaking changes.
2025-01-22 20:52:42 +08:00
UncleCode
5c36f4308f Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-01-22 20:51:52 +08:00
UncleCode
45809d1c91 Merge branch 'vr0.4.3b2' 2025-01-22 20:51:46 +08:00
UncleCode
357414c345 docs(readme): update version references and fix links
Update version numbers to v0.4.3bx throughout README.md
Fix contributing guidelines link to point to CONTRIBUTORS.md
Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product
Add pre-release installation instructions
Fix minor formatting in personal story section

No breaking changes
2025-01-22 20:46:39 +08:00
Aravind
6dfa9cb703 Streamline Feature requests, bug reports and Forums with Forms & Templates (#465)
* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config: updated new bugs to have needs-triage label by default

* Template for PR

* Template for PR

* Template for PR

* Template for PR

* Added FR template

* Added FR template

* Added FR template

* Added FR template

* Config: updated the text for new labels

* config: changed the order of steps to reproduce

* Config: shortened the form for feature request

* Config: Added a code snippet section to the bug report
2025-01-19 16:53:03 +08:00
devatbosch
8878b3d032 Updated the correct link for "Contribution guidelines" in README.md (#445)
Thank you for pointing this out. I am creating a contributing guide, which is why I changed the name to the contributors, but I forgot to update some other places. Thanks again.
2025-01-13 20:57:31 +08:00
Jōnin bingi
1ab9d115cf Fixing minor typos in README (#440)
@mcam10 Thx for the support. Appreciate
2025-01-13 20:23:52 +08:00
UncleCode
f9c601eb7e docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain.

Also add todo/ directory to .gitignore.
2025-01-09 16:24:41 +08:00
UncleCode
ad5e5d21ca Remove .codeiumignore from version control and add to .gitignore 2025-01-08 13:09:23 +08:00
UncleCode
26d821c0de Remove .codeiumignore from version control and add to .gitignore 2025-01-08 13:08:19 +08:00
UncleCode
010677cbee chore: add .gitattributes file
Add initial .gitattributes file to standardize line endings and file handling across different operating systems.

This will help prevent issues with line ending inconsistencies between developers working on different platforms.
2025-01-08 13:05:00 +08:00
UncleCode
fe52311bf4 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-01-06 15:20:30 +08:00
UncleCode
01b73950ee Merge branch 'vr0.4.267' 2025-01-06 15:20:28 +08:00
UncleCode
12880f1ffa Update gitignore 2025-01-06 15:19:01 +08:00
UncleCode
53be88b677 Update gitignore 2025-01-06 15:18:37 +08:00
UncleCode
3427ead8b8 Update CHANGELOG 2025-01-06 15:13:43 +08:00
aravind
32652189b0 Docs: Add Code of Conduct for the project (#410) 2025-01-06 12:52:51 +08:00
21 changed files with 941 additions and 103 deletions

View File

@@ -0,0 +1,59 @@
title: "[Feature Request]: "
labels: ["⚙️ New"]
body:
- type: markdown
attributes:
value: |
Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in
this discussions category to avoid duplicates. 😊
- type: textarea
id: needs_to_be_done
attributes:
label: What needs to be done?
description: Please describe the feature or functionality you'd like to see.
placeholder: "e.g., Return alt text along with images scraped from a webpages in Result"
validations:
required: true
- type: textarea
id: problem_to_solve
attributes:
label: What problem does this solve?
description: Explain the pain point or issue this feature will help address.
placeholder: "e.g., Bypass Captchas added by cloudflare"
validations:
required: true
- type: textarea
id: target_users
attributes:
label: Target users/beneficiaries
description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.)
placeholder: "e.g., Marketing teams, developers"
validations:
required: false
- type: textarea
id: current_workarounds
attributes:
label: Current alternatives/workarounds
description: Are there any existing solutions or workarounds? How does this feature improve upon them?
placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them"
validations:
required: false
- type: markdown
attributes:
value: |
### 💡 Implementation Ideas
- type: textarea
id: proposed_approach
attributes:
label: Proposed approach
description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee
and the success metrics for this feature
placeholder: "e.g., Implement a breadth first traversal algorithm for scraper"
validations:
required: false

127
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@@ -0,0 +1,127 @@
name: Bug Report
description: Report a bug with the Crawl4AI.
title: "[Bug]: "
labels: ["🐞 Bug","🩺 Needs Triage"]
body:
- type: input
id: crawl4ai_version
attributes:
label: crawl4ai version
description: Specify the version of crawl4ai you are using.
placeholder: "e.g., 2.0.0"
validations:
required: true
- type: textarea
id: expected_behavior
attributes:
label: Expected Behavior
description: Describe what you expected to happen.
placeholder: "Provide a detailed explanation of the expected outcome."
validations:
required: true
- type: textarea
id: current_behavior
attributes:
label: Current Behavior
description: Describe what is happening instead of the expected behavior.
placeholder: "Describe the actual result or issue you encountered."
validations:
required: true
- type: dropdown
id: reproducible
attributes:
label: Is this reproducible?
description: Indicate whether this bug can be reproduced consistently.
options:
- "Yes"
- "No"
validations:
required: true
- type: textarea
id: inputs
attributes:
label: Inputs Causing the Bug
description: Provide details about the inputs causing the issue.
placeholder: |
- URL(s):
- Settings used:
- Input data (if applicable):
render: bash
- type: textarea
id: steps_to_reproduce
attributes:
label: Steps to Reproduce
description: Provide step-by-step instructions to reproduce the issue.
placeholder: |
1. Go to...
2. Click on...
3. Observe the issue...
render: bash
- type: textarea
id: code_snippets
attributes:
label: Code snippets
description: Provide code snippets(if any). Add comments as necessary
placeholder: print("Hello world")
render: python
# Header Section with Title
- type: markdown
attributes:
value: |
## Supporting Information
Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem
- type: input
id: os
attributes:
label: OS
description: Please provide the operating system & distro where the issue occurs.
placeholder: "e.g., Windows, macOS, Linux"
validations:
required: true
- type: input
id: python_version
attributes:
label: Python version
description: Specify the Python version being used.
placeholder: "e.g., 3.8.5"
validations:
required: true
# Browser Field
- type: input
id: browser
attributes:
label: Browser
description: Provide the name of the browser you are using.
placeholder: "e.g., Chrome, Firefox, Safari"
validations:
required: false
# Browser Version Field
- type: input
id: browser_version
attributes:
label: Browser version
description: Provide the version of the browser you are using.
placeholder: "e.g., 91.0.4472.124"
validations:
required: false
# Error Logs Field (Text Area)
- type: textarea
id: error_logs
attributes:
label: Error logs & Screenshots (if applicable)
description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue.
placeholder: "Paste error logs here and attach your screenshots"
validations:
required: false

8
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1,8 @@
blank_issues_enabled: false
contact_links:
- name: Feature Requests
url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests
about: "Suggest new features or enhancements for Crawl4AI"
- name: Forums - Q&A
url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a
about: "Ask questions or engage in general discussions about Crawl4AI"

19
.github/pull_request_template.md vendored Normal file
View File

@@ -0,0 +1,19 @@
## Summary
Please include a summary of the change and/or which issues are fixed.
eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)
## List of files changed and why
eg: quickstart.py - To update the example as per new changes
## How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
## Checklist:
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes

3
.gitignore vendored
View File

@@ -226,6 +226,9 @@ tree.md
.local
.do
/plans
plans/
# Codeium
.codeiumignore
todo/

View File

@@ -5,9 +5,12 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
---
### Changed
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
## Version 0.4.3 (2025-01-21)
## Version 0.4.3b2 (2025-01-21)
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
@@ -135,9 +138,11 @@ This release introduces several powerful new features, including robots.txt comp
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
#### Other
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
## [0.4.267] - 2025 - 01 - 06
### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
## [0.4.24] - 2024-12-31

131
CODE_OF_CONDUCT.md Normal file
View File

@@ -0,0 +1,131 @@
# Crawl4AI Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or advances of
any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official email address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series of
actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within the
community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.1, available at
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
[https://www.contributor-covenant.org/translations][translations].
[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations

View File

@@ -6,7 +6,7 @@ We would like to thank the following people for their contributions to Crawl4AI:
- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
- [Aravind Karnam](https://github.com/aravindkarnam) - Developer
- [Aravind Karnam](https://github.com/aravindkarnam) - Head of Community and Product
## Community Contributors

View File

@@ -15,14 +15,15 @@
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
</div>
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.4.3b1x](#-recent-updates)
[✨ Check out latest update v0.4.3bx](#-recent-updates)
🎉 **Version 0.4.3b1 is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -31,7 +32,7 @@ My journey with computers started in childhood when my dad, a computer scientist
Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didnt meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AIa vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
I made Crawl4AI open-source for two reasons. First, its my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI, a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
</details>
@@ -52,6 +53,9 @@ Thank you to everyone who has supported this project, used it, and shared feedba
# Install the package
pip install -U crawl4ai
# For pre release versions
pip install crawl4ai --pre
# Run post-installation setup
crawl4ai-setup
@@ -443,7 +447,7 @@ if __name__ == "__main__":
</details>
<details>
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
```python
import os, sys
@@ -497,10 +501,7 @@ async def test_news_crawl():
- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
Read the full details in our [0.4.248 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
Here's a clear markdown explanation for your users about version numbering:
Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
## Version Numbering in Crawl4AI
@@ -571,7 +572,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
## 🤝 Contributing
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
## 📄 License

View File

@@ -16,7 +16,7 @@ from .extraction_strategy import (
)
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import DefaultMarkdownGenerator
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter, RelevantContentFilter
from .models import CrawlResult, MarkdownGenerationResult
from .async_dispatcher import (
MemoryAdaptiveDispatcher,
@@ -44,6 +44,7 @@ __all__ = [
"ChunkingStrategy",
"RegexChunking",
"DefaultMarkdownGenerator",
"RelevantContentFilter",
"PruningContentFilter",
"BM25ContentFilter",
"LLMContentFilter",

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.4.3b2"
__version__ = "0.4.3b3"

View File

@@ -6,12 +6,15 @@ from .config import (
IMAGE_SCORE_THRESHOLD,
SOCIAL_MEDIA_DOMAINS,
)
from .user_agent_generator import UserAgentGenerator
from .user_agent_generator import UserAgentGenerator, UAGen, ValidUAGenerator, OnlineUAGenerator
from .extraction_strategy import ExtractionStrategy
from .chunking_strategy import ChunkingStrategy, RegexChunking
from .markdown_generation_strategy import MarkdownGenerationStrategy
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter, LLMContentFilter, PruningContentFilter
from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
from typing import Optional, Union, List
from .cache_context import CacheMode
class BrowserConfig:
@@ -29,6 +32,7 @@ class BrowserConfig:
Default: True.
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
advanced manipulation. Default: False.
cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
debugging_port (int): Port for the browser debugging protocol. Default: 9222.
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
Automatically sets use_managed_browser=True. Default: False.
@@ -77,17 +81,18 @@ class BrowserConfig:
browser_type: str = "chromium",
headless: bool = True,
use_managed_browser: bool = False,
cdp_url: str = None,
use_persistent_context: bool = False,
user_data_dir: str = None,
chrome_channel: str = "chromium",
channel: str = "chromium",
proxy: Optional[str] = None,
proxy: str = None,
proxy_config: dict = None,
viewport_width: int = 1080,
viewport_height: int = 600,
accept_downloads: bool = False,
downloads_path: str = None,
storage_state=None,
storage_state : Union[str, dict, None]=None,
ignore_https_errors: bool = True,
java_script_enabled: bool = True,
sleep_on_close: bool = False,
@@ -95,19 +100,23 @@ class BrowserConfig:
cookies: list = None,
headers: dict = None,
user_agent: str = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
# "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
# "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
),
user_agent_mode: str = None,
user_agent_generator_config: dict = None,
user_agent_mode: str = "",
user_agent_generator_config: dict = {},
text_mode: bool = False,
light_mode: bool = False,
extra_args: list = None,
debugging_port: int = 9222,
host: str = "localhost",
):
self.browser_type = browser_type
self.headless = headless
self.use_managed_browser = use_managed_browser
self.cdp_url = cdp_url
self.use_persistent_context = use_persistent_context
self.user_data_dir = user_data_dir
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
@@ -136,17 +145,15 @@ class BrowserConfig:
self.verbose = verbose
self.debugging_port = debugging_port
user_agenr_generator = UserAgentGenerator()
if self.user_agent_mode != "random" and self.user_agent_generator_config:
self.user_agent = user_agenr_generator.generate(
fa_user_agenr_generator = ValidUAGenerator()
if self.user_agent_mode == "random":
self.user_agent = fa_user_agenr_generator.generate(
**(self.user_agent_generator_config or {})
)
elif self.user_agent_mode == "random":
self.user_agent = user_agenr_generator.generate()
else:
pass
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
self.browser_hint = UAGen.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint)
# If persistent context is requested, ensure managed browser is enabled
@@ -159,6 +166,7 @@ class BrowserConfig:
browser_type=kwargs.get("browser_type", "chromium"),
headless=kwargs.get("headless", True),
use_managed_browser=kwargs.get("use_managed_browser", False),
cdp_url=kwargs.get("cdp_url"),
use_persistent_context=kwargs.get("use_persistent_context", False),
user_data_dir=kwargs.get("user_data_dir"),
chrome_channel=kwargs.get("chrome_channel", "chromium"),
@@ -191,6 +199,7 @@ class BrowserConfig:
"browser_type": self.browser_type,
"headless": self.headless,
"use_managed_browser": self.use_managed_browser,
"cdp_url": self.cdp_url,
"use_persistent_context": self.use_persistent_context,
"user_data_dir": self.user_data_dir,
"chrome_channel": self.chrome_channel,
@@ -373,6 +382,11 @@ class CrawlerRunConfig:
stream (bool): If True, stream the page content as it is being loaded.
url: str = None # This is not a compulsory parameter
check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
user_agent (str): Custom User-Agent string to use. Default: None
user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
user_agent as-is. Default: None.
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
Default: None.
"""
def __init__(
@@ -382,7 +396,7 @@ class CrawlerRunConfig:
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
markdown_generator: MarkdownGenerationStrategy = None,
content_filter=None,
content_filter : RelevantContentFilter = None,
only_text: bool = False,
css_selector: str = None,
excluded_tags: list = None,
@@ -396,7 +410,7 @@ class CrawlerRunConfig:
# SSL Parameters
fetch_ssl_certificate: bool = False,
# Caching Parameters
cache_mode=None,
cache_mode: CacheMode =None,
session_id: str = None,
bypass_cache: bool = False,
disable_cache: bool = False,
@@ -444,6 +458,9 @@ class CrawlerRunConfig:
stream: bool = False,
url: str = None,
check_robots_txt: bool = False,
user_agent: str = None,
user_agent_mode: str = None,
user_agent_generator_config: dict = {},
):
self.url = url
@@ -526,6 +543,11 @@ class CrawlerRunConfig:
# Robots.txt Handling Parameters
self.check_robots_txt = check_robots_txt
# User Agent Parameters
self.user_agent = user_agent
self.user_agent_mode = user_agent_mode
self.user_agent_generator_config = user_agent_generator_config
# Validate type of extraction strategy and chunking strategy if they are provided
if self.extraction_strategy is not None and not isinstance(
self.extraction_strategy, ExtractionStrategy
@@ -623,6 +645,9 @@ class CrawlerRunConfig:
stream=kwargs.get("stream", False),
url=kwargs.get("url"),
check_robots_txt=kwargs.get("check_robots_txt", False),
user_agent=kwargs.get("user_agent"),
user_agent_mode=kwargs.get("user_agent_mode"),
user_agent_generator_config=kwargs.get("user_agent_generator_config", {}),
)
# Create a funciton returns dict of the object
@@ -686,6 +711,9 @@ class CrawlerRunConfig:
"stream": self.stream,
"url": self.url,
"check_robots_txt": self.check_robots_txt,
"user_agent": self.user_agent,
"user_agent_mode": self.user_agent_mode,
"user_agent_generator_config": self.user_agent_generator_config,
}
def clone(self, **kwargs):

View File

@@ -23,6 +23,7 @@ from .async_logger import AsyncLogger
from playwright_stealth import StealthConfig
from .ssl_certificate import SSLCertificate
from .utils import get_home_folder, get_chromium_path
from .user_agent_generator import ValidUAGenerator, OnlineUAGenerator
stealth_config = StealthConfig(
webdriver=True,
@@ -102,6 +103,7 @@ class ManagedBrowser:
logger=None,
host: str = "localhost",
debugging_port: int = 9222,
cdp_url: Optional[str] = None,
):
"""
Initialize the ManagedBrowser instance.
@@ -116,6 +118,7 @@ class ManagedBrowser:
logger (logging.Logger): Logger instance for logging messages. Default: None.
host (str): Host for debugging the browser. Default: "localhost".
debugging_port (int): Port for debugging the browser. Default: 9222.
cdp_url (str or None): CDP URL to connect to the browser. Default: None.
"""
self.browser_type = browser_type
self.user_data_dir = user_data_dir
@@ -126,12 +129,20 @@ class ManagedBrowser:
self.host = host
self.logger = logger
self.shutting_down = False
self.cdp_url = cdp_url
async def start(self) -> str:
"""
Starts the browser process and returns the CDP endpoint URL.
If user_data_dir is not provided, creates a temporary directory.
Starts the browser process or returns CDP endpoint URL.
If cdp_url is provided, returns it directly.
If user_data_dir is not provided for local browser, creates a temporary directory.
Returns:
str: CDP endpoint URL
"""
# If CDP URL provided, just return it
if self.cdp_url:
return self.cdp_url
# Create temp dir if needed
if not self.user_data_dir:
@@ -554,7 +565,7 @@ class BrowserManager:
Context: Browser context object with the specified configurations
"""
# Base settings
user_agent = self.config.headers.get("User-Agent", self.config.user_agent)
user_agent = self.config.headers.get("User-Agent", self.config.user_agent)
viewport_settings = {
"width": self.config.viewport_width,
"height": self.config.viewport_height,
@@ -1260,10 +1271,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
self._downloaded_files = []
# Handle user agent with magic mode
user_agent = self.browser_config.user_agent
if config.magic and self.browser_config.user_agent_mode != "random":
self.browser_config.user_agent = UserAgentGenerator().generate(
**(self.browser_config.user_agent_generator_config or {})
user_agent_to_override = config.user_agent
if user_agent_to_override:
self.browser_config.user_agent = user_agent_to_override
elif config.magic or config.user_agent_mode == "random":
self.browser_config.user_agent = ValidUAGenerator().generate(
**(config.user_agent_generator_config or {})
)
# Get page for session

View File

@@ -2,8 +2,146 @@ import random
from typing import Optional, Literal, List, Dict, Tuple
import re
from abc import ABC, abstractmethod
import random
from fake_useragent import UserAgent
import requests
from lxml import html
import json
from typing import Optional, List, Union, Dict
class UserAgentGenerator:
class UAGen(ABC):
@abstractmethod
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Union[str, Dict]:
pass
@staticmethod
def generate_client_hints( user_agent: str) -> str:
"""Generate Sec-CH-UA header value based on user agent string"""
def _parse_user_agent(user_agent: str) -> Dict[str, str]:
"""Parse a user agent string to extract browser and version information"""
browsers = {
"chrome": r"Chrome/(\d+)",
"edge": r"Edg/(\d+)",
"safari": r"Version/(\d+)",
"firefox": r"Firefox/(\d+)",
}
result = {}
for browser, pattern in browsers.items():
match = re.search(pattern, user_agent)
if match:
result[browser] = match.group(1)
return result
browsers = _parse_user_agent(user_agent)
# Client hints components
hints = []
# Handle different browser combinations
if "chrome" in browsers:
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
hints.append('"Not_A Brand";v="8"')
if "edge" in browsers:
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
else:
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
elif "firefox" in browsers:
# Firefox doesn't typically send Sec-CH-UA
return '""'
elif "safari" in browsers:
# Safari's format for client hints
hints.append(f'"Safari";v="{browsers["safari"]}"')
hints.append('"Not_A Brand";v="8"')
return ", ".join(hints)
class ValidUAGenerator(UAGen):
def __init__(self):
self.ua = UserAgent()
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> str:
self.ua = UserAgent(
browsers=browsers or ['Chrome', 'Firefox', 'Edge'],
os=os or ['Windows', 'Mac OS X'],
min_version=min_version,
platforms=platforms or ['desktop'],
fallback=fallback
)
return self.ua.random
class OnlineUAGenerator(UAGen):
def __init__(self):
self.agents = []
self._fetch_agents()
def _fetch_agents(self):
try:
response = requests.get(
'https://www.useragents.me/',
timeout=5,
headers={'Accept': 'text/html,application/xhtml+xml'}
)
response.raise_for_status()
tree = html.fromstring(response.content)
json_text = tree.cssselect('#most-common-desktop-useragents-json-csv > div:nth-child(1) > textarea')[0].text
self.agents = json.loads(json_text)
except Exception as e:
print(f"Error fetching agents: {e}")
def generate(self,
browsers: Optional[List[str]] = None,
os: Optional[Union[str, List[str]]] = None,
min_version: float = 0.0,
platforms: Optional[Union[str, List[str]]] = None,
pct_threshold: Optional[float] = None,
fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Dict:
if not self.agents:
self._fetch_agents()
filtered_agents = self.agents
if pct_threshold:
filtered_agents = [a for a in filtered_agents if a['pct'] >= pct_threshold]
if browsers:
filtered_agents = [a for a in filtered_agents
if any(b.lower() in a['ua'].lower() for b in browsers)]
if os:
os_list = [os] if isinstance(os, str) else os
filtered_agents = [a for a in filtered_agents
if any(o.lower() in a['ua'].lower() for o in os_list)]
if platforms:
platform_list = [platforms] if isinstance(platforms, str) else platforms
filtered_agents = [a for a in filtered_agents
if any(p.lower() in a['ua'].lower() for p in platform_list)]
return filtered_agents[0] if filtered_agents else {'ua': fallback, 'pct': 0}
class UserAgentGenerator():
"""
Generate random user agents with specified constraints.
@@ -187,9 +325,15 @@ class UserAgentGenerator:
browser_stack = self.get_browser_stack(num_browsers)
# Add appropriate legacy token based on browser stack
if "Firefox" in str(browser_stack):
if "Firefox" in str(browser_stack) or browser_type == "firefox":
components.append(random.choice(self.rendering_engines["gecko"]))
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack) or browser_type == "chrome":
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
elif "Edge" in str(browser_stack) or browser_type == "edge":
components.append(self.rendering_engines["safari_webkit"])
components.append("(KHTML, like Gecko)")
elif "Safari" in str(browser_stack) or browser_type == "safari":
components.append(self.rendering_engines["chrome_webkit"])
components.append("(KHTML, like Gecko)")
@@ -273,27 +417,13 @@ class UserAgentGenerator:
# Example usage:
if __name__ == "__main__":
generator = UserAgentGenerator()
print(generator.generate())
# Usage example:
generator = ValidUAGenerator()
ua = generator.generate()
print(ua)
generator = OnlineUAGenerator()
ua = generator.generate()
print(ua)
print("\nSingle browser (Chrome):")
print(generator.generate(num_browsers=1, browser_type="chrome"))
print("\nTwo browsers (Gecko/Firefox):")
print(generator.generate(num_browsers=2))
print("\nThree browsers (Chrome/Safari/Edge):")
print(generator.generate(num_browsers=3))
print("\nFirefox on Linux:")
print(
generator.generate(
device_type="desktop",
os_type="linux",
browser_type="firefox",
num_browsers=2,
)
)
print("\nChrome/Safari/Edge on Windows:")
print(generator.generate(device_type="desktop", os_type="windows", num_browsers=3))

View File

@@ -85,17 +85,16 @@ async def demo_memory_dispatcher():
)
print("\n🚀 Starting batch crawl...")
results = await dispatcher.run_urls(
results = await crawler.arun_many(
urls=urls,
crawler=crawler,
config=crawler_config,
dispatcher=dispatcher
)
print(f"\n✅ Completed {len(results)} URLs successfully")
except Exception as e:
print(f"\n❌ Error in memory dispatcher demo: {str(e)}")
async def demo_streaming_support():
"""
2. Streaming Support Demo
@@ -115,16 +114,17 @@ async def demo_streaming_support():
dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5)
print("Starting streaming crawl...")
async for result in dispatcher.run_urls_stream(
urls=urls, crawler=crawler, config=crawler_config
async for result in await crawler.arun_many(
urls=urls,
config=crawler_config,
dispatcher=dispatcher
):
# Process each result as it arrives
print(
f"Received result for {result.url} - Success: {result.result.success}"
f"Received result for {result.url} - Success: {result.success}"
)
if result.result.success:
print(f"Content length: {len(result.result.markdown)}")
if result.success:
print(f"Content length: {len(result.markdown)}")
async def demo_content_scraping():
"""
@@ -138,7 +138,10 @@ async def demo_content_scraping():
url = "https://example.com/article"
# Configure with the new LXML strategy
config = CrawlerRunConfig(scraping_strategy=LXMLWebScrapingStrategy(), verbose=True)
config = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
print("Scraping content with LXML strategy...")
async with crawler:
@@ -146,7 +149,6 @@ async def demo_content_scraping():
if result.success:
print("Successfully scraped content using LXML strategy")
async def demo_llm_markdown():
"""
4. LLM-Powered Markdown Generation Demo
@@ -197,7 +199,6 @@ async def demo_llm_markdown():
print(result.markdown_v2.fit_markdown[:500])
print("Successfully generated LLM-filtered markdown")
async def demo_robots_compliance():
"""
5. Robots.txt Compliance Demo
@@ -221,8 +222,6 @@ async def demo_robots_compliance():
elif result.success:
print(f"Successfully crawled: {result.url}")
async def demo_json_schema_generation():
"""
7. LLM-Powered Schema Generation Demo
@@ -276,7 +275,6 @@ async def demo_json_schema_generation():
print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None)
print("Successfully used generated schema for crawling")
async def demo_proxy_rotation():
"""
8. Proxy Rotation Demo
@@ -299,8 +297,7 @@ async def demo_proxy_rotation():
}
except Exception as e:
print(f"Error loading proxy: {e}")
return None
return None
# Create 10 test requests to httpbin
urls = ["https://httpbin.org/ip"] * 2
@@ -316,7 +313,7 @@ async def demo_proxy_rotation():
continue
# Create new config with proxy
current_config = run_config.clone(proxy_config=proxy)
current_config = run_config.clone(proxy_config=proxy, user_agent="")
result = await crawler.arun(url=url, config=current_config)
if result.success:

View File

@@ -5,16 +5,20 @@
## 1. Introduction
When crawling many URLs:
- **Basic**: Use `arun()` in a loop (simple but less efficient)
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
**Why Dispatchers?**
- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
---
## 2. Core Components
### 2.1 Rate Limiter
@@ -22,34 +26,116 @@ When crawling many URLs:
```python
class RateLimiter:
def __init__(
base_delay: Tuple[float, float] = (1.0, 3.0), # Random delay range between requests
max_delay: float = 60.0, # Maximum backoff delay
max_retries: int = 3, # Retries before giving up
rate_limit_codes: List[int] = [429, 503] # Status codes triggering backoff
# Random delay range between requests
base_delay: Tuple[float, float] = (1.0, 3.0),
# Maximum backoff delay
max_delay: float = 60.0,
# Retries before giving up
max_retries: int = 3,
# Status codes triggering backoff
rate_limit_codes: List[int] = [429, 503]
)
```
The RateLimiter provides:
- Random delays between requests
- Exponential backoff on rate limit responses
- Domain-specific rate limiting
- Automatic retry handling
Heres the revised and simplified explanation of the **RateLimiter**, focusing on constructor parameters and adhering to your markdown style and mkDocs guidelines.
#### RateLimiter Constructor Parameters
The **RateLimiter** is a utility that helps manage the pace of requests to avoid overloading servers or getting blocked due to rate limits. It operates internally to delay requests and handle retries but can be configured using its constructor parameters.
**Parameters of the `RateLimiter` constructor:**
1.**`base_delay`** (`Tuple[float, float]`, default: `(1.0, 3.0)`)
The range for a random delay (in seconds) between consecutive requests to the same domain.
- A random delay is chosen between `base_delay[0]` and `base_delay[1]` for each request.
- This prevents sending requests at a predictable frequency, reducing the chances of triggering rate limits.
**Example:**
If `base_delay = (2.0, 5.0)`, delays could be randomly chosen as `2.3s`, `4.1s`, etc.
---
2.**`max_delay`** (`float`, default: `60.0`)
The maximum allowable delay when rate-limiting errors occur.
- When servers return rate-limit responses (e.g., 429 or 503), the delay increases exponentially with jitter.
- The `max_delay` ensures the delay doesnt grow unreasonably high, capping it at this value.
**Example:**
For a `max_delay = 30.0`, even if backoff calculations suggest a delay of `45s`, it will cap at `30s`.
---
3.**`max_retries`** (`int`, default: `3`)
The maximum number of retries for a request if rate-limiting errors occur.
- After encountering a rate-limit response, the `RateLimiter` retries the request up to this number of times.
- If all retries fail, the request is marked as failed, and the process continues.
**Example:**
If `max_retries = 3`, the system retries a failed request three times before giving up.
---
4.**`rate_limit_codes`** (`List[int]`, default: `[429, 503]`)
A list of HTTP status codes that trigger the rate-limiting logic.
- These status codes indicate the server is overwhelmed or actively limiting requests.
- You can customize this list to include other codes based on specific server behavior.
**Example:**
If `rate_limit_codes = [429, 503, 504]`, the crawler will back off on these three error codes.
---
**How to Use the `RateLimiter`:**
Heres an example of initializing and using a `RateLimiter` in your project:
```python
from crawl4ai import RateLimiter
# Create a RateLimiter with custom settings
rate_limiter = RateLimiter(
base_delay=(2.0, 4.0), # Random delay between 2-4 seconds
max_delay=30.0, # Cap delay at 30 seconds
max_retries=5, # Retry up to 5 times on rate-limiting errors
rate_limit_codes=[429, 503] # Handle these HTTP status codes
)
# RateLimiter will handle delays and retries internally
# No additional setup is required for its operation
```
The `RateLimiter` integrates seamlessly with dispatchers like `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher`, ensuring requests are paced correctly without user intervention. Its internal mechanisms manage delays and retries to avoid overwhelming servers while maximizing efficiency.
### 2.2 Crawler Monitor
The CrawlerMonitor provides real-time visibility into crawling operations:
```python
from crawl4ai import CrawlerMonitor, DisplayMode
monitor = CrawlerMonitor(
max_visible_rows=15, # Maximum rows in live display
display_mode=DisplayMode.DETAILED # DETAILED or AGGREGATED view
# Maximum rows in live display
max_visible_rows=15,
# DETAILED or AGGREGATED view
display_mode=DisplayMode.DETAILED
)
```
**Display Modes**:
1. **DETAILED**: Shows individual task status, memory usage, and timing
2. **AGGREGATED**: Displays summary statistics and overall progress
---
## 3. Available Dispatchers
### 3.1 MemoryAdaptiveDispatcher (Default)
@@ -57,6 +143,8 @@ monitor = CrawlerMonitor(
Automatically manages concurrency based on system memory usage:
```python
from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=90.0, # Pause if memory exceeds this
check_interval=1.0, # How often to check memory
@@ -73,13 +161,37 @@ dispatcher = MemoryAdaptiveDispatcher(
)
```
**Constructor Parameters:**
1.**`memory_threshold_percent`** (`float`, default: `90.0`)
Specifies the memory usage threshold (as a percentage). If system memory usage exceeds this value, the dispatcher pauses crawling to prevent system overload.
2.**`check_interval`** (`float`, default: `1.0`)
The interval (in seconds) at which the dispatcher checks system memory usage.
3.**`max_session_permit`** (`int`, default: `10`)
The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
4.**`memory_wait_timeout`** (`float`, default: `300.0`)
Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
5.**`rate_limiter`** (`RateLimiter`, default: `None`)
Optional rate-limiting logic to avoid server-side blocking (e.g., for handling 429 or 503 errors). See **RateLimiter** for details.
6.**`monitor`** (`CrawlerMonitor`, default: `None`)
Optional monitoring for real-time task tracking and performance insights. See **CrawlerMonitor** for details.
---
### 3.2 SemaphoreDispatcher
Provides simple concurrency control with a fixed limit:
```python
from crawl4ai.async_dispatcher import SemaphoreDispatcher
dispatcher = SemaphoreDispatcher(
max_session_permit=5, # Fixed concurrent tasks
max_session_permit=20, # Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(0.5, 1.0),
max_delay=10.0
@@ -91,6 +203,19 @@ dispatcher = SemaphoreDispatcher(
)
```
**Constructor Parameters:**
1.**`max_session_permit`** (`int`, default: `20`)
The maximum number of concurrent crawling tasks allowed, irrespective of semaphore slots.
2.**`rate_limiter`** (`RateLimiter`, default: `None`)
Optional rate-limiting logic to avoid overwhelming servers. See **RateLimiter** for details.
3.**`monitor`** (`CrawlerMonitor`, default: `None`)
Optional monitoring for tracking task progress and resource usage. See **CrawlerMonitor** for details.
---
## 4. Usage Examples
### 4.1 Batch Processing (Default)
@@ -128,6 +253,14 @@ async def crawl_batch():
print(f"Failed to crawl {result.url}: {result.error_message}")
```
**Review:**
- **Purpose:** Executes a batch crawl with all URLs processed together after crawling is complete.
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` to manage concurrency and system memory.
- **Stream:** Disabled (`stream=False`), so all results are collected at once for post-processing.
- **Best Use Case:** When you need to analyze results in bulk rather than individually during the crawl.
---
### 4.2 Streaming Mode
```python
@@ -161,6 +294,14 @@ async def crawl_streaming():
print(f"Failed to crawl {result.url}: {result.error_message}")
```
**Review:**
- **Purpose:** Enables streaming to process results as soon as theyre available.
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` for concurrency and memory management.
- **Stream:** Enabled (`stream=True`), allowing real-time processing during crawling.
- **Best Use Case:** When you need to act on results immediately, such as for real-time analytics or progressive data storage.
---
### 4.3 Semaphore-based Crawling
```python
@@ -189,6 +330,14 @@ async def crawl_with_semaphore(urls):
return results
```
**Review:**
- **Purpose:** Uses `SemaphoreDispatcher` to limit concurrency with a fixed number of slots.
- **Dispatcher:** Configured with a semaphore to control parallel crawling tasks.
- **Rate Limiter:** Prevents servers from being overwhelmed by pacing requests.
- **Best Use Case:** When you want precise control over the number of concurrent requests, independent of system memory.
---
### 4.4 Robots.txt Consideration
```python
@@ -221,11 +370,13 @@ if __name__ == "__main__":
asyncio.run(main())
```
**Key Points**:
- When `check_robots_txt=True`, each URL's robots.txt is checked before crawling
- Robots.txt files are cached for efficiency
- Failed robots.txt checks return 403 status code
- Dispatcher handles robots.txt checks automatically for each URL
**Review:**
- **Purpose:** Ensures compliance with `robots.txt` rules for ethical and legal web crawling.
- **Configuration:** Set `check_robots_txt=True` to validate each URL against `robots.txt` before crawling.
- **Dispatcher:** Handles requests with concurrency limits (`semaphore_count=3`).
- **Best Use Case:** When crawling websites that strictly enforce robots.txt policies or for responsible crawling practices.
---
## 5. Dispatch Results
@@ -255,20 +406,24 @@ for result in results:
## 6. Summary
1. **Two Dispatcher Types**:
1.**Two Dispatcher Types**:
- MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
- SemaphoreDispatcher: Fixed concurrency limit
2. **Optional Components**:
2.**Optional Components**:
- RateLimiter: Smart request pacing and backoff
- CrawlerMonitor: Real-time progress visualization
3. **Key Benefits**:
3.**Key Benefits**:
- Automatic memory management
- Built-in rate limiting
- Live progress monitoring
- Flexible concurrency control
Choose the dispatcher that best fits your needs:
- **MemoryAdaptiveDispatcher**: For large crawls or limited resources
- **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios

View File

@@ -95,6 +95,10 @@ strong {
}
div.highlight {
margin-bottom: 2em;
}
.terminal-card > header {
color: var(--font-color);
text-align: center;
@@ -231,6 +235,16 @@ pre {
font-size: 2em;
}
.terminal h2 {
font-size: 1.5em;
margin-bottom: 0.8em;
}
.terminal h3 {
font-size: 1.3em;
margin-bottom: 0.8em;
}
.terminal h1, .terminal h2, .terminal h3, .terminal h4, .terminal h5, .terminal h6 {
text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color), 0 0 0px var(--font-color);
}

View File

@@ -0,0 +1,137 @@
# Installation 💻
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package, use it with Docker, or run it as a local server.
## Option 1: Python Package Installation (Recommended)
Crawl4AI is now available on PyPI, making installation easier than ever. Choose the option that best fits your needs:
### Basic Installation
For basic web crawling and scraping tasks:
```bash
pip install crawl4ai
playwright install # Install Playwright dependencies
```
### Installation with PyTorch
For advanced text clustering (includes CosineSimilarity cluster strategy):
```bash
pip install crawl4ai[torch]
```
### Installation with Transformers
For text summarization and Hugging Face models:
```bash
pip install crawl4ai[transformer]
```
### Full Installation
For all features:
```bash
pip install crawl4ai[all]
```
### Development Installation
For contributors who plan to modify the source code:
```bash
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e ".[all]"
playwright install # Install Playwright dependencies
```
💡 After installation with "torch", "transformer", or "all" options, it's recommended to run the following CLI command to load the required models:
```bash
crawl4ai-download-models
```
This is optional but will boost the performance and speed of the crawler. You only need to do this once after installation.
## Playwright Installation Note for Ubuntu
If you encounter issues with Playwright installation on Ubuntu, you may need to install additional dependencies:
```bash
sudo apt-get install -y \
libwoff1 \
libopus0 \
libwebp7 \
libwebpdemux2 \
libenchant-2-2 \
libgudev-1.0-0 \
libsecret-1-0 \
libhyphen0 \
libgdk-pixbuf2.0-0 \
libegl1 \
libnotify4 \
libxslt1.1 \
libevent-2.1-7 \
libgles2 \
libxcomposite1 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libepoxy0 \
libgtk-3-0 \
libharfbuzz-icu0 \
libgstreamer-gl1.0-0 \
libgstreamer-plugins-bad1.0-0 \
gstreamer1.0-plugins-good \
gstreamer1.0-plugins-bad \
libxt6 \
libxaw7 \
xvfb \
fonts-noto-color-emoji \
libfontconfig \
libfreetype6 \
xfonts-cyrillic \
xfonts-scalable \
fonts-liberation \
fonts-ipafont-gothic \
fonts-wqy-zenhei \
fonts-tlwg-loma-otf \
fonts-freefont-ttf
```
## Option 2: Using Docker (Coming Soon)
Docker support for Crawl4AI is currently in progress and will be available soon. This will allow you to run Crawl4AI in a containerized environment, ensuring consistency across different systems.
## Option 3: Local Server Installation
For those who prefer to run Crawl4AI as a local server, instructions will be provided once the Docker implementation is complete.
## Verifying Your Installation
After installation, you can verify that Crawl4AI is working correctly by running a simple Python script:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.example.com")
print(result.markdown[:500]) # Print first 500 characters
if __name__ == "__main__":
asyncio.run(main())
```
This script should successfully crawl the example website and print the first 500 characters of the extracted content.
## Getting Help
If you encounter any issues during installation or usage, please check the [documentation](https://docs.crawl4ai.com/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues).
Happy crawling! 🕷️🤖

View File

@@ -1,4 +1,4 @@
site_name: Crawl4AI Documentation
site_name: Crawl4AI Documentation (v0.4.3b2)
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
site_url: https://docs.crawl4ai.com
repo_url: https://github.com/unclecode/crawl4ai
@@ -52,6 +52,11 @@ nav:
theme:
name: 'terminal'
palette: 'dark'
icon:
repo: fontawesome/brands/github
plugins:
- search
markdown_extensions:
- pymdownx.highlight:
@@ -64,6 +69,9 @@ markdown_extensions:
- attr_list
- tables
extra:
version: !ENV [CRAWL4AI_VERSION, 'development']
extra_css:
- assets/styles.css
- assets/highlight.css
@@ -72,3 +80,4 @@ extra_css:
extra_javascript:
- assets/highlight.min.js
- assets/highlight_init.js
- https://buttons.github.io/buttons.js

View File

@@ -37,6 +37,7 @@ dependencies = [
"rich>=13.9.4",
"cssselect>=1.2.0",
"httpx==0.27.2",
"fake-useragent>=2.0.3"
]
classifiers = [
"Development Status :: 4 - Beta",