Merge branch 'main' into next
# Conflicts: # .gitignore
This commit is contained in:
59
.github/DISCUSSION_TEMPLATE/feature-requests.yml
vendored
Normal file
59
.github/DISCUSSION_TEMPLATE/feature-requests.yml
vendored
Normal file
@@ -0,0 +1,59 @@
|
||||
title: "[Feature Request]: "
|
||||
labels: ["⚙️ New"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in
|
||||
this discussions category to avoid duplicates. 😊
|
||||
|
||||
- type: textarea
|
||||
id: needs_to_be_done
|
||||
attributes:
|
||||
label: What needs to be done?
|
||||
description: Please describe the feature or functionality you'd like to see.
|
||||
placeholder: "e.g., Return alt text along with images scraped from a webpages in Result"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: problem_to_solve
|
||||
attributes:
|
||||
label: What problem does this solve?
|
||||
description: Explain the pain point or issue this feature will help address.
|
||||
placeholder: "e.g., Bypass Captchas added by cloudflare"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: target_users
|
||||
attributes:
|
||||
label: Target users/beneficiaries
|
||||
description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.)
|
||||
placeholder: "e.g., Marketing teams, developers"
|
||||
validations:
|
||||
required: false
|
||||
|
||||
- type: textarea
|
||||
id: current_workarounds
|
||||
attributes:
|
||||
label: Current alternatives/workarounds
|
||||
description: Are there any existing solutions or workarounds? How does this feature improve upon them?
|
||||
placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them"
|
||||
validations:
|
||||
required: false
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
### 💡 Implementation Ideas
|
||||
|
||||
- type: textarea
|
||||
id: proposed_approach
|
||||
attributes:
|
||||
label: Proposed approach
|
||||
description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee
|
||||
and the success metrics for this feature
|
||||
placeholder: "e.g., Implement a breadth first traversal algorithm for scraper"
|
||||
validations:
|
||||
required: false
|
||||
127
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
127
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
@@ -0,0 +1,127 @@
|
||||
name: Bug Report
|
||||
description: Report a bug with the Crawl4AI.
|
||||
title: "[Bug]: "
|
||||
labels: ["🐞 Bug","🩺 Needs Triage"]
|
||||
body:
|
||||
- type: input
|
||||
id: crawl4ai_version
|
||||
attributes:
|
||||
label: crawl4ai version
|
||||
description: Specify the version of crawl4ai you are using.
|
||||
placeholder: "e.g., 2.0.0"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: expected_behavior
|
||||
attributes:
|
||||
label: Expected Behavior
|
||||
description: Describe what you expected to happen.
|
||||
placeholder: "Provide a detailed explanation of the expected outcome."
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: current_behavior
|
||||
attributes:
|
||||
label: Current Behavior
|
||||
description: Describe what is happening instead of the expected behavior.
|
||||
placeholder: "Describe the actual result or issue you encountered."
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: dropdown
|
||||
id: reproducible
|
||||
attributes:
|
||||
label: Is this reproducible?
|
||||
description: Indicate whether this bug can be reproduced consistently.
|
||||
options:
|
||||
- "Yes"
|
||||
- "No"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: inputs
|
||||
attributes:
|
||||
label: Inputs Causing the Bug
|
||||
description: Provide details about the inputs causing the issue.
|
||||
placeholder: |
|
||||
- URL(s):
|
||||
- Settings used:
|
||||
- Input data (if applicable):
|
||||
render: bash
|
||||
|
||||
- type: textarea
|
||||
id: steps_to_reproduce
|
||||
attributes:
|
||||
label: Steps to Reproduce
|
||||
description: Provide step-by-step instructions to reproduce the issue.
|
||||
placeholder: |
|
||||
1. Go to...
|
||||
2. Click on...
|
||||
3. Observe the issue...
|
||||
render: bash
|
||||
|
||||
- type: textarea
|
||||
id: code_snippets
|
||||
attributes:
|
||||
label: Code snippets
|
||||
description: Provide code snippets(if any). Add comments as necessary
|
||||
placeholder: print("Hello world")
|
||||
render: python
|
||||
|
||||
# Header Section with Title
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Supporting Information
|
||||
Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem
|
||||
|
||||
- type: input
|
||||
id: os
|
||||
attributes:
|
||||
label: OS
|
||||
description: Please provide the operating system & distro where the issue occurs.
|
||||
placeholder: "e.g., Windows, macOS, Linux"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: input
|
||||
id: python_version
|
||||
attributes:
|
||||
label: Python version
|
||||
description: Specify the Python version being used.
|
||||
placeholder: "e.g., 3.8.5"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
# Browser Field
|
||||
- type: input
|
||||
id: browser
|
||||
attributes:
|
||||
label: Browser
|
||||
description: Provide the name of the browser you are using.
|
||||
placeholder: "e.g., Chrome, Firefox, Safari"
|
||||
validations:
|
||||
required: false
|
||||
|
||||
# Browser Version Field
|
||||
- type: input
|
||||
id: browser_version
|
||||
attributes:
|
||||
label: Browser version
|
||||
description: Provide the version of the browser you are using.
|
||||
placeholder: "e.g., 91.0.4472.124"
|
||||
validations:
|
||||
required: false
|
||||
|
||||
# Error Logs Field (Text Area)
|
||||
- type: textarea
|
||||
id: error_logs
|
||||
attributes:
|
||||
label: Error logs & Screenshots (if applicable)
|
||||
description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue.
|
||||
placeholder: "Paste error logs here and attach your screenshots"
|
||||
validations:
|
||||
required: false
|
||||
8
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
8
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
@@ -0,0 +1,8 @@
|
||||
blank_issues_enabled: false
|
||||
contact_links:
|
||||
- name: Feature Requests
|
||||
url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests
|
||||
about: "Suggest new features or enhancements for Crawl4AI"
|
||||
- name: Forums - Q&A
|
||||
url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a
|
||||
about: "Ask questions or engage in general discussions about Crawl4AI"
|
||||
19
.github/pull_request_template.md
vendored
Normal file
19
.github/pull_request_template.md
vendored
Normal file
@@ -0,0 +1,19 @@
|
||||
## Summary
|
||||
Please include a summary of the change and/or which issues are fixed.
|
||||
|
||||
eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)
|
||||
|
||||
## List of files changed and why
|
||||
eg: quickstart.py - To update the example as per new changes
|
||||
|
||||
## How Has This Been Tested?
|
||||
Please describe the tests that you ran to verify your changes.
|
||||
|
||||
## Checklist:
|
||||
|
||||
- [ ] My code follows the style guidelines of this project
|
||||
- [ ] I have performed a self-review of my own code
|
||||
- [ ] I have commented my code, particularly in hard-to-understand areas
|
||||
- [ ] I have made corresponding changes to the documentation
|
||||
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
|
||||
- [ ] New and existing unit tests pass locally with my changes
|
||||
5
.gitignore
vendored
5
.gitignore
vendored
@@ -226,6 +226,9 @@ tree.md
|
||||
.local
|
||||
.do
|
||||
/plans
|
||||
plans/
|
||||
|
||||
# Codeium
|
||||
.codeiumignore
|
||||
todo/
|
||||
|
||||
@@ -251,4 +254,4 @@ continue_config.json
|
||||
.prompts/
|
||||
|
||||
.llm.env
|
||||
.private/
|
||||
.private/
|
||||
|
||||
11
CHANGELOG.md
11
CHANGELOG.md
@@ -5,9 +5,12 @@ All notable changes to Crawl4AI will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
---
|
||||
|
||||
### Changed
|
||||
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
|
||||
|
||||
## Version 0.4.3 (2025-01-21)
|
||||
## Version 0.4.3b2 (2025-01-21)
|
||||
|
||||
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
|
||||
|
||||
@@ -135,9 +138,11 @@ This release introduces several powerful new features, including robots.txt comp
|
||||
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
|
||||
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||
|
||||
#### Other
|
||||
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
|
||||
## [0.4.267] - 2025 - 01 - 06
|
||||
|
||||
### Added
|
||||
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||
|
||||
## [0.4.24] - 2024-12-31
|
||||
|
||||
|
||||
131
CODE_OF_CONDUCT.md
Normal file
131
CODE_OF_CONDUCT.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Crawl4AI Code of Conduct
|
||||
|
||||
## Our Pledge
|
||||
|
||||
We as members, contributors, and leaders pledge to make participation in our
|
||||
community a harassment-free experience for everyone, regardless of age, body
|
||||
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||
identity and expression, level of experience, education, socio-economic status,
|
||||
nationality, personal appearance, race, caste, color, religion, or sexual
|
||||
identity and orientation.
|
||||
|
||||
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||
diverse, inclusive, and healthy community.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to a positive environment for our
|
||||
community include:
|
||||
|
||||
* Demonstrating empathy and kindness toward other people
|
||||
* Being respectful of differing opinions, viewpoints, and experiences
|
||||
* Giving and gracefully accepting constructive feedback
|
||||
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||
and learning from the experience
|
||||
* Focusing on what is best not just for us as individuals, but for the overall
|
||||
community
|
||||
|
||||
Examples of unacceptable behavior include:
|
||||
|
||||
* The use of sexualized language or imagery, and sexual attention or advances of
|
||||
any kind
|
||||
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||
* Public or private harassment
|
||||
* Publishing others' private information, such as a physical or email address,
|
||||
without their explicit permission
|
||||
* Other conduct which could reasonably be considered inappropriate in a
|
||||
professional setting
|
||||
|
||||
## Enforcement Responsibilities
|
||||
|
||||
Community leaders are responsible for clarifying and enforcing our standards of
|
||||
acceptable behavior and will take appropriate and fair corrective action in
|
||||
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||
or harmful.
|
||||
|
||||
Community leaders have the right and responsibility to remove, edit, or reject
|
||||
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||
decisions when appropriate.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies within all community spaces, and also applies when
|
||||
an individual is officially representing the community in public spaces.
|
||||
Examples of representing our community include using an official email address,
|
||||
posting via an official social media account, or acting as an appointed
|
||||
representative at an online or offline event.
|
||||
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported to the community leaders responsible for enforcement at
|
||||
unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
|
||||
|
||||
All community leaders are obligated to respect the privacy and security of the
|
||||
reporter of any incident.
|
||||
|
||||
## Enforcement Guidelines
|
||||
|
||||
Community leaders will follow these Community Impact Guidelines in determining
|
||||
the consequences for any action they deem in violation of this Code of Conduct:
|
||||
|
||||
### 1. Correction
|
||||
|
||||
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||
unprofessional or unwelcome in the community.
|
||||
|
||||
**Consequence**: A private, written warning from community leaders, providing
|
||||
clarity around the nature of the violation and an explanation of why the
|
||||
behavior was inappropriate. A public apology may be requested.
|
||||
|
||||
### 2. Warning
|
||||
|
||||
**Community Impact**: A violation through a single incident or series of
|
||||
actions.
|
||||
|
||||
**Consequence**: A warning with consequences for continued behavior. No
|
||||
interaction with the people involved, including unsolicited interaction with
|
||||
those enforcing the Code of Conduct, for a specified period of time. This
|
||||
includes avoiding interactions in community spaces as well as external channels
|
||||
like social media. Violating these terms may lead to a temporary or permanent
|
||||
ban.
|
||||
|
||||
### 3. Temporary Ban
|
||||
|
||||
**Community Impact**: A serious violation of community standards, including
|
||||
sustained inappropriate behavior.
|
||||
|
||||
**Consequence**: A temporary ban from any sort of interaction or public
|
||||
communication with the community for a specified period of time. No public or
|
||||
private interaction with the people involved, including unsolicited interaction
|
||||
with those enforcing the Code of Conduct, is allowed during this period.
|
||||
Violating these terms may lead to a permanent ban.
|
||||
|
||||
### 4. Permanent Ban
|
||||
|
||||
**Community Impact**: Demonstrating a pattern of violation of community
|
||||
standards, including sustained inappropriate behavior, harassment of an
|
||||
individual, or aggression toward or disparagement of classes of individuals.
|
||||
|
||||
**Consequence**: A permanent ban from any sort of public interaction within the
|
||||
community.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||
version 2.1, available at
|
||||
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
|
||||
|
||||
Community Impact Guidelines were inspired by
|
||||
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
|
||||
|
||||
For answers to common questions about this code of conduct, see the FAQ at
|
||||
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
|
||||
[https://www.contributor-covenant.org/translations][translations].
|
||||
|
||||
[homepage]: https://www.contributor-covenant.org
|
||||
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
|
||||
[Mozilla CoC]: https://github.com/mozilla/diversity
|
||||
[FAQ]: https://www.contributor-covenant.org/faq
|
||||
[translations]: https://www.contributor-covenant.org/translations
|
||||
@@ -15,6 +15,7 @@
|
||||
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
||||
[](https://github.com/psf/black)
|
||||
[](https://github.com/PyCQA/bandit)
|
||||
[](code_of_conduct.md)
|
||||
|
||||
</div>
|
||||
|
||||
@@ -446,7 +447,7 @@ if __name__ == "__main__":
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
|
||||
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
|
||||
|
||||
```python
|
||||
import os, sys
|
||||
|
||||
137
docs/md_v2/basic/installation.md
Normal file
137
docs/md_v2/basic/installation.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Installation 💻
|
||||
|
||||
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package, use it with Docker, or run it as a local server.
|
||||
|
||||
## Option 1: Python Package Installation (Recommended)
|
||||
|
||||
Crawl4AI is now available on PyPI, making installation easier than ever. Choose the option that best fits your needs:
|
||||
|
||||
### Basic Installation
|
||||
|
||||
For basic web crawling and scraping tasks:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
playwright install # Install Playwright dependencies
|
||||
```
|
||||
|
||||
### Installation with PyTorch
|
||||
|
||||
For advanced text clustering (includes CosineSimilarity cluster strategy):
|
||||
|
||||
```bash
|
||||
pip install crawl4ai[torch]
|
||||
```
|
||||
|
||||
### Installation with Transformers
|
||||
|
||||
For text summarization and Hugging Face models:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai[transformer]
|
||||
```
|
||||
|
||||
### Full Installation
|
||||
|
||||
For all features:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai[all]
|
||||
```
|
||||
|
||||
### Development Installation
|
||||
|
||||
For contributors who plan to modify the source code:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e ".[all]"
|
||||
playwright install # Install Playwright dependencies
|
||||
```
|
||||
|
||||
💡 After installation with "torch", "transformer", or "all" options, it's recommended to run the following CLI command to load the required models:
|
||||
|
||||
```bash
|
||||
crawl4ai-download-models
|
||||
```
|
||||
|
||||
This is optional but will boost the performance and speed of the crawler. You only need to do this once after installation.
|
||||
|
||||
## Playwright Installation Note for Ubuntu
|
||||
|
||||
If you encounter issues with Playwright installation on Ubuntu, you may need to install additional dependencies:
|
||||
|
||||
```bash
|
||||
sudo apt-get install -y \
|
||||
libwoff1 \
|
||||
libopus0 \
|
||||
libwebp7 \
|
||||
libwebpdemux2 \
|
||||
libenchant-2-2 \
|
||||
libgudev-1.0-0 \
|
||||
libsecret-1-0 \
|
||||
libhyphen0 \
|
||||
libgdk-pixbuf2.0-0 \
|
||||
libegl1 \
|
||||
libnotify4 \
|
||||
libxslt1.1 \
|
||||
libevent-2.1-7 \
|
||||
libgles2 \
|
||||
libxcomposite1 \
|
||||
libatk1.0-0 \
|
||||
libatk-bridge2.0-0 \
|
||||
libepoxy0 \
|
||||
libgtk-3-0 \
|
||||
libharfbuzz-icu0 \
|
||||
libgstreamer-gl1.0-0 \
|
||||
libgstreamer-plugins-bad1.0-0 \
|
||||
gstreamer1.0-plugins-good \
|
||||
gstreamer1.0-plugins-bad \
|
||||
libxt6 \
|
||||
libxaw7 \
|
||||
xvfb \
|
||||
fonts-noto-color-emoji \
|
||||
libfontconfig \
|
||||
libfreetype6 \
|
||||
xfonts-cyrillic \
|
||||
xfonts-scalable \
|
||||
fonts-liberation \
|
||||
fonts-ipafont-gothic \
|
||||
fonts-wqy-zenhei \
|
||||
fonts-tlwg-loma-otf \
|
||||
fonts-freefont-ttf
|
||||
```
|
||||
|
||||
## Option 2: Using Docker (Coming Soon)
|
||||
|
||||
Docker support for Crawl4AI is currently in progress and will be available soon. This will allow you to run Crawl4AI in a containerized environment, ensuring consistency across different systems.
|
||||
|
||||
## Option 3: Local Server Installation
|
||||
|
||||
For those who prefer to run Crawl4AI as a local server, instructions will be provided once the Docker implementation is complete.
|
||||
|
||||
## Verifying Your Installation
|
||||
|
||||
After installation, you can verify that Crawl4AI is working correctly by running a simple Python script:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com")
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This script should successfully crawl the example website and print the first 500 characters of the extracted content.
|
||||
|
||||
## Getting Help
|
||||
|
||||
If you encounter any issues during installation or usage, please check the [documentation](https://docs.crawl4ai.com/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues).
|
||||
|
||||
Happy crawling! 🕷️🤖
|
||||
Reference in New Issue
Block a user