Compare commits
19 Commits
vr0.4.3b3
...
unclecode-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
149b69c832 | ||
|
|
d0586f09a9 | ||
|
|
0afc3e9e5e | ||
|
|
7b7fe84e0d | ||
|
|
5c36f4308f | ||
|
|
45809d1c91 | ||
|
|
6dfa9cb703 | ||
|
|
8878b3d032 | ||
|
|
1ab9d115cf | ||
|
|
f9c601eb7e | ||
|
|
ad5e5d21ca | ||
|
|
26d821c0de | ||
|
|
010677cbee | ||
|
|
fe52311bf4 | ||
|
|
01b73950ee | ||
|
|
12880f1ffa | ||
|
|
53be88b677 | ||
|
|
3427ead8b8 | ||
|
|
32652189b0 |
59
.github/DISCUSSION_TEMPLATE/feature-requests.yml
vendored
Normal file
59
.github/DISCUSSION_TEMPLATE/feature-requests.yml
vendored
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
title: "[Feature Request]: "
|
||||||
|
labels: ["⚙️ New"]
|
||||||
|
body:
|
||||||
|
- type: markdown
|
||||||
|
attributes:
|
||||||
|
value: |
|
||||||
|
Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in
|
||||||
|
this discussions category to avoid duplicates. 😊
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: needs_to_be_done
|
||||||
|
attributes:
|
||||||
|
label: What needs to be done?
|
||||||
|
description: Please describe the feature or functionality you'd like to see.
|
||||||
|
placeholder: "e.g., Return alt text along with images scraped from a webpages in Result"
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: problem_to_solve
|
||||||
|
attributes:
|
||||||
|
label: What problem does this solve?
|
||||||
|
description: Explain the pain point or issue this feature will help address.
|
||||||
|
placeholder: "e.g., Bypass Captchas added by cloudflare"
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: target_users
|
||||||
|
attributes:
|
||||||
|
label: Target users/beneficiaries
|
||||||
|
description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.)
|
||||||
|
placeholder: "e.g., Marketing teams, developers"
|
||||||
|
validations:
|
||||||
|
required: false
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: current_workarounds
|
||||||
|
attributes:
|
||||||
|
label: Current alternatives/workarounds
|
||||||
|
description: Are there any existing solutions or workarounds? How does this feature improve upon them?
|
||||||
|
placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them"
|
||||||
|
validations:
|
||||||
|
required: false
|
||||||
|
|
||||||
|
- type: markdown
|
||||||
|
attributes:
|
||||||
|
value: |
|
||||||
|
### 💡 Implementation Ideas
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: proposed_approach
|
||||||
|
attributes:
|
||||||
|
label: Proposed approach
|
||||||
|
description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee
|
||||||
|
and the success metrics for this feature
|
||||||
|
placeholder: "e.g., Implement a breadth first traversal algorithm for scraper"
|
||||||
|
validations:
|
||||||
|
required: false
|
||||||
127
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
127
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
@@ -0,0 +1,127 @@
|
|||||||
|
name: Bug Report
|
||||||
|
description: Report a bug with the Crawl4AI.
|
||||||
|
title: "[Bug]: "
|
||||||
|
labels: ["🐞 Bug","🩺 Needs Triage"]
|
||||||
|
body:
|
||||||
|
- type: input
|
||||||
|
id: crawl4ai_version
|
||||||
|
attributes:
|
||||||
|
label: crawl4ai version
|
||||||
|
description: Specify the version of crawl4ai you are using.
|
||||||
|
placeholder: "e.g., 2.0.0"
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: expected_behavior
|
||||||
|
attributes:
|
||||||
|
label: Expected Behavior
|
||||||
|
description: Describe what you expected to happen.
|
||||||
|
placeholder: "Provide a detailed explanation of the expected outcome."
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: current_behavior
|
||||||
|
attributes:
|
||||||
|
label: Current Behavior
|
||||||
|
description: Describe what is happening instead of the expected behavior.
|
||||||
|
placeholder: "Describe the actual result or issue you encountered."
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: dropdown
|
||||||
|
id: reproducible
|
||||||
|
attributes:
|
||||||
|
label: Is this reproducible?
|
||||||
|
description: Indicate whether this bug can be reproduced consistently.
|
||||||
|
options:
|
||||||
|
- "Yes"
|
||||||
|
- "No"
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: inputs
|
||||||
|
attributes:
|
||||||
|
label: Inputs Causing the Bug
|
||||||
|
description: Provide details about the inputs causing the issue.
|
||||||
|
placeholder: |
|
||||||
|
- URL(s):
|
||||||
|
- Settings used:
|
||||||
|
- Input data (if applicable):
|
||||||
|
render: bash
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: steps_to_reproduce
|
||||||
|
attributes:
|
||||||
|
label: Steps to Reproduce
|
||||||
|
description: Provide step-by-step instructions to reproduce the issue.
|
||||||
|
placeholder: |
|
||||||
|
1. Go to...
|
||||||
|
2. Click on...
|
||||||
|
3. Observe the issue...
|
||||||
|
render: bash
|
||||||
|
|
||||||
|
- type: textarea
|
||||||
|
id: code_snippets
|
||||||
|
attributes:
|
||||||
|
label: Code snippets
|
||||||
|
description: Provide code snippets(if any). Add comments as necessary
|
||||||
|
placeholder: print("Hello world")
|
||||||
|
render: python
|
||||||
|
|
||||||
|
# Header Section with Title
|
||||||
|
- type: markdown
|
||||||
|
attributes:
|
||||||
|
value: |
|
||||||
|
## Supporting Information
|
||||||
|
Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem
|
||||||
|
|
||||||
|
- type: input
|
||||||
|
id: os
|
||||||
|
attributes:
|
||||||
|
label: OS
|
||||||
|
description: Please provide the operating system & distro where the issue occurs.
|
||||||
|
placeholder: "e.g., Windows, macOS, Linux"
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
- type: input
|
||||||
|
id: python_version
|
||||||
|
attributes:
|
||||||
|
label: Python version
|
||||||
|
description: Specify the Python version being used.
|
||||||
|
placeholder: "e.g., 3.8.5"
|
||||||
|
validations:
|
||||||
|
required: true
|
||||||
|
|
||||||
|
# Browser Field
|
||||||
|
- type: input
|
||||||
|
id: browser
|
||||||
|
attributes:
|
||||||
|
label: Browser
|
||||||
|
description: Provide the name of the browser you are using.
|
||||||
|
placeholder: "e.g., Chrome, Firefox, Safari"
|
||||||
|
validations:
|
||||||
|
required: false
|
||||||
|
|
||||||
|
# Browser Version Field
|
||||||
|
- type: input
|
||||||
|
id: browser_version
|
||||||
|
attributes:
|
||||||
|
label: Browser version
|
||||||
|
description: Provide the version of the browser you are using.
|
||||||
|
placeholder: "e.g., 91.0.4472.124"
|
||||||
|
validations:
|
||||||
|
required: false
|
||||||
|
|
||||||
|
# Error Logs Field (Text Area)
|
||||||
|
- type: textarea
|
||||||
|
id: error_logs
|
||||||
|
attributes:
|
||||||
|
label: Error logs & Screenshots (if applicable)
|
||||||
|
description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue.
|
||||||
|
placeholder: "Paste error logs here and attach your screenshots"
|
||||||
|
validations:
|
||||||
|
required: false
|
||||||
8
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
8
.github/ISSUE_TEMPLATE/config.yml
vendored
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
blank_issues_enabled: false
|
||||||
|
contact_links:
|
||||||
|
- name: Feature Requests
|
||||||
|
url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests
|
||||||
|
about: "Suggest new features or enhancements for Crawl4AI"
|
||||||
|
- name: Forums - Q&A
|
||||||
|
url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a
|
||||||
|
about: "Ask questions or engage in general discussions about Crawl4AI"
|
||||||
19
.github/pull_request_template.md
vendored
Normal file
19
.github/pull_request_template.md
vendored
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
## Summary
|
||||||
|
Please include a summary of the change and/or which issues are fixed.
|
||||||
|
|
||||||
|
eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)
|
||||||
|
|
||||||
|
## List of files changed and why
|
||||||
|
eg: quickstart.py - To update the example as per new changes
|
||||||
|
|
||||||
|
## How Has This Been Tested?
|
||||||
|
Please describe the tests that you ran to verify your changes.
|
||||||
|
|
||||||
|
## Checklist:
|
||||||
|
|
||||||
|
- [ ] My code follows the style guidelines of this project
|
||||||
|
- [ ] I have performed a self-review of my own code
|
||||||
|
- [ ] I have commented my code, particularly in hard-to-understand areas
|
||||||
|
- [ ] I have made corresponding changes to the documentation
|
||||||
|
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
|
||||||
|
- [ ] New and existing unit tests pass locally with my changes
|
||||||
3
.gitignore
vendored
3
.gitignore
vendored
@@ -226,6 +226,9 @@ tree.md
|
|||||||
.local
|
.local
|
||||||
.do
|
.do
|
||||||
/plans
|
/plans
|
||||||
|
plans/
|
||||||
|
|
||||||
|
# Codeium
|
||||||
.codeiumignore
|
.codeiumignore
|
||||||
todo/
|
todo/
|
||||||
|
|
||||||
|
|||||||
11
CHANGELOG.md
11
CHANGELOG.md
@@ -5,9 +5,12 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Changed
|
||||||
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
|
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
|
||||||
|
|
||||||
## Version 0.4.3 (2025-01-21)
|
## Version 0.4.3b2 (2025-01-21)
|
||||||
|
|
||||||
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
|
This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
|
||||||
|
|
||||||
@@ -135,9 +138,11 @@ This release introduces several powerful new features, including robots.txt comp
|
|||||||
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
|
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
|
||||||
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
#### Other
|
## [0.4.267] - 2025 - 01 - 06
|
||||||
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
|
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||||
|
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
## [0.4.24] - 2024-12-31
|
## [0.4.24] - 2024-12-31
|
||||||
|
|
||||||
|
|||||||
131
CODE_OF_CONDUCT.md
Normal file
131
CODE_OF_CONDUCT.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
# Crawl4AI Code of Conduct
|
||||||
|
|
||||||
|
## Our Pledge
|
||||||
|
|
||||||
|
We as members, contributors, and leaders pledge to make participation in our
|
||||||
|
community a harassment-free experience for everyone, regardless of age, body
|
||||||
|
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||||
|
identity and expression, level of experience, education, socio-economic status,
|
||||||
|
nationality, personal appearance, race, caste, color, religion, or sexual
|
||||||
|
identity and orientation.
|
||||||
|
|
||||||
|
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||||
|
diverse, inclusive, and healthy community.
|
||||||
|
|
||||||
|
## Our Standards
|
||||||
|
|
||||||
|
Examples of behavior that contributes to a positive environment for our
|
||||||
|
community include:
|
||||||
|
|
||||||
|
* Demonstrating empathy and kindness toward other people
|
||||||
|
* Being respectful of differing opinions, viewpoints, and experiences
|
||||||
|
* Giving and gracefully accepting constructive feedback
|
||||||
|
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||||
|
and learning from the experience
|
||||||
|
* Focusing on what is best not just for us as individuals, but for the overall
|
||||||
|
community
|
||||||
|
|
||||||
|
Examples of unacceptable behavior include:
|
||||||
|
|
||||||
|
* The use of sexualized language or imagery, and sexual attention or advances of
|
||||||
|
any kind
|
||||||
|
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||||
|
* Public or private harassment
|
||||||
|
* Publishing others' private information, such as a physical or email address,
|
||||||
|
without their explicit permission
|
||||||
|
* Other conduct which could reasonably be considered inappropriate in a
|
||||||
|
professional setting
|
||||||
|
|
||||||
|
## Enforcement Responsibilities
|
||||||
|
|
||||||
|
Community leaders are responsible for clarifying and enforcing our standards of
|
||||||
|
acceptable behavior and will take appropriate and fair corrective action in
|
||||||
|
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||||
|
or harmful.
|
||||||
|
|
||||||
|
Community leaders have the right and responsibility to remove, edit, or reject
|
||||||
|
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||||
|
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||||
|
decisions when appropriate.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This Code of Conduct applies within all community spaces, and also applies when
|
||||||
|
an individual is officially representing the community in public spaces.
|
||||||
|
Examples of representing our community include using an official email address,
|
||||||
|
posting via an official social media account, or acting as an appointed
|
||||||
|
representative at an online or offline event.
|
||||||
|
|
||||||
|
## Enforcement
|
||||||
|
|
||||||
|
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||||
|
reported to the community leaders responsible for enforcement at
|
||||||
|
unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
|
||||||
|
|
||||||
|
All community leaders are obligated to respect the privacy and security of the
|
||||||
|
reporter of any incident.
|
||||||
|
|
||||||
|
## Enforcement Guidelines
|
||||||
|
|
||||||
|
Community leaders will follow these Community Impact Guidelines in determining
|
||||||
|
the consequences for any action they deem in violation of this Code of Conduct:
|
||||||
|
|
||||||
|
### 1. Correction
|
||||||
|
|
||||||
|
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||||
|
unprofessional or unwelcome in the community.
|
||||||
|
|
||||||
|
**Consequence**: A private, written warning from community leaders, providing
|
||||||
|
clarity around the nature of the violation and an explanation of why the
|
||||||
|
behavior was inappropriate. A public apology may be requested.
|
||||||
|
|
||||||
|
### 2. Warning
|
||||||
|
|
||||||
|
**Community Impact**: A violation through a single incident or series of
|
||||||
|
actions.
|
||||||
|
|
||||||
|
**Consequence**: A warning with consequences for continued behavior. No
|
||||||
|
interaction with the people involved, including unsolicited interaction with
|
||||||
|
those enforcing the Code of Conduct, for a specified period of time. This
|
||||||
|
includes avoiding interactions in community spaces as well as external channels
|
||||||
|
like social media. Violating these terms may lead to a temporary or permanent
|
||||||
|
ban.
|
||||||
|
|
||||||
|
### 3. Temporary Ban
|
||||||
|
|
||||||
|
**Community Impact**: A serious violation of community standards, including
|
||||||
|
sustained inappropriate behavior.
|
||||||
|
|
||||||
|
**Consequence**: A temporary ban from any sort of interaction or public
|
||||||
|
communication with the community for a specified period of time. No public or
|
||||||
|
private interaction with the people involved, including unsolicited interaction
|
||||||
|
with those enforcing the Code of Conduct, is allowed during this period.
|
||||||
|
Violating these terms may lead to a permanent ban.
|
||||||
|
|
||||||
|
### 4. Permanent Ban
|
||||||
|
|
||||||
|
**Community Impact**: Demonstrating a pattern of violation of community
|
||||||
|
standards, including sustained inappropriate behavior, harassment of an
|
||||||
|
individual, or aggression toward or disparagement of classes of individuals.
|
||||||
|
|
||||||
|
**Consequence**: A permanent ban from any sort of public interaction within the
|
||||||
|
community.
|
||||||
|
|
||||||
|
## Attribution
|
||||||
|
|
||||||
|
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||||
|
version 2.1, available at
|
||||||
|
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
|
||||||
|
|
||||||
|
Community Impact Guidelines were inspired by
|
||||||
|
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
|
||||||
|
|
||||||
|
For answers to common questions about this code of conduct, see the FAQ at
|
||||||
|
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
|
||||||
|
[https://www.contributor-covenant.org/translations][translations].
|
||||||
|
|
||||||
|
[homepage]: https://www.contributor-covenant.org
|
||||||
|
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
|
||||||
|
[Mozilla CoC]: https://github.com/mozilla/diversity
|
||||||
|
[FAQ]: https://www.contributor-covenant.org/faq
|
||||||
|
[translations]: https://www.contributor-covenant.org/translations
|
||||||
@@ -15,6 +15,7 @@
|
|||||||
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
||||||
[](https://github.com/psf/black)
|
[](https://github.com/psf/black)
|
||||||
[](https://github.com/PyCQA/bandit)
|
[](https://github.com/PyCQA/bandit)
|
||||||
|
[](code_of_conduct.md)
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
@@ -446,7 +447,7 @@ if __name__ == "__main__":
|
|||||||
</details>
|
</details>
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
|
<summary>🤖 <strong>Using You own Browser with Custom User Profile</strong></summary>
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import os, sys
|
import os, sys
|
||||||
|
|||||||
137
docs/md_v2/basic/installation.md
Normal file
137
docs/md_v2/basic/installation.md
Normal file
@@ -0,0 +1,137 @@
|
|||||||
|
# Installation 💻
|
||||||
|
|
||||||
|
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package, use it with Docker, or run it as a local server.
|
||||||
|
|
||||||
|
## Option 1: Python Package Installation (Recommended)
|
||||||
|
|
||||||
|
Crawl4AI is now available on PyPI, making installation easier than ever. Choose the option that best fits your needs:
|
||||||
|
|
||||||
|
### Basic Installation
|
||||||
|
|
||||||
|
For basic web crawling and scraping tasks:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai
|
||||||
|
playwright install # Install Playwright dependencies
|
||||||
|
```
|
||||||
|
|
||||||
|
### Installation with PyTorch
|
||||||
|
|
||||||
|
For advanced text clustering (includes CosineSimilarity cluster strategy):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai[torch]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Installation with Transformers
|
||||||
|
|
||||||
|
For text summarization and Hugging Face models:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai[transformer]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Full Installation
|
||||||
|
|
||||||
|
For all features:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install crawl4ai[all]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Development Installation
|
||||||
|
|
||||||
|
For contributors who plan to modify the source code:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/unclecode/crawl4ai.git
|
||||||
|
cd crawl4ai
|
||||||
|
pip install -e ".[all]"
|
||||||
|
playwright install # Install Playwright dependencies
|
||||||
|
```
|
||||||
|
|
||||||
|
💡 After installation with "torch", "transformer", or "all" options, it's recommended to run the following CLI command to load the required models:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
crawl4ai-download-models
|
||||||
|
```
|
||||||
|
|
||||||
|
This is optional but will boost the performance and speed of the crawler. You only need to do this once after installation.
|
||||||
|
|
||||||
|
## Playwright Installation Note for Ubuntu
|
||||||
|
|
||||||
|
If you encounter issues with Playwright installation on Ubuntu, you may need to install additional dependencies:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo apt-get install -y \
|
||||||
|
libwoff1 \
|
||||||
|
libopus0 \
|
||||||
|
libwebp7 \
|
||||||
|
libwebpdemux2 \
|
||||||
|
libenchant-2-2 \
|
||||||
|
libgudev-1.0-0 \
|
||||||
|
libsecret-1-0 \
|
||||||
|
libhyphen0 \
|
||||||
|
libgdk-pixbuf2.0-0 \
|
||||||
|
libegl1 \
|
||||||
|
libnotify4 \
|
||||||
|
libxslt1.1 \
|
||||||
|
libevent-2.1-7 \
|
||||||
|
libgles2 \
|
||||||
|
libxcomposite1 \
|
||||||
|
libatk1.0-0 \
|
||||||
|
libatk-bridge2.0-0 \
|
||||||
|
libepoxy0 \
|
||||||
|
libgtk-3-0 \
|
||||||
|
libharfbuzz-icu0 \
|
||||||
|
libgstreamer-gl1.0-0 \
|
||||||
|
libgstreamer-plugins-bad1.0-0 \
|
||||||
|
gstreamer1.0-plugins-good \
|
||||||
|
gstreamer1.0-plugins-bad \
|
||||||
|
libxt6 \
|
||||||
|
libxaw7 \
|
||||||
|
xvfb \
|
||||||
|
fonts-noto-color-emoji \
|
||||||
|
libfontconfig \
|
||||||
|
libfreetype6 \
|
||||||
|
xfonts-cyrillic \
|
||||||
|
xfonts-scalable \
|
||||||
|
fonts-liberation \
|
||||||
|
fonts-ipafont-gothic \
|
||||||
|
fonts-wqy-zenhei \
|
||||||
|
fonts-tlwg-loma-otf \
|
||||||
|
fonts-freefont-ttf
|
||||||
|
```
|
||||||
|
|
||||||
|
## Option 2: Using Docker (Coming Soon)
|
||||||
|
|
||||||
|
Docker support for Crawl4AI is currently in progress and will be available soon. This will allow you to run Crawl4AI in a containerized environment, ensuring consistency across different systems.
|
||||||
|
|
||||||
|
## Option 3: Local Server Installation
|
||||||
|
|
||||||
|
For those who prefer to run Crawl4AI as a local server, instructions will be provided once the Docker implementation is complete.
|
||||||
|
|
||||||
|
## Verifying Your Installation
|
||||||
|
|
||||||
|
After installation, you can verify that Crawl4AI is working correctly by running a simple Python script:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
result = await crawler.arun(url="https://www.example.com")
|
||||||
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
|
```
|
||||||
|
|
||||||
|
This script should successfully crawl the example website and print the first 500 characters of the extracted content.
|
||||||
|
|
||||||
|
## Getting Help
|
||||||
|
|
||||||
|
If you encounter any issues during installation or usage, please check the [documentation](https://docs.crawl4ai.com/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues).
|
||||||
|
|
||||||
|
Happy crawling! 🕷️🤖
|
||||||
Reference in New Issue
Block a user