Compare commits
145 Commits
feature/co
...
v0.4.241
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5313c71a0d | ||
|
|
d36ef3d424 | ||
|
|
4a4f613238 | ||
|
|
dc6a24618e | ||
|
|
74a7c6dbb6 | ||
|
|
67f65f958b | ||
|
|
78b6ba5cef | ||
|
|
3f019d34cc | ||
|
|
304260e484 | ||
|
|
704bd66b63 | ||
|
|
1acc162c18 | ||
|
|
553c97a0c1 | ||
|
|
bd66befcf0 | ||
|
|
19b0a5ae82 | ||
|
|
bd71f7f4ea | ||
|
|
171ce25ba6 | ||
|
|
6c5a44f774 | ||
|
|
5c3c05bf93 | ||
|
|
67d0999bc3 | ||
|
|
553a4622bf | ||
|
|
6f81ef006d | ||
|
|
a04870a662 | ||
|
|
f7d26390c5 | ||
|
|
141783fb2d | ||
|
|
2fedd4876e | ||
|
|
e187b0aaf0 | ||
|
|
e95374d7c6 | ||
|
|
8f2d0cda2f | ||
|
|
9d261d2b9c | ||
|
|
7792fe0e4c | ||
|
|
86259244e4 | ||
|
|
0ec593fa90 | ||
|
|
7391d6be73 | ||
|
|
e4e23065f1 | ||
|
|
fb33a24891 | ||
|
|
78768fd714 | ||
|
|
f2d9912697 | ||
|
|
9a4ed6bbd7 | ||
|
|
d5ed451299 | ||
|
|
bacbeb3ed4 | ||
|
|
84b311760f | ||
|
|
8fbc2e0463 | ||
|
|
849765712f | ||
|
|
393bb911c0 | ||
|
|
4a5f1aebee | ||
|
|
a11d9646e3 | ||
|
|
ed7bc1909c | ||
|
|
e9e5b5642d | ||
|
|
7524aa7b5e | ||
|
|
7af1d32ef6 | ||
|
|
399af801a1 | ||
|
|
4a72c5ea6e | ||
|
|
20d6f5fdf4 | ||
|
|
3d69715dba | ||
|
|
de1766d565 | ||
|
|
0982c639ae | ||
|
|
5188b7a6a0 | ||
|
|
759164831d | ||
|
|
5431fa2d0c | ||
|
|
e130fd8db9 | ||
|
|
ded554d334 | ||
|
|
2d31915f0a | ||
|
|
ba3e808802 | ||
|
|
e3488da194 | ||
|
|
740214e021 | ||
|
|
c51e901f68 | ||
|
|
8c611dcb4b | ||
|
|
a45b8b1eb1 | ||
|
|
56f82f3e7f | ||
|
|
486db3a771 | ||
|
|
b02544bc0b | ||
|
|
e9639ad189 | ||
|
|
95a4f74d2a | ||
|
|
293f299c08 | ||
|
|
80d58ad24c | ||
|
|
3e83893b3f | ||
|
|
8c76a8c7dc | ||
|
|
0780db55e1 | ||
|
|
1ed7c15118 | ||
|
|
569bdb6073 | ||
|
|
1def53b7fe | ||
|
|
f9c98a377d | ||
|
|
93bf3e8a1f | ||
|
|
d202f3539b | ||
|
|
12e73d4898 | ||
|
|
449dd7cc0b | ||
|
|
b0419edda6 | ||
|
|
c0e87abaee | ||
|
|
c8485776fe | ||
|
|
aa3e2d0fe6 | ||
|
|
98c64f9d5f | ||
|
|
7d81c17cca | ||
|
|
652d396a81 | ||
|
|
1d83c493af | ||
|
|
cf35cbe59e | ||
|
|
9221c08418 | ||
|
|
48d43c14b1 | ||
|
|
776efa74a4 | ||
|
|
b14e83f499 | ||
|
|
a9b6b65238 | ||
|
|
a036b7f122 | ||
|
|
0bccf23db3 | ||
|
|
0cbd594512 | ||
|
|
efe93a5f57 | ||
|
|
3fda66b85b | ||
|
|
ddfb6707b4 | ||
|
|
a69f7a9531 | ||
|
|
d583aa43ca | ||
|
|
3abb573142 | ||
|
|
d556dada9f | ||
|
|
ce7d49484f | ||
|
|
e4acd18429 | ||
|
|
c2d4784810 | ||
|
|
76bea6c577 | ||
|
|
3ff0b0b2c4 | ||
|
|
a1c7dc17ce | ||
|
|
24723b2f10 | ||
|
|
f998e9e949 | ||
|
|
73661f7d1f | ||
|
|
b5d4db07d1 | ||
|
|
c6a022132b | ||
|
|
195c0ccf8a | ||
|
|
b09a86c0c1 | ||
|
|
de43505ae4 | ||
|
|
d7c5b900b8 | ||
|
|
edad7b6a74 | ||
|
|
829a1f7992 | ||
|
|
d729aa7d5e | ||
|
|
0d0cef3438 | ||
|
|
d7a112fefe | ||
|
|
a5decaa7cf | ||
|
|
8dea3f470f | ||
|
|
e02935dc5b | ||
|
|
24ad2fe2dd | ||
|
|
571dda6549 | ||
|
|
006bee4a5a | ||
|
|
dbb751c8f0 | ||
|
|
3439f7886d | ||
|
|
d418a04602 | ||
|
|
b654c49e55 | ||
|
|
fbcff85ecb | ||
|
|
788c67c29a | ||
|
|
2f19d38693 | ||
|
|
3aae30ed2a | ||
|
|
593c7ad307 |
220
.codeiumignore
Normal file
220
.codeiumignore
Normal file
@@ -0,0 +1,220 @@
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
.pybuilder/
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
# For a library or package, you might want to ignore these files since the code is
|
||||
# intended to run in multiple environments; otherwise, check them in:
|
||||
# .python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# poetry
|
||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||
#poetry.lock
|
||||
|
||||
# pdm
|
||||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
||||
#pdm.lock
|
||||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
||||
# in version control.
|
||||
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
|
||||
.pdm.toml
|
||||
.pdm-python
|
||||
.pdm-build/
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# PyCharm
|
||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
#.idea/
|
||||
|
||||
Crawl4AI.egg-info/
|
||||
Crawl4AI.egg-info/*
|
||||
crawler_data.db
|
||||
.vscode/
|
||||
.tests/
|
||||
.test_pads/
|
||||
test_pad.py
|
||||
test_pad*.py
|
||||
.data/
|
||||
Crawl4AI.egg-info/
|
||||
|
||||
requirements0.txt
|
||||
a.txt
|
||||
|
||||
*.sh
|
||||
.idea
|
||||
docs/examples/.chainlit/
|
||||
docs/examples/.chainlit/*
|
||||
.chainlit/config.toml
|
||||
.chainlit/translations/en-US.json
|
||||
|
||||
local/
|
||||
.files/
|
||||
|
||||
a.txt
|
||||
.lambda_function.py
|
||||
ec2*
|
||||
|
||||
update_changelog.sh
|
||||
|
||||
.DS_Store
|
||||
docs/.DS_Store
|
||||
tmp/
|
||||
test_env/
|
||||
**/.DS_Store
|
||||
**/.DS_Store
|
||||
|
||||
todo.md
|
||||
todo_executor.md
|
||||
git_changes.py
|
||||
git_changes.md
|
||||
pypi_build.sh
|
||||
git_issues.py
|
||||
git_issues.md
|
||||
|
||||
.next/
|
||||
.tests/
|
||||
.docs/
|
||||
.gitboss/
|
||||
todo_executor.md
|
||||
protect-all-except-feature.sh
|
||||
manage-collab.sh
|
||||
publish.sh
|
||||
combine.sh
|
||||
combined_output.txt
|
||||
tree.md
|
||||
|
||||
19
.do/app.yaml
19
.do/app.yaml
@@ -1,19 +0,0 @@
|
||||
alerts:
|
||||
- rule: DEPLOYMENT_FAILED
|
||||
- rule: DOMAIN_FAILED
|
||||
name: crawl4ai
|
||||
region: nyc
|
||||
services:
|
||||
- dockerfile_path: Dockerfile
|
||||
github:
|
||||
branch: 0.3.74
|
||||
deploy_on_push: true
|
||||
repo: unclecode/crawl4ai
|
||||
health_check:
|
||||
http_path: /health
|
||||
http_port: 11235
|
||||
instance_count: 1
|
||||
instance_size_slug: professional-xs
|
||||
name: web
|
||||
routes:
|
||||
- path: /
|
||||
@@ -1,22 +0,0 @@
|
||||
spec:
|
||||
name: crawl4ai
|
||||
services:
|
||||
- name: crawl4ai
|
||||
git:
|
||||
branch: 0.3.74
|
||||
repo_clone_url: https://github.com/unclecode/crawl4ai.git
|
||||
dockerfile_path: Dockerfile
|
||||
http_port: 11235
|
||||
instance_count: 1
|
||||
instance_size_slug: professional-xs
|
||||
health_check:
|
||||
http_path: /health
|
||||
envs:
|
||||
- key: INSTALL_TYPE
|
||||
value: "basic"
|
||||
- key: PYTHON_VERSION
|
||||
value: "3.10"
|
||||
- key: ENABLE_GPU
|
||||
value: "false"
|
||||
routes:
|
||||
- path: /
|
||||
18
.gitignore
vendored
18
.gitignore
vendored
@@ -206,10 +206,22 @@ pypi_build.sh
|
||||
git_issues.py
|
||||
git_issues.md
|
||||
|
||||
.next/
|
||||
.tests/
|
||||
.issues/
|
||||
# .issues/
|
||||
.docs/
|
||||
.issues/
|
||||
.gitboss/
|
||||
|
||||
manage-collab.sh
|
||||
todo_executor.md
|
||||
protect-all-except-feature.sh
|
||||
manage-collab.sh
|
||||
publish.sh
|
||||
combine.sh
|
||||
combined_output.txt
|
||||
.local
|
||||
.scripts
|
||||
tree.md
|
||||
tree.md
|
||||
.scripts
|
||||
.local
|
||||
.do
|
||||
|
||||
314
CHANGELOG.md
314
CHANGELOG.md
@@ -1,5 +1,315 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to Crawl4AI will be documented in this file.
|
||||
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [0.4.24] - 2024-12-31
|
||||
|
||||
### Added
|
||||
- **Browser and SSL Handling**
|
||||
- SSL certificate validation options in extraction strategies
|
||||
- Custom certificate paths support
|
||||
- Configurable certificate validation skipping
|
||||
- Enhanced response status code handling with retry logic
|
||||
|
||||
- **Content Processing**
|
||||
- New content filtering system with regex support
|
||||
- Advanced chunking strategies for large content
|
||||
- Memory-efficient parallel processing
|
||||
- Configurable chunk size optimization
|
||||
|
||||
- **JSON Extraction**
|
||||
- Complex JSONPath expression support
|
||||
- JSON-CSS and Microdata extraction
|
||||
- RDFa parsing capabilities
|
||||
- Advanced data transformation pipeline
|
||||
|
||||
- **Field Types**
|
||||
- New field types: `computed`, `conditional`, `aggregate`, `template`
|
||||
- Field inheritance system
|
||||
- Reusable field definitions
|
||||
- Custom validation rules
|
||||
|
||||
### Changed
|
||||
- **Performance**
|
||||
- Optimized selector compilation with caching
|
||||
- Improved HTML parsing efficiency
|
||||
- Enhanced memory management for large documents
|
||||
- Batch processing optimizations
|
||||
|
||||
- **Error Handling**
|
||||
- More detailed error messages and categorization
|
||||
- Enhanced debugging capabilities
|
||||
- Improved performance metrics tracking
|
||||
- Better error recovery mechanisms
|
||||
|
||||
### Deprecated
|
||||
- Old field computation method using `eval`
|
||||
- Direct browser manipulation without proper SSL handling
|
||||
- Simple text-based content filtering
|
||||
|
||||
### Removed
|
||||
- Legacy extraction patterns without proper error handling
|
||||
- Unsafe eval-based field computation
|
||||
- Direct DOM manipulation without sanitization
|
||||
|
||||
### Fixed
|
||||
- Memory leaks in large document processing
|
||||
- SSL certificate validation issues
|
||||
- Incorrect handling of nested JSON structures
|
||||
- Performance bottlenecks in parallel processing
|
||||
|
||||
### Security
|
||||
- Improved input validation and sanitization
|
||||
- Safe expression evaluation system
|
||||
- Enhanced resource protection
|
||||
- Rate limiting implementation
|
||||
|
||||
## [0.4.1] - 2024-12-08
|
||||
|
||||
### **File: `crawl4ai/async_crawler_strategy.py`**
|
||||
|
||||
#### **New Parameters and Attributes Added**
|
||||
- **`text_mode` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
|
||||
- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
|
||||
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_mode` mode (default values: 800x600 for `text_mode`, 1920x1080 otherwise).
|
||||
- **`extra_args`**: Adds browser-specific flags for `text_mode` mode.
|
||||
- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
|
||||
|
||||
#### **Browser Context Adjustments**
|
||||
- Added **`viewport` adjustments**: Dynamically computed based on `text_mode` or custom configuration.
|
||||
- Enhanced support for `light_mode` and `text_mode` by adding specific browser arguments to reduce resource consumption.
|
||||
|
||||
#### **Dynamic Content Handling**
|
||||
- **Full Page Scan Feature**:
|
||||
- Scrolls through the entire page while dynamically detecting content changes.
|
||||
- Ensures scrolling stops when no new dynamic content is loaded.
|
||||
|
||||
#### **Session Management**
|
||||
- Added **`create_session`** method:
|
||||
- Creates a new browser session and assigns a unique ID.
|
||||
- Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies.
|
||||
|
||||
#### **Improved Content Loading and Adjustment**
|
||||
- **`adjust_viewport_to_content`**:
|
||||
- Automatically adjusts viewport to match content dimensions.
|
||||
- Includes scaling via Chrome DevTools Protocol (CDP).
|
||||
- Enhanced content loading:
|
||||
- Waits for images to load and ensures network activity is idle before proceeding.
|
||||
|
||||
#### **Error Handling and Logging**
|
||||
- Improved error handling and detailed logging for:
|
||||
- Viewport adjustment (`adjust_viewport_to_content`).
|
||||
- Full page scanning (`scan_full_page`).
|
||||
- Dynamic content loading.
|
||||
|
||||
#### **Refactoring and Cleanup**
|
||||
- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`).
|
||||
- Removed commented-out and unused code for better readability.
|
||||
- Added default value for `delay_before_return_html` parameter.
|
||||
|
||||
#### **Optimizations**
|
||||
- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync.
|
||||
- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`).
|
||||
|
||||
---
|
||||
|
||||
### **File: `docs/examples/quickstart_async.py`**
|
||||
|
||||
#### **Schema Adjustment**
|
||||
- Changed schema reference for `LLMExtractionStrategy`:
|
||||
- **Old**: `OpenAIModelFee.schema()`
|
||||
- **New**: `OpenAIModelFee.model_json_schema()`
|
||||
- This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema.
|
||||
|
||||
#### **Documentation Comments Updated**
|
||||
- Improved extraction instruction for schema-based LLM strategies.
|
||||
|
||||
---
|
||||
|
||||
### **New Features Added**
|
||||
1. **Text-Only Mode**:
|
||||
- Focuses on minimal resource usage by disabling non-essential browser features.
|
||||
2. **Light Mode**:
|
||||
- Optimizes browser for performance by disabling background tasks and unnecessary services.
|
||||
3. **Full Page Scanning**:
|
||||
- Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling.
|
||||
4. **Dynamic Viewport Adjustment**:
|
||||
- Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy.
|
||||
5. **Session Management**:
|
||||
- Simplifies session handling with better support for persistent and non-persistent contexts.
|
||||
|
||||
---
|
||||
|
||||
### **Bug Fixes**
|
||||
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
|
||||
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## [0.3.75] December 1, 2024
|
||||
|
||||
### PruningContentFilter
|
||||
|
||||
#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
|
||||
A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.
|
||||
|
||||
**Affected Files:**
|
||||
- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
|
||||
```diff
|
||||
Implemented effective pruning algorithm with comprehensive scoring.
|
||||
```
|
||||
- `README.md`: Improved documentation regarding new features.
|
||||
```diff
|
||||
Updated to include usage and explanation for the PruningContentFilter.
|
||||
```
|
||||
- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
|
||||
```diff
|
||||
Added detailed section explaining the PruningContentFilter.
|
||||
```
|
||||
|
||||
#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
|
||||
Comprehensive tests added to ensure correct functionality of PruningContentFilter
|
||||
|
||||
**Affected Files:**
|
||||
- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
|
||||
```diff
|
||||
Created test cases for various scenarios using the PruningContentFilter.
|
||||
```
|
||||
|
||||
### Development Updates
|
||||
|
||||
#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
|
||||
Extended testing to cover additional edge cases and performance metrics.
|
||||
|
||||
**Affected Files:**
|
||||
- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
|
||||
```diff
|
||||
Added tests for new extraction scenarios including malformed HTML.
|
||||
```
|
||||
|
||||
### Infrastructure & Documentation
|
||||
|
||||
#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
|
||||
Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.
|
||||
|
||||
**Affected Files:**
|
||||
- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
|
||||
- Revised example to illustrate usage of PruningContentFilter.
|
||||
|
||||
## [0.3.746] November 29, 2024
|
||||
|
||||
### Major Features
|
||||
1. Enhanced Docker Support (Nov 29, 2024)
|
||||
- Improved GPU support in Docker images.
|
||||
- Dockerfile refactored for better platform-specific installations.
|
||||
- Introduced new Docker commands for different platforms:
|
||||
- `basic-amd64`, `all-amd64`, `gpu-amd64` for AMD64.
|
||||
- `basic-arm64`, `all-arm64`, `gpu-arm64` for ARM64.
|
||||
|
||||
### Infrastructure & Documentation
|
||||
- Enhanced README.md to improve user guidance and installation instructions.
|
||||
- Added installation instructions for Playwright setup in README.
|
||||
- Created and updated examples in `docs/examples/quickstart_async.py` to be more useful and user-friendly.
|
||||
- Updated `requirements.txt` with a new `pydantic` dependency.
|
||||
- Bumped version number in `crawl4ai/__version__.py` to 0.3.746.
|
||||
|
||||
### Breaking Changes
|
||||
- Streamlined application structure:
|
||||
- Removed static pages and related code from `main.py` which might affect existing deployments relying on static content.
|
||||
|
||||
### Development Updates
|
||||
- Developed `post_install` method in `crawl4ai/install.py` to streamline post-installation setup tasks.
|
||||
- Refined migration processes in `crawl4ai/migrations.py` with enhanced logging for better error visibility.
|
||||
- Updated `docker-compose.yml` to support local and hub services for different architectures, enhancing build and deploy capabilities.
|
||||
- Refactored example test cases in `docs/examples/docker_example.py` to facilitate comprehensive testing.
|
||||
|
||||
### README.md
|
||||
Updated README with new docker commands and setup instructions.
|
||||
Enhanced installation instructions and guidance.
|
||||
|
||||
### crawl4ai/install.py
|
||||
Added post-install script functionality.
|
||||
Introduced `post_install` method for automation of post-installation tasks.
|
||||
|
||||
### crawl4ai/migrations.py
|
||||
Improved migration logging.
|
||||
Refined migration processes and added better logging.
|
||||
|
||||
### docker-compose.yml
|
||||
Refactored docker-compose for better service management.
|
||||
Updated to define services for different platforms and versions.
|
||||
|
||||
### requirements.txt
|
||||
Updated dependencies.
|
||||
Added `pydantic` to requirements file.
|
||||
|
||||
### crawler/__version__.py
|
||||
Updated version number.
|
||||
Bumped version number to 0.3.746.
|
||||
|
||||
### docs/examples/quickstart_async.py
|
||||
Enhanced example scripts.
|
||||
Uncommented example usage in async guide for user functionality.
|
||||
|
||||
### main.py
|
||||
Refactored code to improve maintainability.
|
||||
Streamlined app structure by removing static pages code.
|
||||
|
||||
## [0.3.743] November 27, 2024
|
||||
|
||||
Enhance features and documentation
|
||||
- Updated version to 0.3.743
|
||||
- Improved ManagedBrowser configuration with dynamic host/port
|
||||
- Implemented fast HTML formatting in web crawler
|
||||
- Enhanced markdown generation with a new generator class
|
||||
- Improved sanitization and utility functions
|
||||
- Added contributor details and pull request acknowledgments
|
||||
- Updated documentation for clearer usage scenarios
|
||||
- Adjusted tests to reflect class name changes
|
||||
|
||||
### CONTRIBUTORS.md
|
||||
Added new contributors and pull request details.
|
||||
Updated community contributions and acknowledged pull requests.
|
||||
|
||||
### crawl4ai/__version__.py
|
||||
Version update.
|
||||
Bumped version to 0.3.743.
|
||||
|
||||
### crawl4ai/async_crawler_strategy.py
|
||||
Improved ManagedBrowser configuration.
|
||||
Enhanced browser initialization with configurable host and debugging port; improved hook execution.
|
||||
|
||||
### crawl4ai/async_webcrawler.py
|
||||
Optimized HTML processing.
|
||||
Implemented 'fast_format_html' for optimized HTML formatting; applied it when 'prettiify' is enabled.
|
||||
|
||||
### crawl4ai/content_scraping_strategy.py
|
||||
Enhanced markdown generation strategy.
|
||||
Updated to use DefaultMarkdownGenerator and improved markdown generation with filters option.
|
||||
|
||||
### crawl4ai/markdown_generation_strategy.py
|
||||
Refactored markdown generation class.
|
||||
Renamed DefaultMarkdownGenerationStrategy to DefaultMarkdownGenerator; added content filter handling.
|
||||
|
||||
### crawl4ai/utils.py
|
||||
Enhanced utility functions.
|
||||
Improved input sanitization and enhanced HTML formatting method.
|
||||
|
||||
### docs/md_v2/advanced/hooks-auth.md
|
||||
Improved documentation for hooks.
|
||||
Updated code examples to include cookies in crawler strategy initialization.
|
||||
|
||||
### tests/async/test_markdown_genertor.py
|
||||
Refactored tests to match class renaming.
|
||||
Updated tests to use renamed DefaultMarkdownGenerator class.
|
||||
|
||||
## [0.3.74] November 17, 2024
|
||||
|
||||
This changelog details the updates and changes introduced in Crawl4AI version 0.3.74. It's designed to inform developers about new features, modifications to existing components, removals, and other important information.
|
||||
@@ -466,7 +776,7 @@ This commit introduces several key enhancements, including improved error handli
|
||||
- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
|
||||
- Enhanced flexibility in `CosineStrategy`:
|
||||
- Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
|
||||
- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
|
||||
- Updated `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` for better JSON-based extraction.
|
||||
|
||||
### Fixed
|
||||
- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
|
||||
@@ -737,6 +1047,6 @@ These changes focus on refining the existing codebase, resulting in a more stabl
|
||||
- Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
|
||||
- Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
|
||||
|
||||
## [0.2.4] - 2024-06-17
|
||||
## [v0.2.4] - 2024-06-17
|
||||
### Fixed
|
||||
- Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs
|
||||
|
||||
@@ -10,11 +10,21 @@ We would like to thank the following people for their contributions to Crawl4AI:
|
||||
|
||||
## Community Contributors
|
||||
|
||||
- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
|
||||
- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
|
||||
- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
|
||||
- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
|
||||
- [datehoer](https://github.com/datehoer) - Add browser prxy support
|
||||
|
||||
## Pull Requests
|
||||
|
||||
- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
|
||||
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
|
||||
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
|
||||
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
|
||||
- [paulokuong](https://github.com/paulokuong) - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string [#298](https://github.com/unclecode/crawl4ai/pull/298)
|
||||
|
||||
|
||||
## Other Contributors
|
||||
|
||||
- [Gokhan](https://github.com/gkhngyk)
|
||||
|
||||
25
Dockerfile
25
Dockerfile
@@ -1,6 +1,9 @@
|
||||
# syntax=docker/dockerfile:1.4
|
||||
|
||||
# Build arguments
|
||||
ARG TARGETPLATFORM
|
||||
ARG BUILDPLATFORM
|
||||
|
||||
# Other build arguments
|
||||
ARG PYTHON_VERSION=3.10
|
||||
|
||||
# Base stage with system dependencies
|
||||
@@ -63,13 +66,13 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# GPU support if enabled and architecture is supported
|
||||
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$(dpkg --print-architecture)" != "arm64" ] ; then \
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
nvidia-cuda-toolkit \
|
||||
&& rm -rf /var/lib/apt/lists/* ; \
|
||||
else \
|
||||
echo "Skipping NVIDIA CUDA Toolkit installation (unsupported architecture or GPU disabled)"; \
|
||||
fi
|
||||
RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETPLATFORM" = "linux/amd64" ] ; then \
|
||||
apt-get update && apt-get install -y --no-install-recommends \
|
||||
nvidia-cuda-toolkit \
|
||||
&& rm -rf /var/lib/apt/lists/* ; \
|
||||
else \
|
||||
echo "Skipping NVIDIA CUDA Toolkit installation (unsupported platform or GPU disabled)"; \
|
||||
fi
|
||||
|
||||
# Create and set working directory
|
||||
WORKDIR /app
|
||||
@@ -120,7 +123,11 @@ RUN pip install --no-cache-dir \
|
||||
RUN mkdocs build
|
||||
|
||||
# Install Playwright and browsers
|
||||
RUN playwright install
|
||||
RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
|
||||
playwright install chromium; \
|
||||
elif [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
|
||||
playwright install chromium; \
|
||||
fi
|
||||
|
||||
# Expose port
|
||||
EXPOSE 8000 11235 9222 8080
|
||||
|
||||
@@ -1 +1,2 @@
|
||||
include requirements.txt
|
||||
include requirements.txt
|
||||
recursive-include crawl4ai/js_snippet *.js
|
||||
688
README.md
688
README.md
@@ -1,27 +1,148 @@
|
||||
# 🔥🕷️ Crawl4AI: LLM Friendly Web Crawler & Scraper
|
||||
# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
|
||||
|
||||
<div align="center">
|
||||
|
||||
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
|
||||
|
||||
[](https://github.com/unclecode/crawl4ai/stargazers)
|
||||

|
||||
[](https://github.com/unclecode/crawl4ai/network/members)
|
||||
[](https://github.com/unclecode/crawl4ai/issues)
|
||||
[](https://github.com/unclecode/crawl4ai/pulls)
|
||||
|
||||
[](https://badge.fury.io/py/crawl4ai)
|
||||
[](https://pypi.org/project/crawl4ai/)
|
||||
[](https://pepy.tech/project/crawl4ai)
|
||||
|
||||
[](https://crawl4ai.readthedocs.io/)
|
||||
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
||||
[](https://github.com/psf/black)
|
||||
[](https://github.com/PyCQA/bandit)
|
||||
|
||||
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
|
||||
</div>
|
||||
|
||||
## New in 0.3.74 ✨
|
||||
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
|
||||
|
||||
- 🚀 **Blazing Fast Scraping:** The scraping process is now significantly faster, often completing in under 100 milliseconds (excluding web fetch time)!
|
||||
- 📥 **Download Manager:** Integrated file crawling and downloading capabilities, with full control over file management and tracking within the `CrawlResult` object.
|
||||
- 🔎 **Markdown Filter:** Enhanced content extraction using BM25 algorithm to create cleaner markdown with only relevant webpage content.
|
||||
- 🗂️ **Local & Raw HTML:** Crawl local files (`file://`) and raw HTML strings (`raw:`) directly.
|
||||
- 🤖 **Browser Control:** Use your own browser setup for crawling, with persistent contexts and stealth integration to bypass anti-bot measures.
|
||||
- ☁️ **API & Cache Boost:** CORS support, static file serving, and a new filesystem-based cache for blazing-fast performance. Fine-tune caching with the `CacheMode` enum (ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, BYPASS) and the `always_bypass_cache` parameter.
|
||||
- 🐳 **API Gateway:** Run Crawl4AI as a local or cloud API service, enabling cross-platform usage through a containerized server with secure token authentication via `CRAWL4AI_API_TOKEN`.
|
||||
- 🛠️ **Database Improvements:** Enhanced database system for handling larger content sets with improved caching and faster performance.
|
||||
- 🐛 **Squashed Bugs:** Fixed browser context issues in Docker, memory leaks, enhanced error handling, and improved HTML parsing.
|
||||
[✨ Check out latest update v0.4.24](#-recent-updates)
|
||||
|
||||
🎉 **Version 0.4.24 is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
|
||||
|
||||
## 🧐 Why Crawl4AI?
|
||||
|
||||
1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
|
||||
2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.
|
||||
3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.
|
||||
4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
|
||||
5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.
|
||||
6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
1. Install Crawl4AI:
|
||||
```bash
|
||||
# Install the package
|
||||
pip install crawl4ai
|
||||
|
||||
# Run post-installation setup
|
||||
crawl4ai-setup
|
||||
|
||||
# Verify your installation
|
||||
crawl4ai-doctor
|
||||
```
|
||||
|
||||
If you encounter any browser-related issues, you can install them manually:
|
||||
```bash
|
||||
python -m playwright install --with-deps chromium
|
||||
```
|
||||
|
||||
2. Run a simple web crawl:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import *
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## ✨ Features
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Markdown Generation</strong></summary>
|
||||
|
||||
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
|
||||
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
|
||||
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
|
||||
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
|
||||
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content.
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>📊 <strong>Structured Data Extraction</strong></summary>
|
||||
|
||||
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
|
||||
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
|
||||
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
|
||||
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
|
||||
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🌐 <strong>Browser Integration</strong></summary>
|
||||
|
||||
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
|
||||
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
|
||||
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
|
||||
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
|
||||
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
|
||||
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
|
||||
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🔎 <strong>Crawling & Scraping</strong></summary>
|
||||
|
||||
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
|
||||
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
|
||||
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
|
||||
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
|
||||
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
|
||||
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
|
||||
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
|
||||
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
|
||||
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
|
||||
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
|
||||
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🚀 <strong>Deployment</strong></summary>
|
||||
|
||||
- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
|
||||
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
|
||||
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
|
||||
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🎯 <strong>Additional Features</strong></summary>
|
||||
|
||||
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
|
||||
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
|
||||
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
|
||||
- 🛡️ **Error Handling**: Robust error management for seamless execution.
|
||||
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
|
||||
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
|
||||
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
|
||||
|
||||
</details>
|
||||
|
||||
## Try it Now!
|
||||
|
||||
@@ -29,53 +150,27 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
|
||||
|
||||
✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
|
||||
|
||||
## Features ✨
|
||||
|
||||
- 🆓 Completely free and open-source
|
||||
- 🚀 Blazing fast performance, outperforming many paid services
|
||||
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
|
||||
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
|
||||
- 🌍 Supports crawling multiple URLs simultaneously
|
||||
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
|
||||
- 🔗 Extracts all external and internal links
|
||||
- 📚 Extracts metadata from the page
|
||||
- 🔄 Custom hooks for authentication, headers, and page modifications
|
||||
- 🕵️ User-agent customization
|
||||
- 🖼️ Takes screenshots of pages with enhanced error handling
|
||||
- 📜 Executes multiple custom JavaScripts before crawling
|
||||
- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
|
||||
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
|
||||
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
|
||||
- 🎯 CSS selector support for precise data extraction
|
||||
- 📝 Passes instructions/keywords to refine extraction
|
||||
- 🔒 Proxy support with authentication for enhanced access
|
||||
- 🔄 Session management for complex multi-page crawling
|
||||
- 🌐 Asynchronous architecture for improved performance
|
||||
- 🖼️ Improved image processing with lazy-loading detection
|
||||
- 🕰️ Enhanced handling of delayed content loading
|
||||
- 🔑 Custom headers support for LLM interactions
|
||||
- 🖼️ iframe content extraction for comprehensive analysis
|
||||
- ⏱️ Flexible timeout and delayed content retrieval options
|
||||
|
||||
## Installation 🛠️
|
||||
|
||||
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
|
||||
|
||||
### Using pip 🐍
|
||||
<details>
|
||||
<summary>🐍 <strong>Using pip</strong></summary>
|
||||
|
||||
Choose the installation option that best fits your needs:
|
||||
|
||||
#### Basic Installation
|
||||
### Basic Installation
|
||||
|
||||
For basic web crawling and scraping tasks:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup # Setup the browser
|
||||
```
|
||||
|
||||
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
|
||||
|
||||
👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
|
||||
👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
|
||||
|
||||
1. Through the command line:
|
||||
|
||||
@@ -91,74 +186,70 @@ By default, this will install the asynchronous version of Crawl4AI, using Playwr
|
||||
|
||||
This second method has proven to be more reliable in some cases.
|
||||
|
||||
#### Installation with Synchronous Version
|
||||
---
|
||||
|
||||
If you need the synchronous version using Selenium:
|
||||
### Installation with Synchronous Version
|
||||
|
||||
The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai[sync]
|
||||
```
|
||||
|
||||
#### Development Installation
|
||||
---
|
||||
|
||||
### Development Installation
|
||||
|
||||
For contributors who plan to modify the source code:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
pip install -e .
|
||||
pip install -e . # Basic installation in editable mode
|
||||
```
|
||||
|
||||
## One-Click Deployment 🚀
|
||||
|
||||
Deploy your own instance of Crawl4AI with one click:
|
||||
|
||||
[](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
|
||||
|
||||
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
|
||||
|
||||
The deploy will:
|
||||
- Set up a Docker container with Crawl4AI
|
||||
- Configure Playwright and all dependencies
|
||||
- Start the FastAPI server on port 11235
|
||||
- Set up health checks and auto-deployment
|
||||
|
||||
### Using Docker 🐳
|
||||
|
||||
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
|
||||
|
||||
#### Option 1: Docker Hub (Recommended)
|
||||
Install optional features:
|
||||
|
||||
```bash
|
||||
# Pull and run from Docker Hub (choose one):
|
||||
docker pull unclecode/crawl4ai:basic # Basic crawling features
|
||||
docker pull unclecode/crawl4ai:all # Full installation (ML, LLM support)
|
||||
docker pull unclecode/crawl4ai:gpu # GPU-enabled version
|
||||
|
||||
# Run the container
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic # Replace 'basic' with your chosen version
|
||||
|
||||
# In case to allocate more shared memory for the container
|
||||
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic
|
||||
pip install -e ".[torch]" # With PyTorch features
|
||||
pip install -e ".[transformer]" # With Transformer features
|
||||
pip install -e ".[cosine]" # With cosine similarity features
|
||||
pip install -e ".[sync]" # With synchronous crawling (Selenium)
|
||||
pip install -e ".[all]" # Install all optional features
|
||||
```
|
||||
|
||||
#### Option 2: Build from Repository
|
||||
</details>
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
<details>
|
||||
<summary>🐳 <strong>Docker Deployment</strong></summary>
|
||||
|
||||
# Build the image
|
||||
docker build -t crawl4ai:local \
|
||||
--build-arg INSTALL_TYPE=basic \ # Options: basic, all
|
||||
.
|
||||
> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
|
||||
|
||||
# Run your local build
|
||||
docker run -p 11235:11235 crawl4ai:local
|
||||
```
|
||||
### Current Docker Support
|
||||
|
||||
The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
|
||||
|
||||
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
|
||||
- ⚠️ Note: This setup will be replaced in the next major release
|
||||
|
||||
### What's Coming Next?
|
||||
|
||||
Our new Docker implementation will bring:
|
||||
- Improved performance and resource efficiency
|
||||
- Streamlined deployment process
|
||||
- Better integration with Crawl4AI features
|
||||
- Enhanced scalability options
|
||||
|
||||
Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
### Quick Test
|
||||
|
||||
Run a quick test (works for both Docker options):
|
||||
|
||||
Quick test (works for both options):
|
||||
```python
|
||||
import requests
|
||||
|
||||
@@ -169,149 +260,138 @@ response = requests.post(
|
||||
)
|
||||
task_id = response.json()["task_id"]
|
||||
|
||||
# Get results
|
||||
# Continue polling until the task is complete (status="completed")
|
||||
result = requests.get(f"http://localhost:11235/task/{task_id}")
|
||||
```
|
||||
|
||||
For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
|
||||
For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
## Quick Start 🚀
|
||||
## 🔬 Advanced Usage Examples 🔬
|
||||
|
||||
You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
|
||||
|
||||
<details>
|
||||
<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Advanced Usage 🔬
|
||||
|
||||
### Executing JavaScript and Using CSS Selectors
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
|
||||
# ),
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
css_selector=".wide-tease-item__description",
|
||||
bypass_cache=True
|
||||
url="https://docs.micronaut.io/4.7.6/guide/",
|
||||
config=run_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
print(len(result.markdown))
|
||||
print(len(result.fit_markdown))
|
||||
print(len(result.markdown_v2.fit_markdown))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Using a Proxy
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
bypass_cache=True
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Extracting Structured Data without LLM
|
||||
|
||||
The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def extract_news_teasers():
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "News Teaser Extractor",
|
||||
"baseSelector": ".wide-tease-item__wrapper",
|
||||
"fields": [
|
||||
{
|
||||
"name": "category",
|
||||
"selector": ".unibrow span[data-testid='unibrow-text']",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".wide-tease-item__headline",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "summary",
|
||||
"selector": ".wide-tease-item__description",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "time",
|
||||
"selector": "[data-testid='wide-tease-date']",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "image",
|
||||
"type": "nested",
|
||||
"selector": "picture.teasePicture img",
|
||||
"fields": [
|
||||
{"name": "src", "type": "attribute", "attribute": "src"},
|
||||
{"name": "alt", "type": "attribute", "attribute": "alt"},
|
||||
],
|
||||
},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "a[href]",
|
||||
"type": "attribute",
|
||||
"attribute": "href",
|
||||
},
|
||||
],
|
||||
"name": "KidoCode Courses",
|
||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||
"fields": [
|
||||
{
|
||||
"name": "section_title",
|
||||
"selector": "h3.heading-50",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "section_description",
|
||||
"selector": ".charge-content",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_name",
|
||||
"selector": ".text-block-93",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_description",
|
||||
"selector": ".course-content-text",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_icon",
|
||||
"selector": ".image-92",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
browser_config = BrowserConfig(
|
||||
headless=False,
|
||||
verbose=True
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
url="https://www.kidocode.com/degrees/technology",
|
||||
config=run_config
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
companies = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(companies)} companies")
|
||||
print(json.dumps(companies[0], indent=2))
|
||||
|
||||
news_teasers = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(news_teasers)} news teasers")
|
||||
print(json.dumps(news_teasers[0], indent=2))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(extract_news_teasers())
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/extraction/css-advanced/) section in the documentation.
|
||||
</details>
|
||||
|
||||
### Extracting Structured Data with OpenAI
|
||||
<details>
|
||||
<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
|
||||
|
||||
```python
|
||||
import os
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
@@ -321,19 +401,26 @@ class OpenAIModelFee(BaseModel):
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
# Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
|
||||
# provider="ollama/qwen2", api_token="no-token",
|
||||
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url='https://openai.com/api/pricing/',
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
||||
),
|
||||
bypass_cache=True,
|
||||
config=run_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
@@ -341,143 +428,95 @@ if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Session Management and Dynamic Content Crawling
|
||||
</details>
|
||||
|
||||
Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:
|
||||
<details>
|
||||
<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import re
|
||||
from bs4 import BeautifulSoup
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
import os, sys
|
||||
from pathlib import Path
|
||||
import asyncio, time
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def crawl_typescript_commits():
|
||||
first_commit = ""
|
||||
async def on_execution_started(page):
|
||||
nonlocal first_commit
|
||||
try:
|
||||
while True:
|
||||
await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
|
||||
commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
|
||||
commit = await commit.evaluate('(element) => element.textContent')
|
||||
commit = re.sub(r'\s+', '', commit)
|
||||
if commit and commit != first_commit:
|
||||
first_commit = commit
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
except Exception as e:
|
||||
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
|
||||
async def test_news_crawl():
|
||||
# Create a persistent user data directory
|
||||
user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
|
||||
os.makedirs(user_data_dir, exist_ok=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
|
||||
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
for page in range(3): # Crawl 3 pages
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
js=js_next_page if page > 0 else None,
|
||||
bypass_cache=True,
|
||||
js_only=page > 0
|
||||
)
|
||||
|
||||
assert result.success, f"Failed to crawl page {page + 1}"
|
||||
|
||||
soup = BeautifulSoup(result.cleaned_html, 'html.parser')
|
||||
commits = soup.select("li")
|
||||
all_commits.extend(commits)
|
||||
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(crawl_typescript_commits())
|
||||
browser_config = BrowserConfig(
|
||||
verbose=True,
|
||||
headless=True,
|
||||
user_data_dir=user_data_dir,
|
||||
use_persistent_context=True,
|
||||
)
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
|
||||
|
||||
result = await crawler.arun(
|
||||
url,
|
||||
config=run_config,
|
||||
magic=True,
|
||||
)
|
||||
|
||||
print(f"Successfully crawled {url}")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
```
|
||||
|
||||
This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
|
||||
|
||||
For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
|
||||
</details>
|
||||
|
||||
|
||||
## Speed Comparison 🚀
|
||||
## ✨ Recent Updates
|
||||
|
||||
Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.
|
||||
- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling
|
||||
- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies
|
||||
- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction
|
||||
- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types
|
||||
- ⚡ **Performance Boost**: Optimized caching, parallel processing, and memory management
|
||||
- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking
|
||||
- 🔐 **Security Features**: Improved input validation and safe expression evaluation
|
||||
|
||||
We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:
|
||||
Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
|
||||
|
||||
```bash
|
||||
Firecrawl:
|
||||
Time taken: 7.02 seconds
|
||||
Content length: 42074 characters
|
||||
Images found: 49
|
||||
## 📖 Documentation & Roadmap
|
||||
|
||||
Crawl4AI (simple crawl):
|
||||
Time taken: 1.60 seconds
|
||||
Content length: 18238 characters
|
||||
Images found: 49
|
||||
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
|
||||
|
||||
Crawl4AI (with JavaScript execution):
|
||||
Time taken: 4.64 seconds
|
||||
Content length: 40869 characters
|
||||
Images found: 89
|
||||
```
|
||||
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
|
||||
As you can see, Crawl4AI outperforms Firecrawl significantly:
|
||||
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
|
||||
- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
|
||||
- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.
|
||||
<details>
|
||||
<summary>📈 <strong>Development TODOs</strong></summary>
|
||||
|
||||
You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.
|
||||
|
||||
## Documentation 📚
|
||||
|
||||
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
|
||||
## Crawl4AI Roadmap 🗺️
|
||||
|
||||
For detailed information on our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
|
||||
|
||||
### Advanced Crawling Systems 🔧
|
||||
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
|
||||
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
|
||||
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
|
||||
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
|
||||
|
||||
### Specialized Features 🛠️
|
||||
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
|
||||
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
|
||||
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
|
||||
|
||||
### Development Tools 🔨
|
||||
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
|
||||
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
|
||||
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
|
||||
|
||||
### Community & Growth 🌱
|
||||
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
|
||||
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
|
||||
|
||||
## Contributing 🤝
|
||||
</details>
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
|
||||
|
||||
## License 📄
|
||||
## 📄 License
|
||||
|
||||
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
|
||||
|
||||
## Contact 📧
|
||||
## 📧 Contact
|
||||
|
||||
For questions, suggestions, or feedback, feel free to reach out:
|
||||
|
||||
@@ -487,33 +526,30 @@ For questions, suggestions, or feedback, feel free to reach out:
|
||||
|
||||
Happy Crawling! 🕸️🚀
|
||||
|
||||
## 🗾 Mission
|
||||
|
||||
# Mission
|
||||
Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.
|
||||
|
||||
Our mission is to unlock the untapped potential of personal and enterprise data in the digital age. In today's world, individuals and organizations generate vast amounts of valuable digital footprints, yet this data remains largely uncapitalized as a true asset.
|
||||
We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
|
||||
|
||||
Our open-source solution empowers developers and innovators to build tools for data extraction and structuring, laying the foundation for a new era of data ownership. By transforming personal and enterprise data into structured, tradeable assets, we're creating opportunities for individuals to capitalize on their digital footprints and for organizations to unlock the value of their collective knowledge.
|
||||
<details>
|
||||
<summary>🔑 <strong>Key Opportunities</strong></summary>
|
||||
|
||||
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.
|
||||
- **Authentic AI Data**: Provide AI systems with real human insights.
|
||||
- **Shared Economy**: Create a fair data marketplace that benefits data creators.
|
||||
|
||||
This democratization of data represents the first step toward a shared data economy, where willing participation in data sharing drives AI advancement while ensuring the benefits flow back to data creators. Through this approach, we're building a future where AI development is powered by authentic human knowledge rather than synthetic alternatives.
|
||||
</details>
|
||||
|
||||

|
||||
<details>
|
||||
<summary>🚀 <strong>Development Pathway</strong></summary>
|
||||
|
||||
For a detailed exploration of our vision, opportunities, and pathway forward, please see our [full mission statement](./MISSION.md).
|
||||
|
||||
## Key Opportunities
|
||||
|
||||
- **Data Capitalization**: Transform digital footprints into valuable assets that can appear on personal and enterprise balance sheets
|
||||
- **Authentic Data**: Unlock the vast reservoir of real human insights and knowledge for AI advancement
|
||||
- **Shared Economy**: Create new value streams where data creators directly benefit from their contributions
|
||||
|
||||
## Development Pathway
|
||||
|
||||
1. **Open-Source Foundation**: Building transparent, community-driven data extraction tools
|
||||
2. **Data Capitalization Platform**: Creating tools to structure and value digital assets
|
||||
3. **Shared Data Marketplace**: Establishing an economic platform for ethical data exchange
|
||||
|
||||
For a detailed exploration of our vision, challenges, and solutions, please see our [full mission statement](./MISSION.md).
|
||||
1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.
|
||||
2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.
|
||||
3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.
|
||||
|
||||
For more details, see our [full mission statement](./MISSION.md).
|
||||
</details>
|
||||
|
||||
## Star History
|
||||
|
||||
|
||||
244
README.sync.md
244
README.sync.md
@@ -1,244 +0,0 @@
|
||||
# Crawl4AI v0.2.77 🕷️🤖
|
||||
|
||||
[](https://github.com/unclecode/crawl4ai/stargazers)
|
||||
[](https://github.com/unclecode/crawl4ai/network/members)
|
||||
[](https://github.com/unclecode/crawl4ai/issues)
|
||||
[](https://github.com/unclecode/crawl4ai/pulls)
|
||||
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
||||
|
||||
Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
|
||||
|
||||
#### [v0.2.77] - 2024-08-02
|
||||
|
||||
Major improvements in functionality, performance, and cross-platform compatibility! 🚀
|
||||
|
||||
- 🐳 **Docker enhancements**:
|
||||
- Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
|
||||
- 🌐 **Official Docker Hub image**:
|
||||
- Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
|
||||
- 🔧 **Selenium upgrade**:
|
||||
- Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
|
||||
- 🖼️ **Image description**:
|
||||
- Implemented ability to generate textual descriptions for extracted images from web pages.
|
||||
- ⚡ **Performance boost**:
|
||||
- Various improvements to enhance overall speed and performance.
|
||||
|
||||
## Try it Now!
|
||||
|
||||
✨ Play around with this [](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
|
||||
|
||||
✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
|
||||
|
||||
✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
|
||||
|
||||
## Features ✨
|
||||
|
||||
- 🆓 Completely free and open-source
|
||||
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
|
||||
- 🌍 Supports crawling multiple URLs simultaneously
|
||||
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
|
||||
- 🔗 Extracts all external and internal links
|
||||
- 📚 Extracts metadata from the page
|
||||
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
|
||||
- 🕵️ User-agent customization
|
||||
- 🖼️ Takes screenshots of the page
|
||||
- 📜 Executes multiple custom JavaScripts before crawling
|
||||
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
|
||||
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
|
||||
- 🎯 CSS selector support
|
||||
- 📝 Passes instructions/keywords to refine extraction
|
||||
|
||||
# Crawl4AI
|
||||
|
||||
## 🌟 Shoutout to Contributors of v0.2.77!
|
||||
|
||||
A big thank you to the amazing contributors who've made this release possible:
|
||||
|
||||
- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
|
||||
- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
|
||||
- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
|
||||
|
||||
Your contributions are driving Crawl4AI forward! 🚀
|
||||
|
||||
## Cool Examples 🚀
|
||||
|
||||
### Quick Start
|
||||
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
|
||||
# Create an instance of WebCrawler
|
||||
crawler = WebCrawler()
|
||||
|
||||
# Warm up the crawler (load necessary models)
|
||||
crawler.warmup()
|
||||
|
||||
# Run the crawler on a URL
|
||||
result = crawler.run(url="https://www.nbcnews.com/business")
|
||||
|
||||
# Print the extracted content
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
## How to install 🛠
|
||||
|
||||
### Using pip 🐍
|
||||
```bash
|
||||
virtualenv venv
|
||||
source venv/bin/activate
|
||||
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
|
||||
```
|
||||
|
||||
### Using Docker 🐳
|
||||
|
||||
```bash
|
||||
# For Mac users (M1/M2)
|
||||
# docker build --platform linux/amd64 -t crawl4ai .
|
||||
docker build -t crawl4ai .
|
||||
docker run -d -p 8000:80 crawl4ai
|
||||
```
|
||||
|
||||
### Using Docker Hub 🐳
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
docker run -d -p 8000:80 unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
|
||||
## Speed-First Design 🚀
|
||||
|
||||
Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
|
||||
|
||||
```python
|
||||
import time
|
||||
from crawl4ai.web_crawler import WebCrawler
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
|
||||
start = time.time()
|
||||
url = r"https://www.nbcnews.com/business"
|
||||
result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
|
||||
end = time.time()
|
||||
print(f"Time taken: {end - start}")
|
||||
```
|
||||
|
||||
Let's take a look the calculated time for the above code snippet:
|
||||
|
||||
```bash
|
||||
[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
|
||||
[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
|
||||
[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
|
||||
Time taken: 1.439958095550537
|
||||
```
|
||||
Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
|
||||
|
||||
### Extract Structured Data from Web Pages 📊
|
||||
|
||||
Crawl all OpenAI models and their fees from the official page.
|
||||
|
||||
```python
|
||||
import os
|
||||
from crawl4ai import WebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
|
||||
|
||||
url = 'https://openai.com/api/pricing/'
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
|
||||
result = crawler.run(
|
||||
url=url,
|
||||
word_count_threshold=1,
|
||||
extraction_strategy= LLMExtractionStrategy(
|
||||
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
### Execute JS, Filter Data with CSS Selector, and Clustering
|
||||
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
from crawl4ai.chunking_strategy import CosineStrategy
|
||||
|
||||
js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js=js_code,
|
||||
css_selector="p",
|
||||
extraction_strategy=CosineStrategy(semantic_filter="technology")
|
||||
)
|
||||
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
### Extract Structured Data from Web Pages With Proxy and BaseUrl
|
||||
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
def create_crawler():
|
||||
crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
|
||||
crawler.warmup()
|
||||
return crawler
|
||||
|
||||
crawler = create_crawler()
|
||||
|
||||
crawler.warmup()
|
||||
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token="sk-",
|
||||
base_url="https://api.openai.com/v1"
|
||||
)
|
||||
)
|
||||
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
## Documentation 📚
|
||||
|
||||
For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
|
||||
|
||||
## Contributing 🤝
|
||||
|
||||
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
|
||||
|
||||
## License 📄
|
||||
|
||||
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
|
||||
|
||||
## Contact 📧
|
||||
|
||||
For questions, suggestions, or feedback, feel free to reach out:
|
||||
|
||||
- GitHub: [unclecode](https://github.com/unclecode)
|
||||
- Twitter: [@unclecode](https://twitter.com/unclecode)
|
||||
- Website: [crawl4ai.com](https://crawl4ai.com)
|
||||
|
||||
Happy Crawling! 🕸️🚀
|
||||
|
||||
## Star History
|
||||
|
||||
[](https://star-history.com/#unclecode/crawl4ai&Date)
|
||||
@@ -1,14 +1,29 @@
|
||||
# __init__.py
|
||||
|
||||
from .async_webcrawler import AsyncWebCrawler, CacheMode
|
||||
from .async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy
|
||||
from .chunking_strategy import ChunkingStrategy, RegexChunking
|
||||
from .markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from .models import CrawlResult
|
||||
from .__version__ import __version__
|
||||
# __version__ = "0.3.73"
|
||||
|
||||
__all__ = [
|
||||
"AsyncWebCrawler",
|
||||
"CrawlResult",
|
||||
"CacheMode",
|
||||
'BrowserConfig',
|
||||
'CrawlerRunConfig',
|
||||
'ExtractionStrategy',
|
||||
'LLMExtractionStrategy',
|
||||
'CosineStrategy',
|
||||
'JsonCssExtractionStrategy',
|
||||
'ChunkingStrategy',
|
||||
'RegexChunking',
|
||||
'DefaultMarkdownGenerator',
|
||||
'PruningContentFilter',
|
||||
'BM25ContentFilter',
|
||||
]
|
||||
|
||||
def is_sync_version_installed():
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# crawl4ai/_version.py
|
||||
__version__ = "0.3.74"
|
||||
__version__ = "0.4.241"
|
||||
|
||||
605
crawl4ai/async_configs.py
Normal file
605
crawl4ai/async_configs.py
Normal file
@@ -0,0 +1,605 @@
|
||||
from .config import (
|
||||
MIN_WORD_THRESHOLD,
|
||||
IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||
SCREENSHOT_HEIGHT_TRESHOLD,
|
||||
PAGE_TIMEOUT,
|
||||
IMAGE_SCORE_THRESHOLD,
|
||||
SOCIAL_MEDIA_DOMAINS,
|
||||
|
||||
)
|
||||
from .user_agent_generator import UserAgentGenerator
|
||||
from .extraction_strategy import ExtractionStrategy
|
||||
from .chunking_strategy import ChunkingStrategy
|
||||
from .markdown_generation_strategy import MarkdownGenerationStrategy
|
||||
from typing import Union, List
|
||||
|
||||
|
||||
class BrowserConfig:
|
||||
"""
|
||||
Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrategy.
|
||||
|
||||
This class centralizes all parameters that affect browser and context creation. Instead of passing
|
||||
scattered keyword arguments, users can instantiate and modify this configuration object. The crawler
|
||||
code will then reference these settings to initialize the browser in a consistent, documented manner.
|
||||
|
||||
Attributes:
|
||||
browser_type (str): The type of browser to launch. Supported values: "chromium", "firefox", "webkit".
|
||||
Default: "chromium".
|
||||
headless (bool): Whether to run the browser in headless mode (no visible GUI).
|
||||
Default: True.
|
||||
use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
|
||||
advanced manipulation. Default: False.
|
||||
debugging_port (int): Port for the browser debugging protocol. Default: 9222.
|
||||
use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
|
||||
Automatically sets use_managed_browser=True. Default: False.
|
||||
user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
|
||||
temporary directory may be used. Default: None.
|
||||
chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
|
||||
is "chromium". Default: "chrome".
|
||||
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
|
||||
Default: None.
|
||||
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
||||
If None, no additional proxy config. Default: None.
|
||||
viewport_width (int): Default viewport width for pages. Default: 1080.
|
||||
viewport_height (int): Default viewport height for pages. Default: 600.
|
||||
verbose (bool): Enable verbose logging.
|
||||
Default: True.
|
||||
accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
|
||||
Default: False.
|
||||
downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
|
||||
a default path will be created. Default: None.
|
||||
storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage).
|
||||
Default: None.
|
||||
ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
|
||||
java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
|
||||
cookies (list): List of cookies to add to the browser context. Each cookie is a dict with fields like
|
||||
{"name": "...", "value": "...", "url": "..."}.
|
||||
Default: [].
|
||||
headers (dict): Extra HTTP headers to apply to all requests in this context.
|
||||
Default: {}.
|
||||
user_agent (str): Custom User-Agent string to use. Default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36".
|
||||
user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
|
||||
user_agent as-is. Default: None.
|
||||
user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
|
||||
Default: None.
|
||||
text_mode (bool): If True, disables images and other rich content for potentially faster load times.
|
||||
Default: False.
|
||||
light_mode (bool): Disables certain background features for performance gains. Default: False.
|
||||
extra_args (list): Additional command-line arguments passed to the browser.
|
||||
Default: [].
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
browser_type: str = "chromium",
|
||||
headless: bool = True,
|
||||
use_managed_browser: bool = False,
|
||||
use_persistent_context: bool = False,
|
||||
user_data_dir: str = None,
|
||||
chrome_channel: str = "chrome",
|
||||
proxy: str = None,
|
||||
proxy_config: dict = None,
|
||||
viewport_width: int = 1080,
|
||||
viewport_height: int = 600,
|
||||
accept_downloads: bool = False,
|
||||
downloads_path: str = None,
|
||||
storage_state=None,
|
||||
ignore_https_errors: bool = True,
|
||||
java_script_enabled: bool = True,
|
||||
sleep_on_close: bool = False,
|
||||
verbose: bool = True,
|
||||
cookies: list = None,
|
||||
headers: dict = None,
|
||||
user_agent: str = (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
|
||||
),
|
||||
user_agent_mode: str = None,
|
||||
user_agent_generator_config: dict = None,
|
||||
text_mode: bool = False,
|
||||
light_mode: bool = False,
|
||||
extra_args: list = None,
|
||||
debugging_port : int = 9222,
|
||||
):
|
||||
self.browser_type = browser_type
|
||||
self.headless = headless
|
||||
self.use_managed_browser = use_managed_browser
|
||||
self.use_persistent_context = use_persistent_context
|
||||
self.user_data_dir = user_data_dir
|
||||
if self.browser_type == "chromium":
|
||||
self.chrome_channel = "chrome"
|
||||
elif self.browser_type == "firefox":
|
||||
self.chrome_channel = "firefox"
|
||||
elif self.browser_type == "webkit":
|
||||
self.chrome_channel = "webkit"
|
||||
else:
|
||||
self.chrome_channel = chrome_channel or "chrome"
|
||||
self.proxy = proxy
|
||||
self.proxy_config = proxy_config
|
||||
self.viewport_width = viewport_width
|
||||
self.viewport_height = viewport_height
|
||||
self.accept_downloads = accept_downloads
|
||||
self.downloads_path = downloads_path
|
||||
self.storage_state = storage_state
|
||||
self.ignore_https_errors = ignore_https_errors
|
||||
self.java_script_enabled = java_script_enabled
|
||||
self.cookies = cookies if cookies is not None else []
|
||||
self.headers = headers if headers is not None else {}
|
||||
self.user_agent = user_agent
|
||||
self.user_agent_mode = user_agent_mode
|
||||
self.user_agent_generator_config = user_agent_generator_config
|
||||
self.text_mode = text_mode
|
||||
self.light_mode = light_mode
|
||||
self.extra_args = extra_args if extra_args is not None else []
|
||||
self.sleep_on_close = sleep_on_close
|
||||
self.verbose = verbose
|
||||
self.debugging_port = debugging_port
|
||||
|
||||
user_agenr_generator = UserAgentGenerator()
|
||||
if self.user_agent_mode != "random" and self.user_agent_generator_config:
|
||||
self.user_agent = user_agenr_generator.generate(
|
||||
**(self.user_agent_generator_config or {})
|
||||
)
|
||||
elif self.user_agent_mode == "random":
|
||||
self.user_agent = user_agenr_generator.generate()
|
||||
else:
|
||||
pass
|
||||
|
||||
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
|
||||
self.headers.setdefault("sec-ch-ua", self.browser_hint)
|
||||
|
||||
# If persistent context is requested, ensure managed browser is enabled
|
||||
if self.use_persistent_context:
|
||||
self.use_managed_browser = True
|
||||
|
||||
@staticmethod
|
||||
def from_kwargs(kwargs: dict) -> "BrowserConfig":
|
||||
return BrowserConfig(
|
||||
browser_type=kwargs.get("browser_type", "chromium"),
|
||||
headless=kwargs.get("headless", True),
|
||||
use_managed_browser=kwargs.get("use_managed_browser", False),
|
||||
use_persistent_context=kwargs.get("use_persistent_context", False),
|
||||
user_data_dir=kwargs.get("user_data_dir"),
|
||||
chrome_channel=kwargs.get("chrome_channel", "chrome"),
|
||||
proxy=kwargs.get("proxy"),
|
||||
proxy_config=kwargs.get("proxy_config"),
|
||||
viewport_width=kwargs.get("viewport_width", 1080),
|
||||
viewport_height=kwargs.get("viewport_height", 600),
|
||||
accept_downloads=kwargs.get("accept_downloads", False),
|
||||
downloads_path=kwargs.get("downloads_path"),
|
||||
storage_state=kwargs.get("storage_state"),
|
||||
ignore_https_errors=kwargs.get("ignore_https_errors", True),
|
||||
java_script_enabled=kwargs.get("java_script_enabled", True),
|
||||
cookies=kwargs.get("cookies", []),
|
||||
headers=kwargs.get("headers", {}),
|
||||
user_agent=kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
|
||||
),
|
||||
user_agent_mode=kwargs.get("user_agent_mode"),
|
||||
user_agent_generator_config=kwargs.get("user_agent_generator_config"),
|
||||
text_mode=kwargs.get("text_mode", False),
|
||||
light_mode=kwargs.get("light_mode", False),
|
||||
extra_args=kwargs.get("extra_args", []),
|
||||
)
|
||||
|
||||
|
||||
class CrawlerRunConfig:
|
||||
"""
|
||||
Configuration class for controlling how the crawler runs each crawl operation.
|
||||
This includes parameters for content extraction, page manipulation, waiting conditions,
|
||||
caching, and other runtime behaviors.
|
||||
|
||||
This centralizes parameters that were previously scattered as kwargs to `arun()` and related methods.
|
||||
By using this class, you have a single place to understand and adjust the crawling options.
|
||||
|
||||
Attributes:
|
||||
# Content Processing Parameters
|
||||
word_count_threshold (int): Minimum word count threshold before processing content.
|
||||
Default: MIN_WORD_THRESHOLD (typically 200).
|
||||
extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
|
||||
Default: None (NoExtractionStrategy is used if None).
|
||||
chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
|
||||
Default: RegexChunking().
|
||||
markdown_generator (MarkdownGenerationStrategy): Strategy for generating markdown.
|
||||
Default: None.
|
||||
content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
|
||||
Default: None.
|
||||
only_text (bool): If True, attempt to extract text-only content where applicable.
|
||||
Default: False.
|
||||
css_selector (str or None): CSS selector to extract a specific portion of the page.
|
||||
Default: None.
|
||||
excluded_tags (list of str or None): List of HTML tags to exclude from processing.
|
||||
Default: None.
|
||||
excluded_selector (str or None): CSS selector to exclude from processing.
|
||||
Default: None.
|
||||
keep_data_attributes (bool): If True, retain `data-*` attributes while removing unwanted attributes.
|
||||
Default: False.
|
||||
remove_forms (bool): If True, remove all `<form>` elements from the HTML.
|
||||
Default: False.
|
||||
prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
|
||||
Default: False.
|
||||
parser_type (str): Type of parser to use for HTML parsing.
|
||||
Default: "lxml".
|
||||
|
||||
# Caching Parameters
|
||||
cache_mode (CacheMode or None): Defines how caching is handled.
|
||||
If None, defaults to CacheMode.ENABLED internally.
|
||||
Default: None.
|
||||
session_id (str or None): Optional session ID to persist the browser context and the created
|
||||
page instance. If the ID already exists, the crawler does not
|
||||
create a new page and uses the current page to preserve the state.
|
||||
bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
|
||||
Default: False.
|
||||
disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
|
||||
Default: False.
|
||||
no_cache_read (bool): Legacy parameter, if True acts like CacheMode.WRITE_ONLY.
|
||||
Default: False.
|
||||
no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
|
||||
Default: False.
|
||||
|
||||
# Page Navigation and Timing Parameters
|
||||
wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
|
||||
Default: "domcontentloaded".
|
||||
page_timeout (int): Timeout in ms for page operations like navigation.
|
||||
Default: 60000 (60 seconds).
|
||||
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
|
||||
Default: None.
|
||||
wait_for_images (bool): If True, wait for images to load before extracting content.
|
||||
Default: True.
|
||||
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
|
||||
Default: 0.1.
|
||||
mean_delay (float): Mean base delay between requests when calling arun_many.
|
||||
Default: 0.1.
|
||||
max_range (float): Max random additional delay range for requests in arun_many.
|
||||
Default: 0.3.
|
||||
semaphore_count (int): Number of concurrent operations allowed.
|
||||
Default: 5.
|
||||
|
||||
# Page Interaction Parameters
|
||||
js_code (str or list of str or None): JavaScript code/snippets to run on the page.
|
||||
Default: None.
|
||||
js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
|
||||
Default: False.
|
||||
ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
|
||||
Default: True.
|
||||
scan_full_page (bool): If True, scroll through the entire page to load all content.
|
||||
Default: False.
|
||||
scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
|
||||
Default: 0.2.
|
||||
process_iframes (bool): If True, attempts to process and inline iframe content.
|
||||
Default: False.
|
||||
remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
|
||||
Default: False.
|
||||
simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
|
||||
Default: False.
|
||||
override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
|
||||
Default: False.
|
||||
magic (bool): If True, attempts automatic handling of overlays/popups.
|
||||
Default: False.
|
||||
adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
|
||||
Default: False.
|
||||
|
||||
# Media Handling Parameters
|
||||
screenshot (bool): Whether to take a screenshot after crawling.
|
||||
Default: False.
|
||||
screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
|
||||
Default: None.
|
||||
screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
|
||||
Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
|
||||
pdf (bool): Whether to generate a PDF of the page.
|
||||
Default: False.
|
||||
image_description_min_word_threshold (int): Minimum words for image description extraction.
|
||||
Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
|
||||
image_score_threshold (int): Minimum score threshold for processing an image.
|
||||
Default: IMAGE_SCORE_THRESHOLD (e.g., 3).
|
||||
exclude_external_images (bool): If True, exclude all external images from processing.
|
||||
Default: False.
|
||||
|
||||
# Link and Domain Handling Parameters
|
||||
exclude_social_media_domains (list of str): List of domains to exclude for social media links.
|
||||
Default: SOCIAL_MEDIA_DOMAINS (from config).
|
||||
exclude_external_links (bool): If True, exclude all external links from the results.
|
||||
Default: False.
|
||||
exclude_social_media_links (bool): If True, exclude links pointing to social media domains.
|
||||
Default: False.
|
||||
exclude_domains (list of str): List of specific domains to exclude from results.
|
||||
Default: [].
|
||||
|
||||
# Debugging and Logging Parameters
|
||||
verbose (bool): Enable verbose logging.
|
||||
Default: True.
|
||||
log_console (bool): If True, log console messages from the page.
|
||||
Default: False.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
# Content Processing Parameters
|
||||
word_count_threshold: int = MIN_WORD_THRESHOLD,
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
chunking_strategy: ChunkingStrategy = None,
|
||||
markdown_generator: MarkdownGenerationStrategy = None,
|
||||
content_filter=None,
|
||||
only_text: bool = False,
|
||||
css_selector: str = None,
|
||||
excluded_tags: list = None,
|
||||
excluded_selector: str = None,
|
||||
keep_data_attributes: bool = False,
|
||||
remove_forms: bool = False,
|
||||
prettiify: bool = False,
|
||||
parser_type: str = "lxml",
|
||||
|
||||
# SSL Parameters
|
||||
fetch_ssl_certificate: bool = False,
|
||||
|
||||
# Caching Parameters
|
||||
cache_mode=None,
|
||||
session_id: str = None,
|
||||
bypass_cache: bool = False,
|
||||
disable_cache: bool = False,
|
||||
no_cache_read: bool = False,
|
||||
no_cache_write: bool = False,
|
||||
|
||||
# Page Navigation and Timing Parameters
|
||||
wait_until: str = "domcontentloaded",
|
||||
page_timeout: int = PAGE_TIMEOUT,
|
||||
wait_for: str = None,
|
||||
wait_for_images: bool = True,
|
||||
delay_before_return_html: float = 0.1,
|
||||
mean_delay: float = 0.1,
|
||||
max_range: float = 0.3,
|
||||
semaphore_count: int = 5,
|
||||
|
||||
# Page Interaction Parameters
|
||||
js_code: Union[str, List[str]] = None,
|
||||
js_only: bool = False,
|
||||
ignore_body_visibility: bool = True,
|
||||
scan_full_page: bool = False,
|
||||
scroll_delay: float = 0.2,
|
||||
process_iframes: bool = False,
|
||||
remove_overlay_elements: bool = False,
|
||||
simulate_user: bool = False,
|
||||
override_navigator: bool = False,
|
||||
magic: bool = False,
|
||||
adjust_viewport_to_content: bool = False,
|
||||
|
||||
# Media Handling Parameters
|
||||
screenshot: bool = False,
|
||||
screenshot_wait_for: float = None,
|
||||
screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
|
||||
pdf: bool = False,
|
||||
image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
|
||||
image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
|
||||
exclude_external_images: bool = False,
|
||||
|
||||
# Link and Domain Handling Parameters
|
||||
exclude_social_media_domains: list = None,
|
||||
exclude_external_links: bool = False,
|
||||
exclude_social_media_links: bool = False,
|
||||
exclude_domains: list = None,
|
||||
|
||||
# Debugging and Logging Parameters
|
||||
verbose: bool = True,
|
||||
log_console: bool = False,
|
||||
|
||||
url: str = None,
|
||||
):
|
||||
self.url = url
|
||||
|
||||
# Content Processing Parameters
|
||||
self.word_count_threshold = word_count_threshold
|
||||
self.extraction_strategy = extraction_strategy
|
||||
self.chunking_strategy = chunking_strategy
|
||||
self.markdown_generator = markdown_generator
|
||||
self.content_filter = content_filter
|
||||
self.only_text = only_text
|
||||
self.css_selector = css_selector
|
||||
self.excluded_tags = excluded_tags or []
|
||||
self.excluded_selector = excluded_selector or ""
|
||||
self.keep_data_attributes = keep_data_attributes
|
||||
self.remove_forms = remove_forms
|
||||
self.prettiify = prettiify
|
||||
self.parser_type = parser_type
|
||||
|
||||
# SSL Parameters
|
||||
self.fetch_ssl_certificate = fetch_ssl_certificate
|
||||
|
||||
# Caching Parameters
|
||||
self.cache_mode = cache_mode
|
||||
self.session_id = session_id
|
||||
self.bypass_cache = bypass_cache
|
||||
self.disable_cache = disable_cache
|
||||
self.no_cache_read = no_cache_read
|
||||
self.no_cache_write = no_cache_write
|
||||
|
||||
# Page Navigation and Timing Parameters
|
||||
self.wait_until = wait_until
|
||||
self.page_timeout = page_timeout
|
||||
self.wait_for = wait_for
|
||||
self.wait_for_images = wait_for_images
|
||||
self.delay_before_return_html = delay_before_return_html
|
||||
self.mean_delay = mean_delay
|
||||
self.max_range = max_range
|
||||
self.semaphore_count = semaphore_count
|
||||
|
||||
# Page Interaction Parameters
|
||||
self.js_code = js_code
|
||||
self.js_only = js_only
|
||||
self.ignore_body_visibility = ignore_body_visibility
|
||||
self.scan_full_page = scan_full_page
|
||||
self.scroll_delay = scroll_delay
|
||||
self.process_iframes = process_iframes
|
||||
self.remove_overlay_elements = remove_overlay_elements
|
||||
self.simulate_user = simulate_user
|
||||
self.override_navigator = override_navigator
|
||||
self.magic = magic
|
||||
self.adjust_viewport_to_content = adjust_viewport_to_content
|
||||
|
||||
# Media Handling Parameters
|
||||
self.screenshot = screenshot
|
||||
self.screenshot_wait_for = screenshot_wait_for
|
||||
self.screenshot_height_threshold = screenshot_height_threshold
|
||||
self.pdf = pdf
|
||||
self.image_description_min_word_threshold = image_description_min_word_threshold
|
||||
self.image_score_threshold = image_score_threshold
|
||||
self.exclude_external_images = exclude_external_images
|
||||
|
||||
# Link and Domain Handling Parameters
|
||||
self.exclude_social_media_domains = exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
|
||||
self.exclude_external_links = exclude_external_links
|
||||
self.exclude_social_media_links = exclude_social_media_links
|
||||
self.exclude_domains = exclude_domains or []
|
||||
|
||||
# Debugging and Logging Parameters
|
||||
self.verbose = verbose
|
||||
self.log_console = log_console
|
||||
|
||||
# Validate type of extraction strategy and chunking strategy if they are provided
|
||||
if self.extraction_strategy is not None and not isinstance(
|
||||
self.extraction_strategy, ExtractionStrategy
|
||||
):
|
||||
raise ValueError("extraction_strategy must be an instance of ExtractionStrategy")
|
||||
if self.chunking_strategy is not None and not isinstance(
|
||||
self.chunking_strategy, ChunkingStrategy
|
||||
):
|
||||
raise ValueError("chunking_strategy must be an instance of ChunkingStrategy")
|
||||
|
||||
# Set default chunking strategy if None
|
||||
if self.chunking_strategy is None:
|
||||
from .chunking_strategy import RegexChunking
|
||||
self.chunking_strategy = RegexChunking()
|
||||
|
||||
@staticmethod
|
||||
def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
|
||||
return CrawlerRunConfig(
|
||||
# Content Processing Parameters
|
||||
word_count_threshold=kwargs.get("word_count_threshold", 200),
|
||||
extraction_strategy=kwargs.get("extraction_strategy"),
|
||||
chunking_strategy=kwargs.get("chunking_strategy"),
|
||||
markdown_generator=kwargs.get("markdown_generator"),
|
||||
content_filter=kwargs.get("content_filter"),
|
||||
only_text=kwargs.get("only_text", False),
|
||||
css_selector=kwargs.get("css_selector"),
|
||||
excluded_tags=kwargs.get("excluded_tags", []),
|
||||
excluded_selector=kwargs.get("excluded_selector", ""),
|
||||
keep_data_attributes=kwargs.get("keep_data_attributes", False),
|
||||
remove_forms=kwargs.get("remove_forms", False),
|
||||
prettiify=kwargs.get("prettiify", False),
|
||||
parser_type=kwargs.get("parser_type", "lxml"),
|
||||
|
||||
# SSL Parameters
|
||||
fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
|
||||
|
||||
# Caching Parameters
|
||||
cache_mode=kwargs.get("cache_mode"),
|
||||
session_id=kwargs.get("session_id"),
|
||||
bypass_cache=kwargs.get("bypass_cache", False),
|
||||
disable_cache=kwargs.get("disable_cache", False),
|
||||
no_cache_read=kwargs.get("no_cache_read", False),
|
||||
no_cache_write=kwargs.get("no_cache_write", False),
|
||||
|
||||
# Page Navigation and Timing Parameters
|
||||
wait_until=kwargs.get("wait_until", "domcontentloaded"),
|
||||
page_timeout=kwargs.get("page_timeout", 60000),
|
||||
wait_for=kwargs.get("wait_for"),
|
||||
wait_for_images=kwargs.get("wait_for_images", True),
|
||||
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
|
||||
mean_delay=kwargs.get("mean_delay", 0.1),
|
||||
max_range=kwargs.get("max_range", 0.3),
|
||||
semaphore_count=kwargs.get("semaphore_count", 5),
|
||||
|
||||
# Page Interaction Parameters
|
||||
js_code=kwargs.get("js_code"),
|
||||
js_only=kwargs.get("js_only", False),
|
||||
ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
|
||||
scan_full_page=kwargs.get("scan_full_page", False),
|
||||
scroll_delay=kwargs.get("scroll_delay", 0.2),
|
||||
process_iframes=kwargs.get("process_iframes", False),
|
||||
remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
|
||||
simulate_user=kwargs.get("simulate_user", False),
|
||||
override_navigator=kwargs.get("override_navigator", False),
|
||||
magic=kwargs.get("magic", False),
|
||||
adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
|
||||
|
||||
# Media Handling Parameters
|
||||
screenshot=kwargs.get("screenshot", False),
|
||||
screenshot_wait_for=kwargs.get("screenshot_wait_for"),
|
||||
screenshot_height_threshold=kwargs.get("screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD),
|
||||
pdf=kwargs.get("pdf", False),
|
||||
image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD),
|
||||
image_score_threshold=kwargs.get("image_score_threshold", IMAGE_SCORE_THRESHOLD),
|
||||
exclude_external_images=kwargs.get("exclude_external_images", False),
|
||||
|
||||
# Link and Domain Handling Parameters
|
||||
exclude_social_media_domains=kwargs.get("exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS),
|
||||
exclude_external_links=kwargs.get("exclude_external_links", False),
|
||||
exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
|
||||
exclude_domains=kwargs.get("exclude_domains", []),
|
||||
|
||||
# Debugging and Logging Parameters
|
||||
verbose=kwargs.get("verbose", True),
|
||||
log_console=kwargs.get("log_console", False),
|
||||
|
||||
url=kwargs.get("url"),
|
||||
)
|
||||
|
||||
# Create a funciton returns dict of the object
|
||||
def to_dict(self):
|
||||
return {
|
||||
"word_count_threshold": self.word_count_threshold,
|
||||
"extraction_strategy": self.extraction_strategy,
|
||||
"chunking_strategy": self.chunking_strategy,
|
||||
"markdown_generator": self.markdown_generator,
|
||||
"content_filter": self.content_filter,
|
||||
"only_text": self.only_text,
|
||||
"css_selector": self.css_selector,
|
||||
"excluded_tags": self.excluded_tags,
|
||||
"excluded_selector": self.excluded_selector,
|
||||
"keep_data_attributes": self.keep_data_attributes,
|
||||
"remove_forms": self.remove_forms,
|
||||
"prettiify": self.prettiify,
|
||||
"parser_type": self.parser_type,
|
||||
"fetch_ssl_certificate": self.fetch_ssl_certificate,
|
||||
"cache_mode": self.cache_mode,
|
||||
"session_id": self.session_id,
|
||||
"bypass_cache": self.bypass_cache,
|
||||
"disable_cache": self.disable_cache,
|
||||
"no_cache_read": self.no_cache_read,
|
||||
"no_cache_write": self.no_cache_write,
|
||||
"wait_until": self.wait_until,
|
||||
"page_timeout": self.page_timeout,
|
||||
"wait_for": self.wait_for,
|
||||
"wait_for_images": self.wait_for_images,
|
||||
"delay_before_return_html": self.delay_before_return_html,
|
||||
"mean_delay": self.mean_delay,
|
||||
"max_range": self.max_range,
|
||||
"semaphore_count": self.semaphore_count,
|
||||
"js_code": self.js_code,
|
||||
"js_only": self.js_only,
|
||||
"ignore_body_visibility": self.ignore_body_visibility,
|
||||
"scan_full_page": self.scan_full_page,
|
||||
"scroll_delay": self.scroll_delay,
|
||||
"process_iframes": self.process_iframes,
|
||||
"remove_overlay_elements": self.remove_overlay_elements,
|
||||
"simulate_user": self.simulate_user,
|
||||
"override_navigator": self.override_navigator,
|
||||
"magic": self.magic,
|
||||
"adjust_viewport_to_content": self.adjust_viewport_to_content,
|
||||
"screenshot": self.screenshot,
|
||||
"screenshot_wait_for": self.screenshot_wait_for,
|
||||
"screenshot_height_threshold": self.screenshot_height_threshold,
|
||||
"pdf": self.pdf,
|
||||
"image_description_min_word_threshold": self.image_description_min_word_threshold,
|
||||
"image_score_threshold": self.image_score_threshold,
|
||||
"exclude_external_images": self.exclude_external_images,
|
||||
"exclude_social_media_domains": self.exclude_social_media_domains,
|
||||
"exclude_external_links": self.exclude_external_links,
|
||||
"exclude_social_media_links": self.exclude_social_media_links,
|
||||
"exclude_domains": self.exclude_domains,
|
||||
"verbose": self.verbose,
|
||||
"log_console": self.log_console,
|
||||
"url": self.url,
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,285 +0,0 @@
|
||||
import os
|
||||
from pathlib import Path
|
||||
import aiosqlite
|
||||
import asyncio
|
||||
from typing import Optional, Tuple, Dict
|
||||
from contextlib import asynccontextmanager
|
||||
import logging
|
||||
import json # Added for serialization/deserialization
|
||||
from .utils import ensure_content_dirs, generate_content_hash
|
||||
import xxhash
|
||||
import aiofiles
|
||||
# Set up logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
DB_PATH = os.path.join(Path.home(), ".crawl4ai")
|
||||
os.makedirs(DB_PATH, exist_ok=True)
|
||||
DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
|
||||
|
||||
class AsyncDatabaseManager:
|
||||
def __init__(self, pool_size: int = 10, max_retries: int = 3):
|
||||
self.db_path = DB_PATH
|
||||
self.content_paths = ensure_content_dirs(os.path.dirname(DB_PATH))
|
||||
self.pool_size = pool_size
|
||||
self.max_retries = max_retries
|
||||
self.connection_pool: Dict[int, aiosqlite.Connection] = {}
|
||||
self.pool_lock = asyncio.Lock()
|
||||
self.connection_semaphore = asyncio.Semaphore(pool_size)
|
||||
|
||||
async def initialize(self):
|
||||
"""Initialize the database and connection pool"""
|
||||
await self.ainit_db()
|
||||
|
||||
async def cleanup(self):
|
||||
"""Cleanup connections when shutting down"""
|
||||
async with self.pool_lock:
|
||||
for conn in self.connection_pool.values():
|
||||
await conn.close()
|
||||
self.connection_pool.clear()
|
||||
|
||||
@asynccontextmanager
|
||||
async def get_connection(self):
|
||||
"""Connection pool manager"""
|
||||
async with self.connection_semaphore:
|
||||
task_id = id(asyncio.current_task())
|
||||
try:
|
||||
async with self.pool_lock:
|
||||
if task_id not in self.connection_pool:
|
||||
conn = await aiosqlite.connect(
|
||||
self.db_path,
|
||||
timeout=30.0
|
||||
)
|
||||
await conn.execute('PRAGMA journal_mode = WAL')
|
||||
await conn.execute('PRAGMA busy_timeout = 5000')
|
||||
self.connection_pool[task_id] = conn
|
||||
|
||||
yield self.connection_pool[task_id]
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Connection error: {e}")
|
||||
raise
|
||||
finally:
|
||||
async with self.pool_lock:
|
||||
if task_id in self.connection_pool:
|
||||
await self.connection_pool[task_id].close()
|
||||
del self.connection_pool[task_id]
|
||||
|
||||
async def execute_with_retry(self, operation, *args):
|
||||
"""Execute database operations with retry logic"""
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
async with self.get_connection() as db:
|
||||
result = await operation(db, *args)
|
||||
await db.commit()
|
||||
return result
|
||||
except Exception as e:
|
||||
if attempt == self.max_retries - 1:
|
||||
logger.error(f"Operation failed after {self.max_retries} attempts: {e}")
|
||||
raise
|
||||
await asyncio.sleep(1 * (attempt + 1)) # Exponential backoff
|
||||
|
||||
async def ainit_db(self):
|
||||
"""Initialize database schema"""
|
||||
async def _init(db):
|
||||
await db.execute('''
|
||||
CREATE TABLE IF NOT EXISTS crawled_data (
|
||||
url TEXT PRIMARY KEY,
|
||||
html TEXT,
|
||||
cleaned_html TEXT,
|
||||
markdown TEXT,
|
||||
extracted_content TEXT,
|
||||
success BOOLEAN,
|
||||
media TEXT DEFAULT "{}",
|
||||
links TEXT DEFAULT "{}",
|
||||
metadata TEXT DEFAULT "{}",
|
||||
screenshot TEXT DEFAULT "",
|
||||
response_headers TEXT DEFAULT "{}",
|
||||
downloaded_files TEXT DEFAULT "{}" -- New column added
|
||||
)
|
||||
''')
|
||||
|
||||
await self.execute_with_retry(_init)
|
||||
await self.update_db_schema()
|
||||
|
||||
async def update_db_schema(self):
|
||||
"""Update database schema if needed"""
|
||||
async def _check_columns(db):
|
||||
cursor = await db.execute("PRAGMA table_info(crawled_data)")
|
||||
columns = await cursor.fetchall()
|
||||
return [column[1] for column in columns]
|
||||
|
||||
column_names = await self.execute_with_retry(_check_columns)
|
||||
|
||||
# List of new columns to add
|
||||
new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files']
|
||||
|
||||
for column in new_columns:
|
||||
if column not in column_names:
|
||||
await self.aalter_db_add_column(column)
|
||||
|
||||
async def aalter_db_add_column(self, new_column: str):
|
||||
"""Add new column to the database"""
|
||||
async def _alter(db):
|
||||
if new_column == 'response_headers':
|
||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"')
|
||||
else:
|
||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
||||
logger.info(f"Added column '{new_column}' to the database.")
|
||||
|
||||
await self.execute_with_retry(_alter)
|
||||
|
||||
|
||||
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, bool, str, str, str, str]]:
|
||||
"""Retrieve cached URL data"""
|
||||
async def _get(db):
|
||||
async with db.execute(
|
||||
'''
|
||||
SELECT url, html, cleaned_html, markdown,
|
||||
extracted_content, success, media, links,
|
||||
metadata, screenshot, response_headers,
|
||||
downloaded_files
|
||||
FROM crawled_data WHERE url = ?
|
||||
''',
|
||||
(url,)
|
||||
) as cursor:
|
||||
row = await cursor.fetchone()
|
||||
if row:
|
||||
# Load content from files using stored hashes
|
||||
html = await self._load_content(row[1], 'html') if row[1] else ""
|
||||
cleaned = await self._load_content(row[2], 'cleaned') if row[2] else ""
|
||||
markdown = await self._load_content(row[3], 'markdown') if row[3] else ""
|
||||
extracted = await self._load_content(row[4], 'extracted') if row[4] else ""
|
||||
screenshot = await self._load_content(row[9], 'screenshots') if row[9] else ""
|
||||
|
||||
return (
|
||||
row[0], # url
|
||||
html or "", # Return empty string if file not found
|
||||
cleaned or "",
|
||||
markdown or "",
|
||||
extracted or "",
|
||||
row[5], # success
|
||||
json.loads(row[6] or '{}'), # media
|
||||
json.loads(row[7] or '{}'), # links
|
||||
json.loads(row[8] or '{}'), # metadata
|
||||
screenshot or "",
|
||||
json.loads(row[10] or '{}'), # response_headers
|
||||
json.loads(row[11] or '[]') # downloaded_files
|
||||
)
|
||||
return None
|
||||
|
||||
try:
|
||||
return await self.execute_with_retry(_get)
|
||||
except Exception as e:
|
||||
logger.error(f"Error retrieving cached URL: {e}")
|
||||
return None
|
||||
|
||||
async def acache_url(self, url: str, html: str, cleaned_html: str,
|
||||
markdown: str, extracted_content: str, success: bool,
|
||||
media: str = "{}", links: str = "{}",
|
||||
metadata: str = "{}", screenshot: str = "",
|
||||
response_headers: str = "{}", downloaded_files: str = "[]"):
|
||||
"""Cache URL data with content stored in filesystem"""
|
||||
|
||||
# Store content files and get hashes
|
||||
html_hash = await self._store_content(html, 'html')
|
||||
cleaned_hash = await self._store_content(cleaned_html, 'cleaned')
|
||||
markdown_hash = await self._store_content(markdown, 'markdown')
|
||||
extracted_hash = await self._store_content(extracted_content, 'extracted')
|
||||
screenshot_hash = await self._store_content(screenshot, 'screenshots')
|
||||
|
||||
async def _cache(db):
|
||||
await db.execute('''
|
||||
INSERT INTO crawled_data (
|
||||
url, html, cleaned_html, markdown,
|
||||
extracted_content, success, media, links, metadata,
|
||||
screenshot, response_headers, downloaded_files
|
||||
)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
ON CONFLICT(url) DO UPDATE SET
|
||||
html = excluded.html,
|
||||
cleaned_html = excluded.cleaned_html,
|
||||
markdown = excluded.markdown,
|
||||
extracted_content = excluded.extracted_content,
|
||||
success = excluded.success,
|
||||
media = excluded.media,
|
||||
links = excluded.links,
|
||||
metadata = excluded.metadata,
|
||||
screenshot = excluded.screenshot,
|
||||
response_headers = excluded.response_headers,
|
||||
downloaded_files = excluded.downloaded_files
|
||||
''', (url, html_hash, cleaned_hash, markdown_hash, extracted_hash,
|
||||
success, media, links, metadata, screenshot_hash,
|
||||
response_headers, downloaded_files))
|
||||
|
||||
try:
|
||||
await self.execute_with_retry(_cache)
|
||||
except Exception as e:
|
||||
logger.error(f"Error caching URL: {e}")
|
||||
|
||||
|
||||
|
||||
async def aget_total_count(self) -> int:
|
||||
"""Get total number of cached URLs"""
|
||||
async def _count(db):
|
||||
async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
|
||||
result = await cursor.fetchone()
|
||||
return result[0] if result else 0
|
||||
|
||||
try:
|
||||
return await self.execute_with_retry(_count)
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting total count: {e}")
|
||||
return 0
|
||||
|
||||
async def aclear_db(self):
|
||||
"""Clear all data from the database"""
|
||||
async def _clear(db):
|
||||
await db.execute('DELETE FROM crawled_data')
|
||||
|
||||
try:
|
||||
await self.execute_with_retry(_clear)
|
||||
except Exception as e:
|
||||
logger.error(f"Error clearing database: {e}")
|
||||
|
||||
async def aflush_db(self):
|
||||
"""Drop the entire table"""
|
||||
async def _flush(db):
|
||||
await db.execute('DROP TABLE IF EXISTS crawled_data')
|
||||
|
||||
try:
|
||||
await self.execute_with_retry(_flush)
|
||||
except Exception as e:
|
||||
logger.error(f"Error flushing database: {e}")
|
||||
|
||||
|
||||
async def _store_content(self, content: str, content_type: str) -> str:
|
||||
"""Store content in filesystem and return hash"""
|
||||
if not content:
|
||||
return ""
|
||||
|
||||
content_hash = generate_content_hash(content)
|
||||
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
||||
|
||||
# Only write if file doesn't exist
|
||||
if not os.path.exists(file_path):
|
||||
async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
|
||||
await f.write(content)
|
||||
|
||||
return content_hash
|
||||
|
||||
async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]:
|
||||
"""Load content from filesystem by hash"""
|
||||
if not content_hash:
|
||||
return None
|
||||
|
||||
file_path = os.path.join(self.content_paths[content_type], content_hash)
|
||||
try:
|
||||
async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
|
||||
return await f.read()
|
||||
except:
|
||||
logger.error(f"Failed to load content: {file_path}")
|
||||
return None
|
||||
|
||||
# Create a singleton instance
|
||||
async_db_manager = AsyncDatabaseManager()
|
||||
@@ -1,4 +1,4 @@
|
||||
import os
|
||||
import os, sys
|
||||
from pathlib import Path
|
||||
import aiosqlite
|
||||
import asyncio
|
||||
@@ -7,12 +7,13 @@ from contextlib import asynccontextmanager
|
||||
import logging
|
||||
import json # Added for serialization/deserialization
|
||||
from .utils import ensure_content_dirs, generate_content_hash
|
||||
from .models import CrawlResult
|
||||
from .models import CrawlResult, MarkdownGenerationResult
|
||||
import xxhash
|
||||
import aiofiles
|
||||
from .config import NEED_MIGRATION
|
||||
from .version_manager import VersionManager
|
||||
from .async_logger import AsyncLogger
|
||||
from .utils import get_error_context, create_box_message
|
||||
# Set up logging
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -97,35 +98,84 @@ class AsyncDatabaseManager:
|
||||
|
||||
@asynccontextmanager
|
||||
async def get_connection(self):
|
||||
"""Connection pool manager"""
|
||||
"""Connection pool manager with enhanced error handling"""
|
||||
if not self._initialized:
|
||||
# Use an asyncio.Lock to ensure only one initialization occurs
|
||||
async with self.init_lock:
|
||||
if not self._initialized:
|
||||
await self.initialize()
|
||||
self._initialized = True
|
||||
try:
|
||||
await self.initialize()
|
||||
self._initialized = True
|
||||
except Exception as e:
|
||||
import sys
|
||||
error_context = get_error_context(sys.exc_info())
|
||||
self.logger.error(
|
||||
message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
|
||||
tag="ERROR",
|
||||
force_verbose=True,
|
||||
params={
|
||||
"error": str(e),
|
||||
"context": error_context["code_context"],
|
||||
"traceback": error_context["full_traceback"]
|
||||
}
|
||||
)
|
||||
raise
|
||||
|
||||
await self.connection_semaphore.acquire()
|
||||
task_id = id(asyncio.current_task())
|
||||
|
||||
try:
|
||||
async with self.pool_lock:
|
||||
if task_id not in self.connection_pool:
|
||||
conn = await aiosqlite.connect(
|
||||
self.db_path,
|
||||
timeout=30.0
|
||||
)
|
||||
await conn.execute('PRAGMA journal_mode = WAL')
|
||||
await conn.execute('PRAGMA busy_timeout = 5000')
|
||||
self.connection_pool[task_id] = conn
|
||||
try:
|
||||
conn = await aiosqlite.connect(
|
||||
self.db_path,
|
||||
timeout=30.0
|
||||
)
|
||||
await conn.execute('PRAGMA journal_mode = WAL')
|
||||
await conn.execute('PRAGMA busy_timeout = 5000')
|
||||
|
||||
# Verify database structure
|
||||
async with conn.execute("PRAGMA table_info(crawled_data)") as cursor:
|
||||
columns = await cursor.fetchall()
|
||||
column_names = [col[1] for col in columns]
|
||||
expected_columns = {
|
||||
'url', 'html', 'cleaned_html', 'markdown', 'extracted_content',
|
||||
'success', 'media', 'links', 'metadata', 'screenshot',
|
||||
'response_headers', 'downloaded_files'
|
||||
}
|
||||
missing_columns = expected_columns - set(column_names)
|
||||
if missing_columns:
|
||||
raise ValueError(f"Database missing columns: {missing_columns}")
|
||||
|
||||
self.connection_pool[task_id] = conn
|
||||
except Exception as e:
|
||||
import sys
|
||||
error_context = get_error_context(sys.exc_info())
|
||||
error_message = (
|
||||
f"Unexpected error in db get_connection at line {error_context['line_no']} "
|
||||
f"in {error_context['function']} ({error_context['filename']}):\n"
|
||||
f"Error: {str(e)}\n\n"
|
||||
f"Code context:\n{error_context['code_context']}"
|
||||
)
|
||||
self.logger.error(
|
||||
message=create_box_message(error_message, type= "error"),
|
||||
)
|
||||
|
||||
raise
|
||||
|
||||
yield self.connection_pool[task_id]
|
||||
|
||||
except Exception as e:
|
||||
import sys
|
||||
error_context = get_error_context(sys.exc_info())
|
||||
error_message = (
|
||||
f"Unexpected error in db get_connection at line {error_context['line_no']} "
|
||||
f"in {error_context['function']} ({error_context['filename']}):\n"
|
||||
f"Error: {str(e)}\n\n"
|
||||
f"Code context:\n{error_context['code_context']}"
|
||||
)
|
||||
self.logger.error(
|
||||
message="Connection error: {error}",
|
||||
tag="ERROR",
|
||||
force_verbose=True,
|
||||
params={"error": str(e)}
|
||||
message=create_box_message(error_message, type= "error"),
|
||||
)
|
||||
raise
|
||||
finally:
|
||||
@@ -230,7 +280,8 @@ class AsyncDatabaseManager:
|
||||
'cleaned_html': row_dict['cleaned_html'],
|
||||
'markdown': row_dict['markdown'],
|
||||
'extracted_content': row_dict['extracted_content'],
|
||||
'screenshot': row_dict['screenshot']
|
||||
'screenshot': row_dict['screenshot'],
|
||||
'screenshots': row_dict['screenshot'],
|
||||
}
|
||||
|
||||
for field, hash_value in content_fields.items():
|
||||
@@ -244,13 +295,18 @@ class AsyncDatabaseManager:
|
||||
row_dict[field] = ""
|
||||
|
||||
# Parse JSON fields
|
||||
json_fields = ['media', 'links', 'metadata', 'response_headers']
|
||||
json_fields = ['media', 'links', 'metadata', 'response_headers', 'markdown']
|
||||
for field in json_fields:
|
||||
try:
|
||||
row_dict[field] = json.loads(row_dict[field]) if row_dict[field] else {}
|
||||
except json.JSONDecodeError:
|
||||
row_dict[field] = {}
|
||||
|
||||
if isinstance(row_dict['markdown'], Dict):
|
||||
row_dict['markdown_v2'] = row_dict['markdown']
|
||||
if row_dict['markdown'].get('raw_markdown'):
|
||||
row_dict['markdown'] = row_dict['markdown']['raw_markdown']
|
||||
|
||||
# Parse downloaded_files
|
||||
try:
|
||||
row_dict['downloaded_files'] = json.loads(row_dict['downloaded_files']) if row_dict['downloaded_files'] else []
|
||||
@@ -280,10 +336,28 @@ class AsyncDatabaseManager:
|
||||
content_map = {
|
||||
'html': (result.html, 'html'),
|
||||
'cleaned_html': (result.cleaned_html or "", 'cleaned'),
|
||||
'markdown': (result.markdown or "", 'markdown'),
|
||||
'markdown': None,
|
||||
'extracted_content': (result.extracted_content or "", 'extracted'),
|
||||
'screenshot': (result.screenshot or "", 'screenshots')
|
||||
}
|
||||
|
||||
try:
|
||||
if isinstance(result.markdown, MarkdownGenerationResult):
|
||||
content_map['markdown'] = (result.markdown.model_dump_json(), 'markdown')
|
||||
elif hasattr(result, 'markdown_v2'):
|
||||
content_map['markdown'] = (result.markdown_v2.model_dump_json(), 'markdown')
|
||||
elif isinstance(result.markdown, str):
|
||||
markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
|
||||
content_map['markdown'] = (markdown_result.model_dump_json(), 'markdown')
|
||||
else:
|
||||
content_map['markdown'] = (MarkdownGenerationResult().model_dump_json(), 'markdown')
|
||||
except Exception as e:
|
||||
self.logger.warning(
|
||||
message=f"Error processing markdown content: {str(e)}",
|
||||
tag="WARNING"
|
||||
)
|
||||
# Fallback to empty markdown result
|
||||
content_map['markdown'] = (MarkdownGenerationResult().model_dump_json(), 'markdown')
|
||||
|
||||
content_hashes = {}
|
||||
for field, (content, content_type) in content_map.items():
|
||||
|
||||
@@ -42,7 +42,7 @@ class AsyncLogger:
|
||||
def __init__(
|
||||
self,
|
||||
log_file: Optional[str] = None,
|
||||
log_level: LogLevel = LogLevel.INFO,
|
||||
log_level: LogLevel = LogLevel.DEBUG,
|
||||
tag_width: int = 10,
|
||||
icons: Optional[Dict[str, str]] = None,
|
||||
colors: Optional[Dict[LogLevel, str]] = None,
|
||||
|
||||
@@ -1,344 +0,0 @@
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
import json
|
||||
import asyncio
|
||||
from .models import CrawlResult
|
||||
from .async_database import async_db_manager
|
||||
from .chunking_strategy import *
|
||||
from .extraction_strategy import *
|
||||
from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy, AsyncCrawlResponse
|
||||
from .content_scrapping_strategy import WebScrapingStrategy
|
||||
from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
|
||||
from .utils import (
|
||||
sanitize_input_encode,
|
||||
InvalidCSSSelectorError,
|
||||
format_html
|
||||
)
|
||||
from .__version__ import __version__ as crawl4ai_version
|
||||
|
||||
class AsyncWebCrawler:
|
||||
def __init__(
|
||||
self,
|
||||
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
||||
always_by_pass_cache: bool = False,
|
||||
base_directory: str = str(Path.home()),
|
||||
**kwargs,
|
||||
):
|
||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
||||
**kwargs
|
||||
)
|
||||
self.always_by_pass_cache = always_by_pass_cache
|
||||
# self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
|
||||
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
|
||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
self.ready = False
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
|
||||
async def __aenter__(self):
|
||||
await self.crawler_strategy.__aenter__()
|
||||
await self.awarmup()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
|
||||
|
||||
async def awarmup(self):
|
||||
# Print a message for crawl4ai and its version
|
||||
if self.verbose:
|
||||
print(f"[LOG] 🚀 Crawl4AI {crawl4ai_version}")
|
||||
print("[LOG] 🌤️ Warming up the AsyncWebCrawler")
|
||||
# await async_db_manager.ainit_db()
|
||||
# # await async_db_manager.initialize()
|
||||
# await self.arun(
|
||||
# url="https://google.com/",
|
||||
# word_count_threshold=5,
|
||||
# bypass_cache=False,
|
||||
# verbose=False,
|
||||
# )
|
||||
self.ready = True
|
||||
if self.verbose:
|
||||
print("[LOG] 🌞 AsyncWebCrawler is ready to crawl")
|
||||
|
||||
async def arun(
|
||||
self,
|
||||
url: str,
|
||||
word_count_threshold=MIN_WORD_THRESHOLD,
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||
bypass_cache: bool = False,
|
||||
css_selector: str = None,
|
||||
screenshot: bool = False,
|
||||
user_agent: str = None,
|
||||
verbose=True,
|
||||
disable_cache: bool = False,
|
||||
no_cache_read: bool = False,
|
||||
no_cache_write: bool = False,
|
||||
**kwargs,
|
||||
) -> CrawlResult:
|
||||
"""
|
||||
Runs the crawler for a single source: URL (web, local file, or raw HTML).
|
||||
|
||||
Args:
|
||||
url (str): The URL to crawl. Supported prefixes:
|
||||
- 'http://' or 'https://': Web URL to crawl.
|
||||
- 'file://': Local file path to process.
|
||||
- 'raw:': Raw HTML content to process.
|
||||
... [other existing parameters]
|
||||
|
||||
Returns:
|
||||
CrawlResult: The result of the crawling and processing.
|
||||
"""
|
||||
try:
|
||||
if disable_cache:
|
||||
bypass_cache = True
|
||||
no_cache_read = True
|
||||
no_cache_write = True
|
||||
|
||||
extraction_strategy = extraction_strategy or NoExtractionStrategy()
|
||||
extraction_strategy.verbose = verbose
|
||||
if not isinstance(extraction_strategy, ExtractionStrategy):
|
||||
raise ValueError("Unsupported extraction strategy")
|
||||
if not isinstance(chunking_strategy, ChunkingStrategy):
|
||||
raise ValueError("Unsupported chunking strategy")
|
||||
|
||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||
|
||||
async_response: AsyncCrawlResponse = None
|
||||
cached = None
|
||||
screenshot_data = None
|
||||
extracted_content = None
|
||||
|
||||
is_web_url = url.startswith(('http://', 'https://'))
|
||||
is_local_file = url.startswith("file://")
|
||||
is_raw_html = url.startswith("raw:")
|
||||
_url = url if not is_raw_html else "Raw HTML"
|
||||
|
||||
start_time = time.perf_counter()
|
||||
cached_result = None
|
||||
if is_web_url and (not bypass_cache or not no_cache_read) and not self.always_by_pass_cache:
|
||||
cached_result = await async_db_manager.aget_cached_url(url)
|
||||
|
||||
if cached_result:
|
||||
html = sanitize_input_encode(cached_result.html)
|
||||
extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
|
||||
if screenshot:
|
||||
screenshot_data = cached_result.screenshot
|
||||
if not screenshot_data:
|
||||
cached_result = None
|
||||
if verbose:
|
||||
print(
|
||||
f"[LOG] 1️⃣ ✅ Page fetched (cache) for {_url}, success: {bool(html)}, time taken: {time.perf_counter() - start_time:.2f} seconds"
|
||||
)
|
||||
|
||||
|
||||
if not cached or not html:
|
||||
t1 = time.perf_counter()
|
||||
|
||||
if user_agent:
|
||||
self.crawler_strategy.update_user_agent(user_agent)
|
||||
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
|
||||
html = sanitize_input_encode(async_response.html)
|
||||
screenshot_data = async_response.screenshot
|
||||
t2 = time.perf_counter()
|
||||
if verbose:
|
||||
print(
|
||||
f"[LOG] 1️⃣ ✅ Page fetched (no-cache) for {_url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
|
||||
)
|
||||
|
||||
t1 = time.perf_counter()
|
||||
crawl_result = await self.aprocess_html(
|
||||
url=url,
|
||||
html=html,
|
||||
extracted_content=extracted_content,
|
||||
word_count_threshold=word_count_threshold,
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy,
|
||||
css_selector=css_selector,
|
||||
screenshot=screenshot_data,
|
||||
verbose=verbose,
|
||||
is_cached=bool(cached),
|
||||
async_response=async_response,
|
||||
bypass_cache=bypass_cache,
|
||||
is_web_url = is_web_url,
|
||||
is_local_file = is_local_file,
|
||||
is_raw_html = is_raw_html,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if async_response:
|
||||
crawl_result.status_code = async_response.status_code
|
||||
crawl_result.response_headers = async_response.response_headers
|
||||
crawl_result.downloaded_files = async_response.downloaded_files
|
||||
else:
|
||||
crawl_result.status_code = 200
|
||||
crawl_result.response_headers = cached_result.response_headers if cached_result else {}
|
||||
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = kwargs.get("session_id", None)
|
||||
|
||||
if verbose:
|
||||
print(
|
||||
f"[LOG] 🔥 🚀 Crawling done for {_url}, success: {crawl_result.success}, time taken: {time.perf_counter() - start_time:.2f} seconds"
|
||||
)
|
||||
|
||||
if not is_raw_html and not no_cache_write:
|
||||
if not bool(cached_result) or kwargs.get("bypass_cache", False) or self.always_by_pass_cache:
|
||||
await async_db_manager.acache_url(crawl_result)
|
||||
|
||||
|
||||
return crawl_result
|
||||
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
print(f"[ERROR] 🚫 arun(): Failed to crawl {_url}, error: {e.msg}")
|
||||
return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {_url}, error: {e.msg}", success=False, error_message=e.msg)
|
||||
|
||||
async def arun_many(
|
||||
self,
|
||||
urls: List[str],
|
||||
word_count_threshold=MIN_WORD_THRESHOLD,
|
||||
extraction_strategy: ExtractionStrategy = None,
|
||||
chunking_strategy: ChunkingStrategy = RegexChunking(),
|
||||
bypass_cache: bool = False,
|
||||
css_selector: str = None,
|
||||
screenshot: bool = False,
|
||||
user_agent: str = None,
|
||||
verbose=True,
|
||||
**kwargs,
|
||||
) -> List[CrawlResult]:
|
||||
"""
|
||||
Runs the crawler for multiple sources: URLs (web, local files, or raw HTML).
|
||||
|
||||
Args:
|
||||
urls (List[str]): A list of URLs with supported prefixes:
|
||||
- 'http://' or 'https://': Web URL to crawl.
|
||||
- 'file://': Local file path to process.
|
||||
- 'raw:': Raw HTML content to process.
|
||||
... [other existing parameters]
|
||||
|
||||
Returns:
|
||||
List[CrawlResult]: The results of the crawling and processing.
|
||||
"""
|
||||
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
|
||||
semaphore = asyncio.Semaphore(semaphore_count)
|
||||
|
||||
async def crawl_with_semaphore(url):
|
||||
async with semaphore:
|
||||
return await self.arun(
|
||||
url,
|
||||
word_count_threshold=word_count_threshold,
|
||||
extraction_strategy=extraction_strategy,
|
||||
chunking_strategy=chunking_strategy,
|
||||
bypass_cache=bypass_cache,
|
||||
css_selector=css_selector,
|
||||
screenshot=screenshot,
|
||||
user_agent=user_agent,
|
||||
verbose=verbose,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
tasks = [crawl_with_semaphore(url) for url in urls]
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||
|
||||
async def aprocess_html(
|
||||
self,
|
||||
url: str,
|
||||
html: str,
|
||||
extracted_content: str,
|
||||
word_count_threshold: int,
|
||||
extraction_strategy: ExtractionStrategy,
|
||||
chunking_strategy: ChunkingStrategy,
|
||||
css_selector: str,
|
||||
screenshot: str,
|
||||
verbose: bool,
|
||||
**kwargs,
|
||||
) -> CrawlResult:
|
||||
t = time.perf_counter()
|
||||
# Extract content from HTML
|
||||
try:
|
||||
_url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
|
||||
t1 = time.perf_counter()
|
||||
scrapping_strategy = WebScrapingStrategy()
|
||||
# result = await scrapping_strategy.ascrap(
|
||||
result = scrapping_strategy.scrap(
|
||||
url,
|
||||
html,
|
||||
word_count_threshold=word_count_threshold,
|
||||
css_selector=css_selector,
|
||||
only_text=kwargs.get("only_text", False),
|
||||
image_description_min_word_threshold=kwargs.get(
|
||||
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
|
||||
),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
if result is None:
|
||||
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
|
||||
except InvalidCSSSelectorError as e:
|
||||
raise ValueError(str(e))
|
||||
except Exception as e:
|
||||
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
|
||||
|
||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||
markdown = sanitize_input_encode(result.get("markdown", ""))
|
||||
fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
|
||||
fit_html = sanitize_input_encode(result.get("fit_html", ""))
|
||||
media = result.get("media", [])
|
||||
links = result.get("links", [])
|
||||
metadata = result.get("metadata", {})
|
||||
|
||||
if verbose:
|
||||
print(
|
||||
f"[LOG] 2️⃣ ✅ Scraping done for {_url}, success: True, time taken: {time.perf_counter() - t1:.2f} seconds"
|
||||
)
|
||||
|
||||
if extracted_content is None and extraction_strategy and chunking_strategy and not isinstance(extraction_strategy, NoExtractionStrategy):
|
||||
t1 = time.perf_counter()
|
||||
# Check if extraction strategy is type of JsonCssExtractionStrategy
|
||||
if isinstance(extraction_strategy, JsonCssExtractionStrategy) or isinstance(extraction_strategy, JsonCssExtractionStrategy):
|
||||
extraction_strategy.verbose = verbose
|
||||
extracted_content = extraction_strategy.run(url, [html])
|
||||
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
|
||||
else:
|
||||
sections = chunking_strategy.chunk(markdown)
|
||||
extracted_content = extraction_strategy.run(url, sections)
|
||||
extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
|
||||
if verbose:
|
||||
print(
|
||||
f"[LOG] 3️⃣ ✅ Extraction done for {_url}, time taken: {time.perf_counter() - t1:.2f} seconds"
|
||||
)
|
||||
|
||||
screenshot = None if not screenshot else screenshot
|
||||
|
||||
return CrawlResult(
|
||||
url=url,
|
||||
html=html,
|
||||
cleaned_html=format_html(cleaned_html),
|
||||
markdown=markdown,
|
||||
fit_markdown=fit_markdown,
|
||||
fit_html= fit_html,
|
||||
media=media,
|
||||
links=links,
|
||||
metadata=metadata,
|
||||
screenshot=screenshot,
|
||||
extracted_content=extracted_content,
|
||||
success=True,
|
||||
error_message="",
|
||||
)
|
||||
|
||||
async def aclear_cache(self):
|
||||
# await async_db_manager.aclear_db()
|
||||
await async_db_manager.cleanup()
|
||||
|
||||
async def aflush_cache(self):
|
||||
await async_db_manager.aflush_db()
|
||||
|
||||
async def aget_cache_size(self):
|
||||
return await async_db_manager.aget_total_count()
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -25,8 +25,26 @@ class CacheContext:
|
||||
|
||||
This class centralizes all cache-related logic and URL type checking,
|
||||
making the caching behavior more predictable and maintainable.
|
||||
|
||||
Attributes:
|
||||
url (str): The URL being processed.
|
||||
cache_mode (CacheMode): The cache mode for the current operation.
|
||||
always_bypass (bool): If True, bypasses caching for this operation.
|
||||
is_cacheable (bool): True if the URL is cacheable, False otherwise.
|
||||
is_web_url (bool): True if the URL is a web URL, False otherwise.
|
||||
is_local_file (bool): True if the URL is a local file, False otherwise.
|
||||
is_raw_html (bool): True if the URL is raw HTML, False otherwise.
|
||||
_url_display (str): The display name for the URL (web, local file, or raw HTML).
|
||||
"""
|
||||
def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
|
||||
"""
|
||||
Initializes the CacheContext with the provided URL and cache mode.
|
||||
|
||||
Args:
|
||||
url (str): The URL being processed.
|
||||
cache_mode (CacheMode): The cache mode for the current operation.
|
||||
always_bypass (bool): If True, bypasses caching for this operation.
|
||||
"""
|
||||
self.url = url
|
||||
self.cache_mode = cache_mode
|
||||
self.always_bypass = always_bypass
|
||||
@@ -37,13 +55,31 @@ class CacheContext:
|
||||
self._url_display = url if not self.is_raw_html else "Raw HTML"
|
||||
|
||||
def should_read(self) -> bool:
|
||||
"""Determines if cache should be read based on context."""
|
||||
"""
|
||||
Determines if cache should be read based on context.
|
||||
|
||||
How it works:
|
||||
1. If always_bypass is True or is_cacheable is False, return False.
|
||||
2. If cache_mode is ENABLED or READ_ONLY, return True.
|
||||
|
||||
Returns:
|
||||
bool: True if cache should be read, False otherwise.
|
||||
"""
|
||||
if self.always_bypass or not self.is_cacheable:
|
||||
return False
|
||||
return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
|
||||
|
||||
def should_write(self) -> bool:
|
||||
"""Determines if cache should be written based on context."""
|
||||
"""
|
||||
Determines if cache should be written based on context.
|
||||
|
||||
How it works:
|
||||
1. If always_bypass is True or is_cacheable is False, return False.
|
||||
2. If cache_mode is ENABLED or WRITE_ONLY, return True.
|
||||
|
||||
Returns:
|
||||
bool: True if cache should be written, False otherwise.
|
||||
"""
|
||||
if self.always_bypass or not self.is_cacheable:
|
||||
return False
|
||||
return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
|
||||
|
||||
@@ -7,17 +7,43 @@ from .utils import *
|
||||
|
||||
# Define the abstract base class for chunking strategies
|
||||
class ChunkingStrategy(ABC):
|
||||
"""
|
||||
Abstract base class for chunking strategies.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def chunk(self, text: str) -> list:
|
||||
"""
|
||||
Abstract method to chunk the given text.
|
||||
|
||||
Args:
|
||||
text (str): The text to chunk.
|
||||
|
||||
Returns:
|
||||
list: A list of chunks.
|
||||
"""
|
||||
pass
|
||||
|
||||
|
||||
# Create an identity chunking strategy f(x) = [x]
|
||||
class IdentityChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that returns the input text as a single chunk.
|
||||
"""
|
||||
def chunk(self, text: str) -> list:
|
||||
return [text]
|
||||
|
||||
# Regex-based chunking
|
||||
class RegexChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that splits text based on regular expression patterns.
|
||||
"""
|
||||
def __init__(self, patterns=None, **kwargs):
|
||||
"""
|
||||
Initialize the RegexChunking object.
|
||||
|
||||
Args:
|
||||
patterns (list): A list of regular expression patterns to split text.
|
||||
"""
|
||||
if patterns is None:
|
||||
patterns = [r'\n\n'] # Default split pattern
|
||||
self.patterns = patterns
|
||||
@@ -33,9 +59,15 @@ class RegexChunking(ChunkingStrategy):
|
||||
|
||||
# NLP-based sentence chunking
|
||||
class NlpSentenceChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
"""
|
||||
Initialize the NlpSentenceChunking object.
|
||||
"""
|
||||
load_nltk_punkt()
|
||||
pass
|
||||
|
||||
|
||||
def chunk(self, text: str) -> list:
|
||||
# Improved regex for sentence splitting
|
||||
@@ -52,8 +84,21 @@ class NlpSentenceChunking(ChunkingStrategy):
|
||||
|
||||
# Topic-based segmentation using TextTiling
|
||||
class TopicSegmentationChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
|
||||
|
||||
How it works:
|
||||
1. Segment the text into topics using TextTilingTokenizer
|
||||
2. Extract keywords for each topic segment
|
||||
"""
|
||||
|
||||
def __init__(self, num_keywords=3, **kwargs):
|
||||
"""
|
||||
Initialize the TopicSegmentationChunking object.
|
||||
|
||||
Args:
|
||||
num_keywords (int): The number of keywords to extract for each topic segment.
|
||||
"""
|
||||
import nltk as nl
|
||||
self.tokenizer = nl.tokenize.TextTilingTokenizer()
|
||||
self.num_keywords = num_keywords
|
||||
@@ -83,6 +128,14 @@ class TopicSegmentationChunking(ChunkingStrategy):
|
||||
|
||||
# Fixed-length word chunks
|
||||
class FixedLengthWordChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that splits text into fixed-length word chunks.
|
||||
|
||||
How it works:
|
||||
1. Split the text into words
|
||||
2. Create chunks of fixed length
|
||||
3. Return the list of chunks
|
||||
"""
|
||||
def __init__(self, chunk_size=100, **kwargs):
|
||||
"""
|
||||
Initialize the fixed-length word chunking strategy with the given chunk size.
|
||||
@@ -98,6 +151,14 @@ class FixedLengthWordChunking(ChunkingStrategy):
|
||||
|
||||
# Sliding window chunking
|
||||
class SlidingWindowChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that splits text into overlapping word chunks.
|
||||
|
||||
How it works:
|
||||
1. Split the text into words
|
||||
2. Create chunks of fixed length
|
||||
3. Return the list of chunks
|
||||
"""
|
||||
def __init__(self, window_size=100, step=50, **kwargs):
|
||||
"""
|
||||
Initialize the sliding window chunking strategy with the given window size and
|
||||
@@ -127,8 +188,16 @@ class SlidingWindowChunking(ChunkingStrategy):
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
class OverlappingWindowChunking(ChunkingStrategy):
|
||||
"""
|
||||
Chunking strategy that splits text into overlapping word chunks.
|
||||
|
||||
How it works:
|
||||
1. Split the text into words using whitespace
|
||||
2. Create chunks of fixed length equal to the window size
|
||||
3. Slide the window by the overlap size
|
||||
4. Return the list of chunks
|
||||
"""
|
||||
def __init__(self, window_size=1000, overlap=100, **kwargs):
|
||||
"""
|
||||
Initialize the overlapping window chunking strategy with the given window size and
|
||||
|
||||
105
crawl4ai/cli.py
Normal file
105
crawl4ai/cli.py
Normal file
@@ -0,0 +1,105 @@
|
||||
import click
|
||||
import sys
|
||||
import asyncio
|
||||
from typing import List
|
||||
from .docs_manager import DocsManager
|
||||
from .async_logger import AsyncLogger
|
||||
|
||||
logger = AsyncLogger(verbose=True)
|
||||
docs_manager = DocsManager(logger)
|
||||
|
||||
def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
|
||||
"""Print formatted table with headers and rows"""
|
||||
widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
|
||||
border = '+' + '+'.join('-' * (w + 2 * padding) for w in widths) + '+'
|
||||
|
||||
def format_row(row):
|
||||
return '|' + '|'.join(f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
|
||||
for cell, w in zip(row, widths)) + '|'
|
||||
|
||||
click.echo(border)
|
||||
click.echo(format_row(headers))
|
||||
click.echo(border)
|
||||
for row in rows:
|
||||
click.echo(format_row(row))
|
||||
click.echo(border)
|
||||
|
||||
@click.group()
|
||||
def cli():
|
||||
"""Crawl4AI Command Line Interface"""
|
||||
pass
|
||||
|
||||
@cli.group()
|
||||
def docs():
|
||||
"""Documentation operations"""
|
||||
pass
|
||||
|
||||
@docs.command()
|
||||
@click.argument('sections', nargs=-1)
|
||||
@click.option('--mode', type=click.Choice(['extended', 'condensed']), default='extended')
|
||||
def combine(sections: tuple, mode: str):
|
||||
"""Combine documentation sections"""
|
||||
try:
|
||||
asyncio.run(docs_manager.ensure_docs_exist())
|
||||
click.echo(docs_manager.generate(sections, mode))
|
||||
except Exception as e:
|
||||
logger.error(str(e), tag="ERROR")
|
||||
sys.exit(1)
|
||||
|
||||
@docs.command()
|
||||
@click.argument('query')
|
||||
@click.option('--top-k', '-k', default=5)
|
||||
@click.option('--build-index', is_flag=True, help='Build index if missing')
|
||||
def search(query: str, top_k: int, build_index: bool):
|
||||
"""Search documentation"""
|
||||
try:
|
||||
result = docs_manager.search(query, top_k)
|
||||
if result == "No search index available. Call build_search_index() first.":
|
||||
if build_index or click.confirm('No search index found. Build it now?'):
|
||||
asyncio.run(docs_manager.llm_text.generate_index_files())
|
||||
result = docs_manager.search(query, top_k)
|
||||
click.echo(result)
|
||||
except Exception as e:
|
||||
click.echo(f"Error: {str(e)}", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
@docs.command()
|
||||
def update():
|
||||
"""Update docs from GitHub"""
|
||||
try:
|
||||
asyncio.run(docs_manager.fetch_docs())
|
||||
click.echo("Documentation updated successfully")
|
||||
except Exception as e:
|
||||
click.echo(f"Error: {str(e)}", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
@docs.command()
|
||||
@click.option('--force-facts', is_flag=True, help='Force regenerate fact files')
|
||||
@click.option('--clear-cache', is_flag=True, help='Clear BM25 cache')
|
||||
def index(force_facts: bool, clear_cache: bool):
|
||||
"""Build or rebuild search indexes"""
|
||||
try:
|
||||
asyncio.run(docs_manager.ensure_docs_exist())
|
||||
asyncio.run(docs_manager.llm_text.generate_index_files(
|
||||
force_generate_facts=force_facts,
|
||||
clear_bm25_cache=clear_cache
|
||||
))
|
||||
click.echo("Search indexes built successfully")
|
||||
except Exception as e:
|
||||
click.echo(f"Error: {str(e)}", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
# Add docs list command
|
||||
@docs.command()
|
||||
def list():
|
||||
"""List available documentation sections"""
|
||||
try:
|
||||
sections = docs_manager.list()
|
||||
print_table(["Sections"], [[section] for section in sections])
|
||||
|
||||
except Exception as e:
|
||||
click.echo(f"Error: {str(e)}", err=True)
|
||||
sys.exit(1)
|
||||
|
||||
if __name__ == '__main__':
|
||||
cli()
|
||||
@@ -13,6 +13,8 @@ PROVIDER_MODELS = {
|
||||
"groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
|
||||
"openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/o1-mini": os.getenv("OPENAI_API_KEY"),
|
||||
"openai/o1-preview": os.getenv("OPENAI_API_KEY"),
|
||||
"anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
|
||||
"anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
|
||||
@@ -56,4 +58,7 @@ MAX_METRICS_HISTORY = 1000
|
||||
|
||||
NEED_MIGRATION = True
|
||||
URL_LOG_SHORTEN_LENGTH = 30
|
||||
SHOW_DEPRECATION_WARNINGS = True
|
||||
SHOW_DEPRECATION_WARNINGS = True
|
||||
SCREENSHOT_HEIGHT_TRESHOLD = 10000
|
||||
PAGE_TIMEOUT=60000
|
||||
DOWNLOAD_PAGE_TIMEOUT=60000
|
||||
@@ -4,15 +4,13 @@ from typing import List, Tuple, Dict
|
||||
from rank_bm25 import BM25Okapi
|
||||
from time import perf_counter
|
||||
from collections import deque
|
||||
from bs4 import BeautifulSoup, NavigableString, Tag
|
||||
from bs4 import BeautifulSoup, NavigableString, Tag, Comment
|
||||
from .utils import clean_tokens
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
import math
|
||||
from snowballstemmer import stemmer
|
||||
|
||||
# from nltk.stem import PorterStemmer
|
||||
# ps = PorterStemmer()
|
||||
class RelevantContentFilter(ABC):
|
||||
"""Abstract base class for content filtering strategies"""
|
||||
def __init__(self, user_query: str = None):
|
||||
self.user_query = user_query
|
||||
self.included_tags = {
|
||||
@@ -57,9 +55,14 @@ class RelevantContentFilter(ABC):
|
||||
query_parts = []
|
||||
|
||||
# Title
|
||||
if soup.title:
|
||||
query_parts.append(soup.title.string)
|
||||
elif soup.find('h1'):
|
||||
try:
|
||||
title = soup.title.string
|
||||
if title:
|
||||
query_parts.append(title)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
if soup.find('h1'):
|
||||
query_parts.append(soup.find('h1').get_text())
|
||||
|
||||
# Meta tags
|
||||
@@ -80,8 +83,7 @@ class RelevantContentFilter(ABC):
|
||||
|
||||
return ' '.join(filter(None, query_parts))
|
||||
|
||||
|
||||
def extract_text_chunks(self, body: Tag) -> List[Tuple[str, str]]:
|
||||
def extract_text_chunks(self, body: Tag, min_word_threshold: int = None) -> List[Tuple[str, str]]:
|
||||
"""
|
||||
Extracts text chunks from a BeautifulSoup body element while preserving order.
|
||||
Returns list of tuples (text, tag_name) for classification.
|
||||
@@ -155,10 +157,12 @@ class RelevantContentFilter(ABC):
|
||||
if text:
|
||||
chunks.append((chunk_index, text, 'content', body))
|
||||
|
||||
if min_word_threshold:
|
||||
chunks = [chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold]
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def extract_text_chunks1(self, soup: BeautifulSoup) -> List[Tuple[int, str, Tag]]:
|
||||
def _deprecated_extract_text_chunks(self, soup: BeautifulSoup) -> List[Tuple[int, str, Tag]]:
|
||||
"""Common method for extracting text chunks"""
|
||||
_text_cache = {}
|
||||
def fast_text(element: Tag) -> str:
|
||||
@@ -256,7 +260,38 @@ class RelevantContentFilter(ABC):
|
||||
return str(tag) # Fallback to original if anything fails
|
||||
|
||||
class BM25ContentFilter(RelevantContentFilter):
|
||||
"""
|
||||
Content filtering using BM25 algorithm with priority tag handling.
|
||||
|
||||
How it works:
|
||||
1. Extracts page metadata with fallbacks.
|
||||
2. Extracts text chunks from the body element.
|
||||
3. Tokenizes the corpus and query.
|
||||
4. Applies BM25 algorithm to calculate scores for each chunk.
|
||||
5. Filters out chunks below the threshold.
|
||||
6. Sorts chunks by score in descending order.
|
||||
7. Returns the top N chunks.
|
||||
|
||||
Attributes:
|
||||
user_query (str): User query for filtering (optional).
|
||||
bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
|
||||
language (str): Language for stemming (default: 'english').
|
||||
|
||||
Methods:
|
||||
filter_content(self, html: str, min_word_threshold: int = None)
|
||||
"""
|
||||
def __init__(self, user_query: str = None, bm25_threshold: float = 1.0, language: str = 'english'):
|
||||
"""
|
||||
Initializes the BM25ContentFilter class, if not provided, falls back to page metadata.
|
||||
|
||||
Note:
|
||||
If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
|
||||
|
||||
Args:
|
||||
user_query (str): User query for filtering (optional).
|
||||
bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
|
||||
language (str): Language for stemming (default: 'english').
|
||||
"""
|
||||
super().__init__(user_query=user_query)
|
||||
self.bm25_threshold = bm25_threshold
|
||||
self.priority_tags = {
|
||||
@@ -274,15 +309,39 @@ class BM25ContentFilter(RelevantContentFilter):
|
||||
}
|
||||
self.stemmer = stemmer(language)
|
||||
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
"""Implements content filtering using BM25 algorithm with priority tag handling"""
|
||||
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
|
||||
"""
|
||||
Implements content filtering using BM25 algorithm with priority tag handling.
|
||||
|
||||
Note:
|
||||
This method implements the filtering logic for the BM25ContentFilter class.
|
||||
It takes HTML content as input and returns a list of filtered text chunks.
|
||||
|
||||
Args:
|
||||
html (str): HTML content to be filtered.
|
||||
min_word_threshold (int): Minimum word threshold for filtering (optional).
|
||||
|
||||
Returns:
|
||||
List[str]: List of filtered text chunks.
|
||||
"""
|
||||
if not html or not isinstance(html, str):
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
|
||||
# Check if body is present
|
||||
if not soup.body:
|
||||
# Wrap in body tag if missing
|
||||
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
|
||||
body = soup.find('body')
|
||||
query = self.extract_page_query(soup.find('head'), body)
|
||||
candidates = self.extract_text_chunks(body)
|
||||
|
||||
query = self.extract_page_query(soup, body)
|
||||
|
||||
if not query:
|
||||
return []
|
||||
# return [self.clean_element(soup)]
|
||||
|
||||
candidates = self.extract_text_chunks(body, min_word_threshold)
|
||||
|
||||
if not candidates:
|
||||
return []
|
||||
@@ -299,6 +358,10 @@ class BM25ContentFilter(RelevantContentFilter):
|
||||
for _, chunk, _, _ in candidates]
|
||||
tokenized_query = [self.stemmer.stemWord(word) for word in query.lower().split()]
|
||||
|
||||
# tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
|
||||
# for _, chunk, _, _ in candidates]
|
||||
# tokenized_query = [self.stemmer.stemWord(word) for word in tokenize_text(query.lower())]
|
||||
|
||||
# Clean from stop words and noise
|
||||
tokenized_corpus = [clean_tokens(tokens) for tokens in tokenized_corpus]
|
||||
tokenized_query = clean_tokens(tokenized_query)
|
||||
@@ -326,3 +389,239 @@ class BM25ContentFilter(RelevantContentFilter):
|
||||
selected_candidates.sort(key=lambda x: x[0])
|
||||
|
||||
return [self.clean_element(tag) for _, _, tag in selected_candidates]
|
||||
|
||||
class PruningContentFilter(RelevantContentFilter):
|
||||
"""
|
||||
Content filtering using pruning algorithm with dynamic threshold.
|
||||
|
||||
How it works:
|
||||
1. Extracts page metadata with fallbacks.
|
||||
2. Extracts text chunks from the body element.
|
||||
3. Applies pruning algorithm to calculate scores for each chunk.
|
||||
4. Filters out chunks below the threshold.
|
||||
5. Sorts chunks by score in descending order.
|
||||
6. Returns the top N chunks.
|
||||
|
||||
Attributes:
|
||||
user_query (str): User query for filtering (optional), if not provided, falls back to page metadata.
|
||||
min_word_threshold (int): Minimum word threshold for filtering (optional).
|
||||
threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
|
||||
threshold (float): Fixed threshold value (default: 0.48).
|
||||
|
||||
Methods:
|
||||
filter_content(self, html: str, min_word_threshold: int = None):
|
||||
"""
|
||||
def __init__(self, user_query: str = None, min_word_threshold: int = None,
|
||||
threshold_type: str = 'fixed', threshold: float = 0.48):
|
||||
"""
|
||||
Initializes the PruningContentFilter class, if not provided, falls back to page metadata.
|
||||
|
||||
Note:
|
||||
If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
|
||||
|
||||
Args:
|
||||
user_query (str): User query for filtering (optional).
|
||||
min_word_threshold (int): Minimum word threshold for filtering (optional).
|
||||
threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
|
||||
threshold (float): Fixed threshold value (default: 0.48).
|
||||
"""
|
||||
super().__init__(None)
|
||||
self.min_word_threshold = min_word_threshold
|
||||
self.threshold_type = threshold_type
|
||||
self.threshold = threshold
|
||||
|
||||
# Add tag importance for dynamic threshold
|
||||
self.tag_importance = {
|
||||
'article': 1.5,
|
||||
'main': 1.4,
|
||||
'section': 1.3,
|
||||
'p': 1.2,
|
||||
'h1': 1.4,
|
||||
'h2': 1.3,
|
||||
'h3': 1.2,
|
||||
'div': 0.7,
|
||||
'span': 0.6
|
||||
}
|
||||
|
||||
# Metric configuration
|
||||
self.metric_config = {
|
||||
'text_density': True,
|
||||
'link_density': True,
|
||||
'tag_weight': True,
|
||||
'class_id_weight': True,
|
||||
'text_length': True,
|
||||
}
|
||||
|
||||
self.metric_weights = {
|
||||
'text_density': 0.4,
|
||||
'link_density': 0.2,
|
||||
'tag_weight': 0.2,
|
||||
'class_id_weight': 0.1,
|
||||
'text_length': 0.1,
|
||||
}
|
||||
|
||||
self.tag_weights = {
|
||||
'div': 0.5,
|
||||
'p': 1.0,
|
||||
'article': 1.5,
|
||||
'section': 1.0,
|
||||
'span': 0.3,
|
||||
'li': 0.5,
|
||||
'ul': 0.5,
|
||||
'ol': 0.5,
|
||||
'h1': 1.2,
|
||||
'h2': 1.1,
|
||||
'h3': 1.0,
|
||||
'h4': 0.9,
|
||||
'h5': 0.8,
|
||||
'h6': 0.7,
|
||||
}
|
||||
|
||||
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
|
||||
"""
|
||||
Implements content filtering using pruning algorithm with dynamic threshold.
|
||||
|
||||
Note:
|
||||
This method implements the filtering logic for the PruningContentFilter class.
|
||||
It takes HTML content as input and returns a list of filtered text chunks.
|
||||
|
||||
Args:
|
||||
html (str): HTML content to be filtered.
|
||||
min_word_threshold (int): Minimum word threshold for filtering (optional).
|
||||
|
||||
Returns:
|
||||
List[str]: List of filtered text chunks.
|
||||
"""
|
||||
if not html or not isinstance(html, str):
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
if not soup.body:
|
||||
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
|
||||
|
||||
# Remove comments and unwanted tags
|
||||
self._remove_comments(soup)
|
||||
self._remove_unwanted_tags(soup)
|
||||
|
||||
# Prune tree starting from body
|
||||
body = soup.find('body')
|
||||
self._prune_tree(body)
|
||||
|
||||
# Extract remaining content as list of HTML strings
|
||||
content_blocks = []
|
||||
for element in body.children:
|
||||
if isinstance(element, str) or not hasattr(element, 'name'):
|
||||
continue
|
||||
if len(element.get_text(strip=True)) > 0:
|
||||
content_blocks.append(str(element))
|
||||
|
||||
return content_blocks
|
||||
|
||||
def _remove_comments(self, soup):
|
||||
"""Removes HTML comments"""
|
||||
for element in soup(text=lambda text: isinstance(text, Comment)):
|
||||
element.extract()
|
||||
|
||||
def _remove_unwanted_tags(self, soup):
|
||||
"""Removes unwanted tags"""
|
||||
for tag in self.excluded_tags:
|
||||
for element in soup.find_all(tag):
|
||||
element.decompose()
|
||||
|
||||
def _prune_tree(self, node):
|
||||
"""
|
||||
Prunes the tree starting from the given node.
|
||||
|
||||
Args:
|
||||
node (Tag): The node from which the pruning starts.
|
||||
"""
|
||||
if not node or not hasattr(node, 'name') or node.name is None:
|
||||
return
|
||||
|
||||
text_len = len(node.get_text(strip=True))
|
||||
tag_len = len(node.encode_contents().decode('utf-8'))
|
||||
link_text_len = sum(len(s.strip()) for s in (a.string for a in node.find_all('a', recursive=False)) if s)
|
||||
|
||||
metrics = {
|
||||
'node': node,
|
||||
'tag_name': node.name,
|
||||
'text_len': text_len,
|
||||
'tag_len': tag_len,
|
||||
'link_text_len': link_text_len
|
||||
}
|
||||
|
||||
score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
|
||||
|
||||
if self.threshold_type == 'fixed':
|
||||
should_remove = score < self.threshold
|
||||
else: # dynamic
|
||||
tag_importance = self.tag_importance.get(node.name, 0.7)
|
||||
text_ratio = text_len / tag_len if tag_len > 0 else 0
|
||||
link_ratio = link_text_len / text_len if text_len > 0 else 1
|
||||
|
||||
threshold = self.threshold # base threshold
|
||||
if tag_importance > 1:
|
||||
threshold *= 0.8
|
||||
if text_ratio > 0.4:
|
||||
threshold *= 0.9
|
||||
if link_ratio > 0.6:
|
||||
threshold *= 1.2
|
||||
|
||||
should_remove = score < threshold
|
||||
|
||||
if should_remove:
|
||||
node.decompose()
|
||||
else:
|
||||
children = [child for child in node.children if hasattr(child, 'name')]
|
||||
for child in children:
|
||||
self._prune_tree(child)
|
||||
|
||||
def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
|
||||
"""Computes the composite score"""
|
||||
if self.min_word_threshold:
|
||||
# Get raw text from metrics node - avoid extra processing
|
||||
text = metrics['node'].get_text(strip=True)
|
||||
word_count = text.count(' ') + 1
|
||||
if word_count < self.min_word_threshold:
|
||||
return -1.0 # Guaranteed removal
|
||||
score = 0.0
|
||||
total_weight = 0.0
|
||||
|
||||
if self.metric_config['text_density']:
|
||||
density = text_len / tag_len if tag_len > 0 else 0
|
||||
score += self.metric_weights['text_density'] * density
|
||||
total_weight += self.metric_weights['text_density']
|
||||
|
||||
if self.metric_config['link_density']:
|
||||
density = 1 - (link_text_len / text_len if text_len > 0 else 0)
|
||||
score += self.metric_weights['link_density'] * density
|
||||
total_weight += self.metric_weights['link_density']
|
||||
|
||||
if self.metric_config['tag_weight']:
|
||||
tag_score = self.tag_weights.get(metrics['tag_name'], 0.5)
|
||||
score += self.metric_weights['tag_weight'] * tag_score
|
||||
total_weight += self.metric_weights['tag_weight']
|
||||
|
||||
if self.metric_config['class_id_weight']:
|
||||
class_score = self._compute_class_id_weight(metrics['node'])
|
||||
score += self.metric_weights['class_id_weight'] * max(0, class_score)
|
||||
total_weight += self.metric_weights['class_id_weight']
|
||||
|
||||
if self.metric_config['text_length']:
|
||||
score += self.metric_weights['text_length'] * math.log(text_len + 1)
|
||||
total_weight += self.metric_weights['text_length']
|
||||
|
||||
return score / total_weight if total_weight > 0 else 0
|
||||
|
||||
def _compute_class_id_weight(self, node):
|
||||
"""Computes the class ID weight"""
|
||||
class_id_score = 0
|
||||
if 'class' in node.attrs:
|
||||
classes = ' '.join(node['class'])
|
||||
if self.negative_patterns.match(classes):
|
||||
class_id_score -= 0.5
|
||||
if 'id' in node.attrs:
|
||||
element_id = node['id']
|
||||
if self.negative_patterns.match(element_id):
|
||||
class_id_score -= 0.5
|
||||
return class_id_score
|
||||
816
crawl4ai/content_scraping_strategy.py
Normal file
816
crawl4ai/content_scraping_strategy.py
Normal file
@@ -0,0 +1,816 @@
|
||||
import re # Point 1: Pre-Compile Regular Expressions
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any, Optional
|
||||
from bs4 import BeautifulSoup
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
import asyncio, requests, re, os
|
||||
from .config import *
|
||||
from bs4 import element, NavigableString, Comment
|
||||
from bs4 import PageElement, Tag
|
||||
from urllib.parse import urljoin
|
||||
from requests.exceptions import InvalidSchema
|
||||
# from .content_cleaning_strategy import ContentCleaningStrategy
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#, HeuristicContentFilter
|
||||
from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
|
||||
from .models import MarkdownGenerationResult
|
||||
from .utils import (
|
||||
extract_metadata,
|
||||
normalize_url,
|
||||
is_external_url,
|
||||
get_base_domain,
|
||||
)
|
||||
|
||||
|
||||
# Pre-compile regular expressions for Open Graph and Twitter metadata
|
||||
OG_REGEX = re.compile(r'^og:')
|
||||
TWITTER_REGEX = re.compile(r'^twitter:')
|
||||
DIMENSION_REGEX = re.compile(r"(\d+)(\D*)")
|
||||
|
||||
# Function to parse image height/width value and units
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
# match = re.match(r"(\d+)(\D*)", dimension)
|
||||
match = DIMENSION_REGEX.match(dimension)
|
||||
if match:
|
||||
number = int(match.group(1))
|
||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||
return number, unit
|
||||
return None, None
|
||||
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
#If src is relative path construct full URL, if not it may be CDN URL
|
||||
img_url = urljoin(base_url,img.get('src'))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
if response.status_code == 200:
|
||||
return response.headers.get('Content-Length',None)
|
||||
else:
|
||||
print(f"Failed to retrieve file size for {img_url}")
|
||||
return None
|
||||
except InvalidSchema as e:
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
|
||||
class ContentScrapingStrategy(ABC):
|
||||
@abstractmethod
|
||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
pass
|
||||
|
||||
class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
"""
|
||||
Class for web content scraping. Perhaps the most important class.
|
||||
|
||||
How it works:
|
||||
1. Extract content from HTML using BeautifulSoup.
|
||||
2. Clean the extracted content using a content cleaning strategy.
|
||||
3. Filter the cleaned content using a content filtering strategy.
|
||||
4. Generate markdown content from the filtered content.
|
||||
5. Return the markdown content.
|
||||
"""
|
||||
|
||||
def __init__(self, logger=None):
|
||||
self.logger = logger
|
||||
|
||||
def _log(self, level, message, tag="SCRAPE", **kwargs):
|
||||
"""Helper method to safely use logger."""
|
||||
if self.logger:
|
||||
log_method = getattr(self.logger, level)
|
||||
log_method(message=message, tag=tag, **kwargs)
|
||||
|
||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Main entry point for content scraping.
|
||||
|
||||
Args:
|
||||
url (str): The URL of the page to scrape.
|
||||
html (str): The HTML content of the page.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing the scraped content. This dictionary contains the following keys:
|
||||
|
||||
- 'markdown': The generated markdown content, type is str, however soon will become MarkdownGenerationResult via 'markdown.raw_markdown'.
|
||||
- 'fit_markdown': The generated markdown content with relevant content filtered, this will be removed soon and available in 'markdown.fit_markdown'.
|
||||
- 'fit_html': The HTML content with relevant content filtered, this will be removed soon and available in 'markdown.fit_html'.
|
||||
- 'markdown_v2': The generated markdown content with relevant content filtered, this is temporary and will be removed soon and replaced with 'markdown'
|
||||
"""
|
||||
return self._scrap(url, html, is_async=False, **kwargs)
|
||||
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Main entry point for asynchronous content scraping.
|
||||
|
||||
Args:
|
||||
url (str): The URL of the page to scrape.
|
||||
html (str): The HTML content of the page.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing the scraped content. This dictionary contains the following keys:
|
||||
|
||||
- 'markdown': The generated markdown content, type is str, however soon will become MarkdownGenerationResult via 'markdown.raw_markdown'.
|
||||
- 'fit_markdown': The generated markdown content with relevant content filtered, this will be removed soon and available in 'markdown.fit_markdown'.
|
||||
- 'fit_html': The HTML content with relevant content filtered, this will be removed soon and available in 'markdown.fit_html'.
|
||||
- 'markdown_v2': The generated markdown content with relevant content filtered, this is temporary and will be removed soon and replaced with 'markdown'
|
||||
"""
|
||||
return await asyncio.to_thread(self._scrap, url, html, **kwargs)
|
||||
|
||||
def _generate_markdown_content(self, cleaned_html: str,html: str,url: str, success: bool, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Generate markdown content from cleaned HTML.
|
||||
|
||||
Args:
|
||||
cleaned_html (str): The cleaned HTML content.
|
||||
html (str): The original HTML content.
|
||||
url (str): The URL of the page.
|
||||
success (bool): Whether the content was successfully cleaned.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary containing the generated markdown content.
|
||||
"""
|
||||
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
|
||||
|
||||
if markdown_generator:
|
||||
try:
|
||||
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
|
||||
markdown_generator.content_filter = BM25ContentFilter(
|
||||
user_query=kwargs.get('fit_markdown_user_query', None),
|
||||
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
)
|
||||
|
||||
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
|
||||
cleaned_html=cleaned_html,
|
||||
base_url=url,
|
||||
html2text_options=kwargs.get('html2text', {})
|
||||
)
|
||||
|
||||
return {
|
||||
'markdown': markdown_result.raw_markdown,
|
||||
'fit_markdown': markdown_result.fit_markdown,
|
||||
'fit_html': markdown_result.fit_html,
|
||||
'markdown_v2': markdown_result
|
||||
}
|
||||
except Exception as e:
|
||||
self._log('error',
|
||||
message="Error using new markdown generation strategy: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
markdown_generator = None
|
||||
return {
|
||||
'markdown': f"Error using new markdown generation strategy: {str(e)}",
|
||||
'fit_markdown': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
|
||||
'fit_html': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
|
||||
'markdown_v2': None
|
||||
}
|
||||
|
||||
# Legacy method
|
||||
"""
|
||||
# h = CustomHTML2Text()
|
||||
# h.update_params(**kwargs.get('html2text', {}))
|
||||
# markdown = h.handle(cleaned_html)
|
||||
# markdown = markdown.replace(' ```', '```')
|
||||
|
||||
# fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
|
||||
# fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
|
||||
|
||||
# if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
|
||||
# content_filter = kwargs.get('content_filter', None)
|
||||
# if not content_filter:
|
||||
# content_filter = BM25ContentFilter(
|
||||
# user_query=kwargs.get('fit_markdown_user_query', None),
|
||||
# bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
# )
|
||||
# fit_html = content_filter.filter_content(html)
|
||||
# fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
|
||||
# fit_markdown = h.handle(fit_html)
|
||||
|
||||
# markdown_v2 = MarkdownGenerationResult(
|
||||
# raw_markdown=markdown,
|
||||
# markdown_with_citations=markdown,
|
||||
# references_markdown=markdown,
|
||||
# fit_markdown=fit_markdown
|
||||
# )
|
||||
|
||||
# return {
|
||||
# 'markdown': markdown,
|
||||
# 'fit_markdown': fit_markdown,
|
||||
# 'fit_html': fit_html,
|
||||
# 'markdown_v2' : markdown_v2
|
||||
# }
|
||||
"""
|
||||
|
||||
def flatten_nested_elements(self, node):
|
||||
"""
|
||||
Flatten nested elements in a HTML tree.
|
||||
|
||||
Args:
|
||||
node (Tag): The root node of the HTML tree.
|
||||
|
||||
Returns:
|
||||
Tag: The flattened HTML tree.
|
||||
"""
|
||||
if isinstance(node, NavigableString):
|
||||
return node
|
||||
if len(node.contents) == 1 and isinstance(node.contents[0], Tag) and node.contents[0].name == node.name:
|
||||
return self.flatten_nested_elements(node.contents[0])
|
||||
node.contents = [self.flatten_nested_elements(child) for child in node.contents]
|
||||
return node
|
||||
|
||||
def find_closest_parent_with_useful_text(self, tag, **kwargs):
|
||||
"""
|
||||
Find the closest parent with useful text.
|
||||
|
||||
Args:
|
||||
tag (Tag): The starting tag to search from.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
Tag: The closest parent with useful text, or None if not found.
|
||||
"""
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
if len(text_content.split()) >= image_description_min_word_threshold:
|
||||
return text_content
|
||||
return None
|
||||
|
||||
def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):
|
||||
"""
|
||||
Remove unwanted attributes from an HTML element.
|
||||
|
||||
Args:
|
||||
element (Tag): The HTML element to remove attributes from.
|
||||
important_attrs (list): List of important attributes to keep.
|
||||
keep_data_attributes (bool): Whether to keep data attributes.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
|
||||
def process_image(self, img, url, index, total_images, **kwargs):
|
||||
"""
|
||||
Process an image element.
|
||||
|
||||
How it works:
|
||||
1. Check if the image has valid display and inside undesired html elements.
|
||||
2. Score an image for it's usefulness.
|
||||
3. Extract image file metadata to extract size and extension.
|
||||
4. Generate a dictionary with the processed image information.
|
||||
5. Return the processed image information.
|
||||
|
||||
Args:
|
||||
img (Tag): The image element to process.
|
||||
url (str): The URL of the page containing the image.
|
||||
index (int): The index of the image in the list of images.
|
||||
total_images (int): The total number of images in the list.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
dict: A dictionary containing the processed image information.
|
||||
"""
|
||||
parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w')
|
||||
if ' ' in u else None}
|
||||
for u in [f"http{p}" for p in s.split("http") if p]]
|
||||
|
||||
# Constants for checks
|
||||
classes_to_check = frozenset(['button', 'icon', 'logo'])
|
||||
tags_to_check = frozenset(['button', 'input'])
|
||||
image_formats = frozenset(['jpg', 'jpeg', 'png', 'webp', 'avif', 'gif'])
|
||||
|
||||
# Pre-fetch commonly used attributes
|
||||
style = img.get('style', '')
|
||||
alt = img.get('alt', '')
|
||||
src = img.get('src', '')
|
||||
data_src = img.get('data-src', '')
|
||||
srcset = img.get('srcset', '')
|
||||
data_srcset = img.get('data-srcset', '')
|
||||
width = img.get('width')
|
||||
height = img.get('height')
|
||||
parent = img.parent
|
||||
parent_classes = parent.get('class', [])
|
||||
|
||||
# Quick validation checks
|
||||
if ('display:none' in style or
|
||||
parent.name in tags_to_check or
|
||||
any(c in cls for c in parent_classes for cls in classes_to_check) or
|
||||
any(c in src for c in classes_to_check) or
|
||||
any(c in alt for c in classes_to_check)):
|
||||
return None
|
||||
|
||||
# Quick score calculation
|
||||
score = 0
|
||||
if width and width.isdigit():
|
||||
width_val = int(width)
|
||||
score += 1 if width_val > 150 else 0
|
||||
if height and height.isdigit():
|
||||
height_val = int(height)
|
||||
score += 1 if height_val > 150 else 0
|
||||
if alt:
|
||||
score += 1
|
||||
score += index/total_images < 0.5
|
||||
|
||||
# image_format = ''
|
||||
# if "data:image/" in src:
|
||||
# image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
|
||||
# else:
|
||||
# image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
|
||||
|
||||
# if image_format in ('jpg', 'png', 'webp', 'avif'):
|
||||
# score += 1
|
||||
|
||||
|
||||
# Check for image format in all possible sources
|
||||
def has_image_format(url):
|
||||
return any(fmt in url.lower() for fmt in image_formats)
|
||||
|
||||
# Score for having proper image sources
|
||||
if any(has_image_format(url) for url in [src, data_src, srcset, data_srcset]):
|
||||
score += 1
|
||||
if srcset or data_srcset:
|
||||
score += 1
|
||||
if img.find_parent('picture'):
|
||||
score += 1
|
||||
|
||||
# Detect format from any available source
|
||||
detected_format = None
|
||||
for url in [src, data_src, srcset, data_srcset]:
|
||||
if url:
|
||||
format_matches = [fmt for fmt in image_formats if fmt in url.lower()]
|
||||
if format_matches:
|
||||
detected_format = format_matches[0]
|
||||
break
|
||||
|
||||
if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
|
||||
return None
|
||||
|
||||
# Use set for deduplication
|
||||
unique_urls = set()
|
||||
image_variants = []
|
||||
|
||||
# Generate a unique group ID for this set of variants
|
||||
group_id = index
|
||||
|
||||
# Base image info template
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
base_info = {
|
||||
'alt': alt,
|
||||
'desc': self.find_closest_parent_with_useful_text(img, **kwargs),
|
||||
'score': score,
|
||||
'type': 'image',
|
||||
'group_id': group_id, # Group ID for this set of variants
|
||||
'format': detected_format,
|
||||
}
|
||||
|
||||
# Inline function for adding variants
|
||||
def add_variant(src, width=None):
|
||||
if src and not src.startswith('data:') and src not in unique_urls:
|
||||
unique_urls.add(src)
|
||||
image_variants.append({**base_info, 'src': src, 'width': width})
|
||||
|
||||
# Process all sources
|
||||
add_variant(src)
|
||||
add_variant(data_src)
|
||||
|
||||
# Handle srcset and data-srcset in one pass
|
||||
for attr in ('srcset', 'data-srcset'):
|
||||
if value := img.get(attr):
|
||||
for source in parse_srcset(value):
|
||||
add_variant(source['url'], source['width'])
|
||||
|
||||
# Quick picture element check
|
||||
if picture := img.find_parent('picture'):
|
||||
for source in picture.find_all('source'):
|
||||
if srcset := source.get('srcset'):
|
||||
for src in parse_srcset(srcset):
|
||||
add_variant(src['url'], src['width'])
|
||||
|
||||
# Framework-specific attributes in one pass
|
||||
for attr, value in img.attrs.items():
|
||||
if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
|
||||
add_variant(value)
|
||||
|
||||
return image_variants if image_variants else None
|
||||
|
||||
def process_element(self, url, element: PageElement, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Process an HTML element.
|
||||
|
||||
How it works:
|
||||
1. Check if the element is an image, video, or audio.
|
||||
2. Extract the element's attributes and content.
|
||||
3. Process the element based on its type.
|
||||
4. Return the processed element information.
|
||||
|
||||
Args:
|
||||
url (str): The URL of the page containing the element.
|
||||
element (Tag): The HTML element to process.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
dict: A dictionary containing the processed element information.
|
||||
"""
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
self._process_element(
|
||||
url,
|
||||
element,
|
||||
media,
|
||||
internal_links_dict,
|
||||
external_links_dict,
|
||||
**kwargs
|
||||
)
|
||||
return {
|
||||
'media': media,
|
||||
'internal_links_dict': internal_links_dict,
|
||||
'external_links_dict': external_links_dict
|
||||
}
|
||||
|
||||
def _process_element(self, url, element: PageElement, media: Dict[str, Any], internal_links_dict: Dict[str, Any], external_links_dict: Dict[str, Any], **kwargs) -> bool:
|
||||
"""
|
||||
Process an HTML element.
|
||||
"""
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
base_domain = kwargs.get("base_domain", get_base_domain(url))
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_domains = kwargs.get('exclude_domains', [])
|
||||
# exclude_social_media_domains = kwargs.get('exclude_social_media_domains', set(SOCIAL_MEDIA_DOMAINS))
|
||||
# exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
# exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip(),
|
||||
'base_domain': base_domain
|
||||
}
|
||||
|
||||
is_external = is_external_url(normalized_href, base_domain)
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
link_base_domain = get_base_domain(normalized_href)
|
||||
link_data['base_domain'] = link_base_domain
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
# elif kwargs.get('exclude_social_media_links', False):
|
||||
# if link_base_domain in exclude_social_media_domains:
|
||||
# element.decompose()
|
||||
# return False
|
||||
# if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
# element.decompose()
|
||||
# return False
|
||||
elif exclude_domains:
|
||||
if link_base_domain in exclude_domains:
|
||||
element.decompose()
|
||||
return False
|
||||
# if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
# element.decompose()
|
||||
# return False
|
||||
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# If image src is internal, then skip
|
||||
if not is_external_url(src, base_domain):
|
||||
return True
|
||||
|
||||
image_src_base_domain = get_base_domain(src)
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
element.decompose()
|
||||
return False
|
||||
# src_url_base = src.split('/')[2]
|
||||
# url_base = url.split('/')[2]
|
||||
# if url_base not in src_url_base:
|
||||
# element.decompose()
|
||||
# return False
|
||||
|
||||
# if kwargs.get('exclude_social_media_links', False):
|
||||
# if image_src_base_domain in exclude_social_media_domains:
|
||||
# element.decompose()
|
||||
# return False
|
||||
# src_url_base = src.split('/')[2]
|
||||
# url_base = url.split('/')[2]
|
||||
# if any(domain in src for domain in exclude_social_media_domains):
|
||||
# element.decompose()
|
||||
# return False
|
||||
|
||||
# Handle exclude domains
|
||||
if exclude_domains:
|
||||
if image_src_base_domain in exclude_domains:
|
||||
element.decompose()
|
||||
return False
|
||||
# if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
# element.decompose()
|
||||
# return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
|
||||
})
|
||||
source_tags = element.find_all('source')
|
||||
for source_tag in source_tags:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': source_tag.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': self.find_closest_parent_with_useful_text(element, **kwargs)
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
self.remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
# print('Error removing unwanted attributes:', str(e))
|
||||
self._log('error',
|
||||
message="Error removing unwanted attributes: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
if isinstance(child, NavigableString) and not isinstance(child, Comment):
|
||||
if len(child.strip()) > 0:
|
||||
keep_element = True
|
||||
else:
|
||||
if self._process_element(url, child, media, internal_links_dict, external_links_dict, **kwargs):
|
||||
keep_element = True
|
||||
|
||||
|
||||
# Check word count
|
||||
word_count_threshold = kwargs.get('word_count_threshold', MIN_WORD_THRESHOLD)
|
||||
if not keep_element:
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
keep_element = word_count >= word_count_threshold
|
||||
|
||||
if not keep_element:
|
||||
element.decompose()
|
||||
|
||||
return keep_element
|
||||
except Exception as e:
|
||||
# print('Error processing element:', str(e))
|
||||
self._log('error',
|
||||
message="Error processing element: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
return False
|
||||
|
||||
def _scrap(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Extract content from HTML using BeautifulSoup.
|
||||
|
||||
Args:
|
||||
url (str): The URL of the page to scrape.
|
||||
html (str): The HTML content of the page to scrape.
|
||||
word_count_threshold (int): The minimum word count threshold for content extraction.
|
||||
css_selector (str): The CSS selector to use for content extraction.
|
||||
**kwargs: Additional keyword arguments.
|
||||
|
||||
Returns:
|
||||
dict: A dictionary containing the extracted content.
|
||||
"""
|
||||
success = True
|
||||
if not html:
|
||||
return None
|
||||
|
||||
parser_type = kwargs.get('parser', 'lxml')
|
||||
soup = BeautifulSoup(html, parser_type)
|
||||
body = soup.body
|
||||
base_domain = get_base_domain(url)
|
||||
|
||||
try:
|
||||
meta = extract_metadata("", soup)
|
||||
except Exception as e:
|
||||
self._log('error',
|
||||
message="Error extracting metadata: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
meta = {}
|
||||
|
||||
# Handle tag-based removal first - faster than CSS selection
|
||||
excluded_tags = set(kwargs.get('excluded_tags', []) or [])
|
||||
if excluded_tags:
|
||||
for element in body.find_all(lambda tag: tag.name in excluded_tags):
|
||||
element.extract()
|
||||
|
||||
# Handle CSS selector-based removal
|
||||
excluded_selector = kwargs.get('excluded_selector', '')
|
||||
if excluded_selector:
|
||||
is_single_selector = ',' not in excluded_selector and ' ' not in excluded_selector
|
||||
if is_single_selector:
|
||||
while element := body.select_one(excluded_selector):
|
||||
element.extract()
|
||||
else:
|
||||
for element in body.select(excluded_selector):
|
||||
element.extract()
|
||||
|
||||
if css_selector:
|
||||
selected_elements = body.select(css_selector)
|
||||
if not selected_elements:
|
||||
return {
|
||||
'markdown': '',
|
||||
'cleaned_html': '',
|
||||
'success': True,
|
||||
'media': {'images': [], 'videos': [], 'audios': []},
|
||||
'links': {'internal': [], 'external': []},
|
||||
'metadata': {},
|
||||
'message': f"No elements found for CSS selector: {css_selector}"
|
||||
}
|
||||
# raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
|
||||
body = soup.new_tag('div')
|
||||
for el in selected_elements:
|
||||
body.append(el)
|
||||
|
||||
kwargs['exclude_social_media_domains'] = set(kwargs.get('exclude_social_media_domains', []) + SOCIAL_MEDIA_DOMAINS)
|
||||
kwargs['exclude_domains'] = set(kwargs.get('exclude_domains', []))
|
||||
if kwargs.get('exclude_social_media_links', False):
|
||||
kwargs['exclude_domains'] = kwargs['exclude_domains'].union(kwargs['exclude_social_media_domains'])
|
||||
|
||||
result_obj = self.process_element(
|
||||
url,
|
||||
body,
|
||||
word_count_threshold = word_count_threshold,
|
||||
base_domain=base_domain,
|
||||
**kwargs
|
||||
)
|
||||
|
||||
links = {'internal': [], 'external': []}
|
||||
media = result_obj['media']
|
||||
internal_links_dict = result_obj['internal_links_dict']
|
||||
external_links_dict = result_obj['external_links_dict']
|
||||
|
||||
# Update the links dictionary with unique links
|
||||
links['internal'] = list(internal_links_dict.values())
|
||||
links['external'] = list(external_links_dict.values())
|
||||
|
||||
# # Process images using ThreadPoolExecutor
|
||||
imgs = body.find_all('img')
|
||||
|
||||
media['images'] = [
|
||||
img for result in (self.process_image(img, url, i, len(imgs))
|
||||
for i, img in enumerate(imgs))
|
||||
if result is not None
|
||||
for img in result
|
||||
]
|
||||
|
||||
body = self.flatten_nested_elements(body)
|
||||
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
|
||||
for img in imgs:
|
||||
src = img.get('src', '')
|
||||
if base64_pattern.match(src):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
|
||||
str_body = ""
|
||||
try:
|
||||
str_body = body.encode_contents().decode('utf-8')
|
||||
except Exception as e:
|
||||
# Reset body to the original HTML
|
||||
success = False
|
||||
body = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Create a new div with a special ID
|
||||
error_div = body.new_tag('div', id='crawl4ai_error_message')
|
||||
error_div.string = '''
|
||||
Crawl4AI Error: This page is not fully supported.
|
||||
|
||||
Possible reasons:
|
||||
1. The page may have restrictions that prevent crawling.
|
||||
2. The page might not be fully loaded.
|
||||
|
||||
Suggestions:
|
||||
- Try calling the crawl function with these parameters:
|
||||
magic=True,
|
||||
- Set headless=False to visualize what's happening on the page.
|
||||
|
||||
If the issue persists, please check the page's structure and any potential anti-crawling measures.
|
||||
'''
|
||||
|
||||
# Append the error div to the body
|
||||
body.body.append(error_div)
|
||||
str_body = body.encode_contents().decode('utf-8')
|
||||
|
||||
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
|
||||
self._log('error',
|
||||
message="After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.",
|
||||
tag="SCRAPE"
|
||||
)
|
||||
|
||||
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
|
||||
|
||||
# markdown_content = self._generate_markdown_content(
|
||||
# cleaned_html=cleaned_html,
|
||||
# html=html,
|
||||
# url=url,
|
||||
# success=success,
|
||||
# **kwargs
|
||||
# )
|
||||
|
||||
return {
|
||||
# **markdown_content,
|
||||
'cleaned_html': cleaned_html,
|
||||
'success': success,
|
||||
'media': media,
|
||||
'links': links,
|
||||
'metadata': meta
|
||||
}
|
||||
@@ -1,588 +0,0 @@
|
||||
import re # Point 1: Pre-Compile Regular Expressions
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, Any
|
||||
from bs4 import BeautifulSoup
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
import asyncio, requests, re, os
|
||||
from .config import *
|
||||
from bs4 import element, NavigableString, Comment
|
||||
from urllib.parse import urljoin
|
||||
from requests.exceptions import InvalidSchema
|
||||
# from .content_cleaning_strategy import ContentCleaningStrategy
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
|
||||
|
||||
from .utils import (
|
||||
sanitize_input_encode,
|
||||
sanitize_html,
|
||||
extract_metadata,
|
||||
InvalidCSSSelectorError,
|
||||
# CustomHTML2Text,
|
||||
normalize_url,
|
||||
is_external_url
|
||||
|
||||
)
|
||||
|
||||
from .html2text import HTML2Text
|
||||
class CustomHTML2Text(HTML2Text):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.inside_pre = False
|
||||
self.inside_code = False
|
||||
self.preserve_tags = set() # Set of tags to preserve
|
||||
self.current_preserved_tag = None
|
||||
self.preserved_content = []
|
||||
self.preserve_depth = 0
|
||||
|
||||
# Configuration options
|
||||
self.skip_internal_links = False
|
||||
self.single_line_break = False
|
||||
self.mark_code = False
|
||||
self.include_sup_sub = False
|
||||
self.body_width = 0
|
||||
self.ignore_mailto_links = True
|
||||
self.ignore_links = False
|
||||
self.escape_backslash = False
|
||||
self.escape_dot = False
|
||||
self.escape_plus = False
|
||||
self.escape_dash = False
|
||||
self.escape_snob = False
|
||||
|
||||
def update_params(self, **kwargs):
|
||||
"""Update parameters and set preserved tags."""
|
||||
for key, value in kwargs.items():
|
||||
if key == 'preserve_tags':
|
||||
self.preserve_tags = set(value)
|
||||
else:
|
||||
setattr(self, key, value)
|
||||
|
||||
def handle_tag(self, tag, attrs, start):
|
||||
# Handle preserved tags
|
||||
if tag in self.preserve_tags:
|
||||
if start:
|
||||
if self.preserve_depth == 0:
|
||||
self.current_preserved_tag = tag
|
||||
self.preserved_content = []
|
||||
# Format opening tag with attributes
|
||||
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
|
||||
self.preserved_content.append(f'<{tag}{attr_str}>')
|
||||
self.preserve_depth += 1
|
||||
return
|
||||
else:
|
||||
self.preserve_depth -= 1
|
||||
if self.preserve_depth == 0:
|
||||
self.preserved_content.append(f'</{tag}>')
|
||||
# Output the preserved HTML block with proper spacing
|
||||
preserved_html = ''.join(self.preserved_content)
|
||||
self.o('\n' + preserved_html + '\n')
|
||||
self.current_preserved_tag = None
|
||||
return
|
||||
|
||||
# If we're inside a preserved tag, collect all content
|
||||
if self.preserve_depth > 0:
|
||||
if start:
|
||||
# Format nested tags with attributes
|
||||
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
|
||||
self.preserved_content.append(f'<{tag}{attr_str}>')
|
||||
else:
|
||||
self.preserved_content.append(f'</{tag}>')
|
||||
return
|
||||
|
||||
# Handle pre tags
|
||||
if tag == 'pre':
|
||||
if start:
|
||||
self.o('```\n')
|
||||
self.inside_pre = True
|
||||
else:
|
||||
self.o('\n```')
|
||||
self.inside_pre = False
|
||||
# elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
# pass
|
||||
else:
|
||||
super().handle_tag(tag, attrs, start)
|
||||
|
||||
def handle_data(self, data, entity_char=False):
|
||||
"""Override handle_data to capture content within preserved tags."""
|
||||
if self.preserve_depth > 0:
|
||||
self.preserved_content.append(data)
|
||||
return
|
||||
super().handle_data(data, entity_char)
|
||||
|
||||
# Pre-compile regular expressions for Open Graph and Twitter metadata
|
||||
OG_REGEX = re.compile(r'^og:')
|
||||
TWITTER_REGEX = re.compile(r'^twitter:')
|
||||
DIMENSION_REGEX = re.compile(r"(\d+)(\D*)")
|
||||
|
||||
# Function to parse image height/width value and units
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
# match = re.match(r"(\d+)(\D*)", dimension)
|
||||
match = DIMENSION_REGEX.match(dimension)
|
||||
if match:
|
||||
number = int(match.group(1))
|
||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||
return number, unit
|
||||
return None, None
|
||||
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
#If src is relative path construct full URL, if not it may be CDN URL
|
||||
img_url = urljoin(base_url,img.get('src'))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
if response.status_code == 200:
|
||||
return response.headers.get('Content-Length',None)
|
||||
else:
|
||||
print(f"Failed to retrieve file size for {img_url}")
|
||||
return None
|
||||
except InvalidSchema as e:
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
|
||||
class ContentScrapingStrategy(ABC):
|
||||
@abstractmethod
|
||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
pass
|
||||
|
||||
class WebScrapingStrategy(ContentScrapingStrategy):
|
||||
def __init__(self, logger=None):
|
||||
self.logger = logger
|
||||
|
||||
def _log(self, level, message, tag="SCRAPE", **kwargs):
|
||||
"""Helper method to safely use logger."""
|
||||
if self.logger:
|
||||
log_method = getattr(self.logger, level)
|
||||
log_method(message=message, tag=tag, **kwargs)
|
||||
|
||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
return self._get_content_of_website_optimized(url, html, is_async=False, **kwargs)
|
||||
|
||||
async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
|
||||
|
||||
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
|
||||
success = True
|
||||
if not html:
|
||||
return None
|
||||
|
||||
# soup = BeautifulSoup(html, 'html.parser')
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
body = soup.body
|
||||
|
||||
try:
|
||||
meta = extract_metadata("", soup)
|
||||
except Exception as e:
|
||||
self._log('error',
|
||||
message="Error extracting metadata: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# print('Error extracting metadata:', str(e))
|
||||
meta = {}
|
||||
|
||||
|
||||
image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
|
||||
|
||||
for tag in kwargs.get('excluded_tags', []) or []:
|
||||
for el in body.select(tag):
|
||||
el.decompose()
|
||||
|
||||
if css_selector:
|
||||
selected_elements = body.select(css_selector)
|
||||
if not selected_elements:
|
||||
return {
|
||||
'markdown': '',
|
||||
'cleaned_html': '',
|
||||
'success': True,
|
||||
'media': {'images': [], 'videos': [], 'audios': []},
|
||||
'links': {'internal': [], 'external': []},
|
||||
'metadata': {},
|
||||
'message': f"No elements found for CSS selector: {css_selector}"
|
||||
}
|
||||
# raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
|
||||
body = soup.new_tag('div')
|
||||
for el in selected_elements:
|
||||
body.append(el)
|
||||
|
||||
links = {'internal': [], 'external': []}
|
||||
media = {'images': [], 'videos': [], 'audios': []}
|
||||
internal_links_dict = {}
|
||||
external_links_dict = {}
|
||||
|
||||
# Extract meaningful text for media files from closest parent
|
||||
def find_closest_parent_with_useful_text(tag):
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
if len(text_content.split()) >= image_description_min_word_threshold:
|
||||
return text_content
|
||||
return None
|
||||
|
||||
def process_image(img, url, index, total_images):
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
|
||||
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_src = img.get('src','')
|
||||
if "data:image/" in image_src:
|
||||
image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
|
||||
else:
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.').split('?')[0]
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
|
||||
|
||||
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= IMAGE_SCORE_THRESHOLD:
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', ''),
|
||||
'data-src': img.get('data-src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
|
||||
def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
|
||||
attrs_to_remove = []
|
||||
for attr in element.attrs:
|
||||
if attr not in important_attrs:
|
||||
if keep_data_attributes:
|
||||
if not attr.startswith('data-'):
|
||||
attrs_to_remove.append(attr)
|
||||
else:
|
||||
attrs_to_remove.append(attr)
|
||||
|
||||
for attr in attrs_to_remove:
|
||||
del element[attr]
|
||||
|
||||
def process_element(element: element.PageElement) -> bool:
|
||||
try:
|
||||
if isinstance(element, NavigableString):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
keep_element = False
|
||||
|
||||
exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
|
||||
exclude_social_media_domains = list(set(exclude_social_media_domains))
|
||||
|
||||
try:
|
||||
if element.name == 'a' and element.get('href'):
|
||||
href = element.get('href', '').strip()
|
||||
if not href: # Skip empty hrefs
|
||||
return False
|
||||
|
||||
url_base = url.split('/')[2]
|
||||
|
||||
# Normalize the URL
|
||||
try:
|
||||
normalized_href = normalize_url(href, url)
|
||||
except ValueError as e:
|
||||
# logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
|
||||
return False
|
||||
|
||||
link_data = {
|
||||
'href': normalized_href,
|
||||
'text': element.get_text().strip(),
|
||||
'title': element.get('title', '').strip()
|
||||
}
|
||||
|
||||
# Check for duplicates and add to appropriate dictionary
|
||||
is_external = is_external_url(normalized_href, url_base)
|
||||
if is_external:
|
||||
if normalized_href not in external_links_dict:
|
||||
external_links_dict[normalized_href] = link_data
|
||||
else:
|
||||
if normalized_href not in internal_links_dict:
|
||||
internal_links_dict[normalized_href] = link_data
|
||||
|
||||
keep_element = True
|
||||
|
||||
# Handle external link exclusions
|
||||
if is_external:
|
||||
if kwargs.get('exclude_external_links', False):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_social_media_links', False):
|
||||
if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
elif kwargs.get('exclude_domains', []):
|
||||
if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"Error processing links: {str(e)}")
|
||||
|
||||
try:
|
||||
if element.name == 'img':
|
||||
potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
|
||||
src = element.get('src', '')
|
||||
while not src and potential_sources:
|
||||
src = element.get(potential_sources.pop(0), '')
|
||||
if not src:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# If it is srcset pick up the first image
|
||||
if 'srcset' in element.attrs:
|
||||
src = element.attrs['srcset'].split(',')[0].split(' ')[0]
|
||||
|
||||
# Check flag if we should remove external images
|
||||
if kwargs.get('exclude_external_images', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if url_base not in src_url_base:
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
|
||||
src_url_base = src.split('/')[2]
|
||||
url_base = url.split('/')[2]
|
||||
if any(domain in src for domain in exclude_social_media_domains):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
# Handle exclude domains
|
||||
if kwargs.get('exclude_domains', []):
|
||||
if any(domain in src for domain in kwargs.get('exclude_domains', [])):
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
return True # Always keep image elements
|
||||
except Exception as e:
|
||||
raise "Error processing images"
|
||||
|
||||
|
||||
# Check if flag to remove all forms is set
|
||||
if kwargs.get('remove_forms', False) and element.name == 'form':
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
if element.name in ['video', 'audio']:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': element.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': find_closest_parent_with_useful_text(element)
|
||||
})
|
||||
source_tags = element.find_all('source')
|
||||
for source_tag in source_tags:
|
||||
media[f"{element.name}s"].append({
|
||||
'src': source_tag.get('src'),
|
||||
'alt': element.get('alt'),
|
||||
'type': element.name,
|
||||
'description': find_closest_parent_with_useful_text(element)
|
||||
})
|
||||
return True # Always keep video and audio elements
|
||||
|
||||
if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
|
||||
if kwargs.get('only_text', False):
|
||||
element.replace_with(element.get_text())
|
||||
|
||||
try:
|
||||
remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
|
||||
except Exception as e:
|
||||
# print('Error removing unwanted attributes:', str(e))
|
||||
self._log('error',
|
||||
message="Error removing unwanted attributes: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
# Process children
|
||||
for child in list(element.children):
|
||||
if isinstance(child, NavigableString) and not isinstance(child, Comment):
|
||||
if len(child.strip()) > 0:
|
||||
keep_element = True
|
||||
else:
|
||||
if process_element(child):
|
||||
keep_element = True
|
||||
|
||||
|
||||
# Check word count
|
||||
if not keep_element:
|
||||
word_count = len(element.get_text(strip=True).split())
|
||||
keep_element = word_count >= word_count_threshold
|
||||
|
||||
if not keep_element:
|
||||
element.decompose()
|
||||
|
||||
return keep_element
|
||||
except Exception as e:
|
||||
# print('Error processing element:', str(e))
|
||||
self._log('error',
|
||||
message="Error processing element: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
return False
|
||||
|
||||
process_element(body)
|
||||
|
||||
# Update the links dictionary with unique links
|
||||
links['internal'] = list(internal_links_dict.values())
|
||||
links['external'] = list(external_links_dict.values())
|
||||
|
||||
|
||||
# # Process images using ThreadPoolExecutor
|
||||
imgs = body.find_all('img')
|
||||
|
||||
with ThreadPoolExecutor() as executor:
|
||||
image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
|
||||
media['images'] = [result for result in image_results if result is not None]
|
||||
|
||||
def flatten_nested_elements(node):
|
||||
if isinstance(node, NavigableString):
|
||||
return node
|
||||
if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
|
||||
return flatten_nested_elements(node.contents[0])
|
||||
node.contents = [flatten_nested_elements(child) for child in node.contents]
|
||||
return node
|
||||
|
||||
body = flatten_nested_elements(body)
|
||||
base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
|
||||
for img in imgs:
|
||||
src = img.get('src', '')
|
||||
if base64_pattern.match(src):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
|
||||
str_body = ""
|
||||
try:
|
||||
str_body = body.encode_contents().decode('utf-8')
|
||||
except Exception as e:
|
||||
# Reset body to the original HTML
|
||||
success = False
|
||||
body = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Create a new div with a special ID
|
||||
error_div = body.new_tag('div', id='crawl4ai_error_message')
|
||||
error_div.string = '''
|
||||
Crawl4AI Error: This page is not fully supported.
|
||||
|
||||
Possible reasons:
|
||||
1. The page may have restrictions that prevent crawling.
|
||||
2. The page might not be fully loaded.
|
||||
|
||||
Suggestions:
|
||||
- Try calling the crawl function with these parameters:
|
||||
magic=True,
|
||||
- Set headless=False to visualize what's happening on the page.
|
||||
|
||||
If the issue persists, please check the page's structure and any potential anti-crawling measures.
|
||||
'''
|
||||
|
||||
# Append the error div to the body
|
||||
body.body.append(error_div)
|
||||
str_body = body.encode_contents().decode('utf-8')
|
||||
|
||||
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
|
||||
self._log('error',
|
||||
message="After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.",
|
||||
tag="SCRAPE"
|
||||
)
|
||||
|
||||
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
|
||||
|
||||
try:
|
||||
h = CustomHTML2Text()
|
||||
h.update_params(**kwargs.get('html2text', {}))
|
||||
markdown = h.handle(cleaned_html)
|
||||
except Exception as e:
|
||||
if not h:
|
||||
h = CustomHTML2Text()
|
||||
self._log('error',
|
||||
message="Error converting HTML to markdown: {error}",
|
||||
tag="SCRAPE",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
markdown = h.handle(sanitize_html(cleaned_html))
|
||||
markdown = markdown.replace(' ```', '```')
|
||||
|
||||
|
||||
|
||||
fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
|
||||
fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
|
||||
if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
|
||||
content_filter = kwargs.get('content_filter', None)
|
||||
if not content_filter:
|
||||
content_filter = BM25ContentFilter(
|
||||
user_query= kwargs.get('fit_markdown_user_query', None),
|
||||
bm25_threshold= kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
||||
)
|
||||
fit_html = content_filter.filter_content(html)
|
||||
fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
|
||||
fit_markdown = h.handle(fit_html)
|
||||
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
return {
|
||||
'markdown': markdown,
|
||||
'fit_markdown': fit_markdown,
|
||||
'fit_html': fit_html,
|
||||
'cleaned_html': cleaned_html,
|
||||
'success': success,
|
||||
'media': media,
|
||||
'links': links,
|
||||
'metadata': meta
|
||||
}
|
||||
@@ -283,7 +283,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
|
||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||
|
||||
return html
|
||||
except InvalidArgumentException:
|
||||
except InvalidArgumentException as e:
|
||||
if not hasattr(e, 'msg'):
|
||||
e.msg = sanitize_input_encode(str(e))
|
||||
raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
|
||||
|
||||
67
crawl4ai/docs_manager.py
Normal file
67
crawl4ai/docs_manager.py
Normal file
@@ -0,0 +1,67 @@
|
||||
import requests
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from crawl4ai.async_logger import AsyncLogger
|
||||
from crawl4ai.llmtxt import AsyncLLMTextManager
|
||||
|
||||
class DocsManager:
|
||||
def __init__(self, logger=None):
|
||||
self.docs_dir = Path.home() / ".crawl4ai" / "docs"
|
||||
self.local_docs = Path(__file__).parent.parent / "docs" / "llm.txt"
|
||||
self.docs_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.logger = logger or AsyncLogger(verbose=True)
|
||||
self.llm_text = AsyncLLMTextManager(self.docs_dir, self.logger)
|
||||
|
||||
async def ensure_docs_exist(self):
|
||||
"""Fetch docs if not present"""
|
||||
if not any(self.docs_dir.iterdir()):
|
||||
await self.fetch_docs()
|
||||
|
||||
async def fetch_docs(self) -> bool:
|
||||
"""Copy from local docs or download from GitHub"""
|
||||
try:
|
||||
# Try local first
|
||||
if self.local_docs.exists() and (any(self.local_docs.glob("*.md")) or any(self.local_docs.glob("*.tokens"))):
|
||||
# Empty the local docs directory
|
||||
for file_path in self.docs_dir.glob("*.md"):
|
||||
file_path.unlink()
|
||||
# for file_path in self.docs_dir.glob("*.tokens"):
|
||||
# file_path.unlink()
|
||||
for file_path in self.local_docs.glob("*.md"):
|
||||
shutil.copy2(file_path, self.docs_dir / file_path.name)
|
||||
# for file_path in self.local_docs.glob("*.tokens"):
|
||||
# shutil.copy2(file_path, self.docs_dir / file_path.name)
|
||||
return True
|
||||
|
||||
# Fallback to GitHub
|
||||
response = requests.get(
|
||||
"https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
|
||||
headers={'Accept': 'application/vnd.github.v3+json'}
|
||||
)
|
||||
response.raise_for_status()
|
||||
|
||||
for item in response.json():
|
||||
if item['type'] == 'file' and item['name'].endswith('.md'):
|
||||
content = requests.get(item['download_url']).text
|
||||
with open(self.docs_dir / item['name'], 'w', encoding='utf-8') as f:
|
||||
f.write(content)
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to fetch docs: {str(e)}")
|
||||
raise
|
||||
|
||||
def list(self) -> list[str]:
|
||||
"""List available topics"""
|
||||
names = [file_path.stem for file_path in self.docs_dir.glob("*.md")]
|
||||
# Remove [0-9]+_ prefix
|
||||
names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
|
||||
# Exclude those end with .xs.md and .q.md
|
||||
names = [name for name in names if not name.endswith(".xs") and not name.endswith(".q")]
|
||||
return names
|
||||
|
||||
def generate(self, sections, mode="extended"):
|
||||
return self.llm_text.generate(sections, mode)
|
||||
|
||||
def search(self, query: str, top_k: int = 5):
|
||||
return self.llm_text.search(query, top_k)
|
||||
1440
crawl4ai/extraction_strategy.bak.py
Normal file
1440
crawl4ai/extraction_strategy.bak.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -6,18 +6,31 @@ import json, time
|
||||
from .prompts import *
|
||||
from .config import *
|
||||
from .utils import *
|
||||
from .models import *
|
||||
from functools import partial
|
||||
from .model_loader import *
|
||||
import math
|
||||
import numpy as np
|
||||
from lxml import etree
|
||||
import re
|
||||
from bs4 import BeautifulSoup
|
||||
from lxml import html, etree
|
||||
from dataclasses import dataclass
|
||||
|
||||
class ExtractionStrategy(ABC):
|
||||
"""
|
||||
Abstract base class for all extraction strategies.
|
||||
"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
def __init__(self, input_format: str = "markdown", **kwargs):
|
||||
"""
|
||||
Initialize the extraction strategy.
|
||||
|
||||
Args:
|
||||
input_format: Content format to use for extraction.
|
||||
Options: "markdown" (default), "html", "fit_markdown"
|
||||
**kwargs: Additional keyword arguments
|
||||
"""
|
||||
self.input_format = input_format
|
||||
self.DEL = "<|DEL|>"
|
||||
self.name = self.__class__.__name__
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
@@ -49,24 +62,68 @@ class ExtractionStrategy(ABC):
|
||||
return extracted_content
|
||||
|
||||
class NoExtractionStrategy(ExtractionStrategy):
|
||||
"""
|
||||
A strategy that does not extract any meaningful content from the HTML. It simply returns the entire HTML as a single block.
|
||||
"""
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Extract meaningful blocks or chunks from the given HTML.
|
||||
"""
|
||||
return [{"index": 0, "content": html}]
|
||||
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
|
||||
|
||||
|
||||
#######################################################
|
||||
# Strategies using LLM-based extraction for text data #
|
||||
#######################################################
|
||||
class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"""
|
||||
A strategy that uses an LLM to extract meaningful content from the HTML.
|
||||
|
||||
Attributes:
|
||||
provider: The provider to use for extraction. It follows the format <provider_name>/<model_name>, e.g., "ollama/llama3.3".
|
||||
api_token: The API token for the provider.
|
||||
instruction: The instruction to use for the LLM model.
|
||||
schema: Pydantic model schema for structured data.
|
||||
extraction_type: "block" or "schema".
|
||||
chunk_token_threshold: Maximum tokens per chunk.
|
||||
overlap_rate: Overlap between chunks.
|
||||
word_token_rate: Word to token conversion rate.
|
||||
apply_chunking: Whether to apply chunking.
|
||||
base_url: The base URL for the API request.
|
||||
api_base: The base URL for the API request.
|
||||
extra_args: Additional arguments for the API request, such as temprature, max_tokens, etc.
|
||||
verbose: Whether to print verbose output.
|
||||
usages: List of individual token usages.
|
||||
total_usage: Accumulated token usage.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None,
|
||||
instruction:str = None, schema:Dict = None, extraction_type = "block", **kwargs):
|
||||
"""
|
||||
Initialize the strategy with clustering parameters.
|
||||
|
||||
Args:
|
||||
provider: The provider to use for extraction. It follows the format <provider_name>/<model_name>, e.g., "ollama/llama3.3".
|
||||
api_token: The API token for the provider.
|
||||
instruction: The instruction to use for the LLM model.
|
||||
schema: Pydantic model schema for structured data.
|
||||
extraction_type: "block" or "schema".
|
||||
chunk_token_threshold: Maximum tokens per chunk.
|
||||
overlap_rate: Overlap between chunks.
|
||||
word_token_rate: Word to token conversion rate.
|
||||
apply_chunking: Whether to apply chunking.
|
||||
base_url: The base URL for the API request.
|
||||
api_base: The base URL for the API request.
|
||||
extra_args: Additional arguments for the API request, such as temprature, max_tokens, etc.
|
||||
verbose: Whether to print verbose output.
|
||||
usages: List of individual token usages.
|
||||
total_usage: Accumulated token usage.
|
||||
|
||||
:param provider: The provider to use for extraction.
|
||||
:param api_token: The API token for the provider.
|
||||
:param instruction: The instruction to use for the LLM model.
|
||||
"""
|
||||
super().__init__()
|
||||
super().__init__(**kwargs)
|
||||
self.provider = provider
|
||||
self.api_token = api_token or PROVIDER_MODELS.get(provider, "no-token") or os.getenv("OPENAI_API_KEY")
|
||||
self.instruction = instruction
|
||||
@@ -86,14 +143,34 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
self.chunk_token_threshold = 1e9
|
||||
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
self.usages = [] # Store individual usages
|
||||
self.total_usage = TokenUsage() # Accumulated usage
|
||||
|
||||
if not self.api_token:
|
||||
raise ValueError("API token must be provided for LLMExtractionStrategy. Update the config.py or set OPENAI_API_KEY environment variable.")
|
||||
|
||||
|
||||
def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]:
|
||||
# print("[LOG] Extracting blocks from URL:", url)
|
||||
print(f"[LOG] Call LLM for {url} - block index: {ix}")
|
||||
"""
|
||||
Extract meaningful blocks or chunks from the given HTML using an LLM.
|
||||
|
||||
How it works:
|
||||
1. Construct a prompt with variables.
|
||||
2. Make a request to the LLM using the prompt.
|
||||
3. Parse the response and extract blocks or chunks.
|
||||
|
||||
Args:
|
||||
url: The URL of the webpage.
|
||||
ix: Index of the block.
|
||||
html: The HTML content of the webpage.
|
||||
|
||||
Returns:
|
||||
A list of extracted blocks or chunks.
|
||||
"""
|
||||
if self.verbose:
|
||||
# print("[LOG] Extracting blocks from URL:", url)
|
||||
print(f"[LOG] Call LLM for {url} - block index: {ix}")
|
||||
|
||||
variable_values = {
|
||||
"URL": url,
|
||||
"HTML": escape_json_string(sanitize_html(html)),
|
||||
@@ -120,6 +197,21 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
base_url=self.api_base or self.base_url,
|
||||
extra_args = self.extra_args
|
||||
) # , json_response=self.extract_type == "schema")
|
||||
# Track usage
|
||||
usage = TokenUsage(
|
||||
completion_tokens=response.usage.completion_tokens,
|
||||
prompt_tokens=response.usage.prompt_tokens,
|
||||
total_tokens=response.usage.total_tokens,
|
||||
completion_tokens_details=response.usage.completion_tokens_details.__dict__ if response.usage.completion_tokens_details else {},
|
||||
prompt_tokens_details=response.usage.prompt_tokens_details.__dict__ if response.usage.prompt_tokens_details else {}
|
||||
)
|
||||
self.usages.append(usage)
|
||||
|
||||
# Update totals
|
||||
self.total_usage.completion_tokens += usage.completion_tokens
|
||||
self.total_usage.prompt_tokens += usage.prompt_tokens
|
||||
self.total_usage.total_tokens += usage.total_tokens
|
||||
|
||||
try:
|
||||
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
|
||||
blocks = json.loads(blocks)
|
||||
@@ -141,6 +233,9 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
return blocks
|
||||
|
||||
def _merge(self, documents, chunk_token_threshold, overlap):
|
||||
"""
|
||||
Merge documents into sections based on chunk_token_threshold and overlap.
|
||||
"""
|
||||
chunks = []
|
||||
sections = []
|
||||
total_tokens = 0
|
||||
@@ -190,6 +285,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
def run(self, url: str, sections: List[str]) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionStrategy.
|
||||
|
||||
Args:
|
||||
url: The URL of the webpage.
|
||||
sections: List of sections (strings) to process.
|
||||
|
||||
Returns:
|
||||
A list of extracted blocks or chunks.
|
||||
"""
|
||||
|
||||
merged_sections = self._merge(
|
||||
@@ -229,8 +331,47 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
|
||||
|
||||
return extracted_content
|
||||
|
||||
|
||||
def show_usage(self) -> None:
|
||||
"""Print a detailed token usage report showing total and per-request usage."""
|
||||
print("\n=== Token Usage Summary ===")
|
||||
print(f"{'Type':<15} {'Count':>12}")
|
||||
print("-" * 30)
|
||||
print(f"{'Completion':<15} {self.total_usage.completion_tokens:>12,}")
|
||||
print(f"{'Prompt':<15} {self.total_usage.prompt_tokens:>12,}")
|
||||
print(f"{'Total':<15} {self.total_usage.total_tokens:>12,}")
|
||||
|
||||
print("\n=== Usage History ===")
|
||||
print(f"{'Request #':<10} {'Completion':>12} {'Prompt':>12} {'Total':>12}")
|
||||
print("-" * 48)
|
||||
for i, usage in enumerate(self.usages, 1):
|
||||
print(f"{i:<10} {usage.completion_tokens:>12,} {usage.prompt_tokens:>12,} {usage.total_tokens:>12,}")
|
||||
|
||||
#######################################################
|
||||
# Strategies using clustering for text data extraction #
|
||||
#######################################################
|
||||
|
||||
class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Extract meaningful blocks or chunks from the given HTML using cosine similarity.
|
||||
|
||||
How it works:
|
||||
1. Pre-filter documents using embeddings and semantic_filter.
|
||||
2. Perform clustering using cosine similarity.
|
||||
3. Organize texts by their cluster labels, retaining order.
|
||||
4. Filter clusters by word count.
|
||||
5. Extract meaningful blocks or chunks from the filtered clusters.
|
||||
|
||||
Attributes:
|
||||
semantic_filter (str): A keyword filter for document filtering.
|
||||
word_count_threshold (int): Minimum number of words per cluster.
|
||||
max_dist (float): The maximum cophenetic distance on the dendrogram to form clusters.
|
||||
linkage_method (str): The linkage method for hierarchical clustering.
|
||||
top_k (int): Number of top categories to extract.
|
||||
model_name (str): The name of the sentence-transformers model.
|
||||
sim_threshold (float): The similarity threshold for clustering.
|
||||
"""
|
||||
def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'sentence-transformers/all-MiniLM-L6-v2', sim_threshold = 0.3, **kwargs):
|
||||
"""
|
||||
Initialize the strategy with clustering parameters.
|
||||
@@ -242,7 +383,7 @@ class CosineStrategy(ExtractionStrategy):
|
||||
linkage_method (str): The linkage method for hierarchical clustering.
|
||||
top_k (int): Number of top categories to extract.
|
||||
"""
|
||||
super().__init__()
|
||||
super().__init__(**kwargs)
|
||||
|
||||
import numpy as np
|
||||
|
||||
@@ -308,11 +449,13 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Filter and sort documents based on the cosine similarity of their embeddings with the semantic_filter embedding.
|
||||
|
||||
:param documents: List of text chunks (documents).
|
||||
:param semantic_filter: A string containing the keywords for filtering.
|
||||
:param threshold: Cosine similarity threshold for filtering documents.
|
||||
:param at_least_k: Minimum number of documents to return.
|
||||
:return: List of filtered documents, ensuring at least `at_least_k` documents.
|
||||
Args:
|
||||
documents (List[str]): A list of document texts.
|
||||
semantic_filter (str): A keyword filter for document filtering.
|
||||
at_least_k (int): The minimum number of documents to return.
|
||||
|
||||
Returns:
|
||||
List[str]: A list of filtered and sorted document texts.
|
||||
"""
|
||||
|
||||
if not semantic_filter:
|
||||
@@ -350,8 +493,11 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Get BERT embeddings for a list of sentences.
|
||||
|
||||
:param sentences: List of text chunks (sentences).
|
||||
:return: NumPy array of embeddings.
|
||||
Args:
|
||||
sentences (List[str]): A list of text chunks (sentences).
|
||||
|
||||
Returns:
|
||||
NumPy array of embeddings.
|
||||
"""
|
||||
# if self.buffer_embeddings.any() and not bypass_buffer:
|
||||
# return self.buffer_embeddings
|
||||
@@ -395,8 +541,11 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Perform hierarchical clustering on sentences and return cluster labels.
|
||||
|
||||
:param sentences: List of text chunks (sentences).
|
||||
:return: NumPy array of cluster labels.
|
||||
Args:
|
||||
sentences (List[str]): A list of text chunks (sentences).
|
||||
|
||||
Returns:
|
||||
NumPy array of cluster labels.
|
||||
"""
|
||||
# Get embeddings
|
||||
from scipy.cluster.hierarchy import linkage, fcluster
|
||||
@@ -412,12 +561,15 @@ class CosineStrategy(ExtractionStrategy):
|
||||
labels = fcluster(linked, self.max_dist, criterion='distance')
|
||||
return labels
|
||||
|
||||
def filter_clusters_by_word_count(self, clusters: Dict[int, List[str]]):
|
||||
def filter_clusters_by_word_count(self, clusters: Dict[int, List[str]]) -> Dict[int, List[str]]:
|
||||
"""
|
||||
Filter clusters to remove those with a word count below the threshold.
|
||||
|
||||
:param clusters: Dictionary of clusters.
|
||||
:return: Filtered dictionary of clusters.
|
||||
Args:
|
||||
clusters (Dict[int, List[str]]): Dictionary of clusters.
|
||||
|
||||
Returns:
|
||||
Dict[int, List[str]]: Filtered dictionary of clusters.
|
||||
"""
|
||||
filtered_clusters = {}
|
||||
for cluster_id, texts in clusters.items():
|
||||
@@ -436,9 +588,12 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Extract clusters from HTML content using hierarchical clustering.
|
||||
|
||||
:param url: The URL of the webpage.
|
||||
:param html: The HTML content of the webpage.
|
||||
:return: A list of dictionaries representing the clusters.
|
||||
Args:
|
||||
url (str): The URL of the webpage.
|
||||
html (str): The HTML content of the webpage.
|
||||
|
||||
Returns:
|
||||
List[Dict[str, Any]]: A list of processed JSON blocks.
|
||||
"""
|
||||
# Assume `html` is a list of text chunks for this strategy
|
||||
t = time.time()
|
||||
@@ -500,170 +655,135 @@ class CosineStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Process sections using hierarchical clustering.
|
||||
|
||||
:param url: The URL of the webpage.
|
||||
:param sections: List of sections (strings) to process.
|
||||
:param provider: The provider to be used for extraction (not used here).
|
||||
:param api_token: Optional API token for the provider (not used here).
|
||||
:return: A list of processed JSON blocks.
|
||||
Args:
|
||||
url (str): The URL of the webpage.
|
||||
sections (List[str]): List of sections (strings) to process.
|
||||
|
||||
Returns:
|
||||
"""
|
||||
# This strategy processes all sections together
|
||||
|
||||
return self.extract(url, self.DEL.join(sections), **kwargs)
|
||||
|
||||
class TopicExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, num_keywords: int = 3, **kwargs):
|
||||
"""
|
||||
Initialize the topic extraction strategy with parameters for topic segmentation.
|
||||
#######################################################
|
||||
# New extraction strategies for JSON-based extraction #
|
||||
#######################################################
|
||||
|
||||
:param num_keywords: Number of keywords to represent each topic segment.
|
||||
"""
|
||||
import nltk
|
||||
super().__init__()
|
||||
self.num_keywords = num_keywords
|
||||
self.tokenizer = nltk.TextTilingTokenizer()
|
||||
class JsonElementExtractionStrategy(ExtractionStrategy):
|
||||
"""
|
||||
Abstract base class for extracting structured JSON from HTML content.
|
||||
|
||||
def extract_keywords(self, text: str) -> List[str]:
|
||||
"""
|
||||
Extract keywords from a given text segment using simple frequency analysis.
|
||||
How it works:
|
||||
1. Parses HTML content using the `_parse_html` method.
|
||||
2. Uses a schema to define base selectors, fields, and transformations.
|
||||
3. Extracts data hierarchically, supporting nested fields and lists.
|
||||
4. Handles computed fields with expressions or functions.
|
||||
|
||||
:param text: The text segment from which to extract keywords.
|
||||
:return: A list of keyword strings.
|
||||
"""
|
||||
import nltk
|
||||
# Tokenize the text and compute word frequency
|
||||
words = nltk.word_tokenize(text)
|
||||
freq_dist = nltk.FreqDist(words)
|
||||
# Get the most common words as keywords
|
||||
keywords = [word for (word, _) in freq_dist.most_common(self.num_keywords)]
|
||||
return keywords
|
||||
Attributes:
|
||||
DEL (str): Delimiter used to combine HTML sections. Defaults to '\n'.
|
||||
schema (Dict[str, Any]): The schema defining the extraction rules.
|
||||
verbose (bool): Enables verbose logging for debugging purposes.
|
||||
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Extract topics from HTML content using TextTiling for segmentation and keyword extraction.
|
||||
Methods:
|
||||
extract(url, html_content, *q, **kwargs): Extracts structured data from HTML content.
|
||||
_extract_item(element, fields): Extracts fields from a single element.
|
||||
_extract_single_field(element, field): Extracts a single field based on its type.
|
||||
_apply_transform(value, transform): Applies a transformation to a value.
|
||||
_compute_field(item, field): Computes a field value using an expression or function.
|
||||
run(url, sections, *q, **kwargs): Combines HTML sections and runs the extraction strategy.
|
||||
|
||||
:param url: The URL of the webpage.
|
||||
:param html: The HTML content of the webpage.
|
||||
:param provider: The provider to be used for extraction (not used here).
|
||||
:param api_token: Optional API token for the provider (not used here).
|
||||
:return: A list of dictionaries representing the topics.
|
||||
"""
|
||||
# Use TextTiling to segment the text into topics
|
||||
segmented_topics = html.split(self.DEL) # Split by lines or paragraphs as needed
|
||||
Abstract Methods:
|
||||
_parse_html(html_content): Parses raw HTML into a structured format (e.g., BeautifulSoup or lxml).
|
||||
_get_base_elements(parsed_html, selector): Retrieves base elements using a selector.
|
||||
_get_elements(element, selector): Retrieves child elements using a selector.
|
||||
_get_element_text(element): Extracts text content from an element.
|
||||
_get_element_html(element): Extracts raw HTML from an element.
|
||||
_get_element_attribute(element, attribute): Extracts an attribute's value from an element.
|
||||
"""
|
||||
|
||||
# Prepare the output as a list of dictionaries
|
||||
topic_list = []
|
||||
for i, segment in enumerate(segmented_topics):
|
||||
# Extract keywords for each segment
|
||||
keywords = self.extract_keywords(segment)
|
||||
topic_list.append({
|
||||
"index": i,
|
||||
"content": segment,
|
||||
"keywords": keywords
|
||||
})
|
||||
|
||||
return topic_list
|
||||
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Process sections using topic segmentation and keyword extraction.
|
||||
|
||||
:param url: The URL of the webpage.
|
||||
:param sections: List of sections (strings) to process.
|
||||
:param provider: The provider to be used for extraction (not used here).
|
||||
:param api_token: Optional API token for the provider (not used here).
|
||||
:return: A list of processed JSON blocks.
|
||||
"""
|
||||
# Concatenate sections into a single text for coherent topic segmentation
|
||||
|
||||
|
||||
return self.extract(url, self.DEL.join(sections), **kwargs)
|
||||
|
||||
class ContentSummarizationStrategy(ExtractionStrategy):
|
||||
def __init__(self, model_name: str = "sshleifer/distilbart-cnn-12-6", **kwargs):
|
||||
"""
|
||||
Initialize the content summarization strategy with a specific model.
|
||||
DEL = '\n'
|
||||
|
||||
:param model_name: The model to use for summarization.
|
||||
"""
|
||||
from transformers import pipeline
|
||||
self.summarizer = pipeline("summarization", model=model_name)
|
||||
|
||||
def extract(self, url: str, text: str, provider: str = None, api_token: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Summarize a single section of text.
|
||||
|
||||
:param url: The URL of the webpage.
|
||||
:param text: A section of text to summarize.
|
||||
:param provider: The provider to be used for extraction (not used here).
|
||||
:param api_token: Optional API token for the provider (not used here).
|
||||
:return: A dictionary with the summary.
|
||||
"""
|
||||
try:
|
||||
summary = self.summarizer(text, max_length=130, min_length=30, do_sample=False)
|
||||
return {"summary": summary[0]['summary_text']}
|
||||
except Exception as e:
|
||||
print(f"Error summarizing text: {e}")
|
||||
return {"summary": text} # Fallback to original text if summarization fails
|
||||
|
||||
def run(self, url: str, sections: List[str], provider: str = None, api_token: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Process each section in parallel to produce summaries.
|
||||
|
||||
:param url: The URL of the webpage.
|
||||
:param sections: List of sections (strings) to summarize.
|
||||
:param provider: The provider to be used for extraction (not used here).
|
||||
:param api_token: Optional API token for the provider (not used here).
|
||||
:return: A list of dictionaries with summaries for each section.
|
||||
"""
|
||||
# Use a ThreadPoolExecutor to summarize in parallel
|
||||
summaries = []
|
||||
with ThreadPoolExecutor() as executor:
|
||||
# Create a future for each section's summarization
|
||||
future_to_section = {executor.submit(self.extract, url, section, provider, api_token): i for i, section in enumerate(sections)}
|
||||
for future in as_completed(future_to_section):
|
||||
section_index = future_to_section[future]
|
||||
try:
|
||||
summary_result = future.result()
|
||||
summaries.append((section_index, summary_result))
|
||||
except Exception as e:
|
||||
print(f"Error processing section {section_index}: {e}")
|
||||
summaries.append((section_index, {"summary": sections[section_index]})) # Fallback to original text
|
||||
|
||||
# Sort summaries by the original section index to maintain order
|
||||
summaries.sort(key=lambda x: x[0])
|
||||
return [summary for _, summary in summaries]
|
||||
|
||||
class JsonCssExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, schema: Dict[str, Any], **kwargs):
|
||||
"""
|
||||
Initialize the JSON element extraction strategy with a schema.
|
||||
|
||||
Args:
|
||||
schema (Dict[str, Any]): The schema defining the extraction rules.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
self.schema = schema
|
||||
self.verbose = kwargs.get('verbose', False)
|
||||
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
soup = BeautifulSoup(html, 'html.parser')
|
||||
base_elements = soup.select(self.schema['baseSelector'])
|
||||
def extract(self, url: str, html_content: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Extract structured data from HTML content.
|
||||
|
||||
How it works:
|
||||
1. Parses the HTML content using the `_parse_html` method.
|
||||
2. Identifies base elements using the schema's base selector.
|
||||
3. Extracts fields from each base element using `_extract_item`.
|
||||
|
||||
Args:
|
||||
url (str): The URL of the page being processed.
|
||||
html_content (str): The raw HTML content to parse and extract.
|
||||
*q: Additional positional arguments.
|
||||
**kwargs: Additional keyword arguments for custom extraction.
|
||||
|
||||
Returns:
|
||||
List[Dict[str, Any]]: A list of extracted items, each represented as a dictionary.
|
||||
"""
|
||||
|
||||
parsed_html = self._parse_html(html_content)
|
||||
base_elements = self._get_base_elements(parsed_html, self.schema['baseSelector'])
|
||||
|
||||
results = []
|
||||
for element in base_elements:
|
||||
item = self._extract_item(element, self.schema['fields'])
|
||||
# Extract base element attributes
|
||||
item = {}
|
||||
if 'baseFields' in self.schema:
|
||||
for field in self.schema['baseFields']:
|
||||
value = self._extract_single_field(element, field)
|
||||
if value is not None:
|
||||
item[field['name']] = value
|
||||
|
||||
# Extract child fields
|
||||
field_data = self._extract_item(element, self.schema['fields'])
|
||||
item.update(field_data)
|
||||
|
||||
if item:
|
||||
results.append(item)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
@abstractmethod
|
||||
def _parse_html(self, html_content: str):
|
||||
"""Parse HTML content into appropriate format"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _get_base_elements(self, parsed_html, selector: str):
|
||||
"""Get all base elements using the selector"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _get_elements(self, element, selector: str):
|
||||
"""Get child elements using the selector"""
|
||||
pass
|
||||
|
||||
def _extract_field(self, element, field):
|
||||
try:
|
||||
if field['type'] == 'nested':
|
||||
nested_element = element.select_one(field['selector'])
|
||||
nested_elements = self._get_elements(element, field['selector'])
|
||||
nested_element = nested_elements[0] if nested_elements else None
|
||||
return self._extract_item(nested_element, field['fields']) if nested_element else {}
|
||||
|
||||
if field['type'] == 'list':
|
||||
elements = element.select(field['selector'])
|
||||
elements = self._get_elements(element, field['selector'])
|
||||
return [self._extract_list_item(el, field['fields']) for el in elements]
|
||||
|
||||
if field['type'] == 'nested_list':
|
||||
elements = element.select(field['selector'])
|
||||
elements = self._get_elements(element, field['selector'])
|
||||
return [self._extract_item(el, field['fields']) for el in elements]
|
||||
|
||||
return self._extract_single_field(element, field)
|
||||
@@ -672,146 +792,25 @@ class JsonCssExtractionStrategy(ExtractionStrategy):
|
||||
print(f"Error extracting field {field['name']}: {str(e)}")
|
||||
return field.get('default')
|
||||
|
||||
def _extract_list_item(self, element, fields):
|
||||
item = {}
|
||||
for field in fields:
|
||||
value = self._extract_single_field(element, field)
|
||||
if value is not None:
|
||||
item[field['name']] = value
|
||||
return item
|
||||
|
||||
def _extract_single_field(self, element, field):
|
||||
if 'selector' in field:
|
||||
selected = element.select_one(field['selector'])
|
||||
if not selected:
|
||||
return field.get('default')
|
||||
else:
|
||||
selected = element
|
||||
"""
|
||||
Extract a single field based on its type.
|
||||
|
||||
value = None
|
||||
if field['type'] == 'text':
|
||||
value = selected.get_text(strip=True)
|
||||
elif field['type'] == 'attribute':
|
||||
value = selected.get(field['attribute'])
|
||||
elif field['type'] == 'html':
|
||||
value = str(selected)
|
||||
elif field['type'] == 'regex':
|
||||
text = selected.get_text(strip=True)
|
||||
match = re.search(field['pattern'], text)
|
||||
value = match.group(1) if match else None
|
||||
How it works:
|
||||
1. Selects the target element using the field's selector.
|
||||
2. Extracts the field value based on its type (e.g., text, attribute, regex).
|
||||
3. Applies transformations if defined in the schema.
|
||||
|
||||
if 'transform' in field:
|
||||
value = self._apply_transform(value, field['transform'])
|
||||
Args:
|
||||
element: The base element to extract the field from.
|
||||
field (Dict[str, Any]): The field definition in the schema.
|
||||
|
||||
return value if value is not None else field.get('default')
|
||||
|
||||
def _extract_item(self, element, fields):
|
||||
item = {}
|
||||
for field in fields:
|
||||
if field['type'] == 'computed':
|
||||
value = self._compute_field(item, field)
|
||||
else:
|
||||
value = self._extract_field(element, field)
|
||||
if value is not None:
|
||||
item[field['name']] = value
|
||||
return item
|
||||
|
||||
def _apply_transform(self, value, transform):
|
||||
if transform == 'lowercase':
|
||||
return value.lower()
|
||||
elif transform == 'uppercase':
|
||||
return value.upper()
|
||||
elif transform == 'strip':
|
||||
return value.strip()
|
||||
return value
|
||||
|
||||
def _compute_field(self, item, field):
|
||||
try:
|
||||
if 'expression' in field:
|
||||
return eval(field['expression'], {}, item)
|
||||
elif 'function' in field:
|
||||
return field['function'](item)
|
||||
except Exception as e:
|
||||
if self.verbose:
|
||||
print(f"Error computing field {field['name']}: {str(e)}")
|
||||
return field.get('default')
|
||||
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
combined_html = self.DEL.join(sections)
|
||||
return self.extract(url, combined_html, **kwargs)
|
||||
|
||||
class JsonXPATHExtractionStrategy(ExtractionStrategy):
|
||||
def __init__(self, schema: Dict[str, Any], **kwargs):
|
||||
super().__init__(**kwargs)
|
||||
self.schema = schema
|
||||
self.use_cssselect = self._check_cssselect()
|
||||
|
||||
def _check_cssselect(self):
|
||||
try:
|
||||
import cssselect
|
||||
return True
|
||||
except ImportError:
|
||||
print("Warning: cssselect is not installed. Falling back to XPath for all selectors.")
|
||||
return False
|
||||
|
||||
def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
self.soup = BeautifulSoup(html, 'lxml')
|
||||
self.tree = etree.HTML(str(self.soup))
|
||||
|
||||
selector_type = 'xpath' if not self.use_cssselect else self.schema.get('selectorType', 'css')
|
||||
base_selector = self.schema.get('baseXPath' if selector_type == 'xpath' else 'baseSelector')
|
||||
base_elements = self._select_elements(base_selector, selector_type)
|
||||
|
||||
results = []
|
||||
for element in base_elements:
|
||||
item = self._extract_item(element, self.schema['fields'])
|
||||
if item:
|
||||
results.append(item)
|
||||
|
||||
return results
|
||||
|
||||
def _select_elements(self, selector, selector_type, element=None):
|
||||
if selector_type == 'xpath' or not self.use_cssselect:
|
||||
return self.tree.xpath(selector) if element is None else element.xpath(selector)
|
||||
else: # CSS
|
||||
return self.tree.cssselect(selector) if element is None else element.cssselect(selector)
|
||||
|
||||
def _extract_field(self, element, field):
|
||||
try:
|
||||
selector_type = 'xpath' if not self.use_cssselect else field.get('selectorType', 'css')
|
||||
selector = field.get('xpathSelector' if selector_type == 'xpath' else 'selector')
|
||||
|
||||
if field['type'] == 'nested':
|
||||
nested_element = self._select_elements(selector, selector_type, element)
|
||||
return self._extract_item(nested_element[0], field['fields']) if nested_element else {}
|
||||
|
||||
if field['type'] == 'list':
|
||||
elements = self._select_elements(selector, selector_type, element)
|
||||
return [self._extract_list_item(el, field['fields']) for el in elements]
|
||||
|
||||
if field['type'] == 'nested_list':
|
||||
elements = self._select_elements(selector, selector_type, element)
|
||||
return [self._extract_item(el, field['fields']) for el in elements]
|
||||
|
||||
return self._extract_single_field(element, field)
|
||||
except Exception as e:
|
||||
if self.verbose:
|
||||
print(f"Error extracting field {field['name']}: {str(e)}")
|
||||
return field.get('default')
|
||||
|
||||
def _extract_list_item(self, element, fields):
|
||||
item = {}
|
||||
for field in fields:
|
||||
value = self._extract_single_field(element, field)
|
||||
if value is not None:
|
||||
item[field['name']] = value
|
||||
return item
|
||||
|
||||
def _extract_single_field(self, element, field):
|
||||
selector_type = field.get('selectorType', 'css')
|
||||
Returns:
|
||||
Any: The extracted field value.
|
||||
"""
|
||||
|
||||
if 'selector' in field:
|
||||
selected = self._select_elements(field['selector'], selector_type, element)
|
||||
selected = self._get_elements(element, field['selector'])
|
||||
if not selected:
|
||||
return field.get('default')
|
||||
selected = selected[0]
|
||||
@@ -820,13 +819,13 @@ class JsonXPATHExtractionStrategy(ExtractionStrategy):
|
||||
|
||||
value = None
|
||||
if field['type'] == 'text':
|
||||
value = selected.text_content().strip() if hasattr(selected, 'text_content') else selected.text.strip()
|
||||
value = self._get_element_text(selected)
|
||||
elif field['type'] == 'attribute':
|
||||
value = selected.get(field['attribute'])
|
||||
value = self._get_element_attribute(selected, field['attribute'])
|
||||
elif field['type'] == 'html':
|
||||
value = etree.tostring(selected, encoding='unicode')
|
||||
value = self._get_element_html(selected)
|
||||
elif field['type'] == 'regex':
|
||||
text = selected.text_content().strip() if hasattr(selected, 'text_content') else selected.text.strip()
|
||||
text = self._get_element_text(selected)
|
||||
match = re.search(field['pattern'], text)
|
||||
value = match.group(1) if match else None
|
||||
|
||||
@@ -835,7 +834,31 @@ class JsonXPATHExtractionStrategy(ExtractionStrategy):
|
||||
|
||||
return value if value is not None else field.get('default')
|
||||
|
||||
def _extract_list_item(self, element, fields):
|
||||
item = {}
|
||||
for field in fields:
|
||||
value = self._extract_single_field(element, field)
|
||||
if value is not None:
|
||||
item[field['name']] = value
|
||||
return item
|
||||
|
||||
def _extract_item(self, element, fields):
|
||||
"""
|
||||
Extracts fields from a given element.
|
||||
|
||||
How it works:
|
||||
1. Iterates through the fields defined in the schema.
|
||||
2. Handles computed, single, and nested field types.
|
||||
3. Updates the item dictionary with extracted field values.
|
||||
|
||||
Args:
|
||||
element: The base element to extract fields from.
|
||||
fields (List[Dict[str, Any]]): The list of fields to extract.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: A dictionary representing the extracted item.
|
||||
"""
|
||||
|
||||
item = {}
|
||||
for field in fields:
|
||||
if field['type'] == 'computed':
|
||||
@@ -845,8 +868,24 @@ class JsonXPATHExtractionStrategy(ExtractionStrategy):
|
||||
if value is not None:
|
||||
item[field['name']] = value
|
||||
return item
|
||||
|
||||
|
||||
def _apply_transform(self, value, transform):
|
||||
"""
|
||||
Apply a transformation to a value.
|
||||
|
||||
How it works:
|
||||
1. Checks the transformation type (e.g., `lowercase`, `strip`).
|
||||
2. Applies the transformation to the value.
|
||||
3. Returns the transformed value.
|
||||
|
||||
Args:
|
||||
value (str): The value to transform.
|
||||
transform (str): The type of transformation to apply.
|
||||
|
||||
Returns:
|
||||
str: The transformed value.
|
||||
"""
|
||||
|
||||
if transform == 'lowercase':
|
||||
return value.lower()
|
||||
elif transform == 'uppercase':
|
||||
@@ -867,5 +906,147 @@ class JsonXPATHExtractionStrategy(ExtractionStrategy):
|
||||
return field.get('default')
|
||||
|
||||
def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Run the extraction strategy on a combined HTML content.
|
||||
|
||||
How it works:
|
||||
1. Combines multiple HTML sections using the `DEL` delimiter.
|
||||
2. Calls the `extract` method with the combined HTML.
|
||||
|
||||
Args:
|
||||
url (str): The URL of the page being processed.
|
||||
sections (List[str]): A list of HTML sections.
|
||||
*q: Additional positional arguments.
|
||||
**kwargs: Additional keyword arguments for custom extraction.
|
||||
|
||||
Returns:
|
||||
List[Dict[str, Any]]: A list of extracted items.
|
||||
"""
|
||||
|
||||
combined_html = self.DEL.join(sections)
|
||||
return self.extract(url, combined_html, **kwargs)
|
||||
return self.extract(url, combined_html, **kwargs)
|
||||
|
||||
@abstractmethod
|
||||
def _get_element_text(self, element) -> str:
|
||||
"""Get text content from element"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _get_element_html(self, element) -> str:
|
||||
"""Get HTML content from element"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def _get_element_attribute(self, element, attribute: str):
|
||||
"""Get attribute value from element"""
|
||||
pass
|
||||
|
||||
class JsonCssExtractionStrategy(JsonElementExtractionStrategy):
|
||||
"""
|
||||
Concrete implementation of `JsonElementExtractionStrategy` using CSS selectors.
|
||||
|
||||
How it works:
|
||||
1. Parses HTML content with BeautifulSoup.
|
||||
2. Selects elements using CSS selectors defined in the schema.
|
||||
3. Extracts field data and applies transformations as defined.
|
||||
|
||||
Attributes:
|
||||
schema (Dict[str, Any]): The schema defining the extraction rules.
|
||||
verbose (bool): Enables verbose logging for debugging purposes.
|
||||
|
||||
Methods:
|
||||
_parse_html(html_content): Parses HTML content into a BeautifulSoup object.
|
||||
_get_base_elements(parsed_html, selector): Selects base elements using a CSS selector.
|
||||
_get_elements(element, selector): Selects child elements using a CSS selector.
|
||||
_get_element_text(element): Extracts text content from a BeautifulSoup element.
|
||||
_get_element_html(element): Extracts the raw HTML content of a BeautifulSoup element.
|
||||
_get_element_attribute(element, attribute): Retrieves an attribute value from a BeautifulSoup element.
|
||||
"""
|
||||
|
||||
def __init__(self, schema: Dict[str, Any], **kwargs):
|
||||
kwargs['input_format'] = 'html' # Force HTML input
|
||||
super().__init__(schema, **kwargs)
|
||||
|
||||
def _parse_html(self, html_content: str):
|
||||
return BeautifulSoup(html_content, 'html.parser')
|
||||
|
||||
def _get_base_elements(self, parsed_html, selector: str):
|
||||
return parsed_html.select(selector)
|
||||
|
||||
def _get_elements(self, element, selector: str):
|
||||
selected = element.select_one(selector)
|
||||
return [selected] if selected else []
|
||||
|
||||
def _get_element_text(self, element) -> str:
|
||||
return element.get_text(strip=True)
|
||||
|
||||
def _get_element_html(self, element) -> str:
|
||||
return str(element)
|
||||
|
||||
def _get_element_attribute(self, element, attribute: str):
|
||||
return element.get(attribute)
|
||||
|
||||
class JsonXPathExtractionStrategy(JsonElementExtractionStrategy):
|
||||
"""
|
||||
Concrete implementation of `JsonElementExtractionStrategy` using XPath selectors.
|
||||
|
||||
How it works:
|
||||
1. Parses HTML content into an lxml tree.
|
||||
2. Selects elements using XPath expressions.
|
||||
3. Converts CSS selectors to XPath when needed.
|
||||
|
||||
Attributes:
|
||||
schema (Dict[str, Any]): The schema defining the extraction rules.
|
||||
verbose (bool): Enables verbose logging for debugging purposes.
|
||||
|
||||
Methods:
|
||||
_parse_html(html_content): Parses HTML content into an lxml tree.
|
||||
_get_base_elements(parsed_html, selector): Selects base elements using an XPath selector.
|
||||
_css_to_xpath(css_selector): Converts a CSS selector to an XPath expression.
|
||||
_get_elements(element, selector): Selects child elements using an XPath selector.
|
||||
_get_element_text(element): Extracts text content from an lxml element.
|
||||
_get_element_html(element): Extracts the raw HTML content of an lxml element.
|
||||
_get_element_attribute(element, attribute): Retrieves an attribute value from an lxml element.
|
||||
"""
|
||||
|
||||
def __init__(self, schema: Dict[str, Any], **kwargs):
|
||||
kwargs['input_format'] = 'html' # Force HTML input
|
||||
super().__init__(schema, **kwargs)
|
||||
|
||||
def _parse_html(self, html_content: str):
|
||||
return html.fromstring(html_content)
|
||||
|
||||
def _get_base_elements(self, parsed_html, selector: str):
|
||||
return parsed_html.xpath(selector)
|
||||
|
||||
def _css_to_xpath(self, css_selector: str) -> str:
|
||||
"""Convert CSS selector to XPath if needed"""
|
||||
if '/' in css_selector: # Already an XPath
|
||||
return css_selector
|
||||
return self._basic_css_to_xpath(css_selector)
|
||||
|
||||
def _basic_css_to_xpath(self, css_selector: str) -> str:
|
||||
"""Basic CSS to XPath conversion for common cases"""
|
||||
if ' > ' in css_selector:
|
||||
parts = css_selector.split(' > ')
|
||||
return '//' + '/'.join(parts)
|
||||
if ' ' in css_selector:
|
||||
parts = css_selector.split(' ')
|
||||
return '//' + '//'.join(parts)
|
||||
return '//' + css_selector
|
||||
|
||||
def _get_elements(self, element, selector: str):
|
||||
xpath = self._css_to_xpath(selector)
|
||||
if not xpath.startswith('.'):
|
||||
xpath = '.' + xpath
|
||||
return element.xpath(xpath)
|
||||
|
||||
def _get_element_text(self, element) -> str:
|
||||
return ''.join(element.xpath('.//text()')).strip()
|
||||
|
||||
def _get_element_html(self, element) -> str:
|
||||
return etree.tostring(element, encoding='unicode')
|
||||
|
||||
def _get_element_attribute(self, element, attribute: str):
|
||||
return element.get(attribute)
|
||||
|
||||
|
||||
@@ -1006,10 +1006,136 @@ class HTML2Text(html.parser.HTMLParser):
|
||||
newlines += 1
|
||||
return result
|
||||
|
||||
|
||||
def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
|
||||
if bodywidth is None:
|
||||
bodywidth = config.BODY_WIDTH
|
||||
h = HTML2Text(baseurl=baseurl, bodywidth=bodywidth)
|
||||
|
||||
return h.handle(html)
|
||||
|
||||
class CustomHTML2Text(HTML2Text):
|
||||
def __init__(self, *args, handle_code_in_pre=False, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.inside_pre = False
|
||||
self.inside_code = False
|
||||
self.preserve_tags = set() # Set of tags to preserve
|
||||
self.current_preserved_tag = None
|
||||
self.preserved_content = []
|
||||
self.preserve_depth = 0
|
||||
self.handle_code_in_pre = handle_code_in_pre
|
||||
|
||||
# Configuration options
|
||||
self.skip_internal_links = False
|
||||
self.single_line_break = False
|
||||
self.mark_code = False
|
||||
self.include_sup_sub = False
|
||||
self.body_width = 0
|
||||
self.ignore_mailto_links = True
|
||||
self.ignore_links = False
|
||||
self.escape_backslash = False
|
||||
self.escape_dot = False
|
||||
self.escape_plus = False
|
||||
self.escape_dash = False
|
||||
self.escape_snob = False
|
||||
|
||||
def update_params(self, **kwargs):
|
||||
"""Update parameters and set preserved tags."""
|
||||
for key, value in kwargs.items():
|
||||
if key == 'preserve_tags':
|
||||
self.preserve_tags = set(value)
|
||||
elif key == 'handle_code_in_pre':
|
||||
self.handle_code_in_pre = value
|
||||
else:
|
||||
setattr(self, key, value)
|
||||
|
||||
def handle_tag(self, tag, attrs, start):
|
||||
# Handle preserved tags
|
||||
if tag in self.preserve_tags:
|
||||
if start:
|
||||
if self.preserve_depth == 0:
|
||||
self.current_preserved_tag = tag
|
||||
self.preserved_content = []
|
||||
# Format opening tag with attributes
|
||||
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
|
||||
self.preserved_content.append(f'<{tag}{attr_str}>')
|
||||
self.preserve_depth += 1
|
||||
return
|
||||
else:
|
||||
self.preserve_depth -= 1
|
||||
if self.preserve_depth == 0:
|
||||
self.preserved_content.append(f'</{tag}>')
|
||||
# Output the preserved HTML block with proper spacing
|
||||
preserved_html = ''.join(self.preserved_content)
|
||||
self.o('\n' + preserved_html + '\n')
|
||||
self.current_preserved_tag = None
|
||||
return
|
||||
|
||||
# If we're inside a preserved tag, collect all content
|
||||
if self.preserve_depth > 0:
|
||||
if start:
|
||||
# Format nested tags with attributes
|
||||
attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
|
||||
self.preserved_content.append(f'<{tag}{attr_str}>')
|
||||
else:
|
||||
self.preserved_content.append(f'</{tag}>')
|
||||
return
|
||||
|
||||
# Handle pre tags
|
||||
if tag == 'pre':
|
||||
if start:
|
||||
self.o('```\n') # Markdown code block start
|
||||
self.inside_pre = True
|
||||
else:
|
||||
self.o('\n```\n') # Markdown code block end
|
||||
self.inside_pre = False
|
||||
elif tag == 'code':
|
||||
if self.inside_pre and not self.handle_code_in_pre:
|
||||
# Ignore code tags inside pre blocks if handle_code_in_pre is False
|
||||
return
|
||||
if start:
|
||||
self.o('`') # Markdown inline code start
|
||||
self.inside_code = True
|
||||
else:
|
||||
self.o('`') # Markdown inline code end
|
||||
self.inside_code = False
|
||||
else:
|
||||
super().handle_tag(tag, attrs, start)
|
||||
|
||||
def handle_data(self, data, entity_char=False):
|
||||
"""Override handle_data to capture content within preserved tags."""
|
||||
if self.preserve_depth > 0:
|
||||
self.preserved_content.append(data)
|
||||
return
|
||||
|
||||
if self.inside_pre:
|
||||
# Output the raw content for pre blocks, including content inside code tags
|
||||
self.o(data) # Directly output the data as-is (preserve newlines)
|
||||
return
|
||||
if self.inside_code:
|
||||
# Inline code: no newlines allowed
|
||||
self.o(data.replace('\n', ' '))
|
||||
return
|
||||
|
||||
# Default behavior for other tags
|
||||
super().handle_data(data, entity_char)
|
||||
|
||||
|
||||
# # Handle pre tags
|
||||
# if tag == 'pre':
|
||||
# if start:
|
||||
# self.o('```\n')
|
||||
# self.inside_pre = True
|
||||
# else:
|
||||
# self.o('\n```')
|
||||
# self.inside_pre = False
|
||||
# # elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
# # pass
|
||||
# else:
|
||||
# super().handle_tag(tag, attrs, start)
|
||||
|
||||
# def handle_data(self, data, entity_char=False):
|
||||
# """Override handle_data to capture content within preserved tags."""
|
||||
# if self.preserve_depth > 0:
|
||||
# self.preserved_content.append(data)
|
||||
# return
|
||||
# super().handle_data(data, entity_char)
|
||||
|
||||
83
crawl4ai/install.py
Normal file
83
crawl4ai/install.py
Normal file
@@ -0,0 +1,83 @@
|
||||
import subprocess
|
||||
import sys
|
||||
import asyncio
|
||||
from .async_logger import AsyncLogger, LogLevel
|
||||
|
||||
# Initialize logger
|
||||
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
|
||||
|
||||
def post_install():
|
||||
"""Run all post-installation tasks"""
|
||||
logger.info("Running post-installation setup...", tag="INIT")
|
||||
install_playwright()
|
||||
run_migration()
|
||||
logger.success("Post-installation setup completed!", tag="COMPLETE")
|
||||
|
||||
def install_playwright():
|
||||
logger.info("Installing Playwright browsers...", tag="INIT")
|
||||
try:
|
||||
# subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
|
||||
subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chromium"])
|
||||
logger.success("Playwright installation completed successfully.", tag="COMPLETE")
|
||||
except subprocess.CalledProcessError as e:
|
||||
# logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
|
||||
logger.warning(f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation.")
|
||||
except Exception as e:
|
||||
# logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
|
||||
logger.warning(f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation.")
|
||||
|
||||
def run_migration():
|
||||
"""Initialize database during installation"""
|
||||
try:
|
||||
logger.info("Starting database initialization...", tag="INIT")
|
||||
from crawl4ai.async_database import async_db_manager
|
||||
|
||||
asyncio.run(async_db_manager.initialize())
|
||||
logger.success("Database initialization completed successfully.", tag="COMPLETE")
|
||||
except ImportError:
|
||||
logger.warning("Database module not found. Will initialize on first use.")
|
||||
except Exception as e:
|
||||
logger.warning(f"Database initialization failed: {e}")
|
||||
logger.warning("Database will be initialized on first use")
|
||||
|
||||
async def run_doctor():
|
||||
"""Test if Crawl4AI is working properly"""
|
||||
logger.info("Running Crawl4AI health check...", tag="INIT")
|
||||
try:
|
||||
from .async_webcrawler import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
browser_type="chromium",
|
||||
ignore_https_errors=True,
|
||||
light_mode=True,
|
||||
viewport_width=1280,
|
||||
viewport_height=720
|
||||
)
|
||||
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
screenshot=True,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
logger.info("Testing crawling capabilities...", tag="TEST")
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com",
|
||||
config=run_config
|
||||
)
|
||||
|
||||
if result and result.markdown:
|
||||
logger.success("✅ Crawling test passed!", tag="COMPLETE")
|
||||
return True
|
||||
else:
|
||||
raise Exception("Failed to get content")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Test failed: {e}", tag="ERROR")
|
||||
return False
|
||||
|
||||
def doctor():
|
||||
"""Entry point for the doctor command"""
|
||||
import asyncio
|
||||
return asyncio.run(run_doctor())
|
||||
15
crawl4ai/js_snippet/__init__.py
Normal file
15
crawl4ai/js_snippet/__init__.py
Normal file
@@ -0,0 +1,15 @@
|
||||
import os, sys
|
||||
|
||||
# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
|
||||
def load_js_script(script_name):
|
||||
# Get the path of the current script
|
||||
current_script_path = os.path.dirname(os.path.realpath(__file__))
|
||||
# Get the path of the script to load
|
||||
script_path = os.path.join(current_script_path, script_name + '.js')
|
||||
# Check if the script exists
|
||||
if not os.path.exists(script_path):
|
||||
raise ValueError(f"Script {script_name} not found in the folder {current_script_path}")
|
||||
# Load the content of the script
|
||||
with open(script_path, 'r') as f:
|
||||
script_content = f.read()
|
||||
return script_content
|
||||
25
crawl4ai/js_snippet/navigator_overrider.js
Normal file
25
crawl4ai/js_snippet/navigator_overrider.js
Normal file
@@ -0,0 +1,25 @@
|
||||
// Pass the Permissions Test.
|
||||
const originalQuery = window.navigator.permissions.query;
|
||||
window.navigator.permissions.query = (parameters) =>
|
||||
parameters.name === "notifications"
|
||||
? Promise.resolve({ state: Notification.permission })
|
||||
: originalQuery(parameters);
|
||||
Object.defineProperty(navigator, "webdriver", {
|
||||
get: () => undefined,
|
||||
});
|
||||
window.navigator.chrome = {
|
||||
runtime: {},
|
||||
// Add other properties if necessary
|
||||
};
|
||||
Object.defineProperty(navigator, "plugins", {
|
||||
get: () => [1, 2, 3, 4, 5],
|
||||
});
|
||||
Object.defineProperty(navigator, "languages", {
|
||||
get: () => ["en-US", "en"],
|
||||
});
|
||||
Object.defineProperty(document, "hidden", {
|
||||
get: () => false,
|
||||
});
|
||||
Object.defineProperty(document, "visibilityState", {
|
||||
get: () => "visible",
|
||||
});
|
||||
119
crawl4ai/js_snippet/remove_overlay_elements.js
Normal file
119
crawl4ai/js_snippet/remove_overlay_elements.js
Normal file
@@ -0,0 +1,119 @@
|
||||
async () => {
|
||||
// Function to check if element is visible
|
||||
const isVisible = (elem) => {
|
||||
const style = window.getComputedStyle(elem);
|
||||
return style.display !== "none" && style.visibility !== "hidden" && style.opacity !== "0";
|
||||
};
|
||||
|
||||
// Common selectors for popups and overlays
|
||||
const commonSelectors = [
|
||||
// Close buttons first
|
||||
'button[class*="close" i]',
|
||||
'button[class*="dismiss" i]',
|
||||
'button[aria-label*="close" i]',
|
||||
'button[title*="close" i]',
|
||||
'a[class*="close" i]',
|
||||
'span[class*="close" i]',
|
||||
|
||||
// Cookie notices
|
||||
'[class*="cookie-banner" i]',
|
||||
'[id*="cookie-banner" i]',
|
||||
'[class*="cookie-consent" i]',
|
||||
'[id*="cookie-consent" i]',
|
||||
|
||||
// Newsletter/subscription dialogs
|
||||
'[class*="newsletter" i]',
|
||||
'[class*="subscribe" i]',
|
||||
|
||||
// Generic popups/modals
|
||||
'[class*="popup" i]',
|
||||
'[class*="modal" i]',
|
||||
'[class*="overlay" i]',
|
||||
'[class*="dialog" i]',
|
||||
'[role="dialog"]',
|
||||
'[role="alertdialog"]',
|
||||
];
|
||||
|
||||
// Try to click close buttons first
|
||||
for (const selector of commonSelectors.slice(0, 6)) {
|
||||
const closeButtons = document.querySelectorAll(selector);
|
||||
for (const button of closeButtons) {
|
||||
if (isVisible(button)) {
|
||||
try {
|
||||
button.click();
|
||||
await new Promise((resolve) => setTimeout(resolve, 100));
|
||||
} catch (e) {
|
||||
console.log("Error clicking button:", e);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Remove remaining overlay elements
|
||||
const removeOverlays = () => {
|
||||
// Find elements with high z-index
|
||||
const allElements = document.querySelectorAll("*");
|
||||
for (const elem of allElements) {
|
||||
const style = window.getComputedStyle(elem);
|
||||
const zIndex = parseInt(style.zIndex);
|
||||
const position = style.position;
|
||||
|
||||
if (
|
||||
isVisible(elem) &&
|
||||
(zIndex > 999 || position === "fixed" || position === "absolute") &&
|
||||
(elem.offsetWidth > window.innerWidth * 0.5 ||
|
||||
elem.offsetHeight > window.innerHeight * 0.5 ||
|
||||
style.backgroundColor.includes("rgba") ||
|
||||
parseFloat(style.opacity) < 1)
|
||||
) {
|
||||
elem.remove();
|
||||
}
|
||||
}
|
||||
|
||||
// Remove elements matching common selectors
|
||||
for (const selector of commonSelectors) {
|
||||
const elements = document.querySelectorAll(selector);
|
||||
elements.forEach((elem) => {
|
||||
if (isVisible(elem)) {
|
||||
elem.remove();
|
||||
}
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
// Remove overlay elements
|
||||
removeOverlays();
|
||||
|
||||
// Remove any fixed/sticky position elements at the top/bottom
|
||||
const removeFixedElements = () => {
|
||||
const elements = document.querySelectorAll("*");
|
||||
elements.forEach((elem) => {
|
||||
const style = window.getComputedStyle(elem);
|
||||
if ((style.position === "fixed" || style.position === "sticky") && isVisible(elem)) {
|
||||
elem.remove();
|
||||
}
|
||||
});
|
||||
};
|
||||
|
||||
removeFixedElements();
|
||||
|
||||
// Remove empty block elements as: div, p, span, etc.
|
||||
const removeEmptyBlockElements = () => {
|
||||
const blockElements = document.querySelectorAll(
|
||||
"div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6"
|
||||
);
|
||||
blockElements.forEach((elem) => {
|
||||
if (elem.innerText.trim() === "") {
|
||||
elem.remove();
|
||||
}
|
||||
});
|
||||
};
|
||||
|
||||
// Remove margin-right and padding-right from body (often added by modal scripts)
|
||||
document.body.style.marginRight = "0px";
|
||||
document.body.style.paddingRight = "0px";
|
||||
document.body.style.overflow = "auto";
|
||||
|
||||
// Wait a bit for any animations to complete
|
||||
await new Promise((resolve) => setTimeout(resolve, 100));
|
||||
};
|
||||
54
crawl4ai/js_snippet/update_image_dimensions.js
Normal file
54
crawl4ai/js_snippet/update_image_dimensions.js
Normal file
@@ -0,0 +1,54 @@
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
// Filter out images that are too small
|
||||
if (img.width < 100 && img.height < 100) return false;
|
||||
|
||||
// Filter out images that are not visible
|
||||
const rect = img.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) return false;
|
||||
|
||||
// Filter out images with certain class names (e.g., icons, thumbnails)
|
||||
if (img.classList.contains("icon") || img.classList.contains("thumbnail")) return false;
|
||||
|
||||
// Filter out images with certain patterns in their src (e.g., placeholder images)
|
||||
if (img.src.includes("placeholder") || img.src.includes("icon")) return false;
|
||||
|
||||
return true;
|
||||
};
|
||||
|
||||
const images = Array.from(document.querySelectorAll("img")).filter(filterImage);
|
||||
let imagesLeft = images.length;
|
||||
|
||||
if (imagesLeft === 0) {
|
||||
resolve();
|
||||
return;
|
||||
}
|
||||
|
||||
const checkImage = (img) => {
|
||||
if (img.complete && img.naturalWidth !== 0) {
|
||||
img.setAttribute("width", img.naturalWidth);
|
||||
img.setAttribute("height", img.naturalHeight);
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
}
|
||||
};
|
||||
|
||||
images.forEach((img) => {
|
||||
checkImage(img);
|
||||
if (!img.complete) {
|
||||
img.onload = () => {
|
||||
checkImage(img);
|
||||
};
|
||||
img.onerror = () => {
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
};
|
||||
}
|
||||
});
|
||||
|
||||
// Fallback timeout of 5 seconds
|
||||
// setTimeout(() => resolve(), 5000);
|
||||
resolve();
|
||||
});
|
||||
};
|
||||
498
crawl4ai/llmtxt.py
Normal file
498
crawl4ai/llmtxt.py
Normal file
@@ -0,0 +1,498 @@
|
||||
import os
|
||||
from pathlib import Path
|
||||
import re
|
||||
from typing import Dict, List, Tuple, Optional, Any
|
||||
import json
|
||||
from tqdm import tqdm
|
||||
import time
|
||||
import psutil
|
||||
import numpy as np
|
||||
from rank_bm25 import BM25Okapi
|
||||
from nltk.tokenize import word_tokenize
|
||||
from nltk.corpus import stopwords
|
||||
from nltk.stem import WordNetLemmatizer
|
||||
from litellm import completion, batch_completion
|
||||
from .async_logger import AsyncLogger
|
||||
import litellm
|
||||
import pickle
|
||||
import hashlib # <--- ADDED for file-hash
|
||||
from fnmatch import fnmatch
|
||||
import glob
|
||||
|
||||
litellm.set_verbose = False
|
||||
|
||||
def _compute_file_hash(file_path: Path) -> str:
|
||||
"""Compute MD5 hash for the file's entire content."""
|
||||
hash_md5 = hashlib.md5()
|
||||
with file_path.open("rb") as f:
|
||||
for chunk in iter(lambda: f.read(4096), b""):
|
||||
hash_md5.update(chunk)
|
||||
return hash_md5.hexdigest()
|
||||
|
||||
class AsyncLLMTextManager:
|
||||
def __init__(
|
||||
self,
|
||||
docs_dir: Path,
|
||||
logger: Optional[AsyncLogger] = None,
|
||||
max_concurrent_calls: int = 5,
|
||||
batch_size: int = 3
|
||||
) -> None:
|
||||
self.docs_dir = docs_dir
|
||||
self.logger = logger
|
||||
self.max_concurrent_calls = max_concurrent_calls
|
||||
self.batch_size = batch_size
|
||||
self.bm25_index = None
|
||||
self.document_map: Dict[str, Any] = {}
|
||||
self.tokenized_facts: List[str] = []
|
||||
self.bm25_index_file = self.docs_dir / "bm25_index.pkl"
|
||||
|
||||
async def _process_document_batch(self, doc_batch: List[Path]) -> None:
|
||||
"""Process a batch of documents in parallel"""
|
||||
contents = []
|
||||
for file_path in doc_batch:
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
contents.append(f.read())
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error reading {file_path}: {str(e)}")
|
||||
contents.append("") # Add empty content to maintain batch alignment
|
||||
|
||||
prompt = """Given a documentation file, generate a list of atomic facts where each fact:
|
||||
1. Represents a single piece of knowledge
|
||||
2. Contains variations in terminology for the same concept
|
||||
3. References relevant code patterns if they exist
|
||||
4. Is written in a way that would match natural language queries
|
||||
|
||||
Each fact should follow this format:
|
||||
<main_concept>: <fact_statement> | <related_terms> | <code_reference>
|
||||
|
||||
Example Facts:
|
||||
browser_config: Configure headless mode and browser type for AsyncWebCrawler | headless, browser_type, chromium, firefox | BrowserConfig(browser_type="chromium", headless=True)
|
||||
redis_connection: Redis client connection requires host and port configuration | redis setup, redis client, connection params | Redis(host='localhost', port=6379, db=0)
|
||||
pandas_filtering: Filter DataFrame rows using boolean conditions | dataframe filter, query, boolean indexing | df[df['column'] > 5]
|
||||
|
||||
Wrap your response in <index>...</index> tags.
|
||||
"""
|
||||
|
||||
# Prepare messages for batch processing
|
||||
messages_list = [
|
||||
[
|
||||
{"role": "user", "content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}"}
|
||||
]
|
||||
for content in contents if content
|
||||
]
|
||||
|
||||
try:
|
||||
responses = batch_completion(
|
||||
model="anthropic/claude-3-5-sonnet-latest",
|
||||
messages=messages_list,
|
||||
logger_fn=None
|
||||
)
|
||||
|
||||
# Process responses and save index files
|
||||
for response, file_path in zip(responses, doc_batch):
|
||||
try:
|
||||
index_content_match = re.search(
|
||||
r'<index>(.*?)</index>',
|
||||
response.choices[0].message.content,
|
||||
re.DOTALL
|
||||
)
|
||||
if not index_content_match:
|
||||
self.logger.warning(f"No <index>...</index> content found for {file_path}")
|
||||
continue
|
||||
|
||||
index_content = re.sub(
|
||||
r"\n\s*\n", "\n", index_content_match.group(1)
|
||||
).strip()
|
||||
if index_content:
|
||||
index_file = file_path.with_suffix('.q.md')
|
||||
with open(index_file, 'w', encoding='utf-8') as f:
|
||||
f.write(index_content)
|
||||
self.logger.info(f"Created index file: {index_file}")
|
||||
else:
|
||||
self.logger.warning(f"No index content found in response for {file_path}")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error processing response for {file_path}: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error in batch completion: {str(e)}")
|
||||
|
||||
def _validate_fact_line(self, line: str) -> Tuple[bool, Optional[str]]:
|
||||
if "|" not in line:
|
||||
return False, "Missing separator '|'"
|
||||
|
||||
parts = [p.strip() for p in line.split("|")]
|
||||
if len(parts) != 3:
|
||||
return False, f"Expected 3 parts, got {len(parts)}"
|
||||
|
||||
concept_part = parts[0]
|
||||
if ":" not in concept_part:
|
||||
return False, "Missing ':' in concept definition"
|
||||
|
||||
return True, None
|
||||
|
||||
def _load_or_create_token_cache(self, fact_file: Path) -> Dict:
|
||||
"""
|
||||
Load token cache from .q.tokens if present and matching file hash.
|
||||
Otherwise return a new structure with updated file-hash.
|
||||
"""
|
||||
cache_file = fact_file.with_suffix(".q.tokens")
|
||||
current_hash = _compute_file_hash(fact_file)
|
||||
|
||||
if cache_file.exists():
|
||||
try:
|
||||
with open(cache_file, "r") as f:
|
||||
cache = json.load(f)
|
||||
# If the hash matches, return it directly
|
||||
if cache.get("content_hash") == current_hash:
|
||||
return cache
|
||||
# Otherwise, we signal that it's changed
|
||||
self.logger.info(f"Hash changed for {fact_file}, reindex needed.")
|
||||
except json.JSONDecodeError:
|
||||
self.logger.warning(f"Corrupt token cache for {fact_file}, rebuilding.")
|
||||
except Exception as e:
|
||||
self.logger.warning(f"Error reading cache for {fact_file}: {str(e)}")
|
||||
|
||||
# Return a fresh cache
|
||||
return {"facts": {}, "content_hash": current_hash}
|
||||
|
||||
def _save_token_cache(self, fact_file: Path, cache: Dict) -> None:
|
||||
cache_file = fact_file.with_suffix(".q.tokens")
|
||||
# Always ensure we're saving the correct file-hash
|
||||
cache["content_hash"] = _compute_file_hash(fact_file)
|
||||
with open(cache_file, "w") as f:
|
||||
json.dump(cache, f)
|
||||
|
||||
def preprocess_text(self, text: str) -> List[str]:
|
||||
parts = [x.strip() for x in text.split("|")] if "|" in text else [text]
|
||||
# Remove : after the first word of parts[0]
|
||||
parts[0] = re.sub(r"^(.*?):", r"\1", parts[0])
|
||||
|
||||
lemmatizer = WordNetLemmatizer()
|
||||
stop_words = set(stopwords.words("english")) - {
|
||||
"how", "what", "when", "where", "why", "which",
|
||||
}
|
||||
|
||||
tokens = []
|
||||
for part in parts:
|
||||
if "(" in part and ")" in part:
|
||||
code_tokens = re.findall(
|
||||
r'[\w_]+(?=\()|[\w_]+(?==[\'"]{1}[\w_]+[\'"]{1})', part
|
||||
)
|
||||
tokens.extend(code_tokens)
|
||||
|
||||
words = word_tokenize(part.lower())
|
||||
tokens.extend(
|
||||
[
|
||||
lemmatizer.lemmatize(token)
|
||||
for token in words
|
||||
if token not in stop_words
|
||||
]
|
||||
)
|
||||
|
||||
return tokens
|
||||
|
||||
def maybe_load_bm25_index(self, clear_cache=False) -> bool:
|
||||
"""
|
||||
Load existing BM25 index from disk, if present and clear_cache=False.
|
||||
"""
|
||||
if not clear_cache and os.path.exists(self.bm25_index_file):
|
||||
self.logger.info("Loading existing BM25 index from disk.")
|
||||
with open(self.bm25_index_file, "rb") as f:
|
||||
data = pickle.load(f)
|
||||
self.tokenized_facts = data["tokenized_facts"]
|
||||
self.bm25_index = data["bm25_index"]
|
||||
return True
|
||||
return False
|
||||
|
||||
def build_search_index(self, clear_cache=False) -> None:
|
||||
"""
|
||||
Checks for new or modified .q.md files by comparing file-hash.
|
||||
If none need reindexing and clear_cache is False, loads existing index if available.
|
||||
Otherwise, reindexes only changed/new files and merges or creates a new index.
|
||||
"""
|
||||
# If clear_cache is True, we skip partial logic: rebuild everything from scratch
|
||||
if clear_cache:
|
||||
self.logger.info("Clearing cache and rebuilding full search index.")
|
||||
if self.bm25_index_file.exists():
|
||||
self.bm25_index_file.unlink()
|
||||
|
||||
process = psutil.Process()
|
||||
self.logger.info("Checking which .q.md files need (re)indexing...")
|
||||
|
||||
# Gather all .q.md files
|
||||
q_files = [self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")]
|
||||
|
||||
# We'll store known (unchanged) facts in these lists
|
||||
existing_facts: List[str] = []
|
||||
existing_tokens: List[List[str]] = []
|
||||
|
||||
# Keep track of invalid lines for logging
|
||||
invalid_lines = []
|
||||
needSet = [] # files that must be (re)indexed
|
||||
|
||||
for qf in q_files:
|
||||
token_cache_file = qf.with_suffix(".q.tokens")
|
||||
|
||||
# If no .q.tokens or clear_cache is True → definitely reindex
|
||||
if clear_cache or not token_cache_file.exists():
|
||||
needSet.append(qf)
|
||||
continue
|
||||
|
||||
# Otherwise, load the existing cache and compare hash
|
||||
cache = self._load_or_create_token_cache(qf)
|
||||
# If the .q.tokens was out of date (i.e. changed hash), we reindex
|
||||
if len(cache["facts"]) == 0 or cache.get("content_hash") != _compute_file_hash(qf):
|
||||
needSet.append(qf)
|
||||
else:
|
||||
# File is unchanged → retrieve cached token data
|
||||
for line, cache_data in cache["facts"].items():
|
||||
existing_facts.append(line)
|
||||
existing_tokens.append(cache_data["tokens"])
|
||||
self.document_map[line] = qf # track the doc for that fact
|
||||
|
||||
if not needSet and not clear_cache:
|
||||
# If no file needs reindexing, try loading existing index
|
||||
if self.maybe_load_bm25_index(clear_cache=False):
|
||||
self.logger.info("No new/changed .q.md files found. Using existing BM25 index.")
|
||||
return
|
||||
else:
|
||||
# If there's no existing index, we must build a fresh index from the old caches
|
||||
self.logger.info("No existing BM25 index found. Building from cached facts.")
|
||||
if existing_facts:
|
||||
self.logger.info(f"Building BM25 index with {len(existing_facts)} cached facts.")
|
||||
self.bm25_index = BM25Okapi(existing_tokens)
|
||||
self.tokenized_facts = existing_facts
|
||||
with open(self.bm25_index_file, "wb") as f:
|
||||
pickle.dump({
|
||||
"bm25_index": self.bm25_index,
|
||||
"tokenized_facts": self.tokenized_facts
|
||||
}, f)
|
||||
else:
|
||||
self.logger.warning("No facts found at all. Index remains empty.")
|
||||
return
|
||||
|
||||
# ----------------------------------------------------- /Users/unclecode/.crawl4ai/docs/14_proxy_security.q.q.tokens '/Users/unclecode/.crawl4ai/docs/14_proxy_security.q.md'
|
||||
# If we reach here, we have new or changed .q.md files
|
||||
# We'll parse them, reindex them, and then combine with existing_facts
|
||||
# -----------------------------------------------------
|
||||
|
||||
self.logger.info(f"{len(needSet)} file(s) need reindexing. Parsing now...")
|
||||
|
||||
# 1) Parse the new or changed .q.md files
|
||||
new_facts = []
|
||||
new_tokens = []
|
||||
with tqdm(total=len(needSet), desc="Indexing changed files") as file_pbar:
|
||||
for file in needSet:
|
||||
# We'll build up a fresh cache
|
||||
fresh_cache = {"facts": {}, "content_hash": _compute_file_hash(file)}
|
||||
try:
|
||||
with open(file, "r", encoding="utf-8") as f_obj:
|
||||
content = f_obj.read().strip()
|
||||
lines = [l.strip() for l in content.split("\n") if l.strip()]
|
||||
|
||||
for line in lines:
|
||||
is_valid, error = self._validate_fact_line(line)
|
||||
if not is_valid:
|
||||
invalid_lines.append((file, line, error))
|
||||
continue
|
||||
|
||||
tokens = self.preprocess_text(line)
|
||||
fresh_cache["facts"][line] = {
|
||||
"tokens": tokens,
|
||||
"added": time.time(),
|
||||
}
|
||||
new_facts.append(line)
|
||||
new_tokens.append(tokens)
|
||||
self.document_map[line] = file
|
||||
|
||||
# Save the new .q.tokens with updated hash
|
||||
self._save_token_cache(file, fresh_cache)
|
||||
|
||||
mem_usage = process.memory_info().rss / 1024 / 1024
|
||||
self.logger.debug(f"Memory usage after {file.name}: {mem_usage:.2f}MB")
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error processing {file}: {str(e)}")
|
||||
|
||||
file_pbar.update(1)
|
||||
|
||||
if invalid_lines:
|
||||
self.logger.warning(f"Found {len(invalid_lines)} invalid fact lines:")
|
||||
for file, line, error in invalid_lines:
|
||||
self.logger.warning(f"{file}: {error} in line: {line[:50]}...")
|
||||
|
||||
# 2) Merge newly tokenized facts with the existing ones
|
||||
all_facts = existing_facts + new_facts
|
||||
all_tokens = existing_tokens + new_tokens
|
||||
|
||||
# 3) Build BM25 index from combined facts
|
||||
self.logger.info(f"Building BM25 index with {len(all_facts)} total facts (old + new).")
|
||||
self.bm25_index = BM25Okapi(all_tokens)
|
||||
self.tokenized_facts = all_facts
|
||||
|
||||
# 4) Save the updated BM25 index to disk
|
||||
with open(self.bm25_index_file, "wb") as f:
|
||||
pickle.dump({
|
||||
"bm25_index": self.bm25_index,
|
||||
"tokenized_facts": self.tokenized_facts
|
||||
}, f)
|
||||
|
||||
final_mem = process.memory_info().rss / 1024 / 1024
|
||||
self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
|
||||
|
||||
async def generate_index_files(self, force_generate_facts: bool = False, clear_bm25_cache: bool = False) -> None:
|
||||
"""
|
||||
Generate index files for all documents in parallel batches
|
||||
|
||||
Args:
|
||||
force_generate_facts (bool): If True, regenerate indexes even if they exist
|
||||
clear_bm25_cache (bool): If True, clear existing BM25 index cache
|
||||
"""
|
||||
self.logger.info("Starting index generation for documentation files.")
|
||||
|
||||
md_files = [
|
||||
self.docs_dir / f for f in os.listdir(self.docs_dir)
|
||||
if f.endswith('.md') and not any(f.endswith(x) for x in ['.q.md', '.xs.md'])
|
||||
]
|
||||
|
||||
# Filter out files that already have .q files unless force=True
|
||||
if not force_generate_facts:
|
||||
md_files = [
|
||||
f for f in md_files
|
||||
if not (self.docs_dir / f.name.replace('.md', '.q.md')).exists()
|
||||
]
|
||||
|
||||
if not md_files:
|
||||
self.logger.info("All index files exist. Use force=True to regenerate.")
|
||||
else:
|
||||
# Process documents in batches
|
||||
for i in range(0, len(md_files), self.batch_size):
|
||||
batch = md_files[i:i + self.batch_size]
|
||||
self.logger.info(f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}")
|
||||
await self._process_document_batch(batch)
|
||||
|
||||
self.logger.info("Index generation complete, building/updating search index.")
|
||||
self.build_search_index(clear_cache=clear_bm25_cache)
|
||||
|
||||
def generate(self, sections: List[str], mode: str = "extended") -> str:
|
||||
# Get all markdown files
|
||||
all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + \
|
||||
glob.glob(str(self.docs_dir / "[0-9]*.xs.md"))
|
||||
|
||||
# Extract base names without extensions
|
||||
base_docs = {Path(f).name.split('.')[0] for f in all_files
|
||||
if not Path(f).name.endswith('.q.md')}
|
||||
|
||||
# Filter by sections if provided
|
||||
if sections:
|
||||
base_docs = {doc for doc in base_docs
|
||||
if any(section.lower() in doc.lower() for section in sections)}
|
||||
|
||||
# Get file paths based on mode
|
||||
files = []
|
||||
for doc in sorted(base_docs, key=lambda x: int(x.split('_')[0]) if x.split('_')[0].isdigit() else 999999):
|
||||
if mode == "condensed":
|
||||
xs_file = self.docs_dir / f"{doc}.xs.md"
|
||||
regular_file = self.docs_dir / f"{doc}.md"
|
||||
files.append(str(xs_file if xs_file.exists() else regular_file))
|
||||
else:
|
||||
files.append(str(self.docs_dir / f"{doc}.md"))
|
||||
|
||||
# Read and format content
|
||||
content = []
|
||||
for file in files:
|
||||
try:
|
||||
with open(file, 'r', encoding='utf-8') as f:
|
||||
fname = Path(file).name
|
||||
content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
|
||||
except Exception as e:
|
||||
self.logger.error(f"Error reading {file}: {str(e)}")
|
||||
|
||||
return "\n\n---\n\n".join(content) if content else ""
|
||||
|
||||
def search(self, query: str, top_k: int = 5) -> str:
|
||||
if not self.bm25_index:
|
||||
return "No search index available. Call build_search_index() first."
|
||||
|
||||
query_tokens = self.preprocess_text(query)
|
||||
doc_scores = self.bm25_index.get_scores(query_tokens)
|
||||
|
||||
mean_score = np.mean(doc_scores)
|
||||
std_score = np.std(doc_scores)
|
||||
score_threshold = mean_score + (0.25 * std_score)
|
||||
|
||||
file_data = self._aggregate_search_scores(
|
||||
doc_scores=doc_scores,
|
||||
score_threshold=score_threshold,
|
||||
query_tokens=query_tokens,
|
||||
)
|
||||
|
||||
ranked_files = sorted(
|
||||
file_data.items(),
|
||||
key=lambda x: (
|
||||
x[1]["code_match_score"] * 2.0
|
||||
+ x[1]["match_count"] * 1.5
|
||||
+ x[1]["total_score"]
|
||||
),
|
||||
reverse=True,
|
||||
)[:top_k]
|
||||
|
||||
results = []
|
||||
for file, _ in ranked_files:
|
||||
main_doc = str(file).replace(".q.md", ".md")
|
||||
if os.path.exists(self.docs_dir / main_doc):
|
||||
with open(self.docs_dir / main_doc, "r", encoding='utf-8') as f:
|
||||
only_file_name = main_doc.split("/")[-1]
|
||||
content = [
|
||||
"#" * 20,
|
||||
f"# {only_file_name}",
|
||||
"#" * 20,
|
||||
"",
|
||||
f.read()
|
||||
]
|
||||
results.append("\n".join(content))
|
||||
|
||||
return "\n\n---\n\n".join(results)
|
||||
|
||||
def _aggregate_search_scores(
|
||||
self, doc_scores: List[float], score_threshold: float, query_tokens: List[str]
|
||||
) -> Dict:
|
||||
file_data = {}
|
||||
|
||||
for idx, score in enumerate(doc_scores):
|
||||
if score <= score_threshold:
|
||||
continue
|
||||
|
||||
fact = self.tokenized_facts[idx]
|
||||
file_path = self.document_map[fact]
|
||||
|
||||
if file_path not in file_data:
|
||||
file_data[file_path] = {
|
||||
"total_score": 0,
|
||||
"match_count": 0,
|
||||
"code_match_score": 0,
|
||||
"matched_facts": [],
|
||||
}
|
||||
|
||||
components = fact.split("|") if "|" in fact else [fact]
|
||||
|
||||
code_match_score = 0
|
||||
if len(components) == 3:
|
||||
code_ref = components[2].strip()
|
||||
code_tokens = self.preprocess_text(code_ref)
|
||||
code_match_score = len(set(query_tokens) & set(code_tokens)) / len(query_tokens)
|
||||
|
||||
file_data[file_path]["total_score"] += score
|
||||
file_data[file_path]["match_count"] += 1
|
||||
file_data[file_path]["code_match_score"] = max(
|
||||
file_data[file_path]["code_match_score"], code_match_score
|
||||
)
|
||||
file_data[file_path]["matched_facts"].append(fact)
|
||||
|
||||
return file_data
|
||||
|
||||
def refresh_index(self) -> None:
|
||||
"""Convenience method for a full rebuild."""
|
||||
self.build_search_index(clear_cache=True)
|
||||
183
crawl4ai/markdown_generation_strategy.py
Normal file
183
crawl4ai/markdown_generation_strategy.py
Normal file
@@ -0,0 +1,183 @@
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Optional, Dict, Any, Tuple
|
||||
from .models import MarkdownGenerationResult
|
||||
from .html2text import CustomHTML2Text
|
||||
from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
|
||||
import re
|
||||
from urllib.parse import urljoin
|
||||
|
||||
# Pre-compile the regex pattern
|
||||
LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
|
||||
|
||||
def fast_urljoin(base: str, url: str) -> str:
|
||||
"""Fast URL joining for common cases."""
|
||||
if url.startswith(('http://', 'https://', 'mailto:', '//')):
|
||||
return url
|
||||
if url.startswith('/'):
|
||||
# Handle absolute paths
|
||||
if base.endswith('/'):
|
||||
return base[:-1] + url
|
||||
return base + url
|
||||
return urljoin(base, url)
|
||||
|
||||
class MarkdownGenerationStrategy(ABC):
|
||||
"""Abstract base class for markdown generation strategies."""
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
||||
self.content_filter = content_filter
|
||||
self.options = options or {}
|
||||
|
||||
@abstractmethod
|
||||
def generate_markdown(self,
|
||||
cleaned_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
citations: bool = True,
|
||||
**kwargs) -> MarkdownGenerationResult:
|
||||
"""Generate markdown from cleaned HTML."""
|
||||
pass
|
||||
|
||||
class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
||||
"""
|
||||
Default implementation of markdown generation strategy.
|
||||
|
||||
How it works:
|
||||
1. Generate raw markdown from cleaned HTML.
|
||||
2. Convert links to citations.
|
||||
3. Generate fit markdown if content filter is provided.
|
||||
4. Return MarkdownGenerationResult.
|
||||
|
||||
Args:
|
||||
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
|
||||
|
||||
Returns:
|
||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||
"""
|
||||
def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
|
||||
super().__init__(content_filter, options)
|
||||
|
||||
def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
|
||||
"""
|
||||
Convert links in markdown to citations.
|
||||
|
||||
How it works:
|
||||
1. Find all links in the markdown.
|
||||
2. Convert links to citations.
|
||||
3. Return converted markdown and references markdown.
|
||||
|
||||
Note:
|
||||
This function uses a regex pattern to find links in markdown.
|
||||
|
||||
Args:
|
||||
markdown (str): Markdown text.
|
||||
base_url (str): Base URL for URL joins.
|
||||
|
||||
Returns:
|
||||
Tuple[str, str]: Converted markdown and references markdown.
|
||||
"""
|
||||
link_map = {}
|
||||
url_cache = {} # Cache for URL joins
|
||||
parts = []
|
||||
last_end = 0
|
||||
counter = 1
|
||||
|
||||
for match in LINK_PATTERN.finditer(markdown):
|
||||
parts.append(markdown[last_end:match.start()])
|
||||
text, url, title = match.groups()
|
||||
|
||||
# Use cached URL if available, otherwise compute and cache
|
||||
if base_url and not url.startswith(('http://', 'https://', 'mailto:')):
|
||||
if url not in url_cache:
|
||||
url_cache[url] = fast_urljoin(base_url, url)
|
||||
url = url_cache[url]
|
||||
|
||||
if url not in link_map:
|
||||
desc = []
|
||||
if title: desc.append(title)
|
||||
if text and text != title: desc.append(text)
|
||||
link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
|
||||
counter += 1
|
||||
|
||||
num = link_map[url][0]
|
||||
parts.append(f"{text}⟨{num}⟩" if not match.group(0).startswith('!') else f"![{text}⟨{num}⟩]")
|
||||
last_end = match.end()
|
||||
|
||||
parts.append(markdown[last_end:])
|
||||
converted_text = ''.join(parts)
|
||||
|
||||
# Pre-build reference strings
|
||||
references = ["\n\n## References\n\n"]
|
||||
references.extend(
|
||||
f"⟨{num}⟩ {url}{desc}\n"
|
||||
for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
|
||||
)
|
||||
|
||||
return converted_text, ''.join(references)
|
||||
|
||||
def generate_markdown(self,
|
||||
cleaned_html: str,
|
||||
base_url: str = "",
|
||||
html2text_options: Optional[Dict[str, Any]] = None,
|
||||
options: Optional[Dict[str, Any]] = None,
|
||||
content_filter: Optional[RelevantContentFilter] = None,
|
||||
citations: bool = True,
|
||||
**kwargs) -> MarkdownGenerationResult:
|
||||
"""
|
||||
Generate markdown with citations from cleaned HTML.
|
||||
|
||||
How it works:
|
||||
1. Generate raw markdown from cleaned HTML.
|
||||
2. Convert links to citations.
|
||||
3. Generate fit markdown if content filter is provided.
|
||||
4. Return MarkdownGenerationResult.
|
||||
|
||||
Args:
|
||||
cleaned_html (str): Cleaned HTML content.
|
||||
base_url (str): Base URL for URL joins.
|
||||
html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
|
||||
options (Optional[Dict[str, Any]]): Additional options for markdown generation.
|
||||
content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
|
||||
citations (bool): Whether to generate citations.
|
||||
|
||||
Returns:
|
||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||
"""
|
||||
# Initialize HTML2Text with options
|
||||
h = CustomHTML2Text()
|
||||
if html2text_options:
|
||||
h.update_params(**html2text_options)
|
||||
elif options:
|
||||
h.update_params(**options)
|
||||
elif self.options:
|
||||
h.update_params(**self.options)
|
||||
|
||||
# Generate raw markdown
|
||||
raw_markdown = h.handle(cleaned_html)
|
||||
raw_markdown = raw_markdown.replace(' ```', '```')
|
||||
|
||||
# Convert links to citations
|
||||
markdown_with_citations: str = ""
|
||||
references_markdown: str = ""
|
||||
if citations:
|
||||
markdown_with_citations, references_markdown = self.convert_links_to_citations(
|
||||
raw_markdown, base_url
|
||||
)
|
||||
|
||||
# Generate fit markdown if content filter is provided
|
||||
fit_markdown: Optional[str] = ""
|
||||
filtered_html: Optional[str] = ""
|
||||
if content_filter or self.content_filter:
|
||||
content_filter = content_filter or self.content_filter
|
||||
filtered_html = content_filter.filter_content(cleaned_html)
|
||||
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
|
||||
fit_markdown = h.handle(filtered_html)
|
||||
|
||||
return MarkdownGenerationResult(
|
||||
raw_markdown=raw_markdown,
|
||||
markdown_with_citations=markdown_with_citations,
|
||||
references_markdown=references_markdown,
|
||||
fit_markdown=fit_markdown,
|
||||
fit_html=filtered_html,
|
||||
)
|
||||
|
||||
@@ -9,9 +9,13 @@ import aiofiles
|
||||
import shutil
|
||||
import time
|
||||
from datetime import datetime
|
||||
from .async_logger import AsyncLogger, LogLevel
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
# Initialize logger
|
||||
logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
|
||||
|
||||
# logging.basicConfig(level=logging.INFO)
|
||||
# logger = logging.getLogger(__name__)
|
||||
|
||||
class DatabaseMigration:
|
||||
def __init__(self, db_path: str):
|
||||
@@ -55,7 +59,8 @@ class DatabaseMigration:
|
||||
|
||||
async def migrate_database(self):
|
||||
"""Migrate existing database to file-based storage"""
|
||||
logger.info("Starting database migration...")
|
||||
# logger.info("Starting database migration...")
|
||||
logger.info("Starting database migration...", tag="INIT")
|
||||
|
||||
try:
|
||||
async with aiosqlite.connect(self.db_path) as db:
|
||||
@@ -91,19 +96,25 @@ class DatabaseMigration:
|
||||
|
||||
migrated_count += 1
|
||||
if migrated_count % 100 == 0:
|
||||
logger.info(f"Migrated {migrated_count} records...")
|
||||
logger.info(f"Migrated {migrated_count} records...", tag="INIT")
|
||||
|
||||
|
||||
await db.commit()
|
||||
logger.info(f"Migration completed. {migrated_count} records processed.")
|
||||
logger.success(f"Migration completed. {migrated_count} records processed.", tag="COMPLETE")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Migration failed: {e}")
|
||||
raise
|
||||
# logger.error(f"Migration failed: {e}")
|
||||
logger.error(
|
||||
message="Migration failed: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
raise e
|
||||
|
||||
async def backup_database(db_path: str) -> str:
|
||||
"""Create backup of existing database"""
|
||||
if not os.path.exists(db_path):
|
||||
logger.info("No existing database found. Skipping backup.")
|
||||
logger.info("No existing database found. Skipping backup.", tag="INIT")
|
||||
return None
|
||||
|
||||
# Create backup with timestamp
|
||||
@@ -116,11 +127,16 @@ async def backup_database(db_path: str) -> str:
|
||||
|
||||
# Create backup
|
||||
shutil.copy2(db_path, backup_path)
|
||||
logger.info(f"Database backup created at: {backup_path}")
|
||||
logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
|
||||
return backup_path
|
||||
except Exception as e:
|
||||
logger.error(f"Backup failed: {e}")
|
||||
raise
|
||||
# logger.error(f"Backup failed: {e}")
|
||||
logger.error(
|
||||
message="Migration failed: {error}",
|
||||
tag="ERROR",
|
||||
params={"error": str(e)}
|
||||
)
|
||||
raise e
|
||||
|
||||
async def run_migration(db_path: Optional[str] = None):
|
||||
"""Run database migration"""
|
||||
@@ -128,7 +144,7 @@ async def run_migration(db_path: Optional[str] = None):
|
||||
db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
|
||||
|
||||
if not os.path.exists(db_path):
|
||||
logger.info("No existing database found. Skipping migration.")
|
||||
logger.info("No existing database found. Skipping migration.", tag="INIT")
|
||||
return
|
||||
|
||||
# Create backup first
|
||||
|
||||
@@ -1,12 +1,28 @@
|
||||
from pydantic import BaseModel, HttpUrl
|
||||
from typing import List, Dict, Optional, Callable, Awaitable
|
||||
|
||||
from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
|
||||
from dataclasses import dataclass
|
||||
from .ssl_certificate import SSLCertificate
|
||||
|
||||
@dataclass
|
||||
class TokenUsage:
|
||||
completion_tokens: int = 0
|
||||
prompt_tokens: int = 0
|
||||
total_tokens: int = 0
|
||||
completion_tokens_details: Optional[dict] = None
|
||||
prompt_tokens_details: Optional[dict] = None
|
||||
|
||||
|
||||
class UrlModel(BaseModel):
|
||||
url: HttpUrl
|
||||
forced: bool = False
|
||||
|
||||
class MarkdownGenerationResult(BaseModel):
|
||||
raw_markdown: str
|
||||
markdown_with_citations: str
|
||||
references_markdown: str
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
@@ -16,7 +32,9 @@ class CrawlResult(BaseModel):
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
screenshot: Optional[str] = None
|
||||
markdown: Optional[str] = None
|
||||
pdf : Optional[bytes] = None
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
extracted_content: Optional[str] = None
|
||||
@@ -25,14 +43,19 @@ class CrawlResult(BaseModel):
|
||||
session_id: Optional[str] = None
|
||||
response_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
|
||||
ssl_certificate: Optional[SSLCertificate] = None
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
class AsyncCrawlResponse(BaseModel):
|
||||
html: str
|
||||
response_headers: Dict[str, str]
|
||||
status_code: int
|
||||
screenshot: Optional[str] = None
|
||||
pdf_data: Optional[bytes] = None
|
||||
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||
downloaded_files: Optional[List[str]] = None
|
||||
ssl_certificate: Optional[SSLCertificate] = None
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
181
crawl4ai/ssl_certificate.py
Normal file
181
crawl4ai/ssl_certificate.py
Normal file
@@ -0,0 +1,181 @@
|
||||
"""SSL Certificate class for handling certificate operations."""
|
||||
|
||||
import ssl
|
||||
import socket
|
||||
import base64
|
||||
import json
|
||||
from typing import Dict, Any, Optional
|
||||
from urllib.parse import urlparse
|
||||
import OpenSSL.crypto
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class SSLCertificate:
|
||||
"""
|
||||
A class representing an SSL certificate with methods to export in various formats.
|
||||
|
||||
Attributes:
|
||||
cert_info (Dict[str, Any]): The certificate information.
|
||||
|
||||
Methods:
|
||||
from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
|
||||
from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
|
||||
from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
|
||||
export_as_pem() -> str: Export the certificate as PEM format.
|
||||
export_as_der() -> bytes: Export the certificate as DER format.
|
||||
export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
|
||||
export_as_text() -> str: Export the certificate as text format.
|
||||
"""
|
||||
def __init__(self, cert_info: Dict[str, Any]):
|
||||
self._cert_info = self._decode_cert_data(cert_info)
|
||||
|
||||
@staticmethod
|
||||
def from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']:
|
||||
"""
|
||||
Create SSLCertificate instance from a URL.
|
||||
|
||||
Args:
|
||||
url (str): URL of the website.
|
||||
timeout (int): Timeout for the connection (default: 10).
|
||||
|
||||
Returns:
|
||||
Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
|
||||
"""
|
||||
try:
|
||||
hostname = urlparse(url).netloc
|
||||
if ':' in hostname:
|
||||
hostname = hostname.split(':')[0]
|
||||
|
||||
context = ssl.create_default_context()
|
||||
with socket.create_connection((hostname, 443), timeout=timeout) as sock:
|
||||
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
|
||||
cert_binary = ssock.getpeercert(binary_form=True)
|
||||
x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_ASN1, cert_binary)
|
||||
|
||||
cert_info = {
|
||||
"subject": dict(x509.get_subject().get_components()),
|
||||
"issuer": dict(x509.get_issuer().get_components()),
|
||||
"version": x509.get_version(),
|
||||
"serial_number": hex(x509.get_serial_number()),
|
||||
"not_before": x509.get_notBefore(),
|
||||
"not_after": x509.get_notAfter(),
|
||||
"fingerprint": x509.digest("sha256").hex(),
|
||||
"signature_algorithm": x509.get_signature_algorithm(),
|
||||
"raw_cert": base64.b64encode(cert_binary)
|
||||
}
|
||||
|
||||
# Add extensions
|
||||
extensions = []
|
||||
for i in range(x509.get_extension_count()):
|
||||
ext = x509.get_extension(i)
|
||||
extensions.append({
|
||||
"name": ext.get_short_name(),
|
||||
"value": str(ext)
|
||||
})
|
||||
cert_info["extensions"] = extensions
|
||||
|
||||
return SSLCertificate(cert_info)
|
||||
|
||||
except Exception as e:
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _decode_cert_data(data: Any) -> Any:
|
||||
"""Helper method to decode bytes in certificate data."""
|
||||
if isinstance(data, bytes):
|
||||
return data.decode('utf-8')
|
||||
elif isinstance(data, dict):
|
||||
return {
|
||||
(k.decode('utf-8') if isinstance(k, bytes) else k): SSLCertificate._decode_cert_data(v)
|
||||
for k, v in data.items()
|
||||
}
|
||||
elif isinstance(data, list):
|
||||
return [SSLCertificate._decode_cert_data(item) for item in data]
|
||||
return data
|
||||
|
||||
def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
|
||||
"""
|
||||
Export certificate as JSON.
|
||||
|
||||
Args:
|
||||
filepath (Optional[str]): Path to save the JSON file (default: None).
|
||||
|
||||
Returns:
|
||||
Optional[str]: JSON string if successful, None otherwise.
|
||||
"""
|
||||
json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
|
||||
if filepath:
|
||||
Path(filepath).write_text(json_str, encoding='utf-8')
|
||||
return None
|
||||
return json_str
|
||||
|
||||
def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
|
||||
"""
|
||||
Export certificate as PEM.
|
||||
|
||||
Args:
|
||||
filepath (Optional[str]): Path to save the PEM file (default: None).
|
||||
|
||||
Returns:
|
||||
Optional[str]: PEM string if successful, None otherwise.
|
||||
"""
|
||||
try:
|
||||
x509 = OpenSSL.crypto.load_certificate(
|
||||
OpenSSL.crypto.FILETYPE_ASN1,
|
||||
base64.b64decode(self._cert_info['raw_cert'])
|
||||
)
|
||||
pem_data = OpenSSL.crypto.dump_certificate(
|
||||
OpenSSL.crypto.FILETYPE_PEM,
|
||||
x509
|
||||
).decode('utf-8')
|
||||
|
||||
if filepath:
|
||||
Path(filepath).write_text(pem_data, encoding='utf-8')
|
||||
return None
|
||||
return pem_data
|
||||
except Exception as e:
|
||||
return None
|
||||
|
||||
def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
|
||||
"""
|
||||
Export certificate as DER.
|
||||
|
||||
Args:
|
||||
filepath (Optional[str]): Path to save the DER file (default: None).
|
||||
|
||||
Returns:
|
||||
Optional[bytes]: DER bytes if successful, None otherwise.
|
||||
"""
|
||||
try:
|
||||
der_data = base64.b64decode(self._cert_info['raw_cert'])
|
||||
if filepath:
|
||||
Path(filepath).write_bytes(der_data)
|
||||
return None
|
||||
return der_data
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
@property
|
||||
def issuer(self) -> Dict[str, str]:
|
||||
"""Get certificate issuer information."""
|
||||
return self._cert_info.get('issuer', {})
|
||||
|
||||
@property
|
||||
def subject(self) -> Dict[str, str]:
|
||||
"""Get certificate subject information."""
|
||||
return self._cert_info.get('subject', {})
|
||||
|
||||
@property
|
||||
def valid_from(self) -> str:
|
||||
"""Get certificate validity start date."""
|
||||
return self._cert_info.get('not_before', '')
|
||||
|
||||
@property
|
||||
def valid_until(self) -> str:
|
||||
"""Get certificate validity end date."""
|
||||
return self._cert_info.get('not_after', '')
|
||||
|
||||
@property
|
||||
def fingerprint(self) -> str:
|
||||
"""Get certificate fingerprint."""
|
||||
return self._cert_info.get('fingerprint', '')
|
||||
305
crawl4ai/user_agent_generator.py
Normal file
305
crawl4ai/user_agent_generator.py
Normal file
@@ -0,0 +1,305 @@
|
||||
import random
|
||||
from typing import Optional, Literal, List, Dict, Tuple
|
||||
import re
|
||||
|
||||
|
||||
class UserAgentGenerator:
|
||||
"""
|
||||
Generate random user agents with specified constraints.
|
||||
|
||||
Attributes:
|
||||
desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
|
||||
mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
|
||||
browser_combinations (dict): A dictionary of possible browser combinations and their corresponding user agent strings.
|
||||
rendering_engines (dict): A dictionary of possible rendering engines and their corresponding user agent strings.
|
||||
chrome_versions (list): A list of possible Chrome browser versions.
|
||||
firefox_versions (list): A list of possible Firefox browser versions.
|
||||
edge_versions (list): A list of possible Edge browser versions.
|
||||
safari_versions (list): A list of possible Safari browser versions.
|
||||
ios_versions (list): A list of possible iOS browser versions.
|
||||
android_versions (list): A list of possible Android browser versions.
|
||||
|
||||
Methods:
|
||||
generate_user_agent(
|
||||
platform: Literal["desktop", "mobile"] = "desktop",
|
||||
browser: str = "chrome",
|
||||
rendering_engine: str = "chrome_webkit",
|
||||
chrome_version: Optional[str] = None,
|
||||
firefox_version: Optional[str] = None,
|
||||
edge_version: Optional[str] = None,
|
||||
safari_version: Optional[str] = None,
|
||||
ios_version: Optional[str] = None,
|
||||
android_version: Optional[str] = None
|
||||
): Generates a random user agent string based on the specified parameters.
|
||||
"""
|
||||
def __init__(self):
|
||||
# Previous platform definitions remain the same...
|
||||
self.desktop_platforms = {
|
||||
"windows": {
|
||||
"10_64": "(Windows NT 10.0; Win64; x64)",
|
||||
"10_32": "(Windows NT 10.0; WOW64)",
|
||||
},
|
||||
"macos": {
|
||||
"intel": "(Macintosh; Intel Mac OS X 10_15_7)",
|
||||
"newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
|
||||
},
|
||||
"linux": {
|
||||
"generic": "(X11; Linux x86_64)",
|
||||
"ubuntu": "(X11; Ubuntu; Linux x86_64)",
|
||||
"chrome_os": "(X11; CrOS x86_64 14541.0.0)",
|
||||
}
|
||||
}
|
||||
|
||||
self.mobile_platforms = {
|
||||
"android": {
|
||||
"samsung": "(Linux; Android 13; SM-S901B)",
|
||||
"pixel": "(Linux; Android 12; Pixel 6)",
|
||||
"oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
|
||||
"xiaomi": "(Linux; Android 12; M2102J20SG)",
|
||||
},
|
||||
"ios": {
|
||||
"iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
|
||||
"ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
|
||||
}
|
||||
}
|
||||
|
||||
# Browser Combinations
|
||||
self.browser_combinations = {
|
||||
1: [
|
||||
["chrome"],
|
||||
["firefox"],
|
||||
["safari"],
|
||||
["edge"]
|
||||
],
|
||||
2: [
|
||||
["gecko", "firefox"],
|
||||
["chrome", "safari"],
|
||||
["webkit", "safari"]
|
||||
],
|
||||
3: [
|
||||
["chrome", "safari", "edge"],
|
||||
["webkit", "chrome", "safari"]
|
||||
]
|
||||
}
|
||||
|
||||
# Rendering Engines with versions
|
||||
self.rendering_engines = {
|
||||
"chrome_webkit": "AppleWebKit/537.36",
|
||||
"safari_webkit": "AppleWebKit/605.1.15",
|
||||
"gecko": [ # Added Gecko versions
|
||||
"Gecko/20100101",
|
||||
"Gecko/20100101", # Firefox usually uses this constant version
|
||||
"Gecko/2010010",
|
||||
]
|
||||
}
|
||||
|
||||
# Browser Versions
|
||||
self.chrome_versions = [
|
||||
"Chrome/119.0.6045.199",
|
||||
"Chrome/118.0.5993.117",
|
||||
"Chrome/117.0.5938.149",
|
||||
"Chrome/116.0.5845.187",
|
||||
"Chrome/115.0.5790.171",
|
||||
]
|
||||
|
||||
self.edge_versions = [
|
||||
"Edg/119.0.2151.97",
|
||||
"Edg/118.0.2088.76",
|
||||
"Edg/117.0.2045.47",
|
||||
"Edg/116.0.1938.81",
|
||||
"Edg/115.0.1901.203",
|
||||
]
|
||||
|
||||
self.safari_versions = [
|
||||
"Safari/537.36", # For Chrome-based
|
||||
"Safari/605.1.15",
|
||||
"Safari/604.1",
|
||||
"Safari/602.1",
|
||||
"Safari/601.5.17",
|
||||
]
|
||||
|
||||
# Added Firefox versions
|
||||
self.firefox_versions = [
|
||||
"Firefox/119.0",
|
||||
"Firefox/118.0.2",
|
||||
"Firefox/117.0.1",
|
||||
"Firefox/116.0",
|
||||
"Firefox/115.0.3",
|
||||
"Firefox/114.0.2",
|
||||
"Firefox/113.0.1",
|
||||
"Firefox/112.0",
|
||||
"Firefox/111.0.1",
|
||||
"Firefox/110.0",
|
||||
]
|
||||
|
||||
def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
|
||||
"""
|
||||
Get a valid combination of browser versions.
|
||||
|
||||
How it works:
|
||||
1. Check if the number of browsers is supported.
|
||||
2. Randomly choose a combination of browsers.
|
||||
3. Iterate through the combination and add browser versions.
|
||||
4. Return the browser stack.
|
||||
|
||||
Args:
|
||||
num_browsers: Number of browser specifications (1-3)
|
||||
|
||||
Returns:
|
||||
List[str]: A list of browser versions.
|
||||
"""
|
||||
if num_browsers not in self.browser_combinations:
|
||||
raise ValueError(f"Unsupported number of browsers: {num_browsers}")
|
||||
|
||||
combination = random.choice(self.browser_combinations[num_browsers])
|
||||
browser_stack = []
|
||||
|
||||
for browser in combination:
|
||||
if browser == "chrome":
|
||||
browser_stack.append(random.choice(self.chrome_versions))
|
||||
elif browser == "firefox":
|
||||
browser_stack.append(random.choice(self.firefox_versions))
|
||||
elif browser == "safari":
|
||||
browser_stack.append(random.choice(self.safari_versions))
|
||||
elif browser == "edge":
|
||||
browser_stack.append(random.choice(self.edge_versions))
|
||||
elif browser == "gecko":
|
||||
browser_stack.append(random.choice(self.rendering_engines["gecko"]))
|
||||
elif browser == "webkit":
|
||||
browser_stack.append(self.rendering_engines["chrome_webkit"])
|
||||
|
||||
return browser_stack
|
||||
|
||||
def generate(self,
|
||||
device_type: Optional[Literal['desktop', 'mobile']] = None,
|
||||
os_type: Optional[str] = None,
|
||||
device_brand: Optional[str] = None,
|
||||
browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
|
||||
num_browsers: int = 3) -> str:
|
||||
"""
|
||||
Generate a random user agent with specified constraints.
|
||||
|
||||
Args:
|
||||
device_type: 'desktop' or 'mobile'
|
||||
os_type: 'windows', 'macos', 'linux', 'android', 'ios'
|
||||
device_brand: Specific device brand
|
||||
browser_type: 'chrome', 'edge', 'safari', or 'firefox'
|
||||
num_browsers: Number of browser specifications (1-3)
|
||||
"""
|
||||
# Get platform string
|
||||
platform = self.get_random_platform(device_type, os_type, device_brand)
|
||||
|
||||
# Start with Mozilla
|
||||
components = ["Mozilla/5.0", platform]
|
||||
|
||||
# Add browser stack
|
||||
browser_stack = self.get_browser_stack(num_browsers)
|
||||
|
||||
# Add appropriate legacy token based on browser stack
|
||||
if "Firefox" in str(browser_stack):
|
||||
components.append(random.choice(self.rendering_engines["gecko"]))
|
||||
elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
|
||||
components.append(self.rendering_engines["chrome_webkit"])
|
||||
components.append("(KHTML, like Gecko)")
|
||||
|
||||
# Add browser versions
|
||||
components.extend(browser_stack)
|
||||
|
||||
return " ".join(components)
|
||||
|
||||
def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
|
||||
"""Generate both user agent and matching client hints"""
|
||||
user_agent = self.generate(**kwargs)
|
||||
client_hints = self.generate_client_hints(user_agent)
|
||||
return user_agent, client_hints
|
||||
|
||||
def get_random_platform(self, device_type, os_type, device_brand):
|
||||
"""Helper method to get random platform based on constraints"""
|
||||
platforms = self.desktop_platforms if device_type == 'desktop' else \
|
||||
self.mobile_platforms if device_type == 'mobile' else \
|
||||
{**self.desktop_platforms, **self.mobile_platforms}
|
||||
|
||||
if os_type:
|
||||
for platform_group in [self.desktop_platforms, self.mobile_platforms]:
|
||||
if os_type in platform_group:
|
||||
platforms = {os_type: platform_group[os_type]}
|
||||
break
|
||||
|
||||
os_key = random.choice(list(platforms.keys()))
|
||||
if device_brand and device_brand in platforms[os_key]:
|
||||
return platforms[os_key][device_brand]
|
||||
return random.choice(list(platforms[os_key].values()))
|
||||
|
||||
def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
|
||||
"""Parse a user agent string to extract browser and version information"""
|
||||
browsers = {
|
||||
'chrome': r'Chrome/(\d+)',
|
||||
'edge': r'Edg/(\d+)',
|
||||
'safari': r'Version/(\d+)',
|
||||
'firefox': r'Firefox/(\d+)'
|
||||
}
|
||||
|
||||
result = {}
|
||||
for browser, pattern in browsers.items():
|
||||
match = re.search(pattern, user_agent)
|
||||
if match:
|
||||
result[browser] = match.group(1)
|
||||
|
||||
return result
|
||||
|
||||
def generate_client_hints(self, user_agent: str) -> str:
|
||||
"""Generate Sec-CH-UA header value based on user agent string"""
|
||||
browsers = self.parse_user_agent(user_agent)
|
||||
|
||||
# Client hints components
|
||||
hints = []
|
||||
|
||||
# Handle different browser combinations
|
||||
if 'chrome' in browsers:
|
||||
hints.append(f'"Chromium";v="{browsers["chrome"]}"')
|
||||
hints.append('"Not_A Brand";v="8"')
|
||||
|
||||
if 'edge' in browsers:
|
||||
hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
|
||||
else:
|
||||
hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
|
||||
|
||||
elif 'firefox' in browsers:
|
||||
# Firefox doesn't typically send Sec-CH-UA
|
||||
return '""'
|
||||
|
||||
elif 'safari' in browsers:
|
||||
# Safari's format for client hints
|
||||
hints.append(f'"Safari";v="{browsers["safari"]}"')
|
||||
hints.append('"Not_A Brand";v="8"')
|
||||
|
||||
return ', '.join(hints)
|
||||
|
||||
# Example usage:
|
||||
if __name__ == "__main__":
|
||||
generator = UserAgentGenerator()
|
||||
print(generator.generate())
|
||||
|
||||
print("\nSingle browser (Chrome):")
|
||||
print(generator.generate(num_browsers=1, browser_type='chrome'))
|
||||
|
||||
print("\nTwo browsers (Gecko/Firefox):")
|
||||
print(generator.generate(num_browsers=2))
|
||||
|
||||
print("\nThree browsers (Chrome/Safari/Edge):")
|
||||
print(generator.generate(num_browsers=3))
|
||||
|
||||
print("\nFirefox on Linux:")
|
||||
print(generator.generate(
|
||||
device_type='desktop',
|
||||
os_type='linux',
|
||||
browser_type='firefox',
|
||||
num_browsers=2
|
||||
))
|
||||
|
||||
print("\nChrome/Safari/Edge on Windows:")
|
||||
print(generator.generate(
|
||||
device_type='desktop',
|
||||
os_type='windows',
|
||||
num_browsers=3
|
||||
))
|
||||
@@ -1,4 +1,5 @@
|
||||
import time
|
||||
from urllib.parse import urlparse
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
from bs4 import BeautifulSoup, Comment, element, Tag, NavigableString
|
||||
import json
|
||||
@@ -6,7 +7,6 @@ import html
|
||||
import re
|
||||
import os
|
||||
import platform
|
||||
from .html2text import HTML2Text
|
||||
from .prompts import PROMPT_EXTRACT_BLOCKS
|
||||
from .config import *
|
||||
from pathlib import Path
|
||||
@@ -14,14 +14,102 @@ from typing import Dict, Any
|
||||
from urllib.parse import urljoin
|
||||
import requests
|
||||
from requests.exceptions import InvalidSchema
|
||||
import hashlib
|
||||
from typing import Optional, Tuple, Dict, Any
|
||||
import xxhash
|
||||
from colorama import Fore, Style, init
|
||||
import textwrap
|
||||
import cProfile
|
||||
import pstats
|
||||
from functools import wraps
|
||||
|
||||
class InvalidCSSSelectorError(Exception):
|
||||
pass
|
||||
|
||||
def create_box_message(message: str, type: str = "info", width: int = 120, add_newlines: bool = True, double_line: bool = False) -> str:
|
||||
"""
|
||||
Create a styled message box with colored borders and formatted text.
|
||||
|
||||
How it works:
|
||||
1. Determines box style and colors based on the message type (e.g., info, warning).
|
||||
2. Wraps text to fit within the specified width.
|
||||
3. Constructs a box using characters (single or double lines) with appropriate formatting.
|
||||
4. Adds optional newlines before and after the box.
|
||||
|
||||
Args:
|
||||
message (str): The message to display inside the box.
|
||||
type (str): Type of the message (e.g., "info", "warning", "error", "success"). Defaults to "info".
|
||||
width (int): Width of the box. Defaults to 120.
|
||||
add_newlines (bool): Whether to add newlines before and after the box. Defaults to True.
|
||||
double_line (bool): Whether to use double lines for the box border. Defaults to False.
|
||||
|
||||
Returns:
|
||||
str: A formatted string containing the styled message box.
|
||||
"""
|
||||
|
||||
init()
|
||||
|
||||
# Define border and text colors for different types
|
||||
styles = {
|
||||
"warning": (Fore.YELLOW, Fore.LIGHTYELLOW_EX, "⚠"),
|
||||
"info": (Fore.BLUE, Fore.LIGHTBLUE_EX, "ℹ"),
|
||||
"success": (Fore.GREEN, Fore.LIGHTGREEN_EX, "✓"),
|
||||
"error": (Fore.RED, Fore.LIGHTRED_EX, "×"),
|
||||
}
|
||||
|
||||
border_color, text_color, prefix = styles.get(type.lower(), styles["info"])
|
||||
|
||||
# Define box characters based on line style
|
||||
box_chars = {
|
||||
"single": ("─", "│", "┌", "┐", "└", "┘"),
|
||||
"double": ("═", "║", "╔", "╗", "╚", "╝")
|
||||
}
|
||||
line_style = "double" if double_line else "single"
|
||||
h_line, v_line, tl, tr, bl, br = box_chars[line_style]
|
||||
|
||||
# Process lines with lighter text color
|
||||
formatted_lines = []
|
||||
raw_lines = message.split('\n')
|
||||
|
||||
if raw_lines:
|
||||
first_line = f"{prefix} {raw_lines[0].strip()}"
|
||||
wrapped_first = textwrap.fill(first_line, width=width-4)
|
||||
formatted_lines.extend(wrapped_first.split('\n'))
|
||||
|
||||
for line in raw_lines[1:]:
|
||||
if line.strip():
|
||||
wrapped = textwrap.fill(f" {line.strip()}", width=width-4)
|
||||
formatted_lines.extend(wrapped.split('\n'))
|
||||
else:
|
||||
formatted_lines.append("")
|
||||
|
||||
# Create the box with colored borders and lighter text
|
||||
horizontal_line = h_line * (width - 1)
|
||||
box = [
|
||||
f"{border_color}{tl}{horizontal_line}{tr}",
|
||||
*[f"{border_color}{v_line}{text_color} {line:<{width-2}}{border_color}{v_line}" for line in formatted_lines],
|
||||
f"{border_color}{bl}{horizontal_line}{br}{Style.RESET_ALL}"
|
||||
]
|
||||
|
||||
result = "\n".join(box)
|
||||
if add_newlines:
|
||||
result = f"\n{result}\n"
|
||||
|
||||
return result
|
||||
|
||||
def calculate_semaphore_count():
|
||||
"""
|
||||
Calculate the optimal semaphore count based on system resources.
|
||||
|
||||
How it works:
|
||||
1. Determines the number of CPU cores and total system memory.
|
||||
2. Sets a base count as half of the available CPU cores.
|
||||
3. Limits the count based on memory, assuming 2GB per semaphore instance.
|
||||
4. Returns the minimum value between CPU and memory-based limits.
|
||||
|
||||
Returns:
|
||||
int: The calculated semaphore count.
|
||||
"""
|
||||
|
||||
cpu_count = os.cpu_count()
|
||||
memory_gb = get_system_memory() / (1024 ** 3) # Convert to GB
|
||||
base_count = max(1, cpu_count // 2)
|
||||
@@ -29,6 +117,21 @@ def calculate_semaphore_count():
|
||||
return min(base_count, memory_based_cap)
|
||||
|
||||
def get_system_memory():
|
||||
"""
|
||||
Get the total system memory in bytes.
|
||||
|
||||
How it works:
|
||||
1. Detects the operating system.
|
||||
2. Reads memory information from system-specific commands or files.
|
||||
3. Converts the memory to bytes for uniformity.
|
||||
|
||||
Returns:
|
||||
int: The total system memory in bytes.
|
||||
|
||||
Raises:
|
||||
OSError: If the operating system is unsupported.
|
||||
"""
|
||||
|
||||
system = platform.system()
|
||||
if system == "Linux":
|
||||
with open('/proc/meminfo', 'r') as mem:
|
||||
@@ -63,6 +166,18 @@ def get_system_memory():
|
||||
raise OSError("Unsupported operating system")
|
||||
|
||||
def get_home_folder():
|
||||
"""
|
||||
Get or create the home folder for Crawl4AI configuration and cache.
|
||||
|
||||
How it works:
|
||||
1. Uses environment variables or defaults to the user's home directory.
|
||||
2. Creates `.crawl4ai` and its subdirectories (`cache`, `models`) if they don't exist.
|
||||
3. Returns the path to the home folder.
|
||||
|
||||
Returns:
|
||||
str: The path to the Crawl4AI home folder.
|
||||
"""
|
||||
|
||||
home_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), ".crawl4ai")
|
||||
os.makedirs(home_folder, exist_ok=True)
|
||||
os.makedirs(f"{home_folder}/cache", exist_ok=True)
|
||||
@@ -133,6 +248,20 @@ def split_and_parse_json_objects(json_string):
|
||||
return parsed_objects, unparsed_segments
|
||||
|
||||
def sanitize_html(html):
|
||||
"""
|
||||
Sanitize an HTML string by escaping quotes.
|
||||
|
||||
How it works:
|
||||
1. Replaces all unwanted and special characters with an empty string.
|
||||
2. Escapes double and single quotes for safe usage.
|
||||
|
||||
Args:
|
||||
html (str): The HTML string to sanitize.
|
||||
|
||||
Returns:
|
||||
str: The sanitized HTML string.
|
||||
"""
|
||||
|
||||
# Replace all unwanted and special characters with an empty string
|
||||
sanitized_html = html
|
||||
# sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)
|
||||
@@ -145,12 +274,17 @@ def sanitize_html(html):
|
||||
def sanitize_input_encode(text: str) -> str:
|
||||
"""Sanitize input to handle potential encoding issues."""
|
||||
try:
|
||||
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
|
||||
return text.encode('utf-8', errors='ignore').decode('utf-8')
|
||||
except UnicodeEncodeError as e:
|
||||
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
|
||||
# Fall back to ASCII if UTF-8 fails
|
||||
return text.encode('ascii', errors='ignore').decode('ascii')
|
||||
try:
|
||||
if not text:
|
||||
return ''
|
||||
# Attempt to encode and decode as UTF-8 to handle potential encoding issues
|
||||
return text.encode('utf-8', errors='ignore').decode('utf-8')
|
||||
except UnicodeEncodeError as e:
|
||||
print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
|
||||
# Fall back to ASCII if UTF-8 fails
|
||||
return text.encode('ascii', errors='ignore').decode('ascii')
|
||||
except Exception as e:
|
||||
raise ValueError(f"Error sanitizing input: {str(e)}") from e
|
||||
|
||||
def escape_json_string(s):
|
||||
"""
|
||||
@@ -181,51 +315,24 @@ def escape_json_string(s):
|
||||
|
||||
return s
|
||||
|
||||
class CustomHTML2Text_v0(HTML2Text):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.inside_pre = False
|
||||
self.inside_code = False
|
||||
|
||||
self.skip_internal_links = False
|
||||
self.single_line_break = False
|
||||
self.mark_code = False
|
||||
self.include_sup_sub = False
|
||||
self.body_width = 0
|
||||
self.ignore_mailto_links = True
|
||||
self.ignore_links = False
|
||||
self.escape_backslash = False
|
||||
self.escape_dot = False
|
||||
self.escape_plus = False
|
||||
self.escape_dash = False
|
||||
self.escape_snob = False
|
||||
|
||||
|
||||
def handle_tag(self, tag, attrs, start):
|
||||
if tag == 'pre':
|
||||
if start:
|
||||
self.o('```\n')
|
||||
self.inside_pre = True
|
||||
else:
|
||||
self.o('\n```')
|
||||
self.inside_pre = False
|
||||
elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
|
||||
pass
|
||||
|
||||
|
||||
# elif tag == 'code' and not self.inside_pre:
|
||||
# if start:
|
||||
# if not self.inside_pre:
|
||||
# self.o('`')
|
||||
# self.inside_code = True
|
||||
# else:
|
||||
# if not self.inside_pre:
|
||||
# self.o('`')
|
||||
# self.inside_code = False
|
||||
|
||||
super().handle_tag(tag, attrs, start)
|
||||
|
||||
def replace_inline_tags(soup, tags, only_text=False):
|
||||
"""
|
||||
Replace inline HTML tags with Markdown-style equivalents.
|
||||
|
||||
How it works:
|
||||
1. Maps specific tags (e.g., <b>, <i>) to Markdown syntax.
|
||||
2. Finds and replaces all occurrences of these tags in the provided BeautifulSoup object.
|
||||
3. Optionally replaces tags with their text content only.
|
||||
|
||||
Args:
|
||||
soup (BeautifulSoup): Parsed HTML content.
|
||||
tags (List[str]): List of tags to replace.
|
||||
only_text (bool): Whether to replace tags with plain text. Defaults to False.
|
||||
|
||||
Returns:
|
||||
BeautifulSoup: Updated BeautifulSoup object with replaced tags.
|
||||
"""
|
||||
|
||||
tag_replacements = {
|
||||
'b': lambda tag: f"**{tag.text}**",
|
||||
'i': lambda tag: f"*{tag.text}*",
|
||||
@@ -270,6 +377,26 @@ def replace_inline_tags(soup, tags, only_text=False):
|
||||
# return soup
|
||||
|
||||
def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD, css_selector = None, **kwargs):
|
||||
"""
|
||||
Extract structured content, media, and links from website HTML.
|
||||
|
||||
How it works:
|
||||
1. Parses the HTML content using BeautifulSoup.
|
||||
2. Extracts internal/external links and media (images, videos, audios).
|
||||
3. Cleans the content by removing unwanted tags and attributes.
|
||||
4. Converts cleaned HTML to Markdown.
|
||||
5. Collects metadata and returns the extracted information.
|
||||
|
||||
Args:
|
||||
url (str): The website URL.
|
||||
html (str): The HTML content of the website.
|
||||
word_count_threshold (int): Minimum word count for content inclusion. Defaults to MIN_WORD_THRESHOLD.
|
||||
css_selector (Optional[str]): CSS selector to extract specific content. Defaults to None.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: Extracted content including Markdown, cleaned HTML, media, links, and metadata.
|
||||
"""
|
||||
|
||||
try:
|
||||
if not html:
|
||||
return None
|
||||
@@ -740,6 +867,27 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
}
|
||||
|
||||
def extract_metadata(html, soup=None):
|
||||
"""
|
||||
Extract optimized content, media, and links from website HTML.
|
||||
|
||||
How it works:
|
||||
1. Similar to `get_content_of_website`, but optimized for performance.
|
||||
2. Filters and scores images for usefulness.
|
||||
3. Extracts contextual descriptions for media files.
|
||||
4. Handles excluded tags and CSS selectors.
|
||||
5. Cleans HTML and converts it to Markdown.
|
||||
|
||||
Args:
|
||||
url (str): The website URL.
|
||||
html (str): The HTML content of the website.
|
||||
word_count_threshold (int): Minimum word count for content inclusion. Defaults to MIN_WORD_THRESHOLD.
|
||||
css_selector (Optional[str]): CSS selector to extract specific content. Defaults to None.
|
||||
**kwargs: Additional options for customization.
|
||||
|
||||
Returns:
|
||||
Dict[str, Any]: Extracted content including Markdown, cleaned HTML, media, links, and metadata.
|
||||
"""
|
||||
|
||||
metadata = {}
|
||||
|
||||
if not html and not soup:
|
||||
@@ -786,12 +934,36 @@ def extract_metadata(html, soup=None):
|
||||
|
||||
return metadata
|
||||
|
||||
|
||||
def extract_xml_tags(string):
|
||||
"""
|
||||
Extracts XML tags from a string.
|
||||
|
||||
Args:
|
||||
string (str): The input string containing XML tags.
|
||||
|
||||
Returns:
|
||||
List[str]: A list of XML tags extracted from the input string.
|
||||
"""
|
||||
tags = re.findall(r'<(\w+)>', string)
|
||||
return list(set(tags))
|
||||
|
||||
def extract_xml_data(tags, string):
|
||||
"""
|
||||
Extract data for specified XML tags from a string.
|
||||
|
||||
How it works:
|
||||
1. Searches the string for each tag using regex.
|
||||
2. Extracts the content within the tags.
|
||||
3. Returns a dictionary of tag-content pairs.
|
||||
|
||||
Args:
|
||||
tags (List[str]): The list of XML tags to extract.
|
||||
string (str): The input string containing XML data.
|
||||
|
||||
Returns:
|
||||
Dict[str, str]: A dictionary with tag names as keys and extracted content as values.
|
||||
"""
|
||||
|
||||
data = {}
|
||||
|
||||
for tag in tags:
|
||||
@@ -804,7 +976,6 @@ def extract_xml_data(tags, string):
|
||||
|
||||
return data
|
||||
|
||||
# Function to perform the completion with exponential backoff
|
||||
def perform_completion_with_backoff(
|
||||
provider,
|
||||
prompt_with_variables,
|
||||
@@ -813,12 +984,36 @@ def perform_completion_with_backoff(
|
||||
base_url=None,
|
||||
**kwargs
|
||||
):
|
||||
"""
|
||||
Perform an API completion request with exponential backoff.
|
||||
|
||||
How it works:
|
||||
1. Sends a completion request to the API.
|
||||
2. Retries on rate-limit errors with exponential delays.
|
||||
3. Returns the API response or an error after all retries.
|
||||
|
||||
Args:
|
||||
provider (str): The name of the API provider.
|
||||
prompt_with_variables (str): The input prompt for the completion request.
|
||||
api_token (str): The API token for authentication.
|
||||
json_response (bool): Whether to request a JSON response. Defaults to False.
|
||||
base_url (Optional[str]): The base URL for the API. Defaults to None.
|
||||
**kwargs: Additional arguments for the API request.
|
||||
|
||||
Returns:
|
||||
dict: The API response or an error message after all retries.
|
||||
"""
|
||||
|
||||
from litellm import completion
|
||||
from litellm.exceptions import RateLimitError
|
||||
max_attempts = 3
|
||||
base_delay = 2 # Base delay in seconds, you can adjust this based on your needs
|
||||
|
||||
extra_args = {}
|
||||
extra_args = {
|
||||
"temperature": 0.01,
|
||||
'api_key': api_token,
|
||||
'base_url': base_url
|
||||
}
|
||||
if json_response:
|
||||
extra_args["response_format"] = { "type": "json_object" }
|
||||
|
||||
@@ -827,14 +1022,12 @@ def perform_completion_with_backoff(
|
||||
|
||||
for attempt in range(max_attempts):
|
||||
try:
|
||||
|
||||
response =completion(
|
||||
model=provider,
|
||||
messages=[
|
||||
{"role": "user", "content": prompt_with_variables}
|
||||
],
|
||||
temperature=0.01,
|
||||
api_key=api_token,
|
||||
base_url=base_url,
|
||||
**extra_args
|
||||
)
|
||||
return response # Return the successful response
|
||||
@@ -856,6 +1049,25 @@ def perform_completion_with_backoff(
|
||||
}]
|
||||
|
||||
def extract_blocks(url, html, provider = DEFAULT_PROVIDER, api_token = None, base_url = None):
|
||||
"""
|
||||
Extract content blocks from website HTML using an AI provider.
|
||||
|
||||
How it works:
|
||||
1. Prepares a prompt by sanitizing and escaping HTML.
|
||||
2. Sends the prompt to an AI provider with optional retries.
|
||||
3. Parses the response to extract structured blocks or errors.
|
||||
|
||||
Args:
|
||||
url (str): The website URL.
|
||||
html (str): The HTML content of the website.
|
||||
provider (str): The AI provider for content extraction. Defaults to DEFAULT_PROVIDER.
|
||||
api_token (Optional[str]): The API token for authentication. Defaults to None.
|
||||
base_url (Optional[str]): The base URL for the API. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List[dict]: A list of extracted content blocks.
|
||||
"""
|
||||
|
||||
# api_token = os.getenv('GROQ_API_KEY', None) if not api_token else api_token
|
||||
api_token = PROVIDER_MODELS.get(provider, None) if not api_token else api_token
|
||||
|
||||
@@ -892,6 +1104,23 @@ def extract_blocks(url, html, provider = DEFAULT_PROVIDER, api_token = None, bas
|
||||
return blocks
|
||||
|
||||
def extract_blocks_batch(batch_data, provider = "groq/llama3-70b-8192", api_token = None):
|
||||
"""
|
||||
Extract content blocks from a batch of website HTMLs.
|
||||
|
||||
How it works:
|
||||
1. Prepares prompts for each URL and HTML pair.
|
||||
2. Sends the prompts to the AI provider in a batch request.
|
||||
3. Parses the responses to extract structured blocks or errors.
|
||||
|
||||
Args:
|
||||
batch_data (List[Tuple[str, str]]): A list of (URL, HTML) pairs.
|
||||
provider (str): The AI provider for content extraction. Defaults to "groq/llama3-70b-8192".
|
||||
api_token (Optional[str]): The API token for authentication. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List[dict]: A list of extracted content blocks from all batch items.
|
||||
"""
|
||||
|
||||
api_token = os.getenv('GROQ_API_KEY', None) if not api_token else api_token
|
||||
from litellm import batch_completion
|
||||
messages = []
|
||||
@@ -964,6 +1193,25 @@ def merge_chunks_based_on_token_threshold(chunks, token_threshold):
|
||||
return merged_sections
|
||||
|
||||
def process_sections(url: str, sections: list, provider: str, api_token: str, base_url=None) -> list:
|
||||
"""
|
||||
Process sections of HTML content sequentially or in parallel.
|
||||
|
||||
How it works:
|
||||
1. Sequentially processes sections with delays for "groq/" providers.
|
||||
2. Uses ThreadPoolExecutor for parallel processing with other providers.
|
||||
3. Extracts content blocks for each section.
|
||||
|
||||
Args:
|
||||
url (str): The website URL.
|
||||
sections (List[str]): The list of HTML sections to process.
|
||||
provider (str): The AI provider for content extraction.
|
||||
api_token (str): The API token for authentication.
|
||||
base_url (Optional[str]): The base URL for the API. Defaults to None.
|
||||
|
||||
Returns:
|
||||
List[dict]: The list of extracted content blocks from all sections.
|
||||
"""
|
||||
|
||||
extracted_content = []
|
||||
if provider.startswith("groq/"):
|
||||
# Sequential processing with a delay
|
||||
@@ -980,6 +1228,24 @@ def process_sections(url: str, sections: list, provider: str, api_token: str, ba
|
||||
return extracted_content
|
||||
|
||||
def wrap_text(draw, text, font, max_width):
|
||||
"""
|
||||
Wrap text to fit within a specified width for rendering.
|
||||
|
||||
How it works:
|
||||
1. Splits the text into words.
|
||||
2. Constructs lines that fit within the maximum width using the provided font.
|
||||
3. Returns the wrapped text as a single string.
|
||||
|
||||
Args:
|
||||
draw (ImageDraw.Draw): The drawing context for measuring text size.
|
||||
text (str): The text to wrap.
|
||||
font (ImageFont.FreeTypeFont): The font to use for measuring text size.
|
||||
max_width (int): The maximum width for each line.
|
||||
|
||||
Returns:
|
||||
str: The wrapped text.
|
||||
"""
|
||||
|
||||
# Wrap the text to fit within the specified width
|
||||
lines = []
|
||||
words = text.split()
|
||||
@@ -991,9 +1257,69 @@ def wrap_text(draw, text, font, max_width):
|
||||
return '\n'.join(lines)
|
||||
|
||||
def format_html(html_string):
|
||||
soup = BeautifulSoup(html_string, 'html.parser')
|
||||
"""
|
||||
Prettify an HTML string using BeautifulSoup.
|
||||
|
||||
How it works:
|
||||
1. Parses the HTML string with BeautifulSoup.
|
||||
2. Formats the HTML with proper indentation.
|
||||
3. Returns the prettified HTML string.
|
||||
|
||||
Args:
|
||||
html_string (str): The HTML string to format.
|
||||
|
||||
Returns:
|
||||
str: The prettified HTML string.
|
||||
"""
|
||||
|
||||
soup = BeautifulSoup(html_string, 'lxml.parser')
|
||||
return soup.prettify()
|
||||
|
||||
def fast_format_html(html_string):
|
||||
"""
|
||||
A fast HTML formatter that uses string operations instead of parsing.
|
||||
|
||||
Args:
|
||||
html_string (str): The HTML string to format
|
||||
|
||||
Returns:
|
||||
str: The formatted HTML string
|
||||
"""
|
||||
# Initialize variables
|
||||
indent = 0
|
||||
indent_str = " " # Two spaces for indentation
|
||||
formatted = []
|
||||
in_content = False
|
||||
|
||||
# Split by < and > to separate tags and content
|
||||
parts = html_string.replace('>', '>\n').replace('<', '\n<').split('\n')
|
||||
|
||||
for part in parts:
|
||||
if not part.strip():
|
||||
continue
|
||||
|
||||
# Handle closing tags
|
||||
if part.startswith('</'):
|
||||
indent -= 1
|
||||
formatted.append(indent_str * indent + part)
|
||||
|
||||
# Handle self-closing tags
|
||||
elif part.startswith('<') and part.endswith('/>'):
|
||||
formatted.append(indent_str * indent + part)
|
||||
|
||||
# Handle opening tags
|
||||
elif part.startswith('<'):
|
||||
formatted.append(indent_str * indent + part)
|
||||
indent += 1
|
||||
|
||||
# Handle content between tags
|
||||
else:
|
||||
content = part.strip()
|
||||
if content:
|
||||
formatted.append(indent_str * indent + content)
|
||||
|
||||
return '\n'.join(formatted)
|
||||
|
||||
def normalize_url(href, base_url):
|
||||
"""Normalize URLs to ensure consistent format"""
|
||||
from urllib.parse import urljoin, urlparse
|
||||
@@ -1042,23 +1368,94 @@ def normalize_url_tmp(href, base_url):
|
||||
|
||||
return href.strip()
|
||||
|
||||
def is_external_url(url, base_domain):
|
||||
"""Determine if a URL is external"""
|
||||
special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
|
||||
if any(url.lower().startswith(proto) for proto in special_protocols):
|
||||
def get_base_domain(url: str) -> str:
|
||||
"""
|
||||
Extract the base domain from a given URL, handling common edge cases.
|
||||
|
||||
How it works:
|
||||
1. Parses the URL to extract the domain.
|
||||
2. Removes the port number and 'www' prefix.
|
||||
3. Handles special domains (e.g., 'co.uk') to extract the correct base.
|
||||
|
||||
Args:
|
||||
url (str): The URL to extract the base domain from.
|
||||
|
||||
Returns:
|
||||
str: The extracted base domain or an empty string if parsing fails.
|
||||
"""
|
||||
try:
|
||||
# Get domain from URL
|
||||
domain = urlparse(url).netloc.lower()
|
||||
if not domain:
|
||||
return ""
|
||||
|
||||
# Remove port if present
|
||||
domain = domain.split(':')[0]
|
||||
|
||||
# Remove www
|
||||
domain = re.sub(r'^www\.', '', domain)
|
||||
|
||||
# Extract last two parts of domain (handles co.uk etc)
|
||||
parts = domain.split('.')
|
||||
if len(parts) > 2 and parts[-2] in {
|
||||
'co', 'com', 'org', 'gov', 'edu', 'net',
|
||||
'mil', 'int', 'ac', 'ad', 'ae', 'af', 'ag'
|
||||
}:
|
||||
return '.'.join(parts[-3:])
|
||||
|
||||
return '.'.join(parts[-2:])
|
||||
except Exception:
|
||||
return ""
|
||||
|
||||
def is_external_url(url: str, base_domain: str) -> bool:
|
||||
"""
|
||||
Extract the base domain from a given URL, handling common edge cases.
|
||||
|
||||
How it works:
|
||||
1. Parses the URL to extract the domain.
|
||||
2. Removes the port number and 'www' prefix.
|
||||
3. Handles special domains (e.g., 'co.uk') to extract the correct base.
|
||||
|
||||
Args:
|
||||
url (str): The URL to extract the base domain from.
|
||||
|
||||
Returns:
|
||||
str: The extracted base domain or an empty string if parsing fails.
|
||||
"""
|
||||
special = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
|
||||
if any(url.lower().startswith(p) for p in special):
|
||||
return True
|
||||
|
||||
try:
|
||||
# Handle URLs with protocol
|
||||
if url.startswith(('http://', 'https://')):
|
||||
url_domain = url.split('/')[2]
|
||||
return base_domain.lower() not in url_domain.lower()
|
||||
except IndexError:
|
||||
return False
|
||||
parsed = urlparse(url)
|
||||
if not parsed.netloc: # Relative URL
|
||||
return False
|
||||
|
||||
# Strip 'www.' from both domains for comparison
|
||||
url_domain = parsed.netloc.lower().replace('www.', '')
|
||||
base = base_domain.lower().replace('www.', '')
|
||||
|
||||
return False
|
||||
# Check if URL domain ends with base domain
|
||||
return not url_domain.endswith(base)
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
def clean_tokens(tokens: list[str]) -> list[str]:
|
||||
"""
|
||||
Clean a list of tokens by removing noise, stop words, and short tokens.
|
||||
|
||||
How it works:
|
||||
1. Defines a set of noise words and stop words.
|
||||
2. Filters tokens based on length and exclusion criteria.
|
||||
3. Excludes tokens starting with certain symbols (e.g., "↑", "▲").
|
||||
|
||||
Args:
|
||||
tokens (list[str]): The list of tokens to clean.
|
||||
|
||||
Returns:
|
||||
list[str]: The cleaned list of tokens.
|
||||
"""
|
||||
|
||||
# Set of tokens to remove
|
||||
noise = {'ccp', 'up', '↑', '▲', '⬆️', 'a', 'an', 'at', 'by', 'in', 'of', 'on', 'to', 'the'}
|
||||
|
||||
@@ -1113,6 +1510,50 @@ def clean_tokens(tokens: list[str]) -> list[str]:
|
||||
and not token.startswith('▲')
|
||||
and not token.startswith('⬆')]
|
||||
|
||||
def profile_and_time(func):
|
||||
"""
|
||||
Decorator to profile a function's execution time and performance.
|
||||
|
||||
How it works:
|
||||
1. Records the start time before executing the function.
|
||||
2. Profiles the function's execution using `cProfile`.
|
||||
3. Prints the elapsed time and profiling statistics.
|
||||
|
||||
Args:
|
||||
func (Callable): The function to decorate.
|
||||
|
||||
Returns:
|
||||
Callable: The decorated function with profiling and timing enabled.
|
||||
"""
|
||||
|
||||
@wraps(func)
|
||||
def wrapper(self, *args, **kwargs):
|
||||
# Start timer
|
||||
start_time = time.perf_counter()
|
||||
|
||||
# Setup profiler
|
||||
profiler = cProfile.Profile()
|
||||
profiler.enable()
|
||||
|
||||
# Run function
|
||||
result = func(self, *args, **kwargs)
|
||||
|
||||
# Stop profiler
|
||||
profiler.disable()
|
||||
|
||||
# Calculate elapsed time
|
||||
elapsed_time = time.perf_counter() - start_time
|
||||
|
||||
# Print timing
|
||||
print(f"[PROFILER] Scraping completed in {elapsed_time:.2f} seconds")
|
||||
|
||||
# Print profiling stats
|
||||
stats = pstats.Stats(profiler)
|
||||
stats.sort_stats('cumulative') # Sort by cumulative time
|
||||
stats.print_stats(20) # Print top 20 time-consuming functions
|
||||
|
||||
return result
|
||||
return wrapper
|
||||
|
||||
def generate_content_hash(content: str) -> str:
|
||||
"""Generate a unique hash for content"""
|
||||
@@ -1126,7 +1567,8 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
|
||||
'cleaned': 'cleaned_html',
|
||||
'markdown': 'markdown_content',
|
||||
'extracted': 'extracted_content',
|
||||
'screenshots': 'screenshots'
|
||||
'screenshots': 'screenshots',
|
||||
'screenshot': 'screenshots'
|
||||
}
|
||||
|
||||
content_paths = {}
|
||||
@@ -1135,4 +1577,63 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
|
||||
os.makedirs(path, exist_ok=True)
|
||||
content_paths[key] = path
|
||||
|
||||
return content_paths
|
||||
return content_paths
|
||||
|
||||
def get_error_context(exc_info, context_lines: int = 5):
|
||||
"""
|
||||
Extract error context with more reliable line number tracking.
|
||||
|
||||
Args:
|
||||
exc_info: The exception info from sys.exc_info()
|
||||
context_lines: Number of lines to show before and after the error
|
||||
|
||||
Returns:
|
||||
dict: Error context information
|
||||
"""
|
||||
import traceback
|
||||
import linecache
|
||||
import os
|
||||
|
||||
# Get the full traceback
|
||||
tb = traceback.extract_tb(exc_info[2])
|
||||
|
||||
# Get the last frame (where the error occurred)
|
||||
last_frame = tb[-1]
|
||||
filename = last_frame.filename
|
||||
line_no = last_frame.lineno
|
||||
func_name = last_frame.name
|
||||
|
||||
# Get the source code context using linecache
|
||||
# This is more reliable than inspect.getsourcelines
|
||||
context_start = max(1, line_no - context_lines)
|
||||
context_end = line_no + context_lines + 1
|
||||
|
||||
# Build the context lines with line numbers
|
||||
context_lines = []
|
||||
for i in range(context_start, context_end):
|
||||
line = linecache.getline(filename, i)
|
||||
if line:
|
||||
# Remove any trailing whitespace/newlines and add the pointer for error line
|
||||
line = line.rstrip()
|
||||
pointer = '→' if i == line_no else ' '
|
||||
context_lines.append(f"{i:4d} {pointer} {line}")
|
||||
|
||||
# Join the lines with newlines
|
||||
code_context = '\n'.join(context_lines)
|
||||
|
||||
# Get relative path for cleaner output
|
||||
try:
|
||||
rel_path = os.path.relpath(filename)
|
||||
except ValueError:
|
||||
# Fallback if relpath fails (can happen on Windows with different drives)
|
||||
rel_path = filename
|
||||
|
||||
return {
|
||||
"filename": rel_path,
|
||||
"line_no": line_no,
|
||||
"function": func_name,
|
||||
"code_context": code_context
|
||||
}
|
||||
|
||||
|
||||
|
||||
0
crawl4ai/utils.scraping.py
Normal file
0
crawl4ai/utils.scraping.py
Normal file
@@ -10,7 +10,7 @@ from .extraction_strategy import *
|
||||
from .crawler_strategy import *
|
||||
from typing import List
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from .content_scrapping_strategy import WebScrapingStrategy
|
||||
from .content_scraping_strategy import WebScrapingStrategy
|
||||
from .config import *
|
||||
import warnings
|
||||
import json
|
||||
|
||||
@@ -1,19 +0,0 @@
|
||||
# Railway Deployment
|
||||
|
||||
## Quick Deploy
|
||||
[](https://railway.app/template/crawl4ai)
|
||||
|
||||
## Manual Setup
|
||||
1. Fork this repository
|
||||
2. Create a new Railway project
|
||||
3. Configure environment variables:
|
||||
- `INSTALL_TYPE`: basic or all
|
||||
- `ENABLE_GPU`: true/false
|
||||
4. Deploy!
|
||||
|
||||
## Configuration
|
||||
See `railway.toml` for:
|
||||
- Memory limits
|
||||
- Health checks
|
||||
- Restart policies
|
||||
- Scaling options
|
||||
@@ -1,33 +0,0 @@
|
||||
{
|
||||
"name": "Crawl4AI",
|
||||
"description": "LLM Friendly Web Crawler & Scraper",
|
||||
"render": {
|
||||
"dockerfile": {
|
||||
"path": "Dockerfile"
|
||||
}
|
||||
},
|
||||
"env": [
|
||||
{
|
||||
"key": "INSTALL_TYPE",
|
||||
"description": "Installation type (basic/all)",
|
||||
"default": "basic",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"key": "ENABLE_GPU",
|
||||
"description": "Enable GPU support",
|
||||
"default": "false",
|
||||
"required": false
|
||||
}
|
||||
],
|
||||
"services": [
|
||||
{
|
||||
"name": "web",
|
||||
"dockerfile": "./Dockerfile",
|
||||
"healthcheck": {
|
||||
"path": "/health",
|
||||
"port": 11235
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -1,18 +0,0 @@
|
||||
# railway.toml
|
||||
[build]
|
||||
builder = "DOCKERFILE"
|
||||
dockerfilePath = "Dockerfile"
|
||||
|
||||
[deploy]
|
||||
startCommand = "uvicorn main:app --host 0.0.0.0 --port $PORT"
|
||||
healthcheckPath = "/health"
|
||||
restartPolicyType = "ON_FAILURE"
|
||||
restartPolicyMaxRetries = 3
|
||||
|
||||
[deploy.memory]
|
||||
soft = 2048 # 2GB min for Playwright
|
||||
hard = 4096 # 4GB max
|
||||
|
||||
[deploy.scaling]
|
||||
min = 1
|
||||
max = 1
|
||||
@@ -1,27 +0,0 @@
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:basic # Pull image from Docker Hub
|
||||
ports:
|
||||
- "11235:11235" # FastAPI server
|
||||
- "8000:8000" # Alternative port
|
||||
- "9222:9222" # Browser debugging
|
||||
- "8080:8080" # Additional port
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API token
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-} # Optional OpenAI API key
|
||||
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-} # Optional Claude API key
|
||||
volumes:
|
||||
- /dev/shm:/dev/shm # Shared memory for browser operations
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
reservations:
|
||||
memory: 1G
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
@@ -1,33 +0,0 @@
|
||||
services:
|
||||
crawl4ai:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
PYTHON_VERSION: 3.10
|
||||
INSTALL_TYPE: all
|
||||
ENABLE_GPU: false
|
||||
ports:
|
||||
- "11235:11235" # FastAPI server
|
||||
- "8000:8000" # Alternative port
|
||||
- "9222:9222" # Browser debugging
|
||||
- "8080:8080" # Additional port
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional API token
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-} # Optional OpenAI API key
|
||||
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-} # Optional Claude API key
|
||||
volumes:
|
||||
- /dev/shm:/dev/shm # Shared memory for browser operations
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
reservations:
|
||||
memory: 1G
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
@@ -1,41 +1,46 @@
|
||||
services:
|
||||
crawl4ai:
|
||||
# Local build services for different platforms
|
||||
crawl4ai-amd64:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
PYTHON_VERSION: 3.10
|
||||
INSTALL_TYPE: all
|
||||
PYTHON_VERSION: "3.10"
|
||||
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
|
||||
ENABLE_GPU: false
|
||||
profiles: ["local"]
|
||||
ports:
|
||||
- "11235:11235"
|
||||
- "8000:8000"
|
||||
- "9222:9222"
|
||||
- "8080:8080"
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||
- CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
|
||||
volumes:
|
||||
- /dev/shm:/dev/shm
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
reservations:
|
||||
memory: 1G
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
platforms:
|
||||
- linux/amd64
|
||||
profiles: ["local-amd64"]
|
||||
extends: &base-config
|
||||
file: docker-compose.yml
|
||||
service: base-config
|
||||
|
||||
crawl4ai-hub:
|
||||
image: unclecode/crawl4ai:basic
|
||||
profiles: ["hub"]
|
||||
crawl4ai-arm64:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
args:
|
||||
PYTHON_VERSION: "3.10"
|
||||
INSTALL_TYPE: ${INSTALL_TYPE:-basic}
|
||||
ENABLE_GPU: false
|
||||
platforms:
|
||||
- linux/arm64
|
||||
profiles: ["local-arm64"]
|
||||
extends: *base-config
|
||||
|
||||
# Hub services for different platforms and versions
|
||||
crawl4ai-hub-amd64:
|
||||
image: unclecode/crawl4ai:${VERSION:-basic}-amd64
|
||||
profiles: ["hub-amd64"]
|
||||
extends: *base-config
|
||||
|
||||
crawl4ai-hub-arm64:
|
||||
image: unclecode/crawl4ai:${VERSION:-basic}-arm64
|
||||
profiles: ["hub-arm64"]
|
||||
extends: *base-config
|
||||
|
||||
# Base configuration to be extended
|
||||
base-config:
|
||||
ports:
|
||||
- "11235:11235"
|
||||
- "8000:8000"
|
||||
@@ -59,4 +64,4 @@ services:
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 40s
|
||||
start_period: 40s
|
||||
189
docs/deprecated/docker-deployment.md
Normal file
189
docs/deprecated/docker-deployment.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# 🐳 Using Docker (Legacy)
|
||||
|
||||
Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
|
||||
|
||||
---
|
||||
|
||||
<details>
|
||||
<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
|
||||
|
||||
Choose the appropriate image based on your platform and needs:
|
||||
|
||||
### For AMD64 (Regular Linux/Windows):
|
||||
```bash
|
||||
# Basic version (recommended)
|
||||
docker pull unclecode/crawl4ai:basic-amd64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||
|
||||
# Full ML/LLM support
|
||||
docker pull unclecode/crawl4ai:all-amd64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
|
||||
|
||||
# With GPU support
|
||||
docker pull unclecode/crawl4ai:gpu-amd64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
|
||||
```
|
||||
|
||||
### For ARM64 (M1/M2 Macs, ARM servers):
|
||||
```bash
|
||||
# Basic version (recommended)
|
||||
docker pull unclecode/crawl4ai:basic-arm64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
|
||||
|
||||
# Full ML/LLM support
|
||||
docker pull unclecode/crawl4ai:all-arm64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
|
||||
|
||||
# With GPU support
|
||||
docker pull unclecode/crawl4ai:gpu-arm64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
|
||||
```
|
||||
|
||||
Need more memory? Add `--shm-size`:
|
||||
```bash
|
||||
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||
```
|
||||
|
||||
Test the installation:
|
||||
```bash
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
### For Raspberry Pi (32-bit) (coming soon):
|
||||
```bash
|
||||
# Pull and run basic version (recommended for Raspberry Pi)
|
||||
docker pull unclecode/crawl4ai:basic-armv7
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
|
||||
|
||||
# With increased shared memory if needed
|
||||
docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
|
||||
```
|
||||
|
||||
Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
|
||||
|
||||
Build the image locally based on your platform:
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
|
||||
# For AMD64 (Regular Linux/Windows)
|
||||
docker build --platform linux/amd64 \
|
||||
--tag crawl4ai:local \
|
||||
--build-arg INSTALL_TYPE=basic \
|
||||
.
|
||||
|
||||
# For ARM64 (M1/M2 Macs, ARM servers)
|
||||
docker build --platform linux/arm64 \
|
||||
--tag crawl4ai:local \
|
||||
--build-arg INSTALL_TYPE=basic \
|
||||
.
|
||||
```
|
||||
|
||||
Build options:
|
||||
- INSTALL_TYPE=basic (default): Basic crawling features
|
||||
- INSTALL_TYPE=all: Full ML/LLM support
|
||||
- ENABLE_GPU=true: Add GPU support
|
||||
|
||||
Example with all options:
|
||||
```bash
|
||||
docker build --platform linux/amd64 \
|
||||
--tag crawl4ai:local \
|
||||
--build-arg INSTALL_TYPE=all \
|
||||
--build-arg ENABLE_GPU=true \
|
||||
.
|
||||
```
|
||||
|
||||
Run your local build:
|
||||
```bash
|
||||
# Regular run
|
||||
docker run -p 11235:11235 crawl4ai:local
|
||||
|
||||
# With increased shared memory
|
||||
docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
|
||||
```
|
||||
|
||||
Test the installation:
|
||||
```bash
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
|
||||
|
||||
Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
```
|
||||
|
||||
### For AMD64 (Regular Linux/Windows):
|
||||
```bash
|
||||
# Build and run locally
|
||||
docker-compose --profile local-amd64 up
|
||||
|
||||
# Run from Docker Hub
|
||||
VERSION=basic docker-compose --profile hub-amd64 up # Basic version
|
||||
VERSION=all docker-compose --profile hub-amd64 up # Full ML/LLM support
|
||||
VERSION=gpu docker-compose --profile hub-amd64 up # GPU support
|
||||
```
|
||||
|
||||
### For ARM64 (M1/M2 Macs, ARM servers):
|
||||
```bash
|
||||
# Build and run locally
|
||||
docker-compose --profile local-arm64 up
|
||||
|
||||
# Run from Docker Hub
|
||||
VERSION=basic docker-compose --profile hub-arm64 up # Basic version
|
||||
VERSION=all docker-compose --profile hub-arm64 up # Full ML/LLM support
|
||||
VERSION=gpu docker-compose --profile hub-arm64 up # GPU support
|
||||
```
|
||||
|
||||
Environment variables (optional):
|
||||
```bash
|
||||
# Create a .env file
|
||||
CRAWL4AI_API_TOKEN=your_token
|
||||
OPENAI_API_KEY=your_openai_key
|
||||
CLAUDE_API_KEY=your_claude_key
|
||||
```
|
||||
|
||||
The compose file includes:
|
||||
- Memory management (4GB limit, 1GB reserved)
|
||||
- Shared memory volume for browser support
|
||||
- Health checks
|
||||
- Auto-restart policy
|
||||
- All necessary port mappings
|
||||
|
||||
Test the installation:
|
||||
```bash
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>🚀 <strong>One-Click Deployment</strong></summary>
|
||||
|
||||
Deploy your own instance of Crawl4AI with one click:
|
||||
|
||||
[](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
|
||||
|
||||
> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
|
||||
|
||||
The deploy will:
|
||||
- Set up a Docker container with Crawl4AI
|
||||
- Configure Playwright and all dependencies
|
||||
- Start the FastAPI server on port `11235`
|
||||
- Set up health checks and auto-deployment
|
||||
|
||||
</details>
|
||||
114
docs/examples/amazon_product_extraction_direct_url.py
Normal file
114
docs/examples/amazon_product_extraction_direct_url.py
Normal file
@@ -0,0 +1,114 @@
|
||||
"""
|
||||
This example demonstrates how to use JSON CSS extraction to scrape product information
|
||||
from Amazon search results. It shows how to extract structured data like product titles,
|
||||
prices, ratings, and other details using CSS selectors.
|
||||
"""
|
||||
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
import json
|
||||
|
||||
async def extract_amazon_products():
|
||||
# Initialize browser config
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True
|
||||
)
|
||||
|
||||
# Initialize crawler config with JSON CSS extraction strategy
|
||||
crawler_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(
|
||||
schema={
|
||||
"name": "Amazon Product Search Results",
|
||||
"baseSelector": "[data-component-type='s-search-result']",
|
||||
"fields": [
|
||||
{
|
||||
"name": "asin",
|
||||
"selector": "",
|
||||
"type": "attribute",
|
||||
"attribute": "data-asin"
|
||||
},
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2 a span",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "url",
|
||||
"selector": "h2 a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "image",
|
||||
"selector": ".s-image",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": ".a-icon-star-small .a-icon-alt",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "reviews_count",
|
||||
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".a-price .a-offscreen",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "original_price",
|
||||
"selector": ".a-price.a-text-price .a-offscreen",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "sponsored",
|
||||
"selector": ".puis-sponsored-label-text",
|
||||
"type": "exists"
|
||||
},
|
||||
{
|
||||
"name": "delivery_info",
|
||||
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
||||
"type": "text",
|
||||
"multiple": True
|
||||
}
|
||||
]
|
||||
}
|
||||
)
|
||||
)
|
||||
|
||||
# Example search URL (you should replace with your actual Amazon URL)
|
||||
url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
|
||||
|
||||
# Use context manager for proper resource handling
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Extract the data
|
||||
result = await crawler.arun(url=url, config=crawler_config)
|
||||
|
||||
# Process and print the results
|
||||
if result and result.extracted_content:
|
||||
# Parse the JSON string into a list of products
|
||||
products = json.loads(result.extracted_content)
|
||||
|
||||
# Process each product in the list
|
||||
for product in products:
|
||||
print("\nProduct Details:")
|
||||
print(f"ASIN: {product.get('asin')}")
|
||||
print(f"Title: {product.get('title')}")
|
||||
print(f"Price: {product.get('price')}")
|
||||
print(f"Original Price: {product.get('original_price')}")
|
||||
print(f"Rating: {product.get('rating')}")
|
||||
print(f"Reviews: {product.get('reviews_count')}")
|
||||
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
||||
if product.get('delivery_info'):
|
||||
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
||||
print("-" * 80)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(extract_amazon_products())
|
||||
145
docs/examples/amazon_product_extraction_using_hooks.py
Normal file
145
docs/examples/amazon_product_extraction_using_hooks.py
Normal file
@@ -0,0 +1,145 @@
|
||||
"""
|
||||
This example demonstrates how to use JSON CSS extraction to scrape product information
|
||||
from Amazon search results. It shows how to extract structured data like product titles,
|
||||
prices, ratings, and other details using CSS selectors.
|
||||
"""
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
import json
|
||||
from playwright.async_api import Page, BrowserContext
|
||||
|
||||
async def extract_amazon_products():
|
||||
# Initialize browser config
|
||||
browser_config = BrowserConfig(
|
||||
# browser_type="chromium",
|
||||
headless=True
|
||||
)
|
||||
|
||||
# Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
|
||||
extraction_strategy=JsonCssExtractionStrategy(
|
||||
schema={
|
||||
"name": "Amazon Product Search Results",
|
||||
"baseSelector": "[data-component-type='s-search-result']",
|
||||
"fields": [
|
||||
{
|
||||
"name": "asin",
|
||||
"selector": "",
|
||||
"type": "attribute",
|
||||
"attribute": "data-asin"
|
||||
},
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2 a span",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "url",
|
||||
"selector": "h2 a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "image",
|
||||
"selector": ".s-image",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": ".a-icon-star-small .a-icon-alt",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "reviews_count",
|
||||
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".a-price .a-offscreen",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "original_price",
|
||||
"selector": ".a-price.a-text-price .a-offscreen",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "sponsored",
|
||||
"selector": ".puis-sponsored-label-text",
|
||||
"type": "exists"
|
||||
},
|
||||
{
|
||||
"name": "delivery_info",
|
||||
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
||||
"type": "text",
|
||||
"multiple": True
|
||||
}
|
||||
]
|
||||
}
|
||||
)
|
||||
)
|
||||
|
||||
url = "https://www.amazon.com/"
|
||||
|
||||
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
|
||||
"""Hook called after navigating to each URL"""
|
||||
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
||||
|
||||
try:
|
||||
# Wait for search box to be available
|
||||
search_box = await page.wait_for_selector('#twotabsearchtextbox', timeout=1000)
|
||||
|
||||
# Type the search query
|
||||
await search_box.fill('Samsung Galaxy Tab')
|
||||
|
||||
# Get the search button and prepare for navigation
|
||||
search_button = await page.wait_for_selector('#nav-search-submit-button', timeout=1000)
|
||||
|
||||
# Click with navigation waiting
|
||||
await search_button.click()
|
||||
|
||||
# Wait for search results to load
|
||||
await page.wait_for_selector('[data-component-type="s-search-result"]', timeout=10000)
|
||||
print("[HOOK] Search completed and results loaded!")
|
||||
|
||||
except Exception as e:
|
||||
print(f"[HOOK] Error during search operation: {str(e)}")
|
||||
|
||||
return page
|
||||
|
||||
# Use context manager for proper resource handling
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
|
||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||
|
||||
# Extract the data
|
||||
result = await crawler.arun(url=url, config=crawler_config)
|
||||
|
||||
# Process and print the results
|
||||
if result and result.extracted_content:
|
||||
# Parse the JSON string into a list of products
|
||||
products = json.loads(result.extracted_content)
|
||||
|
||||
# Process each product in the list
|
||||
for product in products:
|
||||
print("\nProduct Details:")
|
||||
print(f"ASIN: {product.get('asin')}")
|
||||
print(f"Title: {product.get('title')}")
|
||||
print(f"Price: {product.get('price')}")
|
||||
print(f"Original Price: {product.get('original_price')}")
|
||||
print(f"Rating: {product.get('rating')}")
|
||||
print(f"Reviews: {product.get('reviews_count')}")
|
||||
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
||||
if product.get('delivery_info'):
|
||||
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
||||
print("-" * 80)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(extract_amazon_products())
|
||||
129
docs/examples/amazon_product_extraction_using_use_javascript.py
Normal file
129
docs/examples/amazon_product_extraction_using_use_javascript.py
Normal file
@@ -0,0 +1,129 @@
|
||||
"""
|
||||
This example demonstrates how to use JSON CSS extraction to scrape product information
|
||||
from Amazon search results. It shows how to extract structured data like product titles,
|
||||
prices, ratings, and other details using CSS selectors.
|
||||
"""
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
import json
|
||||
from playwright.async_api import Page, BrowserContext
|
||||
|
||||
async def extract_amazon_products():
|
||||
# Initialize browser config
|
||||
browser_config = BrowserConfig(
|
||||
# browser_type="chromium",
|
||||
headless=True
|
||||
)
|
||||
|
||||
js_code_to_search = """
|
||||
const task = async () => {
|
||||
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
|
||||
document.querySelector('#nav-search-submit-button').click();
|
||||
}
|
||||
await task();
|
||||
"""
|
||||
js_code_to_search_sync = """
|
||||
document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
|
||||
document.querySelector('#nav-search-submit-button').click();
|
||||
"""
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
js_code = js_code_to_search,
|
||||
wait_for='css:[data-component-type="s-search-result"]',
|
||||
extraction_strategy=JsonCssExtractionStrategy(
|
||||
schema={
|
||||
"name": "Amazon Product Search Results",
|
||||
"baseSelector": "[data-component-type='s-search-result']",
|
||||
"fields": [
|
||||
{
|
||||
"name": "asin",
|
||||
"selector": "",
|
||||
"type": "attribute",
|
||||
"attribute": "data-asin"
|
||||
},
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2 a span",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "url",
|
||||
"selector": "h2 a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "image",
|
||||
"selector": ".s-image",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": ".a-icon-star-small .a-icon-alt",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "reviews_count",
|
||||
"selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".a-price .a-offscreen",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "original_price",
|
||||
"selector": ".a-price.a-text-price .a-offscreen",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "sponsored",
|
||||
"selector": ".puis-sponsored-label-text",
|
||||
"type": "exists"
|
||||
},
|
||||
{
|
||||
"name": "delivery_info",
|
||||
"selector": "[data-cy='delivery-recipe'] .a-color-base",
|
||||
"type": "text",
|
||||
"multiple": True
|
||||
}
|
||||
]
|
||||
}
|
||||
)
|
||||
)
|
||||
|
||||
# Example search URL (you should replace with your actual Amazon URL)
|
||||
url = "https://www.amazon.com/"
|
||||
|
||||
|
||||
# Use context manager for proper resource handling
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Extract the data
|
||||
result = await crawler.arun(url=url, config=crawler_config)
|
||||
|
||||
# Process and print the results
|
||||
if result and result.extracted_content:
|
||||
# Parse the JSON string into a list of products
|
||||
products = json.loads(result.extracted_content)
|
||||
|
||||
# Process each product in the list
|
||||
for product in products:
|
||||
print("\nProduct Details:")
|
||||
print(f"ASIN: {product.get('asin')}")
|
||||
print(f"Title: {product.get('title')}")
|
||||
print(f"Price: {product.get('price')}")
|
||||
print(f"Original Price: {product.get('original_price')}")
|
||||
print(f"Rating: {product.get('rating')}")
|
||||
print(f"Reviews: {product.get('reviews_count')}")
|
||||
print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
|
||||
if product.get('delivery_info'):
|
||||
print(f"Delivery: {' '.join(product['delivery_info'])}")
|
||||
print("-" * 80)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(extract_amazon_products())
|
||||
128
docs/examples/browser_optimization_example.py
Normal file
128
docs/examples/browser_optimization_example.py
Normal file
@@ -0,0 +1,128 @@
|
||||
"""
|
||||
This example demonstrates optimal browser usage patterns in Crawl4AI:
|
||||
1. Sequential crawling with session reuse
|
||||
2. Parallel crawling with browser instance reuse
|
||||
3. Performance optimization settings
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
|
||||
async def crawl_sequential(urls: List[str]):
|
||||
"""
|
||||
Sequential crawling using session reuse - most efficient for moderate workloads
|
||||
"""
|
||||
print("\n=== Sequential Crawling with Session Reuse ===")
|
||||
|
||||
# Configure browser with optimized settings
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
browser_args=[
|
||||
"--disable-gpu", # Disable GPU acceleration
|
||||
"--disable-dev-shm-usage", # Disable /dev/shm usage
|
||||
"--no-sandbox", # Required for Docker
|
||||
],
|
||||
viewport={
|
||||
"width": 800,
|
||||
"height": 600,
|
||||
}, # Smaller viewport for better performance
|
||||
)
|
||||
|
||||
# Configure crawl settings
|
||||
crawl_config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=PruningContentFilter(), In case you need fit_markdown
|
||||
),
|
||||
)
|
||||
|
||||
# Create single crawler instance
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
session_id = "session1" # Use same session for all URLs
|
||||
for url in urls:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=crawl_config,
|
||||
session_id=session_id, # Reuse same browser tab
|
||||
)
|
||||
if result.success:
|
||||
print(f"Successfully crawled {url}")
|
||||
print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
|
||||
finally:
|
||||
await crawler.close()
|
||||
|
||||
|
||||
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
|
||||
"""
|
||||
Parallel crawling while reusing browser instance - best for large workloads
|
||||
"""
|
||||
print("\n=== Parallel Crawling with Browser Reuse ===")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||||
viewport={"width": 800, "height": 600},
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=PruningContentFilter(), In case you need fit_markdown
|
||||
),
|
||||
)
|
||||
|
||||
# Create single crawler instance for all parallel tasks
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# Create tasks in batches to control concurrency
|
||||
for i in range(0, len(urls), max_concurrent):
|
||||
batch = urls[i : i + max_concurrent]
|
||||
tasks = []
|
||||
|
||||
for j, url in enumerate(batch):
|
||||
session_id = (
|
||||
f"parallel_session_{j}" # Different session per concurrent task
|
||||
)
|
||||
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
|
||||
tasks.append(task)
|
||||
|
||||
# Wait for batch to complete
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Process results
|
||||
for url, result in zip(batch, results):
|
||||
if isinstance(result, Exception):
|
||||
print(f"Error crawling {url}: {str(result)}")
|
||||
elif result.success:
|
||||
print(f"Successfully crawled {url}")
|
||||
print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
|
||||
finally:
|
||||
await crawler.close()
|
||||
|
||||
|
||||
async def main():
|
||||
# Example URLs
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3",
|
||||
"https://example.com/page4",
|
||||
]
|
||||
|
||||
# Demo sequential crawling
|
||||
await crawl_sequential(urls)
|
||||
|
||||
# Demo parallel crawling
|
||||
await crawl_parallel(urls, max_concurrent=2)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -78,20 +78,20 @@ def test_docker_deployment(version="basic"):
|
||||
time.sleep(5)
|
||||
|
||||
# Test cases based on version
|
||||
# test_basic_crawl(tester)
|
||||
# test_basic_crawl(tester)
|
||||
# test_basic_crawl_sync(tester)
|
||||
test_basic_crawl_direct(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl(tester)
|
||||
test_basic_crawl_sync(tester)
|
||||
|
||||
# if version in ["full", "transformer"]:
|
||||
# test_cosine_extraction(tester)
|
||||
if version in ["full", "transformer"]:
|
||||
test_cosine_extraction(tester)
|
||||
|
||||
# test_js_execution(tester)
|
||||
# test_css_selector(tester)
|
||||
# test_structured_extraction(tester)
|
||||
# test_llm_extraction(tester)
|
||||
# test_llm_with_ollama(tester)
|
||||
# test_screenshot(tester)
|
||||
test_js_execution(tester)
|
||||
test_css_selector(tester)
|
||||
test_structured_extraction(tester)
|
||||
test_llm_extraction(tester)
|
||||
test_llm_with_ollama(tester)
|
||||
test_screenshot(tester)
|
||||
|
||||
|
||||
def test_basic_crawl(tester: Crawl4AiTester):
|
||||
|
||||
115
docs/examples/extraction_strategies_example.py
Normal file
115
docs/examples/extraction_strategies_example.py
Normal file
@@ -0,0 +1,115 @@
|
||||
"""
|
||||
Example demonstrating different extraction strategies with various input formats.
|
||||
This example shows how to:
|
||||
1. Use different input formats (markdown, HTML, fit_markdown)
|
||||
2. Work with JSON-based extractors (CSS and XPath)
|
||||
3. Use LLM-based extraction with different input formats
|
||||
4. Configure browser and crawler settings properly
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
from typing import Dict, Any
|
||||
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import (
|
||||
LLMExtractionStrategy,
|
||||
JsonCssExtractionStrategy,
|
||||
JsonXPathExtractionStrategy
|
||||
)
|
||||
from crawl4ai.chunking_strategy import RegexChunking, IdentityChunking
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
|
||||
"""Helper function to run extraction with proper configuration"""
|
||||
try:
|
||||
# Configure the crawler run settings
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=strategy,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter() # For fit_markdown support
|
||||
)
|
||||
)
|
||||
|
||||
# Run the crawler
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
print(f"\n=== {name} Results ===")
|
||||
print(f"Extracted Content: {result.extracted_content}")
|
||||
print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
|
||||
print(f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}")
|
||||
else:
|
||||
print(f"Error in {name}: Crawl failed")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error in {name}: {str(e)}")
|
||||
|
||||
async def main():
|
||||
# Example URL (replace with actual URL)
|
||||
url = "https://example.com/product-page"
|
||||
|
||||
# Configure browser settings
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Initialize extraction strategies
|
||||
|
||||
# 1. LLM Extraction with different input formats
|
||||
markdown_strategy = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
instruction="Extract product information including name, price, and description"
|
||||
)
|
||||
|
||||
html_strategy = LLMExtractionStrategy(
|
||||
input_format="html",
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
instruction="Extract product information from HTML including structured data"
|
||||
)
|
||||
|
||||
fit_markdown_strategy = LLMExtractionStrategy(
|
||||
input_format="fit_markdown",
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
instruction="Extract product information from cleaned markdown"
|
||||
)
|
||||
|
||||
# 2. JSON CSS Extraction (automatically uses HTML input)
|
||||
css_schema = {
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1.product-title", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "description", "selector": ".description", "type": "text"}
|
||||
]
|
||||
}
|
||||
css_strategy = JsonCssExtractionStrategy(schema=css_schema)
|
||||
|
||||
# 3. JSON XPath Extraction (automatically uses HTML input)
|
||||
xpath_schema = {
|
||||
"baseSelector": "//div[@class='product']",
|
||||
"fields": [
|
||||
{"name": "title", "selector": ".//h1[@class='product-title']/text()", "type": "text"},
|
||||
{"name": "price", "selector": ".//span[@class='price']/text()", "type": "text"},
|
||||
{"name": "description", "selector": ".//div[@class='description']/text()", "type": "text"}
|
||||
]
|
||||
}
|
||||
xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
|
||||
|
||||
# Use context manager for proper resource handling
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Run all strategies
|
||||
await run_extraction(crawler, url, markdown_strategy, "Markdown LLM")
|
||||
await run_extraction(crawler, url, html_strategy, "HTML LLM")
|
||||
await run_extraction(crawler, url, fit_markdown_strategy, "Fit Markdown LLM")
|
||||
await run_extraction(crawler, url, css_strategy, "CSS Extraction")
|
||||
await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
58
docs/examples/full_page_screenshot_and_pdf_export.md
Normal file
58
docs/examples/full_page_screenshot_and_pdf_export.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
|
||||
|
||||
When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
|
||||
|
||||
**The New Approach:**
|
||||
We’ve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browser’s built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
|
||||
|
||||
**Key Benefits:**
|
||||
- **Reliability:** The PDF export never times out and works regardless of page length.
|
||||
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
|
||||
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
|
||||
|
||||
**Simple Example:**
|
||||
```python
|
||||
import os, sys
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
# Adjust paths as needed
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Request both PDF and screenshot
|
||||
result = await crawler.arun(
|
||||
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Save screenshot
|
||||
if result.screenshot:
|
||||
from base64 import b64decode
|
||||
with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Save PDF
|
||||
if result.pdf:
|
||||
pdf_bytes = b64decode(result.pdf)
|
||||
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
|
||||
f.write(pdf_bytes)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What Happens Under the Hood:**
|
||||
- Crawl4AI navigates to the target page.
|
||||
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
|
||||
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
|
||||
- Finally, you get your PDF and/or screenshot ready to use.
|
||||
|
||||
**Conclusion:**
|
||||
With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.
|
||||
107
docs/examples/hooks_example.py
Normal file
107
docs/examples/hooks_example.py
Normal file
@@ -0,0 +1,107 @@
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from playwright.async_api import Page, BrowserContext
|
||||
|
||||
async def main():
|
||||
print("🔗 Hooks Example: Demonstrating different hook use cases")
|
||||
|
||||
# Configure browser settings
|
||||
browser_config = BrowserConfig(
|
||||
headless=True
|
||||
)
|
||||
|
||||
# Configure crawler settings
|
||||
crawler_run_config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="body",
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# Create crawler instance
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
|
||||
# Define and set hook functions
|
||||
async def on_browser_created(browser, context: BrowserContext, **kwargs):
|
||||
"""Hook called after the browser is created"""
|
||||
print("[HOOK] on_browser_created - Browser is ready!")
|
||||
# Example: Set a cookie that will be used for all requests
|
||||
return browser
|
||||
|
||||
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
|
||||
"""Hook called after a new page and context are created"""
|
||||
print("[HOOK] on_page_context_created - New page created!")
|
||||
# Example: Set default viewport size
|
||||
await context.add_cookies([{
|
||||
'name': 'session_id',
|
||||
'value': 'example_session',
|
||||
'domain': '.example.com',
|
||||
'path': '/'
|
||||
}])
|
||||
await page.set_viewport_size({"width": 1920, "height": 1080})
|
||||
return page
|
||||
|
||||
async def on_user_agent_updated(page: Page, context: BrowserContext, user_agent: str, **kwargs):
|
||||
"""Hook called when the user agent is updated"""
|
||||
print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
|
||||
return page
|
||||
|
||||
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
|
||||
"""Hook called after custom JavaScript execution"""
|
||||
print("[HOOK] on_execution_started - Custom JS executed!")
|
||||
return page
|
||||
|
||||
async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
|
||||
"""Hook called before navigating to each URL"""
|
||||
print(f"[HOOK] before_goto - About to visit: {url}")
|
||||
# Example: Add custom headers for the request
|
||||
await page.set_extra_http_headers({
|
||||
"Custom-Header": "my-value"
|
||||
})
|
||||
return page
|
||||
|
||||
async def after_goto(page: Page, context: BrowserContext, url: str, response: dict, **kwargs):
|
||||
"""Hook called after navigating to each URL"""
|
||||
print(f"[HOOK] after_goto - Successfully loaded: {url}")
|
||||
# Example: Wait for a specific element to be loaded
|
||||
try:
|
||||
await page.wait_for_selector('.content', timeout=1000)
|
||||
print("Content element found!")
|
||||
except:
|
||||
print("Content element not found, continuing anyway")
|
||||
return page
|
||||
|
||||
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
|
||||
"""Hook called before retrieving the HTML content"""
|
||||
print("[HOOK] before_retrieve_html - About to get HTML content")
|
||||
# Example: Scroll to bottom to trigger lazy loading
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
|
||||
return page
|
||||
|
||||
async def before_return_html(page: Page, context: BrowserContext, html:str, **kwargs):
|
||||
"""Hook called before returning the HTML content"""
|
||||
print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
|
||||
# Example: You could modify the HTML content here if needed
|
||||
return page
|
||||
|
||||
# Set all the hooks
|
||||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
||||
crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||
crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
|
||||
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
|
||||
|
||||
await crawler.start()
|
||||
|
||||
# Example usage: crawl a simple website
|
||||
url = 'https://example.com'
|
||||
result = await crawler.arun(url, config=crawler_run_config)
|
||||
print(f"\nCrawled URL: {result.url}")
|
||||
print(f"HTML length: {len(result.html)}")
|
||||
|
||||
await crawler.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
@@ -1,41 +1,40 @@
|
||||
import os
|
||||
import time
|
||||
from crawl4ai.web_crawler import WebCrawler
|
||||
from crawl4ai.chunking_strategy import *
|
||||
from crawl4ai.extraction_strategy import *
|
||||
from crawl4ai.crawler_strategy import *
|
||||
import asyncio
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
url = r'https://openai.com/api/pricing/'
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
result = crawler.run(
|
||||
url=url,
|
||||
word_count_threshold=1,
|
||||
extraction_strategy= LLMExtractionStrategy(
|
||||
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
|
||||
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="From the crawled content, extract all mentioned model names along with their "\
|
||||
"fees for input and output tokens. Make sure not to miss anything in the entire content. "\
|
||||
'One extracted model JSON format should look like this: '\
|
||||
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
model_fees = json.loads(result.extracted_content)
|
||||
async def main():
|
||||
# Use AsyncWebCrawler
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=1,
|
||||
extraction_strategy= LLMExtractionStrategy(
|
||||
# provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
|
||||
provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="From the crawled content, extract all mentioned model names along with their " \
|
||||
"fees for input and output tokens. Make sure not to miss anything in the entire content. " \
|
||||
'One extracted model JSON format should look like this: ' \
|
||||
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
|
||||
),
|
||||
|
||||
print(len(model_fees))
|
||||
)
|
||||
print("Success:", result.success)
|
||||
model_fees = json.loads(result.extracted_content)
|
||||
print(len(model_fees))
|
||||
|
||||
with open(".data/data.json", "w", encoding="utf-8") as f:
|
||||
f.write(result.extracted_content)
|
||||
with open(".data/data.json", "w", encoding="utf-8") as f:
|
||||
f.write(result.extracted_content)
|
||||
|
||||
asyncio.run(main())
|
||||
|
||||
610
docs/examples/quickstart_async.config.py
Normal file
610
docs/examples/quickstart_async.config.py
Normal file
@@ -0,0 +1,610 @@
|
||||
import os, sys
|
||||
|
||||
sys.path.append(
|
||||
os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
)
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
import json
|
||||
import re
|
||||
from typing import Dict, List
|
||||
from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
|
||||
from crawl4ai.extraction_strategy import (
|
||||
JsonCssExtractionStrategy,
|
||||
LLMExtractionStrategy,
|
||||
)
|
||||
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
|
||||
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
|
||||
print("Twitter: @unclecode")
|
||||
print("Website: https://crawl4ai.com")
|
||||
|
||||
|
||||
# Basic Example - Simple Crawl
|
||||
async def simple_crawl():
|
||||
print("\n--- Basic Usage ---")
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business", config=crawler_config
|
||||
)
|
||||
print(result.markdown[:500])
|
||||
|
||||
|
||||
async def clean_content():
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
excluded_tags=["nav", "footer", "aside"],
|
||||
remove_overlay_elements=True,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
),
|
||||
options={"ignore_links": True},
|
||||
),
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/Apple",
|
||||
config=crawler_config,
|
||||
)
|
||||
full_markdown_length = len(result.markdown_v2.raw_markdown)
|
||||
fit_markdown_length = len(result.markdown_v2.fit_markdown)
|
||||
print(f"Full Markdown Length: {full_markdown_length}")
|
||||
print(f"Fit Markdown Length: {fit_markdown_length}")
|
||||
|
||||
async def link_analysis():
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
config=crawler_config,
|
||||
)
|
||||
print(f"Found {len(result.links['internal'])} internal links")
|
||||
print(f"Found {len(result.links['external'])} external links")
|
||||
|
||||
for link in result.links['internal'][:5]:
|
||||
print(f"Href: {link['href']}\nText: {link['text']}\n")
|
||||
|
||||
# JavaScript Execution Example
|
||||
async def simple_example_with_running_js_code():
|
||||
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
||||
|
||||
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
js_code="const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();",
|
||||
# wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business", config=crawler_config
|
||||
)
|
||||
print(result.markdown[:500])
|
||||
|
||||
|
||||
# CSS Selector Example
|
||||
async def simple_example_with_css_selector():
|
||||
print("\n--- Using CSS Selectors ---")
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS, css_selector=".wide-tease-item__description"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business", config=crawler_config
|
||||
)
|
||||
print(result.markdown[:500])
|
||||
|
||||
async def media_handling():
|
||||
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
config=crawler_config
|
||||
)
|
||||
for img in result.media['images'][:5]:
|
||||
print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
|
||||
|
||||
async def custom_hook_workflow(verbose=True):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Set a 'before_goto' hook to run custom code just before navigation
|
||||
crawler.crawler_strategy.set_hook("before_goto", lambda page, context: print("[Hook] Preparing to navigate..."))
|
||||
|
||||
# Perform the crawl operation
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com"
|
||||
)
|
||||
print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
|
||||
|
||||
|
||||
# Proxy Example
|
||||
async def use_proxy():
|
||||
print("\n--- Using a Proxy ---")
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "username",
|
||||
"password": "password",
|
||||
},
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business", config=crawler_config
|
||||
)
|
||||
if result.success:
|
||||
print(result.markdown[:500])
|
||||
|
||||
|
||||
# Screenshot Example
|
||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, screenshot=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url=url, config=crawler_config)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
|
||||
screenshot_data = base64.b64decode(result.screenshot)
|
||||
with open(output_path, "wb") as f:
|
||||
f.write(screenshot_data)
|
||||
print(f"Screenshot saved successfully to {output_path}")
|
||||
else:
|
||||
print("Failed to capture screenshot")
|
||||
|
||||
|
||||
# LLM Extraction Example
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(
|
||||
..., description="Fee for output token for the OpenAI model."
|
||||
)
|
||||
|
||||
|
||||
async def extract_structured_data_using_llm(
|
||||
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
|
||||
):
|
||||
print(f"\n--- Extracting Structured Data with {provider} ---")
|
||||
|
||||
if api_token is None and provider != "ollama":
|
||||
print(f"API token is required for {provider}. Skipping this example.")
|
||||
return
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
|
||||
if extra_headers:
|
||||
extra_args["extra_headers"] = extra_headers
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=1,
|
||||
page_timeout=80000,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content.""",
|
||||
extra_args=extra_args,
|
||||
),
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://openai.com/api/pricing/", config=crawler_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
|
||||
# CSS Extraction Example
|
||||
async def extract_structured_data_using_css_extractor():
|
||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||
schema = {
|
||||
"name": "KidoCode Courses",
|
||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||
"fields": [
|
||||
{
|
||||
"name": "section_title",
|
||||
"selector": "h3.heading-50",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "section_description",
|
||||
"selector": ".charge-content",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_name",
|
||||
"selector": ".text-block-93",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_description",
|
||||
"selector": ".course-content-text",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_icon",
|
||||
"selector": ".image-92",
|
||||
"type": "attribute",
|
||||
"attribute": "src",
|
||||
},
|
||||
],
|
||||
}
|
||||
|
||||
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
|
||||
|
||||
js_click_tabs = """
|
||||
(async () => {
|
||||
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
|
||||
for(let tab of tabs) {
|
||||
tab.scrollIntoView();
|
||||
tab.click();
|
||||
await new Promise(r => setTimeout(r, 500));
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
js_code=[js_click_tabs],
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.kidocode.com/degrees/technology", config=crawler_config
|
||||
)
|
||||
|
||||
companies = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(companies)} companies")
|
||||
print(json.dumps(companies[0], indent=2))
|
||||
|
||||
|
||||
# Dynamic Content Examples - Method 1
|
||||
async def crawl_dynamic_content_pages_method_1():
|
||||
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
|
||||
first_commit = ""
|
||||
|
||||
async def on_execution_started(page, **kwargs):
|
||||
nonlocal first_commit
|
||||
try:
|
||||
while True:
|
||||
await page.wait_for_selector("li.Box-sc-g0xbh4-0 h4")
|
||||
commit = await page.query_selector("li.Box-sc-g0xbh4-0 h4")
|
||||
commit = await commit.evaluate("(element) => element.textContent")
|
||||
commit = re.sub(r"\s+", "", commit)
|
||||
if commit and commit != first_commit:
|
||||
first_commit = commit
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
except Exception as e:
|
||||
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
|
||||
|
||||
browser_config = BrowserConfig(headless=False, java_script_enabled=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
for page in range(3):
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
session_id=session_id,
|
||||
)
|
||||
|
||||
result = await crawler.arun(url=url, config=crawler_config)
|
||||
assert result.success, f"Failed to crawl page {page + 1}"
|
||||
|
||||
soup = BeautifulSoup(result.cleaned_html, "html.parser")
|
||||
commits = soup.select("li")
|
||||
all_commits.extend(commits)
|
||||
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
|
||||
# Dynamic Content Examples - Method 2
|
||||
async def crawl_dynamic_content_pages_method_2():
|
||||
print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
|
||||
|
||||
browser_config = BrowserConfig(headless=False, java_script_enabled=True)
|
||||
|
||||
js_next_page_and_wait = """
|
||||
(async () => {
|
||||
const getCurrentCommit = () => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
return commits.length > 0 ? commits[0].textContent.trim() : null;
|
||||
};
|
||||
|
||||
const initialCommit = getCurrentCommit();
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
|
||||
while (true) {
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
const newCommit = getCurrentCommit();
|
||||
if (newCommit && newCommit !== initialCommit) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.markdown-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
all_commits = []
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
for page in range(3):
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page_and_wait if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
session_id=session_id,
|
||||
)
|
||||
|
||||
result = await crawler.arun(url=url, config=crawler_config)
|
||||
assert result.success, f"Failed to crawl page {page + 1}"
|
||||
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
|
||||
async def cosine_similarity_extraction():
|
||||
crawl_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=CosineStrategy(
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2, # Maximum distance between two words
|
||||
linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
|
||||
top_k=3, # Number of top keywords to extract
|
||||
sim_threshold=0.3, # Similarity threshold for clustering
|
||||
semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
|
||||
verbose=True
|
||||
),
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
|
||||
config=crawl_config
|
||||
)
|
||||
print(json.loads(result.extracted_content)[:5])
|
||||
|
||||
# Browser Comparison
|
||||
async def crawl_custom_browser_type():
|
||||
print("\n--- Browser Comparison ---")
|
||||
|
||||
# Firefox
|
||||
browser_config_firefox = BrowserConfig(browser_type="firefox", headless=True)
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.example.com",
|
||||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
|
||||
)
|
||||
print("Firefox:", time.time() - start)
|
||||
print(result.markdown[:500])
|
||||
|
||||
# WebKit
|
||||
browser_config_webkit = BrowserConfig(browser_type="webkit", headless=True)
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.example.com",
|
||||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
|
||||
)
|
||||
print("WebKit:", time.time() - start)
|
||||
print(result.markdown[:500])
|
||||
|
||||
# Chromium (default)
|
||||
browser_config_chromium = BrowserConfig(browser_type="chromium", headless=True)
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.example.com",
|
||||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
|
||||
)
|
||||
print("Chromium:", time.time() - start)
|
||||
print(result.markdown[:500])
|
||||
|
||||
|
||||
# Anti-Bot and User Simulation
|
||||
async def crawl_with_user_simulation():
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
user_agent_mode="random",
|
||||
user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
|
||||
)
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
magic=True,
|
||||
simulate_user=True,
|
||||
override_navigator=True,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
|
||||
print(result.markdown)
|
||||
|
||||
async def ssl_certification():
|
||||
# Configure crawler to fetch SSL certificate
|
||||
config = CrawlerRunConfig(
|
||||
fetch_ssl_certificate=True,
|
||||
cache_mode=CacheMode.BYPASS # Bypass cache to always get fresh certificates
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url='https://example.com',
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success and result.ssl_certificate:
|
||||
cert = result.ssl_certificate
|
||||
|
||||
# 1. Access certificate properties directly
|
||||
print("\nCertificate Information:")
|
||||
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
||||
print(f"Valid until: {cert.valid_until}")
|
||||
print(f"Fingerprint: {cert.fingerprint}")
|
||||
|
||||
# 2. Export certificate in different formats
|
||||
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
|
||||
print("\nCertificate exported to:")
|
||||
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
|
||||
|
||||
pem_data = cert.to_pem(os.path.join(tmp_dir, "certificate.pem")) # For web servers
|
||||
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
|
||||
|
||||
der_data = cert.to_der(os.path.join(tmp_dir, "certificate.der")) # For Java apps
|
||||
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
|
||||
|
||||
# Speed Comparison
|
||||
async def speed_comparison():
|
||||
print("\n--- Speed Comparison ---")
|
||||
|
||||
# Firecrawl comparison
|
||||
from firecrawl import FirecrawlApp
|
||||
|
||||
app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
|
||||
start = time.time()
|
||||
scrape_status = app.scrape_url(
|
||||
"https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
|
||||
)
|
||||
end = time.time()
|
||||
print("Firecrawl:")
|
||||
print(f"Time taken: {end - start:.2f} seconds")
|
||||
print(f"Content length: {len(scrape_status['markdown'])} characters")
|
||||
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
|
||||
print()
|
||||
|
||||
# Crawl4AI comparisons
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
# Simple crawl
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
start = time.time()
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS, word_count_threshold=0
|
||||
),
|
||||
)
|
||||
end = time.time()
|
||||
print("Crawl4AI (simple crawl):")
|
||||
print(f"Time taken: {end - start:.2f} seconds")
|
||||
print(f"Content length: {len(result.markdown)} characters")
|
||||
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
|
||||
print()
|
||||
|
||||
# Advanced filtering
|
||||
start = time.time()
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
config=CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=0,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
)
|
||||
),
|
||||
),
|
||||
)
|
||||
end = time.time()
|
||||
print("Crawl4AI (Markdown Plus):")
|
||||
print(f"Time taken: {end - start:.2f} seconds")
|
||||
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
|
||||
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
|
||||
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
|
||||
print()
|
||||
|
||||
|
||||
# Main execution
|
||||
async def main():
|
||||
# Basic examples
|
||||
# await simple_crawl()
|
||||
# await simple_example_with_running_js_code()
|
||||
# await simple_example_with_css_selector()
|
||||
|
||||
# Advanced examples
|
||||
# await extract_structured_data_using_css_extractor()
|
||||
await extract_structured_data_using_llm(
|
||||
"openai/gpt-4o", os.getenv("OPENAI_API_KEY")
|
||||
)
|
||||
# await crawl_dynamic_content_pages_method_1()
|
||||
# await crawl_dynamic_content_pages_method_2()
|
||||
|
||||
# Browser comparisons
|
||||
# await crawl_custom_browser_type()
|
||||
|
||||
# Performance testing
|
||||
# await speed_comparison()
|
||||
|
||||
# Screenshot example
|
||||
# await capture_and_save_screenshot(
|
||||
# "https://www.example.com",
|
||||
# os.path.join(__location__, "tmp/example_screenshot.jpg")
|
||||
# )
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -13,7 +13,9 @@ import re
|
||||
from typing import Dict, List
|
||||
from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
|
||||
from crawl4ai.extraction_strategy import (
|
||||
JsonCssExtractionStrategy,
|
||||
LLMExtractionStrategy,
|
||||
@@ -30,7 +32,7 @@ print("Website: https://crawl4ai.com")
|
||||
async def simple_crawl():
|
||||
print("\n--- Basic Usage ---")
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode= CacheMode.BYPASS)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def simple_example_with_running_js_code():
|
||||
@@ -51,7 +53,7 @@ async def simple_example_with_running_js_code():
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
# wait_for=wait_for,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
@@ -61,7 +63,7 @@ async def simple_example_with_css_selector():
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector=".wide-tease-item__description",
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
@@ -74,16 +76,17 @@ async def use_proxy():
|
||||
async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
bypass_cache=True
|
||||
cache_mode= CacheMode.BYPASS
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
if result.success:
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
bypass_cache=True
|
||||
cache_mode= CacheMode.BYPASS
|
||||
)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
@@ -114,7 +117,13 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
|
||||
print(f"API token is required for {provider}. Skipping this example.")
|
||||
return
|
||||
|
||||
extra_args = {}
|
||||
# extra_args = {}
|
||||
extra_args={
|
||||
"temperature": 0,
|
||||
"top_p": 0.9,
|
||||
"max_tokens": 2000,
|
||||
# any other supported parameters for litellm
|
||||
}
|
||||
if extra_headers:
|
||||
extra_args["extra_headers"] = extra_headers
|
||||
|
||||
@@ -125,55 +134,82 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.schema(),
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
||||
extra_args=extra_args
|
||||
),
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
async def extract_structured_data_using_css_extractor():
|
||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||
schema = {
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text",
|
||||
"name": "KidoCode Courses",
|
||||
"baseSelector": "section.charge-methodology .w-tab-content > div",
|
||||
"fields": [
|
||||
{
|
||||
"name": "section_title",
|
||||
"selector": "h3.heading-50",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "section_description",
|
||||
"selector": ".charge-content",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_name",
|
||||
"selector": ".text-block-93",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_description",
|
||||
"selector": ".course-content-text",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "course_icon",
|
||||
"selector": ".image-92",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True
|
||||
) as crawler:
|
||||
|
||||
# Create the JavaScript that handles clicking multiple times
|
||||
js_click_tabs = """
|
||||
(async () => {
|
||||
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
|
||||
|
||||
for(let tab of tabs) {
|
||||
// scroll to the tab
|
||||
tab.scrollIntoView();
|
||||
tab.click();
|
||||
// Wait for content to load and animations to complete
|
||||
await new Promise(r => setTimeout(r, 500));
|
||||
}
|
||||
],
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.coinbase.com/explore",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
url="https://www.kidocode.com/degrees/technology",
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
|
||||
js_code=[js_click_tabs],
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
|
||||
news_teasers = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(news_teasers)} news teasers")
|
||||
print(json.dumps(news_teasers[0], indent=2))
|
||||
companies = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(companies)} companies")
|
||||
print(json.dumps(companies[0], indent=2))
|
||||
|
||||
# Advanced Session-Based Crawling with Dynamic Content 🔄
|
||||
async def crawl_dynamic_content_pages_method_1():
|
||||
@@ -203,8 +239,10 @@ async def crawl_dynamic_content_pages_method_1():
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
(() => {
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
})();
|
||||
"""
|
||||
|
||||
for page in range(3): # Crawl 3 pages
|
||||
@@ -213,7 +251,7 @@ async def crawl_dynamic_content_pages_method_1():
|
||||
session_id=session_id,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
js=js_next_page if page > 0 else None,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
js_only=page > 0,
|
||||
headless=False,
|
||||
)
|
||||
@@ -282,7 +320,7 @@ async def crawl_dynamic_content_pages_method_2():
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page_and_wait if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
headless=False,
|
||||
)
|
||||
|
||||
@@ -343,7 +381,7 @@ async def crawl_dynamic_content_pages_method_3():
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
headless=False,
|
||||
)
|
||||
|
||||
@@ -361,21 +399,21 @@ async def crawl_custom_browser_type():
|
||||
# Use Firefox
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
# Use WebKit
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
# Use Chromium (default)
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
@@ -384,7 +422,7 @@ async def crawl_with_user_simultion():
|
||||
url = "YOUR-URL-HERE"
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
|
||||
# simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
|
||||
# override_navigator = True # Overrides the navigator object to make it look like a real user
|
||||
@@ -408,7 +446,7 @@ async def speed_comparison():
|
||||
params={'formats': ['markdown', 'html']}
|
||||
)
|
||||
end = time.time()
|
||||
print("Firecrawl (simulated):")
|
||||
print("Firecrawl:")
|
||||
print(f"Time taken: {end - start:.2f} seconds")
|
||||
print(f"Content length: {len(scrape_status['markdown'])} characters")
|
||||
print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
|
||||
@@ -420,7 +458,7 @@ async def speed_comparison():
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=0,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
verbose=False,
|
||||
)
|
||||
end = time.time()
|
||||
@@ -430,6 +468,26 @@ async def speed_comparison():
|
||||
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
|
||||
print()
|
||||
|
||||
# Crawl4AI with advanced content filtering
|
||||
start = time.time()
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=0,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
),
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
verbose=False,
|
||||
)
|
||||
end = time.time()
|
||||
print("Crawl4AI (Markdown Plus):")
|
||||
print(f"Time taken: {end - start:.2f} seconds")
|
||||
print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
|
||||
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
|
||||
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
|
||||
print()
|
||||
|
||||
# Crawl4AI with JavaScript execution
|
||||
start = time.time()
|
||||
result = await crawler.arun(
|
||||
@@ -438,13 +496,18 @@ async def speed_comparison():
|
||||
"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
|
||||
],
|
||||
word_count_threshold=0,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||
# content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
|
||||
),
|
||||
verbose=False,
|
||||
)
|
||||
end = time.time()
|
||||
print("Crawl4AI (with JavaScript execution):")
|
||||
print(f"Time taken: {end - start:.2f} seconds")
|
||||
print(f"Content length: {len(result.markdown)} characters")
|
||||
print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
|
||||
print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
|
||||
|
||||
print("\nNote on Speed Comparison:")
|
||||
@@ -483,7 +546,7 @@ async def generate_knowledge_graph():
|
||||
url = "https://paulgraham.com/love.html"
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=extraction_strategy,
|
||||
# magic=True
|
||||
)
|
||||
@@ -492,50 +555,85 @@ async def generate_knowledge_graph():
|
||||
f.write(result.extracted_content)
|
||||
|
||||
async def fit_markdown_remove_overlay():
|
||||
async with AsyncWebCrawler(headless = False) as crawler:
|
||||
url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True, # Set to False to see what is happening
|
||||
verbose=True,
|
||||
user_agent_mode="random",
|
||||
user_agent_generator_config={
|
||||
"device_type": "mobile",
|
||||
"os_type": "android"
|
||||
},
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
word_count_threshold = 10,
|
||||
remove_overlay_elements=True,
|
||||
screenshot = True
|
||||
url='https://www.kidocode.com/degrees/technology',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48, threshold_type="fixed", min_word_threshold=0
|
||||
),
|
||||
options={
|
||||
"ignore_links": True
|
||||
}
|
||||
),
|
||||
# markdown_generator=DefaultMarkdownGenerator(
|
||||
# content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
|
||||
# options={
|
||||
# "ignore_links": True
|
||||
# }
|
||||
# ),
|
||||
)
|
||||
# Save markdown to file
|
||||
with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
|
||||
f.write(result.fit_markdown)
|
||||
|
||||
|
||||
if result.success:
|
||||
print(len(result.markdown_v2.raw_markdown))
|
||||
print(len(result.markdown_v2.markdown_with_citations))
|
||||
print(len(result.markdown_v2.fit_markdown))
|
||||
|
||||
# Save clean html
|
||||
with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
|
||||
f.write(result.cleaned_html)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.raw_markdown)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
|
||||
f.write(result.markdown_v2.markdown_with_citations)
|
||||
|
||||
with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.fit_markdown)
|
||||
|
||||
print("Done")
|
||||
|
||||
|
||||
async def main():
|
||||
await simple_crawl()
|
||||
await simple_example_with_running_js_code()
|
||||
await simple_example_with_css_selector()
|
||||
await use_proxy()
|
||||
await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
|
||||
await extract_structured_data_using_css_extractor()
|
||||
# await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
# await simple_crawl()
|
||||
# await simple_example_with_running_js_code()
|
||||
# await simple_example_with_css_selector()
|
||||
# # await use_proxy()
|
||||
# await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
|
||||
# await extract_structured_data_using_css_extractor()
|
||||
|
||||
# LLM extraction examples
|
||||
await extract_structured_data_using_llm()
|
||||
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
# await extract_structured_data_using_llm()
|
||||
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||
# await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
|
||||
# You always can pass custom headers to the extraction strategy
|
||||
custom_headers = {
|
||||
"Authorization": "Bearer your-custom-token",
|
||||
"X-Custom-Header": "Some-Value"
|
||||
}
|
||||
await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||
# custom_headers = {
|
||||
# "Authorization": "Bearer your-custom-token",
|
||||
# "X-Custom-Header": "Some-Value"
|
||||
# }
|
||||
# await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||
|
||||
# await crawl_dynamic_content_pages_method_1()
|
||||
# await crawl_dynamic_content_pages_method_2()
|
||||
await crawl_dynamic_content_pages_method_3()
|
||||
|
||||
await crawl_custom_browser_type()
|
||||
# await crawl_custom_browser_type()
|
||||
|
||||
await speed_comparison()
|
||||
# await speed_comparison()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
46
docs/examples/ssl_example.py
Normal file
46
docs/examples/ssl_example.py
Normal file
@@ -0,0 +1,46 @@
|
||||
"""Example showing how to work with SSL certificates in Crawl4AI."""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
# Create tmp directory if it doesn't exist
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
tmp_dir = os.path.join(parent_dir, "tmp")
|
||||
os.makedirs(tmp_dir, exist_ok=True)
|
||||
|
||||
async def main():
|
||||
# Configure crawler to fetch SSL certificate
|
||||
config = CrawlerRunConfig(
|
||||
fetch_ssl_certificate=True,
|
||||
cache_mode=CacheMode.BYPASS # Bypass cache to always get fresh certificates
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url='https://example.com',
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success and result.ssl_certificate:
|
||||
cert = result.ssl_certificate
|
||||
|
||||
# 1. Access certificate properties directly
|
||||
print("\nCertificate Information:")
|
||||
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
||||
print(f"Valid until: {cert.valid_until}")
|
||||
print(f"Fingerprint: {cert.fingerprint}")
|
||||
|
||||
# 2. Export certificate in different formats
|
||||
cert.to_json(os.path.join(tmp_dir, "certificate.json")) # For analysis
|
||||
print("\nCertificate exported to:")
|
||||
print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
|
||||
|
||||
pem_data = cert.to_pem(os.path.join(tmp_dir, "certificate.pem")) # For web servers
|
||||
print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
|
||||
|
||||
der_data = cert.to_der(os.path.join(tmp_dir, "certificate.der")) # For Java apps
|
||||
print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
225
docs/examples/storage_state_tutorial.md
Normal file
225
docs/examples/storage_state_tutorial.md
Normal file
@@ -0,0 +1,225 @@
|
||||
### Using `storage_state` to Pre-Load Cookies and LocalStorage
|
||||
|
||||
Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
|
||||
|
||||
#### What is `storage_state`?
|
||||
|
||||
`storage_state` can be:
|
||||
|
||||
- A dictionary containing cookies and localStorage data.
|
||||
- A path to a JSON file that holds this information.
|
||||
|
||||
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
|
||||
|
||||
#### Example Structure
|
||||
|
||||
Here’s an example storage state:
|
||||
|
||||
```json
|
||||
{
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1675363572.037711,
|
||||
"httpOnly": false,
|
||||
"secure": false,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{ "name": "token", "value": "my_auth_token" },
|
||||
{ "name": "refreshToken", "value": "my_refresh_token" }
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
|
||||
|
||||
---
|
||||
|
||||
### Passing `storage_state` as a Dictionary
|
||||
|
||||
You can directly provide the data as a dictionary:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
storage_dict = {
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1675363572.037711,
|
||||
"httpOnly": False,
|
||||
"secure": False,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{"name": "token", "value": "my_auth_token"},
|
||||
{"name": "refreshToken", "value": "my_refresh_token"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state=storage_dict
|
||||
) as crawler:
|
||||
result = await crawler.arun(url='https://example.com/protected')
|
||||
if result.success:
|
||||
print("Crawl succeeded with pre-loaded session data!")
|
||||
print("Page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Passing `storage_state` as a File
|
||||
|
||||
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state="mystate.json" # Uses a JSON file instead of a dictionary
|
||||
) as crawler:
|
||||
result = await crawler.arun(url='https://example.com/protected')
|
||||
if result.success:
|
||||
print("Crawl succeeded with pre-loaded session data!")
|
||||
print("Page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
|
||||
|
||||
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
|
||||
|
||||
1. Perform the login once in a hook.
|
||||
2. After login completes, export the resulting `storage_state` to a file.
|
||||
3. On subsequent runs, provide that `storage_state` to skip the login step.
|
||||
|
||||
**Step-by-Step Example:**
|
||||
|
||||
**First Run (Perform Login and Save State):**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def on_browser_created_hook(browser):
|
||||
# Access the default context and create a page
|
||||
context = browser.contexts[0]
|
||||
page = await context.new_page()
|
||||
|
||||
# Navigate to the login page
|
||||
await page.goto("https://example.com/login", wait_until="domcontentloaded")
|
||||
|
||||
# Fill in credentials and submit
|
||||
await page.fill("input[name='username']", "myuser")
|
||||
await page.fill("input[name='password']", "mypassword")
|
||||
await page.click("button[type='submit']")
|
||||
await page.wait_for_load_state("networkidle")
|
||||
|
||||
# Now the site sets tokens in localStorage and cookies
|
||||
# Export this state to a file so we can reuse it
|
||||
await context.storage_state(path="my_storage_state.json")
|
||||
await page.close()
|
||||
|
||||
async def main():
|
||||
# First run: perform login and export the storage_state
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
hooks={"on_browser_created": on_browser_created_hook},
|
||||
use_persistent_context=True,
|
||||
user_data_dir="./my_user_data"
|
||||
) as crawler:
|
||||
|
||||
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
|
||||
result = await crawler.arun(
|
||||
url='https://example.com/protected-page',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
|
||||
)
|
||||
print("First run result success:", result.success)
|
||||
if result.success:
|
||||
print("Protected page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Second Run (Reuse Saved State, No Login Needed):**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
# Second run: no need to hook on_browser_created this time.
|
||||
# Just provide the previously saved storage state.
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
use_persistent_context=True,
|
||||
user_data_dir="./my_user_data",
|
||||
storage_state="my_storage_state.json" # Reuse previously exported state
|
||||
) as crawler:
|
||||
|
||||
# Now the crawler starts already logged in
|
||||
result = await crawler.arun(
|
||||
url='https://example.com/protected-page',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
|
||||
)
|
||||
print("Second run result success:", result.success)
|
||||
if result.success:
|
||||
print("Protected page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s Happening Here?**
|
||||
|
||||
- During the first run, the `on_browser_created_hook` logs into the site.
|
||||
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.
|
||||
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
|
||||
|
||||
**Sign Out Scenario:**
|
||||
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
|
||||
|
||||
---
|
||||
|
||||
### Conclusion
|
||||
|
||||
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.
|
||||
@@ -1,281 +0,0 @@
|
||||
from openai import AsyncOpenAI
|
||||
from chainlit.types import ThreadDict
|
||||
import chainlit as cl
|
||||
from chainlit.input_widget import Select, Switch, Slider
|
||||
client = AsyncOpenAI()
|
||||
|
||||
# Instrument the OpenAI client
|
||||
cl.instrument_openai()
|
||||
|
||||
settings = {
|
||||
"model": "gpt-3.5-turbo",
|
||||
"temperature": 0.5,
|
||||
"max_tokens": 500,
|
||||
"top_p": 1,
|
||||
"frequency_penalty": 0,
|
||||
"presence_penalty": 0,
|
||||
}
|
||||
|
||||
@cl.action_callback("action_button")
|
||||
async def on_action(action: cl.Action):
|
||||
print("The user clicked on the action button!")
|
||||
|
||||
return "Thank you for clicking on the action button!"
|
||||
|
||||
@cl.set_chat_profiles
|
||||
async def chat_profile():
|
||||
return [
|
||||
cl.ChatProfile(
|
||||
name="GPT-3.5",
|
||||
markdown_description="The underlying LLM model is **GPT-3.5**.",
|
||||
icon="https://picsum.photos/200",
|
||||
),
|
||||
cl.ChatProfile(
|
||||
name="GPT-4",
|
||||
markdown_description="The underlying LLM model is **GPT-4**.",
|
||||
icon="https://picsum.photos/250",
|
||||
),
|
||||
]
|
||||
|
||||
@cl.on_chat_start
|
||||
async def on_chat_start():
|
||||
|
||||
settings = await cl.ChatSettings(
|
||||
[
|
||||
Select(
|
||||
id="Model",
|
||||
label="OpenAI - Model",
|
||||
values=["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4", "gpt-4-32k"],
|
||||
initial_index=0,
|
||||
),
|
||||
Switch(id="Streaming", label="OpenAI - Stream Tokens", initial=True),
|
||||
Slider(
|
||||
id="Temperature",
|
||||
label="OpenAI - Temperature",
|
||||
initial=1,
|
||||
min=0,
|
||||
max=2,
|
||||
step=0.1,
|
||||
),
|
||||
Slider(
|
||||
id="SAI_Steps",
|
||||
label="Stability AI - Steps",
|
||||
initial=30,
|
||||
min=10,
|
||||
max=150,
|
||||
step=1,
|
||||
description="Amount of inference steps performed on image generation.",
|
||||
),
|
||||
Slider(
|
||||
id="SAI_Cfg_Scale",
|
||||
label="Stability AI - Cfg_Scale",
|
||||
initial=7,
|
||||
min=1,
|
||||
max=35,
|
||||
step=0.1,
|
||||
description="Influences how strongly your generation is guided to match your prompt.",
|
||||
),
|
||||
Slider(
|
||||
id="SAI_Width",
|
||||
label="Stability AI - Image Width",
|
||||
initial=512,
|
||||
min=256,
|
||||
max=2048,
|
||||
step=64,
|
||||
tooltip="Measured in pixels",
|
||||
),
|
||||
Slider(
|
||||
id="SAI_Height",
|
||||
label="Stability AI - Image Height",
|
||||
initial=512,
|
||||
min=256,
|
||||
max=2048,
|
||||
step=64,
|
||||
tooltip="Measured in pixels",
|
||||
),
|
||||
]
|
||||
).send()
|
||||
|
||||
chat_profile = cl.user_session.get("chat_profile")
|
||||
await cl.Message(
|
||||
content=f"starting chat using the {chat_profile} chat profile"
|
||||
).send()
|
||||
|
||||
print("A new chat session has started!")
|
||||
cl.user_session.set("session", {
|
||||
"history": [],
|
||||
"context": []
|
||||
})
|
||||
|
||||
image = cl.Image(url="https://c.tenor.com/uzWDSSLMCmkAAAAd/tenor.gif", name="cat image", display="inline")
|
||||
|
||||
# Attach the image to the message
|
||||
await cl.Message(
|
||||
content="You are such a good girl, aren't you?!",
|
||||
elements=[image],
|
||||
).send()
|
||||
|
||||
text_content = "Hello, this is a text element."
|
||||
elements = [
|
||||
cl.Text(name="simple_text", content=text_content, display="inline")
|
||||
]
|
||||
|
||||
await cl.Message(
|
||||
content="Check out this text element!",
|
||||
elements=elements,
|
||||
).send()
|
||||
|
||||
elements = [
|
||||
cl.Audio(path="./assets/audio.mp3", display="inline"),
|
||||
]
|
||||
await cl.Message(
|
||||
content="Here is an audio file",
|
||||
elements=elements,
|
||||
).send()
|
||||
|
||||
await cl.Avatar(
|
||||
name="Tool 1",
|
||||
url="https://avatars.githubusercontent.com/u/128686189?s=400&u=a1d1553023f8ea0921fba0debbe92a8c5f840dd9&v=4",
|
||||
).send()
|
||||
|
||||
await cl.Message(
|
||||
content="This message should not have an avatar!", author="Tool 0"
|
||||
).send()
|
||||
|
||||
await cl.Message(
|
||||
content="This message should have an avatar!", author="Tool 1"
|
||||
).send()
|
||||
|
||||
elements = [
|
||||
cl.File(
|
||||
name="quickstart.py",
|
||||
path="./quickstart.py",
|
||||
display="inline",
|
||||
),
|
||||
]
|
||||
|
||||
await cl.Message(
|
||||
content="This message has a file element", elements=elements
|
||||
).send()
|
||||
|
||||
# Sending an action button within a chatbot message
|
||||
actions = [
|
||||
cl.Action(name="action_button", value="example_value", description="Click me!")
|
||||
]
|
||||
|
||||
await cl.Message(content="Interact with this action button:", actions=actions).send()
|
||||
|
||||
# res = await cl.AskActionMessage(
|
||||
# content="Pick an action!",
|
||||
# actions=[
|
||||
# cl.Action(name="continue", value="continue", label="✅ Continue"),
|
||||
# cl.Action(name="cancel", value="cancel", label="❌ Cancel"),
|
||||
# ],
|
||||
# ).send()
|
||||
|
||||
# if res and res.get("value") == "continue":
|
||||
# await cl.Message(
|
||||
# content="Continue!",
|
||||
# ).send()
|
||||
|
||||
# import plotly.graph_objects as go
|
||||
# fig = go.Figure(
|
||||
# data=[go.Bar(y=[2, 1, 3])],
|
||||
# layout_title_text="An example figure",
|
||||
# )
|
||||
# elements = [cl.Plotly(name="chart", figure=fig, display="inline")]
|
||||
|
||||
# await cl.Message(content="This message has a chart", elements=elements).send()
|
||||
|
||||
# Sending a pdf with the local file path
|
||||
# elements = [
|
||||
# cl.Pdf(name="pdf1", display="inline", path="./pdf1.pdf")
|
||||
# ]
|
||||
|
||||
# cl.Message(content="Look at this local pdf!", elements=elements).send()
|
||||
|
||||
@cl.on_settings_update
|
||||
async def setup_agent(settings):
|
||||
print("on_settings_update", settings)
|
||||
|
||||
@cl.on_stop
|
||||
def on_stop():
|
||||
print("The user wants to stop the task!")
|
||||
|
||||
@cl.on_chat_end
|
||||
def on_chat_end():
|
||||
print("The user disconnected!")
|
||||
|
||||
|
||||
@cl.on_chat_resume
|
||||
async def on_chat_resume(thread: ThreadDict):
|
||||
print("The user resumed a previous chat session!")
|
||||
|
||||
|
||||
|
||||
|
||||
# @cl.on_message
|
||||
async def on_message(message: cl.Message):
|
||||
cl.user_session.get("session")["history"].append({
|
||||
"role": "user",
|
||||
"content": message.content
|
||||
})
|
||||
response = await client.chat.completions.create(
|
||||
messages=[
|
||||
{
|
||||
"content": "You are a helpful bot",
|
||||
"role": "system"
|
||||
},
|
||||
*cl.user_session.get("session")["history"]
|
||||
],
|
||||
**settings
|
||||
)
|
||||
|
||||
|
||||
# Add assitanr message to the history
|
||||
cl.user_session.get("session")["history"].append({
|
||||
"role": "assistant",
|
||||
"content": response.choices[0].message.content
|
||||
})
|
||||
|
||||
# msg.content = response.choices[0].message.content
|
||||
# await msg.update()
|
||||
|
||||
# await cl.Message(content=response.choices[0].message.content).send()
|
||||
|
||||
@cl.on_message
|
||||
async def on_message(message: cl.Message):
|
||||
cl.user_session.get("session")["history"].append({
|
||||
"role": "user",
|
||||
"content": message.content
|
||||
})
|
||||
|
||||
msg = cl.Message(content="")
|
||||
await msg.send()
|
||||
|
||||
stream = await client.chat.completions.create(
|
||||
messages=[
|
||||
{
|
||||
"content": "You are a helpful bot",
|
||||
"role": "system"
|
||||
},
|
||||
*cl.user_session.get("session")["history"]
|
||||
],
|
||||
stream = True,
|
||||
**settings
|
||||
)
|
||||
|
||||
async for part in stream:
|
||||
if token := part.choices[0].delta.content or "":
|
||||
await msg.stream_token(token)
|
||||
|
||||
# Add assitanr message to the history
|
||||
cl.user_session.get("session")["history"].append({
|
||||
"role": "assistant",
|
||||
"content": msg.content
|
||||
})
|
||||
await msg.update()
|
||||
|
||||
if __name__ == "__main__":
|
||||
from chainlit.cli import run_chainlit
|
||||
run_chainlit(__file__)
|
||||
@@ -1,238 +0,0 @@
|
||||
# Make sure to install the required packageschainlit and groq
|
||||
import os, time
|
||||
from openai import AsyncOpenAI
|
||||
import chainlit as cl
|
||||
import re
|
||||
import requests
|
||||
from io import BytesIO
|
||||
from chainlit.element import ElementBased
|
||||
from groq import Groq
|
||||
|
||||
# Import threadpools to run the crawl_url function in a separate thread
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
|
||||
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
|
||||
|
||||
# Instrument the OpenAI client
|
||||
cl.instrument_openai()
|
||||
|
||||
settings = {
|
||||
"model": "llama3-8b-8192",
|
||||
"temperature": 0.5,
|
||||
"max_tokens": 500,
|
||||
"top_p": 1,
|
||||
"frequency_penalty": 0,
|
||||
"presence_penalty": 0,
|
||||
}
|
||||
|
||||
def extract_urls(text):
|
||||
url_pattern = re.compile(r'(https?://\S+)')
|
||||
return url_pattern.findall(text)
|
||||
|
||||
def crawl_url(url):
|
||||
data = {
|
||||
"urls": [url],
|
||||
"include_raw_html": True,
|
||||
"word_count_threshold": 10,
|
||||
"extraction_strategy": "NoExtractionStrategy",
|
||||
"chunking_strategy": "RegexChunking"
|
||||
}
|
||||
response = requests.post("https://crawl4ai.com/crawl", json=data)
|
||||
response_data = response.json()
|
||||
response_data = response_data['results'][0]
|
||||
return response_data['markdown']
|
||||
|
||||
@cl.on_chat_start
|
||||
async def on_chat_start():
|
||||
cl.user_session.set("session", {
|
||||
"history": [],
|
||||
"context": {}
|
||||
})
|
||||
await cl.Message(
|
||||
content="Welcome to the chat! How can I assist you today?"
|
||||
).send()
|
||||
|
||||
@cl.on_message
|
||||
async def on_message(message: cl.Message):
|
||||
user_session = cl.user_session.get("session")
|
||||
|
||||
# Extract URLs from the user's message
|
||||
urls = extract_urls(message.content)
|
||||
|
||||
|
||||
futures = []
|
||||
with ThreadPoolExecutor() as executor:
|
||||
for url in urls:
|
||||
futures.append(executor.submit(crawl_url, url))
|
||||
|
||||
results = [future.result() for future in futures]
|
||||
|
||||
for url, result in zip(urls, results):
|
||||
ref_number = f"REF_{len(user_session['context']) + 1}"
|
||||
user_session["context"][ref_number] = {
|
||||
"url": url,
|
||||
"content": result
|
||||
}
|
||||
|
||||
# for url in urls:
|
||||
# # Crawl the content of each URL and add it to the session context with a reference number
|
||||
# ref_number = f"REF_{len(user_session['context']) + 1}"
|
||||
# crawled_content = crawl_url(url)
|
||||
# user_session["context"][ref_number] = {
|
||||
# "url": url,
|
||||
# "content": crawled_content
|
||||
# }
|
||||
|
||||
user_session["history"].append({
|
||||
"role": "user",
|
||||
"content": message.content
|
||||
})
|
||||
|
||||
# Create a system message that includes the context
|
||||
context_messages = [
|
||||
f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
|
||||
for ref, data in user_session["context"].items()
|
||||
]
|
||||
if context_messages:
|
||||
system_message = {
|
||||
"role": "system",
|
||||
"content": (
|
||||
"You are a helpful bot. Use the following context for answering questions. "
|
||||
"Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
|
||||
"If the question requires any information from the provided appendices or context, refer to the sources. "
|
||||
"If not, there is no need to add a references section. "
|
||||
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
|
||||
"\n\n".join(context_messages)
|
||||
)
|
||||
}
|
||||
else:
|
||||
system_message = {
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant."
|
||||
}
|
||||
|
||||
|
||||
msg = cl.Message(content="")
|
||||
await msg.send()
|
||||
|
||||
# Get response from the LLM
|
||||
stream = await client.chat.completions.create(
|
||||
messages=[
|
||||
system_message,
|
||||
*user_session["history"]
|
||||
],
|
||||
stream=True,
|
||||
**settings
|
||||
)
|
||||
|
||||
assistant_response = ""
|
||||
async for part in stream:
|
||||
if token := part.choices[0].delta.content:
|
||||
assistant_response += token
|
||||
await msg.stream_token(token)
|
||||
|
||||
# Add assistant message to the history
|
||||
user_session["history"].append({
|
||||
"role": "assistant",
|
||||
"content": assistant_response
|
||||
})
|
||||
await msg.update()
|
||||
|
||||
# Append the reference section to the assistant's response
|
||||
reference_section = "\n\nReferences:\n"
|
||||
for ref, data in user_session["context"].items():
|
||||
reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
|
||||
|
||||
msg.content += reference_section
|
||||
await msg.update()
|
||||
|
||||
|
||||
@cl.on_audio_chunk
|
||||
async def on_audio_chunk(chunk: cl.AudioChunk):
|
||||
if chunk.isStart:
|
||||
buffer = BytesIO()
|
||||
# This is required for whisper to recognize the file type
|
||||
buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
|
||||
# Initialize the session for a new audio stream
|
||||
cl.user_session.set("audio_buffer", buffer)
|
||||
cl.user_session.set("audio_mime_type", chunk.mimeType)
|
||||
|
||||
# Write the chunks to a buffer and transcribe the whole audio at the end
|
||||
cl.user_session.get("audio_buffer").write(chunk.data)
|
||||
|
||||
pass
|
||||
|
||||
@cl.step(type="tool")
|
||||
async def speech_to_text(audio_file):
|
||||
cli = Groq()
|
||||
|
||||
# response = cli.audio.transcriptions.create(
|
||||
# file=audio_file, #(filename, file.read()),
|
||||
# model="whisper-large-v3",
|
||||
# )
|
||||
|
||||
response = await client.audio.transcriptions.create(
|
||||
model="whisper-large-v3", file=audio_file
|
||||
)
|
||||
|
||||
return response.text
|
||||
|
||||
|
||||
@cl.on_audio_end
|
||||
async def on_audio_end(elements: list[ElementBased]):
|
||||
# Get the audio buffer from the session
|
||||
audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
|
||||
audio_buffer.seek(0) # Move the file pointer to the beginning
|
||||
audio_file = audio_buffer.read()
|
||||
audio_mime_type: str = cl.user_session.get("audio_mime_type")
|
||||
|
||||
# input_audio_el = cl.Audio(
|
||||
# mime=audio_mime_type, content=audio_file, name=audio_buffer.name
|
||||
# )
|
||||
# await cl.Message(
|
||||
# author="You",
|
||||
# type="user_message",
|
||||
# content="",
|
||||
# elements=[input_audio_el, *elements]
|
||||
# ).send()
|
||||
|
||||
# answer_message = await cl.Message(content="").send()
|
||||
|
||||
|
||||
start_time = time.time()
|
||||
whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
|
||||
transcription = await speech_to_text(whisper_input)
|
||||
end_time = time.time()
|
||||
print(f"Transcription took {end_time - start_time} seconds")
|
||||
|
||||
user_msg = cl.Message(
|
||||
author="You",
|
||||
type="user_message",
|
||||
content=transcription
|
||||
)
|
||||
await user_msg.send()
|
||||
await on_message(user_msg)
|
||||
|
||||
# images = [file for file in elements if "image" in file.mime]
|
||||
|
||||
# text_answer = await generate_text_answer(transcription, images)
|
||||
|
||||
# output_name, output_audio = await text_to_speech(text_answer, audio_mime_type)
|
||||
|
||||
# output_audio_el = cl.Audio(
|
||||
# name=output_name,
|
||||
# auto_play=True,
|
||||
# mime=audio_mime_type,
|
||||
# content=output_audio,
|
||||
# )
|
||||
|
||||
# answer_message.elements = [output_audio_el]
|
||||
|
||||
# answer_message.content = transcription
|
||||
# await answer_message.update()
|
||||
|
||||
if __name__ == "__main__":
|
||||
from chainlit.cli import run_chainlit
|
||||
run_chainlit(__file__)
|
||||
|
||||
|
||||
117
docs/examples/tutorial_dynamic_clicks.md
Normal file
117
docs/examples/tutorial_dynamic_clicks.md
Normal file
@@ -0,0 +1,117 @@
|
||||
# Tutorial: Clicking Buttons to Load More Content with Crawl4AI
|
||||
|
||||
## Introduction
|
||||
|
||||
When scraping dynamic websites, it’s common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, we’ll cover two approaches:
|
||||
|
||||
1. **Step-by-step (Session-based) Approach:** Multiple calls to `arun()` to progressively load more content.
|
||||
2. **Single-call Approach:** Execute a more complex JavaScript snippet inside a single `arun()` call to handle all clicks at once before the extraction.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A working installation of Crawl4AI
|
||||
- Basic familiarity with Python’s `async`/`await` syntax
|
||||
|
||||
## Step-by-Step Approach
|
||||
|
||||
Use a session ID to maintain state across multiple `arun()` calls:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
js_code = [
|
||||
# This JS finds the “Next” button and clicks it
|
||||
"const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
|
||||
]
|
||||
|
||||
wait_for_condition = "css:.new-content-class"
|
||||
|
||||
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||||
# 1. Load the initial page
|
||||
result_initial = await crawler.arun(
|
||||
url="https://example.com",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
session_id="my_session"
|
||||
)
|
||||
|
||||
# 2. Click the 'Next' button and wait for new content
|
||||
result_next = await crawler.arun(
|
||||
url="https://example.com",
|
||||
session_id="my_session",
|
||||
js_code=js_code,
|
||||
wait_for=wait_for_condition,
|
||||
js_only=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# `result_next` now contains the updated HTML after clicking 'Next'
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- **`session_id`**: Keeps the same browser context open.
|
||||
- **`js_code`**: Executes JavaScript in the context of the already loaded page.
|
||||
- **`wait_for`**: Ensures the crawler waits until new content is fully loaded.
|
||||
- **`js_only=True`**: Runs the JS in the current session without reloading the page.
|
||||
|
||||
By repeating the `arun()` call multiple times and modifying the `js_code` (e.g., clicking different modules or pages), you can iteratively load all the desired content.
|
||||
|
||||
## Single-call Approach
|
||||
|
||||
If the page allows it, you can run a single `arun()` call with a more elaborate JavaScript snippet that:
|
||||
- Iterates over all the modules or "Next" buttons
|
||||
- Clicks them one by one
|
||||
- Waits for content updates between each click
|
||||
- Once done, returns control to Crawl4AI for extraction.
|
||||
|
||||
Example snippet:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
js_code = [
|
||||
# Example JS that clicks multiple modules:
|
||||
"""
|
||||
(async () => {
|
||||
const modules = document.querySelectorAll('.module-item');
|
||||
for (let i = 0; i < modules.length; i++) {
|
||||
modules[i].scrollIntoView();
|
||||
modules[i].click();
|
||||
// Wait for each module’s content to load, adjust 100ms as needed
|
||||
await new Promise(r => setTimeout(r, 100));
|
||||
}
|
||||
})();
|
||||
"""
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code=js_code,
|
||||
wait_for="css:.final-loaded-content-class",
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# `result` now contains all content after all modules have been clicked in one go.
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- All interactions (clicks and waits) happen before the extraction.
|
||||
- Ideal for pages where all steps can be done in a single pass.
|
||||
|
||||
## Choosing the Right Approach
|
||||
|
||||
- **Step-by-Step (Session-based)**:
|
||||
- Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
|
||||
- Useful if the page requires multiple conditions checked at runtime.
|
||||
|
||||
- **Single-call**:
|
||||
- Perfect if the sequence of interactions is known in advance.
|
||||
- Cleaner code if the page’s structure is consistent and predictable.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Crawl4AI makes it easy to handle dynamic content:
|
||||
- Use session IDs and multiple `arun()` calls for stepwise crawling.
|
||||
- Or pack all actions into one `arun()` call if the interactions are well-defined upfront.
|
||||
|
||||
This flexibility ensures you can handle a wide range of dynamic web pages efficiently.
|
||||
@@ -52,34 +52,7 @@ async def download_example():
|
||||
else:
|
||||
print("\nNo files were downloaded")
|
||||
|
||||
# 2. Content Filtering with BM25 Example
|
||||
async def content_filtering_example():
|
||||
"""Example of using the new BM25 content filtering"""
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Create filter with custom query for OpenAI's blog
|
||||
content_filter = BM25ContentFilter(
|
||||
# user_query="Investment and fundraising",
|
||||
# user_query="Robotic",
|
||||
bm25_threshold=1.0
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://techcrunch.com/",
|
||||
content_filter=content_filter,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
print(f"Filtered content: {len(result.fit_markdown)}")
|
||||
print(f"Filtered content: {result.fit_markdown}")
|
||||
|
||||
# Save html
|
||||
with open(os.path.join(__data__, "techcrunch.html"), "w") as f:
|
||||
f.write(result.fit_html)
|
||||
|
||||
with open(os.path.join(__data__, "filtered_content.md"), "w") as f:
|
||||
f.write(result.fit_markdown)
|
||||
|
||||
# 3. Local File and Raw HTML Processing Example
|
||||
# 2. Local File and Raw HTML Processing Example
|
||||
async def local_and_raw_html_example():
|
||||
"""Example of processing local files and raw HTML"""
|
||||
# Create a sample HTML file
|
||||
@@ -115,6 +88,68 @@ async def local_and_raw_html_example():
|
||||
print("Local file content:", local_result.markdown)
|
||||
print("\nRaw HTML content:", raw_result.markdown)
|
||||
|
||||
# 3. Enhanced Markdown Generation Example
|
||||
async def markdown_generation_example():
|
||||
"""Example of enhanced markdown generation with citations and LLM-friendly features"""
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Create a content filter (optional)
|
||||
content_filter = BM25ContentFilter(
|
||||
# user_query="History and cultivation",
|
||||
bm25_threshold=1.0
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/Apple",
|
||||
css_selector="main div#bodyContent",
|
||||
content_filter=content_filter,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/Apple",
|
||||
css_selector="main div#bodyContent",
|
||||
content_filter=BM25ContentFilter()
|
||||
)
|
||||
print(result.markdown_v2.fit_markdown)
|
||||
|
||||
print("\nMarkdown Generation Results:")
|
||||
print(f"1. Original markdown length: {len(result.markdown)}")
|
||||
print(f"2. New markdown versions (markdown_v2):")
|
||||
print(f" - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
|
||||
print(f" - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}")
|
||||
print(f" - References section length: {len(result.markdown_v2.references_markdown)}")
|
||||
if result.markdown_v2.fit_markdown:
|
||||
print(f" - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}")
|
||||
|
||||
# Save examples to files
|
||||
output_dir = os.path.join(__data__, "markdown_examples")
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
# Save different versions
|
||||
with open(os.path.join(output_dir, "1_raw_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.raw_markdown)
|
||||
|
||||
with open(os.path.join(output_dir, "2_citations_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.markdown_with_citations)
|
||||
|
||||
with open(os.path.join(output_dir, "3_references.md"), "w") as f:
|
||||
f.write(result.markdown_v2.references_markdown)
|
||||
|
||||
if result.markdown_v2.fit_markdown:
|
||||
with open(os.path.join(output_dir, "4_filtered_markdown.md"), "w") as f:
|
||||
f.write(result.markdown_v2.fit_markdown)
|
||||
|
||||
print(f"\nMarkdown examples saved to: {output_dir}")
|
||||
|
||||
# Show a sample of citations and references
|
||||
print("\nSample of markdown with citations:")
|
||||
print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
|
||||
print("Sample of references:")
|
||||
print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...")
|
||||
|
||||
# 4. Browser Management Example
|
||||
async def browser_management_example():
|
||||
"""Example of using enhanced browser management features"""
|
||||
@@ -208,9 +243,13 @@ async def api_example():
|
||||
headers=headers
|
||||
) as status_response:
|
||||
result = await status_response.json()
|
||||
print(f"Task result: {result}")
|
||||
print(f"Task status: {result['status']}")
|
||||
|
||||
if result["status"] == "completed":
|
||||
print("Task completed!")
|
||||
print("Results:")
|
||||
news = json.loads(result["results"][0]['extracted_content'])
|
||||
print(json.dumps(news[:4], indent=2))
|
||||
break
|
||||
else:
|
||||
await asyncio.sleep(1)
|
||||
@@ -220,15 +259,15 @@ async def main():
|
||||
# print("Running Crawl4AI feature examples...")
|
||||
|
||||
# print("\n1. Running Download Example:")
|
||||
await download_example()
|
||||
# await download_example()
|
||||
|
||||
# print("\n2. Running Content Filtering Example:")
|
||||
await content_filtering_example()
|
||||
# print("\n2. Running Markdown Generation Example:")
|
||||
# await markdown_generation_example()
|
||||
|
||||
# print("\n3. Running Local and Raw HTML Example:")
|
||||
await local_and_raw_html_example()
|
||||
# # print("\n3. Running Local and Raw HTML Example:")
|
||||
# await local_and_raw_html_example()
|
||||
|
||||
# print("\n4. Running Browser Management Example:")
|
||||
# # print("\n4. Running Browser Management Example:")
|
||||
await browser_management_example()
|
||||
|
||||
# print("\n5. Running API Example:")
|
||||
|
||||
443
docs/examples/v0_4_24_walkthrough.py
Normal file
443
docs/examples/v0_4_24_walkthrough.py
Normal file
@@ -0,0 +1,443 @@
|
||||
"""
|
||||
Crawl4AI v0.4.24 Feature Walkthrough
|
||||
===================================
|
||||
|
||||
This script demonstrates the new features introduced in Crawl4AI v0.4.24.
|
||||
Each section includes detailed examples and explanations of the new capabilities.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import json
|
||||
import re
|
||||
from typing import List, Optional, Dict, Any
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import (
|
||||
AsyncWebCrawler,
|
||||
BrowserConfig,
|
||||
CrawlerRunConfig,
|
||||
CacheMode,
|
||||
LLMExtractionStrategy,
|
||||
JsonCssExtractionStrategy
|
||||
)
|
||||
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# Sample HTML for demonstrations
|
||||
SAMPLE_HTML = """
|
||||
<div class="article-list">
|
||||
<article class="post" data-category="tech" data-author="john">
|
||||
<h2 class="title"><a href="/post-1">First Post</a></h2>
|
||||
<div class="meta">
|
||||
<a href="/author/john" class="author">John Doe</a>
|
||||
<span class="date">2023-12-31</span>
|
||||
</div>
|
||||
<div class="content">
|
||||
<p>First post content...</p>
|
||||
<a href="/read-more-1" class="read-more">Read More</a>
|
||||
</div>
|
||||
</article>
|
||||
<article class="post" data-category="science" data-author="jane">
|
||||
<h2 class="title"><a href="/post-2">Second Post</a></h2>
|
||||
<div class="meta">
|
||||
<a href="/author/jane" class="author">Jane Smith</a>
|
||||
<span class="date">2023-12-30</span>
|
||||
</div>
|
||||
<div class="content">
|
||||
<p>Second post content...</p>
|
||||
<a href="/read-more-2" class="read-more">Read More</a>
|
||||
</div>
|
||||
</article>
|
||||
</div>
|
||||
"""
|
||||
|
||||
async def demo_ssl_features():
|
||||
"""
|
||||
Enhanced SSL & Security Features Demo
|
||||
-----------------------------------
|
||||
|
||||
This example demonstrates the new SSL certificate handling and security features:
|
||||
1. Custom certificate paths
|
||||
2. SSL verification options
|
||||
3. HTTPS error handling
|
||||
4. Certificate validation configurations
|
||||
|
||||
These features are particularly useful when:
|
||||
- Working with self-signed certificates
|
||||
- Dealing with corporate proxies
|
||||
- Handling mixed content websites
|
||||
- Managing different SSL security levels
|
||||
"""
|
||||
print("\n1. Enhanced SSL & Security Demo")
|
||||
print("--------------------------------")
|
||||
|
||||
browser_config = BrowserConfig()
|
||||
|
||||
run_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
fetch_ssl_certificate=True # Enable SSL certificate fetching
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
print(f"SSL Crawl Success: {result.success}")
|
||||
result.ssl_certificate.to_json(
|
||||
os.path.join(os.getcwd(), "ssl_certificate.json")
|
||||
)
|
||||
if not result.success:
|
||||
print(f"SSL Error: {result.error_message}")
|
||||
|
||||
async def demo_content_filtering():
|
||||
"""
|
||||
Smart Content Filtering Demo
|
||||
----------------------
|
||||
|
||||
Demonstrates advanced content filtering capabilities:
|
||||
1. Custom filter to identify and extract specific content
|
||||
2. Integration with markdown generation
|
||||
3. Flexible pruning rules
|
||||
"""
|
||||
print("\n2. Smart Content Filtering Demo")
|
||||
print("--------------------------------")
|
||||
|
||||
# Create a custom content filter
|
||||
class CustomNewsFilter(RelevantContentFilter):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
# Add news-specific patterns
|
||||
self.negative_patterns = re.compile(
|
||||
r'nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending',
|
||||
re.I
|
||||
)
|
||||
self.min_word_count = 30 # Higher threshold for news content
|
||||
|
||||
def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
|
||||
"""
|
||||
Implements news-specific content filtering logic.
|
||||
|
||||
Args:
|
||||
html (str): HTML content to be filtered
|
||||
min_word_threshold (int, optional): Minimum word count threshold
|
||||
|
||||
Returns:
|
||||
List[str]: List of filtered HTML content blocks
|
||||
"""
|
||||
if not html or not isinstance(html, str):
|
||||
return []
|
||||
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
if not soup.body:
|
||||
soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
|
||||
|
||||
body = soup.find('body')
|
||||
|
||||
# Extract chunks with metadata
|
||||
chunks = self.extract_text_chunks(body, min_word_threshold or self.min_word_count)
|
||||
|
||||
# Filter chunks based on news-specific criteria
|
||||
filtered_chunks = []
|
||||
for _, text, tag_type, element in chunks:
|
||||
# Skip if element has negative class/id
|
||||
if self.is_excluded(element):
|
||||
continue
|
||||
|
||||
# Headers are important in news articles
|
||||
if tag_type == 'header':
|
||||
filtered_chunks.append(self.clean_element(element))
|
||||
continue
|
||||
|
||||
# For content, check word count and link density
|
||||
text = element.get_text(strip=True)
|
||||
if len(text.split()) >= (min_word_threshold or self.min_word_count):
|
||||
# Calculate link density
|
||||
links_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
|
||||
link_density = len(links_text) / len(text) if text else 1
|
||||
|
||||
# Accept if link density is reasonable
|
||||
if link_density < 0.5:
|
||||
filtered_chunks.append(self.clean_element(element))
|
||||
|
||||
return filtered_chunks
|
||||
|
||||
# Create markdown generator with custom filter
|
||||
markdown_gen = DefaultMarkdownGenerator(
|
||||
content_filter=CustomNewsFilter()
|
||||
)
|
||||
|
||||
run_config = CrawlerRunConfig(
|
||||
markdown_generator=markdown_gen,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.ycombinator.com",
|
||||
config=run_config
|
||||
)
|
||||
print("Filtered Content Sample:")
|
||||
print(result.markdown[:500]) # Show first 500 chars
|
||||
|
||||
async def demo_json_extraction():
|
||||
"""
|
||||
Improved JSON Extraction Demo
|
||||
---------------------------
|
||||
|
||||
Demonstrates the enhanced JSON extraction capabilities:
|
||||
1. Base element attributes extraction
|
||||
2. Complex nested structures
|
||||
3. Multiple extraction patterns
|
||||
|
||||
Key features shown:
|
||||
- Extracting attributes from base elements (href, data-* attributes)
|
||||
- Processing repeated patterns
|
||||
- Handling optional fields
|
||||
"""
|
||||
print("\n3. Improved JSON Extraction Demo")
|
||||
print("--------------------------------")
|
||||
|
||||
# Define the extraction schema with base element attributes
|
||||
json_strategy = JsonCssExtractionStrategy(
|
||||
schema={
|
||||
"name": "Blog Posts",
|
||||
"baseSelector": "div.article-list",
|
||||
"baseFields": [
|
||||
{"name": "list_id", "type": "attribute", "attribute": "data-list-id"},
|
||||
{"name": "category", "type": "attribute", "attribute": "data-category"}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "posts",
|
||||
"selector": "article.post",
|
||||
"type": "nested_list",
|
||||
"baseFields": [
|
||||
{"name": "post_id", "type": "attribute", "attribute": "data-post-id"},
|
||||
{"name": "author_id", "type": "attribute", "attribute": "data-author"}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2.title a",
|
||||
"type": "text",
|
||||
"baseFields": [
|
||||
{"name": "url", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "author",
|
||||
"selector": "div.meta a.author",
|
||||
"type": "text",
|
||||
"baseFields": [
|
||||
{"name": "profile_url", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "date",
|
||||
"selector": "span.date",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "read_more",
|
||||
"selector": "a.read-more",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{"name": "text", "type": "text"},
|
||||
{"name": "url", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
)
|
||||
|
||||
# Demonstrate extraction from raw HTML
|
||||
run_config = CrawlerRunConfig(
|
||||
extraction_strategy=json_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="raw:" + SAMPLE_HTML, # Use raw: prefix for raw HTML
|
||||
config=run_config
|
||||
)
|
||||
print("Extracted Content:")
|
||||
print(result.extracted_content)
|
||||
|
||||
async def demo_input_formats():
|
||||
"""
|
||||
Input Format Handling Demo
|
||||
----------------------
|
||||
|
||||
Demonstrates how LLM extraction can work with different input formats:
|
||||
1. Markdown (default) - Good for simple text extraction
|
||||
2. HTML - Better when you need structure and attributes
|
||||
|
||||
This example shows how HTML input can be beneficial when:
|
||||
- You need to understand the DOM structure
|
||||
- You want to extract both visible text and HTML attributes
|
||||
- The content has complex layouts like tables or forms
|
||||
"""
|
||||
print("\n4. Input Format Handling Demo")
|
||||
print("---------------------------")
|
||||
|
||||
# Create a dummy HTML with rich structure
|
||||
dummy_html = """
|
||||
<div class="job-posting" data-post-id="12345">
|
||||
<header class="job-header">
|
||||
<h1 class="job-title">Senior AI/ML Engineer</h1>
|
||||
<div class="job-meta">
|
||||
<span class="department">AI Research Division</span>
|
||||
<span class="location" data-remote="hybrid">San Francisco (Hybrid)</span>
|
||||
</div>
|
||||
<div class="salary-info" data-currency="USD">
|
||||
<span class="range">$150,000 - $220,000</span>
|
||||
<span class="period">per year</span>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<section class="requirements">
|
||||
<div class="technical-skills">
|
||||
<h3>Technical Requirements</h3>
|
||||
<ul class="required-skills">
|
||||
<li class="skill required" data-priority="must-have">
|
||||
5+ years experience in Machine Learning
|
||||
</li>
|
||||
<li class="skill required" data-priority="must-have">
|
||||
Proficiency in Python and PyTorch/TensorFlow
|
||||
</li>
|
||||
<li class="skill preferred" data-priority="nice-to-have">
|
||||
Experience with distributed training systems
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
|
||||
<div class="soft-skills">
|
||||
<h3>Professional Skills</h3>
|
||||
<ul class="required-skills">
|
||||
<li class="skill required" data-priority="must-have">
|
||||
Strong problem-solving abilities
|
||||
</li>
|
||||
<li class="skill preferred" data-priority="nice-to-have">
|
||||
Experience leading technical teams
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section class="timeline">
|
||||
<time class="deadline" datetime="2024-02-28">
|
||||
Application Deadline: February 28, 2024
|
||||
</time>
|
||||
</section>
|
||||
|
||||
<footer class="contact-section">
|
||||
<div class="hiring-manager">
|
||||
<h4>Hiring Manager</h4>
|
||||
<div class="contact-info">
|
||||
<span class="name">Dr. Sarah Chen</span>
|
||||
<span class="title">Director of AI Research</span>
|
||||
<span class="email">ai.hiring@example.com</span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="team-info">
|
||||
<p>Join our team of 50+ researchers working on cutting-edge AI applications</p>
|
||||
</div>
|
||||
</footer>
|
||||
</div>
|
||||
"""
|
||||
|
||||
# Use raw:// prefix to pass HTML content directly
|
||||
url = f"raw://{dummy_html}"
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import List, Optional
|
||||
|
||||
# Define our schema using Pydantic
|
||||
class JobRequirement(BaseModel):
|
||||
category: str = Field(description="Category of the requirement (e.g., Technical, Soft Skills)")
|
||||
items: List[str] = Field(description="List of specific requirements in this category")
|
||||
priority: str = Field(description="Priority level (Required/Preferred) based on the HTML class or context")
|
||||
|
||||
class JobPosting(BaseModel):
|
||||
title: str = Field(description="Job title")
|
||||
department: str = Field(description="Department or team")
|
||||
location: str = Field(description="Job location, including remote options")
|
||||
salary_range: Optional[str] = Field(description="Salary range if specified")
|
||||
requirements: List[JobRequirement] = Field(description="Categorized job requirements")
|
||||
application_deadline: Optional[str] = Field(description="Application deadline if specified")
|
||||
contact_info: Optional[dict] = Field(description="Contact information from footer or contact section")
|
||||
|
||||
# First try with markdown (default)
|
||||
markdown_strategy = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
schema=JobPosting.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Extract job posting details into structured data. Focus on the visible text content
|
||||
and organize requirements into categories.
|
||||
""",
|
||||
input_format="markdown" # default
|
||||
)
|
||||
|
||||
# Then with HTML for better structure understanding
|
||||
html_strategy = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
schema=JobPosting.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Extract job posting details, using HTML structure to:
|
||||
1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
|
||||
2. Extract contact info from the page footer or dedicated contact section
|
||||
3. Parse salary information from specially formatted elements
|
||||
4. Determine application deadline from timestamp or date elements
|
||||
|
||||
Use HTML attributes and classes to enhance extraction accuracy.
|
||||
""",
|
||||
input_format="html" # explicitly use HTML
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Try with markdown first
|
||||
markdown_config = CrawlerRunConfig(
|
||||
extraction_strategy=markdown_strategy
|
||||
)
|
||||
markdown_result = await crawler.arun(
|
||||
url=url,
|
||||
config=markdown_config
|
||||
)
|
||||
print("\nMarkdown-based Extraction Result:")
|
||||
items = json.loads(markdown_result.extracted_content)
|
||||
print(json.dumps(items, indent=2))
|
||||
|
||||
# Then with HTML for better structure understanding
|
||||
html_config = CrawlerRunConfig(
|
||||
extraction_strategy=html_strategy
|
||||
)
|
||||
html_result = await crawler.arun(
|
||||
url=url,
|
||||
config=html_config
|
||||
)
|
||||
print("\nHTML-based Extraction Result:")
|
||||
items = json.loads(html_result.extracted_content)
|
||||
print(json.dumps(items, indent=2))
|
||||
|
||||
# Main execution
|
||||
async def main():
|
||||
print("Crawl4AI v0.4.24 Feature Walkthrough")
|
||||
print("====================================")
|
||||
|
||||
# Run all demos
|
||||
await demo_ssl_features()
|
||||
await demo_content_filtering()
|
||||
await demo_json_extraction()
|
||||
# await demo_input_formats()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -2,80 +2,12 @@
|
||||
|
||||
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
|
||||
|
||||
## Content Cleaning
|
||||
|
||||
### Understanding Clean Content
|
||||
When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
|
||||
|
||||
1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
|
||||
2. **Content Relevance**: Identifies and preserves meaningful content blocks
|
||||
3. **Layout Analysis**: Understands page structure to identify main content areas
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Remove blocks with fewer words
|
||||
excluded_tags=['form', 'nav'], # Remove specific HTML tags
|
||||
remove_overlay_elements=True # Remove popups/modals
|
||||
)
|
||||
|
||||
# Get clean content
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown) # Clean markdown version
|
||||
```
|
||||
|
||||
### Fit Markdown: Smart Content Extraction
|
||||
One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
|
||||
|
||||
#### How Fit Markdown Works
|
||||
- Analyzes content density and distribution
|
||||
- Identifies content patterns and structures
|
||||
- Removes boilerplate content (headers, footers, sidebars)
|
||||
- Preserves the most relevant content blocks
|
||||
- Maintains content hierarchy and formatting
|
||||
|
||||
#### Perfect For:
|
||||
- Blog posts and articles
|
||||
- News content
|
||||
- Documentation pages
|
||||
- Any page with a clear main content area
|
||||
|
||||
#### Not Recommended For:
|
||||
- E-commerce product listings
|
||||
- Search results pages
|
||||
- Social media feeds
|
||||
- Pages with multiple equal-weight content sections
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Get the most relevant content
|
||||
main_content = result.fit_markdown
|
||||
|
||||
# Compare with regular markdown
|
||||
all_content = result.markdown
|
||||
|
||||
print(f"Fit Markdown Length: {len(main_content)}")
|
||||
print(f"Regular Markdown Length: {len(all_content)}")
|
||||
```
|
||||
|
||||
#### Example Use Case
|
||||
```python
|
||||
async def extract_article_content(url: str) -> str:
|
||||
"""Extract main article content from a blog or news site."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url)
|
||||
|
||||
# fit_markdown will focus on the article content,
|
||||
# excluding navigation, ads, and other distractions
|
||||
return result.fit_markdown
|
||||
```
|
||||
|
||||
## Media Processing
|
||||
|
||||
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
|
||||
|
||||
### Image Processing
|
||||
|
||||
The library handles various image scenarios, including:
|
||||
- Regular images
|
||||
- Lazy-loaded images
|
||||
@@ -84,7 +16,10 @@ The library handles various image scenarios, including:
|
||||
- Image metadata and context
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
for image in result.media["images"]:
|
||||
# Each image includes rich metadata
|
||||
@@ -96,20 +31,27 @@ for image in result.media["images"]:
|
||||
```
|
||||
|
||||
### Handling Lazy-Loaded Content
|
||||
Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
|
||||
|
||||
Crawl4AI already handles lazy loading for media elements. You can customize the wait time for lazy-loaded content with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
wait_for="css:img[data-src]", # Wait for lazy images
|
||||
delay_before_return_html=2.0 # Additional wait time
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Video and Audio Content
|
||||
|
||||
The library extracts video and audio elements with their metadata:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Process videos
|
||||
for video in result.media["videos"]:
|
||||
print(f"Video source: {video['src']}")
|
||||
@@ -129,6 +71,7 @@ for audio in result.media["audios"]:
|
||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
||||
|
||||
### Link Classification
|
||||
|
||||
The library automatically categorizes links into:
|
||||
- Internal links (same domain)
|
||||
- External links (different domains)
|
||||
@@ -137,7 +80,10 @@ The library automatically categorizes links into:
|
||||
- Content links
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Analyze internal links
|
||||
for link in result.links["internal"]:
|
||||
@@ -154,18 +100,19 @@ for link in result.links["external"]:
|
||||
```
|
||||
|
||||
### Smart Link Filtering
|
||||
Control which links are included in the results:
|
||||
|
||||
Control which links are included in the results with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
@@ -173,7 +120,10 @@ result = await crawler.arun(
|
||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
metadata = result.metadata
|
||||
print(f"Title: {metadata['title']}")
|
||||
@@ -184,40 +134,3 @@ print(f"Published Date: {metadata['published_date']}")
|
||||
print(f"Modified Date: {metadata['modified_date']}")
|
||||
print(f"Language: {metadata['language']}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Fit Markdown for Articles**
|
||||
```python
|
||||
# Perfect for blog posts, news articles, documentation
|
||||
content = result.fit_markdown
|
||||
```
|
||||
|
||||
2. **Handle Media Appropriately**
|
||||
```python
|
||||
# Filter by relevance score
|
||||
relevant_images = [
|
||||
img for img in result.media["images"]
|
||||
if img['score'] > 5
|
||||
]
|
||||
```
|
||||
|
||||
3. **Combine Link Analysis with Content**
|
||||
```python
|
||||
# Get content links with context
|
||||
content_links = [
|
||||
link for link in result.links["internal"]
|
||||
if link['type'] == 'content'
|
||||
]
|
||||
```
|
||||
|
||||
4. **Clean Content with Purpose**
|
||||
```python
|
||||
# Customize cleaning based on your needs
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=20, # Adjust based on content type
|
||||
keep_data_attributes=False, # Remove data attributes
|
||||
process_iframes=True # Include iframe content
|
||||
)
|
||||
```
|
||||
@@ -1,110 +1,121 @@
|
||||
# Hooks & Auth for AsyncWebCrawler
|
||||
|
||||
Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
|
||||
Crawl4AI's `AsyncWebCrawler` allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This updated documentation demonstrates how to use hooks, including the new `on_page_context_created` hook, and ensures compatibility with `BrowserConfig` and `CrawlerRunConfig`.
|
||||
|
||||
## Example: Using Crawler Hooks with AsyncWebCrawler
|
||||
|
||||
Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
|
||||
In this example, we'll:
|
||||
|
||||
1. Configure the browser when it's created.
|
||||
2. Add custom headers before navigating to the URL.
|
||||
3. Log the current URL after navigation.
|
||||
4. Perform actions after JavaScript execution.
|
||||
5. Log the length of the HTML before returning it.
|
||||
1. Configure the browser and set up authentication when it's created.
|
||||
2. Apply custom routing and initial actions when the page context is created.
|
||||
3. Add custom headers before navigating to the URL.
|
||||
4. Log the current URL after navigation.
|
||||
5. Perform actions after JavaScript execution.
|
||||
6. Log the length of the HTML before returning it.
|
||||
|
||||
### Hook Definitions
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
from playwright.async_api import Page, Browser
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from playwright.async_api import Page, Browser, BrowserContext
|
||||
|
||||
async def on_browser_created(browser: Browser):
|
||||
def log_routing(route):
|
||||
# Example: block loading images
|
||||
if route.request.resource_type == "image":
|
||||
print(f"[HOOK] Blocking image request: {route.request.url}")
|
||||
asyncio.create_task(route.abort())
|
||||
else:
|
||||
asyncio.create_task(route.continue_())
|
||||
|
||||
async def on_browser_created(browser: Browser, **kwargs):
|
||||
print("[HOOK] on_browser_created")
|
||||
# Example customization: set browser viewport size
|
||||
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
|
||||
# Example: Set browser viewport size and log in
|
||||
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
|
||||
page = await context.new_page()
|
||||
|
||||
# Example customization: logging in to a hypothetical website
|
||||
await page.goto('https://example.com/login')
|
||||
await page.fill('input[name="username"]', 'testuser')
|
||||
await page.fill('input[name="password"]', 'password123')
|
||||
await page.click('button[type="submit"]')
|
||||
await page.wait_for_selector('#welcome')
|
||||
|
||||
# Add a custom cookie
|
||||
await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
|
||||
|
||||
await page.goto("https://example.com/login")
|
||||
await page.fill("input[name='username']", "testuser")
|
||||
await page.fill("input[name='password']", "password123")
|
||||
await page.click("button[type='submit']")
|
||||
await page.wait_for_selector("#welcome")
|
||||
await context.add_cookies([{"name": "auth_token", "value": "abc123", "url": "https://example.com"}])
|
||||
await page.close()
|
||||
await context.close()
|
||||
|
||||
async def before_goto(page: Page):
|
||||
print("[HOOK] before_goto")
|
||||
# Example customization: add custom headers
|
||||
await page.set_extra_http_headers({'X-Test-Header': 'test'})
|
||||
async def on_page_context_created(context: BrowserContext, page: Page, **kwargs):
|
||||
print("[HOOK] on_page_context_created")
|
||||
await context.route("**", log_routing)
|
||||
|
||||
async def after_goto(page: Page):
|
||||
async def before_goto(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] before_goto")
|
||||
await page.set_extra_http_headers({"X-Test-Header": "test"})
|
||||
|
||||
async def after_goto(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] after_goto")
|
||||
# Example customization: log the URL
|
||||
print(f"Current URL: {page.url}")
|
||||
|
||||
async def on_execution_started(page: Page):
|
||||
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] on_execution_started")
|
||||
# Example customization: perform actions after JS execution
|
||||
await page.evaluate("console.log('Custom JS executed')")
|
||||
|
||||
async def before_return_html(page: Page, html: str):
|
||||
async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs):
|
||||
print("[HOOK] before_return_html")
|
||||
# Example customization: log the HTML length
|
||||
print(f"HTML length: {len(html)}")
|
||||
return page
|
||||
```
|
||||
|
||||
### Using the Hooks with the AsyncWebCrawler
|
||||
### Using the Hooks with AsyncWebCrawler
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
async def main():
|
||||
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
|
||||
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
|
||||
crawler_strategy.set_hook('on_browser_created', on_browser_created)
|
||||
crawler_strategy.set_hook('before_goto', before_goto)
|
||||
crawler_strategy.set_hook('after_goto', after_goto)
|
||||
crawler_strategy.set_hook('on_execution_started', on_execution_started)
|
||||
crawler_strategy.set_hook('before_return_html', before_return_html)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
print("\n🔗 Using Crawler Hooks: Customize AsyncWebCrawler with hooks!")
|
||||
|
||||
print("📦 Crawler Hooks result:")
|
||||
# Configure browser and crawler settings
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080
|
||||
)
|
||||
|
||||
crawler_run_config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
|
||||
# Initialize crawler
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
||||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
|
||||
|
||||
# Run the crawler
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_run_config)
|
||||
|
||||
print("\n📦 Crawler Hooks Result:")
|
||||
print(result)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Explanation
|
||||
### Explanation of Hooks
|
||||
|
||||
- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
|
||||
- `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
|
||||
- `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
|
||||
- `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
|
||||
- `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
|
||||
- **`on_browser_created`**: Called when the browser is created. Use this to configure the browser or handle authentication (e.g., logging in and setting cookies).
|
||||
- **`on_page_context_created`**: Called when a new page context is created. Use this to apply routing, block resources, or inject custom logic before navigating to the URL.
|
||||
- **`before_goto`**: Called before navigating to the URL. Use this to add custom headers or perform other pre-navigation actions.
|
||||
- **`after_goto`**: Called after navigation. Use this to verify content or log the URL.
|
||||
- **`on_execution_started`**: Called after executing custom JavaScript. Use this to perform additional actions.
|
||||
- **`before_return_html`**: Called before returning the HTML content. Use this to log details or preprocess the content.
|
||||
|
||||
### Additional Ideas
|
||||
### Additional Customizations
|
||||
|
||||
- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
|
||||
- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
|
||||
- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
|
||||
- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
|
||||
- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
|
||||
- **Resource Management**: Use `on_page_context_created` to block or modify requests (e.g., block images, fonts, or third-party scripts).
|
||||
- **Dynamic Headers**: Use `before_goto` to add or modify headers dynamically based on the URL.
|
||||
- **Authentication**: Use `on_browser_created` to handle login processes and set authentication cookies or tokens.
|
||||
- **Content Analysis**: Use `before_return_html` to analyze or modify the extracted HTML content.
|
||||
|
||||
These hooks provide powerful customization options for tailoring the crawling process to your needs.
|
||||
|
||||
By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.
|
||||
156
docs/md_v2/advanced/identity_based_crawling.md
Normal file
156
docs/md_v2/advanced/identity_based_crawling.md
Normal file
@@ -0,0 +1,156 @@
|
||||
### Preserve Your Identity with Crawl4AI
|
||||
|
||||
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This document introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
|
||||
|
||||
---
|
||||
|
||||
### Managed Browsers: Your Digital Identity Solution
|
||||
|
||||
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
|
||||
|
||||
#### Why Use Managed Browsers?
|
||||
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
|
||||
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
|
||||
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
|
||||
|
||||
#### Steps to Use Managed Browsers
|
||||
|
||||
1. **Setup the Browser Configuration**:
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Set to False for initial setup to view browser actions
|
||||
verbose=True,
|
||||
user_agent_mode="random",
|
||||
use_managed_browser=True, # Enables persistent browser sessions
|
||||
browser_type="chromium",
|
||||
user_data_dir="/path/to/user_profile_data" # Path to save session data
|
||||
)
|
||||
```
|
||||
|
||||
2. **Perform an Initial Run**:
|
||||
- Run the crawler with `headless=False`.
|
||||
- Manually interact with the site (e.g., solve CAPTCHA or log in).
|
||||
- The browser session saves cookies, local storage, and other required data.
|
||||
|
||||
3. **Subsequent Runs**:
|
||||
- Switch to `headless=True` for automation.
|
||||
- The session data is reused, allowing seamless crawling.
|
||||
|
||||
#### Example: Extracting Data Using Managed Browsers
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Define schema for structured data extraction
|
||||
schema = {
|
||||
"name": "Example Data",
|
||||
"baseSelector": "div.example",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
# Configure crawler
|
||||
browser_config = BrowserConfig(
|
||||
headless=True, # Automate subsequent runs
|
||||
verbose=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/user_profile_data"
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
wait_for="css:div.example" # Wait for the targeted element to load
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("Extracted Data:", result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Benefits of Managed Browsers Over Other Methods
|
||||
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
|
||||
|
||||
---
|
||||
|
||||
### Magic Mode: Simplified Automation
|
||||
|
||||
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
|
||||
|
||||
#### What Magic Mode Does:
|
||||
- Simulates human browsing by randomizing interaction patterns and timing.
|
||||
- Masks browser automation signals.
|
||||
- Handles cookie popups and modals.
|
||||
- Modifies navigator properties for enhanced compatibility.
|
||||
|
||||
#### Using Magic Mode
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enables all automation features
|
||||
)
|
||||
```
|
||||
|
||||
Magic Mode is particularly useful for:
|
||||
- Quick prototyping when a Managed Browser setup is not available.
|
||||
- Basic sites requiring minimal interaction or configuration.
|
||||
|
||||
#### Example: Combining Magic Mode with Additional Options
|
||||
|
||||
```python
|
||||
async def crawl_with_magic_mode(url: str):
|
||||
async with AsyncWebCrawler(headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
magic=True,
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
page_timeout=60000 # Increased timeout for complex pages
|
||||
)
|
||||
|
||||
return result.markdown if result.success else None
|
||||
```
|
||||
|
||||
### Magic Mode vs. Managed Browsers
|
||||
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
|
||||
|
||||
---
|
||||
|
||||
### Key Comparison: Managed Browsers vs. Magic Mode
|
||||
|
||||
| Feature | **Managed Browsers** | **Magic Mode** |
|
||||
|-------------------------|------------------------------------------|-------------------------------------|
|
||||
| **Session Persistence** | Retains cookies and local storage. | No session retention. |
|
||||
| **Human Interaction** | Uses real user profiles and data. | Simulates human-like patterns. |
|
||||
| **Complex Sites** | Best suited for heavily configured sites.| Works well with simpler challenges.|
|
||||
| **Setup Complexity** | Requires initial manual interaction. | Fully automated, one-line setup. |
|
||||
|
||||
#### Recommendation:
|
||||
- Use **Managed Browsers** for reliable, session-based crawling and data extraction.
|
||||
- Use **Magic Mode** for quick prototyping or when persistent profiles are not required.
|
||||
|
||||
---
|
||||
|
||||
### Conclusion
|
||||
|
||||
- **Use Managed Browsers** to preserve your digital identity and ensure reliable, identity-based crawling with persistent sessions. This approach works seamlessly for even the most complex websites.
|
||||
- **Leverage Magic Mode** for quick automation or in scenarios where persistent user profiles are not needed.
|
||||
|
||||
By combining these approaches, Crawl4AI provides unparalleled flexibility and capability for your crawling needs.
|
||||
|
||||
@@ -1,84 +1,188 @@
|
||||
# Content Filtering in Crawl4AI
|
||||
# Creating Browser Instances, Contexts, and Pages
|
||||
|
||||
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
|
||||
## 1 Introduction
|
||||
|
||||
## Relevance Content Filter
|
||||
### Overview of Browser Management in Crawl4AI
|
||||
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, anti-bot measures, and session persistence for high-volume, dynamic web crawling.
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
### Key Objectives
|
||||
- **Anti-Bot Handling**:
|
||||
- Implements stealth techniques to evade detection mechanisms used by modern websites.
|
||||
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses.
|
||||
- Supports integration with third-party services to bypass CAPTCHA challenges.
|
||||
- **Persistent Sessions**:
|
||||
- Retains session data (cookies, local storage) for workflows requiring user authentication.
|
||||
- Allows seamless continuation of tasks across multiple runs without re-authentication.
|
||||
- **Scalable Crawling**:
|
||||
- Optimized resource utilization for handling thousands of URLs concurrently.
|
||||
- Flexible configuration options to tailor crawling behavior to specific requirements.
|
||||
|
||||
## BM25 Algorithm
|
||||
---
|
||||
|
||||
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
|
||||
## 2 Browser Creation Methods
|
||||
|
||||
### Usage
|
||||
### Standard Browser Creation
|
||||
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
|
||||
|
||||
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
|
||||
#### Features and Limitations
|
||||
- **Features**:
|
||||
- Quick and straightforward setup for small-scale tasks.
|
||||
- Supports headless and headful modes.
|
||||
- **Limitations**:
|
||||
- Lacks advanced customization options like session reuse.
|
||||
- May struggle with sites employing strict anti-bot measures.
|
||||
|
||||
#### Example Usage
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
|
||||
async def filter_content(url, query=None):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = BM25ContentFilter(user_query=query)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
if result.success:
|
||||
print(f"Filtered Content (JSON):\n{result.extracted_content}")
|
||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
|
||||
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
# Example usage:
|
||||
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
|
||||
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Parameters
|
||||
### Persistent Contexts
|
||||
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
|
||||
|
||||
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
|
||||
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
|
||||
|
||||
|
||||
## Fit Markdown Flag
|
||||
|
||||
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
|
||||
|
||||
|
||||
## Custom Content Filtering Strategies
|
||||
|
||||
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
|
||||
#### Benefits of Using `user_data_dir`
|
||||
- **Session Persistence**:
|
||||
- Stores cookies, local storage, and cache between crawling sessions.
|
||||
- Reduces overhead for repetitive logins or multi-step workflows.
|
||||
- **Enhanced Performance**:
|
||||
- Leverages pre-loaded resources for faster page loading.
|
||||
- **Flexibility**:
|
||||
- Adapts to complex workflows requiring user-specific configurations.
|
||||
|
||||
#### Example: Setting Up Persistent Contexts
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
||||
from bs4 import BeautifulSoup, Tag
|
||||
from typing import List
|
||||
|
||||
class MyCustomFilter(RelevantContentFilter):
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
# Implement custom filtering logic here
|
||||
# Example: extract all paragraphs within divs with class "article-body"
|
||||
filtered_paragraphs = []
|
||||
for tag in soup.select("div.article-body p"):
|
||||
if isinstance(tag, Tag):
|
||||
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
|
||||
return filtered_paragraphs
|
||||
|
||||
|
||||
|
||||
async def custom_filter_demo(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
custom_filter = MyCustomFilter()
|
||||
result = await crawler.arun(url, extraction_strategy=custom_filter)
|
||||
if result.success:
|
||||
print(result.extracted_content)
|
||||
|
||||
config = BrowserConfig(user_data_dir="/path/to/user/data")
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
|
||||
### Managed Browser
|
||||
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
|
||||
|
||||
## Conclusion
|
||||
#### How It Works
|
||||
- **Browser Process Management**:
|
||||
- Automates initialization and cleanup of browser processes.
|
||||
- Optimizes resource usage by pooling and reusing browser instances.
|
||||
- **Debugging Support**:
|
||||
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection.
|
||||
- **Anti-Bot Measures**:
|
||||
- Implements stealth plugins to mimic real user behavior and bypass bot detection.
|
||||
|
||||
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.
|
||||
#### Features
|
||||
- **Customizable Configurations**:
|
||||
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation.
|
||||
- **Debugging and Logging**:
|
||||
- Logs detailed browser interactions for debugging and performance analysis.
|
||||
- **Scalability**:
|
||||
- Handles multiple browser instances concurrently, scaling dynamically based on workload.
|
||||
|
||||
#### Example: Using `ManagedBrowser`
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
config = BrowserConfig(headless=False, debug_port=9222)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3 Context and Page Management
|
||||
|
||||
### Creating and Configuring Browser Contexts
|
||||
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
|
||||
|
||||
#### Customizations
|
||||
- **Headers and Cookies**:
|
||||
- Define custom headers to mimic specific devices or browsers.
|
||||
- Set cookies for authenticated sessions.
|
||||
- **Session Reuse**:
|
||||
- Retain and reuse session data across multiple requests.
|
||||
- Example: Preserve login states for authenticated crawls.
|
||||
|
||||
#### Example: Context Initialization
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Creating Pages
|
||||
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
|
||||
|
||||
#### Key Features
|
||||
- **IFrame Handling**:
|
||||
- Extract content from embedded iframes.
|
||||
- Navigate and interact with nested content.
|
||||
- **Viewport Customization**:
|
||||
- Adjust viewport size to match target device dimensions.
|
||||
- **Lazy Loading**:
|
||||
- Ensure dynamic elements are fully loaded before extraction.
|
||||
|
||||
#### Example: Page Initialization
|
||||
```python
|
||||
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4 Advanced Features and Best Practices
|
||||
|
||||
### Debugging and Logging
|
||||
Remote debugging provides a powerful way to troubleshoot complex crawling workflows.
|
||||
|
||||
#### Example: Enabling Remote Debugging
|
||||
```python
|
||||
config = BrowserConfig(debug_port=9222)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
```
|
||||
|
||||
### Anti-Bot Techniques
|
||||
- **Human Behavior Simulation**:
|
||||
- Mimic real user actions, such as scrolling, clicking, and typing.
|
||||
- Example: Use JavaScript to simulate interactions.
|
||||
- **Captcha Handling**:
|
||||
- Integrate with third-party services like 2Captcha or AntiCaptcha for automated solving.
|
||||
|
||||
#### Example: Simulating User Actions
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
document.querySelector('input[name="search"]').value = 'test';
|
||||
document.querySelector('button[type="submit"]').click();
|
||||
})();
|
||||
"""
|
||||
config = CrawlerRunConfig(js_code=[js_code])
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
```
|
||||
|
||||
### Optimizations for Performance and Scalability
|
||||
- **Persistent Contexts**:
|
||||
- Reuse browser contexts to minimize resource consumption.
|
||||
- **Concurrent Crawls**:
|
||||
- Use `arun_many` with a controlled semaphore count for efficient batch processing.
|
||||
|
||||
#### Example: Scaling Crawls
|
||||
```python
|
||||
urls = ["https://example1.com", "https://example2.com"]
|
||||
config = CrawlerRunConfig(semaphore_count=10)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(urls, config=config)
|
||||
for result in results:
|
||||
print(result.url, result.markdown)
|
||||
```
|
||||
|
||||
@@ -4,59 +4,67 @@ Configure proxy settings and enhance security features in Crawl4AI for reliable
|
||||
|
||||
## Basic Proxy Setup
|
||||
|
||||
Simple proxy configuration:
|
||||
Simple proxy configuration with `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
# Using proxy URL
|
||||
async with AsyncWebCrawler(
|
||||
proxy="http://proxy.example.com:8080"
|
||||
) as crawler:
|
||||
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Using SOCKS proxy
|
||||
async with AsyncWebCrawler(
|
||||
proxy="socks5://proxy.example.com:1080"
|
||||
) as crawler:
|
||||
browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Authenticated Proxy
|
||||
|
||||
Use proxy with authentication:
|
||||
Use an authenticated proxy with `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
proxy_config = {
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
|
||||
browser_config = BrowserConfig(proxy_config=proxy_config)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Rotating Proxies
|
||||
|
||||
Example using a proxy rotation service:
|
||||
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
async def get_next_proxy():
|
||||
# Your proxy rotation logic here
|
||||
return {"server": "http://next.proxy.com:8080"}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
browser_config = BrowserConfig()
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Update proxy for each request
|
||||
for url in urls:
|
||||
proxy = await get_next_proxy()
|
||||
crawler.update_proxy(proxy)
|
||||
result = await crawler.arun(url=url)
|
||||
browser_config.proxy_config = proxy
|
||||
result = await crawler.arun(url=url, config=browser_config)
|
||||
```
|
||||
|
||||
## Custom Headers
|
||||
|
||||
Add security-related headers:
|
||||
Add security-related headers via `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
headers = {
|
||||
"X-Forwarded-For": "203.0.113.195",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
@@ -64,21 +72,24 @@ headers = {
|
||||
"Pragma": "no-cache"
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(headers=headers) as crawler:
|
||||
browser_config = BrowserConfig(headers=headers)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Combining with Magic Mode
|
||||
|
||||
For maximum protection, combine proxy with Magic Mode:
|
||||
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
proxy="http://proxy.example.com:8080",
|
||||
headers={"Accept-Language": "en-US"}
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enable all anti-detection features
|
||||
)
|
||||
```
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
||||
```
|
||||
|
||||
@@ -1,44 +1,53 @@
|
||||
# Session-Based Crawling for Dynamic Content
|
||||
### Session-Based Crawling for Dynamic Content
|
||||
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. To effectively crawl such websites, Crawl4AI provides powerful session-based crawling capabilities.
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
|
||||
|
||||
This guide will explore advanced techniques for crawling dynamic content using Crawl4AI's session management features.
|
||||
This guide explores advanced techniques for crawling dynamic content using Crawl4AI's session management features.
|
||||
|
||||
---
|
||||
|
||||
## Understanding Session-Based Crawling
|
||||
|
||||
Session-based crawling allows you to maintain a persistent browser session across multiple requests. This is crucial when:
|
||||
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
|
||||
|
||||
1. The content changes dynamically without URL changes
|
||||
2. You need to interact with the page (e.g., clicking buttons) between requests
|
||||
3. The site requires authentication or maintains state across pages
|
||||
1. **Efficient handling of dynamic content** without reloading the page.
|
||||
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling).
|
||||
3. **State maintenance** for authenticated sessions or multi-step workflows.
|
||||
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources.
|
||||
|
||||
Crawl4AI's `AsyncWebCrawler` class supports session-based crawling through the `session_id` parameter and related methods.
|
||||
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
|
||||
|
||||
---
|
||||
|
||||
## Basic Concepts
|
||||
|
||||
Before diving into examples, let's review some key concepts:
|
||||
Before diving into examples, here are some key concepts:
|
||||
|
||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple `arun` calls to maintain state.
|
||||
- **JavaScript Execution**: Use the `js_code` parameter to execute JavaScript on the page, such as clicking a "Load More" button.
|
||||
- **CSS Selectors**: Use these to target specific elements for extraction or interaction.
|
||||
- **Extraction Strategy**: Define how to extract structured data from the page.
|
||||
- **Wait Conditions**: Specify conditions to wait for before considering the page loaded.
|
||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
|
||||
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
|
||||
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
|
||||
- **CSS Selectors**: Target specific elements for interaction or data extraction.
|
||||
- **Extraction Strategy**: Define rules to extract structured data.
|
||||
- **Wait Conditions**: Specify conditions to wait for before proceeding.
|
||||
|
||||
---
|
||||
|
||||
## Example 1: Basic Session-Based Crawling
|
||||
|
||||
Let's start with a basic example of session-based crawling:
|
||||
A simple example using session-based crawling:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def basic_session_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
session_id = "my_session"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "dynamic_content_session"
|
||||
url = "https://example.com/dynamic-content"
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
|
||||
@@ -46,6 +55,7 @@ async def basic_session_crawl():
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
@@ -53,17 +63,16 @@ async def basic_session_crawl():
|
||||
asyncio.run(basic_session_crawl())
|
||||
```
|
||||
|
||||
This example demonstrates:
|
||||
1. Using a consistent `session_id` across multiple `arun` calls
|
||||
2. Executing JavaScript to load more content after the first page
|
||||
3. Using a CSS selector to extract specific content
|
||||
4. Properly closing the session after crawling
|
||||
This example shows:
|
||||
1. Reusing the same `session_id` across multiple requests.
|
||||
2. Executing JavaScript to load more content dynamically.
|
||||
3. Properly closing the session to free resources.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Technique 1: Custom Execution Hooks
|
||||
|
||||
Crawl4AI allows you to set custom hooks that execute at different stages of the crawling process. This is particularly useful for handling complex loading scenarios.
|
||||
|
||||
Here's an example that waits for new content to appear before proceeding:
|
||||
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
|
||||
|
||||
```python
|
||||
async def advanced_session_crawl_with_hooks():
|
||||
@@ -75,202 +84,96 @@ async def advanced_session_crawl_with_hooks():
|
||||
while True:
|
||||
await page.wait_for_selector("li.commit-item h4")
|
||||
commit = await page.query_selector("li.commit-item h4")
|
||||
commit = await commit.evaluate("(element) => element.textContent")
|
||||
commit = commit.strip()
|
||||
commit = await commit.evaluate("(element) => element.textContent").strip()
|
||||
if commit and commit != first_commit:
|
||||
first_commit = commit
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
except Exception as e:
|
||||
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
|
||||
print(f"Warning: New content didn't appear: {e}")
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "commit_session"
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
session_id = "commit_session"
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a.pagination-next');
|
||||
if (button) button.click();
|
||||
"""
|
||||
js_next_page = """document.querySelector('a.pagination-next').click();"""
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.commit-item",
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
js_only=page > 0
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
commits = result.extracted_content.select("li.commit-item")
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(advanced_session_crawl_with_hooks())
|
||||
```
|
||||
|
||||
This technique uses a custom `on_execution_started` hook to ensure new content has loaded before proceeding to the next step.
|
||||
This technique ensures new content loads before the next action.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
|
||||
|
||||
Instead of using separate hooks, you can integrate the waiting logic directly into your JavaScript execution. This approach can be more concise and easier to manage for some scenarios.
|
||||
|
||||
Here's an example:
|
||||
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
|
||||
|
||||
```python
|
||||
async def integrated_js_and_wait_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "integrated_session"
|
||||
all_commits = []
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
|
||||
js_next_page_and_wait = """
|
||||
(async () => {
|
||||
const getCurrentCommit = () => {
|
||||
const commits = document.querySelectorAll('li.commit-item h4');
|
||||
return commits.length > 0 ? commits[0].textContent.trim() : null;
|
||||
};
|
||||
|
||||
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
|
||||
const initialCommit = getCurrentCommit();
|
||||
const button = document.querySelector('a.pagination-next');
|
||||
if (button) button.click();
|
||||
|
||||
while (true) {
|
||||
document.querySelector('a.pagination-next').click();
|
||||
while (getCurrentCommit() === initialCommit) {
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
const newCommit = getCurrentCommit();
|
||||
if (newCommit && newCommit !== initialCommit) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.commit-item",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.commit-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.commit-item",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page_and_wait if page > 0 else None,
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(integrated_js_and_wait_crawl())
|
||||
```
|
||||
|
||||
This approach combines the JavaScript for clicking the "next" button and waiting for new content to load into a single script.
|
||||
|
||||
## Advanced Technique 3: Using the `wait_for` Parameter
|
||||
|
||||
Crawl4AI provides a `wait_for` parameter that allows you to specify a condition to wait for before considering the page fully loaded. This can be particularly useful for dynamic content.
|
||||
|
||||
Here's an example:
|
||||
|
||||
```python
|
||||
async def wait_for_parameter_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
session_id = "wait_for_session"
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const commits = document.querySelectorAll('li.commit-item h4');
|
||||
if (commits.length > 0) {
|
||||
window.lastCommit = commits[0].textContent.trim();
|
||||
}
|
||||
const button = document.querySelector('a.pagination-next');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.commit-item h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.lastCommit;
|
||||
}"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.commit-item",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.commit-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.commit-item",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(wait_for_parameter_crawl())
|
||||
```
|
||||
|
||||
This technique separates the JavaScript execution (clicking the "next" button) from the waiting condition, providing more flexibility and clarity in some scenarios.
|
||||
---
|
||||
|
||||
## Best Practices for Session-Based Crawling
|
||||
|
||||
1. **Use Unique Session IDs**: Ensure each crawling session has a unique `session_id` to prevent conflicts.
|
||||
2. **Close Sessions**: Always close sessions using `kill_session` when you're done to free up resources.
|
||||
3. **Handle Errors**: Implement proper error handling to deal with unexpected situations during crawling.
|
||||
4. **Respect Website Terms**: Ensure your crawling adheres to the website's terms of service and robots.txt file.
|
||||
5. **Implement Delays**: Add appropriate delays between requests to avoid overwhelming the target server.
|
||||
6. **Use Extraction Strategies**: Leverage `JsonCssExtractionStrategy` or other extraction strategies for structured data extraction.
|
||||
7. **Optimize JavaScript**: Keep your JavaScript execution concise and efficient to improve crawling speed.
|
||||
8. **Monitor Performance**: Keep an eye on memory usage and crawling speed, especially for long-running sessions.
|
||||
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values.
|
||||
2. **Close Sessions**: Always clean up sessions with `kill_session` after use.
|
||||
3. **Error Handling**: Anticipate and handle errors gracefully.
|
||||
4. **Respect Websites**: Follow terms of service and robots.txt.
|
||||
5. **Delays**: Add delays to avoid overwhelming servers.
|
||||
6. **Optimize JavaScript**: Keep scripts concise for better performance.
|
||||
7. **Monitor Resources**: Track memory and CPU usage for long sessions.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Session-based crawling with Crawl4AI provides powerful capabilities for handling dynamic content and complex web applications. By leveraging session management, JavaScript execution, and waiting strategies, you can effectively crawl and extract data from a wide range of modern websites.
|
||||
|
||||
Remember to use these techniques responsibly and in compliance with website policies and ethical web scraping practices.
|
||||
|
||||
For more advanced usage and API details, refer to the Crawl4AI API documentation.
|
||||
Session-based crawling in Crawl4AI is a robust solution for handling dynamic content and multi-step workflows. By combining session management, JavaScript execution, and structured extraction strategies, you can effectively navigate and extract data from modern web applications. Always adhere to ethical web scraping practices and respect website policies.
|
||||
@@ -1,74 +1,70 @@
|
||||
# Session Management
|
||||
### Session Management
|
||||
|
||||
Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.
|
||||
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
|
||||
|
||||
## Basic Session Usage
|
||||
- **Performing JavaScript actions before and after crawling.**
|
||||
- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly.
|
||||
|
||||
Use `session_id` to maintain state between requests:
|
||||
**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
|
||||
|
||||
---
|
||||
|
||||
#### Basic Session Usage
|
||||
|
||||
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "my_session"
|
||||
|
||||
|
||||
# Define configurations
|
||||
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
|
||||
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
|
||||
|
||||
# First request
|
||||
result1 = await crawler.arun(
|
||||
url="https://example.com/page1",
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Subsequent request using same session
|
||||
result2 = await crawler.arun(
|
||||
url="https://example.com/page2",
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
result1 = await crawler.arun(config=config1)
|
||||
|
||||
# Subsequent request using the same session
|
||||
result2 = await crawler.arun(config=config2)
|
||||
|
||||
# Clean up when done
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Dynamic Content with Sessions
|
||||
---
|
||||
|
||||
Here's a real-world example of crawling GitHub commits across multiple pages:
|
||||
#### Dynamic Content with Sessions
|
||||
|
||||
Here's an example of crawling GitHub commits across multiple pages while preserving session state:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "github_commits_session"
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
all_commits = []
|
||||
|
||||
# Define navigation JavaScript
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
# Define wait condition
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.firstCommit;
|
||||
}"""
|
||||
|
||||
# Define extraction schema
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.markdown-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
# JavaScript and wait configurations
|
||||
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
|
||||
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
|
||||
|
||||
# Crawl multiple pages
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
extraction_strategy=extraction_strategy,
|
||||
@@ -78,6 +74,7 @@ async def crawl_dynamic_content():
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
if result.success:
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
@@ -88,46 +85,53 @@ async def crawl_dynamic_content():
|
||||
return all_commits
|
||||
```
|
||||
|
||||
## Session Best Practices
|
||||
---
|
||||
|
||||
1. **Session Naming**:
|
||||
```python
|
||||
# Use descriptive session IDs
|
||||
session_id = "login_flow_session"
|
||||
session_id = "product_catalog_session"
|
||||
```
|
||||
#### Session Best Practices
|
||||
|
||||
1. **Descriptive Session IDs**:
|
||||
Use meaningful names for session IDs to organize workflows:
|
||||
```python
|
||||
session_id = "login_flow_session"
|
||||
session_id = "product_catalog_session"
|
||||
```
|
||||
|
||||
2. **Resource Management**:
|
||||
```python
|
||||
try:
|
||||
# Your crawling code
|
||||
pass
|
||||
finally:
|
||||
# Always clean up sessions
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
Always ensure sessions are cleaned up to free resources:
|
||||
```python
|
||||
try:
|
||||
# Your crawling code here
|
||||
pass
|
||||
finally:
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
3. **State Management**:
|
||||
```python
|
||||
# First page: login
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
3. **State Maintenance**:
|
||||
Reuse the session for subsequent actions within the same workflow:
|
||||
```python
|
||||
# Step 1: Login
|
||||
login_config = CrawlerRunConfig(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
await crawler.arun(config=login_config)
|
||||
|
||||
# Second page: verify login success
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/dashboard",
|
||||
session_id=session_id,
|
||||
wait_for="css:.user-profile" # Wait for authenticated content
|
||||
)
|
||||
```
|
||||
# Step 2: Verify login success
|
||||
dashboard_config = CrawlerRunConfig(
|
||||
url="https://example.com/dashboard",
|
||||
session_id=session_id,
|
||||
wait_for="css:.user-profile" # Wait for authenticated content
|
||||
)
|
||||
result = await crawler.arun(config=dashboard_config)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
---
|
||||
|
||||
1. **Authentication Flows**
|
||||
2. **Pagination Handling**
|
||||
3. **Form Submissions**
|
||||
4. **Multi-step Processes**
|
||||
5. **Dynamic Content Navigation**
|
||||
#### Common Use Cases for Sessions
|
||||
|
||||
1. **Authentication Flows**: Login and interact with secured pages.
|
||||
2. **Pagination Handling**: Navigate through multiple pages.
|
||||
3. **Form Submissions**: Fill forms, submit, and process results.
|
||||
4. **Multi-step Processes**: Complete workflows that span multiple actions.
|
||||
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
|
||||
|
||||
85
docs/md_v2/api/crawl-config.md
Normal file
85
docs/md_v2/api/crawl-config.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# CrawlerRunConfig Parameters Documentation
|
||||
|
||||
## Content Processing Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
|
||||
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
|
||||
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
|
||||
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
|
||||
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
|
||||
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
|
||||
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
|
||||
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
|
||||
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
|
||||
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
|
||||
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
|
||||
|
||||
## Caching Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
|
||||
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
|
||||
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
|
||||
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
|
||||
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
|
||||
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
|
||||
|
||||
## Page Navigation and Timing Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
|
||||
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
|
||||
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
|
||||
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
|
||||
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
|
||||
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
|
||||
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
|
||||
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
|
||||
|
||||
## Page Interaction Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
|
||||
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
|
||||
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
|
||||
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
|
||||
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
|
||||
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
|
||||
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
|
||||
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
|
||||
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
|
||||
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
|
||||
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
|
||||
|
||||
## Media Handling Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
|
||||
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
|
||||
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
|
||||
| `pdf` | bool | False | Whether to generate a PDF of the page |
|
||||
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
|
||||
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
|
||||
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
|
||||
|
||||
## Link and Domain Handling Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
|
||||
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
|
||||
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
|
||||
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
|
||||
|
||||
## Debugging and Logging Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `verbose` | bool | True | Enable verbose logging |
|
||||
| `log_console` | bool | False | If True, log console messages from the page |
|
||||
@@ -1,7 +1,7 @@
|
||||
# Crawl4AI Cache System and Migration Guide
|
||||
|
||||
## Overview
|
||||
Starting from version X.X.X, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.
|
||||
Starting from version 0.5.0, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.
|
||||
|
||||
## Old vs New Approach
|
||||
|
||||
@@ -45,13 +45,15 @@ if __name__ == "__main__":
|
||||
### New Code (Recommended)
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode # Import CacheMode
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def use_proxy():
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) # Use CacheMode in CrawlerRunConfig
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
cache_mode=CacheMode.BYPASS # New way
|
||||
config=config # Pass the configuration object
|
||||
)
|
||||
print(len(result.markdown))
|
||||
|
||||
@@ -64,12 +66,12 @@ if __name__ == "__main__":
|
||||
|
||||
## Common Migration Patterns
|
||||
|
||||
Old Flag | New Mode
|
||||
---------|----------
|
||||
`bypass_cache=True` | `cache_mode=CacheMode.BYPASS`
|
||||
`disable_cache=True` | `cache_mode=CacheMode.DISABLED`
|
||||
`no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY`
|
||||
`no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY`
|
||||
| Old Flag | New Mode |
|
||||
|-----------------------|---------------------------------|
|
||||
| `bypass_cache=True` | `cache_mode=CacheMode.BYPASS` |
|
||||
| `disable_cache=True` | `cache_mode=CacheMode.DISABLED`|
|
||||
| `no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY` |
|
||||
| `no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY` |
|
||||
|
||||
## Suppressing Deprecation Warnings
|
||||
If you need time to migrate, you can temporarily suppress deprecation warnings:
|
||||
|
||||
@@ -1,68 +1,58 @@
|
||||
# Content Selection
|
||||
### Content Selection
|
||||
|
||||
Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.
|
||||
|
||||
## CSS Selectors
|
||||
#### CSS Selectors
|
||||
|
||||
The simplest way to extract specific content:
|
||||
Extract specific content using a `CrawlerRunConfig` with CSS selectors:
|
||||
|
||||
```python
|
||||
# Extract specific content using CSS selector
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
css_selector=".main-article" # Target main article content
|
||||
)
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
# Multiple selectors
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
css_selector="article h1, article .content" # Target heading and content
|
||||
)
|
||||
config = CrawlerRunConfig(css_selector=".main-article") # Target main article content
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
|
||||
config = CrawlerRunConfig(css_selector="article h1, article .content") # Target heading and content
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
```
|
||||
|
||||
## Content Filtering
|
||||
#### Content Filtering
|
||||
|
||||
Control what content is included or excluded:
|
||||
Control content inclusion or exclusion with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
# Content thresholds
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=10, # Minimum words per block
|
||||
|
||||
# Tag exclusions
|
||||
excluded_tags=['form', 'header', 'footer', 'nav'],
|
||||
|
||||
# Link filtering
|
||||
excluded_tags=['form', 'header', 'footer', 'nav'], # Excluded tags
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
|
||||
# Media filtering
|
||||
exclude_external_images=True # Remove external images
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
```
|
||||
|
||||
## Iframe Content
|
||||
#### Iframe Content
|
||||
|
||||
Process content inside iframes:
|
||||
Process iframe content by enabling specific options in `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
process_iframes=True, # Extract iframe content
|
||||
config = CrawlerRunConfig(
|
||||
process_iframes=True, # Extract iframe content
|
||||
remove_overlay_elements=True # Remove popups/modals that might block iframes
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
```
|
||||
|
||||
## Structured Content Selection
|
||||
#### Structured Content Selection Using LLMs
|
||||
|
||||
### Using LLMs for Smart Selection
|
||||
|
||||
Use LLMs to intelligently extract specific types of content:
|
||||
Leverage LLMs for intelligent content extraction:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class ArticleContent(BaseModel):
|
||||
title: str
|
||||
@@ -70,28 +60,27 @@ class ArticleContent(BaseModel):
|
||||
conclusion: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron", # Works with any supported LLM
|
||||
provider="ollama/nemotron",
|
||||
schema=ArticleContent.schema(),
|
||||
instruction="Extract the main article title, key points, and conclusion"
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
article = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
### Pattern-Based Selection
|
||||
#### Pattern-Based Selection
|
||||
|
||||
For repeated content patterns (like product listings, news feeds):
|
||||
Extract content matching repetitive patterns:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.news-item", # Repeated element
|
||||
"baseSelector": "article.news-item",
|
||||
"fields": [
|
||||
{"name": "headline", "selector": "h2", "type": "text"},
|
||||
{"name": "summary", "selector": ".summary", "type": "text"},
|
||||
@@ -108,51 +97,19 @@ schema = {
|
||||
}
|
||||
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
articles = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
## Domain-Based Filtering
|
||||
#### Comprehensive Example
|
||||
|
||||
Control content based on domains:
|
||||
Combine different selection methods using `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
exclude_domains=["ads.com", "tracker.com"],
|
||||
exclude_social_media_domains=["facebook.com", "twitter.com"], # Custom social media domains to exclude
|
||||
exclude_social_media_links=True
|
||||
)
|
||||
```
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
## Media Selection
|
||||
|
||||
Select specific types of media:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Access different media types
|
||||
images = result.media["images"] # List of image details
|
||||
videos = result.media["videos"] # List of video details
|
||||
audios = result.media["audios"] # List of audio details
|
||||
|
||||
# Image with metadata
|
||||
for image in images:
|
||||
print(f"URL: {image['src']}")
|
||||
print(f"Alt text: {image['alt']}")
|
||||
print(f"Description: {image['desc']}")
|
||||
print(f"Relevance score: {image['score']}")
|
||||
```
|
||||
|
||||
## Comprehensive Example
|
||||
|
||||
Here's how to combine different selection methods:
|
||||
|
||||
```python
|
||||
async def extract_article_content(url: str):
|
||||
# Define structured extraction
|
||||
article_schema = {
|
||||
@@ -163,37 +120,16 @@ async def extract_article_content(url: str):
|
||||
{"name": "content", "selector": ".content", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
# Define LLM extraction
|
||||
class ArticleAnalysis(BaseModel):
|
||||
key_points: List[str]
|
||||
sentiment: str
|
||||
category: str
|
||||
|
||||
# Define configuration
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['nav', 'footer'],
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Get structured content
|
||||
pattern_result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['nav', 'footer'],
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
# Get semantic analysis
|
||||
analysis_result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=ArticleAnalysis.schema(),
|
||||
instruction="Analyze the article content"
|
||||
)
|
||||
)
|
||||
|
||||
# Combine results
|
||||
return {
|
||||
"article": json.loads(pattern_result.extracted_content),
|
||||
"analysis": json.loads(analysis_result.extracted_content),
|
||||
"media": pattern_result.media
|
||||
}
|
||||
```
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
@@ -1,84 +1,83 @@
|
||||
# Content Filtering in Crawl4AI
|
||||
|
||||
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
|
||||
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
|
||||
|
||||
## Relevance Content Filter
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
The `RelevanceContentFilter` is an abstract class providing a common interface for content filtering strategies. Specific algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
|
||||
## BM25 Algorithm
|
||||
## Pruning Content Filter
|
||||
|
||||
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
|
||||
The `PruningContentFilter` removes less relevant nodes based on metrics like text density, link density, and tag importance. Nodes that fall below a defined threshold are pruned, leaving only high-value content.
|
||||
|
||||
### Usage
|
||||
|
||||
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def filter_content(url, query=None):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = BM25ContentFilter(user_query=query)
|
||||
result = await crawler.arun(url=url, content_filter=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
if result.success:
|
||||
print(f"Filtered Content (JSON):\n{result.extracted_content}")
|
||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
|
||||
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=PruningContentFilter(
|
||||
min_word_threshold=5,
|
||||
threshold_type='dynamic',
|
||||
threshold=0.45
|
||||
),
|
||||
fit_markdown=True # Activates markdown fitting
|
||||
)
|
||||
|
||||
# Example usage:
|
||||
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
|
||||
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success:
|
||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
|
||||
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
|
||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
||||
- `'fixed'`: Uses a constant threshold value for all nodes.
|
||||
- `'dynamic'`: Adjusts thresholds based on node properties (e.g., tag importance, text/link ratios).
|
||||
- **`threshold`**: (Optional, default 0.48) Base threshold for pruning:
|
||||
- Fixed: Nodes scoring below this value are removed.
|
||||
- Dynamic: This value adjusts based on node characteristics.
|
||||
|
||||
### How It Works
|
||||
|
||||
## Fit Markdown Flag
|
||||
The algorithm evaluates each node using:
|
||||
- **Text density**: Ratio of text to overall content.
|
||||
- **Link density**: Proportion of text within links.
|
||||
- **Tag importance**: Weights based on HTML tag type (e.g., `<article>`, `<p>`, `<div>`).
|
||||
- **Content quality**: Metrics like text length and structural importance.
|
||||
|
||||
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
|
||||
## BM25 Algorithm
|
||||
|
||||
The `BM25ContentFilter` uses the BM25 algorithm to rank and extract text chunks based on relevance to a search query or page metadata.
|
||||
|
||||
## Custom Content Filtering Strategies
|
||||
|
||||
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
||||
from bs4 import BeautifulSoup, Tag
|
||||
from typing import List
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
|
||||
class MyCustomFilter(RelevantContentFilter):
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
# Implement custom filtering logic here
|
||||
# Example: extract all paragraphs within divs with class "article-body"
|
||||
filtered_paragraphs = []
|
||||
for tag in soup.select("div.article-body p"):
|
||||
if isinstance(tag, Tag):
|
||||
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
|
||||
return filtered_paragraphs
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=BM25ContentFilter(user_query="fruit nutrition health"),
|
||||
fit_markdown=True # Activates markdown fitting
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
|
||||
async def custom_filter_demo(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
custom_filter = MyCustomFilter()
|
||||
result = await crawler.arun(url, content_filter=custom_filter)
|
||||
if result.success:
|
||||
print(result.extracted_content)
|
||||
|
||||
if result.success:
|
||||
print(f"Filtered Content:\n{result.extracted_content}")
|
||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}")
|
||||
print(f"\nFiltered HTML:\n{result.fit_html}")
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
```
|
||||
|
||||
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
|
||||
### Parameters
|
||||
|
||||
## Conclusion
|
||||
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts metadata (title, description, keywords) and uses it as the query.
|
||||
- **`bm25_threshold`**: (Optional, default 1.0) Threshold controlling relevance:
|
||||
- Higher values return stricter, more relevant results.
|
||||
- Lower values include more lenient filtering.
|
||||
|
||||
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.
|
||||
|
||||
@@ -310,22 +310,6 @@ response = requests.post("http://localhost:11235/crawl", json=request)
|
||||
> **Note**: Remember to add `.env` to your `.gitignore` to keep your API keys secure!
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Usage Examples 📝
|
||||
|
||||
### Basic Crawling
|
||||
|
||||
@@ -1,124 +1,109 @@
|
||||
# Download Handling in Crawl4AI
|
||||
|
||||
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
|
||||
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
|
||||
|
||||
## Enabling Downloads
|
||||
|
||||
By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method.
|
||||
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(accept_downloads=True) as crawler: # Globally enable downloads
|
||||
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
# ... your crawling logic ...
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Or, enable it for a specific crawl:
|
||||
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="...", accept_downloads=True)
|
||||
config = CrawlerRunConfig(accept_downloads=True)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
# ...
|
||||
```
|
||||
|
||||
## Specifying Download Location
|
||||
|
||||
You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||||
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# ... inside your crawl function:
|
||||
|
||||
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
|
||||
os.makedirs(downloads_path, exist_ok=True)
|
||||
|
||||
result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)
|
||||
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
|
||||
|
||||
# ...
|
||||
```
|
||||
|
||||
If you are setting it globally, provide the path to the AsyncWebCrawler:
|
||||
```python
|
||||
async def crawl_with_downloads(url: str, download_path: str):
|
||||
async with AsyncWebCrawler(
|
||||
accept_downloads=True,
|
||||
downloads_path=download_path, # or set it on arun
|
||||
verbose=True
|
||||
) as crawler:
|
||||
result = await crawler.arun(url=url) # you still need to enable downloads per call.
|
||||
async def main():
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
# ...
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Triggering Downloads
|
||||
|
||||
Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button). You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context. The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.
|
||||
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://www.python.org/downloads/",
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
// Find and click the first Windows installer link
|
||||
const downloadLink = document.querySelector('a[href$=".exe"]');
|
||||
if (downloadLink) {
|
||||
downloadLink.click();
|
||||
}
|
||||
""",
|
||||
wait_for=5 # Wait for 5 seconds for the download to start
|
||||
wait_for=5 # Wait 5 seconds for the download to start
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
|
||||
```
|
||||
|
||||
## Accessing Downloaded Files
|
||||
|
||||
Downloaded file paths are stored in the `downloaded_files` attribute of the returned `CrawlResult` object. This is a list of strings, with each string representing the absolute path to a downloaded file.
|
||||
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
|
||||
|
||||
```python
|
||||
if result.downloaded_files:
|
||||
print("Downloaded files:")
|
||||
for file_path in result.downloaded_files:
|
||||
print(f"- {file_path}")
|
||||
# Perform operations with downloaded files, e.g., check file size
|
||||
file_size = os.path.getsize(file_path)
|
||||
print(f"- File size: {file_size} bytes")
|
||||
else:
|
||||
print("No files downloaded.")
|
||||
```
|
||||
|
||||
|
||||
## Example: Downloading Multiple Files
|
||||
## Example: Downloading Multiple Files
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
import os
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def download_multiple_files(url: str, download_path: str):
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
accept_downloads=True,
|
||||
downloads_path=download_path,
|
||||
verbose=True
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
run_config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
// Trigger multiple downloads (example)
|
||||
const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
|
||||
for (const link of downloadLinks) {
|
||||
link.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
|
||||
}
|
||||
const downloadLinks = document.querySelectorAll('a[download]');
|
||||
for (const link of downloadLinks) {
|
||||
link.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
|
||||
}
|
||||
""",
|
||||
wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
|
||||
wait_for=10 # Wait for all downloads to start
|
||||
)
|
||||
result = await crawler.arun(url=url, config=run_config)
|
||||
|
||||
if result.downloaded_files:
|
||||
print("Downloaded files:")
|
||||
@@ -126,23 +111,19 @@ async def download_multiple_files(url: str, download_path: str):
|
||||
print(f"- {file}")
|
||||
else:
|
||||
print("No files downloaded.")
|
||||
|
||||
|
||||
# Example usage
|
||||
# Usage
|
||||
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
|
||||
os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist
|
||||
|
||||
os.makedirs(download_path, exist_ok=True)
|
||||
|
||||
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
|
||||
```
|
||||
|
||||
## Important Considerations
|
||||
|
||||
- **Browser Context:** Downloads are managed within the browser context. Ensure your `js_code` correctly targets the download triggers on the specific web page.
|
||||
- **Waiting:** Use `wait_for` to manage the timing of the crawl process if immediate download might not occur.
|
||||
- **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
|
||||
- **Security:** Downloaded files should be scanned for potential security threats before use.
|
||||
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
|
||||
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
|
||||
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
|
||||
- **Security:** Scan downloaded files for potential security threats before use.
|
||||
|
||||
|
||||
|
||||
This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.
|
||||
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
|
||||
@@ -1,6 +1,6 @@
|
||||
# Output Formats
|
||||
|
||||
Crawl4AI provides multiple output formats to suit different needs, from raw HTML to structured data using LLM or pattern-based extraction.
|
||||
Crawl4AI provides multiple output formats to suit different needs, ranging from raw HTML to structured data using LLM or pattern-based extraction, and versatile markdown outputs.
|
||||
|
||||
## Basic Formats
|
||||
|
||||
@@ -8,18 +8,20 @@ Crawl4AI provides multiple output formats to suit different needs, from raw HTML
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Access different formats
|
||||
raw_html = result.html # Original HTML
|
||||
clean_html = result.cleaned_html # Sanitized HTML
|
||||
markdown = result.markdown # Standard markdown
|
||||
fit_md = result.fit_markdown # Most relevant content in markdown
|
||||
raw_html = result.html # Original HTML
|
||||
clean_html = result.cleaned_html # Sanitized HTML
|
||||
markdown_v2 = result.markdown_v2 # Detailed markdown generation results
|
||||
fit_md = result.markdown_v2.fit_markdown # Most relevant content in markdown
|
||||
```
|
||||
|
||||
> **Note**: The `markdown_v2` property will soon be replaced by `markdown`. It is recommended to start transitioning to using `markdown` for new implementations.
|
||||
|
||||
## Raw HTML
|
||||
|
||||
Original, unmodified HTML from the webpage. Useful when you need to:
|
||||
- Preserve the exact page structure
|
||||
- Process HTML with your own tools
|
||||
- Debug page issues
|
||||
- Preserve the exact page structure.
|
||||
- Process HTML with your own tools.
|
||||
- Debug page issues.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
@@ -29,167 +31,72 @@ print(result.html) # Complete HTML including headers, scripts, etc.
|
||||
## Cleaned HTML
|
||||
|
||||
Sanitized HTML with unnecessary elements removed. Automatically:
|
||||
- Removes scripts and styles
|
||||
- Cleans up formatting
|
||||
- Preserves semantic structure
|
||||
- Removes scripts and styles.
|
||||
- Cleans up formatting.
|
||||
- Preserves semantic structure.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
excluded_tags=['form', 'header', 'footer'], # Additional tags to remove
|
||||
keep_data_attributes=False # Remove data-* attributes
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.cleaned_html)
|
||||
```
|
||||
|
||||
## Standard Markdown
|
||||
|
||||
HTML converted to clean markdown format. Great for:
|
||||
- Content analysis
|
||||
- Documentation
|
||||
- Readability
|
||||
HTML converted to clean markdown format. This output is useful for:
|
||||
- Content analysis.
|
||||
- Documentation.
|
||||
- Readability.
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
include_links_on_markdown=True # Include links in markdown
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={"include_links": True} # Include links in markdown
|
||||
)
|
||||
)
|
||||
print(result.markdown)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.markdown_v2.raw_markdown) # Standard markdown with links
|
||||
```
|
||||
|
||||
## Fit Markdown
|
||||
|
||||
Most relevant content extracted and converted to markdown. Ideal for:
|
||||
- Article extraction
|
||||
- Main content focus
|
||||
- Removing boilerplate
|
||||
Extract and convert only the most relevant content into markdown format. Best suited for:
|
||||
- Article extraction.
|
||||
- Focusing on the main content.
|
||||
- Removing boilerplate.
|
||||
|
||||
To generate `fit_markdown`, use a content filter like `PruningContentFilter`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.fit_markdown) # Only the main content
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.7,
|
||||
threshold_type="dynamic",
|
||||
min_word_threshold=100
|
||||
)
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.markdown_v2.fit_markdown) # Extracted main content in markdown
|
||||
```
|
||||
|
||||
## Structured Data Extraction
|
||||
## Markdown with Citations
|
||||
|
||||
Crawl4AI offers two powerful approaches for structured data extraction:
|
||||
|
||||
### 1. LLM-Based Extraction
|
||||
|
||||
Use any LLM (OpenAI, HuggingFace, Ollama, etc.) to extract structured data with high accuracy:
|
||||
Generate markdown that includes citations for links. This format is ideal for:
|
||||
- Creating structured documentation.
|
||||
- Including references for extracted content.
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[dict]
|
||||
relationships: List[dict]
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
|
||||
api_token="your-token", # not needed for Ollama
|
||||
schema=KnowledgeGraph.schema(),
|
||||
instruction="Extract entities and relationships from the content"
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={"citations": True} # Enable citations
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
knowledge_graph = json.loads(result.extracted_content)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.markdown_v2.markdown_with_citations)
|
||||
print(result.markdown_v2.references_markdown) # Citations section
|
||||
```
|
||||
|
||||
### 2. Pattern-Based Extraction
|
||||
|
||||
For pages with repetitive patterns (e.g., product listings, article feeds), use JsonCssExtractionStrategy:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Product Listing",
|
||||
"baseSelector": ".product-card", # Repeated element
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "description", "selector": ".desc", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
products = json.loads(result.extracted_content)
|
||||
```
|
||||
|
||||
## Content Customization
|
||||
|
||||
### HTML to Text Options
|
||||
|
||||
Configure markdown conversion:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
html2text={
|
||||
"escape_dot": False,
|
||||
"body_width": 0,
|
||||
"protect_links": True,
|
||||
"unicode_snob": True
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Content Filters
|
||||
|
||||
Control what content is included:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Minimum words per block
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_external_images=True, # Remove external images
|
||||
excluded_tags=['form', 'nav'] # Remove specific HTML tags
|
||||
)
|
||||
```
|
||||
|
||||
## Comprehensive Example
|
||||
|
||||
Here's how to use multiple output formats together:
|
||||
|
||||
```python
|
||||
async def crawl_content(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Extract main content with fit markdown
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=10,
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
# Get structured data using LLM
|
||||
llm_result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=YourSchema.schema(),
|
||||
instruction="Extract key information"
|
||||
)
|
||||
)
|
||||
|
||||
# Get repeated patterns (if any)
|
||||
pattern_result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=JsonCssExtractionStrategy(your_schema)
|
||||
)
|
||||
|
||||
return {
|
||||
"main_content": result.fit_markdown,
|
||||
"structured_data": json.loads(llm_result.extracted_content),
|
||||
"pattern_data": json.loads(pattern_result.extracted_content),
|
||||
"media": result.media
|
||||
}
|
||||
```
|
||||
@@ -7,11 +7,13 @@ Crawl4AI provides powerful features for interacting with dynamic webpages, handl
|
||||
### Basic Execution
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
# Single JavaScript command
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);"
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Multiple commands
|
||||
js_commands = [
|
||||
@@ -19,10 +21,8 @@ js_commands = [
|
||||
"document.querySelector('.load-more').click();",
|
||||
"document.querySelector('#consent-button').click();"
|
||||
]
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code=js_commands
|
||||
)
|
||||
config = CrawlerRunConfig(js_code=js_commands)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Wait Conditions
|
||||
@@ -32,10 +32,8 @@ result = await crawler.arun(
|
||||
Wait for elements to appear:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
wait_for="css:.dynamic-content" # Wait for element with class 'dynamic-content'
|
||||
)
|
||||
config = CrawlerRunConfig(wait_for="css:.dynamic-content") # Wait for element with class 'dynamic-content'
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### JavaScript-Based Waiting
|
||||
@@ -48,10 +46,8 @@ wait_condition = """() => {
|
||||
return document.querySelectorAll('.item').length > 10;
|
||||
}"""
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
wait_for=f"js:{wait_condition}"
|
||||
)
|
||||
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Wait for dynamic content to load
|
||||
wait_for_content = """() => {
|
||||
@@ -59,10 +55,8 @@ wait_for_content = """() => {
|
||||
return content && content.innerText.length > 100;
|
||||
}"""
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
wait_for=f"js:{wait_for_content}"
|
||||
)
|
||||
config = CrawlerRunConfig(wait_for=f"js:{wait_for_content}")
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Handling Dynamic Content
|
||||
@@ -72,18 +66,14 @@ result = await crawler.arun(
|
||||
Handle infinite scroll or load more buttons:
|
||||
|
||||
```python
|
||||
# Scroll and wait pattern
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
js_code=[
|
||||
# Scroll to bottom
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
# Click load more if exists
|
||||
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();"
|
||||
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
|
||||
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();" # Click load more
|
||||
],
|
||||
# Wait for new content
|
||||
wait_for="js:() => document.querySelectorAll('.item').length > previousCount"
|
||||
wait_for="js:() => document.querySelectorAll('.item').length > previousCount" # Wait for new content
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Form Interaction
|
||||
@@ -92,17 +82,15 @@ Handle forms and inputs:
|
||||
|
||||
```python
|
||||
js_form_interaction = """
|
||||
// Fill form fields
|
||||
document.querySelector('#search').value = 'search term';
|
||||
// Submit form
|
||||
document.querySelector('form').submit();
|
||||
document.querySelector('#search').value = 'search term'; // Fill form fields
|
||||
document.querySelector('form').submit(); // Submit form
|
||||
"""
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
js_code=js_form_interaction,
|
||||
wait_for="css:.results" # Wait for results to load
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Timing Control
|
||||
@@ -112,11 +100,11 @@ result = await crawler.arun(
|
||||
Control timing of interactions:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
page_timeout=60000, # Page load timeout (ms)
|
||||
delay_before_return_html=2.0, # Wait before capturing content
|
||||
delay_before_return_html=2.0 # Wait before capturing content
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Complex Interactions Example
|
||||
@@ -124,43 +112,37 @@ result = await crawler.arun(
|
||||
Here's an example of handling a dynamic page with multiple interactions:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Initial page load
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
# Handle cookie consent
|
||||
js_code="document.querySelector('.cookie-accept')?.click();",
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.querySelector('.cookie-accept')?.click();", # Handle cookie consent
|
||||
wait_for="css:.main-content"
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Load more content
|
||||
session_id = "dynamic_session" # Keep session for multiple interactions
|
||||
|
||||
for page in range(3): # Load 3 pages of content
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
session_id=session_id,
|
||||
js_code=[
|
||||
# Scroll to bottom
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
# Store current item count
|
||||
"window.previousCount = document.querySelectorAll('.item').length;",
|
||||
# Click load more
|
||||
"document.querySelector('.load-more')?.click();"
|
||||
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
|
||||
"window.previousCount = document.querySelectorAll('.item').length;", # Store item count
|
||||
"document.querySelector('.load-more')?.click();" # Click load more
|
||||
],
|
||||
# Wait for new items
|
||||
wait_for="""() => {
|
||||
const currentCount = document.querySelectorAll('.item').length;
|
||||
return currentCount > window.previousCount;
|
||||
}""",
|
||||
# Only execute JS without reloading page
|
||||
js_only=True if page > 0 else False
|
||||
js_only=(page > 0) # Execute JS without reloading page for subsequent interactions
|
||||
)
|
||||
|
||||
# Process content after each load
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(f"Page {page + 1} items:", len(result.cleaned_html))
|
||||
|
||||
|
||||
# Clean up session
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
@@ -171,6 +153,7 @@ Combine page interaction with structured extraction:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
# Pattern-based extraction after interaction
|
||||
schema = {
|
||||
@@ -182,20 +165,19 @@ schema = {
|
||||
]
|
||||
}
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="css:.item:nth-child(10)", # Wait for 10 items
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Or use LLM to analyze dynamic content
|
||||
class ContentAnalysis(BaseModel):
|
||||
topics: List[str]
|
||||
summary: str
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.querySelector('.show-more').click();",
|
||||
wait_for="css:.full-content",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
@@ -204,4 +186,5 @@ result = await crawler.arun(
|
||||
instruction="Analyze the full content"
|
||||
)
|
||||
)
|
||||
```
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
@@ -2,31 +2,19 @@
|
||||
|
||||
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
|
||||
|
||||
## Table of Contents
|
||||
- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)
|
||||
- [Table of Contents](#table-of-contents)
|
||||
- [Crawling a Web URL](#crawling-a-web-url)
|
||||
- [Crawling a Local HTML File](#crawling-a-local-html-file)
|
||||
- [Crawling Raw HTML Content](#crawling-raw-html-content)
|
||||
- [Complete Example](#complete-example)
|
||||
- [**How It Works**](#how-it-works)
|
||||
- [**Running the Example**](#running-the-example)
|
||||
- [Conclusion](#conclusion)
|
||||
## Crawling a Web URL
|
||||
|
||||
---
|
||||
|
||||
|
||||
### Crawling a Web URL
|
||||
|
||||
To crawl a live web page, provide the URL starting with `http://` or `https://`.
|
||||
To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_web():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
|
||||
if result.success:
|
||||
print("Markdown Content:")
|
||||
print(result.markdown)
|
||||
@@ -36,20 +24,22 @@ async def crawl_web():
|
||||
asyncio.run(crawl_web())
|
||||
```
|
||||
|
||||
### Crawling a Local HTML File
|
||||
## Crawling a Local HTML File
|
||||
|
||||
To crawl a local HTML file, prefix the file path with `file://`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_local_file():
|
||||
local_file_path = "/path/to/apple.html" # Replace with your file path
|
||||
file_url = f"file://{local_file_path}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url=file_url, bypass_cache=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=file_url, config=config)
|
||||
if result.success:
|
||||
print("Markdown Content from Local File:")
|
||||
print(result.markdown)
|
||||
@@ -59,20 +49,22 @@ async def crawl_local_file():
|
||||
asyncio.run(crawl_local_file())
|
||||
```
|
||||
|
||||
### Crawling Raw HTML Content
|
||||
## Crawling Raw HTML Content
|
||||
|
||||
To crawl raw HTML content, prefix the HTML string with `raw:`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_raw_html():
|
||||
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
|
||||
raw_html_url = f"raw:{raw_html}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url=raw_html_url, bypass_cache=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=raw_html_url, config=config)
|
||||
if result.success:
|
||||
print("Markdown Content from Raw HTML:")
|
||||
print(result.markdown)
|
||||
@@ -84,152 +76,83 @@ asyncio.run(crawl_raw_html())
|
||||
|
||||
---
|
||||
|
||||
## Complete Example
|
||||
# Complete Example
|
||||
|
||||
Below is a comprehensive script that:
|
||||
1. **Crawls the Wikipedia page for "Apple".**
|
||||
2. **Saves the HTML content to a local file (`apple.html`).**
|
||||
3. **Crawls the local HTML file and verifies the markdown length matches the original crawl.**
|
||||
4. **Crawls the raw HTML content from the saved file and verifies consistency.**
|
||||
|
||||
1. Crawls the Wikipedia page for "Apple."
|
||||
2. Saves the HTML content to a local file (`apple.html`).
|
||||
3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
|
||||
4. Crawls the raw HTML content from the saved file and verifies consistency.
|
||||
|
||||
```python
|
||||
import os
|
||||
import sys
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
|
||||
# Adjust the parent directory to include the crawl4ai module
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Define the URL to crawl
|
||||
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
|
||||
|
||||
# Define the path to save the HTML file
|
||||
# Save the file in the same directory as the script
|
||||
script_dir = Path(__file__).parent
|
||||
html_file_path = script_dir / "apple.html"
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Step 1: Crawl the Web URL
|
||||
print("\n=== Step 1: Crawling the Wikipedia URL ===")
|
||||
# Crawl the Wikipedia URL
|
||||
result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
|
||||
|
||||
# Check if crawling was successful
|
||||
web_config = CrawlerRunConfig(bypass_cache=True)
|
||||
result = await crawler.arun(url=wikipedia_url, config=web_config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
|
||||
return
|
||||
|
||||
# Save the HTML content to a local file
|
||||
|
||||
with open(html_file_path, 'w', encoding='utf-8') as f:
|
||||
f.write(result.html)
|
||||
print(f"Saved HTML content to {html_file_path}")
|
||||
|
||||
# Store the length of the generated markdown
|
||||
web_crawl_length = len(result.markdown)
|
||||
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
|
||||
|
||||
|
||||
# Step 2: Crawl from the Local HTML File
|
||||
print("=== Step 2: Crawling from the Local HTML File ===")
|
||||
# Construct the file URL with 'file://' prefix
|
||||
file_url = f"file://{html_file_path.resolve()}"
|
||||
|
||||
# Crawl the local HTML file
|
||||
local_result = await crawler.arun(url=file_url, bypass_cache=True)
|
||||
|
||||
# Check if crawling was successful
|
||||
file_config = CrawlerRunConfig(bypass_cache=True)
|
||||
local_result = await crawler.arun(url=file_url, config=file_config)
|
||||
|
||||
if not local_result.success:
|
||||
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
|
||||
return
|
||||
|
||||
# Store the length of the generated markdown from local file
|
||||
|
||||
local_crawl_length = len(local_result.markdown)
|
||||
print(f"Length of markdown from local file crawl: {local_crawl_length}")
|
||||
|
||||
# Compare the lengths
|
||||
assert web_crawl_length == local_crawl_length, (
|
||||
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
|
||||
)
|
||||
print("✅ Markdown length matches between web crawl and local file crawl.\n")
|
||||
|
||||
assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
|
||||
print("✅ Markdown length matches between web and local file crawl.\n")
|
||||
|
||||
# Step 3: Crawl Using Raw HTML Content
|
||||
print("=== Step 3: Crawling Using Raw HTML Content ===")
|
||||
# Read the HTML content from the saved file
|
||||
with open(html_file_path, 'r', encoding='utf-8') as f:
|
||||
raw_html_content = f.read()
|
||||
|
||||
# Prefix the raw HTML content with 'raw:'
|
||||
raw_html_url = f"raw:{raw_html_content}"
|
||||
|
||||
# Crawl using the raw HTML content
|
||||
raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
|
||||
|
||||
# Check if crawling was successful
|
||||
raw_config = CrawlerRunConfig(bypass_cache=True)
|
||||
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
|
||||
|
||||
if not raw_result.success:
|
||||
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
|
||||
return
|
||||
|
||||
# Store the length of the generated markdown from raw HTML
|
||||
|
||||
raw_crawl_length = len(raw_result.markdown)
|
||||
print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
|
||||
|
||||
# Compare the lengths
|
||||
assert web_crawl_length == raw_crawl_length, (
|
||||
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
|
||||
)
|
||||
print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
|
||||
|
||||
assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
|
||||
print("✅ Markdown length matches between web and raw HTML crawl.\n")
|
||||
|
||||
print("All tests passed successfully!")
|
||||
|
||||
# Clean up by removing the saved HTML file
|
||||
if html_file_path.exists():
|
||||
os.remove(html_file_path)
|
||||
print(f"Removed the saved HTML file: {html_file_path}")
|
||||
|
||||
# Run the main function
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### **How It Works**
|
||||
|
||||
1. **Step 1: Crawl the Web URL**
|
||||
- Crawls `https://en.wikipedia.org/wiki/apple`.
|
||||
- Saves the HTML content to `apple.html`.
|
||||
- Records the length of the generated markdown.
|
||||
|
||||
2. **Step 2: Crawl from the Local HTML File**
|
||||
- Uses the `file://` prefix to crawl `apple.html`.
|
||||
- Ensures the markdown length matches the original web crawl.
|
||||
|
||||
3. **Step 3: Crawl Using Raw HTML Content**
|
||||
- Reads the HTML from `apple.html`.
|
||||
- Prefixes it with `raw:` and crawls.
|
||||
- Verifies the markdown length matches the previous results.
|
||||
|
||||
4. **Cleanup**
|
||||
- Deletes the `apple.html` file after testing.
|
||||
|
||||
### **Running the Example**
|
||||
|
||||
1. **Save the Script:**
|
||||
- Save the above code as `test_crawl4ai.py` in your project directory.
|
||||
|
||||
2. **Execute the Script:**
|
||||
- Run the script using:
|
||||
```bash
|
||||
python test_crawl4ai.py
|
||||
```
|
||||
|
||||
3. **Observe the Output:**
|
||||
- The script will print logs detailing each step.
|
||||
- Assertions ensure consistency across different crawling methods.
|
||||
- Upon success, it confirms that all markdown lengths match.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
With the new prefix-based input handling in **Crawl4AI**, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.
|
||||
# Conclusion
|
||||
|
||||
With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.
|
||||
@@ -1,49 +1,66 @@
|
||||
# Quick Start Guide 🚀
|
||||
|
||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
|
||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI, covering everything from initial setup to advanced features like chunking and extraction strategies, using asynchronous programming. Let's dive in! 🌟
|
||||
|
||||
---
|
||||
|
||||
## Getting Started 🛠️
|
||||
|
||||
First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
|
||||
Set up your environment with `BrowserConfig` and create an `AsyncWebCrawler` instance.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CasheMode
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# We'll add our crawling code here
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Add your crawling logic here
|
||||
pass
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Basic Usage
|
||||
|
||||
Simply provide a URL and let Crawl4AI do the magic!
|
||||
Provide a URL and let Crawl4AI do the work!
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
crawl_config = CrawlerRunConfig(url="https://www.nbcnews.com/business")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
|
||||
|
||||
asyncio.run(main())
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Taking Screenshots 📸
|
||||
|
||||
Capture screenshots of web pages easily:
|
||||
Capture and save webpage screenshots with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CacheMode
|
||||
|
||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
crawl_config = CrawlerRunConfig(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
@@ -55,243 +72,101 @@ async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
print("Failed to capture screenshot")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Browser Selection 🌐
|
||||
|
||||
Crawl4AI supports multiple browser engines. Here's how to use different browsers:
|
||||
Choose from multiple browser engines using `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
# Use Firefox
|
||||
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
|
||||
firefox_config = BrowserConfig(browser_type="firefox", verbose=True, headless=True)
|
||||
async with AsyncWebCrawler(config=firefox_config) as crawler:
|
||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
||||
|
||||
# Use WebKit
|
||||
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
|
||||
webkit_config = BrowserConfig(browser_type="webkit", verbose=True, headless=True)
|
||||
async with AsyncWebCrawler(config=webkit_config) as crawler:
|
||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
||||
|
||||
# Use Chromium (default)
|
||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
|
||||
chromium_config = BrowserConfig(verbose=True, headless=True)
|
||||
async with AsyncWebCrawler(config=chromium_config) as crawler:
|
||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### User Simulation 🎭
|
||||
|
||||
Simulate real user behavior to avoid detection:
|
||||
Simulate real user behavior to bypass detection:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="YOUR-URL-HERE",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
simulate_user=True, # Causes random mouse movements and clicks
|
||||
override_navigator=True # Makes the browser appear more like a real user
|
||||
)
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(verbose=True, headless=True)
|
||||
crawl_config = CrawlerRunConfig(
|
||||
url="YOUR-URL-HERE",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
simulate_user=True, # Random mouse movements and clicks
|
||||
override_navigator=True # Makes the browser appear like a real user
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Understanding Parameters 🧠
|
||||
|
||||
By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
|
||||
Explore caching and forcing fresh crawls:
|
||||
|
||||
```python
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# First crawl (caches the result)
|
||||
result1 = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# First crawl (uses cache)
|
||||
result1 = await crawler.arun(config=CrawlerRunConfig(url="https://www.nbcnews.com/business"))
|
||||
print(f"First crawl result: {result1.markdown[:100]}...")
|
||||
|
||||
# Force to crawl again
|
||||
result2 = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
|
||||
# Force fresh crawl
|
||||
result2 = await crawler.arun(
|
||||
config=CrawlerRunConfig(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
print(f"Second crawl result: {result2.markdown[:100]}...")
|
||||
|
||||
asyncio.run(main())
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Adding a Chunking Strategy 🧩
|
||||
|
||||
Let's add a chunking strategy: `RegexChunking`! This strategy splits the text based on a given regex pattern.
|
||||
Split content into chunks using `RegexChunking`:
|
||||
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import RegexChunking
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
crawl_config = CrawlerRunConfig(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
print(f"RegexChunking result: {result.extracted_content[:200]}...")
|
||||
|
||||
asyncio.run(main())
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Using LLMExtractionStrategy with Different Providers 🤖
|
||||
---
|
||||
|
||||
Crawl4AI supports multiple LLM providers for extraction:
|
||||
### Advanced Features and Configurations
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
# OpenAI
|
||||
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
# Hugging Face
|
||||
await extract_structured_data_using_llm(
|
||||
"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||
os.getenv("HUGGINGFACE_API_KEY")
|
||||
)
|
||||
|
||||
# Ollama
|
||||
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
|
||||
# With custom headers
|
||||
custom_headers = {
|
||||
"Authorization": "Bearer your-custom-token",
|
||||
"X-Custom-Header": "Some-Value"
|
||||
}
|
||||
await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||
```
|
||||
|
||||
### Knowledge Graph Generation 🕸️
|
||||
|
||||
Generate knowledge graphs from web content:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
description: str
|
||||
|
||||
class Relationship(BaseModel):
|
||||
entity1: Entity
|
||||
entity2: Entity
|
||||
description: str
|
||||
relation_type: str
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity]
|
||||
relationships: List[Relationship]
|
||||
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider='openai/gpt-4o-mini',
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract entities and relationships from the given text."
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://paulgraham.com/love.html",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=extraction_strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Session-Based Crawling with Dynamic Content 🔄
|
||||
|
||||
For modern web applications with dynamic content loading, here's how to handle pagination and content updates:
|
||||
|
||||
```python
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.firstCommit;
|
||||
}"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.markdown-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3): # Crawl 3 pages
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
headless=False,
|
||||
)
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
### Handling Overlays and Fitting Content 📏
|
||||
|
||||
Remove overlay elements and fit content appropriately:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(headless=False) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="your-url-here",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=10,
|
||||
remove_overlay_elements=True,
|
||||
screenshot=True
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Comparison 🏎️
|
||||
|
||||
Crawl4AI offers impressive performance compared to other solutions:
|
||||
|
||||
```python
|
||||
# Firecrawl comparison
|
||||
from firecrawl import FirecrawlApp
|
||||
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
|
||||
start = time.time()
|
||||
scrape_status = app.scrape_url(
|
||||
'https://www.nbcnews.com/business',
|
||||
params={'formats': ['markdown', 'html']}
|
||||
)
|
||||
end = time.time()
|
||||
|
||||
# Crawl4AI comparison
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
start = time.time()
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=0,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
verbose=False,
|
||||
)
|
||||
end = time.time()
|
||||
```
|
||||
|
||||
Note: Performance comparisons should be conducted in environments with stable and fast internet connections for accurate results.
|
||||
|
||||
## Congratulations! 🎉
|
||||
|
||||
You've made it through the updated Crawl4AI Quickstart Guide! Now you're equipped with even more powerful features to crawl the web asynchronously like a pro! 🕸️
|
||||
|
||||
Happy crawling! 🚀
|
||||
For advanced examples (LLM strategies, knowledge graphs, pagination handling), ensure all code aligns with the `BrowserConfig` and `CrawlerRunConfig` pattern shown above.
|
||||
|
||||
@@ -4,16 +4,21 @@ This guide covers the basics of web crawling with Crawl4AI. You'll learn how to
|
||||
|
||||
## Basic Usage
|
||||
|
||||
Here's the simplest way to crawl a webpage:
|
||||
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
browser_config = BrowserConfig() # Default browser configuration
|
||||
run_config = CrawlerRunConfig() # Default crawl run configuration
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com"
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
print(result.markdown) # Print clean markdown content
|
||||
|
||||
@@ -26,7 +31,10 @@ if __name__ == "__main__":
|
||||
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com", fit_markdown=True)
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=CrawlerRunConfig(fit_markdown=True)
|
||||
)
|
||||
|
||||
# Different content formats
|
||||
print(result.html) # Raw HTML
|
||||
@@ -45,16 +53,20 @@ print(result.links) # Dictionary of internal and external links
|
||||
|
||||
## Adding Basic Options
|
||||
|
||||
Customize your crawl with these common options:
|
||||
Customize your crawl using `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=10, # Minimum words per content block
|
||||
exclude_external_links=True, # Remove external links
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
process_iframes=True # Process iframe content
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
```
|
||||
|
||||
## Handling Errors
|
||||
@@ -62,7 +74,9 @@ result = await crawler.arun(
|
||||
Always check if the crawl was successful:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
run_config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=run_config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
print(f"Status code: {result.status_code}")
|
||||
@@ -70,36 +84,45 @@ if not result.success:
|
||||
|
||||
## Logging and Debugging
|
||||
|
||||
Enable verbose mode for detailed logging:
|
||||
Enable verbose logging in `BrowserConfig`:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
run_config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=run_config)
|
||||
```
|
||||
|
||||
## Complete Example
|
||||
|
||||
Here's a more comprehensive example showing common usage patterns:
|
||||
Here's a more comprehensive example demonstrating common usage patterns:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
run_config = CrawlerRunConfig(
|
||||
# Content filtering
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['form', 'header'],
|
||||
exclude_external_links=True,
|
||||
|
||||
# Content processing
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
|
||||
# Cache control
|
||||
cache_mode=CacheMode.ENABLED # Use cache if available
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
# Content filtering
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['form', 'header'],
|
||||
exclude_external_links=True,
|
||||
|
||||
# Content processing
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
|
||||
# Cache control
|
||||
cache_mode=CacheMode.ENABLE # Use cache if available
|
||||
config=run_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
|
||||
46
docs/md_v2/blog/articles/dockerize_hooks.md
Normal file
46
docs/md_v2/blog/articles/dockerize_hooks.md
Normal file
@@ -0,0 +1,46 @@
|
||||
## Introducing Event Streams and Interactive Hooks in Crawl4AI
|
||||
|
||||

|
||||
|
||||
In the near future, I’m planning to enhance Crawl4AI’s capabilities by introducing an event stream mechanism that will give clients deeper, real-time insights into the crawling process. Today, hooks are a powerful feature at the code level—they let developers define custom logic at key points in the crawl. However, when using Crawl4AI as a service (e.g., through a Dockerized API), there isn’t an easy way to interact with these hooks at runtime.
|
||||
|
||||
**What’s Changing?**
|
||||
|
||||
I’m working on a solution that will allow the crawler to emit a continuous stream of events, updating clients on the current crawling stage, encountered pages, and any decision points. This event stream could be exposed over a standardized protocol like Server-Sent Events (SSE) or WebSockets, enabling clients to “subscribe” and listen as the crawler works.
|
||||
|
||||
**Interactivity Through Process IDs**
|
||||
|
||||
A key part of this new design is the concept of a unique process ID for each crawl session. Imagine you’re listening to an event stream that informs you:
|
||||
- The crawler just hit a certain page
|
||||
- It triggered a hook and is now pausing for instructions
|
||||
|
||||
With the event stream in place, you can send a follow-up request back to the server—referencing the unique process ID—to provide extra data, instructions, or parameters. This might include selecting which links to follow next, adjusting extraction strategies, or providing authentication tokens for a protected API. Once the crawler receives these instructions, it resumes execution with the updated context.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Server
|
||||
participant Crawler
|
||||
|
||||
Client->>Server: Start crawl request
|
||||
Server->>Crawler: Initiate crawl with Process ID
|
||||
Crawler-->>Server: Event: Page hit
|
||||
Server-->>Client: Stream: Page hit event
|
||||
Client->>Server: Instruction for Process ID
|
||||
Server->>Crawler: Update crawl with new instructions
|
||||
Crawler-->>Server: Event: Crawl completed
|
||||
Server-->>Client: Stream: Crawl completed
|
||||
```
|
||||
|
||||
**Benefits for Developers and Users**
|
||||
|
||||
1. **Fine-Grained Control**: Instead of predefining all logic upfront, you can dynamically guide the crawler in response to actual data and conditions encountered mid-crawl.
|
||||
2. **Real-Time Insights**: Monitor progress, errors, or network bottlenecks as they happen, without waiting for the entire crawl to finish.
|
||||
3. **Enhanced Collaboration**: Different team members or automated systems can watch the same crawl events and provide input, making the crawling process more adaptive and intelligent.
|
||||
|
||||
**Next Steps**
|
||||
|
||||
I’m currently exploring the best APIs, technologies, and patterns to make this vision a reality. My goal is to deliver a seamless developer experience—one that integrates with existing Crawl4AI workflows while offering new flexibility and power.
|
||||
|
||||
Stay tuned for more updates as I continue building this feature out. In the meantime, I’d love to hear any feedback or suggestions you might have to help shape this interactive, event-driven future of web crawling with Crawl4AI.
|
||||
|
||||
47
docs/md_v2/blog/index.md
Normal file
47
docs/md_v2/blog/index.md
Normal file
@@ -0,0 +1,47 @@
|
||||
# Crawl4AI Blog
|
||||
|
||||
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
|
||||
|
||||
## Latest Release
|
||||
|
||||
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
|
||||
*December 12, 2024*
|
||||
|
||||
The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.
|
||||
|
||||
[Read full release notes →](releases/0.4.2.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
|
||||
*December 8, 2024*
|
||||
|
||||
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
|
||||
|
||||
[Read full release notes →](releases/0.4.1.md)
|
||||
|
||||
---
|
||||
|
||||
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
|
||||
*December 1, 2024*
|
||||
|
||||
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
|
||||
|
||||
[Read full release notes →](releases/0.4.0.md)
|
||||
|
||||
## Project History
|
||||
|
||||
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
|
||||
|
||||
## Categories
|
||||
|
||||
- [Technical Deep Dives](/blog/technical) - Coming soon
|
||||
- [Tutorials & Guides](/blog/tutorials) - Coming soon
|
||||
- [Community Updates](/blog/community) - Coming soon
|
||||
|
||||
## Stay Updated
|
||||
|
||||
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
|
||||
- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
|
||||
- Join our community discussions on GitHub
|
||||
|
||||
62
docs/md_v2/blog/releases/0.4.0.md
Normal file
62
docs/md_v2/blog/releases/0.4.0.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Release Summary for Version 0.4.0 (December 1, 2024)
|
||||
|
||||
## Overview
|
||||
The 0.4.0 release introduces significant improvements to content filtering, multi-threaded environment handling, user-agent generation, and test coverage. Key highlights include the introduction of the PruningContentFilter, designed to automatically identify and extract the most valuable parts of an HTML document, as well as enhancements to the BM25ContentFilter to extend its versatility and effectiveness.
|
||||
|
||||
## Major Features and Enhancements
|
||||
|
||||
### 1. PruningContentFilter
|
||||
- Introduced a new unsupervised content filtering strategy that scores and prunes less relevant nodes in an HTML document based on metrics like text and link density.
|
||||
- Focuses on retaining the most valuable parts of the content, making it highly effective for extracting relevant information from complex web pages.
|
||||
- Fully documented with updated README and expanded user guides.
|
||||
|
||||
### 2. User-Agent Generator
|
||||
- Added a user-agent generator utility that resolves compatibility issues and supports customizable user-agent strings.
|
||||
- By default, the generator randomizes user agents for each request, adding diversity, but users can customize it for tailored scenarios.
|
||||
|
||||
### 3. Enhanced Thread Safety
|
||||
- Improved handling of multi-threaded environments by adding better thread locks for parallel processing, ensuring consistency and stability when running multiple threads.
|
||||
|
||||
### 4. Extended Content Filtering Strategies
|
||||
- Users now have access to both the PruningContentFilter for unsupervised extraction and the BM25ContentFilter for supervised filtering based on user queries.
|
||||
- Enhanced BM25ContentFilter with improved capabilities to process page titles, meta tags, and descriptions, allowing for more effective classification and clustering of text chunks.
|
||||
|
||||
### 5. Documentation Updates
|
||||
- Updated examples and tutorials to promote the use of the PruningContentFilter alongside the BM25ContentFilter, providing clear instructions for selecting the appropriate filter for each use case.
|
||||
|
||||
### 6. Unit Test Enhancements
|
||||
- Added unit tests for PruningContentFilter to ensure accuracy and reliability.
|
||||
- Enhanced BM25ContentFilter tests to cover additional edge cases and performance metrics, particularly for malformed HTML inputs.
|
||||
|
||||
## Revised Change Logs for Version 0.4.0
|
||||
|
||||
### PruningContentFilter (Dec 01, 2024)
|
||||
- Introduced the PruningContentFilter to optimize content extraction by pruning less relevant HTML nodes.
|
||||
- **Affected Files:**
|
||||
- **crawl4ai/content_filter_strategy.py**: Added a scoring-based pruning algorithm.
|
||||
- **README.md**: Updated to include PruningContentFilter usage.
|
||||
- **docs/md_v2/basic/content_filtering.md**: Expanded user documentation, detailing the use and benefits of PruningContentFilter.
|
||||
|
||||
### Unit Tests for PruningContentFilter (Dec 01, 2024)
|
||||
- Added comprehensive unit tests for PruningContentFilter to ensure correctness and efficiency.
|
||||
- **Affected Files:**
|
||||
- **tests/async/test_content_filter_prune.py**: Created tests covering different pruning scenarios to ensure stability and correctness.
|
||||
|
||||
### Enhanced BM25ContentFilter Tests (Dec 01, 2024)
|
||||
- Expanded tests to cover additional extraction scenarios and performance metrics, improving robustness.
|
||||
- **Affected Files:**
|
||||
- **tests/async/test_content_filter_bm25.py**: Added tests for edge cases, including malformed HTML inputs.
|
||||
|
||||
### Documentation and Example Updates (Dec 01, 2024)
|
||||
- Revised examples to illustrate the use of PruningContentFilter alongside existing content filtering methods.
|
||||
- **Affected Files:**
|
||||
- **docs/examples/quickstart_async.py**: Enhanced example clarity and usability for new users.
|
||||
|
||||
## Experimental Features
|
||||
- The PruningContentFilter is still under experimental development, and we continue to gather feedback for further refinements.
|
||||
|
||||
## Conclusion
|
||||
This release significantly enhances the content extraction capabilities of Crawl4ai with the introduction of the PruningContentFilter, improved supervised filtering with BM25ContentFilter, and robust multi-threaded handling. Additionally, the user-agent generator provides much-needed versatility, resolving compatibility issues faced by many users.
|
||||
|
||||
Users are encouraged to experiment with the new content filtering methods to determine which best suits their needs.
|
||||
|
||||
145
docs/md_v2/blog/releases/0.4.1.md
Normal file
145
docs/md_v2/blog/releases/0.4.1.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features!
|
||||
|
||||
_This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂_
|
||||
|
||||
Hi everyone,
|
||||
|
||||
I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it.
|
||||
|
||||
---
|
||||
|
||||
### Handling Lazy Loading Better (Images Included)
|
||||
|
||||
One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI **waits for all images to load** before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes.
|
||||
|
||||
Here’s how to enable it:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
wait_for_images=True # Add this argument to ensure images are fully loaded
|
||||
)
|
||||
```
|
||||
|
||||
What this does is:
|
||||
1. Waits for the page to reach a "network idle" state.
|
||||
2. Ensures all images on the page have been completely loaded.
|
||||
|
||||
This single change handles the majority of lazy-loading cases you’re likely to encounter.
|
||||
|
||||
---
|
||||
|
||||
### Text-Only Mode (Fast, Lightweight Crawling)
|
||||
|
||||
Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable **text-only mode** to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling **3-4 times faster** in most cases.
|
||||
|
||||
Here’s how to turn it on:
|
||||
|
||||
```python
|
||||
crawler = AsyncPlaywrightCrawlerStrategy(
|
||||
text_mode=True # Set this to True to enable text-only crawling
|
||||
)
|
||||
```
|
||||
|
||||
When `text_mode=True`, the crawler automatically:
|
||||
- Disables GPU processing.
|
||||
- Blocks image and JavaScript resources.
|
||||
- Reduces the viewport size to 800x600 (you can override this with `viewport_width` and `viewport_height`).
|
||||
|
||||
If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.
|
||||
|
||||
---
|
||||
|
||||
### Adjusting the Viewport Dynamically
|
||||
|
||||
Another useful addition is the ability to **dynamically adjust the viewport size** to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly.
|
||||
|
||||
Here’s how it works:
|
||||
1. The crawler calculates the page’s width and height after it loads.
|
||||
2. It adjusts the viewport to fit the content dimensions.
|
||||
3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport.
|
||||
|
||||
To enable this, use:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
adjust_viewport_to_content=True # Dynamically adjusts the viewport
|
||||
)
|
||||
```
|
||||
|
||||
This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.
|
||||
|
||||
---
|
||||
|
||||
### Simulating Full-Page Scrolling
|
||||
|
||||
Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for **full-page scanning**. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.
|
||||
|
||||
Here’s an example:
|
||||
|
||||
```python
|
||||
await crawler.crawl(
|
||||
url="https://example.com",
|
||||
scan_full_page=True, # Enables scrolling
|
||||
scroll_delay=0.2 # Waits 200ms between scrolls (optional)
|
||||
)
|
||||
```
|
||||
|
||||
What happens here:
|
||||
1. The crawler scrolls down in increments, waiting for content to load after each scroll.
|
||||
2. It stops when no new content appears (i.e., dynamic elements stop loading).
|
||||
3. It scrolls back to the top before finishing (if necessary).
|
||||
|
||||
If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.
|
||||
|
||||
---
|
||||
|
||||
### Reusing Browser Sessions (Save Time on Setup)
|
||||
|
||||
By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session.
|
||||
|
||||
I added a method called `create_session` for this:
|
||||
|
||||
```python
|
||||
session_id = await crawler.create_session()
|
||||
|
||||
# Use the same session for multiple crawls
|
||||
await crawler.crawl(
|
||||
url="https://example.com/page1",
|
||||
session_id=session_id # Reuse the session
|
||||
)
|
||||
await crawler.crawl(
|
||||
url="https://example.com/page2",
|
||||
session_id=session_id
|
||||
)
|
||||
```
|
||||
|
||||
This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.
|
||||
|
||||
---
|
||||
|
||||
### Other Updates
|
||||
|
||||
Here are a few smaller updates I’ve made:
|
||||
- **Light Mode**: Use `light_mode=True` to disable background processes, extensions, and other unnecessary features, making the browser more efficient.
|
||||
- **Logging**: Improved logs to make debugging easier.
|
||||
- **Defaults**: Added sensible defaults for things like `delay_before_return_html` (now set to 0.1 seconds).
|
||||
|
||||
---
|
||||
|
||||
### How to Get the Update
|
||||
|
||||
You can install or upgrade to version `0.4.1` like this:
|
||||
|
||||
```bash
|
||||
pip install crawl4ai --upgrade
|
||||
```
|
||||
|
||||
As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know!
|
||||
|
||||
Enjoy the new features, and happy crawling! 🕷️
|
||||
|
||||
---
|
||||
|
||||
|
||||
86
docs/md_v2/blog/releases/0.4.2.md
Normal file
86
docs/md_v2/blog/releases/0.4.2.md
Normal file
@@ -0,0 +1,86 @@
|
||||
## 🚀 Crawl4AI 0.4.2 Update: Smarter Crawling Just Got Easier (Dec 12, 2024)
|
||||
|
||||
### Hey Developers,
|
||||
|
||||
I’m excited to share Crawl4AI 0.4.2—a major upgrade that makes crawling smarter, faster, and a whole lot more intuitive. I’ve packed in a bunch of new features to simplify your workflows and improve your experience. Let’s cut to the chase!
|
||||
|
||||
---
|
||||
|
||||
### 🔧 **Configurable Browser and Crawler Behavior**
|
||||
|
||||
You’ve asked for better control over how browsers and crawlers are configured, and now you’ve got it. With the new `BrowserConfig` and `CrawlerRunConfig` objects, you can set up your browser and crawling behavior exactly how you want. No more cluttering `arun` with a dozen arguments—just pass in your configs and go.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
browser_config = BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080)
|
||||
crawler_config = CrawlerRunConfig(cache_mode="BYPASS")
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
||||
print(result.markdown[:500])
|
||||
```
|
||||
|
||||
This setup is a game-changer for scalability, keeping your code clean and flexible as we add more parameters in the future.
|
||||
|
||||
Remember: If you like to use the old way, you can still pass arguments directly to `arun` as before, no worries!
|
||||
|
||||
---
|
||||
|
||||
### 🔐 **Streamlined Session Management**
|
||||
|
||||
Here’s the big one: You can now pass local storage and cookies directly. Whether it’s setting values programmatically or importing a saved JSON state, managing sessions has never been easier. This is a must-have for authenticated crawls—just export your storage state once and reuse it effortlessly across runs.
|
||||
|
||||
**Example:**
|
||||
1. Open a browser, log in manually, and export the storage state.
|
||||
2. Import the JSON file for seamless authenticated crawling:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/protected",
|
||||
storage_state="my_storage_state.json"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🔢 **Handling Large Pages: Supercharged Screenshots and PDF Conversion**
|
||||
|
||||
Two big upgrades here:
|
||||
|
||||
- **Blazing-fast long-page screenshots**: Turn extremely long web pages into clean, high-quality screenshots—without breaking a sweat. It’s optimized to handle large content without lag.
|
||||
|
||||
- **Full-page PDF exports**: Now, you can also convert any page into a PDF with all the details intact. Perfect for archiving or sharing complex layouts.
|
||||
|
||||
---
|
||||
|
||||
### 🔧 **Other Cool Stuff**
|
||||
|
||||
- **Anti-bot enhancements**: Magic mode now handles overlays, user simulation, and anti-detection features like a pro.
|
||||
- **JavaScript execution**: Execute custom JS snippets to handle dynamic content. No more wrestling with endless page interactions.
|
||||
|
||||
---
|
||||
|
||||
### 📊 **Performance Boosts and Dev-friendly Updates**
|
||||
|
||||
- Faster rendering and viewport adjustments for better performance.
|
||||
- Improved cookie and local storage handling for seamless authentication.
|
||||
- Better debugging with detailed logs and actionable error messages.
|
||||
|
||||
---
|
||||
|
||||
### 🔠 **Use Cases You’ll Love**
|
||||
|
||||
1. **Authenticated Crawls**: Login once, export your storage state, and reuse it across multiple requests without the headache.
|
||||
2. **Long-page Screenshots**: Perfect for blogs, e-commerce pages, or any endless-scroll website.
|
||||
3. **PDF Export**: Create professional-looking page PDFs in seconds.
|
||||
|
||||
---
|
||||
|
||||
### Let’s Get Crawling
|
||||
|
||||
Crawl4AI 0.4.2 is ready for you to download and try. I’m always looking for ways to improve, so don’t hold back—share your thoughts and feedback.
|
||||
|
||||
Happy Crawling! 🚀
|
||||
|
||||
@@ -169,6 +169,35 @@ llm_result = await crawler.arun(
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
## Input Formats
|
||||
All extraction strategies support different input formats to give you more control over how content is processed:
|
||||
|
||||
- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical.
|
||||
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements.
|
||||
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured.
|
||||
|
||||
To specify an input format:
|
||||
```python
|
||||
strategy = LLMExtractionStrategy(
|
||||
input_format="html", # or "markdown" or "fit_markdown"
|
||||
provider="openai/gpt-4",
|
||||
instruction="Extract product information"
|
||||
)
|
||||
```
|
||||
|
||||
Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator with content filter:
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter() # Content filter goes here for fit_markdown
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose the Right Strategy**
|
||||
|
||||
329
docs/md_v3/tutorials/advanced-features.md
Normal file
329
docs/md_v3/tutorials/advanced-features.md
Normal file
@@ -0,0 +1,329 @@
|
||||
# Advanced Features (Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
|
||||
|
||||
Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
|
||||
|
||||
1. **Proxy Usage**
|
||||
2. **Capturing PDFs & Screenshots**
|
||||
3. **Handling SSL Certificates**
|
||||
4. **Custom Headers**
|
||||
5. **Session Persistence & Local Storage**
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have a basic grasp of [AsyncWebCrawler Basics](./async-webcrawler-basics.md)
|
||||
> - You know how to run or configure your Python environment with Playwright installed
|
||||
|
||||
---
|
||||
|
||||
## 1. Proxy Usage
|
||||
|
||||
If you need to route your crawl traffic through a proxy—whether for IP rotation, geo-testing, or privacy—Crawl4AI supports it via `BrowserConfig.proxy_config`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_cfg = BrowserConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "myuser",
|
||||
"password": "mypass",
|
||||
},
|
||||
headless=True
|
||||
)
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.whatismyip.com/",
|
||||
config=crawler_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Page fetched via proxy.")
|
||||
print("Page HTML snippet:", result.html[:200])
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- **`proxy_config`** expects a dict with `server` and optional auth credentials.
|
||||
- Many commercial proxies provide an HTTP/HTTPS “gateway” server that you specify in `server`.
|
||||
- If your proxy doesn’t need auth, omit `username`/`password`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Capturing PDFs & Screenshots
|
||||
|
||||
Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI can do both in one pass:
|
||||
|
||||
```python
|
||||
import os, asyncio
|
||||
from base64 import b64decode
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Save screenshot
|
||||
if result.screenshot:
|
||||
with open("wikipedia_screenshot.png", "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Save PDF
|
||||
if result.pdf:
|
||||
with open("wikipedia_page.pdf", "wb") as f:
|
||||
f.write(b64decode(result.pdf))
|
||||
|
||||
print("[OK] PDF & screenshot captured.")
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why PDF + Screenshot?**
|
||||
- Large or complex pages can be slow or error-prone with “traditional” full-page screenshots.
|
||||
- Exporting a PDF is more reliable for very long pages. Crawl4AI automatically converts the first PDF page into an image if you request both.
|
||||
|
||||
**Relevant Parameters**
|
||||
- **`pdf=True`**: Exports the current page as a PDF (base64-encoded in `result.pdf`).
|
||||
- **`screenshot=True`**: Creates a screenshot (base64-encoded in `result.screenshot`).
|
||||
- **`scan_full_page`** or advanced hooking can further refine how the crawler captures content.
|
||||
|
||||
---
|
||||
|
||||
## 3. Handling SSL Certificates
|
||||
|
||||
If you need to verify or export a site’s SSL certificate—for compliance, debugging, or data analysis—Crawl4AI can fetch it during the crawl:
|
||||
|
||||
```python
|
||||
import asyncio, os
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
tmp_dir = os.path.join(os.getcwd(), "tmp")
|
||||
os.makedirs(tmp_dir, exist_ok=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
fetch_ssl_certificate=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.ssl_certificate:
|
||||
cert = result.ssl_certificate
|
||||
print("\nCertificate Information:")
|
||||
print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
|
||||
print(f"Valid until: {cert.valid_until}")
|
||||
print(f"Fingerprint: {cert.fingerprint}")
|
||||
|
||||
# Export in multiple formats:
|
||||
cert.to_json(os.path.join(tmp_dir, "certificate.json"))
|
||||
cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
|
||||
cert.to_der(os.path.join(tmp_dir, "certificate.der"))
|
||||
|
||||
print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
|
||||
else:
|
||||
print("[ERROR] No certificate or crawl failed.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- **`fetch_ssl_certificate=True`** triggers certificate retrieval.
|
||||
- `result.ssl_certificate` includes methods (`to_json`, `to_pem`, `to_der`) for saving in various formats (handy for server config, Java keystores, etc.).
|
||||
|
||||
---
|
||||
|
||||
## 4. Custom Headers
|
||||
|
||||
Sometimes you need to set custom headers (e.g., language preferences, authentication tokens, or specialized user-agent strings). You can do this in multiple ways:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
# Option 1: Set headers at the crawler strategy level
|
||||
crawler1 = AsyncWebCrawler(
|
||||
# The underlying strategy can accept headers in its constructor
|
||||
crawler_strategy=None # We'll override below for clarity
|
||||
)
|
||||
crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
|
||||
crawler1.crawler_strategy.set_custom_headers({
|
||||
"Accept-Language": "fr-FR,fr;q=0.9"
|
||||
})
|
||||
result1 = await crawler1.arun("https://www.example.com")
|
||||
print("Example 1 result success:", result1.success)
|
||||
|
||||
# Option 2: Pass headers directly to `arun()`
|
||||
crawler2 = AsyncWebCrawler()
|
||||
result2 = await crawler2.arun(
|
||||
url="https://www.example.com",
|
||||
headers={"Accept-Language": "es-ES,es;q=0.9"}
|
||||
)
|
||||
print("Example 2 result success:", result2.success)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Notes**
|
||||
- Some sites may react differently to certain headers (e.g., `Accept-Language`).
|
||||
- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-anti-bot.md) or use `UserAgentGenerator`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Session Persistence & Local Storage
|
||||
|
||||
Crawl4AI can preserve cookies and localStorage so you can continue where you left off—ideal for logging into sites or skipping repeated auth flows.
|
||||
|
||||
### 5.1 `storage_state`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
storage_dict = {
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1699999999.0,
|
||||
"httpOnly": False,
|
||||
"secure": False,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{"name": "token", "value": "my_auth_token"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Provide the storage state as a dictionary to start "already logged in"
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state=storage_dict
|
||||
) as crawler:
|
||||
result = await crawler.arun("https://example.com/protected")
|
||||
if result.success:
|
||||
print("Protected page content length:", len(result.html))
|
||||
else:
|
||||
print("Failed to crawl protected page")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 5.2 Exporting & Reusing State
|
||||
|
||||
You can sign in once, export the browser context, and reuse it later—without re-entering credentials.
|
||||
|
||||
- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
|
||||
- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
|
||||
|
||||
**See**: [Detailed session management tutorial](./hooks-custom.md#using-storage_state) or [Explanations → Browser Context & Managed Browser](../../explanations/browser-management.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
|
||||
|
||||
---
|
||||
|
||||
## Putting It All Together
|
||||
|
||||
Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
|
||||
|
||||
```python
|
||||
import os, asyncio
|
||||
from base64 import b64decode
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
# 1. Browser config with proxy + headless
|
||||
browser_cfg = BrowserConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "myuser",
|
||||
"password": "mypass",
|
||||
},
|
||||
headless=True,
|
||||
)
|
||||
|
||||
# 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
pdf=True,
|
||||
screenshot=True,
|
||||
fetch_ssl_certificate=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
headers={"Accept-Language": "en-US,en;q=0.8"},
|
||||
storage_state="my_storage.json", # Reuse session from a previous sign-in
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# 3. Crawl
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun("https://secure.example.com/protected", config=crawler_cfg)
|
||||
|
||||
if result.success:
|
||||
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
|
||||
|
||||
# Save PDF & screenshot
|
||||
if result.pdf:
|
||||
with open("result.pdf", "wb") as f:
|
||||
f.write(b64decode(result.pdf))
|
||||
if result.screenshot:
|
||||
with open("result.png", "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Check SSL cert
|
||||
if result.ssl_certificate:
|
||||
print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Conclusion & Next Steps
|
||||
|
||||
You’ve now explored several **advanced** features:
|
||||
|
||||
- **Proxy Usage**
|
||||
- **PDF & Screenshot** capturing for large or critical pages
|
||||
- **SSL Certificate** retrieval & exporting
|
||||
- **Custom Headers** for language or specialized requests
|
||||
- **Session Persistence** via storage state
|
||||
|
||||
**Where to go next**:
|
||||
|
||||
- **[Hooks & Custom Code](./hooks-custom.md)**: For multi-step interactions (clicking “Load More,” performing logins, etc.)
|
||||
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: If you need more sophisticated user simulation or stealth.
|
||||
- **[Reference → BrowserConfig & CrawlerRunConfig](../../reference/configuration.md)**: Detailed param descriptions for everything you’ve seen here and more.
|
||||
|
||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
218
docs/md_v3/tutorials/async-webcrawler-basics.md
Normal file
218
docs/md_v3/tutorials/async-webcrawler-basics.md
Normal file
@@ -0,0 +1,218 @@
|
||||
Below is a sample Markdown file (`tutorials/async-webcrawler-basics.md`) illustrating how you might teach new users the fundamentals of `AsyncWebCrawler`. This tutorial builds on the **Getting Started** section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.
|
||||
|
||||
---
|
||||
|
||||
# AsyncWebCrawler Basics
|
||||
|
||||
In this tutorial, you’ll learn how to:
|
||||
|
||||
1. Create and configure an `AsyncWebCrawler` instance
|
||||
2. Understand the `CrawlResult` object returned by `arun()`
|
||||
3. Use basic `BrowserConfig` and `CrawlerRunConfig` options to tailor your crawl
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve already completed the [Getting Started](./getting-started.md) tutorial (or have equivalent knowledge).
|
||||
> - You have **Crawl4AI** installed and configured with Playwright.
|
||||
|
||||
---
|
||||
|
||||
## 1. What is `AsyncWebCrawler`?
|
||||
|
||||
`AsyncWebCrawler` is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, it’s your high-level interface for collecting page data.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Creating a Basic `AsyncWebCrawler` Instance
|
||||
|
||||
Below is a simple code snippet showing how to create and use `AsyncWebCrawler`. This goes one step beyond the minimal example you saw in [Getting Started](./getting-started.md).
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# 1. Set up configuration objects (optional if you want defaults)
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
verbose=True
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
page_timeout=30000, # 30 seconds
|
||||
wait_for_images=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# 2. Initialize AsyncWebCrawler with your chosen browser config
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# 3. Run a single crawl
|
||||
url_to_crawl = "https://example.com"
|
||||
result = await crawler.arun(url=url_to_crawl, config=crawler_config)
|
||||
|
||||
# 4. Inspect the result
|
||||
if result.success:
|
||||
print(f"Successfully crawled: {result.url}")
|
||||
print(f"HTML length: {len(result.html)}")
|
||||
print(f"Markdown snippet: {result.markdown[:200]}...")
|
||||
else:
|
||||
print(f"Failed to crawl {result.url}. Error: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Key Points
|
||||
|
||||
1. **`BrowserConfig`** is optional, but it’s the place to specify browser-related settings (e.g., `headless`, `browser_type`).
|
||||
2. **`CrawlerRunConfig`** deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).
|
||||
3. **`arun()`** is the main method to crawl a single URL. We’ll see how `arun_many()` works in later tutorials.
|
||||
|
||||
---
|
||||
|
||||
## 3. Understanding `CrawlResult`
|
||||
|
||||
When you call `arun()`, you get back a `CrawlResult` object containing all the relevant data from that crawl attempt. Some common fields include:
|
||||
|
||||
```python
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
screenshot: Optional[str] = None # base64-encoded screenshot if requested
|
||||
pdf: Optional[bytes] = None # binary PDF data if requested
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
error_message: Optional[str] = None
|
||||
# ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
|
||||
```
|
||||
|
||||
### Commonly Used Fields
|
||||
|
||||
- **`success`**: `True` if the crawl succeeded, `False` otherwise.
|
||||
- **`html`**: The raw HTML (or final rendered state if JavaScript was executed).
|
||||
- **`markdown` / `markdown_v2`**: Contains the automatically generated Markdown representation of the page.
|
||||
- **`media`**: A dictionary with lists of extracted images, videos, or audio elements.
|
||||
- **`links`**: A dictionary with lists of “internal” and “external” link objects.
|
||||
- **`error_message`**: If `success` is `False`, this often contains a description of the error.
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
if result.success:
|
||||
print("Page Title or snippet of HTML:", result.html[:200])
|
||||
if result.markdown:
|
||||
print("Markdown snippet:", result.markdown[:200])
|
||||
print("Links found:", len(result.links.get("internal", [])), "internal links")
|
||||
else:
|
||||
print("Error crawling:", result.error_message)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Relevant Basic Parameters
|
||||
|
||||
Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might tweak early on. We’ll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.
|
||||
|
||||
### 4.1 `BrowserConfig` Essentials
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|--------------------|-----------------------------------------------------------|----------------|
|
||||
| `browser_type` | Which browser engine to use: `"chromium"`, `"firefox"`, `"webkit"` | `"chromium"` |
|
||||
| `headless` | Run the browser with no UI window. If `False`, you see the browser. | `True` |
|
||||
| `verbose` | Print extra logs for debugging. | `True` |
|
||||
| `java_script_enabled` | Toggle JavaScript. When `False`, you might speed up loads but lose dynamic content. | `True` |
|
||||
|
||||
### 4.2 `CrawlerRunConfig` Essentials
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------------------|--------------------------------------------------------------|--------------------|
|
||||
| `page_timeout` | Maximum time in ms to wait for the page to load or scripts. | `30000` (30s) |
|
||||
| `wait_for_images` | Wait for images to fully load. Good for accurate rendering. | `True` |
|
||||
| `css_selector` | Target only certain elements for extraction. | `None` |
|
||||
| `excluded_tags` | Skip certain HTML tags (like `nav`, `footer`, etc.) | `None` |
|
||||
| `verbose` | Print logs for debugging. | `True` |
|
||||
|
||||
> **Tip**: Don’t worry if you see lots of parameters. You’ll learn them gradually in later tutorials.
|
||||
|
||||
---
|
||||
|
||||
## 5. Putting It All Together
|
||||
|
||||
Here’s a slightly more in-depth example that shows off a few key config parameters at once:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_cfg = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
java_script_enabled=True,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
page_timeout=30000, # wait up to 30 seconds
|
||||
wait_for_images=True,
|
||||
css_selector=".article-body", # only extract content under this CSS selector
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun("https://news.example.com", config=crawler_cfg)
|
||||
|
||||
if result.success:
|
||||
print("[OK] Crawled:", result.url)
|
||||
print("HTML length:", len(result.html))
|
||||
print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Observations**:
|
||||
- `css_selector=".article-body"` ensures we only focus on the main content region.
|
||||
- `page_timeout=30000` helps if the site is slow.
|
||||
- We turned off `verbose` logs for the browser but kept them on for the crawler config.
|
||||
|
||||
---
|
||||
|
||||
## 6. Next Steps
|
||||
|
||||
- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
|
||||
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
|
||||
- **Reference**: For a complete list of every parameter in `BrowserConfig` and `CrawlerRunConfig`, check out the [Reference section](../../reference/configuration.md).
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
You now know the basics of **AsyncWebCrawler**:
|
||||
- How to create it with optional browser/crawler configs
|
||||
- How `arun()` works for single-page crawls
|
||||
- Where to find your crawled data in `CrawlResult`
|
||||
- A handful of frequently used configuration parameters
|
||||
|
||||
From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Let’s move on to **[Smart Crawling Techniques](./smart-crawling.md)** to learn how to handle iframes, advanced caching, and more.
|
||||
|
||||
---
|
||||
|
||||
**Last updated**: 2024-XX-XX
|
||||
|
||||
Keep exploring! If you get stuck, remember to check out the [How-To Guides](../../how-to/) for targeted solutions or the [Explanations](../../explanations/) for deeper conceptual background.
|
||||
271
docs/md_v3/tutorials/docker-quickstart.md
Normal file
271
docs/md_v3/tutorials/docker-quickstart.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# Deploying with Docker (Quickstart)
|
||||
|
||||
> **⚠️ WARNING: Experimental & Legacy**
|
||||
> Our current Docker solution for Crawl4AI is **not stable** and **will be discontinued** soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in **2025**. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.
|
||||
|
||||
Crawl4AI is **open-source** and under **active development**. We appreciate your interest, but strongly recommend you make **informed decisions** if you need a production environment. Expect breaking changes in future versions.
|
||||
|
||||
---
|
||||
|
||||
## 1. Installation & Environment Setup (Outside Docker)
|
||||
|
||||
Before we jump into Docker usage, here’s a quick reminder of how to install Crawl4AI locally (legacy doc). For **non-Docker** deployments or local dev:
|
||||
|
||||
```bash
|
||||
# 1. Install the package
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
|
||||
# 2. Install playwright dependencies (all browsers or specific ones)
|
||||
playwright install --with-deps
|
||||
# or
|
||||
playwright install --with-deps chromium
|
||||
# or
|
||||
playwright install --with-deps chrome
|
||||
```
|
||||
|
||||
**Testing** your installation:
|
||||
|
||||
```bash
|
||||
# Visible browser test
|
||||
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Docker Overview
|
||||
|
||||
This Docker approach allows you to run a **Crawl4AI** service via REST API. You can:
|
||||
|
||||
1. **POST** a request (e.g., URLs, extraction config)
|
||||
2. **Retrieve** your results from a task-based endpoint
|
||||
|
||||
> **Note**: This Docker solution is **temporary**. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.
|
||||
|
||||
---
|
||||
|
||||
## 3. Pulling and Running the Image
|
||||
|
||||
### Basic Run
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
This starts a container on port `11235`. You can `POST` requests to `http://localhost:11235/crawl`.
|
||||
|
||||
### Using an API Token
|
||||
|
||||
```bash
|
||||
docker run -p 11235:11235 \
|
||||
-e CRAWL4AI_API_TOKEN=your_secret_token \
|
||||
unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
If **`CRAWL4AI_API_TOKEN`** is set, you must include `Authorization: Bearer <token>` in your requests. Otherwise, the service is open to anyone.
|
||||
|
||||
---
|
||||
|
||||
## 4. Docker Compose for Multi-Container Workflows
|
||||
|
||||
You can also use **Docker Compose** to manage multiple services. Below is an **experimental** snippet:
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:basic
|
||||
ports:
|
||||
- "11235:11235"
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||
# Additional env variables as needed
|
||||
volumes:
|
||||
- /dev/shm:/dev/shm
|
||||
```
|
||||
|
||||
To run:
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
And to stop:
|
||||
|
||||
```bash
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
**Troubleshooting**:
|
||||
|
||||
- **Check logs**: `docker-compose logs -f crawl4ai`
|
||||
- **Remove orphan containers**: `docker-compose down --remove-orphans`
|
||||
- **Remove networks**: `docker network rm <network_name>`
|
||||
|
||||
---
|
||||
|
||||
## 5. Making Requests to the Container
|
||||
|
||||
**Base URL**: `http://localhost:11235`
|
||||
|
||||
### Example: Basic Crawl
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
task_request = {
|
||||
"urls": "https://example.com",
|
||||
"priority": 10
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=task_request)
|
||||
task_id = response.json()["task_id"]
|
||||
|
||||
# Poll for status
|
||||
status_url = f"http://localhost:11235/task/{task_id}"
|
||||
status = requests.get(status_url).json()
|
||||
print(status)
|
||||
```
|
||||
|
||||
If you used an API token, do:
|
||||
|
||||
```python
|
||||
headers = {"Authorization": "Bearer your_secret_token"}
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
headers=headers,
|
||||
json=task_request
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Docker + New Crawler Config Approach
|
||||
|
||||
### Using `BrowserConfig` & `CrawlerRunConfig` in Requests
|
||||
|
||||
The Docker-based solution can accept **crawler configurations** in the request JSON (legacy doc might show direct parameters, but we want to embed them in `crawler_params` or `extra` to align with the new approach). For example:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
request_data = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"crawler_params": {
|
||||
"headless": True,
|
||||
"browser_type": "chromium",
|
||||
"verbose": True,
|
||||
"page_timeout": 30000,
|
||||
# ... any other BrowserConfig-like fields
|
||||
},
|
||||
"extra": {
|
||||
"word_count_threshold": 50,
|
||||
"bypass_cache": True
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=request_data)
|
||||
task_id = response.json()["task_id"]
|
||||
```
|
||||
|
||||
This is the recommended style if you want to replicate `BrowserConfig` and `CrawlerRunConfig` settings in Docker mode.
|
||||
|
||||
---
|
||||
|
||||
## 7. Example: JSON Extraction in Docker
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
# Define a schema for CSS extraction
|
||||
schema = {
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
request_data = {
|
||||
"urls": "https://www.coinbase.com/explore",
|
||||
"extraction_config": {
|
||||
"type": "json_css",
|
||||
"params": {"schema": schema}
|
||||
},
|
||||
"crawler_params": {
|
||||
"headless": True,
|
||||
"verbose": True
|
||||
}
|
||||
}
|
||||
|
||||
resp = requests.post("http://localhost:11235/crawl", json=request_data)
|
||||
task_id = resp.json()["task_id"]
|
||||
|
||||
# Poll for status
|
||||
status = requests.get(f"http://localhost:11235/task/{task_id}").json()
|
||||
if status["status"] == "completed":
|
||||
extracted_content = status["result"]["extracted_content"]
|
||||
data = json.loads(extracted_content)
|
||||
print("Extracted:", len(data), "entries")
|
||||
else:
|
||||
print("Task still in progress or failed.")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Why This Docker Is Temporary
|
||||
|
||||
**We are building a new, stable approach**:
|
||||
|
||||
- The current Docker container is **experimental** and might break with future releases.
|
||||
- We plan a stable release in **2025** with a more robust API, versioning, and orchestration.
|
||||
- If you use this Docker in production, do so at your own risk and be prepared for **breaking changes**.
|
||||
|
||||
**Community**: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the [GitHub repository](https://github.com/unclecode/crawl4ai) for roadmaps and updates.
|
||||
|
||||
---
|
||||
|
||||
## 9. Known Limitations & Next Steps
|
||||
|
||||
1. **Not Production-Ready**: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.
|
||||
2. **Ongoing Changes**: Expect API changes. The official stable version is targeted for **2025**.
|
||||
3. **LLM Integrations**: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.
|
||||
4. **Performance**: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.
|
||||
5. **Version Pinning**: If you must deploy, pin your Docker tag to a specific version (e.g., `:basic-0.3.7`) to avoid surprise updates.
|
||||
|
||||
### Next Steps
|
||||
|
||||
- **Watch the Repository**: For announcements on the new Docker architecture.
|
||||
- **Experiment**: Use this Docker for test or dev environments, but keep an eye out for breakage.
|
||||
- **Contribute**: If you have ideas or improvements, open a PR or discussion.
|
||||
- **Check Roadmaps**: See our [GitHub issues](https://github.com/unclecode/crawl4ai/issues) or [Roadmap doc](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md) to find upcoming releases.
|
||||
|
||||
---
|
||||
|
||||
## 10. Summary
|
||||
|
||||
**Deploying with Docker** can simplify running Crawl4AI as a service. However:
|
||||
|
||||
- **This Docker** approach is **legacy** and subject to removal/overhaul.
|
||||
- For production, please weigh the risks carefully.
|
||||
- Detailed “new Docker approach” is coming in **2025**.
|
||||
|
||||
We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for **experimental** usage. Stay tuned for the fully-supported version!
|
||||
272
docs/md_v3/tutorials/getting-started.md
Normal file
272
docs/md_v3/tutorials/getting-started.md
Normal file
@@ -0,0 +1,272 @@
|
||||
# Getting Started with Crawl4AI
|
||||
|
||||
Welcome to **Crawl4AI**, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:
|
||||
|
||||
1. **Install** Crawl4AI (both via pip and Docker, with notes on platform challenges).
|
||||
2. Run your **first crawl** using minimal configuration.
|
||||
3. Generate **Markdown** output (and learn how it’s influenced by content filters).
|
||||
4. Experiment with a simple **CSS-based extraction** strategy.
|
||||
5. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
|
||||
|
||||
---
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
Crawl4AI provides:
|
||||
- An asynchronous crawler, **`AsyncWebCrawler`**.
|
||||
- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
|
||||
- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports additional filters).
|
||||
- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
|
||||
|
||||
By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.
|
||||
|
||||
---
|
||||
|
||||
## 2. Installation
|
||||
|
||||
### 2.1 Python + Playwright
|
||||
|
||||
#### Basic Pip Installation
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
|
||||
# Verify your installation
|
||||
crawl4ai-doctor
|
||||
```
|
||||
|
||||
If you encounter any browser-related issues, you can install them manually:
|
||||
```bash
|
||||
python -m playwright install --with-deps chrome chromium
|
||||
```
|
||||
|
||||
- **`crawl4ai-setup`** installs and configures Playwright (Chromium by default).
|
||||
|
||||
We cover advanced installation and Docker in the [Installation](#installation) section.
|
||||
|
||||
---
|
||||
|
||||
## 3. Your First Crawl
|
||||
|
||||
Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown[:300]) # Print first 300 chars
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
|
||||
- It fetches `https://example.com`.
|
||||
- Crawl4AI automatically converts the HTML into Markdown.
|
||||
|
||||
You now have a simple, working crawl!
|
||||
|
||||
---
|
||||
|
||||
## 4. Basic Configuration (Light Introduction)
|
||||
|
||||
Crawl4AI’s crawler can be heavily customized using two main classes:
|
||||
|
||||
1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
|
||||
2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
|
||||
|
||||
Below is an example with minimal usage:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_conf = BrowserConfig(headless=True) # or False to see the browser
|
||||
run_conf = CrawlerRunConfig(cache_mode="BYPASS")
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_conf
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
|
||||
|
||||
---
|
||||
|
||||
## 5. Generating Markdown Output
|
||||
|
||||
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
|
||||
|
||||
- **`result.markdown`**:
|
||||
The direct HTML-to-Markdown conversion.
|
||||
- **`result.markdown.fit_markdown`**:
|
||||
The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
|
||||
|
||||
### Example: Using a Filter with `DefaultMarkdownGenerator`
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://news.ycombinator.com", config=config)
|
||||
print("Raw Markdown length:", len(result.markdown.raw_markdown))
|
||||
print("Fit Markdown length:", len(result.markdown.fit_markdown))
|
||||
```
|
||||
|
||||
**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
|
||||
|
||||
---
|
||||
|
||||
## 6. Simple Data Extraction (CSS-based)
|
||||
|
||||
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "Example Items",
|
||||
"baseSelector": "div.item",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/items",
|
||||
config=CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
)
|
||||
# The JSON output is stored in 'extracted_content'
|
||||
data = json.loads(result.extracted_content)
|
||||
print(data)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why is this helpful?**
|
||||
- Great for repetitive page structures (e.g., item listings, articles).
|
||||
- No AI usage or costs.
|
||||
- The crawler returns a JSON string you can parse or store.
|
||||
|
||||
---
|
||||
|
||||
## 7. Simple Data Extraction (LLM-based)
|
||||
|
||||
For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
|
||||
|
||||
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
|
||||
- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
|
||||
- Or any provider supported by the underlying library
|
||||
|
||||
Below is an example using **open-source** style (no token) and closed-source:
|
||||
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class PricingInfo(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the AI model")
|
||||
input_fee: str = Field(..., description="Fee for input tokens")
|
||||
output_fee: str = Field(..., description="Fee for output tokens")
|
||||
|
||||
async def main():
|
||||
# 1) Open-Source usage: no token required
|
||||
llm_strategy_open_source = LLMExtractionStrategy(
|
||||
provider="ollama/llama3.3", # or "any-other-local-model"
|
||||
api_token="no_token", # for local models, no API key is typically required
|
||||
schema=PricingInfo.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
From this page, extract all AI model pricing details in JSON format.
|
||||
Each entry should have 'model_name', 'input_fee', and 'output_fee'.
|
||||
""",
|
||||
temperature=0
|
||||
)
|
||||
|
||||
# 2) Closed-Source usage: API key for OpenAI, for example
|
||||
openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
|
||||
llm_strategy_openai = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token=openai_token,
|
||||
schema=PricingInfo.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
From this page, extract all AI model pricing details in JSON format.
|
||||
Each entry should have 'model_name', 'input_fee', and 'output_fee'.
|
||||
""",
|
||||
temperature=0
|
||||
)
|
||||
|
||||
# We'll demo the open-source approach here
|
||||
config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/pricing",
|
||||
config=config
|
||||
)
|
||||
print("LLM-based extraction JSON:", result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
|
||||
- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
|
||||
- Depending on the **provider** and **api_token**, you can use local models or a remote API.
|
||||
|
||||
---
|
||||
|
||||
## 8. Next Steps
|
||||
|
||||
Congratulations! You have:
|
||||
1. Installed Crawl4AI (via pip, with Docker as an option).
|
||||
2. Performed a simple crawl and printed Markdown.
|
||||
3. Seen how adding a **markdown generator** + **content filter** can produce “fit” Markdown.
|
||||
4. Experimented with **CSS-based** extraction for repetitive data.
|
||||
5. Learned the basics of **LLM-based** extraction (open-source and closed-source).
|
||||
|
||||
If you are ready for more, check out:
|
||||
|
||||
- **Installation**: Learn more on how to install Crawl4AI and set up Playwright.
|
||||
- **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc.
|
||||
- **Markdown Generation Basics**: Dive deeper into content filtering and “fit markdown” usage.
|
||||
- **Dynamic Pages & Hooks**: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
|
||||
- **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes.
|
||||
- **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more.
|
||||
|
||||
Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!
|
||||
527
docs/md_v3/tutorials/getting-warmer.md
Normal file
527
docs/md_v3/tutorials/getting-warmer.md
Normal file
@@ -0,0 +1,527 @@
|
||||
# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
|
||||
|
||||
Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
|
||||
|
||||
**What Crawl4AI is not:**
|
||||
|
||||
Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:
|
||||
|
||||
- To generate perfect, AI-friendly data (particularly for LLMs) from web content
|
||||
- To maximize speed and efficiency in data extraction and processing
|
||||
- To operate at scale, from Raspberry Pi to cloud infrastructures
|
||||
|
||||
Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:
|
||||
|
||||
1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
|
||||
2. Implement intelligent extraction strategies to reduce reliance on costly API calls
|
||||
3. Provide a streamlined pipeline for AI data preparation and ingestion
|
||||
|
||||
In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.
|
||||
|
||||
**Key Links:**
|
||||
|
||||
- **Website:** [https://crawl4ai.com](https://crawl4ai.com)
|
||||
- **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
- **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
- **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py)
|
||||
- **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution)
|
||||
- [Table of Contents](#table-of-contents)
|
||||
- [1. Introduction \& Key Concepts](#1-introduction--key-concepts)
|
||||
- [2. Installation \& Environment Setup](#2-installation--environment-setup)
|
||||
- [Test Your Installation](#test-your-installation)
|
||||
- [3. Core Concepts \& Configuration](#3-core-concepts--configuration)
|
||||
- [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction)
|
||||
- [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output)
|
||||
- [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm)
|
||||
- [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models)
|
||||
- [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content)
|
||||
- [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling)
|
||||
- [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation)
|
||||
- [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory)
|
||||
- [Using `storage_state`](#using-storage_state)
|
||||
- [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements)
|
||||
- [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads)
|
||||
- [13. Caching \& Performance Optimization](#13-caching--performance-optimization)
|
||||
- [14. Hooks for Custom Logic](#14-hooks-for-custom-logic)
|
||||
- [15. Dockerization \& Scaling](#15-dockerization--scaling)
|
||||
- [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls)
|
||||
- [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example)
|
||||
- [18. Further Resources \& Community](#18-further-resources--community)
|
||||
|
||||
---
|
||||
|
||||
## 1. Introduction & Key Concepts
|
||||
|
||||
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
|
||||
|
||||
**Quick Test:**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def test_run():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown)
|
||||
|
||||
asyncio.run(test_run())
|
||||
```
|
||||
|
||||
If you see Markdown output, everything is working!
|
||||
|
||||
**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 2. Installation & Environment Setup
|
||||
|
||||
```bash
|
||||
# Install the package
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
|
||||
# Install Playwright with system dependencies (recommended)
|
||||
playwright install --with-deps # Installs all browsers
|
||||
|
||||
# Or install specific browsers:
|
||||
playwright install --with-deps chrome # Recommended for Colab/Linux
|
||||
playwright install --with-deps firefox
|
||||
playwright install --with-deps webkit
|
||||
playwright install --with-deps chromium
|
||||
|
||||
# Keep Playwright updated periodically
|
||||
playwright install
|
||||
```
|
||||
|
||||
> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably.
|
||||
|
||||
### Test Your Installation
|
||||
Try these one-liners:
|
||||
|
||||
```python
|
||||
# Visible browser test
|
||||
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
|
||||
|
||||
# Headless test (for servers/CI)
|
||||
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"
|
||||
```
|
||||
|
||||
You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`.
|
||||
|
||||
|
||||
**Try in Colab:**
|
||||
[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
|
||||
**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md)
|
||||
|
||||
---
|
||||
|
||||
## 3. Core Concepts & Configuration
|
||||
|
||||
Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling.
|
||||
|
||||
**Example config:**
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
viewport_width=1080,
|
||||
viewport_height=600,
|
||||
text_mode=False,
|
||||
ignore_https_errors=True,
|
||||
java_script_enabled=True
|
||||
)
|
||||
|
||||
run_config = CrawlerRunConfig(
|
||||
css_selector="article.main",
|
||||
word_count_threshold=50,
|
||||
excluded_tags=['nav','footer'],
|
||||
exclude_external_links=True,
|
||||
wait_for="css:.article-loaded",
|
||||
page_timeout=60000,
|
||||
delay_before_return_html=1.0,
|
||||
mean_delay=0.1,
|
||||
max_range=0.3,
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
js_code="""
|
||||
(async () => {
|
||||
window.scrollTo(0, document.body.scrollHeight);
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
document.querySelector('.load-more')?.click();
|
||||
})();
|
||||
"""
|
||||
)
|
||||
|
||||
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
|
||||
# run_config.cache_mode = CacheMode.ENABLED
|
||||
```
|
||||
|
||||
**Prefixes:**
|
||||
|
||||
- `http://` or `https://` for live pages
|
||||
- `file://local.html` for local
|
||||
- `raw:<html>` for raw HTML strings
|
||||
|
||||
**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 4. Basic Crawling & Simple Extraction
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://news.example.com/article", config=run_config)
|
||||
print(result.markdown) # Basic markdown content
|
||||
```
|
||||
|
||||
**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 5. Markdown Generation & AI-Optimized Output
|
||||
|
||||
After crawling, `result.markdown_v2` provides:
|
||||
|
||||
- `raw_markdown`: Unfiltered markdown
|
||||
- `markdown_with_citations`: Links as references at the bottom
|
||||
- `references_markdown`: A separate list of reference links
|
||||
- `fit_markdown`: Filtered, relevant markdown (e.g., after BM25)
|
||||
- `fit_html`: The HTML used to produce `fit_markdown`
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
print("RAW:", result.markdown_v2.raw_markdown[:200])
|
||||
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
|
||||
print("REFERENCES:", result.markdown_v2.references_markdown)
|
||||
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
|
||||
```
|
||||
|
||||
For AI training, `fit_markdown` focuses on the most relevant content.
|
||||
|
||||
**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 6. Structured Data Extraction (CSS, XPath, LLM)
|
||||
|
||||
Extract JSON data without LLMs:
|
||||
|
||||
**CSS:**
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"}
|
||||
]
|
||||
}
|
||||
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
```
|
||||
|
||||
**XPath:**
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
xpath_schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "//div[@class='article']",
|
||||
"fields": [
|
||||
{"name":"headline","selector":".//h1","type":"text"},
|
||||
{"name":"summary","selector":".//p[@class='summary']","type":"text"}
|
||||
]
|
||||
}
|
||||
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
|
||||
```
|
||||
|
||||
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 7. Advanced Extraction: LLM & Open-Source Models
|
||||
|
||||
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class TravelData(BaseModel):
|
||||
destination: str
|
||||
attractions: list
|
||||
|
||||
run_config.extraction_strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=TravelData.schema(),
|
||||
instruction="Extract destination and top attractions."
|
||||
)
|
||||
```
|
||||
|
||||
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 8. Page Interactions, JS Execution, & Dynamic Content
|
||||
|
||||
Insert `js_code` and use `wait_for` to ensure content loads. Example:
|
||||
|
||||
```python
|
||||
run_config.js_code = """
|
||||
(async () => {
|
||||
document.querySelector('.load-more')?.click();
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
})();
|
||||
"""
|
||||
run_config.wait_for = "css:.item-loaded"
|
||||
```
|
||||
|
||||
**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md)
|
||||
|
||||
---
|
||||
|
||||
## 9. Media, Links, & Metadata Handling
|
||||
|
||||
`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance.
|
||||
|
||||
`result.media["videos"]`, `result.media["audios"]` similarly hold media info.
|
||||
|
||||
`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`.
|
||||
|
||||
`result.metadata`: Title, description, keywords, author.
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
# Images
|
||||
for img in result.media["images"]:
|
||||
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
|
||||
|
||||
# Links
|
||||
for link in result.links["external"]:
|
||||
print("External Link:", link["href"], "Text:", link["text"])
|
||||
|
||||
# Metadata
|
||||
print("Page Title:", result.metadata["title"])
|
||||
print("Description:", result.metadata["description"])
|
||||
```
|
||||
|
||||
**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 10. Authentication & Identity Preservation
|
||||
|
||||
### Manual Setup via User Data Directory
|
||||
|
||||
1. **Open Chrome with a custom user data dir:**
|
||||
|
||||
```bash
|
||||
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
|
||||
```
|
||||
|
||||
On macOS:
|
||||
|
||||
```bash
|
||||
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
|
||||
```
|
||||
|
||||
2. **Log in to sites, solve CAPTCHAs, adjust settings manually.**
|
||||
The browser saves cookies/localStorage in that directory.
|
||||
|
||||
3. **Use `user_data_dir` in `BrowserConfig`:**
|
||||
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
user_data_dir="/Users/username/ChromeProfiles/MyProfile"
|
||||
)
|
||||
```
|
||||
|
||||
Now the crawler starts with those cookies, sessions, etc.
|
||||
|
||||
### Using `storage_state`
|
||||
|
||||
Alternatively, export and reuse storage states:
|
||||
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
storage_state="mystate.json" # Pre-saved state
|
||||
)
|
||||
```
|
||||
|
||||
No repeated logins needed.
|
||||
|
||||
**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md)
|
||||
|
||||
---
|
||||
|
||||
## 11. Proxy & Security Enhancements
|
||||
|
||||
Use `proxy_config` for authenticated proxies:
|
||||
|
||||
```python
|
||||
browser_config.proxy_config = {
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "proxyuser",
|
||||
"password": "proxypass"
|
||||
}
|
||||
```
|
||||
|
||||
Combine with `headers` or `ignore_https_errors` as needed.
|
||||
|
||||
**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md)
|
||||
|
||||
---
|
||||
|
||||
## 12. Screenshots, PDFs & File Downloads
|
||||
|
||||
Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
run_config.screenshot = True
|
||||
run_config.pdf = True
|
||||
```
|
||||
|
||||
After crawling:
|
||||
|
||||
```python
|
||||
if result.screenshot:
|
||||
with open("page.png", "wb") as f:
|
||||
f.write(result.screenshot)
|
||||
|
||||
if result.pdf:
|
||||
with open("page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
```
|
||||
|
||||
**File Downloads:**
|
||||
|
||||
```python
|
||||
browser_config.accept_downloads = True
|
||||
browser_config.downloads_path = "./downloads"
|
||||
run_config.js_code = """document.querySelector('a.download')?.click();"""
|
||||
|
||||
# After crawl:
|
||||
print("Downloaded files:", result.downloaded_files)
|
||||
```
|
||||
|
||||
**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md)
|
||||
Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md)
|
||||
|
||||
---
|
||||
|
||||
## 13. Caching & Performance Optimization
|
||||
|
||||
Set `cache_mode` to reuse fetch results:
|
||||
|
||||
```python
|
||||
from crawl4ai import CacheMode
|
||||
run_config.cache_mode = CacheMode.ENABLED
|
||||
```
|
||||
|
||||
Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction.
|
||||
|
||||
**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md)
|
||||
|
||||
---
|
||||
|
||||
## 14. Hooks for Custom Logic
|
||||
|
||||
Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`.
|
||||
|
||||
Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL:
|
||||
|
||||
**Example Hook:**
|
||||
|
||||
```python
|
||||
async def on_page_context_created_hook(context, page, **kwargs):
|
||||
# Block all images to speed up load
|
||||
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
|
||||
print("[HOOK] Image requests blocked")
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
|
||||
result = await crawler.arun("https://imageheavy.example.com", config=run_config)
|
||||
print("Crawl finished with images blocked.")
|
||||
```
|
||||
|
||||
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.
|
||||
|
||||
**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md)
|
||||
|
||||
---
|
||||
|
||||
## 15. Dockerization & Scaling
|
||||
|
||||
Use Docker images:
|
||||
|
||||
- AMD64 basic:
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic-amd64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||
```
|
||||
|
||||
- ARM64 for M1/M2:
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic-arm64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
|
||||
```
|
||||
|
||||
- GPU support:
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:gpu-amd64
|
||||
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
|
||||
```
|
||||
|
||||
Scale with load balancers or Kubernetes.
|
||||
|
||||
**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#)
|
||||
|
||||
---
|
||||
|
||||
## 16. Troubleshooting & Common Pitfalls
|
||||
|
||||
- Empty results? Relax filters, check selectors.
|
||||
- Timeouts? Increase `page_timeout` or refine `wait_for`.
|
||||
- CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving.
|
||||
- JS errors? Try headful mode for debugging.
|
||||
|
||||
Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code.
|
||||
|
||||
---
|
||||
|
||||
## 17. Comprehensive End-to-End Example
|
||||
|
||||
Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example.
|
||||
|
||||
---
|
||||
|
||||
## 18. Further Resources & Community
|
||||
|
||||
- **Docs:** [https://crawl4ai.com](https://crawl4ai.com)
|
||||
- **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
|
||||
|
||||
Follow [@unclecode](https://x.com/unclecode) for news & community updates.
|
||||
|
||||
**Happy Crawling!**
|
||||
Leverage Crawl4AI to feed your AI models with clean, structured web data today.
|
||||
335
docs/md_v3/tutorials/hooks-custom.md
Normal file
335
docs/md_v3/tutorials/hooks-custom.md
Normal file
@@ -0,0 +1,335 @@
|
||||
# Hooks & Custom Code
|
||||
|
||||
Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:
|
||||
|
||||
- **Authentication** (log in before navigating)
|
||||
- **Content manipulation** (modify HTML, inject scripts, etc.)
|
||||
- **Session or browser configuration** (e.g., adjusting user agents, local storage)
|
||||
- **Custom data collection** (scrape extra details or track state at each stage)
|
||||
|
||||
In this tutorial, you’ll learn about:
|
||||
|
||||
1. What hooks are available
|
||||
2. How to attach code to each hook
|
||||
3. Practical examples (auth flows, user agent changes, content manipulation, etc.)
|
||||
|
||||
> **Prerequisites**
|
||||
> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
|
||||
> - Comfortable with Python async/await.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview of Available Hooks
|
||||
|
||||
| Hook Name | Called When / Purpose | Context / Objects Provided |
|
||||
|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
|
||||
| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. |
|
||||
| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`. |
|
||||
| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. |
|
||||
| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`. |
|
||||
| **`before_goto`** | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. |
|
||||
| **`after_goto`** | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`. |
|
||||
| **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference. |
|
||||
| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`. |
|
||||
|
||||
### A Note on `on_browser_created` (the “unbrowser” hook)
|
||||
- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties.
|
||||
- For example, you might control [CDP sessions][cdp] or advanced browser flags here.
|
||||
|
||||
---
|
||||
|
||||
## 2. Registering Hooks
|
||||
|
||||
You can attach hooks by calling:
|
||||
|
||||
```python
|
||||
crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
|
||||
```
|
||||
|
||||
or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:
|
||||
|
||||
```python
|
||||
hooks = {
|
||||
"before_goto": my_before_goto_hook,
|
||||
"after_goto": my_after_goto_hook,
|
||||
# ... etc.
|
||||
}
|
||||
async with AsyncWebCrawler(hooks=hooks) as crawler:
|
||||
...
|
||||
```
|
||||
|
||||
### Hook Signature
|
||||
|
||||
Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.
|
||||
|
||||
---
|
||||
|
||||
## 3. Real-Life Examples
|
||||
|
||||
Below are concrete scenarios where hooks come in handy.
|
||||
|
||||
---
|
||||
|
||||
### 3.1 Authentication Before Navigation
|
||||
|
||||
One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately).
|
||||
|
||||
#### Using `before_goto`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def before_goto_auth_hook(page, context, goto_params, **kwargs):
|
||||
"""
|
||||
Example: Set cookies or localStorage to simulate login.
|
||||
This hook runs right before page.goto() is called.
|
||||
"""
|
||||
# Example: Insert cookie-based auth or local storage data
|
||||
# (You could also do more complex actions, like fill forms if you already have a 'page' open.)
|
||||
print("[HOOK] Setting auth data before goto.")
|
||||
await context.add_cookies([
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/"
|
||||
}
|
||||
])
|
||||
# Optionally manipulate goto_params if needed:
|
||||
# goto_params["url"] = goto_params["url"] + "?debug=1"
|
||||
|
||||
async def main():
|
||||
hooks = {
|
||||
"before_goto": before_goto_auth_hook
|
||||
}
|
||||
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
crawler_cfg = CrawlerRunConfig()
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
|
||||
result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
|
||||
if result.success:
|
||||
print("[OK] Logged in and fetched protected page.")
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.
|
||||
- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Setting Up the Browser in `on_browser_created`
|
||||
|
||||
If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself.
|
||||
|
||||
```python
|
||||
async def on_browser_created_hook(browser, **kwargs):
|
||||
"""
|
||||
Runs immediately after the browser is created, before any pages.
|
||||
'browser' here is a Playwright Browser object.
|
||||
"""
|
||||
print("[HOOK] Browser created. Setting up custom stuff.")
|
||||
# Possibly connect to DevTools or create an incognito context
|
||||
# Example (pseudo-code):
|
||||
# devtools_url = await browser.new_context(devtools=True)
|
||||
|
||||
# Usage:
|
||||
async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Adjusting Page or Context in `on_page_context_created`
|
||||
|
||||
If you’d like to set default timeouts or inject scripts right after a page context is spun up:
|
||||
|
||||
```python
|
||||
async def on_page_context_created_hook(page, context, **kwargs):
|
||||
print("[HOOK] Page context created. Setting default timeouts or scripts.")
|
||||
await page.set_default_timeout(20000) # 20 seconds
|
||||
# Possibly inject a script or set user locale
|
||||
|
||||
# Usage:
|
||||
hooks = {
|
||||
"on_page_context_created": on_page_context_created_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.4 Dynamically Updating User Agents
|
||||
|
||||
`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:
|
||||
|
||||
```python
|
||||
async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
|
||||
print(f"[HOOK] User agent updated to {new_ua}")
|
||||
# Maybe add a custom header based on new UA
|
||||
await context.set_extra_http_headers({"X-UA-Source": new_ua})
|
||||
|
||||
hooks = {
|
||||
"on_user_agent_updated": on_user_agent_updated_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Initializing Stuff with `on_execution_started`
|
||||
|
||||
`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).
|
||||
|
||||
```python
|
||||
async def on_execution_started_hook(page, context, **kwargs):
|
||||
print("[HOOK] Execution started. Setting a start timestamp or logging.")
|
||||
context.set_default_navigation_timeout(45000) # 45s if your site is slow
|
||||
|
||||
hooks = {
|
||||
"on_execution_started": on_execution_started_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.6 Post-Processing with `after_goto`
|
||||
|
||||
After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:
|
||||
|
||||
```python
|
||||
async def after_goto_hook(page, context, response, **kwargs):
|
||||
"""
|
||||
Called right after page.goto() finishes, but before the crawler extracts HTML.
|
||||
"""
|
||||
if response and response.ok:
|
||||
print("[HOOK] After goto. Status:", response.status)
|
||||
# Maybe remove popups or check if we landed on a login failure page.
|
||||
await page.evaluate("""() => {
|
||||
const popup = document.querySelector(".annoying-popup");
|
||||
if (popup) popup.remove();
|
||||
}""")
|
||||
else:
|
||||
print("[HOOK] Navigation might have failed, status not ok or no response.")
|
||||
|
||||
hooks = {
|
||||
"after_goto": after_goto_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`
|
||||
|
||||
Sometimes you need to tweak the page or raw HTML right before it’s captured.
|
||||
|
||||
```python
|
||||
async def before_retrieve_html_hook(page, context, **kwargs):
|
||||
"""
|
||||
Modify the DOM just before the crawler finalizes the HTML.
|
||||
"""
|
||||
print("[HOOK] Removing adverts before capturing HTML.")
|
||||
await page.evaluate("""() => {
|
||||
const ads = document.querySelectorAll(".ad-banner");
|
||||
ads.forEach(ad => ad.remove());
|
||||
}""")
|
||||
|
||||
async def before_return_html_hook(page, context, html, **kwargs):
|
||||
"""
|
||||
'html' is the near-finished HTML string. Return an updated string if you like.
|
||||
"""
|
||||
# For example, remove personal data or certain tags from the final text
|
||||
print("[HOOK] Sanitizing final HTML.")
|
||||
sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
|
||||
return sanitized_html
|
||||
|
||||
hooks = {
|
||||
"before_retrieve_html": before_retrieve_html_hook,
|
||||
"before_return_html": before_return_html_hook
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.
|
||||
|
||||
---
|
||||
|
||||
## 4. Putting It All Together
|
||||
|
||||
You can combine multiple hooks in a single run. For instance:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def on_browser_created_hook(browser, **kwargs):
|
||||
print("[HOOK] Browser is up, no page yet. Good for broad config.")
|
||||
|
||||
async def before_goto_auth_hook(page, context, goto_params, **kwargs):
|
||||
print("[HOOK] Adding cookies for auth.")
|
||||
await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])
|
||||
|
||||
async def after_goto_log_hook(page, context, response, **kwargs):
|
||||
if response:
|
||||
print("[HOOK] after_goto: Status code:", response.status)
|
||||
|
||||
async def main():
|
||||
hooks = {
|
||||
"on_browser_created": on_browser_created_hook,
|
||||
"before_goto": before_goto_auth_hook,
|
||||
"after_goto": after_goto_log_hook
|
||||
}
|
||||
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
crawler_cfg = CrawlerRunConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
|
||||
result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
|
||||
if result.success:
|
||||
print("[OK] Protected page length:", len(result.html))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This example:
|
||||
|
||||
1. **`on_browser_created`** sets up the brand-new browser instance.
|
||||
2. **`before_goto`** ensures you inject an auth cookie before accessing the page.
|
||||
3. **`after_goto`** logs the resulting HTTP status code.
|
||||
|
||||
---
|
||||
|
||||
## 5. Common Pitfalls & Best Practices
|
||||
|
||||
1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.
|
||||
2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.
|
||||
3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.
|
||||
4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**.
|
||||
5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion & Next Steps
|
||||
|
||||
**Hooks** let you bend Crawl4AI to your will:
|
||||
|
||||
- **Authentication** (cookies, localStorage) with `before_goto`
|
||||
- **Browser-level config** with `on_browser_created`
|
||||
- **Page or context config** with `on_page_context_created`
|
||||
- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`)
|
||||
|
||||
**Where to go next**:
|
||||
|
||||
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection.
|
||||
- **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood.
|
||||
- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks.
|
||||
|
||||
With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user