Compare commits
27 Commits
v0.4.243
...
unclecode-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b53835d34f | ||
|
|
fe52311bf4 | ||
|
|
01b73950ee | ||
|
|
12880f1ffa | ||
|
|
53be88b677 | ||
|
|
3427ead8b8 | ||
|
|
32652189b0 | ||
|
|
ae376f15fb | ||
|
|
72fbdac467 | ||
|
|
0857c7b448 | ||
|
|
07b4c1c0ed | ||
|
|
196dc79ec7 | ||
|
|
24b3da717a | ||
|
|
98acc4254d | ||
|
|
eac78c7993 | ||
|
|
da1bc0f7bf | ||
|
|
aa4f92f458 | ||
|
|
a96e05d4ae | ||
|
|
5c95fd92b4 | ||
|
|
4cb2a62551 | ||
|
|
5b4fad9e25 | ||
|
|
ea0ac25f38 | ||
|
|
7688aca7d6 | ||
|
|
a7215ad972 | ||
|
|
8e2403a7da | ||
|
|
318554e6bf | ||
|
|
3e769a9c6c |
220
.codeiumignore
220
.codeiumignore
@@ -1,220 +0,0 @@
|
|||||||
# Byte-compiled / optimized / DLL files
|
|
||||||
__pycache__/
|
|
||||||
*.py[cod]
|
|
||||||
*$py.class
|
|
||||||
|
|
||||||
# C extensions
|
|
||||||
*.so
|
|
||||||
|
|
||||||
# Distribution / packaging
|
|
||||||
.Python
|
|
||||||
build/
|
|
||||||
develop-eggs/
|
|
||||||
dist/
|
|
||||||
downloads/
|
|
||||||
eggs/
|
|
||||||
.eggs/
|
|
||||||
lib/
|
|
||||||
lib64/
|
|
||||||
parts/
|
|
||||||
sdist/
|
|
||||||
var/
|
|
||||||
wheels/
|
|
||||||
share/python-wheels/
|
|
||||||
*.egg-info/
|
|
||||||
.installed.cfg
|
|
||||||
*.egg
|
|
||||||
MANIFEST
|
|
||||||
|
|
||||||
# PyInstaller
|
|
||||||
# Usually these files are written by a python script from a template
|
|
||||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
|
||||||
*.manifest
|
|
||||||
*.spec
|
|
||||||
|
|
||||||
# Installer logs
|
|
||||||
pip-log.txt
|
|
||||||
pip-delete-this-directory.txt
|
|
||||||
|
|
||||||
# Unit test / coverage reports
|
|
||||||
htmlcov/
|
|
||||||
.tox/
|
|
||||||
.nox/
|
|
||||||
.coverage
|
|
||||||
.coverage.*
|
|
||||||
.cache
|
|
||||||
nosetests.xml
|
|
||||||
coverage.xml
|
|
||||||
*.cover
|
|
||||||
*.py,cover
|
|
||||||
.hypothesis/
|
|
||||||
.pytest_cache/
|
|
||||||
cover/
|
|
||||||
|
|
||||||
# Translations
|
|
||||||
*.mo
|
|
||||||
*.pot
|
|
||||||
|
|
||||||
# Django stuff:
|
|
||||||
*.log
|
|
||||||
local_settings.py
|
|
||||||
db.sqlite3
|
|
||||||
db.sqlite3-journal
|
|
||||||
|
|
||||||
# Flask stuff:
|
|
||||||
instance/
|
|
||||||
.webassets-cache
|
|
||||||
|
|
||||||
# Scrapy stuff:
|
|
||||||
.scrapy
|
|
||||||
|
|
||||||
# Sphinx documentation
|
|
||||||
docs/_build/
|
|
||||||
|
|
||||||
# PyBuilder
|
|
||||||
.pybuilder/
|
|
||||||
target/
|
|
||||||
|
|
||||||
# Jupyter Notebook
|
|
||||||
.ipynb_checkpoints
|
|
||||||
|
|
||||||
# IPython
|
|
||||||
profile_default/
|
|
||||||
ipython_config.py
|
|
||||||
|
|
||||||
# pyenv
|
|
||||||
# For a library or package, you might want to ignore these files since the code is
|
|
||||||
# intended to run in multiple environments; otherwise, check them in:
|
|
||||||
# .python-version
|
|
||||||
|
|
||||||
# pipenv
|
|
||||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
|
||||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
|
||||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
|
||||||
# install all needed dependencies.
|
|
||||||
#Pipfile.lock
|
|
||||||
|
|
||||||
# poetry
|
|
||||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
|
||||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
|
||||||
# commonly ignored for libraries.
|
|
||||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
|
||||||
#poetry.lock
|
|
||||||
|
|
||||||
# pdm
|
|
||||||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
|
|
||||||
#pdm.lock
|
|
||||||
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
|
|
||||||
# in version control.
|
|
||||||
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
|
|
||||||
.pdm.toml
|
|
||||||
.pdm-python
|
|
||||||
.pdm-build/
|
|
||||||
|
|
||||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
|
|
||||||
__pypackages__/
|
|
||||||
|
|
||||||
# Celery stuff
|
|
||||||
celerybeat-schedule
|
|
||||||
celerybeat.pid
|
|
||||||
|
|
||||||
# SageMath parsed files
|
|
||||||
*.sage.py
|
|
||||||
|
|
||||||
# Environments
|
|
||||||
.env
|
|
||||||
.venv
|
|
||||||
env/
|
|
||||||
venv/
|
|
||||||
ENV/
|
|
||||||
env.bak/
|
|
||||||
venv.bak/
|
|
||||||
|
|
||||||
# Spyder project settings
|
|
||||||
.spyderproject
|
|
||||||
.spyproject
|
|
||||||
|
|
||||||
# Rope project settings
|
|
||||||
.ropeproject
|
|
||||||
|
|
||||||
# mkdocs documentation
|
|
||||||
/site
|
|
||||||
|
|
||||||
# mypy
|
|
||||||
.mypy_cache/
|
|
||||||
.dmypy.json
|
|
||||||
dmypy.json
|
|
||||||
|
|
||||||
# Pyre type checker
|
|
||||||
.pyre/
|
|
||||||
|
|
||||||
# pytype static type analyzer
|
|
||||||
.pytype/
|
|
||||||
|
|
||||||
# Cython debug symbols
|
|
||||||
cython_debug/
|
|
||||||
|
|
||||||
# PyCharm
|
|
||||||
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
|
|
||||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
|
||||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
|
||||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
|
||||||
#.idea/
|
|
||||||
|
|
||||||
Crawl4AI.egg-info/
|
|
||||||
Crawl4AI.egg-info/*
|
|
||||||
crawler_data.db
|
|
||||||
.vscode/
|
|
||||||
.tests/
|
|
||||||
.test_pads/
|
|
||||||
test_pad.py
|
|
||||||
test_pad*.py
|
|
||||||
.data/
|
|
||||||
Crawl4AI.egg-info/
|
|
||||||
|
|
||||||
requirements0.txt
|
|
||||||
a.txt
|
|
||||||
|
|
||||||
*.sh
|
|
||||||
.idea
|
|
||||||
docs/examples/.chainlit/
|
|
||||||
docs/examples/.chainlit/*
|
|
||||||
.chainlit/config.toml
|
|
||||||
.chainlit/translations/en-US.json
|
|
||||||
|
|
||||||
local/
|
|
||||||
.files/
|
|
||||||
|
|
||||||
a.txt
|
|
||||||
.lambda_function.py
|
|
||||||
ec2*
|
|
||||||
|
|
||||||
update_changelog.sh
|
|
||||||
|
|
||||||
.DS_Store
|
|
||||||
docs/.DS_Store
|
|
||||||
tmp/
|
|
||||||
test_env/
|
|
||||||
**/.DS_Store
|
|
||||||
**/.DS_Store
|
|
||||||
|
|
||||||
todo.md
|
|
||||||
todo_executor.md
|
|
||||||
git_changes.py
|
|
||||||
git_changes.md
|
|
||||||
pypi_build.sh
|
|
||||||
git_issues.py
|
|
||||||
git_issues.md
|
|
||||||
|
|
||||||
.next/
|
|
||||||
.tests/
|
|
||||||
.docs/
|
|
||||||
.gitboss/
|
|
||||||
todo_executor.md
|
|
||||||
protect-all-except-feature.sh
|
|
||||||
manage-collab.sh
|
|
||||||
publish.sh
|
|
||||||
combine.sh
|
|
||||||
combined_output.txt
|
|
||||||
tree.md
|
|
||||||
|
|
||||||
2
.gitignore
vendored
2
.gitignore
vendored
@@ -225,3 +225,5 @@ tree.md
|
|||||||
.scripts
|
.scripts
|
||||||
.local
|
.local
|
||||||
.do
|
.do
|
||||||
|
/plans
|
||||||
|
plans/
|
||||||
37
CHANGELOG.md
37
CHANGELOG.md
@@ -5,6 +5,43 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## [0.4.267] - 2025 - 01 - 06
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||||
|
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
|
||||||
|
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
|
||||||
|
- **Documentation Update**:
|
||||||
|
- Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **Page Closing to Prevent Memory Leaks**:
|
||||||
|
- **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
|
||||||
|
- **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
|
||||||
|
- **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py)
|
||||||
|
- **Code**:
|
||||||
|
```python
|
||||||
|
finally:
|
||||||
|
# If no session_id is given we should close the page
|
||||||
|
if not config.session_id:
|
||||||
|
await page.close()
|
||||||
|
```
|
||||||
|
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
|
||||||
|
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
|
||||||
|
|
||||||
|
### Other
|
||||||
|
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
|
||||||
|
|
||||||
|
|
||||||
## [0.4.24] - 2024-12-31
|
## [0.4.24] - 2024-12-31
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|||||||
131
CODE_OF_CONDUCT.md
Normal file
131
CODE_OF_CONDUCT.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
# Crawl4AI Code of Conduct
|
||||||
|
|
||||||
|
## Our Pledge
|
||||||
|
|
||||||
|
We as members, contributors, and leaders pledge to make participation in our
|
||||||
|
community a harassment-free experience for everyone, regardless of age, body
|
||||||
|
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||||
|
identity and expression, level of experience, education, socio-economic status,
|
||||||
|
nationality, personal appearance, race, caste, color, religion, or sexual
|
||||||
|
identity and orientation.
|
||||||
|
|
||||||
|
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||||
|
diverse, inclusive, and healthy community.
|
||||||
|
|
||||||
|
## Our Standards
|
||||||
|
|
||||||
|
Examples of behavior that contributes to a positive environment for our
|
||||||
|
community include:
|
||||||
|
|
||||||
|
* Demonstrating empathy and kindness toward other people
|
||||||
|
* Being respectful of differing opinions, viewpoints, and experiences
|
||||||
|
* Giving and gracefully accepting constructive feedback
|
||||||
|
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||||
|
and learning from the experience
|
||||||
|
* Focusing on what is best not just for us as individuals, but for the overall
|
||||||
|
community
|
||||||
|
|
||||||
|
Examples of unacceptable behavior include:
|
||||||
|
|
||||||
|
* The use of sexualized language or imagery, and sexual attention or advances of
|
||||||
|
any kind
|
||||||
|
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||||
|
* Public or private harassment
|
||||||
|
* Publishing others' private information, such as a physical or email address,
|
||||||
|
without their explicit permission
|
||||||
|
* Other conduct which could reasonably be considered inappropriate in a
|
||||||
|
professional setting
|
||||||
|
|
||||||
|
## Enforcement Responsibilities
|
||||||
|
|
||||||
|
Community leaders are responsible for clarifying and enforcing our standards of
|
||||||
|
acceptable behavior and will take appropriate and fair corrective action in
|
||||||
|
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||||
|
or harmful.
|
||||||
|
|
||||||
|
Community leaders have the right and responsibility to remove, edit, or reject
|
||||||
|
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||||
|
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||||
|
decisions when appropriate.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This Code of Conduct applies within all community spaces, and also applies when
|
||||||
|
an individual is officially representing the community in public spaces.
|
||||||
|
Examples of representing our community include using an official email address,
|
||||||
|
posting via an official social media account, or acting as an appointed
|
||||||
|
representative at an online or offline event.
|
||||||
|
|
||||||
|
## Enforcement
|
||||||
|
|
||||||
|
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||||
|
reported to the community leaders responsible for enforcement at
|
||||||
|
unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
|
||||||
|
|
||||||
|
All community leaders are obligated to respect the privacy and security of the
|
||||||
|
reporter of any incident.
|
||||||
|
|
||||||
|
## Enforcement Guidelines
|
||||||
|
|
||||||
|
Community leaders will follow these Community Impact Guidelines in determining
|
||||||
|
the consequences for any action they deem in violation of this Code of Conduct:
|
||||||
|
|
||||||
|
### 1. Correction
|
||||||
|
|
||||||
|
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||||
|
unprofessional or unwelcome in the community.
|
||||||
|
|
||||||
|
**Consequence**: A private, written warning from community leaders, providing
|
||||||
|
clarity around the nature of the violation and an explanation of why the
|
||||||
|
behavior was inappropriate. A public apology may be requested.
|
||||||
|
|
||||||
|
### 2. Warning
|
||||||
|
|
||||||
|
**Community Impact**: A violation through a single incident or series of
|
||||||
|
actions.
|
||||||
|
|
||||||
|
**Consequence**: A warning with consequences for continued behavior. No
|
||||||
|
interaction with the people involved, including unsolicited interaction with
|
||||||
|
those enforcing the Code of Conduct, for a specified period of time. This
|
||||||
|
includes avoiding interactions in community spaces as well as external channels
|
||||||
|
like social media. Violating these terms may lead to a temporary or permanent
|
||||||
|
ban.
|
||||||
|
|
||||||
|
### 3. Temporary Ban
|
||||||
|
|
||||||
|
**Community Impact**: A serious violation of community standards, including
|
||||||
|
sustained inappropriate behavior.
|
||||||
|
|
||||||
|
**Consequence**: A temporary ban from any sort of interaction or public
|
||||||
|
communication with the community for a specified period of time. No public or
|
||||||
|
private interaction with the people involved, including unsolicited interaction
|
||||||
|
with those enforcing the Code of Conduct, is allowed during this period.
|
||||||
|
Violating these terms may lead to a permanent ban.
|
||||||
|
|
||||||
|
### 4. Permanent Ban
|
||||||
|
|
||||||
|
**Community Impact**: Demonstrating a pattern of violation of community
|
||||||
|
standards, including sustained inappropriate behavior, harassment of an
|
||||||
|
individual, or aggression toward or disparagement of classes of individuals.
|
||||||
|
|
||||||
|
**Consequence**: A permanent ban from any sort of public interaction within the
|
||||||
|
community.
|
||||||
|
|
||||||
|
## Attribution
|
||||||
|
|
||||||
|
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||||
|
version 2.1, available at
|
||||||
|
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
|
||||||
|
|
||||||
|
Community Impact Guidelines were inspired by
|
||||||
|
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
|
||||||
|
|
||||||
|
For answers to common questions about this code of conduct, see the FAQ at
|
||||||
|
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
|
||||||
|
[https://www.contributor-covenant.org/translations][translations].
|
||||||
|
|
||||||
|
[homepage]: https://www.contributor-covenant.org
|
||||||
|
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
|
||||||
|
[Mozilla CoC]: https://github.com/mozilla/diversity
|
||||||
|
[FAQ]: https://www.contributor-covenant.org/faq
|
||||||
|
[translations]: https://www.contributor-covenant.org/translations
|
||||||
@@ -11,10 +11,11 @@
|
|||||||
[](https://pypi.org/project/crawl4ai/)
|
[](https://pypi.org/project/crawl4ai/)
|
||||||
[](https://pepy.tech/project/crawl4ai)
|
[](https://pepy.tech/project/crawl4ai)
|
||||||
|
|
||||||
[](https://crawl4ai.readthedocs.io/)
|
<!-- [](https://crawl4ai.readthedocs.io/) -->
|
||||||
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
|
||||||
[](https://github.com/psf/black)
|
[](https://github.com/psf/black)
|
||||||
[](https://github.com/PyCQA/bandit)
|
[](https://github.com/PyCQA/bandit)
|
||||||
|
[](code_of_conduct.md)
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
|||||||
@@ -1,2 +1,2 @@
|
|||||||
# crawl4ai/_version.py
|
# crawl4ai/_version.py
|
||||||
__version__ = "0.4.243"
|
__version__ = "0.4.247"
|
||||||
|
|||||||
@@ -35,7 +35,9 @@ class BrowserConfig:
|
|||||||
user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
|
user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
|
||||||
temporary directory may be used. Default: None.
|
temporary directory may be used. Default: None.
|
||||||
chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
|
chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
|
||||||
is "chromium". Default: "chrome".
|
is "chromium". Default: "chromium".
|
||||||
|
channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
|
||||||
|
is "chromium". Default: "chromium".
|
||||||
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
|
proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
|
||||||
Default: None.
|
Default: None.
|
||||||
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
|
||||||
@@ -77,7 +79,8 @@ class BrowserConfig:
|
|||||||
use_managed_browser: bool = False,
|
use_managed_browser: bool = False,
|
||||||
use_persistent_context: bool = False,
|
use_persistent_context: bool = False,
|
||||||
user_data_dir: str = None,
|
user_data_dir: str = None,
|
||||||
chrome_channel: str = "chrome",
|
chrome_channel: str = "chromium",
|
||||||
|
channel: str = "chromium",
|
||||||
proxy: str = None,
|
proxy: str = None,
|
||||||
proxy_config: dict = None,
|
proxy_config: dict = None,
|
||||||
viewport_width: int = 1080,
|
viewport_width: int = 1080,
|
||||||
@@ -107,14 +110,8 @@ class BrowserConfig:
|
|||||||
self.use_managed_browser = use_managed_browser
|
self.use_managed_browser = use_managed_browser
|
||||||
self.use_persistent_context = use_persistent_context
|
self.use_persistent_context = use_persistent_context
|
||||||
self.user_data_dir = user_data_dir
|
self.user_data_dir = user_data_dir
|
||||||
if self.browser_type == "chromium":
|
self.chrome_channel = chrome_channel or self.browser_type or "chromium"
|
||||||
self.chrome_channel = "chrome"
|
self.channel = channel or self.browser_type or "chromium"
|
||||||
elif self.browser_type == "firefox":
|
|
||||||
self.chrome_channel = "firefox"
|
|
||||||
elif self.browser_type == "webkit":
|
|
||||||
self.chrome_channel = "webkit"
|
|
||||||
else:
|
|
||||||
self.chrome_channel = chrome_channel or "chrome"
|
|
||||||
self.proxy = proxy
|
self.proxy = proxy
|
||||||
self.proxy_config = proxy_config
|
self.proxy_config = proxy_config
|
||||||
self.viewport_width = viewport_width
|
self.viewport_width = viewport_width
|
||||||
@@ -161,7 +158,8 @@ class BrowserConfig:
|
|||||||
use_managed_browser=kwargs.get("use_managed_browser", False),
|
use_managed_browser=kwargs.get("use_managed_browser", False),
|
||||||
use_persistent_context=kwargs.get("use_persistent_context", False),
|
use_persistent_context=kwargs.get("use_persistent_context", False),
|
||||||
user_data_dir=kwargs.get("user_data_dir"),
|
user_data_dir=kwargs.get("user_data_dir"),
|
||||||
chrome_channel=kwargs.get("chrome_channel", "chrome"),
|
chrome_channel=kwargs.get("chrome_channel", "chromium"),
|
||||||
|
channel=kwargs.get("channel", "chromium"),
|
||||||
proxy=kwargs.get("proxy"),
|
proxy=kwargs.get("proxy"),
|
||||||
proxy_config=kwargs.get("proxy_config"),
|
proxy_config=kwargs.get("proxy_config"),
|
||||||
viewport_width=kwargs.get("viewport_width", 1080),
|
viewport_width=kwargs.get("viewport_width", 1080),
|
||||||
@@ -248,7 +246,7 @@ class CrawlerRunConfig:
|
|||||||
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
|
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
|
||||||
Default: None.
|
Default: None.
|
||||||
wait_for_images (bool): If True, wait for images to load before extracting content.
|
wait_for_images (bool): If True, wait for images to load before extracting content.
|
||||||
Default: True.
|
Default: False.
|
||||||
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
|
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
|
||||||
Default: 0.1.
|
Default: 0.1.
|
||||||
mean_delay (float): Mean base delay between requests when calling arun_many.
|
mean_delay (float): Mean base delay between requests when calling arun_many.
|
||||||
@@ -347,7 +345,7 @@ class CrawlerRunConfig:
|
|||||||
wait_until: str = "domcontentloaded",
|
wait_until: str = "domcontentloaded",
|
||||||
page_timeout: int = PAGE_TIMEOUT,
|
page_timeout: int = PAGE_TIMEOUT,
|
||||||
wait_for: str = None,
|
wait_for: str = None,
|
||||||
wait_for_images: bool = True,
|
wait_for_images: bool = False,
|
||||||
delay_before_return_html: float = 0.1,
|
delay_before_return_html: float = 0.1,
|
||||||
mean_delay: float = 0.1,
|
mean_delay: float = 0.1,
|
||||||
max_range: float = 0.3,
|
max_range: float = 0.3,
|
||||||
@@ -505,7 +503,7 @@ class CrawlerRunConfig:
|
|||||||
wait_until=kwargs.get("wait_until", "domcontentloaded"),
|
wait_until=kwargs.get("wait_until", "domcontentloaded"),
|
||||||
page_timeout=kwargs.get("page_timeout", 60000),
|
page_timeout=kwargs.get("page_timeout", 60000),
|
||||||
wait_for=kwargs.get("wait_for"),
|
wait_for=kwargs.get("wait_for"),
|
||||||
wait_for_images=kwargs.get("wait_for_images", True),
|
wait_for_images=kwargs.get("wait_for_images", False),
|
||||||
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
|
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
|
||||||
mean_delay=kwargs.get("mean_delay", 0.1),
|
mean_delay=kwargs.get("mean_delay", 0.1),
|
||||||
max_range=kwargs.get("max_range", 0.3),
|
max_range=kwargs.get("max_range", 0.3),
|
||||||
|
|||||||
@@ -1476,7 +1476,12 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
raise e
|
raise e
|
||||||
|
|
||||||
async def _handle_full_page_scan(self, page: Page, scroll_delay: float):
|
finally:
|
||||||
|
# If no session_id is given we should close the page
|
||||||
|
if not config.session_id:
|
||||||
|
await page.close()
|
||||||
|
|
||||||
|
async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
|
||||||
"""
|
"""
|
||||||
Helper method to handle full page scanning.
|
Helper method to handle full page scanning.
|
||||||
|
|
||||||
@@ -1500,7 +1505,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
current_position = viewport_height
|
current_position = viewport_height
|
||||||
|
|
||||||
# await page.evaluate(f"window.scrollTo(0, {current_position})")
|
# await page.evaluate(f"window.scrollTo(0, {current_position})")
|
||||||
await self.safe_scroll(page, 0, current_position)
|
await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
|
||||||
# await self.csp_scroll_to(page, 0, current_position)
|
# await self.csp_scroll_to(page, 0, current_position)
|
||||||
# await asyncio.sleep(scroll_delay)
|
# await asyncio.sleep(scroll_delay)
|
||||||
|
|
||||||
@@ -1510,7 +1515,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
|
|
||||||
while current_position < total_height:
|
while current_position < total_height:
|
||||||
current_position = min(current_position + viewport_height, total_height)
|
current_position = min(current_position + viewport_height, total_height)
|
||||||
await self.safe_scroll(page, 0, current_position)
|
await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
|
||||||
# await page.evaluate(f"window.scrollTo(0, {current_position})")
|
# await page.evaluate(f"window.scrollTo(0, {current_position})")
|
||||||
# await asyncio.sleep(scroll_delay)
|
# await asyncio.sleep(scroll_delay)
|
||||||
|
|
||||||
@@ -1639,11 +1644,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
Returns:
|
Returns:
|
||||||
str: The base64-encoded screenshot data
|
str: The base64-encoded screenshot data
|
||||||
"""
|
"""
|
||||||
dimensions = await self.get_page_dimensions(page)
|
need_scroll = await self.page_need_scroll(page)
|
||||||
page_height = dimensions['height']
|
|
||||||
if page_height < kwargs.get(
|
if not need_scroll:
|
||||||
"screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
|
|
||||||
):
|
|
||||||
# Page is short enough, just take a screenshot
|
# Page is short enough, just take a screenshot
|
||||||
return await self.take_screenshot_naive(page)
|
return await self.take_screenshot_naive(page)
|
||||||
else:
|
else:
|
||||||
@@ -2066,7 +2069,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
}
|
}
|
||||||
""")
|
""")
|
||||||
|
|
||||||
async def safe_scroll(self, page: Page, x: int, y: int):
|
async def safe_scroll(self, page: Page, x: int, y: int, delay: float = 0.1):
|
||||||
"""
|
"""
|
||||||
Safely scroll the page with rendering time.
|
Safely scroll the page with rendering time.
|
||||||
|
|
||||||
@@ -2077,7 +2080,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
"""
|
"""
|
||||||
result = await self.csp_scroll_to(page, x, y)
|
result = await self.csp_scroll_to(page, x, y)
|
||||||
if result['success']:
|
if result['success']:
|
||||||
await page.wait_for_timeout(100) # Allow for rendering
|
await page.wait_for_timeout(delay * 1000)
|
||||||
return result
|
return result
|
||||||
|
|
||||||
async def csp_scroll_to(self, page: Page, x: int, y: int) -> Dict[str, Any]:
|
async def csp_scroll_to(self, page: Page, x: int, y: int) -> Dict[str, Any]:
|
||||||
@@ -2159,3 +2162,30 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
return {width: scrollWidth, height: scrollHeight};
|
return {width: scrollWidth, height: scrollHeight};
|
||||||
}
|
}
|
||||||
""")
|
""")
|
||||||
|
|
||||||
|
async def page_need_scroll(self, page: Page) -> bool:
|
||||||
|
"""
|
||||||
|
Determine whether the page need to scroll
|
||||||
|
|
||||||
|
Args:
|
||||||
|
page: Playwright page object
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
bool: True if page needs scrolling
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
need_scroll = await page.evaluate("""
|
||||||
|
() => {
|
||||||
|
const scrollHeight = document.documentElement.scrollHeight;
|
||||||
|
const viewportHeight = window.innerHeight;
|
||||||
|
return scrollHeight > viewportHeight;
|
||||||
|
}
|
||||||
|
""")
|
||||||
|
return need_scroll
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.warning(
|
||||||
|
message="Failed to check scroll need: {error}. Defaulting to True for safety.",
|
||||||
|
tag="SCROLL",
|
||||||
|
params={"error": str(e)}
|
||||||
|
)
|
||||||
|
return True # Default to scrolling if check fails
|
||||||
@@ -418,34 +418,30 @@ class AsyncWebCrawler:
|
|||||||
**kwargs
|
**kwargs
|
||||||
)
|
)
|
||||||
|
|
||||||
# crawl_result.status_code = async_response.status_code
|
crawl_result.status_code = async_response.status_code
|
||||||
# crawl_result.response_headers = async_response.response_headers
|
crawl_result.response_headers = async_response.response_headers
|
||||||
# crawl_result.downloaded_files = async_response.downloaded_files
|
crawl_result.downloaded_files = async_response.downloaded_files
|
||||||
# crawl_result.ssl_certificate = async_response.ssl_certificate # Add SSL certificate
|
crawl_result.ssl_certificate = async_response.ssl_certificate # Add SSL certificate
|
||||||
# else:
|
|
||||||
# crawl_result.status_code = 200
|
|
||||||
# crawl_result.response_headers = cached_result.response_headers if cached_result else {}
|
|
||||||
# crawl_result.ssl_certificate = cached_result.ssl_certificate if cached_result else None # Add SSL certificate from cache
|
|
||||||
|
|
||||||
# # Check and set values from async_response to crawl_result
|
# # Check and set values from async_response to crawl_result
|
||||||
try:
|
# try:
|
||||||
for key in vars(async_response):
|
# for key in vars(async_response):
|
||||||
if hasattr(crawl_result, key):
|
# if hasattr(crawl_result, key):
|
||||||
value = getattr(async_response, key, None)
|
# value = getattr(async_response, key, None)
|
||||||
current_value = getattr(crawl_result, key, None)
|
# current_value = getattr(crawl_result, key, None)
|
||||||
if value is not None and not current_value:
|
# if value is not None and not current_value:
|
||||||
try:
|
# try:
|
||||||
setattr(crawl_result, key, value)
|
# setattr(crawl_result, key, value)
|
||||||
except Exception as e:
|
# except Exception as e:
|
||||||
self.logger.warning(
|
# self.logger.warning(
|
||||||
message=f"Failed to set attribute {key}: {str(e)}",
|
# message=f"Failed to set attribute {key}: {str(e)}",
|
||||||
tag="WARNING"
|
# tag="WARNING"
|
||||||
)
|
# )
|
||||||
except Exception as e:
|
# except Exception as e:
|
||||||
self.logger.warning(
|
# self.logger.warning(
|
||||||
message=f"Error copying response attributes: {str(e)}",
|
# message=f"Error copying response attributes: {str(e)}",
|
||||||
tag="WARNING"
|
# tag="WARNING"
|
||||||
)
|
# )
|
||||||
|
|
||||||
crawl_result.success = bool(html)
|
crawl_result.success = bool(html)
|
||||||
crawl_result.session_id = getattr(config, 'session_id', None)
|
crawl_result.session_id = getattr(config, 'session_id', None)
|
||||||
@@ -585,8 +581,10 @@ class AsyncWebCrawler:
|
|||||||
|
|
||||||
# Markdown Generation
|
# Markdown Generation
|
||||||
markdown_generator: Optional[MarkdownGenerationStrategy] = config.markdown_generator or DefaultMarkdownGenerator()
|
markdown_generator: Optional[MarkdownGenerationStrategy] = config.markdown_generator or DefaultMarkdownGenerator()
|
||||||
if not config.content_filter and not markdown_generator.content_filter:
|
|
||||||
markdown_generator.content_filter = PruningContentFilter()
|
# Uncomment if by default we want to use PruningContentFilter
|
||||||
|
# if not config.content_filter and not markdown_generator.content_filter:
|
||||||
|
# markdown_generator.content_filter = PruningContentFilter()
|
||||||
|
|
||||||
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
|
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
|
||||||
cleaned_html=cleaned_html,
|
cleaned_html=cleaned_html,
|
||||||
|
|||||||
@@ -122,92 +122,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
|||||||
"""
|
"""
|
||||||
return await asyncio.to_thread(self._scrap, url, html, **kwargs)
|
return await asyncio.to_thread(self._scrap, url, html, **kwargs)
|
||||||
|
|
||||||
def _generate_markdown_content(self, cleaned_html: str,html: str,url: str, success: bool, **kwargs) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Generate markdown content from cleaned HTML.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
cleaned_html (str): The cleaned HTML content.
|
|
||||||
html (str): The original HTML content.
|
|
||||||
url (str): The URL of the page.
|
|
||||||
success (bool): Whether the content was successfully cleaned.
|
|
||||||
**kwargs: Additional keyword arguments.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Dict[str, Any]: A dictionary containing the generated markdown content.
|
|
||||||
"""
|
|
||||||
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
|
|
||||||
|
|
||||||
if markdown_generator:
|
|
||||||
try:
|
|
||||||
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
|
|
||||||
markdown_generator.content_filter = BM25ContentFilter(
|
|
||||||
user_query=kwargs.get('fit_markdown_user_query', None),
|
|
||||||
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
|
||||||
)
|
|
||||||
|
|
||||||
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
|
|
||||||
cleaned_html=cleaned_html,
|
|
||||||
base_url=url,
|
|
||||||
html2text_options=kwargs.get('html2text', {})
|
|
||||||
)
|
|
||||||
|
|
||||||
return {
|
|
||||||
'markdown': markdown_result.raw_markdown,
|
|
||||||
'fit_markdown': markdown_result.fit_markdown,
|
|
||||||
'fit_html': markdown_result.fit_html,
|
|
||||||
'markdown_v2': markdown_result
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
self._log('error',
|
|
||||||
message="Error using new markdown generation strategy: {error}",
|
|
||||||
tag="SCRAPE",
|
|
||||||
params={"error": str(e)}
|
|
||||||
)
|
|
||||||
markdown_generator = None
|
|
||||||
return {
|
|
||||||
'markdown': f"Error using new markdown generation strategy: {str(e)}",
|
|
||||||
'fit_markdown': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
|
|
||||||
'fit_html': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
|
|
||||||
'markdown_v2': None
|
|
||||||
}
|
|
||||||
|
|
||||||
# Legacy method
|
|
||||||
"""
|
|
||||||
# h = CustomHTML2Text()
|
|
||||||
# h.update_params(**kwargs.get('html2text', {}))
|
|
||||||
# markdown = h.handle(cleaned_html)
|
|
||||||
# markdown = markdown.replace(' ```', '```')
|
|
||||||
|
|
||||||
# fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
|
|
||||||
# fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
|
|
||||||
|
|
||||||
# if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
|
|
||||||
# content_filter = kwargs.get('content_filter', None)
|
|
||||||
# if not content_filter:
|
|
||||||
# content_filter = BM25ContentFilter(
|
|
||||||
# user_query=kwargs.get('fit_markdown_user_query', None),
|
|
||||||
# bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
|
|
||||||
# )
|
|
||||||
# fit_html = content_filter.filter_content(html)
|
|
||||||
# fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
|
|
||||||
# fit_markdown = h.handle(fit_html)
|
|
||||||
|
|
||||||
# markdown_v2 = MarkdownGenerationResult(
|
|
||||||
# raw_markdown=markdown,
|
|
||||||
# markdown_with_citations=markdown,
|
|
||||||
# references_markdown=markdown,
|
|
||||||
# fit_markdown=fit_markdown
|
|
||||||
# )
|
|
||||||
|
|
||||||
# return {
|
|
||||||
# 'markdown': markdown,
|
|
||||||
# 'fit_markdown': fit_markdown,
|
|
||||||
# 'fit_html': fit_html,
|
|
||||||
# 'markdown_v2' : markdown_v2
|
|
||||||
# }
|
|
||||||
"""
|
|
||||||
|
|
||||||
def flatten_nested_elements(self, node):
|
def flatten_nested_elements(self, node):
|
||||||
"""
|
"""
|
||||||
Flatten nested elements in a HTML tree.
|
Flatten nested elements in a HTML tree.
|
||||||
@@ -798,13 +712,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
|
|||||||
|
|
||||||
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
|
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
|
||||||
|
|
||||||
# markdown_content = self._generate_markdown_content(
|
|
||||||
# cleaned_html=cleaned_html,
|
|
||||||
# html=html,
|
|
||||||
# url=url,
|
|
||||||
# success=success,
|
|
||||||
# **kwargs
|
|
||||||
# )
|
|
||||||
|
|
||||||
return {
|
return {
|
||||||
# **markdown_content,
|
# **markdown_content,
|
||||||
|
|||||||
@@ -974,8 +974,9 @@ class JsonCssExtractionStrategy(JsonElementExtractionStrategy):
|
|||||||
return parsed_html.select(selector)
|
return parsed_html.select(selector)
|
||||||
|
|
||||||
def _get_elements(self, element, selector: str):
|
def _get_elements(self, element, selector: str):
|
||||||
selected = element.select_one(selector)
|
# Return all matching elements using select() instead of select_one()
|
||||||
return [selected] if selected else []
|
# This ensures that we get all elements that match the selector, not just the first one
|
||||||
|
return element.select(selector)
|
||||||
|
|
||||||
def _get_element_text(self, element) -> str:
|
def _get_element_text(self, element) -> str:
|
||||||
return element.get_text(strip=True)
|
return element.get_text(strip=True)
|
||||||
@@ -1049,4 +1050,3 @@ class JsonXPathExtractionStrategy(JsonElementExtractionStrategy):
|
|||||||
|
|
||||||
def _get_element_attribute(self, element, attribute: str):
|
def _get_element_attribute(self, element, attribute: str):
|
||||||
return element.get(attribute)
|
return element.get(attribute)
|
||||||
|
|
||||||
|
|||||||
@@ -143,41 +143,83 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
|
|||||||
Returns:
|
Returns:
|
||||||
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
|
||||||
"""
|
"""
|
||||||
# Initialize HTML2Text with options
|
try:
|
||||||
h = CustomHTML2Text()
|
# Initialize HTML2Text with default options for better conversion
|
||||||
|
h = CustomHTML2Text(baseurl=base_url)
|
||||||
|
default_options = {
|
||||||
|
'body_width': 0, # Disable text wrapping
|
||||||
|
'ignore_emphasis': False,
|
||||||
|
'ignore_links': False,
|
||||||
|
'ignore_images': False,
|
||||||
|
'protect_links': True,
|
||||||
|
'single_line_break': True,
|
||||||
|
'mark_code': True,
|
||||||
|
'escape_snob': False
|
||||||
|
}
|
||||||
|
|
||||||
|
# Update with custom options if provided
|
||||||
if html2text_options:
|
if html2text_options:
|
||||||
h.update_params(**html2text_options)
|
default_options.update(html2text_options)
|
||||||
elif options:
|
elif options:
|
||||||
h.update_params(**options)
|
default_options.update(options)
|
||||||
elif self.options:
|
elif self.options:
|
||||||
h.update_params(**self.options)
|
default_options.update(self.options)
|
||||||
|
|
||||||
|
h.update_params(**default_options)
|
||||||
|
|
||||||
|
# Ensure we have valid input
|
||||||
|
if not cleaned_html:
|
||||||
|
cleaned_html = ""
|
||||||
|
elif not isinstance(cleaned_html, str):
|
||||||
|
cleaned_html = str(cleaned_html)
|
||||||
|
|
||||||
# Generate raw markdown
|
# Generate raw markdown
|
||||||
|
try:
|
||||||
raw_markdown = h.handle(cleaned_html)
|
raw_markdown = h.handle(cleaned_html)
|
||||||
|
except Exception as e:
|
||||||
|
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
|
||||||
|
|
||||||
raw_markdown = raw_markdown.replace(' ```', '```')
|
raw_markdown = raw_markdown.replace(' ```', '```')
|
||||||
|
|
||||||
# Convert links to citations
|
# Convert links to citations
|
||||||
markdown_with_citations: str = ""
|
markdown_with_citations: str = raw_markdown
|
||||||
references_markdown: str = ""
|
references_markdown: str = ""
|
||||||
if citations:
|
if citations:
|
||||||
|
try:
|
||||||
markdown_with_citations, references_markdown = self.convert_links_to_citations(
|
markdown_with_citations, references_markdown = self.convert_links_to_citations(
|
||||||
raw_markdown, base_url
|
raw_markdown, base_url
|
||||||
)
|
)
|
||||||
|
except Exception as e:
|
||||||
|
markdown_with_citations = raw_markdown
|
||||||
|
references_markdown = f"Error generating citations: {str(e)}"
|
||||||
|
|
||||||
# Generate fit markdown if content filter is provided
|
# Generate fit markdown if content filter is provided
|
||||||
fit_markdown: Optional[str] = ""
|
fit_markdown: Optional[str] = ""
|
||||||
filtered_html: Optional[str] = ""
|
filtered_html: Optional[str] = ""
|
||||||
if content_filter or self.content_filter:
|
if content_filter or self.content_filter:
|
||||||
|
try:
|
||||||
content_filter = content_filter or self.content_filter
|
content_filter = content_filter or self.content_filter
|
||||||
filtered_html = content_filter.filter_content(cleaned_html)
|
filtered_html = content_filter.filter_content(cleaned_html)
|
||||||
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
|
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
|
||||||
fit_markdown = h.handle(filtered_html)
|
fit_markdown = h.handle(filtered_html)
|
||||||
|
except Exception as e:
|
||||||
|
fit_markdown = f"Error generating fit markdown: {str(e)}"
|
||||||
|
filtered_html = ""
|
||||||
|
|
||||||
return MarkdownGenerationResult(
|
return MarkdownGenerationResult(
|
||||||
raw_markdown=raw_markdown,
|
raw_markdown=raw_markdown or "",
|
||||||
markdown_with_citations=markdown_with_citations,
|
markdown_with_citations=markdown_with_citations or "",
|
||||||
references_markdown=references_markdown,
|
references_markdown=references_markdown or "",
|
||||||
fit_markdown=fit_markdown,
|
fit_markdown=fit_markdown or "",
|
||||||
fit_html=filtered_html,
|
fit_html=filtered_html or "",
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
# If anything fails, return empty strings with error message
|
||||||
|
error_msg = f"Error in markdown generation: {str(e)}"
|
||||||
|
return MarkdownGenerationResult(
|
||||||
|
raw_markdown=error_msg,
|
||||||
|
markdown_with_citations=error_msg,
|
||||||
|
references_markdown="",
|
||||||
|
fit_markdown="",
|
||||||
|
fit_html="",
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -21,6 +21,8 @@ import textwrap
|
|||||||
import cProfile
|
import cProfile
|
||||||
import pstats
|
import pstats
|
||||||
from functools import wraps
|
from functools import wraps
|
||||||
|
import asyncio
|
||||||
|
|
||||||
|
|
||||||
class InvalidCSSSelectorError(Exception):
|
class InvalidCSSSelectorError(Exception):
|
||||||
pass
|
pass
|
||||||
@@ -1579,6 +1581,25 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
|
|||||||
|
|
||||||
return content_paths
|
return content_paths
|
||||||
|
|
||||||
|
def configure_windows_event_loop():
|
||||||
|
"""
|
||||||
|
Configure the Windows event loop to use ProactorEventLoop.
|
||||||
|
This resolves the NotImplementedError that occurs on Windows when using asyncio subprocesses.
|
||||||
|
|
||||||
|
This function should only be called on Windows systems and before any async operations.
|
||||||
|
On non-Windows systems, this function does nothing.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
```python
|
||||||
|
from crawl4ai.async_configs import configure_windows_event_loop
|
||||||
|
|
||||||
|
# Call this before any async operations if you're on Windows
|
||||||
|
configure_windows_event_loop()
|
||||||
|
```
|
||||||
|
"""
|
||||||
|
if platform.system() == 'Windows':
|
||||||
|
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
|
||||||
|
|
||||||
def get_error_context(exc_info, context_lines: int = 5):
|
def get_error_context(exc_info, context_lines: int = 5):
|
||||||
"""
|
"""
|
||||||
Extract error context with more reliable line number tracking.
|
Extract error context with more reliable line number tracking.
|
||||||
|
|||||||
20
docs/examples/hello_world.py
Normal file
20
docs/examples/hello_world.py
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
import asyncio
|
||||||
|
from crawl4ai import *
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=True)
|
||||||
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||||
|
crawler_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
markdown_generator=DefaultMarkdownGenerator(
|
||||||
|
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://www.helloworld.org",
|
||||||
|
config=crawler_config
|
||||||
|
)
|
||||||
|
print(result.markdown_v2.raw_markdown[:500])
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
@@ -148,7 +148,24 @@ Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might twea
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 5. Putting It All Together
|
## 5. Windows-Specific Configuration
|
||||||
|
|
||||||
|
When using AsyncWebCrawler on Windows, you might encounter a `NotImplementedError` related to `asyncio.create_subprocess_exec`. This is a known Windows-specific issue that occurs because Windows' default event loop doesn't support subprocess operations.
|
||||||
|
|
||||||
|
To resolve this, Crawl4AI provides a utility function to configure Windows to use the ProactorEventLoop. Call this function before running any async operations:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai.utils import configure_windows_event_loop
|
||||||
|
|
||||||
|
# Call this before any async operations if you're on Windows
|
||||||
|
configure_windows_event_loop()
|
||||||
|
|
||||||
|
# Your AsyncWebCrawler code here
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Putting It All Together
|
||||||
|
|
||||||
Here’s a slightly more in-depth example that shows off a few key config parameters at once:
|
Here’s a slightly more in-depth example that shows off a few key config parameters at once:
|
||||||
|
|
||||||
@@ -193,7 +210,7 @@ if __name__ == "__main__":
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 6. Next Steps
|
## 7. Next Steps
|
||||||
|
|
||||||
- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
|
- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
|
||||||
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
|
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
|
||||||
|
|||||||
Reference in New Issue
Block a user