Compare commits

..

20 Commits

Author SHA1 Message Date
UncleCode
b53835d34f Delete .codeiumignore 2025-01-06 19:17:31 +08:00
UncleCode
fe52311bf4 Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-01-06 15:20:30 +08:00
UncleCode
01b73950ee Merge branch 'vr0.4.267' 2025-01-06 15:20:28 +08:00
UncleCode
12880f1ffa Update gitignore 2025-01-06 15:19:01 +08:00
UncleCode
53be88b677 Update gitignore 2025-01-06 15:18:37 +08:00
UncleCode
3427ead8b8 Update CHANGELOG 2025-01-06 15:13:43 +08:00
aravind
32652189b0 Docs: Add Code of Conduct for the project (#410) 2025-01-06 12:52:51 +08:00
UncleCode
ae376f15fb docs(extraction): add clarifying comments for CSS selector behavior
Add explanatory comments to JsonCssExtractionStrategy._get_elements() method to clarify that it returns all matching elements using select() instead of select_one(). This helps developers understand the method's behavior and its difference from single element selection.

Removed trailing whitespace at end of file.
2025-01-05 19:39:15 +08:00
UncleCode
72fbdac467 fix(extraction): JsonCss selector and crawler improvements
- Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one
- Add robust error handling to page_need_scroll with default fallback
- Improve JSON extraction strategies documentation
- Refactor content scraping strategy
- Update version to 0.4.247
2025-01-05 19:26:46 +08:00
UncleCode
0857c7b448 Merge branch 'main' of https://github.com/unclecode/crawl4ai into next 2025-01-05 17:05:59 +08:00
Guilume
07b4c1c0ed fix: not working long page screenshot (#403) 2025-01-05 17:04:34 +08:00
UncleCode
196dc79ec7 fix: prevent memory leaks by ensuring proper closure of Playwright pages
- Fixes critical memory leak issue where browser pages remained open
- Ensures proper cleanup of Playwright resources after page operations
- Improves resource management in browser farm implementation

This is an urgent fix to address resource leakage that could impact system stability.
2025-01-03 21:17:23 +08:00
UncleCode
24b3da717a refactor():
- Update hello world example
2025-01-02 17:53:30 +08:00
UncleCode
98acc4254d refactor:
- Update hello_world.py example
2025-01-01 19:47:22 +08:00
UncleCode
eac78c7993 Merge branch 'vr0.4.246' 2025-01-01 19:43:01 +08:00
UncleCode
da1bc0f7bf Update version file 2025-01-01 19:42:35 +08:00
UncleCode
aa4f92f458 refactor(crawler):
- Update hello_world example with proper content filtering
2025-01-01 19:39:42 +08:00
UncleCode
a96e05d4ae refactor(crawler): optimize response handling and default settings
- Set wait_for_images default to false for better performance
- Simplify response attribute copying in AsyncWebCrawler
- Update hello_world example with proper content filtering
2025-01-01 19:39:02 +08:00
UncleCode
5c95fd92b4 fix(browser): resolve merge conflicts in browser channel configuration 2025-01-01 19:05:47 +08:00
Arno.Edwards
8e2403a7da fix(browser)!: default to Chromium channel for new headless mode (#387)
BREAKING CHANGE: Updated `chrome_channel` to "chromium" to fix compatibility with the new Chromium headless implementation. This resolves the error `playwright._impl._errors.Error: BrowserType.launch: Chromium distribution 'chrome' is not found`, caused by the removal of the old headless mode in Chromium.

With this change, channels like "chrome" and "msedge" now default to the new headless mode, aligning with upstream updates in Playwright v1.49. The new headless mode uses the real Chrome browser, offering more authenticity, reliability, and feature parity with the full browser.

Additionally, simplified fallback logic by directly assigning `chrome_channel` based on `browser_type` or defaulting to "chromium".

Refer to:
- https://playwright.dev/python/docs/browsers#chromium
- https://github.com/microsoft/playwright/issues/33566
2025-01-01 18:37:50 +08:00
15 changed files with 382 additions and 396 deletions

View File

@@ -1,220 +0,0 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
Crawl4AI.egg-info/
Crawl4AI.egg-info/*
crawler_data.db
.vscode/
.tests/
.test_pads/
test_pad.py
test_pad*.py
.data/
Crawl4AI.egg-info/
requirements0.txt
a.txt
*.sh
.idea
docs/examples/.chainlit/
docs/examples/.chainlit/*
.chainlit/config.toml
.chainlit/translations/en-US.json
local/
.files/
a.txt
.lambda_function.py
ec2*
update_changelog.sh
.DS_Store
docs/.DS_Store
tmp/
test_env/
**/.DS_Store
**/.DS_Store
todo.md
todo_executor.md
git_changes.py
git_changes.md
pypi_build.sh
git_issues.py
git_issues.md
.next/
.tests/
.docs/
.gitboss/
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
publish.sh
combine.sh
combined_output.txt
tree.md

2
.gitignore vendored
View File

@@ -225,3 +225,5 @@ tree.md
.scripts
.local
.do
/plans
plans/

View File

@@ -5,6 +5,43 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
---
## [0.4.267] - 2025 - 01 - 06
### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
### Changed
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
- **Documentation Update**:
- Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
### Removed
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
### Fixed
- **Page Closing to Prevent Memory Leaks**:
- **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
- **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
- **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py)
- **Code**:
```python
finally:
# If no session_id is given we should close the page
if not config.session_id:
await page.close()
```
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
### Other
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
## [0.4.24] - 2024-12-31
### Added

131
CODE_OF_CONDUCT.md Normal file
View File

@@ -0,0 +1,131 @@
# Crawl4AI Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, caste, color, religion, or sexual
identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the overall
community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or advances of
any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email address,
without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official email address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series of
actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or permanent
ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within the
community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.1, available at
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
Community Impact Guidelines were inspired by
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
For answers to common questions about this code of conduct, see the FAQ at
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
[https://www.contributor-covenant.org/translations][translations].
[homepage]: https://www.contributor-covenant.org
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
[Mozilla CoC]: https://github.com/mozilla/diversity
[FAQ]: https://www.contributor-covenant.org/faq
[translations]: https://www.contributor-covenant.org/translations

View File

@@ -15,6 +15,7 @@
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
</div>

View File

@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.4.245"
__version__ = "0.4.247"

View File

@@ -246,7 +246,7 @@ class CrawlerRunConfig:
wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
Default: None.
wait_for_images (bool): If True, wait for images to load before extracting content.
Default: True.
Default: False.
delay_before_return_html (float): Delay in seconds before retrieving final HTML.
Default: 0.1.
mean_delay (float): Mean base delay between requests when calling arun_many.
@@ -345,7 +345,7 @@ class CrawlerRunConfig:
wait_until: str = "domcontentloaded",
page_timeout: int = PAGE_TIMEOUT,
wait_for: str = None,
wait_for_images: bool = True,
wait_for_images: bool = False,
delay_before_return_html: float = 0.1,
mean_delay: float = 0.1,
max_range: float = 0.3,
@@ -503,7 +503,7 @@ class CrawlerRunConfig:
wait_until=kwargs.get("wait_until", "domcontentloaded"),
page_timeout=kwargs.get("page_timeout", 60000),
wait_for=kwargs.get("wait_for"),
wait_for_images=kwargs.get("wait_for_images", True),
wait_for_images=kwargs.get("wait_for_images", False),
delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
mean_delay=kwargs.get("mean_delay", 0.1),
max_range=kwargs.get("max_range", 0.3),

View File

@@ -1475,8 +1475,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
except Exception as e:
raise e
finally:
# If no session_id is given we should close the page
if not config.session_id:
await page.close()
async def _handle_full_page_scan(self, page: Page, scroll_delay: float):
async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
"""
Helper method to handle full page scanning.
@@ -1500,7 +1505,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
current_position = viewport_height
# await page.evaluate(f"window.scrollTo(0, {current_position})")
await self.safe_scroll(page, 0, current_position)
await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
# await self.csp_scroll_to(page, 0, current_position)
# await asyncio.sleep(scroll_delay)
@@ -1510,7 +1515,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
while current_position < total_height:
current_position = min(current_position + viewport_height, total_height)
await self.safe_scroll(page, 0, current_position)
await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
# await page.evaluate(f"window.scrollTo(0, {current_position})")
# await asyncio.sleep(scroll_delay)
@@ -1639,11 +1644,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
Returns:
str: The base64-encoded screenshot data
"""
dimensions = await self.get_page_dimensions(page)
page_height = dimensions['height']
if page_height < kwargs.get(
"screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
):
need_scroll = await self.page_need_scroll(page)
if not need_scroll:
# Page is short enough, just take a screenshot
return await self.take_screenshot_naive(page)
else:
@@ -2066,7 +2069,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
}
""")
async def safe_scroll(self, page: Page, x: int, y: int):
async def safe_scroll(self, page: Page, x: int, y: int, delay: float = 0.1):
"""
Safely scroll the page with rendering time.
@@ -2077,7 +2080,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
"""
result = await self.csp_scroll_to(page, x, y)
if result['success']:
await page.wait_for_timeout(100) # Allow for rendering
await page.wait_for_timeout(delay * 1000)
return result
async def csp_scroll_to(self, page: Page, x: int, y: int) -> Dict[str, Any]:
@@ -2158,4 +2161,31 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
const {scrollWidth, scrollHeight} = document.documentElement;
return {width: scrollWidth, height: scrollHeight};
}
""")
""")
async def page_need_scroll(self, page: Page) -> bool:
"""
Determine whether the page need to scroll
Args:
page: Playwright page object
Returns:
bool: True if page needs scrolling
"""
try:
need_scroll = await page.evaluate("""
() => {
const scrollHeight = document.documentElement.scrollHeight;
const viewportHeight = window.innerHeight;
return scrollHeight > viewportHeight;
}
""")
return need_scroll
except Exception as e:
self.logger.warning(
message="Failed to check scroll need: {error}. Defaulting to True for safety.",
tag="SCROLL",
params={"error": str(e)}
)
return True # Default to scrolling if check fails

View File

@@ -418,34 +418,30 @@ class AsyncWebCrawler:
**kwargs
)
# crawl_result.status_code = async_response.status_code
# crawl_result.response_headers = async_response.response_headers
# crawl_result.downloaded_files = async_response.downloaded_files
# crawl_result.ssl_certificate = async_response.ssl_certificate # Add SSL certificate
# else:
# crawl_result.status_code = 200
# crawl_result.response_headers = cached_result.response_headers if cached_result else {}
# crawl_result.ssl_certificate = cached_result.ssl_certificate if cached_result else None # Add SSL certificate from cache
crawl_result.status_code = async_response.status_code
crawl_result.response_headers = async_response.response_headers
crawl_result.downloaded_files = async_response.downloaded_files
crawl_result.ssl_certificate = async_response.ssl_certificate # Add SSL certificate
# # Check and set values from async_response to crawl_result
try:
for key in vars(async_response):
if hasattr(crawl_result, key):
value = getattr(async_response, key, None)
current_value = getattr(crawl_result, key, None)
if value is not None and not current_value:
try:
setattr(crawl_result, key, value)
except Exception as e:
self.logger.warning(
message=f"Failed to set attribute {key}: {str(e)}",
tag="WARNING"
)
except Exception as e:
self.logger.warning(
message=f"Error copying response attributes: {str(e)}",
tag="WARNING"
)
# try:
# for key in vars(async_response):
# if hasattr(crawl_result, key):
# value = getattr(async_response, key, None)
# current_value = getattr(crawl_result, key, None)
# if value is not None and not current_value:
# try:
# setattr(crawl_result, key, value)
# except Exception as e:
# self.logger.warning(
# message=f"Failed to set attribute {key}: {str(e)}",
# tag="WARNING"
# )
# except Exception as e:
# self.logger.warning(
# message=f"Error copying response attributes: {str(e)}",
# tag="WARNING"
# )
crawl_result.success = bool(html)
crawl_result.session_id = getattr(config, 'session_id', None)
@@ -585,8 +581,10 @@ class AsyncWebCrawler:
# Markdown Generation
markdown_generator: Optional[MarkdownGenerationStrategy] = config.markdown_generator or DefaultMarkdownGenerator()
if not config.content_filter and not markdown_generator.content_filter:
markdown_generator.content_filter = PruningContentFilter()
# Uncomment if by default we want to use PruningContentFilter
# if not config.content_filter and not markdown_generator.content_filter:
# markdown_generator.content_filter = PruningContentFilter()
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
cleaned_html=cleaned_html,

View File

@@ -122,92 +122,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
"""
return await asyncio.to_thread(self._scrap, url, html, **kwargs)
def _generate_markdown_content(self, cleaned_html: str,html: str,url: str, success: bool, **kwargs) -> Dict[str, Any]:
"""
Generate markdown content from cleaned HTML.
Args:
cleaned_html (str): The cleaned HTML content.
html (str): The original HTML content.
url (str): The URL of the page.
success (bool): Whether the content was successfully cleaned.
**kwargs: Additional keyword arguments.
Returns:
Dict[str, Any]: A dictionary containing the generated markdown content.
"""
markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
if markdown_generator:
try:
if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
markdown_generator.content_filter = BM25ContentFilter(
user_query=kwargs.get('fit_markdown_user_query', None),
bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
)
markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
cleaned_html=cleaned_html,
base_url=url,
html2text_options=kwargs.get('html2text', {})
)
return {
'markdown': markdown_result.raw_markdown,
'fit_markdown': markdown_result.fit_markdown,
'fit_html': markdown_result.fit_html,
'markdown_v2': markdown_result
}
except Exception as e:
self._log('error',
message="Error using new markdown generation strategy: {error}",
tag="SCRAPE",
params={"error": str(e)}
)
markdown_generator = None
return {
'markdown': f"Error using new markdown generation strategy: {str(e)}",
'fit_markdown': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'fit_html': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
'markdown_v2': None
}
# Legacy method
"""
# h = CustomHTML2Text()
# h.update_params(**kwargs.get('html2text', {}))
# markdown = h.handle(cleaned_html)
# markdown = markdown.replace(' ```', '```')
# fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
# fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
# if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
# content_filter = kwargs.get('content_filter', None)
# if not content_filter:
# content_filter = BM25ContentFilter(
# user_query=kwargs.get('fit_markdown_user_query', None),
# bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
# )
# fit_html = content_filter.filter_content(html)
# fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
# fit_markdown = h.handle(fit_html)
# markdown_v2 = MarkdownGenerationResult(
# raw_markdown=markdown,
# markdown_with_citations=markdown,
# references_markdown=markdown,
# fit_markdown=fit_markdown
# )
# return {
# 'markdown': markdown,
# 'fit_markdown': fit_markdown,
# 'fit_html': fit_html,
# 'markdown_v2' : markdown_v2
# }
"""
def flatten_nested_elements(self, node):
"""
Flatten nested elements in a HTML tree.
@@ -798,13 +712,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
cleaned_html = str_body.replace('\n\n', '\n').replace(' ', ' ')
# markdown_content = self._generate_markdown_content(
# cleaned_html=cleaned_html,
# html=html,
# url=url,
# success=success,
# **kwargs
# )
return {
# **markdown_content,

View File

@@ -974,8 +974,9 @@ class JsonCssExtractionStrategy(JsonElementExtractionStrategy):
return parsed_html.select(selector)
def _get_elements(self, element, selector: str):
selected = element.select_one(selector)
return [selected] if selected else []
# Return all matching elements using select() instead of select_one()
# This ensures that we get all elements that match the selector, not just the first one
return element.select(selector)
def _get_element_text(self, element) -> str:
return element.get_text(strip=True)
@@ -1049,4 +1050,3 @@ class JsonXPathExtractionStrategy(JsonElementExtractionStrategy):
def _get_element_attribute(self, element, attribute: str):
return element.get(attribute)

View File

@@ -143,41 +143,83 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
Returns:
MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
"""
# Initialize HTML2Text with options
h = CustomHTML2Text()
if html2text_options:
h.update_params(**html2text_options)
elif options:
h.update_params(**options)
elif self.options:
h.update_params(**self.options)
try:
# Initialize HTML2Text with default options for better conversion
h = CustomHTML2Text(baseurl=base_url)
default_options = {
'body_width': 0, # Disable text wrapping
'ignore_emphasis': False,
'ignore_links': False,
'ignore_images': False,
'protect_links': True,
'single_line_break': True,
'mark_code': True,
'escape_snob': False
}
# Update with custom options if provided
if html2text_options:
default_options.update(html2text_options)
elif options:
default_options.update(options)
elif self.options:
default_options.update(self.options)
h.update_params(**default_options)
# Generate raw markdown
raw_markdown = h.handle(cleaned_html)
raw_markdown = raw_markdown.replace(' ```', '```')
# Ensure we have valid input
if not cleaned_html:
cleaned_html = ""
elif not isinstance(cleaned_html, str):
cleaned_html = str(cleaned_html)
# Convert links to citations
markdown_with_citations: str = ""
references_markdown: str = ""
if citations:
markdown_with_citations, references_markdown = self.convert_links_to_citations(
raw_markdown, base_url
# Generate raw markdown
try:
raw_markdown = h.handle(cleaned_html)
except Exception as e:
raw_markdown = f"Error converting HTML to markdown: {str(e)}"
raw_markdown = raw_markdown.replace(' ```', '```')
# Convert links to citations
markdown_with_citations: str = raw_markdown
references_markdown: str = ""
if citations:
try:
markdown_with_citations, references_markdown = self.convert_links_to_citations(
raw_markdown, base_url
)
except Exception as e:
markdown_with_citations = raw_markdown
references_markdown = f"Error generating citations: {str(e)}"
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
try:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
fit_markdown = h.handle(filtered_html)
except Exception as e:
fit_markdown = f"Error generating fit markdown: {str(e)}"
filtered_html = ""
return MarkdownGenerationResult(
raw_markdown=raw_markdown or "",
markdown_with_citations=markdown_with_citations or "",
references_markdown=references_markdown or "",
fit_markdown=fit_markdown or "",
fit_html=filtered_html or "",
)
except Exception as e:
# If anything fails, return empty strings with error message
error_msg = f"Error in markdown generation: {str(e)}"
return MarkdownGenerationResult(
raw_markdown=error_msg,
markdown_with_citations=error_msg,
references_markdown="",
fit_markdown="",
fit_html="",
)
# Generate fit markdown if content filter is provided
fit_markdown: Optional[str] = ""
filtered_html: Optional[str] = ""
if content_filter or self.content_filter:
content_filter = content_filter or self.content_filter
filtered_html = content_filter.filter_content(cleaned_html)
filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
fit_markdown = h.handle(filtered_html)
return MarkdownGenerationResult(
raw_markdown=raw_markdown,
markdown_with_citations=markdown_with_citations,
references_markdown=references_markdown,
fit_markdown=fit_markdown,
fit_html=filtered_html,
)

View File

@@ -21,6 +21,8 @@ import textwrap
import cProfile
import pstats
from functools import wraps
import asyncio
class InvalidCSSSelectorError(Exception):
pass
@@ -1579,6 +1581,25 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
return content_paths
def configure_windows_event_loop():
"""
Configure the Windows event loop to use ProactorEventLoop.
This resolves the NotImplementedError that occurs on Windows when using asyncio subprocesses.
This function should only be called on Windows systems and before any async operations.
On non-Windows systems, this function does nothing.
Example:
```python
from crawl4ai.async_configs import configure_windows_event_loop
# Call this before any async operations if you're on Windows
configure_windows_event_loop()
```
"""
if platform.system() == 'Windows':
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
def get_error_context(exc_info, context_lines: int = 5):
"""
Extract error context with more reliable line number tracking.

View File

@@ -0,0 +1,20 @@
import asyncio
from crawl4ai import *
async def main():
browser_config = BrowserConfig(headless=True, verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
)
)
result = await crawler.arun(
url="https://www.helloworld.org",
config=crawler_config
)
print(result.markdown_v2.raw_markdown[:500])
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -148,7 +148,24 @@ Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might twea
---
## 5. Putting It All Together
## 5. Windows-Specific Configuration
When using AsyncWebCrawler on Windows, you might encounter a `NotImplementedError` related to `asyncio.create_subprocess_exec`. This is a known Windows-specific issue that occurs because Windows' default event loop doesn't support subprocess operations.
To resolve this, Crawl4AI provides a utility function to configure Windows to use the ProactorEventLoop. Call this function before running any async operations:
```python
from crawl4ai.utils import configure_windows_event_loop
# Call this before any async operations if you're on Windows
configure_windows_event_loop()
# Your AsyncWebCrawler code here
```
---
## 6. Putting It All Together
Heres a slightly more in-depth example that shows off a few key config parameters at once:
@@ -193,7 +210,7 @@ if __name__ == "__main__":
---
## 6. Next Steps
## 7. Next Steps
- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).