build: streamline package discovery and bump to v0.4.244

- Replace explicit package listing with setuptools.find - Include all crawl4ai.* packages automatically - Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax - Bump version to 0.4.244 This change simplifies package maintenance by automatically discovering all subpackages under crawl4ai namespace instead of listing them manually.
2025-01-01 17:53:51 +08:00
15 changed files with 411 additions and 395 deletions
--- a/.codeiumignore
+++ b/.codeiumignore
@@ -0,0 +1,220 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+Crawl4AI.egg-info/
+Crawl4AI.egg-info/*
+crawler_data.db
+.vscode/
+.tests/
+.test_pads/
+test_pad.py
+test_pad*.py
+.data/
+Crawl4AI.egg-info/
+
+requirements0.txt
+a.txt
+
+*.sh
+.idea
+docs/examples/.chainlit/
+docs/examples/.chainlit/*
+.chainlit/config.toml
+.chainlit/translations/en-US.json
+
+local/
+.files/
+
+a.txt
+.lambda_function.py
+ec2*
+
+update_changelog.sh
+
+.DS_Store
+docs/.DS_Store
+tmp/
+test_env/
+**/.DS_Store
+**/.DS_Store
+
+todo.md
+todo_executor.md
+git_changes.py
+git_changes.md
+pypi_build.sh
+git_issues.py
+git_issues.md
+
+.next/
+.tests/
+.docs/
+.gitboss/
+todo_executor.md
+protect-all-except-feature.sh
+manage-collab.sh
+publish.sh
+combine.sh
+combined_output.txt
+tree.md
+
--- a/.gitignore
+++ b/.gitignore
@@ -225,5 +225,3 @@ tree.md
 .scripts
 .local
 .do
-/plans
-plans/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,43 +5,6 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

---
-
-## [0.4.267] - 2025 - 01 - 06
-
-### Added
- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
-
-### Changed
- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
- **Documentation Update**: 
-  - Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
-
-### Removed
- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
-
-### Fixed
- **Page Closing to Prevent Memory Leaks**:
-  - **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
-  - **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
-  - **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py)
-  - **Code**:
-    ```python
-    finally:
-        # If no session_id is given we should close the page
-        if not config.session_id:
-            await page.close()
-    ```
- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
-
-### Other
- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
-
-
 ## [0.4.24] - 2024-12-31

 ### Added
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,131 +0,0 @@
-# Crawl4AI Code of Conduct
-
-## Our Pledge
-
-We as members, contributors, and leaders pledge to make participation in our
-community a harassment-free experience for everyone, regardless of age, body
-size, visible or invisible disability, ethnicity, sex characteristics, gender
-identity and expression, level of experience, education, socio-economic status,
-nationality, personal appearance, race, caste, color, religion, or sexual
-identity and orientation.
-
-We pledge to act and interact in ways that contribute to an open, welcoming,
-diverse, inclusive, and healthy community.
-
-## Our Standards
-
-Examples of behavior that contributes to a positive environment for our
-community include:
-
-* Demonstrating empathy and kindness toward other people
-* Being respectful of differing opinions, viewpoints, and experiences
-* Giving and gracefully accepting constructive feedback
-* Accepting responsibility and apologizing to those affected by our mistakes,
-  and learning from the experience
-* Focusing on what is best not just for us as individuals, but for the overall
-  community
-
-Examples of unacceptable behavior include:
-
-* The use of sexualized language or imagery, and sexual attention or advances of
-  any kind
-* Trolling, insulting or derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or email address,
-  without their explicit permission
-* Other conduct which could reasonably be considered inappropriate in a
-  professional setting
-
-## Enforcement Responsibilities
-
-Community leaders are responsible for clarifying and enforcing our standards of
-acceptable behavior and will take appropriate and fair corrective action in
-response to any behavior that they deem inappropriate, threatening, offensive,
-or harmful.
-
-Community leaders have the right and responsibility to remove, edit, or reject
-comments, commits, code, wiki edits, issues, and other contributions that are
-not aligned to this Code of Conduct, and will communicate reasons for moderation
-decisions when appropriate.
-
-## Scope
-
-This Code of Conduct applies within all community spaces, and also applies when
-an individual is officially representing the community in public spaces.
-Examples of representing our community include using an official email address,
-posting via an official social media account, or acting as an appointed
-representative at an online or offline event.
-
-## Enforcement
-
-Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported to the community leaders responsible for enforcement at
-unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
-
-All community leaders are obligated to respect the privacy and security of the
-reporter of any incident.
-
-## Enforcement Guidelines
-
-Community leaders will follow these Community Impact Guidelines in determining
-the consequences for any action they deem in violation of this Code of Conduct:
-
-### 1. Correction
-
-**Community Impact**: Use of inappropriate language or other behavior deemed
-unprofessional or unwelcome in the community.
-
-**Consequence**: A private, written warning from community leaders, providing
-clarity around the nature of the violation and an explanation of why the
-behavior was inappropriate. A public apology may be requested.
-
-### 2. Warning
-
-**Community Impact**: A violation through a single incident or series of
-actions.
-
-**Consequence**: A warning with consequences for continued behavior. No
-interaction with the people involved, including unsolicited interaction with
-those enforcing the Code of Conduct, for a specified period of time. This
-includes avoiding interactions in community spaces as well as external channels
-like social media. Violating these terms may lead to a temporary or permanent
-ban.
-
-### 3. Temporary Ban
-
-**Community Impact**: A serious violation of community standards, including
-sustained inappropriate behavior.
-
-**Consequence**: A temporary ban from any sort of interaction or public
-communication with the community for a specified period of time. No public or
-private interaction with the people involved, including unsolicited interaction
-with those enforcing the Code of Conduct, is allowed during this period.
-Violating these terms may lead to a permanent ban.
-
-### 4. Permanent Ban
-
-**Community Impact**: Demonstrating a pattern of violation of community
-standards, including sustained inappropriate behavior, harassment of an
-individual, or aggression toward or disparagement of classes of individuals.
-
-**Consequence**: A permanent ban from any sort of public interaction within the
-community.
-
-## Attribution
-
-This Code of Conduct is adapted from the [Contributor Covenant][homepage],
-version 2.1, available at
-[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
-
-Community Impact Guidelines were inspired by
-[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
-
-For answers to common questions about this code of conduct, see the FAQ at
-[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
-[https://www.contributor-covenant.org/translations][translations].
-
-[homepage]: https://www.contributor-covenant.org
-[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
-[Mozilla CoC]: https://github.com/mozilla/diversity
-[FAQ]: https://www.contributor-covenant.org/faq
-[translations]: https://www.contributor-covenant.org/translations
--- a/README.md
+++ b/README.md
@@ -11,19 +11,18 @@
 [![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
 [![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)

-<!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
+[![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/)
 [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
-[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)

 </div>

 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  

-[✨ Check out latest update v0.4.24x](#-recent-updates)
+[✨ Check out latest update v0.4.24](#-recent-updates)

-🎉 **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
+🎉 **Version 0.4.24 is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)

 ## 🧐 Why Crawl4AI?

@@ -39,7 +38,7 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
 1. Install Crawl4AI:
 ```bash
 # Install the package
-pip install -U crawl4ai
+pip install crawl4ai

 # Run post-installation setup
 crawl4ai-setup
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,2 +1,2 @@
 # crawl4ai/_version.py
-__version__ = "0.4.247"
+__version__ = "0.4.244"
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -35,9 +35,7 @@ class BrowserConfig:
        user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
                                     temporary directory may be used. Default: None.
        chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chromium".
-        channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chromium".
+                              is "chromium". Default: "chrome".
        proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
                             Default: None.
        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
@@ -79,8 +77,7 @@ class BrowserConfig:
        use_managed_browser: bool = False,
        use_persistent_context: bool = False,
        user_data_dir: str = None,
-        chrome_channel: str = "chromium",
-        channel: str = "chromium",
+        chrome_channel: str = "chrome",
        proxy: str = None,
        proxy_config: dict = None,
        viewport_width: int = 1080,
@@ -110,8 +107,14 @@ class BrowserConfig:
        self.use_managed_browser = use_managed_browser
        self.use_persistent_context = use_persistent_context
        self.user_data_dir = user_data_dir
-        self.chrome_channel = chrome_channel or self.browser_type or "chromium"
-        self.channel = channel or self.browser_type or "chromium"
+        if self.browser_type == "chromium":
+            self.chrome_channel = "chrome"
+        elif self.browser_type == "firefox":
+            self.chrome_channel = "firefox"
+        elif self.browser_type == "webkit":
+            self.chrome_channel = "webkit"
+        else:
+            self.chrome_channel = chrome_channel or "chrome"
        self.proxy = proxy
        self.proxy_config = proxy_config
        self.viewport_width = viewport_width
@@ -158,8 +161,7 @@ class BrowserConfig:
            use_managed_browser=kwargs.get("use_managed_browser", False),
            use_persistent_context=kwargs.get("use_persistent_context", False),
            user_data_dir=kwargs.get("user_data_dir"),
-            chrome_channel=kwargs.get("chrome_channel", "chromium"),
-            channel=kwargs.get("channel", "chromium"),
+            chrome_channel=kwargs.get("chrome_channel", "chrome"),
            proxy=kwargs.get("proxy"),
            proxy_config=kwargs.get("proxy_config"),
            viewport_width=kwargs.get("viewport_width", 1080),
@@ -246,7 +248,7 @@ class CrawlerRunConfig:
        wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
                                Default: None.
        wait_for_images (bool): If True, wait for images to load before extracting content.
-                                Default: False.
+                                Default: True.
        delay_before_return_html (float): Delay in seconds before retrieving final HTML.
                                          Default: 0.1.
        mean_delay (float): Mean base delay between requests when calling arun_many.
@@ -345,7 +347,7 @@ class CrawlerRunConfig:
        wait_until: str = "domcontentloaded",
        page_timeout: int = PAGE_TIMEOUT,
        wait_for: str = None,
-        wait_for_images: bool = False,
+        wait_for_images: bool = True,
        delay_before_return_html: float = 0.1,
        mean_delay: float = 0.1,
        max_range: float = 0.3,
@@ -503,7 +505,7 @@ class CrawlerRunConfig:
            wait_until=kwargs.get("wait_until", "domcontentloaded"),
            page_timeout=kwargs.get("page_timeout", 60000),
            wait_for=kwargs.get("wait_for"),
-            wait_for_images=kwargs.get("wait_for_images", False),
+            wait_for_images=kwargs.get("wait_for_images", True),
            delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
            mean_delay=kwargs.get("mean_delay", 0.1),
            max_range=kwargs.get("max_range", 0.3),
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -1475,13 +1475,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

        except Exception as e:
            raise e
-        
-        finally:
-            # If no session_id is given we should close the page
-            if not config.session_id:
-                await page.close()

-    async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
+    async def _handle_full_page_scan(self, page: Page, scroll_delay: float):
        """
        Helper method to handle full page scanning. 
        
@@ -1505,7 +1500,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            current_position = viewport_height

            # await page.evaluate(f"window.scrollTo(0, {current_position})")
-            await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
+            await self.safe_scroll(page, 0, current_position)
            # await self.csp_scroll_to(page, 0, current_position)
            # await asyncio.sleep(scroll_delay)

@@ -1515,7 +1510,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            
            while current_position < total_height:
                current_position = min(current_position + viewport_height, total_height)
-                await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
+                await self.safe_scroll(page, 0, current_position)
                # await page.evaluate(f"window.scrollTo(0, {current_position})")
                # await asyncio.sleep(scroll_delay)

@@ -1644,9 +1639,11 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        Returns:
            str: The base64-encoded screenshot data
        """
-        need_scroll = await self.page_need_scroll(page)
-        
-        if not need_scroll:
+        dimensions = await self.get_page_dimensions(page)
+        page_height = dimensions['height']        
+        if page_height < kwargs.get(
+            "screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
+        ):
            # Page is short enough, just take a screenshot
            return await self.take_screenshot_naive(page)
        else:
@@ -2069,7 +2066,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            }
        """)       
        
-    async def safe_scroll(self, page: Page, x: int, y: int, delay: float = 0.1):
+    async def safe_scroll(self, page: Page, x: int, y: int):
        """
        Safely scroll the page with rendering time.
        
@@ -2080,7 +2077,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        """
        result = await self.csp_scroll_to(page, x, y)
        if result['success']:
-            await page.wait_for_timeout(delay * 1000)
+            await page.wait_for_timeout(100)  # Allow for rendering
        return result
            
    async def csp_scroll_to(self, page: Page, x: int, y: int) -> Dict[str, Any]:
@@ -2161,31 +2158,4 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                const {scrollWidth, scrollHeight} = document.documentElement;
                return {width: scrollWidth, height: scrollHeight};
            }
-        """)
-    
-    async def page_need_scroll(self, page: Page) -> bool:
-        """
-        Determine whether the page need to scroll
-        
-        Args:
-            page: Playwright page object
-            
-        Returns:
-            bool: True if page needs scrolling
-        """
-        try:
-            need_scroll = await page.evaluate("""
-            () => {
-                const scrollHeight = document.documentElement.scrollHeight;
-                const viewportHeight = window.innerHeight;
-                return scrollHeight > viewportHeight;
-            }
-            """)
-            return need_scroll
-        except Exception as e:
-            self.logger.warning(
-                message="Failed to check scroll need: {error}. Defaulting to True for safety.",
-                tag="SCROLL",
-                params={"error": str(e)}
-            )
-            return True  # Default to scrolling if check fails
+        """)
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -418,30 +418,34 @@ class AsyncWebCrawler:
                            **kwargs
                        )

-                        crawl_result.status_code = async_response.status_code
-                        crawl_result.response_headers = async_response.response_headers
-                        crawl_result.downloaded_files = async_response.downloaded_files
-                        crawl_result.ssl_certificate = async_response.ssl_certificate  # Add SSL certificate
+                    #     crawl_result.status_code = async_response.status_code
+                    #     crawl_result.response_headers = async_response.response_headers
+                    #     crawl_result.downloaded_files = async_response.downloaded_files
+                    #     crawl_result.ssl_certificate = async_response.ssl_certificate  # Add SSL certificate
+                    # else:
+                    #     crawl_result.status_code = 200
+                    #     crawl_result.response_headers = cached_result.response_headers if cached_result else {}
+                    #     crawl_result.ssl_certificate = cached_result.ssl_certificate if cached_result else None  # Add SSL certificate from cache

                        # # Check and set values from async_response to crawl_result
-                        # try:
-                        #     for key in vars(async_response):
-                        #         if hasattr(crawl_result, key):
-                        #             value = getattr(async_response, key, None)
-                        #             current_value = getattr(crawl_result, key, None)
-                        #             if value is not None and not current_value:
-                        #                 try:
-                        #                     setattr(crawl_result, key, value)
-                        #                 except Exception as e:
-                        #                     self.logger.warning(
-                        #                         message=f"Failed to set attribute {key}: {str(e)}",
-                        #                         tag="WARNING"
-                        #                     )
-                        # except Exception as e:
-                        #     self.logger.warning(
-                        #         message=f"Error copying response attributes: {str(e)}",
-                        #         tag="WARNING"
-                        #     )
+                        try:
+                            for key in vars(async_response):
+                                if hasattr(crawl_result, key):
+                                    value = getattr(async_response, key, None)
+                                    current_value = getattr(crawl_result, key, None)
+                                    if value is not None and not current_value:
+                                        try:
+                                            setattr(crawl_result, key, value)
+                                        except Exception as e:
+                                            self.logger.warning(
+                                                message=f"Failed to set attribute {key}: {str(e)}",
+                                                tag="WARNING"
+                                            )
+                        except Exception as e:
+                            self.logger.warning(
+                                message=f"Error copying response attributes: {str(e)}",
+                                tag="WARNING"
+                            )

                        crawl_result.success = bool(html)
                        crawl_result.session_id = getattr(config, 'session_id', None)
@@ -581,10 +585,8 @@ class AsyncWebCrawler:

            # Markdown Generation
            markdown_generator: Optional[MarkdownGenerationStrategy] = config.markdown_generator or DefaultMarkdownGenerator()
-            
-            # Uncomment if by default we want to use PruningContentFilter
-            # if not config.content_filter and not markdown_generator.content_filter:
-            #     markdown_generator.content_filter = PruningContentFilter()
+            if not config.content_filter and not markdown_generator.content_filter:
+                markdown_generator.content_filter = PruningContentFilter()
            
            markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
                cleaned_html=cleaned_html,
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
@@ -122,6 +122,92 @@ class WebScrapingStrategy(ContentScrapingStrategy):
        """
        return await asyncio.to_thread(self._scrap, url, html, **kwargs)

+    def _generate_markdown_content(self, cleaned_html: str,html: str,url: str, success: bool, **kwargs) -> Dict[str, Any]:
+        """
+        Generate markdown content from cleaned HTML.
+
+        Args:
+            cleaned_html (str): The cleaned HTML content.
+            html (str): The original HTML content.
+            url (str): The URL of the page.
+            success (bool): Whether the content was successfully cleaned.
+            **kwargs: Additional keyword arguments.
+
+        Returns:
+            Dict[str, Any]: A dictionary containing the generated markdown content.
+        """
+        markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
+        
+        if markdown_generator:
+            try:
+                if kwargs.get('fit_markdown', False) and not markdown_generator.content_filter:
+                        markdown_generator.content_filter = BM25ContentFilter(
+                            user_query=kwargs.get('fit_markdown_user_query', None),
+                            bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
+                        )
+                
+                markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
+                    cleaned_html=cleaned_html,
+                    base_url=url,
+                    html2text_options=kwargs.get('html2text', {})
+                )
+                
+                return {
+                    'markdown': markdown_result.raw_markdown,  
+                    'fit_markdown': markdown_result.fit_markdown,
+                    'fit_html': markdown_result.fit_html, 
+                    'markdown_v2': markdown_result
+                }
+            except Exception as e:
+                self._log('error',
+                    message="Error using new markdown generation strategy: {error}",
+                    tag="SCRAPE",
+                    params={"error": str(e)}
+                )
+                markdown_generator = None
+                return {
+                    'markdown': f"Error using new markdown generation strategy: {str(e)}",
+                    'fit_markdown': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
+                    'fit_html': "Set flag 'fit_markdown' to True to get cleaned HTML content.",
+                    'markdown_v2': None                    
+                }
+
+        # Legacy method
+        """
+        # h = CustomHTML2Text()
+        # h.update_params(**kwargs.get('html2text', {}))            
+        # markdown = h.handle(cleaned_html)
+        # markdown = markdown.replace('    ```', '```')
+        
+        # fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
+        # fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
+        
+        # if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
+        #     content_filter = kwargs.get('content_filter', None)
+        #     if not content_filter:
+        #         content_filter = BM25ContentFilter(
+        #             user_query=kwargs.get('fit_markdown_user_query', None),
+        #             bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
+        #         )
+        #     fit_html = content_filter.filter_content(html)
+        #     fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
+        #     fit_markdown = h.handle(fit_html)
+
+        # markdown_v2 = MarkdownGenerationResult(
+        #     raw_markdown=markdown,
+        #     markdown_with_citations=markdown,
+        #     references_markdown=markdown,
+        #     fit_markdown=fit_markdown
+        # )
+        
+        # return {
+        #     'markdown': markdown,
+        #     'fit_markdown': fit_markdown,
+        #     'fit_html': fit_html,
+        #     'markdown_v2' : markdown_v2
+        # }
+        """
+
    def flatten_nested_elements(self, node):
        """
        Flatten nested elements in a HTML tree.
@@ -712,6 +798,13 @@ class WebScrapingStrategy(ContentScrapingStrategy):

        cleaned_html = str_body.replace('\n\n', '\n').replace('  ', ' ')

+        # markdown_content = self._generate_markdown_content(
+        #     cleaned_html=cleaned_html,
+        #     html=html,
+        #     url=url,
+        #     success=success,
+        #     **kwargs
+        # )
        
        return {
            # **markdown_content,
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -974,9 +974,8 @@ class JsonCssExtractionStrategy(JsonElementExtractionStrategy):
        return parsed_html.select(selector)

    def _get_elements(self, element, selector: str):
-        # Return all matching elements using select() instead of select_one()
-        # This ensures that we get all elements that match the selector, not just the first one
-        return element.select(selector)
+        selected = element.select_one(selector)
+        return [selected] if selected else []

    def _get_element_text(self, element) -> str:
        return element.get_text(strip=True)
@@ -1050,3 +1049,4 @@ class JsonXPathExtractionStrategy(JsonElementExtractionStrategy):

    def _get_element_attribute(self, element, attribute: str):
        return element.get(attribute)
+ 
--- a/crawl4ai/markdown_generation_strategy.py
+++ b/crawl4ai/markdown_generation_strategy.py
@@ -143,83 +143,41 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
        Returns:
            MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
        """
-        try:
-            # Initialize HTML2Text with default options for better conversion
-            h = CustomHTML2Text(baseurl=base_url)
-            default_options = {
-                'body_width': 0,  # Disable text wrapping
-                'ignore_emphasis': False,
-                'ignore_links': False,
-                'ignore_images': False,
-                'protect_links': True,
-                'single_line_break': True,
-                'mark_code': True,
-                'escape_snob': False
-            }
-            
-            # Update with custom options if provided
-            if html2text_options:
-                default_options.update(html2text_options)
-            elif options:
-                default_options.update(options)
-            elif self.options:
-                default_options.update(self.options)
-            
-            h.update_params(**default_options)
+        # Initialize HTML2Text with options
+        h = CustomHTML2Text()
+        if html2text_options:
+            h.update_params(**html2text_options)
+        elif options:
+            h.update_params(**options)
+        elif self.options:
+            h.update_params(**self.options)

-            # Ensure we have valid input
-            if not cleaned_html:
-                cleaned_html = ""
-            elif not isinstance(cleaned_html, str):
-                cleaned_html = str(cleaned_html)
+        # Generate raw markdown
+        raw_markdown = h.handle(cleaned_html)
+        raw_markdown = raw_markdown.replace('    ```', '```')

-            # Generate raw markdown
-            try:
-                raw_markdown = h.handle(cleaned_html)
-            except Exception as e:
-                raw_markdown = f"Error converting HTML to markdown: {str(e)}"
-            
-            raw_markdown = raw_markdown.replace('    ```', '```')
-
-            # Convert links to citations
-            markdown_with_citations: str = raw_markdown
-            references_markdown: str = ""
-            if citations:
-                try:
-                    markdown_with_citations, references_markdown = self.convert_links_to_citations(
-                        raw_markdown, base_url
-                    )
-                except Exception as e:
-                    markdown_with_citations = raw_markdown
-                    references_markdown = f"Error generating citations: {str(e)}"
-
-            # Generate fit markdown if content filter is provided
-            fit_markdown: Optional[str] = ""
-            filtered_html: Optional[str] = ""
-            if content_filter or self.content_filter:
-                try:
-                    content_filter = content_filter or self.content_filter
-                    filtered_html = content_filter.filter_content(cleaned_html)
-                    filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
-                    fit_markdown = h.handle(filtered_html)
-                except Exception as e:
-                    fit_markdown = f"Error generating fit markdown: {str(e)}"
-                    filtered_html = ""
-
-            return MarkdownGenerationResult(
-                raw_markdown=raw_markdown or "",
-                markdown_with_citations=markdown_with_citations or "",
-                references_markdown=references_markdown or "",
-                fit_markdown=fit_markdown or "",
-                fit_html=filtered_html or "",
-            )
-        except Exception as e:
-            # If anything fails, return empty strings with error message
-            error_msg = f"Error in markdown generation: {str(e)}"
-            return MarkdownGenerationResult(
-                raw_markdown=error_msg,
-                markdown_with_citations=error_msg,
-                references_markdown="",
-                fit_markdown="",
-                fit_html="",
+        # Convert links to citations
+        markdown_with_citations: str = ""
+        references_markdown: str = ""
+        if citations:
+            markdown_with_citations, references_markdown = self.convert_links_to_citations(
+                raw_markdown, base_url
            )
+
+        # Generate fit markdown if content filter is provided
+        fit_markdown: Optional[str] = ""
+        filtered_html: Optional[str] = ""
+        if content_filter or self.content_filter:
+            content_filter = content_filter or self.content_filter
+            filtered_html = content_filter.filter_content(cleaned_html)
+            filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
+            fit_markdown = h.handle(filtered_html)
+
+        return MarkdownGenerationResult(
+            raw_markdown=raw_markdown,
+            markdown_with_citations=markdown_with_citations,
+            references_markdown=references_markdown,
+            fit_markdown=fit_markdown,
+            fit_html=filtered_html,
+        )
+
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -21,8 +21,6 @@ import textwrap
 import cProfile
 import pstats
 from functools import wraps
-import asyncio
-

 class InvalidCSSSelectorError(Exception):
    pass
@@ -1581,25 +1579,6 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
        
    return content_paths

-def configure_windows_event_loop():
-    """
-    Configure the Windows event loop to use ProactorEventLoop.
-    This resolves the NotImplementedError that occurs on Windows when using asyncio subprocesses.
-    
-    This function should only be called on Windows systems and before any async operations.
-    On non-Windows systems, this function does nothing.
-    
-    Example:
-        ```python
-        from crawl4ai.async_configs import configure_windows_event_loop
-        
-        # Call this before any async operations if you're on Windows
-        configure_windows_event_loop()
-        ```
-    """
-    if platform.system() == 'Windows':
-        asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
-
 def get_error_context(exc_info, context_lines: int = 5):
    """
    Extract error context with more reliable line number tracking.
--- a/docs/examples/hello_world.py
+++ b/docs/examples/hello_world.py
@@ -1,20 +0,0 @@
-import asyncio
-from crawl4ai import *
-
-async def main():
-    browser_config = BrowserConfig(headless=True, verbose=True)
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        crawler_config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
-            )
-        )
-        result = await crawler.arun(
-            url="https://www.helloworld.org",
-            config=crawler_config
-        )
-        print(result.markdown_v2.raw_markdown[:500])
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/md_v3/tutorials/async-webcrawler-basics.md
+++ b/docs/md_v3/tutorials/async-webcrawler-basics.md
@@ -148,24 +148,7 @@ Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might twea

 ---

-## 5. Windows-Specific Configuration
-
-When using AsyncWebCrawler on Windows, you might encounter a `NotImplementedError` related to `asyncio.create_subprocess_exec`. This is a known Windows-specific issue that occurs because Windows' default event loop doesn't support subprocess operations.
-
-To resolve this, Crawl4AI provides a utility function to configure Windows to use the ProactorEventLoop. Call this function before running any async operations:
-
-```python
-from crawl4ai.utils import configure_windows_event_loop
-
-# Call this before any async operations if you're on Windows
-configure_windows_event_loop()
-
-# Your AsyncWebCrawler code here
-```
-
---
-
-## 6. Putting It All Together
+## 5. Putting It All Together

 Here’s a slightly more in-depth example that shows off a few key config parameters at once:

@@ -210,7 +193,7 @@ if __name__ == "__main__":

 ---

-## 7. Next Steps
+## 6. Next Steps

 - **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
 - **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).