From 32652189b0862029f3784d0d477ba64a9500d7ea Mon Sep 17 00:00:00 2001 From: aravind Date: Mon, 6 Jan 2025 10:22:51 +0530 Subject: [PATCH 01/15] Docs: Add Code of Conduct for the project (#410) --- CODE_OF_CONDUCT.md | 131 +++++++++++++++++++++++++++++++++++++++++++++ README.md | 1 + 2 files changed, 132 insertions(+) create mode 100644 CODE_OF_CONDUCT.md diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 00000000..31dad7b9 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,131 @@ +# Crawl4AI Code of Conduct + +## Our Pledge + +We as members, contributors, and leaders pledge to make participation in our +community a harassment-free experience for everyone, regardless of age, body +size, visible or invisible disability, ethnicity, sex characteristics, gender +identity and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, caste, color, religion, or sexual +identity and orientation. + +We pledge to act and interact in ways that contribute to an open, welcoming, +diverse, inclusive, and healthy community. + +## Our Standards + +Examples of behavior that contributes to a positive environment for our +community include: + +* Demonstrating empathy and kindness toward other people +* Being respectful of differing opinions, viewpoints, and experiences +* Giving and gracefully accepting constructive feedback +* Accepting responsibility and apologizing to those affected by our mistakes, + and learning from the experience +* Focusing on what is best not just for us as individuals, but for the overall + community + +Examples of unacceptable behavior include: + +* The use of sexualized language or imagery, and sexual attention or advances of + any kind +* Trolling, insulting or derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or email address, + without their explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Enforcement Responsibilities + +Community leaders are responsible for clarifying and enforcing our standards of +acceptable behavior and will take appropriate and fair corrective action in +response to any behavior that they deem inappropriate, threatening, offensive, +or harmful. + +Community leaders have the right and responsibility to remove, edit, or reject +comments, commits, code, wiki edits, issues, and other contributions that are +not aligned to this Code of Conduct, and will communicate reasons for moderation +decisions when appropriate. + +## Scope + +This Code of Conduct applies within all community spaces, and also applies when +an individual is officially representing the community in public spaces. +Examples of representing our community include using an official email address, +posting via an official social media account, or acting as an appointed +representative at an online or offline event. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported to the community leaders responsible for enforcement at +unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly. + +All community leaders are obligated to respect the privacy and security of the +reporter of any incident. + +## Enforcement Guidelines + +Community leaders will follow these Community Impact Guidelines in determining +the consequences for any action they deem in violation of this Code of Conduct: + +### 1. Correction + +**Community Impact**: Use of inappropriate language or other behavior deemed +unprofessional or unwelcome in the community. + +**Consequence**: A private, written warning from community leaders, providing +clarity around the nature of the violation and an explanation of why the +behavior was inappropriate. A public apology may be requested. + +### 2. Warning + +**Community Impact**: A violation through a single incident or series of +actions. + +**Consequence**: A warning with consequences for continued behavior. No +interaction with the people involved, including unsolicited interaction with +those enforcing the Code of Conduct, for a specified period of time. This +includes avoiding interactions in community spaces as well as external channels +like social media. Violating these terms may lead to a temporary or permanent +ban. + +### 3. Temporary Ban + +**Community Impact**: A serious violation of community standards, including +sustained inappropriate behavior. + +**Consequence**: A temporary ban from any sort of interaction or public +communication with the community for a specified period of time. No public or +private interaction with the people involved, including unsolicited interaction +with those enforcing the Code of Conduct, is allowed during this period. +Violating these terms may lead to a permanent ban. + +### 4. Permanent Ban + +**Community Impact**: Demonstrating a pattern of violation of community +standards, including sustained inappropriate behavior, harassment of an +individual, or aggression toward or disparagement of classes of individuals. + +**Consequence**: A permanent ban from any sort of public interaction within the +community. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], +version 2.1, available at +[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1]. + +Community Impact Guidelines were inspired by +[Mozilla's code of conduct enforcement ladder][Mozilla CoC]. + +For answers to common questions about this code of conduct, see the FAQ at +[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at +[https://www.contributor-covenant.org/translations][translations]. + +[homepage]: https://www.contributor-covenant.org +[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html +[Mozilla CoC]: https://github.com/mozilla/diversity +[FAQ]: https://www.contributor-covenant.org/faq +[translations]: https://www.contributor-covenant.org/translations diff --git a/README.md b/README.md index af4278ab..482cd065 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit) +[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md) From 3427ead8b8854f70aef2b8fd485648ba22623e21 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Mon, 6 Jan 2025 15:13:43 +0800 Subject: [PATCH 02/15] Update CHANGELOG --- CHANGELOG.md | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index b654953f..afa841c9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,43 @@ All notable changes to Crawl4AI will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +--- + +## [0.4.267] - 2025 - 01 - 06 + +### Added +- **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md)) +- **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py)) + +### Changed +- **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py)) +- **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py)) +- **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py)) +- **Documentation Update**: + - Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md)) + +### Removed +- **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py)) + +### Fixed +- **Page Closing to Prevent Memory Leaks**: + - **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided. + - **Impact**: Prevents memory leaks caused by lingering pages after a crawl. + - **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py) + - **Code**: + ```python + finally: + # If no session_id is given we should close the page + if not config.session_id: + await page.close() + ``` +- **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py)) +- **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py)) + +### Other +- **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore)) + + ## [0.4.24] - 2024-12-31 ### Added From 53be88b6776c077853de4dc629ccae326ad3ce46 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Mon, 6 Jan 2025 15:18:37 +0800 Subject: [PATCH 03/15] Update gitignore --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 6a3b65f0..f022c9ef 100644 --- a/.gitignore +++ b/.gitignore @@ -225,3 +225,4 @@ tree.md .scripts .local .do +plans/ \ No newline at end of file From 12880f1ffad9702aad6adbca3e0f16e391c081ba Mon Sep 17 00:00:00 2001 From: UncleCode Date: Mon, 6 Jan 2025 15:19:01 +0800 Subject: [PATCH 04/15] Update gitignore --- .gitignore | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 7ce3ee0c..943c059c 100644 --- a/.gitignore +++ b/.gitignore @@ -225,4 +225,5 @@ tree.md .scripts .local .do -/plans \ No newline at end of file +/plans +plans/ \ No newline at end of file From 010677cbeeeaa0c2f8dee19e0da7205286c72c2b Mon Sep 17 00:00:00 2001 From: UncleCode Date: Wed, 8 Jan 2025 13:05:00 +0800 Subject: [PATCH 05/15] chore: add .gitattributes file Add initial .gitattributes file to standardize line endings and file handling across different operating systems. This will help prevent issues with line ending inconsistencies between developers working on different platforms. --- .gitattributes | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 .gitattributes diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 00000000..144fe136 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,12 @@ +# Documentation +*.html linguist-documentation +docs/* linguist-documentation +docs/examples/* linguist-documentation +docs/md_v2/* linguist-documentation + +# Explicitly mark Python as the main language +*.py linguist-detectable=true +*.py linguist-language=Python + +# Exclude HTML from language statistics +*.html linguist-detectable=false \ No newline at end of file From 26d821c0def3b5c6ceaad1e9c972f0bac326e3f7 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Wed, 8 Jan 2025 13:08:19 +0800 Subject: [PATCH 06/15] Remove .codeiumignore from version control and add to .gitignore --- .codeiumignore | 220 ------------------------------------------------- 1 file changed, 220 deletions(-) delete mode 100644 .codeiumignore diff --git a/.codeiumignore b/.codeiumignore deleted file mode 100644 index 76ff6caa..00000000 --- a/.codeiumignore +++ /dev/null @@ -1,220 +0,0 @@ -# Byte-compiled / optimized / DLL files -__pycache__/ -*.py[cod] -*$py.class - -# C extensions -*.so - -# Distribution / packaging -.Python -build/ -develop-eggs/ -dist/ -downloads/ -eggs/ -.eggs/ -lib/ -lib64/ -parts/ -sdist/ -var/ -wheels/ -share/python-wheels/ -*.egg-info/ -.installed.cfg -*.egg -MANIFEST - -# PyInstaller -# Usually these files are written by a python script from a template -# before PyInstaller builds the exe, so as to inject date/other infos into it. -*.manifest -*.spec - -# Installer logs -pip-log.txt -pip-delete-this-directory.txt - -# Unit test / coverage reports -htmlcov/ -.tox/ -.nox/ -.coverage -.coverage.* -.cache -nosetests.xml -coverage.xml -*.cover -*.py,cover -.hypothesis/ -.pytest_cache/ -cover/ - -# Translations -*.mo -*.pot - -# Django stuff: -*.log -local_settings.py -db.sqlite3 -db.sqlite3-journal - -# Flask stuff: -instance/ -.webassets-cache - -# Scrapy stuff: -.scrapy - -# Sphinx documentation -docs/_build/ - -# PyBuilder -.pybuilder/ -target/ - -# Jupyter Notebook -.ipynb_checkpoints - -# IPython -profile_default/ -ipython_config.py - -# pyenv -# For a library or package, you might want to ignore these files since the code is -# intended to run in multiple environments; otherwise, check them in: -# .python-version - -# pipenv -# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. -# However, in case of collaboration, if having platform-specific dependencies or dependencies -# having no cross-platform support, pipenv may install dependencies that don't work, or not -# install all needed dependencies. -#Pipfile.lock - -# poetry -# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. -# This is especially recommended for binary packages to ensure reproducibility, and is more -# commonly ignored for libraries. -# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control -#poetry.lock - -# pdm -# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. -#pdm.lock -# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it -# in version control. -# https://pdm.fming.dev/latest/usage/project/#working-with-version-control -.pdm.toml -.pdm-python -.pdm-build/ - -# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm -__pypackages__/ - -# Celery stuff -celerybeat-schedule -celerybeat.pid - -# SageMath parsed files -*.sage.py - -# Environments -.env -.venv -env/ -venv/ -ENV/ -env.bak/ -venv.bak/ - -# Spyder project settings -.spyderproject -.spyproject - -# Rope project settings -.ropeproject - -# mkdocs documentation -/site - -# mypy -.mypy_cache/ -.dmypy.json -dmypy.json - -# Pyre type checker -.pyre/ - -# pytype static type analyzer -.pytype/ - -# Cython debug symbols -cython_debug/ - -# PyCharm -# JetBrains specific template is maintained in a separate JetBrains.gitignore that can -# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore -# and can be added to the global gitignore or merged into this file. For a more nuclear -# option (not recommended) you can uncomment the following to ignore the entire idea folder. -#.idea/ - -Crawl4AI.egg-info/ -Crawl4AI.egg-info/* -crawler_data.db -.vscode/ -.tests/ -.test_pads/ -test_pad.py -test_pad*.py -.data/ -Crawl4AI.egg-info/ - -requirements0.txt -a.txt - -*.sh -.idea -docs/examples/.chainlit/ -docs/examples/.chainlit/* -.chainlit/config.toml -.chainlit/translations/en-US.json - -local/ -.files/ - -a.txt -.lambda_function.py -ec2* - -update_changelog.sh - -.DS_Store -docs/.DS_Store -tmp/ -test_env/ -**/.DS_Store -**/.DS_Store - -todo.md -todo_executor.md -git_changes.py -git_changes.md -pypi_build.sh -git_issues.py -git_issues.md - -.next/ -.tests/ -.docs/ -.gitboss/ -todo_executor.md -protect-all-except-feature.sh -manage-collab.sh -publish.sh -combine.sh -combined_output.txt -tree.md - From ad5e5d21ca265e966ad2a8d9dcd56dd93da0cc5a Mon Sep 17 00:00:00 2001 From: UncleCode Date: Wed, 8 Jan 2025 13:09:23 +0800 Subject: [PATCH 07/15] Remove .codeiumignore from version control and add to .gitignore --- .gitignore | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 943c059c..b28b377a 100644 --- a/.gitignore +++ b/.gitignore @@ -226,4 +226,7 @@ tree.md .local .do /plans -plans/ \ No newline at end of file +plans/ + +# Codeium +.codeiumignore \ No newline at end of file From f9c601eb7e9007baa068df8d4a21de2da9ae58f0 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Thu, 9 Jan 2025 16:24:41 +0800 Subject: [PATCH 08/15] docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain. Also add todo/ directory to .gitignore. --- .gitignore | 3 ++- README.md | 8 ++++---- docs/examples/quickstart_v0.ipynb | 2 +- docs/md_v2/basic/docker-deploymeny.md | 2 +- docs/md_v2/basic/installation.md | 2 +- 5 files changed, 9 insertions(+), 8 deletions(-) diff --git a/.gitignore b/.gitignore index b28b377a..c7ebf2e4 100644 --- a/.gitignore +++ b/.gitignore @@ -229,4 +229,5 @@ tree.md plans/ # Codeium -.codeiumignore \ No newline at end of file +.codeiumignore +todo/ \ No newline at end of file diff --git a/README.md b/README.md index 482cd065..af55c22d 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant [✨ Check out latest update v0.4.24x](#-recent-updates) -πŸŽ‰ **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes β†’](https://crawl4ai.com/mkdocs/blog) +πŸŽ‰ **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes β†’](https://docs.crawl4ai.com/blog) ## 🧐 Why Crawl4AI? @@ -149,7 +149,7 @@ if __name__ == "__main__": ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) -✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/) +✨ Visit our [Documentation Website](https://docs.crawl4ai.com/) ## Installation πŸ› οΈ @@ -265,7 +265,7 @@ task_id = response.json()["task_id"] result = requests.get(f"http://localhost:11235/task/{task_id}") ``` -For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/). +For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/). @@ -487,7 +487,7 @@ Read the full details of this release in our [0.4.24 Release Notes](https://gith > 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide! -For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/). +For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/). To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md). diff --git a/docs/examples/quickstart_v0.ipynb b/docs/examples/quickstart_v0.ipynb index 71f23acb..0282aa12 100644 --- a/docs/examples/quickstart_v0.ipynb +++ b/docs/examples/quickstart_v0.ipynb @@ -702,7 +702,7 @@ "\n", "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n", "\n", - "For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n", + "For more information and advanced usage, please visit the [Crawl4AI documentation](https://docs.crawl4ai.com/).\n", "\n", "Happy crawling!" ] diff --git a/docs/md_v2/basic/docker-deploymeny.md b/docs/md_v2/basic/docker-deploymeny.md index 31d33e8c..8cbc76c4 100644 --- a/docs/md_v2/basic/docker-deploymeny.md +++ b/docs/md_v2/basic/docker-deploymeny.md @@ -699,4 +699,4 @@ Content-Type: application/json GET /task/{task_id} ``` -For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/). \ No newline at end of file +For more details, visit the [official documentation](https://docs.crawl4ai.com/). \ No newline at end of file diff --git a/docs/md_v2/basic/installation.md b/docs/md_v2/basic/installation.md index de8aeafa..10e312f7 100644 --- a/docs/md_v2/basic/installation.md +++ b/docs/md_v2/basic/installation.md @@ -132,6 +132,6 @@ This script should successfully crawl the example website and print the first 50 ## Getting Help -If you encounter any issues during installation or usage, please check the [documentation](https://crawl4ai.com/mkdocs/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues). +If you encounter any issues during installation or usage, please check the [documentation](https://docs.crawl4ai.com/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues). Happy crawling! πŸ•·οΈπŸ€– \ No newline at end of file From 1ab9d115cf5632b229446b727a77850bfcece413 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C5=8Dnin=20bingi?= <42009541+mcam10@users.noreply.github.com> Date: Mon, 13 Jan 2025 04:23:52 -0800 Subject: [PATCH 09/15] Fixing minor typos in README (#440) @mcam10 Thx for the support. Appreciate --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index af55c22d..7ab02b10 100644 --- a/README.md +++ b/README.md @@ -432,7 +432,7 @@ if __name__ == "__main__":
-πŸ€– Using You own Browswer with Custome User Profile +πŸ€– Using You own Browser with Custom User Profile ```python import os, sys From 8878b3d032fb21ce3567b34db128bfa64687198a Mon Sep 17 00:00:00 2001 From: devatbosch Date: Mon, 13 Jan 2025 18:27:31 +0530 Subject: [PATCH 10/15] Updated the correct link for "Contribution guidelines" in README.md (#445) Thank you for pointing this out. I am creating a contributing guide, which is why I changed the name to the contributors, but I forgot to update some other places. Thanks again. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7ab02b10..dbccf547 100644 --- a/README.md +++ b/README.md @@ -511,7 +511,7 @@ To check our development plans and upcoming features, visit our [Roadmap](https: ## 🀝 Contributing -We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information. +We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information. ## πŸ“„ License From 6dfa9cb703231d0bd8a3c9b5e4ce527efae415ef Mon Sep 17 00:00:00 2001 From: Aravind Date: Sun, 19 Jan 2025 14:23:03 +0530 Subject: [PATCH 11/15] Streamline Feature requests, bug reports and Forums with Forms & Templates (#465) * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config: updated new bugs to have needs-triage label by default * Template for PR * Template for PR * Template for PR * Template for PR * Added FR template * Added FR template * Added FR template * Added FR template * Config: updated the text for new labels * config: changed the order of steps to reproduce * Config: shortened the form for feature request * Config: Added a code snippet section to the bug report --- .../DISCUSSION_TEMPLATE/feature-requests.yml | 59 ++++++++ .github/ISSUE_TEMPLATE/bug_report.yml | 127 ++++++++++++++++++ .github/ISSUE_TEMPLATE/config.yml | 8 ++ .github/pull_request_template.md | 19 +++ 4 files changed, 213 insertions(+) create mode 100644 .github/DISCUSSION_TEMPLATE/feature-requests.yml create mode 100644 .github/ISSUE_TEMPLATE/bug_report.yml create mode 100644 .github/ISSUE_TEMPLATE/config.yml create mode 100644 .github/pull_request_template.md diff --git a/.github/DISCUSSION_TEMPLATE/feature-requests.yml b/.github/DISCUSSION_TEMPLATE/feature-requests.yml new file mode 100644 index 00000000..24c21565 --- /dev/null +++ b/.github/DISCUSSION_TEMPLATE/feature-requests.yml @@ -0,0 +1,59 @@ +title: "[Feature Request]: " +labels: ["βš™οΈ New"] +body: + - type: markdown + attributes: + value: | + Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in + this discussions category to avoid duplicates. 😊 + + - type: textarea + id: needs_to_be_done + attributes: + label: What needs to be done? + description: Please describe the feature or functionality you'd like to see. + placeholder: "e.g., Return alt text along with images scraped from a webpages in Result" + validations: + required: true + + - type: textarea + id: problem_to_solve + attributes: + label: What problem does this solve? + description: Explain the pain point or issue this feature will help address. + placeholder: "e.g., Bypass Captchas added by cloudflare" + validations: + required: true + + - type: textarea + id: target_users + attributes: + label: Target users/beneficiaries + description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.) + placeholder: "e.g., Marketing teams, developers" + validations: + required: false + + - type: textarea + id: current_workarounds + attributes: + label: Current alternatives/workarounds + description: Are there any existing solutions or workarounds? How does this feature improve upon them? + placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them" + validations: + required: false + + - type: markdown + attributes: + value: | + ### πŸ’‘ Implementation Ideas + + - type: textarea + id: proposed_approach + attributes: + label: Proposed approach + description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee + and the success metrics for this feature + placeholder: "e.g., Implement a breadth first traversal algorithm for scraper" + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 00000000..0ff926be --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,127 @@ +name: Bug Report +description: Report a bug with the Crawl4AI. +title: "[Bug]: " +labels: ["🐞 Bug","🩺 Needs Triage"] +body: + - type: input + id: crawl4ai_version + attributes: + label: crawl4ai version + description: Specify the version of crawl4ai you are using. + placeholder: "e.g., 2.0.0" + validations: + required: true + + - type: textarea + id: expected_behavior + attributes: + label: Expected Behavior + description: Describe what you expected to happen. + placeholder: "Provide a detailed explanation of the expected outcome." + validations: + required: true + + - type: textarea + id: current_behavior + attributes: + label: Current Behavior + description: Describe what is happening instead of the expected behavior. + placeholder: "Describe the actual result or issue you encountered." + validations: + required: true + + - type: dropdown + id: reproducible + attributes: + label: Is this reproducible? + description: Indicate whether this bug can be reproduced consistently. + options: + - "Yes" + - "No" + validations: + required: true + + - type: textarea + id: inputs + attributes: + label: Inputs Causing the Bug + description: Provide details about the inputs causing the issue. + placeholder: | + - URL(s): + - Settings used: + - Input data (if applicable): + render: bash + + - type: textarea + id: steps_to_reproduce + attributes: + label: Steps to Reproduce + description: Provide step-by-step instructions to reproduce the issue. + placeholder: | + 1. Go to... + 2. Click on... + 3. Observe the issue... + render: bash + + - type: textarea + id: code_snippets + attributes: + label: Code snippets + description: Provide code snippets(if any). Add comments as necessary + placeholder: print("Hello world") + render: python + + # Header Section with Title + - type: markdown + attributes: + value: | + ## Supporting Information + Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem + + - type: input + id: os + attributes: + label: OS + description: Please provide the operating system & distro where the issue occurs. + placeholder: "e.g., Windows, macOS, Linux" + validations: + required: true + + - type: input + id: python_version + attributes: + label: Python version + description: Specify the Python version being used. + placeholder: "e.g., 3.8.5" + validations: + required: true + + # Browser Field + - type: input + id: browser + attributes: + label: Browser + description: Provide the name of the browser you are using. + placeholder: "e.g., Chrome, Firefox, Safari" + validations: + required: false + + # Browser Version Field + - type: input + id: browser_version + attributes: + label: Browser version + description: Provide the version of the browser you are using. + placeholder: "e.g., 91.0.4472.124" + validations: + required: false + + # Error Logs Field (Text Area) + - type: textarea + id: error_logs + attributes: + label: Error logs & Screenshots (if applicable) + description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue. + placeholder: "Paste error logs here and attach your screenshots" + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 00000000..5f877d13 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,8 @@ +blank_issues_enabled: false +contact_links: + - name: Feature Requests + url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests + about: "Suggest new features or enhancements for Crawl4AI" + - name: Forums - Q&A + url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a + about: "Ask questions or engage in general discussions about Crawl4AI" diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md new file mode 100644 index 00000000..7366dad4 --- /dev/null +++ b/.github/pull_request_template.md @@ -0,0 +1,19 @@ +## Summary +Please include a summary of the change and/or which issues are fixed. + +eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR) + +## List of files changed and why +eg: quickstart.py - To update the example as per new changes + +## How Has This Been Tested? +Please describe the tests that you ran to verify your changes. + +## Checklist: + +- [ ] My code follows the style guidelines of this project +- [ ] I have performed a self-review of my own code +- [ ] I have commented my code, particularly in hard-to-understand areas +- [ ] I have made corresponding changes to the documentation +- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works +- [ ] New and existing unit tests pass locally with my changes From 7b7fe84e0d47fdeeb99e11aa91b5664c5e1c2447 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Wed, 22 Jan 2025 20:52:42 +0800 Subject: [PATCH 12/15] docs(readme): resolve merge conflict and update version info Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements. No breaking changes. --- README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/README.md b/README.md index 5cb60452..8987d19d 100644 --- a/README.md +++ b/README.md @@ -23,9 +23,6 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant [✨ Check out latest update v0.4.3bx](#-recent-updates) -<<<<<<< HEAD -πŸŽ‰ **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes β†’](https://docs.crawl4ai.com/blog) -======= πŸŽ‰ **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes β†’](https://docs.crawl4ai.com/blog)
From 0afc3e9e5e38b09d0995042ecaa9c77de66842e1 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Thu, 23 Jan 2025 22:37:29 +0800 Subject: [PATCH 13/15] refactor(examples): update API usage in features demo Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls() and fix result access patterns. Also improve code formatting and remove extra whitespace. - Replace dispatcher.run_urls with crawler.arun_many - Update streaming demo to use new API and correct result access - Clean up whitespace and formatting - Simplify result property access patterns --- docs/examples/v0_4_3b2_features_demo.py | 28 ++++++++++++------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/docs/examples/v0_4_3b2_features_demo.py b/docs/examples/v0_4_3b2_features_demo.py index 6e091423..7771c3f8 100644 --- a/docs/examples/v0_4_3b2_features_demo.py +++ b/docs/examples/v0_4_3b2_features_demo.py @@ -85,17 +85,16 @@ async def demo_memory_dispatcher(): ) print("\nπŸš€ Starting batch crawl...") - results = await dispatcher.run_urls( + results = await crawler.arun_many( urls=urls, - crawler=crawler, config=crawler_config, + dispatcher=dispatcher ) print(f"\nβœ… Completed {len(results)} URLs successfully") except Exception as e: print(f"\n❌ Error in memory dispatcher demo: {str(e)}") - async def demo_streaming_support(): """ 2. Streaming Support Demo @@ -115,16 +114,17 @@ async def demo_streaming_support(): dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5) print("Starting streaming crawl...") - async for result in dispatcher.run_urls_stream( - urls=urls, crawler=crawler, config=crawler_config + async for result in await crawler.arun_many( + urls=urls, + config=crawler_config, + dispatcher=dispatcher ): # Process each result as it arrives print( - f"Received result for {result.url} - Success: {result.result.success}" + f"Received result for {result.url} - Success: {result.success}" ) - if result.result.success: - print(f"Content length: {len(result.result.markdown)}") - + if result.success: + print(f"Content length: {len(result.markdown)}") async def demo_content_scraping(): """ @@ -138,7 +138,10 @@ async def demo_content_scraping(): url = "https://example.com/article" # Configure with the new LXML strategy - config = CrawlerRunConfig(scraping_strategy=LXMLWebScrapingStrategy(), verbose=True) + config = CrawlerRunConfig( + scraping_strategy=LXMLWebScrapingStrategy(), + verbose=True + ) print("Scraping content with LXML strategy...") async with crawler: @@ -146,7 +149,6 @@ async def demo_content_scraping(): if result.success: print("Successfully scraped content using LXML strategy") - async def demo_llm_markdown(): """ 4. LLM-Powered Markdown Generation Demo @@ -197,7 +199,6 @@ async def demo_llm_markdown(): print(result.markdown_v2.fit_markdown[:500]) print("Successfully generated LLM-filtered markdown") - async def demo_robots_compliance(): """ 5. Robots.txt Compliance Demo @@ -221,8 +222,6 @@ async def demo_robots_compliance(): elif result.success: print(f"Successfully crawled: {result.url}") - - async def demo_json_schema_generation(): """ 7. LLM-Powered Schema Generation Demo @@ -276,7 +275,6 @@ async def demo_json_schema_generation(): print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None) print("Successfully used generated schema for crawling") - async def demo_proxy_rotation(): """ 8. Proxy Rotation Demo From dde14eba7db2de240d7a1dc80f436f5c821571e8 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Sun, 26 Jan 2025 04:00:28 +0100 Subject: [PATCH 14/15] Update README.md (#562) --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index 8987d19d..a9fcdd19 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,6 @@ I made Crawl4AI open-source for two reasons. First, it’s my way of giving back Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
->>>>>>> vr0.4.3b2 ## 🧐 Why Crawl4AI? From f00dcc276f7a14599a670f928110421bb9320ac7 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Sun, 26 Jan 2025 04:00:28 +0100 Subject: [PATCH 15/15] Update README.md (#562) --- .gitignore | 1 + README.md | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/.gitignore b/.gitignore index 44b90f82..b5431429 100644 --- a/.gitignore +++ b/.gitignore @@ -234,3 +234,4 @@ todo/ # windsurf rules .windsurfrules +.private \ No newline at end of file diff --git a/README.md b/README.md index 8987d19d..a9fcdd19 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,6 @@ I made Crawl4AI open-source for two reasons. First, it’s my way of giving back Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
->>>>>>> vr0.4.3b2 ## 🧐 Why Crawl4AI?