Fixes#1553
The code imports from `playwright_stealth` module (e.g., StealthConfig,
stealth_async, stealth_sync with capital S) which is only available in
the `playwright-stealth` package, not `tf-playwright-stealth`.
Changed dependency from:
- tf-playwright-stealth>=1.1.0 (wrong package, exports lowercase names)
to:
- playwright-stealth>=2.0.0 (correct package, exports StealthConfig)
This fixes the mismatch between declared dependency and actual imports
in the codebase.
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing
Fix: Wrong URL variable used for extraction of raw html
* Fix#1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
* Refactor Pydantic model configuration to use ConfigDict for arbitrary types
* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
* fix: ensure BrowserConfig.to_dict serializes proxy_config
* feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
* reproduced AttributeError from #1642
* pass timeout parameter to docker client request
* added missing deep crawling objects to init
* generalized query in ContentRelevanceFilter to be a str or list
* import modules from enhanceable deserialization
* parameterized tests
* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
* announcement: add application form for cloud API closed beta
* Release v0.7.8: Stability & Bug Fix Release
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
* docs: add section for Crawl4AI Cloud API closed beta with application link
* fix: add disk cleanup step to Docker workflow
---------
Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
- Updates pyOpenSSL from >=24.3.0 to >=25.3.0
- This resolves CVE affecting cryptography package versions >=37.0.0 & <43.0.1
- pyOpenSSL 25.3.0 requires cryptography>=45.0.7, which is above the vulnerable range
- Fixes issue #1545🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Fixes bug reported in issue #1405
[Bug]: Excluded selector (excluded_selector) doesn't work
This commit reintroduces the cssselect library which was removed by PR (https://github.com/unclecode/crawl4ai/pull/1368) and merged via (437395e490).
Integration tested against 0.7.4 Docker container. Reintroducing cssselector package eliminated errors seen in logs and excluded_selector functionality was restored.
Refs: #1405
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage.
The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature.
The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool.
Significant modifications:
- Added adaptive_crawler.py and related scripts
- Modified __init__.py and utils.py
- Updated documentation with details about the adaptive crawling feature
- Added tests for the new feature
BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature.
Refs: #123, #456
* fix: Update export of URLPatternFilter
* chore: Add dependancy for cchardet in requirements
* docs: Update example for deep crawl in release note for v0.5
* Docs: update the example for memory dispatcher
* docs: updated example for crawl strategies
* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.
* chore: removed cchardet from dependancy list, since unclecode is planning to remove it
* docs: updated the example for proxy rotation to a working example
* feat: Introduced ProxyConfig param
* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1
* chore: update and test new dependancies
* feat:Make PyPDF2 a conditional dependancy
* updated tutorial and release note for v0.5
* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename
* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult
* fix: Bug in serialisation of markdown in acache_url
* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown
* fix: remove deprecated markdown_v2 from docker
* Refactor: remove deprecated fit_markdown and fit_html from result
* refactor: fix cache retrieval for markdown as a string
* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing.
LXML mode offers 10-20x better performance for large HTML documents.
Key changes:
- Added ScrapingMode enum with BEAUTIFULSOUP and LXML options
- Implemented LXMLWebScrapingStrategy class
- Added LXML-based metadata extraction
- Updated documentation with scraping mode usage and performance considerations
- Added cssselect dependency
BREAKING CHANGE: None
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.
BREAKING CHANGE: Documentation structure has been significantly reorganized
- Add llm.txt generator
- Added SSL certificate extraction in AsyncWebCrawler.
- Introduced new content filters and chunking strategies for more robust data extraction.
- Updated documentation.
- Added a post-installation setup script for initialization.
- Updated README with installation notes for Playwright setup.
- Enhanced migration logging for better error visibility.
- Added 'pydantic' to requirements.
- Bumped version to 0.3.746.
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
- Introduced AsyncDatabaseManager for async DB management.
- Added migration feature to transition to file-based storage.
- Enhanced web crawler with improved caching logic.
- Updated requirements and setup for async processing.
- Implemented new async crawler strategy using Playwright.
- Introduced ManagedBrowser for better browser management.
- Added support for persistent browser sessions and improved error handling.
- Updated version from 0.3.73 to 0.3.731.
- Enhanced logic in main.py for conditional mounting of static files.
- Updated requirements to replace playwright_stealth with tf-playwright-stealth.
According to #102 the requirements specified are minimum version. Currently they are defined as fixed versions in requirements.txt and setup.py leading to projects consuming this package are limited to using exactly these requirements instead of a more flexible range. This PR addresses this.
This commit updates the Dockerfile to install Crawl4AI with the specified options. The `INSTALL_OPTION` build argument is used to determine which additional packages to install. If the option is set to "all", all models will be downloaded. If the option is set to "torch", only torch models will be downloaded. If the option is set to "transformer", only transformer models will be downloaded. If no option is specified, the default installation will be used. This change improves the flexibility and customization of the Crawl4AI installation process.
Set the `image_size` variable to 0 in the `get_content_of_website_optimized` function to temporarily disable fetching the image file size. This change addresses performance issues and will be improved in a future update.
Update Dockerfile for linuz users