feat(core): release version 0.5.0 with deep crawling and CLI
This major release adds deep crawling capabilities, memory-adaptive dispatcher, multiple crawling strategies, Docker deployment, and a new CLI. It also includes significant improvements to proxy handling, PDF processing, and LLM integration. BREAKING CHANGES: - Add memory-adaptive dispatcher as default for arun_many() - Move max_depth to CrawlerRunConfig - Replace ScrapingMode enum with strategy pattern - Update BrowserContext API - Make model fields optional with defaults - Remove content_filter parameter from CrawlerRunConfig - Remove synchronous WebCrawler and old CLI - Update Docker deployment configuration - Replace FastFilterChain with FilterChain - Change license to Apache 2.0 with attribution clause
This commit is contained in:
109
CHANGELOG.md
109
CHANGELOG.md
@@ -5,10 +5,109 @@ All notable changes to Crawl4AI will be documented in this file.
|
|||||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
|
||||||
---
|
|
||||||
|
## Version 0.5.0 (2025-02-21)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- *(crawler)* [**breaking**] Add memory-adaptive dispatcher with rate limiting
|
||||||
|
- *(scraping)* [**breaking**] Add LXML-based scraping mode for improved performance
|
||||||
|
- *(content-filter)* Add LLMContentFilter for intelligent markdown generation
|
||||||
|
- *(dispatcher)* [**breaking**] Add streaming support for URL processing
|
||||||
|
- *(browser)* [**breaking**] Improve browser context management and add shared data support
|
||||||
|
- *(config)* [**breaking**] Add streaming support and config cloning
|
||||||
|
- *(crawler)* Add URL redirection tracking
|
||||||
|
- *(extraction)* Add LLM-powered schema generation utility
|
||||||
|
- *(proxy)* Add proxy configuration support to CrawlerRunConfig
|
||||||
|
- *(robots)* Add robots.txt compliance support
|
||||||
|
- *(release)* [**breaking**] Prepare v0.4.3 beta release
|
||||||
|
- *(proxy)* Add proxy rotation support and documentation
|
||||||
|
- *(browser)* Add CDP URL configuration support
|
||||||
|
- *(demo)* Uncomment feature demos and add fake-useragent dependency
|
||||||
|
- *(pdf)* Add PDF processing capabilities
|
||||||
|
- *(crawler)* [**breaking**] Enhance JavaScript execution and PDF processing
|
||||||
|
- *(docker)* Add Docker deployment configuration and API server
|
||||||
|
- *(docker)* Add Docker service integration and config serialization
|
||||||
|
- *(docker)* [**breaking**] Enhance Docker deployment setup and configuration
|
||||||
|
- *(api)* Improve cache handling and add API tests
|
||||||
|
- *(crawler)* [**breaking**] Add deep crawling capabilities with BFS strategy
|
||||||
|
- *(proxy)* [**breaking**] Add proxy rotation strategy
|
||||||
|
- *(deep-crawling)* Add DFS strategy and update exports; refactor CLI entry point
|
||||||
|
- *(cli)* Add command line interface with comprehensive features
|
||||||
|
- *(config)* Enhance serialization and add deep crawling exports
|
||||||
|
- *(crawler)* Add HTTP crawler strategy for lightweight web scraping
|
||||||
|
- *(docker)* [**breaking**] Implement supervisor and secure API endpoints
|
||||||
|
- *(docker)* [**breaking**] Add JWT authentication and improve server architecture
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
|
|
||||||
|
- *(browser)* Update browser channel default to 'chromium' in BrowserConfig.from_args method
|
||||||
|
- *(crawler)* Optimize response handling and default settings
|
||||||
|
- *(crawler)* - Update hello_world example with proper content filtering
|
||||||
|
- - Update hello_world.py example
|
||||||
|
- *(docs)* [**breaking**] Reorganize documentation structure and update styles
|
||||||
|
- *(dispatcher)* [**breaking**] Migrate to modular dispatcher system with enhanced monitoring
|
||||||
|
- *(scraping)* [**breaking**] Replace ScrapingMode enum with strategy pattern
|
||||||
|
- *(browser)* Improve browser path management
|
||||||
|
- *(models)* Rename final_url to redirected_url for consistency
|
||||||
|
- *(core)* [**breaking**] Improve type hints and remove unused file
|
||||||
|
- *(docs)* Improve code formatting in features demo
|
||||||
|
- *(user-agent)* Improve user agent generation system
|
||||||
|
- *(core)* [**breaking**] Reorganize project structure and remove legacy code
|
||||||
|
- *(docker)* Clean up import statements in server.py
|
||||||
|
- *(docker)* Remove unused models and utilities for cleaner codebase
|
||||||
|
- *(docker)* [**breaking**] Improve server architecture and configuration
|
||||||
|
- *(deep-crawl)* [**breaking**] Reorganize deep crawling functionality into dedicated module
|
||||||
|
- *(deep-crawling)* [**breaking**] Reorganize deep crawling strategies and add new implementations
|
||||||
|
- *(crawling)* [**breaking**] Improve type hints and code cleanup
|
||||||
|
- *(crawler)* [**breaking**] Improve HTML handling and cleanup codebase
|
||||||
|
- *(crawler)* [**breaking**] Remove content filter functionality
|
||||||
|
- *(examples)* Update API usage in features demo
|
||||||
|
- *(config)* [**breaking**] Enhance serialization and config handling
|
||||||
|
|
||||||
|
### Docs
|
||||||
|
|
||||||
|
- Add Code of Conduct for the project (#410)
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
- *(extraction)* Add clarifying comments for CSS selector behavior
|
||||||
|
- *(readme)* Update personal story and project vision
|
||||||
|
- *(urls)* [**breaking**] Update documentation URLs to new domain
|
||||||
|
- *(api)* Add streaming mode documentation and examples
|
||||||
|
- *(readme)* Update version and feature announcements for v0.4.3b1
|
||||||
|
- *(examples)* Update demo scripts and fix output formats
|
||||||
|
- *(examples)* Update v0.4.3 features demo to v0.4.3b2
|
||||||
|
- *(readme)* Update version references and fix links
|
||||||
|
- *(multi-url)* [**breaking**] Improve documentation clarity and update examples
|
||||||
|
- *(examples)* Update proxy rotation demo and disable other demos
|
||||||
|
- *(api)* Improve formatting and readability of API documentation
|
||||||
|
- *(examples)* Add SERP API project example
|
||||||
|
- *(urls)* Update documentation URLs to new domain
|
||||||
|
- *(readme)* Resolve merge conflict and update version info
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- *(browser)* Update default browser channel to chromium and simplify channel selection logic
|
||||||
|
- *(browser)* [**breaking**] Default to Chromium channel for new headless mode (#387)
|
||||||
|
- *(browser)* Resolve merge conflicts in browser channel configuration
|
||||||
|
- Prevent memory leaks by ensuring proper closure of Playwright pages
|
||||||
|
- Not working long page screenshot (#403)
|
||||||
|
- *(extraction)* JsonCss selector and crawler improvements
|
||||||
|
- *(models)* [**breaking**] Make model fields optional with default values
|
||||||
|
- *(dispatcher)* Adjust memory threshold and fix dispatcher initialization
|
||||||
|
- *(install)* Ensure proper exit after running doctor command
|
||||||
|
|
||||||
|
### Miscellaneous Tasks
|
||||||
|
|
||||||
|
- *(cleanup)* Remove unused files and improve type hints
|
||||||
|
- Add .gitattributes file
|
||||||
|
|
||||||
|
## License Update
|
||||||
|
|
||||||
|
Crawl4AI v0.5.0 updates the license to Apache 2.0 *with a required attribution clause*. This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you *must* clearly attribute the project in any public use or distribution. See the updated `LICENSE` file for the full legal text and specific requirements.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Version 0.4.3b2 (2025-01-21)
|
## Version 0.4.3b2 (2025-01-21)
|
||||||
|
|
||||||
@@ -286,12 +385,6 @@ This release introduces several powerful new features, including robots.txt comp
|
|||||||
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
|
- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
|
||||||
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
|
- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## [0.3.75] December 1, 2024
|
## [0.3.75] December 1, 2024
|
||||||
|
|
||||||
### PruningContentFilter
|
### PruningContentFilter
|
||||||
|
|||||||
20
LICENSE
20
LICENSE
@@ -48,4 +48,22 @@ You may add Your own copyright statement to Your modifications and may provide a
|
|||||||
|
|
||||||
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
|
9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
|
||||||
|
|
||||||
END OF TERMS AND CONDITIONS
|
END OF TERMS AND CONDITIONS
|
||||||
|
|
||||||
|
---
|
||||||
|
Attribution Requirement
|
||||||
|
|
||||||
|
All distributions, publications, or public uses of this software, or derivative works based on this software, must include the following attribution:
|
||||||
|
|
||||||
|
"This product includes software developed by UncleCode (https://x.com/unclecode) as part of the Crawl4AI project (https://github.com/unclecode/crawl4ai)."
|
||||||
|
|
||||||
|
This attribution must be displayed in a prominent and easily accessible location, such as:
|
||||||
|
|
||||||
|
- For software distributions: In a NOTICE file, README file, or equivalent documentation.
|
||||||
|
- For publications (research papers, articles, blog posts): In the acknowledgments section or a footnote.
|
||||||
|
- For websites/web applications: In an "About" or "Credits" section.
|
||||||
|
- For command-line tools: In the help/usage output.
|
||||||
|
|
||||||
|
This requirement ensures proper credit is given for the use of Crawl4AI and helps promote the project.
|
||||||
|
|
||||||
|
---
|
||||||
78
README.md
78
README.md
@@ -574,9 +574,83 @@ To check our development plans and upcoming features, visit our [Roadmap](https:
|
|||||||
|
|
||||||
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTORS.md) for more information.
|
||||||
|
|
||||||
## 📄 License
|
I'll help modify the license section with badges. For the halftone effect, here's a version with it:
|
||||||
|
|
||||||
Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
|
Here's the updated license section:
|
||||||
|
|
||||||
|
## 📄 License & Attribution
|
||||||
|
|
||||||
|
This project is licensed under the Apache License 2.0 with a required attribution clause. See the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE) file for details.
|
||||||
|
|
||||||
|
### Attribution Requirements
|
||||||
|
When using Crawl4AI, you must include one of the following attribution methods:
|
||||||
|
|
||||||
|
#### 1. Badge Attribution (Recommended)
|
||||||
|
Add one of these badges to your README, documentation, or website:
|
||||||
|
|
||||||
|
| Theme | Badge |
|
||||||
|
|-------|-------|
|
||||||
|
| **Disco Theme (Animated)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||||
|
| **Night Theme (Dark with Neon)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||||
|
| **Dark Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||||
|
| **Light Theme (Classic)** | <a href="https://github.com/unclecode/crawl4ai"><img src="./docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/></a> |
|
||||||
|
|
||||||
|
|
||||||
|
HTML code for adding the badges:
|
||||||
|
```html
|
||||||
|
<!-- Disco Theme (Animated) -->
|
||||||
|
<a href="https://github.com/unclecode/crawl4ai">
|
||||||
|
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-disco.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<!-- Night Theme (Dark with Neon) -->
|
||||||
|
<a href="https://github.com/unclecode/crawl4ai">
|
||||||
|
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-night.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<!-- Dark Theme (Classic) -->
|
||||||
|
<a href="https://github.com/unclecode/crawl4ai">
|
||||||
|
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-dark.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<!-- Light Theme (Classic) -->
|
||||||
|
<a href="https://github.com/unclecode/crawl4ai">
|
||||||
|
<img src="https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg" alt="Powered by Crawl4AI" width="200"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<!-- Simple Shield Badge -->
|
||||||
|
<a href="https://github.com/unclecode/crawl4ai">
|
||||||
|
<img src="https://img.shields.io/badge/Powered%20by-Crawl4AI-blue?style=flat-square" alt="Powered by Crawl4AI"/>
|
||||||
|
</a>
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Text Attribution
|
||||||
|
Add this line to your documentation:
|
||||||
|
```
|
||||||
|
This project uses Crawl4AI (https://github.com/unclecode/crawl4ai) for web data extraction.
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📚 Citation
|
||||||
|
|
||||||
|
If you use Crawl4AI in your research or project, please cite:
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@software{crawl4ai2024,
|
||||||
|
author = {UncleCode},
|
||||||
|
title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},
|
||||||
|
year = {2024},
|
||||||
|
publisher = {GitHub},
|
||||||
|
journal = {GitHub Repository},
|
||||||
|
howpublished = {\url{https://github.com/unclecode/crawl4ai}},
|
||||||
|
commit = {Please use the commit hash you're working with}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Text citation format:
|
||||||
|
```
|
||||||
|
UncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software].
|
||||||
|
GitHub. https://github.com/unclecode/crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
## 📧 Contact
|
## 📧 Contact
|
||||||
|
|
||||||
|
|||||||
24
cliff.toml
Normal file
24
cliff.toml
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
[changelog]
|
||||||
|
# Template format
|
||||||
|
header = """
|
||||||
|
# Changelog\n
|
||||||
|
All notable changes to this project will be documented in this file.\n
|
||||||
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||||
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).\n
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Organize commits by type
|
||||||
|
[git]
|
||||||
|
conventional_commits = true
|
||||||
|
filter_unconventional = true
|
||||||
|
commit_parsers = [
|
||||||
|
{ message = "^feat", group = "Added"},
|
||||||
|
{ message = "^fix", group = "Fixed"},
|
||||||
|
{ message = "^doc", group = "Documentation"},
|
||||||
|
{ message = "^perf", group = "Performance"},
|
||||||
|
{ message = "^refactor", group = "Changed"},
|
||||||
|
{ message = "^style", group = "Changed"},
|
||||||
|
{ message = "^test", group = "Testing"},
|
||||||
|
{ message = "^chore\\(release\\): prepare for", skip = true},
|
||||||
|
{ message = "^chore", group = "Miscellaneous Tasks"},
|
||||||
|
]
|
||||||
25
docs/assets/powered-by-dark.svg
Normal file
25
docs/assets/powered-by-dark.svg
Normal file
@@ -0,0 +1,25 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
|
||||||
|
<!-- Dark Theme -->
|
||||||
|
<g>
|
||||||
|
<defs>
|
||||||
|
<pattern id="halftoneDark" width="4" height="4" patternUnits="userSpaceOnUse">
|
||||||
|
<circle cx="2" cy="2" r="1" fill="#eee" opacity="0.1"/>
|
||||||
|
</pattern>
|
||||||
|
<pattern id="halftoneTextDark" width="3" height="3" patternUnits="userSpaceOnUse">
|
||||||
|
<circle cx="1.5" cy="1.5" r="2" fill="#aaa" opacity="0.2"/>
|
||||||
|
</pattern>
|
||||||
|
</defs>
|
||||||
|
<!-- White border - added as outer rectangle -->
|
||||||
|
<rect width="120" height="35" rx="5" fill="#111"/>
|
||||||
|
<!-- Dark background slightly smaller to show thicker border -->
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="#1a1a1a"/>
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="url(#halftoneDark)"/>
|
||||||
|
|
||||||
|
<!-- Logo with halftone -->
|
||||||
|
<path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#eee" stroke-width="2"/>
|
||||||
|
<path d="M18 17.5 L27 17.5" stroke="#eee" stroke-width="2"/>
|
||||||
|
<circle cx="22.5" cy="17.5" r="2" fill="#eee"/>
|
||||||
|
|
||||||
|
<text x="40" y="23" fill="#eee" font-family="Arial, sans-serif" font-weight="500" font-size="14">Crawl4AI</text>
|
||||||
|
</g>
|
||||||
|
</svg>
|
||||||
|
After Width: | Height: | Size: 1.2 KiB |
64
docs/assets/powered-by-disco.svg
Normal file
64
docs/assets/powered-by-disco.svg
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
|
||||||
|
<g>
|
||||||
|
<defs>
|
||||||
|
<pattern id="cyberdots" width="4" height="4" patternUnits="userSpaceOnUse">
|
||||||
|
<circle cx="2" cy="2" r="1">
|
||||||
|
<animate attributeName="fill"
|
||||||
|
values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
|
||||||
|
dur="6s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
<animate attributeName="opacity"
|
||||||
|
values="0.2;0.4;0.2"
|
||||||
|
dur="4s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
</circle>
|
||||||
|
</pattern>
|
||||||
|
<filter id="neonGlow" x="-20%" y="-20%" width="140%" height="140%">
|
||||||
|
<feGaussianBlur stdDeviation="1" result="blur"/>
|
||||||
|
<feFlood flood-color="#FF2EC4" flood-opacity="0.2">
|
||||||
|
<animate attributeName="flood-color"
|
||||||
|
values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
|
||||||
|
dur="8s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
</feFlood>
|
||||||
|
<feComposite in2="blur" operator="in"/>
|
||||||
|
<feMerge>
|
||||||
|
<feMergeNode/>
|
||||||
|
<feMergeNode in="SourceGraphic"/>
|
||||||
|
</feMerge>
|
||||||
|
</filter>
|
||||||
|
</defs>
|
||||||
|
|
||||||
|
<rect width="120" height="35" rx="5" fill="#0A0A0F"/>
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="#16161E"/>
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="url(#cyberdots)"/>
|
||||||
|
|
||||||
|
<!-- Logo with animated neon -->
|
||||||
|
<path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#8B5CF6" stroke-width="2" filter="url(#neonGlow)">
|
||||||
|
<animate attributeName="stroke"
|
||||||
|
values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
|
||||||
|
dur="8s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
</path>
|
||||||
|
<path d="M18 17.5 L27 17.5" stroke="#8B5CF6" stroke-width="2" filter="url(#neonGlow)">
|
||||||
|
<animate attributeName="stroke"
|
||||||
|
values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
|
||||||
|
dur="8s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
</path>
|
||||||
|
<circle cx="22.5" cy="17.5" r="2" fill="#0BC5EA">
|
||||||
|
<animate attributeName="fill"
|
||||||
|
values="#0BC5EA;#FF2EC4;#8B5CF6;#0BC5EA"
|
||||||
|
dur="8s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
</circle>
|
||||||
|
|
||||||
|
<text x="40" y="23" font-family="Arial, sans-serif" font-weight="500" font-size="14" filter="url(#neonGlow)">
|
||||||
|
<animate attributeName="fill"
|
||||||
|
values="#FF2EC4;#8B5CF6;#0BC5EA;#FF2EC4"
|
||||||
|
dur="8s"
|
||||||
|
repeatCount="indefinite"/>
|
||||||
|
Crawl4AI
|
||||||
|
</text>
|
||||||
|
</g>
|
||||||
|
</svg>
|
||||||
|
After Width: | Height: | Size: 2.5 KiB |
21
docs/assets/powered-by-light.svg
Normal file
21
docs/assets/powered-by-light.svg
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
|
||||||
|
<g>
|
||||||
|
<defs>
|
||||||
|
<pattern id="halftoneLight" width="4" height="4" patternUnits="userSpaceOnUse">
|
||||||
|
<circle cx="2" cy="2" r="1" fill="#111" opacity="0.1"/>
|
||||||
|
</pattern>
|
||||||
|
</defs>
|
||||||
|
<!-- Dark border -->
|
||||||
|
<rect width="120" height="35" rx="5" fill="#DDD"/>
|
||||||
|
<!-- Light background -->
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="#fff"/>
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="url(#halftoneLight)"/>
|
||||||
|
|
||||||
|
<!-- Logo -->
|
||||||
|
<path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#111" stroke-width="2"/>
|
||||||
|
<path d="M18 17.5 L27 17.5" stroke="#111" stroke-width="2"/>
|
||||||
|
<circle cx="22.5" cy="17.5" r="2" fill="#111"/>
|
||||||
|
|
||||||
|
<text x="40" y="23" fill="#111" font-family="Arial, sans-serif" font-weight="500" font-size="14">Crawl4AI</text>
|
||||||
|
</g>
|
||||||
|
</svg>
|
||||||
|
After Width: | Height: | Size: 925 B |
28
docs/assets/powered-by-night.svg
Normal file
28
docs/assets/powered-by-night.svg
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" width="120" height="35" viewBox="0 0 120 35">
|
||||||
|
<g>
|
||||||
|
<defs>
|
||||||
|
<pattern id="halftoneDark" width="4" height="4" patternUnits="userSpaceOnUse">
|
||||||
|
<circle cx="2" cy="2" r="1" fill="#8B5CF6" opacity="0.1"/>
|
||||||
|
</pattern>
|
||||||
|
<filter id="neonGlow" x="-20%" y="-20%" width="140%" height="140%">
|
||||||
|
<feGaussianBlur stdDeviation="1" result="blur"/>
|
||||||
|
<feFlood flood-color="#8B5CF6" flood-opacity="0.2"/>
|
||||||
|
<feComposite in2="blur" operator="in"/>
|
||||||
|
<feMerge>
|
||||||
|
<feMergeNode/>
|
||||||
|
<feMergeNode in="SourceGraphic"/>
|
||||||
|
</feMerge>
|
||||||
|
</filter>
|
||||||
|
</defs>
|
||||||
|
<rect width="120" height="35" rx="5" fill="#0A0A0F"/>
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="#16161E"/>
|
||||||
|
<rect x="2" y="2" width="116" height="31" rx="4" fill="url(#halftoneDark)"/>
|
||||||
|
|
||||||
|
<!-- Logo with neon glow -->
|
||||||
|
<path d="M30 17.5 a7.5 7.5 0 1 1 -15 0 a7.5 7.5 0 1 1 15 0" fill="none" stroke="#8B5CF6" stroke-width="2" filter="url(#neonGlow)"/>
|
||||||
|
<path d="M18 17.5 L27 17.5" stroke="#8B5CF6" stroke-width="2" filter="url(#neonGlow)"/>
|
||||||
|
<circle cx="22.5" cy="17.5" r="2" fill="#8B5CF6"/>
|
||||||
|
|
||||||
|
<text x="40" y="23" fill="#fff" font-family="Arial, sans-serif" font-weight="500" font-size="14" filter="url(#neonGlow)">Crawl4AI</text>
|
||||||
|
</g>
|
||||||
|
</svg>
|
||||||
|
After Width: | Height: | Size: 1.3 KiB |
@@ -4,6 +4,67 @@ Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical
|
|||||||
|
|
||||||
## Latest Release
|
## Latest Release
|
||||||
|
|
||||||
|
|
||||||
|
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
|
||||||
|
|
||||||
|
My dear friends and crawlers, there you go, this is the release of Crawl4AI v0.5.0! This release brings a wealth of new features, performance improvements, and a more streamlined developer experience. Here's a breakdown of what's new:
|
||||||
|
|
||||||
|
**Major New Features:**
|
||||||
|
|
||||||
|
* **Deep Crawling:** Explore entire websites with configurable strategies (BFS, DFS, Best-First). Define custom filters and URL scoring for targeted crawls.
|
||||||
|
* **Memory-Adaptive Dispatcher:** Handle large-scale crawls with ease! Our new dispatcher dynamically adjusts concurrency based on available memory and includes built-in rate limiting.
|
||||||
|
* **Multiple Crawler Strategies:** Choose between the full-featured Playwright browser-based crawler or a new, *much* faster HTTP-only crawler for simpler tasks.
|
||||||
|
* **Docker Deployment:** Deploy Crawl4AI as a scalable, self-contained service with built-in API endpoints and optional JWT authentication.
|
||||||
|
* **Command-Line Interface (CLI):** Interact with Crawl4AI directly from your terminal. Crawl, configure, and extract data with simple commands.
|
||||||
|
* **LLM Configuration (`LlmConfig`):** A new, unified way to configure LLM providers (OpenAI, Anthropic, Ollama, etc.) for extraction, filtering, and schema generation. Simplifies API key management and switching between models.
|
||||||
|
|
||||||
|
**Minor Updates & Improvements:**
|
||||||
|
|
||||||
|
* **LXML Scraping Mode:** Faster HTML parsing with `LXMLWebScrapingStrategy`.
|
||||||
|
* **Proxy Rotation:** Added `ProxyRotationStrategy` with a `RoundRobinProxyStrategy` implementation.
|
||||||
|
* **PDF Processing:** Extract text, images, and metadata from PDF files.
|
||||||
|
* **URL Redirection Tracking:** Automatically follows and records redirects.
|
||||||
|
* **Robots.txt Compliance:** Optionally respect website crawling rules.
|
||||||
|
* **LLM-Powered Schema Generation:** Automatically create extraction schemas using an LLM.
|
||||||
|
* **`LLMContentFilter`:** Generate high-quality, focused markdown using an LLM.
|
||||||
|
* **Improved Error Handling & Stability:** Numerous bug fixes and performance enhancements.
|
||||||
|
* **Enhanced Documentation:** Updated guides and examples.
|
||||||
|
|
||||||
|
**Breaking Changes & Migration:**
|
||||||
|
|
||||||
|
This release includes several breaking changes to improve the library's structure and consistency. Here's what you need to know:
|
||||||
|
|
||||||
|
* **`arun_many()` Behavior:** Now uses the `MemoryAdaptiveDispatcher` by default. The return type depends on the `stream` parameter in `CrawlerRunConfig`. Adjust code that relied on unbounded concurrency.
|
||||||
|
* **`max_depth` Location:** Moved to `CrawlerRunConfig` and now controls *crawl depth*.
|
||||||
|
* **Deep Crawling Imports:** Import `DeepCrawlStrategy` and related classes from `crawl4ai.deep_crawling`.
|
||||||
|
* **`BrowserContext` API:** Updated; the old `get_context` method is deprecated.
|
||||||
|
* **Optional Model Fields:** Many data model fields are now optional. Handle potential `None` values.
|
||||||
|
* **`ScrapingMode` Enum:** Replaced with strategy pattern (`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
|
||||||
|
* **`content_filter` Parameter:** Removed from `CrawlerRunConfig`. Use extraction strategies or markdown generators with filters.
|
||||||
|
* **Removed Functionality:** The synchronous `WebCrawler`, the old CLI, and docs management tools have been removed.
|
||||||
|
* **Docker:** Significant changes to deployment. See the [Docker documentation](../deploy/docker/README.md).
|
||||||
|
* **`ssl_certificate.json`:** This file has been removed.
|
||||||
|
* **Config**: FastFilterChain has been replaced with FilterChain
|
||||||
|
* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
|
||||||
|
* **Proxy**: Removed synchronous WebCrawler support and related rate limiting configurations
|
||||||
|
* **LLM Parameters:** Use the new `LlmConfig` object instead of passing `provider`, `api_token`, `base_url`, and `api_base` directly to `LLMExtractionStrategy` and `LLMContentFilter`.
|
||||||
|
|
||||||
|
**In short:** Update imports, adjust `arun_many()` usage, check for optional fields, and review the Docker deployment guide.
|
||||||
|
|
||||||
|
## License Change
|
||||||
|
|
||||||
|
Crawl4AI v0.5.0 updates the license to Apache 2.0 *with a required attribution clause*. This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you *must* clearly attribute the project in any public use or distribution. See the updated `LICENSE` file for the full legal text and specific requirements.
|
||||||
|
|
||||||
|
**Get Started:**
|
||||||
|
|
||||||
|
* **Installation:** `pip install "crawl4ai[all]"` (or use the Docker image)
|
||||||
|
* **Documentation:** [https://docs.crawl4ai.com](https://docs.crawl4ai.com)
|
||||||
|
* **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||||
|
|
||||||
|
I'm very excited to see what you build with Crawl4AI v0.5.0!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
|
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
|
||||||
*December 12, 2024*
|
*December 12, 2024*
|
||||||
|
|
||||||
|
|||||||
331
docs/md_v2/blog/releases/0.5.0.md
Normal file
331
docs/md_v2/blog/releases/0.5.0.md
Normal file
@@ -0,0 +1,331 @@
|
|||||||
|
# Crawl4AI v0.5.0 Release Notes
|
||||||
|
|
||||||
|
**Release Theme: Power, Flexibility, and Scalability**
|
||||||
|
|
||||||
|
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability. Key improvements include a new **deep crawling** system, a **memory-adaptive dispatcher** for handling large-scale crawls, **multiple crawling strategies** (including a fast HTTP-only crawler), **Docker** deployment options, and a powerful **command-line interface (CLI)**. This release also includes numerous bug fixes, performance optimizations, and documentation updates.
|
||||||
|
|
||||||
|
**Important Note:** This release contains several **breaking changes**. Please review the "Breaking Changes" section carefully and update your code accordingly.
|
||||||
|
|
||||||
|
## Key Features
|
||||||
|
|
||||||
|
### 1. Deep Crawling
|
||||||
|
|
||||||
|
Crawl4AI now supports deep crawling, allowing you to explore websites beyond the initial URLs. This is controlled by the `deep_crawl_strategy` parameter in `CrawlerRunConfig`. Several strategies are available:
|
||||||
|
|
||||||
|
* **`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores the website level by level. (Default)
|
||||||
|
* **`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each branch as deeply as possible before backtracking.
|
||||||
|
* **`BestFirstCrawlingStrategy`:** Uses a scoring function to prioritize which URLs to crawl next.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy
|
||||||
|
from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain
|
||||||
|
|
||||||
|
# Configure a deep crawl with BFS, limiting to a specific domain and content type.
|
||||||
|
filter_chain = FilterChain(
|
||||||
|
filters=[
|
||||||
|
DomainFilter(allowed_domains=["example.com"]),
|
||||||
|
ContentTypeFilter(allowed_types=["text/html"])
|
||||||
|
]
|
||||||
|
)
|
||||||
|
deep_crawl_config = CrawlerRunConfig(
|
||||||
|
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=5, filter_chain=filter_chain),
|
||||||
|
stream=True # Process results as they arrive
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
async for result in await crawler.arun(url="https://example.com", config=deep_crawl_config):
|
||||||
|
print(f"Crawled: {result.url} (Depth: {result.metadata['depth']})")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Breaking Change:** The `max_depth` parameter is now part of `CrawlerRunConfig` and controls the *depth* of the crawl, not the number of concurrent crawls. The `arun()` and `arun_many()` methods are now decorated to handle deep crawling strategies. Imports for deep crawling strategies have changed. See the [Deep Crawling documentation](../deep_crawling/README.md) for more details.
|
||||||
|
|
||||||
|
### 2. Memory-Adaptive Dispatcher
|
||||||
|
|
||||||
|
The new `MemoryAdaptiveDispatcher` dynamically adjusts concurrency based on available system memory and includes built-in rate limiting. This prevents out-of-memory errors and avoids overwhelming target websites.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher
|
||||||
|
|
||||||
|
# Configure the dispatcher (optional, defaults are used if not provided)
|
||||||
|
dispatcher = MemoryAdaptiveDispatcher(
|
||||||
|
memory_threshold_percent=80.0, # Pause if memory usage exceeds 80%
|
||||||
|
check_interval=0.5 # Check memory every 0.5 seconds
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls=["https://example.com/1", "https://example.com/2"],
|
||||||
|
config=CrawlerRunConfig(stream=False), # Batch mode
|
||||||
|
dispatcher=dispatcher
|
||||||
|
)
|
||||||
|
# OR, for streaming:
|
||||||
|
async for result in await crawler.arun_many(urls, config=CrawlerRunConfig(stream=True), dispatcher=dispatcher):
|
||||||
|
# ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Breaking Change:** `AsyncWebCrawler.arun_many()` now uses `MemoryAdaptiveDispatcher` by default. Existing code that relied on unbounded concurrency may require adjustments.
|
||||||
|
|
||||||
|
### 3. Multiple Crawling Strategies (Playwright and HTTP)
|
||||||
|
|
||||||
|
Crawl4AI now offers two crawling strategies:
|
||||||
|
|
||||||
|
* **`AsyncPlaywrightCrawlerStrategy` (Default):** Uses Playwright for browser-based crawling, supporting JavaScript rendering and complex interactions.
|
||||||
|
* **`AsyncHTTPCrawlerStrategy`:** A lightweight, fast, and memory-efficient HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is unnecessary.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
|
||||||
|
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
|
||||||
|
|
||||||
|
# Use the HTTP crawler strategy
|
||||||
|
http_crawler_config = HTTPCrawlerConfig(
|
||||||
|
method="GET",
|
||||||
|
headers={"User-Agent": "MyCustomBot/1.0"},
|
||||||
|
follow_redirects=True,
|
||||||
|
verify_ssl=True
|
||||||
|
)
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler:
|
||||||
|
result = await crawler.arun("https://example.com")
|
||||||
|
print(f"Status code: {result.status_code}")
|
||||||
|
print(f"Content length: {len(result.html)}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Docker Deployment
|
||||||
|
|
||||||
|
Crawl4AI can now be easily deployed as a Docker container, providing a consistent and isolated environment. The Docker image includes a FastAPI server with both streaming and non-streaming endpoints.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Build the image (from the project root)
|
||||||
|
docker build -t crawl4ai .
|
||||||
|
|
||||||
|
# Run the container
|
||||||
|
docker run -d -p 8000:8000 --name crawl4ai crawl4ai
|
||||||
|
```
|
||||||
|
|
||||||
|
**API Endpoints:**
|
||||||
|
|
||||||
|
* `/crawl` (POST): Non-streaming crawl.
|
||||||
|
* `/crawl/stream` (POST): Streaming crawl (NDJSON).
|
||||||
|
* `/health` (GET): Health check.
|
||||||
|
* `/schema` (GET): Returns configuration schemas.
|
||||||
|
* `/md/{url}` (GET): Returns markdown content of the URL.
|
||||||
|
* `/llm/{url}` (GET): Returns LLM extracted content.
|
||||||
|
* `/token` (POST): Get JWT token
|
||||||
|
|
||||||
|
**Breaking Changes:**
|
||||||
|
|
||||||
|
* Docker deployment now requires a `.llm.env` file for API keys.
|
||||||
|
* Docker deployment now requires Redis and a new `config.yml` structure.
|
||||||
|
* Server startup now uses `supervisord` instead of direct process management.
|
||||||
|
* Docker server now requires authentication by default (JWT tokens).
|
||||||
|
|
||||||
|
See the [Docker deployment documentation](../deploy/docker/README.md) for detailed instructions.
|
||||||
|
|
||||||
|
### 5. Command-Line Interface (CLI)
|
||||||
|
|
||||||
|
A new CLI (`crwl`) provides convenient access to Crawl4AI's functionality from the terminal.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Basic crawl
|
||||||
|
crwl https://example.com
|
||||||
|
|
||||||
|
# Get markdown output
|
||||||
|
crwl https://example.com -o markdown
|
||||||
|
|
||||||
|
# Use a configuration file
|
||||||
|
crwl https://example.com -B browser.yml -C crawler.yml
|
||||||
|
|
||||||
|
# Use LLM-based extraction
|
||||||
|
crwl https://example.com -e extract.yml -s schema.json
|
||||||
|
|
||||||
|
# Ask a question about the crawled content
|
||||||
|
crwl https://example.com -q "What is the main topic?"
|
||||||
|
|
||||||
|
# See usage examples
|
||||||
|
crwl --example
|
||||||
|
```
|
||||||
|
|
||||||
|
See the [CLI documentation](../docs/md_v2/core/cli.md) for more details.
|
||||||
|
|
||||||
|
### 6. LXML Scraping Mode
|
||||||
|
|
||||||
|
Added `LXMLWebScrapingStrategy` for faster HTML parsing using the `lxml` library. This can significantly improve scraping performance, especially for large or complex pages. Set `scraping_strategy=LXMLWebScrapingStrategy()` in your `CrawlerRunConfig`.
|
||||||
|
|
||||||
|
**Breaking Change:** The `ScrapingMode` enum has been replaced with a strategy pattern. Use `WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`.
|
||||||
|
|
||||||
|
### 7. Proxy Rotation
|
||||||
|
Added `ProxyRotationStrategy` abstract base class with `RoundRobinProxyStrategy` concrete implementation.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import (
|
||||||
|
AsyncWebCrawler,
|
||||||
|
BrowserConfig,
|
||||||
|
CrawlerRunConfig,
|
||||||
|
CacheMode,
|
||||||
|
RoundRobinProxyStrategy,
|
||||||
|
)
|
||||||
|
# Load proxies and create rotation strategy
|
||||||
|
proxies = load_proxies_from_env()
|
||||||
|
if not proxies:
|
||||||
|
print("No proxies found in environment. Set PROXIES env variable!")
|
||||||
|
return
|
||||||
|
|
||||||
|
proxy_strategy = RoundRobinProxyStrategy(proxies)
|
||||||
|
|
||||||
|
# Create configs
|
||||||
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
||||||
|
run_config = CrawlerRunConfig(
|
||||||
|
cache_mode=CacheMode.BYPASS,
|
||||||
|
proxy_rotation_strategy=proxy_strategy
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Other Changes and Improvements
|
||||||
|
|
||||||
|
* **Added: `LLMContentFilter` for intelligent markdown generation.** This new filter uses an LLM to create more focused and relevant markdown output.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
|
||||||
|
from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||||
|
from crawl4ai.async_configs import LlmConfig
|
||||||
|
|
||||||
|
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
|
||||||
|
|
||||||
|
markdown_generator = DefaultMarkdownGenerator(
|
||||||
|
content_filter=LLMContentFilter(llmConfig=llm_config, instruction="Extract key concepts and summaries")
|
||||||
|
)
|
||||||
|
|
||||||
|
config = CrawlerRunConfig(markdown_generator=markdown_generator)
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun("https://example.com/article", config=config)
|
||||||
|
print(result.markdown) # Output will be filtered and formatted by the LLM
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Added: URL redirection tracking.** The crawler now automatically follows HTTP redirects (301, 302, 307, 308) and records the final URL in the `redirected_url` field of the `CrawlResult` object. No code changes are required to enable this; it's automatic.
|
||||||
|
|
||||||
|
* **Added: LLM-powered schema generation utility.** A new `generate_schema` method has been added to `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. This greatly simplifies creating extraction schemas.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||||
|
from crawl4ai.async_configs import LlmConfig
|
||||||
|
|
||||||
|
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
|
||||||
|
|
||||||
|
schema = JsonCssExtractionStrategy.generate_schema(
|
||||||
|
html="<div class='product'><h2>Product Name</h2><span class='price'>$99</span></div>",
|
||||||
|
llmConfig = llm_config,
|
||||||
|
query="Extract product name and price"
|
||||||
|
)
|
||||||
|
print(schema)
|
||||||
|
# Expected Output (may vary slightly due to LLM):
|
||||||
|
# {
|
||||||
|
# "name": "ProductExtractor",
|
||||||
|
# "baseSelector": "div.product",
|
||||||
|
# "fields": [
|
||||||
|
# {"name": "name", "selector": "h2", "type": "text"},
|
||||||
|
# {"name": "price", "selector": ".price", "type": "text"}
|
||||||
|
# ]
|
||||||
|
# }
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Added: robots.txt compliance support.** The crawler can now respect `robots.txt` rules. Enable this by setting `check_robots_txt=True` in `CrawlerRunConfig`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
config = CrawlerRunConfig(check_robots_txt=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Added: PDF processing capabilities.** Crawl4AI can now extract text, images, and metadata from PDF files (both local and remote). This uses a new `PDFCrawlerStrategy` and `PDFContentScrapingStrategy`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||||
|
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
|
||||||
|
|
||||||
|
async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
"https://example.com/document.pdf",
|
||||||
|
config=CrawlerRunConfig(
|
||||||
|
scraping_strategy=PDFContentScrapingStrategy()
|
||||||
|
)
|
||||||
|
)
|
||||||
|
print(result.markdown) # Access extracted text
|
||||||
|
print(result.metadata) # Access PDF metadata (title, author, etc.)
|
||||||
|
```
|
||||||
|
|
||||||
|
* **Added: Support for frozenset serialization.** Improves configuration serialization, especially for sets of allowed/blocked domains. No code changes required.
|
||||||
|
|
||||||
|
* **Added: New `LlmConfig` parameter.** This new parameter can be passed for extraction, filtering, and schema generation tasks. It simplifies passing provider strings, API tokens, and base URLs across all sections where LLM configuration is necessary. It also enables reuse and allows for quick experimentation between different LLM configurations.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from crawl4ai.async_configs import LlmConfig
|
||||||
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||||
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||||
|
|
||||||
|
# Example of using LlmConfig with LLMExtractionStrategy
|
||||||
|
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
|
||||||
|
strategy = LLMExtractionStrategy(llmConfig=llm_config, schema=...)
|
||||||
|
|
||||||
|
# Example usage within a crawler
|
||||||
|
async with AsyncWebCrawler() as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://example.com",
|
||||||
|
config=CrawlerRunConfig(extraction_strategy=strategy)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
**Breaking Change:** Removed old parameters like `provider`, `api_token`, `base_url`, and `api_base` from `LLMExtractionStrategy` and `LLMContentFilter`. Users should migrate to using the `LlmConfig` object.
|
||||||
|
|
||||||
|
* **Changed: Improved browser context management and added shared data support. (Breaking Change:** `BrowserContext` API updated). Browser contexts are now managed more efficiently, reducing resource usage. A new `shared_data` dictionary is available in the `BrowserContext` to allow passing data between different stages of the crawling process. **Breaking Change:** The `BrowserContext` API has changed, and the old `get_context` method is deprecated.
|
||||||
|
|
||||||
|
* **Changed:** Renamed `final_url` to `redirected_url` in `CrawledURL`. This improves consistency and clarity. Update any code referencing the old field name.
|
||||||
|
|
||||||
|
* **Changed:** Improved type hints and removed unused files. This is an internal improvement and should not require code changes.
|
||||||
|
|
||||||
|
* **Changed:** Reorganized deep crawling functionality into dedicated module. (**Breaking Change:** Import paths for `DeepCrawlStrategy` and related classes have changed). This improves code organization. Update imports to use the new `crawl4ai.deep_crawling` module.
|
||||||
|
|
||||||
|
* **Changed:** Improved HTML handling and cleanup codebase. (**Breaking Change:** Removed `ssl_certificate.json` file). This removes an unused file. If you were relying on this file for custom certificate validation, you'll need to implement an alternative approach.
|
||||||
|
|
||||||
|
* **Changed:** Enhanced serialization and config handling. (**Breaking Change:** `FastFilterChain` has been replaced with `FilterChain`). This change simplifies config and improves the serialization.
|
||||||
|
|
||||||
|
* **Added:** Modified the license to Apache 2.0 *with a required attribution clause*. See the `LICENSE` file for details. All users must now clearly attribute the Crawl4AI project when using, distributing, or creating derivative works.
|
||||||
|
|
||||||
|
* **Fixed:** Prevent memory leaks by ensuring proper closure of Playwright pages. No code changes required.
|
||||||
|
|
||||||
|
* **Fixed:** Make model fields optional with default values (**Breaking Change:** Code relying on all fields being present may need adjustment). Fields in data models (like `CrawledURL`) are now optional, with default values (usually `None`). Update code to handle potential `None` values.
|
||||||
|
|
||||||
|
* **Fixed:** Adjust memory threshold and fix dispatcher initialization. This is an internal bug fix; no code changes are required.
|
||||||
|
|
||||||
|
* **Fixed:** Ensure proper exit after running doctor command. No code changes are required.
|
||||||
|
* **Fixed:** JsonCss selector and crawler improvements.
|
||||||
|
* **Fixed:** Not working long page screenshot (#403)
|
||||||
|
* **Documentation:** Updated documentation URLs to the new domain.
|
||||||
|
* **Documentation:** Added SERP API project example.
|
||||||
|
* **Documentation:** Added clarifying comments for CSS selector behavior.
|
||||||
|
* **Documentation:** Add Code of Conduct for the project (#410)
|
||||||
|
|
||||||
|
## Breaking Changes Summary
|
||||||
|
|
||||||
|
* **Dispatcher:** The `MemoryAdaptiveDispatcher` is now the default for `arun_many()`, changing concurrency behavior. The return type of `arun_many` depends on the `stream` parameter.
|
||||||
|
* **Deep Crawling:** `max_depth` is now part of `CrawlerRunConfig` and controls crawl depth. Import paths for deep crawling strategies have changed.
|
||||||
|
* **Browser Context:** The `BrowserContext` API has been updated.
|
||||||
|
* **Models:** Many fields in data models are now optional, with default values.
|
||||||
|
* **Scraping Mode:** `ScrapingMode` enum replaced by strategy pattern (`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
|
||||||
|
* **Content Filter:** Removed `content_filter` parameter from `CrawlerRunConfig`. Use extraction strategies or markdown generators with filters instead.
|
||||||
|
* **Removed:** Synchronous `WebCrawler`, CLI, and docs management functionality.
|
||||||
|
* **Docker:** Significant changes to Docker deployment, including new requirements and configuration.
|
||||||
|
* **File Removed**: Removed ssl_certificate.json file which might affect existing certificate validations
|
||||||
|
* **Renamed**: final_url to redirected_url for consistency
|
||||||
|
* **Config**: FastFilterChain has been replaced with FilterChain
|
||||||
|
* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
|
||||||
|
* **Proxy**: Removed synchronous WebCrawler support and related rate limiting configurations
|
||||||
|
|
||||||
|
## Migration Guide
|
||||||
|
|
||||||
|
1. **Update Imports:** Adjust imports for `DeepCrawlStrategy`, `BreadthFirstSearchStrategy`, and related classes due to the new `deep_crawling` module structure.
|
||||||
|
2. **`CrawlerRunConfig`:** Move `max_depth` to `CrawlerRunConfig`. If using `content_filter`, migrate to an extraction strategy or a markdown generator with a filter.
|
||||||
|
3. **`arun_many()`:** Adapt code to the new `MemoryAdaptiveDispatcher` behavior and the return type.
|
||||||
|
4. **`BrowserContext`:** Update code using the `BrowserContext` API.
|
||||||
|
5. **Models:** Handle potential `None` values for optional fields in data models.
|
||||||
|
6. **Scraping:** Replace `ScrapingMode` enum with `WebScrapingStrategy` or `LXMLWebScrapingStrategy`.
|
||||||
|
7. **Docker:** Review the updated Docker documentation and adjust your deployment accordingly.
|
||||||
|
8. **CLI:** Migrate to the new `crwl` command and update any scripts using the old CLI.
|
||||||
|
9. **Proxy:**: Removed synchronous WebCrawler support and related rate limiting configurations.
|
||||||
|
10. **Config:**: Replace FastFilterChain to FilterChain
|
||||||
Reference in New Issue
Block a user