Commit Graph

224 Commits

Author SHA1 Message Date
UncleCode
dee5fe9851 feat(proxy): add proxy rotation support and documentation
Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include:
- Added proxy rotation demo in features example
- Updated proxy configuration handling in BrowserManager
- Added proxy rotation documentation
- Updated README with new proxy rotation feature
- Bumped version to 0.4.3b2

This change enables users to dynamically switch between proxies and verify IP addresses for each request.
2025-01-22 16:11:01 +08:00
UncleCode
88697c4630 docs(readme): update version and feature announcements for v0.4.3b1
Update README.md to announce version 0.4.3b1 release with new features including:
- Memory Dispatcher System
- Streaming Support
- LLM-Powered Markdown Generation
- Schema Generation
- Robots.txt Compliance

Add detailed version numbering explanation section to help users understand pre-release versions.
2025-01-21 21:20:04 +08:00
devatbosch
8878b3d032 Updated the correct link for "Contribution guidelines" in README.md (#445)
Thank you for pointing this out. I am creating a contributing guide, which is why I changed the name to the contributors, but I forgot to update some other places. Thanks again.
2025-01-13 20:57:31 +08:00
Jōnin bingi
1ab9d115cf Fixing minor typos in README (#440)
@mcam10 Thx for the support. Appreciate
2025-01-13 20:23:52 +08:00
UncleCode
f9c601eb7e docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain.

Also add todo/ directory to .gitignore.
2025-01-09 16:24:41 +08:00
UncleCode
e8b4ac6046 docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com
Improve badges styling and layout in documentation
Increase code font size in documentation CSS

BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com
2025-01-09 16:22:41 +08:00
UncleCode
051a6cf974 docs(readme): update personal story and project vision
Revise the README's personal story section to better reflect the project's
origins, motivation, and vision for open-source data accessibility. Add more
detail about the creator's background and the project's mission to
democratize AI through open data access.

Also includes a minor TODO comment addition in async crawler strategy.
2025-01-08 21:13:31 +08:00
UncleCode
56fa4e1e42 refactor(doc)
Update README
2025-01-07 20:53:10 +08:00
UncleCode
ca3e33122e refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
2025-01-07 20:49:50 +08:00
aravind
32652189b0 Docs: Add Code of Conduct for the project (#410) 2025-01-06 12:52:51 +08:00
UncleCode
4cb2a62551 Update README 2025-01-01 18:59:55 +08:00
UncleCode
c64979b8dd docs: update README 2025-01-01 18:10:38 +08:00
UncleCode
5313c71a0d docs: update REAME browser installation command
- Remove Chrome from manual installation command
- Keep Chromium as the only default browser in docs
2025-01-01 17:24:44 +08:00
UncleCode
4a4f613238 docs: simplify installation instructions
- Add crawl4ai-doctor command to verify installation
- Update browser installation instructions in README and docs
- Move optional features to documentation
- Add manual browser installation steps as fallback
- Update getting-started guide with verification step
2025-01-01 16:54:03 +08:00
UncleCode
171ce25ba6 Fixe typo in CHANGELOG 2024-12-31 19:49:00 +08:00
UncleCode
5c3c05bf93 docs: update README badges and Docker section, reorganize documentation structure 2024-12-31 19:45:02 +08:00
UncleCode
67d0999bc3 chore: resolve merge conflicts for v0.4.24 2024-12-31 19:24:03 +08:00
UncleCode
553a4622bf chore: prepare for version 0.4.24 2024-12-31 19:18:36 +08:00
UncleCode
7391d6be73 Update README.md (#390) 2024-12-30 21:24:43 +08:00
UncleCode
e4e23065f1 Update README.md (#389) 2024-12-30 21:24:06 +08:00
UncleCode
7af1d32ef6 Update README for version 0.4.2: Reflect new features and enhancements 2024-12-12 20:18:44 +08:00
UncleCode
2d31915f0a Commit Message:
Enhance Async Crawler with storage state handling
  - Updated Async Crawler to support storage state management.
  - Added error handling for URL validation in Async Web Crawler.
  - Modified README logo and improved .gitignore entries.
  - Fixed issues in multiple files for better code robustness.
2024-12-09 20:04:59 +08:00
UncleCode
c51e901f68 feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management
### New Features:
- **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features.
- **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling.
- **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling.
- **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements.
- **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs.

### Improvements:
- Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`.
- Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation.
- Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations.
- Improved handling of cookies, headers, and proxies in session creation.

### Refactoring:
- Removed hardcoded viewport dimensions and replaced them with dynamic configurations.
- Cleaned up unused and commented-out code for better readability and maintainability.
- Introduced defaults for frequently used parameters like `delay_before_return_html`.

### Fixes:
- Resolved potential inconsistencies in viewport handling.
- Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts.

### Docs Update:
- Updated schema usage in `quickstart_async.py` example:
  - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility.
- Enhanced LLM extraction instruction documentation.

This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.
2024-12-08 20:04:44 +08:00
UncleCode
b02544bc0b docs: update README and blog for version 0.4.0 release, highlighting new features and improvements 2024-12-03 21:28:52 +08:00
unclecode
293f299c08 Add PruningContentFilter with unit tests and update documentation
- Introduced the PruningContentFilter for better content relevance.
  - Implemented comprehensive unit tests for verification of functionality.
  - Enhanced existing BM25ContentFilter tests for edge case coverage.
  - Updated documentation to include usage examples for new filter.
2024-12-01 19:17:33 +08:00
UncleCode
1def53b7fe docs: update Raspberry Pi section to indicate upcoming support 2024-11-29 20:53:43 +08:00
UncleCode
f9c98a377d Enhance Docker support and improve installation process
- Added new Docker commands for platform-specific builds.
  - Updated README with comprehensive installation and setup instructions.
  - Introduced `post_install` method in setup script for automation.
  - Refined migration processes with enhanced error logging.
  - Bump version to 0.3.746 and updated dependencies.
2024-11-29 20:52:51 +08:00
UncleCode
d202f3539b Enhance installation and migration processes
- Added a post-installation setup script for initialization.
  - Updated README with installation notes for Playwright setup.
  - Enhanced migration logging for better error visibility.
  - Added 'pydantic' to requirements.
  - Bumped version to 0.3.746.
2024-11-29 18:48:44 +08:00
UncleCode
c8485776fe docs: update README to reflect latest version v0.3.745 2024-11-28 20:04:16 +08:00
UncleCode
48d43c14b1 docs: fix link formatting for recent updates section in README 2024-11-28 19:33:02 +08:00
UncleCode
776efa74a4 docs: fix link formatting for recent updates section in README 2024-11-28 19:32:32 +08:00
UncleCode
b14e83f499 docs: fix link formatting for recent updates section in README 2024-11-28 19:31:09 +08:00
UncleCode
0cbd594512 Merge branch 'next' - Update README, and quickstart examples 2024-11-28 16:43:16 +08:00
UncleCode
efe93a5f57 docs: enhance README with development TODOs and refine mission statement for clarity 2024-11-28 16:41:11 +08:00
UncleCode
3fda66b85b docs: refine README content for clarity and conciseness, improving descriptions and formatting 2024-11-28 16:36:24 +08:00
UncleCode
ddfb6707b4 docs: update README to reflect new branding and improve section headings for clarity 2024-11-28 16:34:08 +08:00
UncleCode
a69f7a9531 fix: correct typo in function documentation for clarity and accuracy 2024-11-28 16:31:41 +08:00
UncleCode
d583aa43ca refactor: update cache handling in quickstart_async example to use CacheMode enum 2024-11-28 15:53:25 +08:00
UncleCode
3abb573142 docs: update README for version 0.3.743 with improved formatting and contributor acknowledgments 2024-11-28 13:07:59 +08:00
UncleCode
d556dada9f docs: update README to keep details open for extraction capabilities, browser integration, input/output flexibility, utility & debugging, security & accessibility, community & documentation, and cutting-edge features 2024-11-28 13:07:33 +08:00
UncleCode
ce7d49484f docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:46 +08:00
UncleCode
e4acd18429 docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments 2024-11-28 13:06:30 +08:00
zhounan
73661f7d1f docs: enhance development installation instructions (#286)
Thanks for your contribution. I'm merging your changes and I'll add your name to our contributor list. Thank you so much.
2024-11-27 15:04:20 +08:00
unclecode
d7c5b900b8 feat: add support for arm64 platform in Docker commands and update INSTALL_TYPE variable in docker-compose 2024-11-24 19:35:53 +08:00
UncleCode
8dea3f470f chore: update README to include new features and improvements for version 0.3.74 2024-11-22 18:50:12 +08:00
UncleCode
e02935dc5b chore: update README to reflect new features and improvements in version 0.3.74 2024-11-22 18:49:22 +08:00
UncleCode
571dda6549 Update Redme 2024-11-22 18:27:43 +08:00
UncleCode
006bee4a5a feat: enhance image processing capabilities
- Enhanced image processing with srcset support and validation checks for better image selection.
2024-11-22 16:00:17 +08:00
UncleCode
b6af94cbbb Merge remote-tracking branch 'origin/main' into 0.3.74 2024-11-18 21:15:04 +08:00
UncleCode
152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements
fix(version): update version number to 0.3.74
refactor(async_webcrawler): enhance logging and add domain-based request delay
2024-11-17 21:09:26 +08:00