From ce7d49484fc097a834d1eac883ecce6f444ceb1e Mon Sep 17 00:00:00 2001 From: UncleCode Date: Thu, 28 Nov 2024 13:06:46 +0800 Subject: [PATCH] docs: update README for version 0.3.743 with new features, enhancements, and contributor acknowledgments --- README.md | 125 +++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 87 insertions(+), 38 deletions(-) diff --git a/README.md b/README.md index 5ba33dea..16d154b5 100644 --- a/README.md +++ b/README.md @@ -11,20 +11,15 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. ๐Ÿ†“๐ŸŒ -## New in 0.3.74 โœจ +## New in 0.3.743 โœจ -- ๐Ÿš€ **Blazing Fast Scraping**: Significantly improved scraping speed. -- ๐Ÿ“ฅ **Download Manager**: Integrated file crawling, downloading, and tracking within `CrawlResult`. -- ๐Ÿ“ **Markdown Strategy**: Flexible system for custom markdown generation and formats. -- ๐Ÿ”— **LLM-Friendly Citations**: Auto-converts links to numbered citations with reference lists. -- ๐Ÿ”Ž **Markdown Filter**: BM25-based content extraction for cleaner, relevant markdown. -- ๐Ÿ–ผ๏ธ **Image Extraction**: Supports `srcset`, `picture`, and responsive image formats. -- ๐Ÿ—‚๏ธ **Local/Raw HTML**: Crawl `file://` paths and raw HTML (`raw:`) directly. -- ๐Ÿค– **Browser Control**: Custom browser setups with stealth integration to bypass bots. -- โ˜๏ธ **API & Cache Boost**: CORS, static serving, and enhanced filesystem-based caching. -- ๐Ÿณ **API Gateway**: Run as an API service with secure token authentication. -- ๐Ÿ› ๏ธ **Database Upgrades**: Optimized for larger content sets with faster caching. -- ๐Ÿ› **Bug Fixes**: Resolved browser context issues, memory leaks, and improved error handling. +๐Ÿš€ **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management. +๐Ÿ“ **Enhanced Markdown Generation**: New generator class for better formatting and customization. +โšก **Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler. +๐Ÿ› ๏ธ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows. +๐Ÿ‘ฅ **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency. +๐Ÿ“– **Documentation Updates**: Clearer usage scenarios and updated guidance for better user onboarding. +๐Ÿงช **Test Adjustments**: Refined tests to align with recent class name changes. ## Try it Now! @@ -35,31 +30,85 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc ## Features โœจ -- ๐Ÿ†“ Completely free and open-source -- ๐Ÿš€ Blazing fast performance, outperforming many paid services -- ๐Ÿค– LLM-friendly output formats (JSON, cleaned HTML, markdown) -- ๐ŸŒ Multi-browser support (Chromium, Firefox, WebKit) -- ๐ŸŒ Supports crawling multiple URLs simultaneously -- ๐ŸŽจ Extracts and returns all media tags (Images, Audio, and Video) -- ๐Ÿ”— Extracts all external and internal links -- ๐Ÿ“š Extracts metadata from the page -- ๐Ÿ”„ Custom hooks for authentication, headers, and page modifications -- ๐Ÿ•ต๏ธ User-agent customization -- ๐Ÿ–ผ๏ธ Takes screenshots of pages with enhanced error handling -- ๐Ÿ“œ Executes multiple custom JavaScripts before crawling -- ๐Ÿ“Š Generates structured output without LLM using JsonCssExtractionStrategy -- ๐Ÿ“š Various chunking strategies: topic-based, regex, sentence, and more -- ๐Ÿง  Advanced extraction strategies: cosine clustering, LLM, and more -- ๐ŸŽฏ CSS selector support for precise data extraction -- ๐Ÿ“ Passes instructions/keywords to refine extraction -- ๐Ÿ”’ Proxy support with authentication for enhanced access -- ๐Ÿ”„ Session management for complex multi-page crawling -- ๐ŸŒ Asynchronous architecture for improved performance -- ๐Ÿ–ผ๏ธ Improved image processing with lazy-loading detection -- ๐Ÿ•ฐ๏ธ Enhanced handling of delayed content loading -- ๐Ÿ”‘ Custom headers support for LLM interactions -- ๐Ÿ–ผ๏ธ iframe content extraction for comprehensive analysis -- โฑ๏ธ Flexible timeout and delayed content retrieval options +
+๐Ÿš€ Performance & Scalability + +- โšก **Blazing Fast Scraping**: Outperforms many paid services with cutting-edge optimization. +- ๐Ÿ”„ **Asynchronous Architecture**: Enhanced performance for complex multi-page crawling. +- โšก **Dynamic HTML Formatting**: New, fast HTML formatting for streamlined workflows. +- ๐Ÿ—‚๏ธ **Large Dataset Optimization**: Improved caching for handling massive content sets. + +
+ +
+๐Ÿ”Ž Extraction Capabilities + +- ๐Ÿ–ผ๏ธ **Comprehensive Media Support**: Extracts images, audio, video, and responsive image formats like `srcset` and `picture`. +- ๐Ÿ“š **Advanced Content Chunking**: Topic-based, regex, sentence-level, and cosine clustering strategies. +- ๐ŸŽฏ **Precise Data Extraction**: Supports CSS selectors and keyword-based refinements. +- ๐Ÿ”— **All-Inclusive Link Crawling**: Extracts internal and external links. +- ๐Ÿ“ **Markdown Generation**: Enhanced markdown generator class for custom, clean, LLM-friendly outputs. +- ๐Ÿท๏ธ **Metadata Extraction**: Fetches metadata directly from pages. + +
+ +
+๐ŸŒ Browser Integration + +- ๐ŸŒ **Multi-Browser Support**: Works with Chromium, Firefox, and WebKit. +- ๐Ÿ–ฅ๏ธ **ManagedBrowser with Dynamic Config**: Flexible host/port control for tailored setups. +- โš™๏ธ **Custom Browser Hooks**: Authentication, headers, and page modifications. +- ๐Ÿ•ถ๏ธ **Stealth Mode**: Bypasses bot detection with advanced techniques. +- ๐Ÿ“ธ **Screenshots & JavaScript Execution**: Takes screenshots and executes custom JavaScript before crawling. + +
+ +
+๐Ÿ“ Input/Output Flexibility + +- ๐Ÿ“‚ **Local & Raw HTML Crawling**: Directly processes `file://` paths and raw HTML. +- ๐ŸŒ **Custom Headers for LLM**: Tailored headers for enhanced AI interactions. +- ๐Ÿ› ๏ธ **Structured Output Options**: Supports JSON, cleaned HTML, and markdown outputs. + +
+ +
+๐Ÿ”ง Utility & Debugging + +- ๐Ÿ›ก๏ธ **Error Handling**: Robust error management for seamless execution. +- ๐Ÿ” **Session Management**: Handles complex, multi-page interactions. +- ๐Ÿงน **Utility Functions**: Enhanced sanitization and flexible extraction helpers. +- ๐Ÿ•ฐ๏ธ **Delayed Content Loading**: Improved handling of lazy-loading and dynamic content. + +
+ +
+๐Ÿ” Security & Accessibility + +- ๐Ÿ•ต๏ธ **Proxy Support**: Enables authenticated access for restricted pages. +- ๐Ÿšช **API Gateway**: Deploy as an API service with secure token authentication. +- ๐ŸŒ **CORS & Static Serving**: Enhanced support for filesystem-based caching and cross-origin requests. + +
+ +
+๐ŸŒŸ Community & Documentation + +- ๐Ÿ™Œ **Contributor Acknowledgments**: Recognition for pull requests and contributions. +- ๐Ÿ“– **Clear Documentation**: Simplified and updated for better onboarding and usage. + +
+ +
+๐ŸŽฏ Cutting-Edge Features + +- ๐Ÿ› ๏ธ **BM25-Based Markdown Filtering**: Extracts cleaner, context-relevant markdown. +- ๐Ÿ“š **LLM-Friendly Citations**: Auto-converts links to numbered citations with reference lists. +- ๐Ÿ“ก **IFrame Content Extraction**: Comprehensive analysis for embedded content. +- ๐Ÿ•ฐ๏ธ **Flexible Content Retrieval**: Combines timing-based strategies for reliable extractions. + +
+ ## Installation ๐Ÿ› ๏ธ