crawl4ai

Author	SHA1	Message	Date
UncleCode	4d7f91b378	refactor(user-agent): improve user agent generation system Redesign user agent generation to be more modular and reliable: - Add abstract base class UAGen for user agent generation - Implement ValidUAGenerator using fake-useragent library - Add OnlineUAGenerator for fetching real-world user agents - Update browser configurations to use new UA generation system - Improve client hints generation This change makes the user agent system more maintainable and provides better real-world user agent coverage.	2025-01-25 21:16:39 +08:00
UncleCode	69a77222ef	feat(browser): add CDP URL configuration support Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser. - Added cdp_url parameter to BrowserConfig - Added cdp_url support in ManagedBrowser.start() method - Updated documentation for new parameters	2025-01-24 15:53:47 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00
UncleCode	dee5fe9851	feat(proxy): add proxy rotation support and documentation Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include: - Added proxy rotation demo in features example - Updated proxy configuration handling in BrowserManager - Added proxy rotation documentation - Updated README with new proxy rotation feature - Bumped version to 0.4.3b2 This change enables users to dynamically switch between proxies and verify IP addresses for each request.	2025-01-22 16:11:01 +08:00
UncleCode	9247877037	feat(proxy): add proxy configuration support to CrawlerRunConfig Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config. - Added proxy_config parameter to CrawlerRunConfig - Updated BrowserManager to apply proxy settings from CrawlerRunConfig - Updated proxy-security documentation with new usage examples	2025-01-20 22:14:05 +08:00
UncleCode	4b1309cbf2	feat(crawler): add URL redirection tracking Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections. Changes include: - Added final_url tracking in AsyncPlaywrightCrawlerStrategy - Added redirected_url field to CrawlResult model - Updated AsyncWebCrawler to properly handle and store redirect URLs - Fixed typo in documentation signature	2025-01-19 19:53:38 +08:00
UncleCode	1221be30a3	feat(browser): improve browser context management and add shared data support Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks. Implement browser context reuse based on config signatures to improve memory usage. Fix Firefox/Webkit channel settings. Add config parameter to hook callbacks for better context access. Remove debug print statements. BREAKING CHANGE: Hook callback signatures now include config parameter	2025-01-19 17:12:03 +08:00
UncleCode	2d6b19e1a2	refactor(browser): improve browser path management Implement more robust browser executable path handling using playwright's built-in browser management. This change: - Adds async browser path resolution - Implements path caching in the home folder - Removes hardcoded browser paths - Adds httpx dependency - Removes obsolete test result files This change makes the browser path resolution more reliable across different platforms and environments.	2025-01-17 22:14:37 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	051a6cf974	docs(readme): update personal story and project vision Revise the README's personal story section to better reflect the project's origins, motivation, and vision for open-source data accessibility. Add more detail about the creator's background and the project's mission to democratize AI through open data access. Also includes a minor TODO comment addition in async crawler strategy.	2025-01-08 21:13:31 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	72fbdac467	fix(extraction): JsonCss selector and crawler improvements - Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one - Add robust error handling to page_need_scroll with default fallback - Improve JSON extraction strategies documentation - Refactor content scraping strategy - Update version to 0.4.247	2025-01-05 19:26:46 +08:00
UncleCode	0857c7b448	Merge branch 'main' of https://github.com/unclecode/crawl4ai into next	2025-01-05 17:05:59 +08:00
Guilume	07b4c1c0ed	fix: not working long page screenshot (#403 )	2025-01-05 17:04:34 +08:00
UncleCode	196dc79ec7	fix: prevent memory leaks by ensuring proper closure of Playwright pages - Fixes critical memory leak issue where browser pages remained open - Ensures proper cleanup of Playwright resources after page operations - Improves resource management in browser farm implementation This is an urgent fix to address resource leakage that could impact system stability.	2025-01-03 21:17:23 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	f2d9912697	Renames browser_config param to config in AsyncWebCrawler Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor. Updates all documentation examples and internal usages to reflect the new parameter name for consistency. Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.	2024-12-26 16:34:36 +08:00
UncleCode	9a4ed6bbd7	Commit Message: Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.	2024-12-26 15:17:07 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	8fbc2e0463	Refactor deployment configuration and enhance browser debugging options	2024-12-20 20:35:28 +08:00
UncleCode	849765712f	Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.	2024-12-19 21:02:29 +08:00
UncleCode	393bb911c0	Enhance crawler strategies with new features - ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction. - Updated existing extraction strategies for better performance. - Improved handling of response status codes during crawls.	2024-12-17 22:40:10 +08:00
UncleCode	a11d9646e3	Enhance crawler features and improve documentation - Added detailed CrawlerRunConfig parameters documentation. - Introduced plans for real-time event-driven crawling. - Updated async logger default level to DEBUG for better insights. - Improved structure and readability in configuration file. - Enhanced documentation on future capabilities in new blog entries.	2024-12-16 18:52:51 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	5431fa2d0c	Add PDF & screenshot functionality, new tutorial - Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.	2024-12-10 20:10:39 +08:00
UncleCode	e130fd8db9	Implement new async crawler features and stability updates - Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.	2024-12-10 17:55:29 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
UncleCode	c51e901f68	feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management ### New Features: - Text-Only Mode: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - Light Mode: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - Dynamic Viewport Adjustment: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - Full Page Scanning: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - Session Management: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.	2024-12-08 20:04:44 +08:00
UncleCode	486db3a771	Updated to version 0.4.0 with new features - Enhanced error handling in async crawler. - Added flexible options in Markdown generation. - Updated user agent settings for improved reliability. - Reflected changes in documentation and examples.	2024-12-04 20:26:39 +08:00
UncleCode	3e83893b3f	Enhance User-Agent Handling - Added a new UserAgentGenerator class for generating random User-Agents. - Integrated User-Agent generation in AsyncPlaywrightCrawlerStrategy for randomization. - Enhanced HTTP headers with generated Client Hints.	2024-11-30 18:13:12 +08:00
UncleCode	0780db55e1	fix: handle errors during image dimension updates in AsyncPlaywrightCrawlerStrategy	2024-11-29 21:12:19 +08:00
UncleCode	a036b7f122	feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler	2024-11-28 19:24:07 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
UncleCode	dbb751c8f0	In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.	2024-11-21 18:21:43 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	852729ff38	feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose	2024-11-18 21:00:06 +08:00
UncleCode	152ac35bc2	feat(docs): update README for version 0.3.74 with new features and improvements fix(version): update version number to 0.3.74 refactor(async_webcrawler): enhance logging and add domain-based request delay	2024-11-17 21:09:26 +08:00
UncleCode	3a66aa8a60	feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior chore(requirements): add colorama dependency refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code fix(docs): update example scripts for clarity and consistency	2024-11-17 15:30:56 +08:00
UncleCode	3d00fee6c2	- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files.	2024-11-14 22:50:59 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
UncleCode	bf91adf3f8	fix: Resolve unexpected BrowserContext closure during crawl in Docker - Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers. - Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess. - Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes. - Improved error handling and resource cleanup for browser instances, particularly in Docker environments. Resolves Issue #256	2024-11-13 15:37:16 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	b6d6631b12	Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth.	2024-11-12 12:10:58 +08:00
UncleCode	b120965b6a	Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved.	2024-11-07 20:15:03 +08:00
unclecode	54d5a3a259	Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.	2024-11-04 13:22:13 +08:00
UncleCode	bcfe83f702	feat: enhance crawler with overlay removal and improved screenshot capabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None	2024-10-24 20:22:47 +08:00
UncleCode	60ba131ac8	[v0.3.72] Enhance content extraction and proxy support - Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility	2024-10-22 20:19:22 +08:00
UncleCode	6ec4cb33ca	Enhance Markdown generation and external content control - Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading	2024-10-20 18:56:58 +08:00
UncleCode	4e2852d5ff	[v0.3.71] Enhance chunking strategies and improve overall performance - Add OverlappingWindowChunking and improve SlidingWindowChunking - Update CHUNK_TOKEN_THRESHOLD to 2048 tokens - Optimize AsyncPlaywrightCrawlerStrategy close method - Enhance flexibility in CosineStrategy with generic embedding model loading - Improve JSON-based extraction strategies - Add knowledge graph generation example	2024-10-19 18:36:59 +08:00
UncleCode	b8147b64e0	chore: Bump version to 0.3.71 and improve error handling - Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation	2024-10-18 13:31:12 +08:00

1 2

60 Commits