crawl4ai

Author	SHA1	Message	Date
UncleCode	0780db55e1	fix: handle errors during image dimension updates in AsyncPlaywrightCrawlerStrategy	2024-11-29 21:12:19 +08:00
UncleCode	d202f3539b	Enhance installation and migration processes - Added a post-installation setup script for initialization. - Updated README with installation notes for Playwright setup. - Enhanced migration logging for better error visibility. - Added 'pydantic' to requirements. - Bumped version to 0.3.746.	2024-11-29 18:48:44 +08:00
UncleCode	652d396a81	chore: update version to 0.3.745	2024-11-28 20:00:29 +08:00
UncleCode	a9b6b65238	chore: update version to 0.3.744 and add publish.sh to .gitignore	2024-11-28 19:26:50 +08:00
UncleCode	a036b7f122	feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler	2024-11-28 19:24:07 +08:00
UncleCode	c2d4784810	fix: resolve merge conflict in DefaultMarkdownGenerator affecting fit_markdown generation	2024-11-28 12:56:31 +08:00
UncleCode	76bea6c577	Merge branch 'main' into 0.3.743	2024-11-28 12:53:30 +08:00
UncleCode	a1c7dc17ce	Merge branch 'next' of https://github.com/unclecode/crawl4ai into next	2024-11-28 12:45:57 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
Hamza Farhan	f998e9e949	Fix: handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined. (#293 ) Thanks, dear Farhan, for the changes you made in the code. I accepted and merged them into the main branch. Also, I will add your name to our contributor list. Thank you so much.	2024-11-27 19:20:54 +08:00
unclecode	de43505ae4	feat: update version to 0.3.742	2024-11-24 19:36:30 +08:00
UncleCode	829a1f7992	feat: update version to 0.3.741 and enhance content filtering with heuristic strategy. Fixing the issue that when the past HTML to BM25 content filter does not have any HTML elements.	2024-11-23 19:45:41 +08:00
UncleCode	d729aa7d5e	refactor: Add group ID to for images extracted from srcset.	2024-11-23 18:00:32 +08:00
UncleCode	d7a112fefe	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-11-22 19:56:56 +08:00
UncleCode	24ad2fe2dd	feat: enhance Markdown generation to include fit_html attribute	2024-11-22 18:47:17 +08:00
UncleCode	006bee4a5a	feat: enhance image processing capabilities - Enhanced image processing with srcset support and validation checks for better image selection.	2024-11-22 16:00:17 +08:00
UncleCode	dbb751c8f0	In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.	2024-11-21 18:21:43 +08:00
程序员阿江(Relakkes)	3439f7886d	fix: crawler strategy exception handling and fixes (#271 )	2024-11-20 20:30:25 +08:00
Darwing Medina	d418a04602	Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269 ) Thank you for the suggestions. It totally makes sense now. Change to pop operator.	2024-11-20 18:52:11 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	852729ff38	feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile feat(requirements): update requirements.txt to include snowballstemmer fix(version_manager): correct version parsing to use __version__.__version__ feat(main): introduce chunking strategy and content filter in CrawlRequest model feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance feat(logger): implement new async logger engine replacing print statements throughout library fix(database): resolve version-related deadlock and circular lock issues in database operations docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose	2024-11-18 21:00:06 +08:00
UncleCode	152ac35bc2	feat(docs): update README for version 0.3.74 with new features and improvements fix(version): update version number to 0.3.74 refactor(async_webcrawler): enhance logging and add domain-based request delay	2024-11-17 21:09:26 +08:00
UncleCode	df63a40606	feat(docs): update examples and documentation to replace bypass_cache with cache_mode for improved clarity	2024-11-17 19:44:45 +08:00
UncleCode	f9fe6f89fe	feat(database): implement version management and migration checks during initialization	2024-11-17 18:09:33 +08:00
UncleCode	3a66aa8a60	feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior chore(requirements): add colorama dependency refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code fix(docs): update example scripts for clarity and consistency	2024-11-17 15:30:56 +08:00
UncleCode	5098442086	refactor: migrate versioning to __version__.py and remove deprecated _version.py	2024-11-16 15:30:24 +08:00
UncleCode	d0014c6793	New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing.	2024-11-16 14:54:41 +08:00
UncleCode	3d00fee6c2	- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files.	2024-11-14 22:50:59 +08:00
UncleCode	17913f5acf	feat(crawler): support local files and raw HTML input in AsyncWebCrawler	2024-11-13 20:00:29 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
UncleCode	bf91adf3f8	fix: Resolve unexpected BrowserContext closure during crawl in Docker - Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers. - Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess. - Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes. - Improved error handling and resource cleanup for browser instances, particularly in Docker environments. Resolves Issue #256	2024-11-13 15:37:16 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	b6d6631b12	Enhance Async Crawler with Playwright support - Implemented new async crawler strategy using Playwright. - Introduced ManagedBrowser for better browser management. - Added support for persistent browser sessions and improved error handling. - Updated version from 0.3.73 to 0.3.731. - Enhanced logic in main.py for conditional mounting of static files. - Updated requirements to replace playwright_stealth with tf-playwright-stealth.	2024-11-12 12:10:58 +08:00
UncleCode	bcdd80911f	Remove some old files.	2024-11-08 19:08:58 +08:00
UncleCode	b120965b6a	Fixed issues with the Manage Browser, including its inability to connect to the user directory and inability to create new pages within the Manage Browser context; all issues are now resolved.	2024-11-07 20:15:03 +08:00
UncleCode	16f918621f	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-11-07 19:30:22 +08:00
UncleCode	9f5eef1f38	Refactored the `CustomHTML2Text` class in `content_scrapping_strategy.py` to remove the handling logic for header tags (h1-h6), which are now commented out. This cleanup improves code readability and reduces maintenance overhead.	2024-11-06 21:50:09 +08:00
UncleCode	c5aa1bec18	Merge pull request #229 from bizrockman/main Preventing NoneType has no attribute get Errors	2024-11-06 07:31:07 +01:00
UncleCode	3cf19a1bc2	chore(version): bump version to 0.3.73	2024-11-05 20:05:58 +08:00
UncleCode	67a23c3182	feat(core): Release v0.3.73 with Browser Takeover and Docker Support Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.	2024-11-05 20:04:18 +08:00
bizrockman	0bba0e074f	Preventing NoneType has no attribute get Errors Sometimes the list contains Tag elements that do not have attrs set, resulting in this Error.	2024-11-04 20:12:24 +01:00
UncleCode	e6c914d2fa	Refactor version management and remove deprecated gitignore.dev file	2024-11-04 16:51:59 +08:00
unclecode	54d5a3a259	Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.	2024-11-04 13:22:13 +08:00
UncleCode	4239654722	Update Documentation	2024-10-27 19:24:46 +08:00
UncleCode	38474bd66a	Update version	2024-10-24 20:24:21 +08:00
UncleCode	bcfe83f702	feat: enhance crawler with overlay removal and improved screenshot capabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None	2024-10-24 20:22:47 +08:00
UncleCode	60ba131ac8	[v0.3.72] Enhance content extraction and proxy support - Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility	2024-10-22 20:19:22 +08:00
UncleCode	04d16e6d2b	Fix Base64 image parsing in WebScrappingStrategy (issue 182) - Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding	2024-10-20 19:25:25 +08:00
UncleCode	1dd36f9035	Refactor content scrapping strategy and improve error handling	2024-10-20 19:11:18 +08:00
UncleCode	6ec4cb33ca	Enhance Markdown generation and external content control - Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading	2024-10-20 18:56:58 +08:00

1 2 3 4 5

204 Commits