crawl4ai

Author	SHA1	Message	Date
UncleCode	f9c601eb7e	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain. Also add todo/ directory to .gitignore.	2025-01-09 16:24:41 +08:00
UncleCode	ad5e5d21ca	Remove .codeiumignore from version control and add to .gitignore	2025-01-08 13:09:23 +08:00
UncleCode	12880f1ffa	Update gitignore	2025-01-06 15:19:01 +08:00
UncleCode	196dc79ec7	fix: prevent memory leaks by ensuring proper closure of Playwright pages - Fixes critical memory leak issue where browser pages remained open - Ensures proper cleanup of Playwright resources after page operations - Improves resource management in browser farm implementation This is an urgent fix to address resource leakage that could impact system stability.	2025-01-03 21:17:23 +08:00
UncleCode	67d0999bc3	chore: resolve merge conflicts for v0.4.24	2024-12-31 19:24:03 +08:00
UncleCode	2fedd4876e	Update gitignore	2024-12-31 17:35:34 +08:00
UncleCode	e187b0aaf0	update gitignore	2024-12-31 17:34:31 +08:00
UncleCode	7792fe0e4c	Recreate .do folder for removal	2024-12-31 17:31:51 +08:00
UncleCode	86259244e4	Add ".do" to gitignore	2024-12-31 17:30:09 +08:00
UncleCode	0ec593fa90	Update the Tutorial section for new document version	2024-12-31 17:27:31 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	9a4ed6bbd7	Commit Message: Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.	2024-12-26 15:17:07 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	84b311760f	Commit Message: Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`	2024-12-21 14:26:56 +08:00
UncleCode	399af801a1	Merge branch 'next'	2024-12-12 20:17:27 +08:00
UncleCode	3d69715dba	chore: Update .gitignore to include new files and directories	2024-12-12 19:57:59 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
UncleCode	a9b6b65238	chore: update version to 0.3.744 and add publish.sh to .gitignore	2024-11-28 19:26:50 +08:00
UncleCode	a5decaa7cf	Merge branch '0.3.74'	2024-11-22 19:55:52 +08:00
UncleCode	2bdec1fa5a	chore: add manage-collab.sh to .gitignore	2024-11-19 19:33:04 +08:00
UncleCode	b654c49e55	Update .gitignore to exclude additional scripts and files	2024-11-19 19:32:06 +08:00
UncleCode	2f19d38693	Update .gitignore to include .gitboss/ and todo_executor.md	2024-11-19 19:02:41 +08:00
UncleCode	73658c758a	chore: update .gitignore to include manage-collab.sh	2024-11-19 16:10:43 +08:00
UncleCode	ae7ebc0bd8	chore: update .gitignore and enhance changelog with major feature additions and examples	2024-11-15 20:16:13 +08:00
UncleCode	bf91adf3f8	fix: Resolve unexpected BrowserContext closure during crawl in Docker - Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers. - Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess. - Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes. - Improved error handling and resource cleanup for browser instances, particularly in Docker environments. Resolves Issue #256	2024-11-13 15:37:16 +08:00
unclecode	54d5a3a259	Improved database management and error handling, updated README instructions, refined .gitignore, enhanced async web crawling capabilities, and updated dependencies.	2024-11-04 13:22:13 +08:00
UncleCode	c2a71a5abe	Update Docs folder, prepare branch for new version 0.3.73	2024-10-27 19:35:13 +08:00
UncleCode	4e2852d5ff	[v0.3.71] Enhance chunking strategies and improve overall performance - Add OverlappingWindowChunking and improve SlidingWindowChunking - Update CHUNK_TOKEN_THRESHOLD to 2048 tokens - Optimize AsyncPlaywrightCrawlerStrategy close method - Enhance flexibility in CosineStrategy with generic embedding model loading - Improve JSON-based extraction strategies - Add knowledge graph generation example	2024-10-19 18:36:59 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00
unclecode	740802c491	Merge branch '0.3.6'	2024-10-14 22:55:24 +08:00
unclecode	b9ac96c332	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-10-14 22:54:23 +08:00
unclecode	d06535388a	Update gitignore	2024-10-14 22:53:56 +08:00
unclecode	6aa803d712	Update gitignore	2024-10-14 21:03:40 +08:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
unclecode	ff3524d9b1	feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts - Implement screenshot capture functionality - Add delayed content retrieval method - Introduce custom page timeout parameter - Enhance LLM support with multiple providers - Improve database schema auto-updates - Optimize image processing in WebScrappingStrategy - Update error handling and logging - Expand examples in quickstart_async.py	2024-10-12 13:42:42 +08:00
unclecode	b99d20b725	Add pypi_build.sh to .gitignore	2024-10-08 18:10:57 +08:00
unclecode	4750810a67	Enhance AsyncWebCrawler with smart waiting and screenshot capabilities - Implement smart_wait function in AsyncPlaywrightCrawlerStrategy - Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler - Improve error handling and timeout management in crawling process - Fix typo in CrawlResult model (responser_headers -> response_headers) - Update .gitignore to exclude additional files - Adjust import path in test_basic_crawling.py	2024-10-02 17:34:56 +08:00
unclecode	8b6e88c85c	Update .gitignore to ignore temporary and test directories	2024-09-26 15:09:49 +08:00
unclecode	10cdad039d	Update documents and README	2024-09-25 16:52:11 +08:00
unclecode	8463aabedf	chore: Remove .test_pads/ directory from .gitignore	2024-07-19 17:09:29 +08:00
unclecode	7f30144ef2	chore: Remove .tests/ directory from .gitignore	2024-07-09 15:10:18 +08:00
unclecode	d11a83c232	## [0.2.71] 2024-06-26 • Refactored `crawler_strategy.py` to handle exceptions and improve error messages • Improved `get_content_of_website_optimized` function in `utils.py` for better performance • Updated `utils.py` with latest changes • Migrated to `ChromeDriverManager` for resolving Chrome driver download issues	2024-06-26 15:34:15 +08:00
unclecode	2217904876	Update .gitignore	2024-06-22 18:12:12 +08:00
unclecode	19d3d39115	Update Marge the DOCS branch	2024-06-21 18:04:13 +08:00
unclecode	4a50781453	chore: Remove local and .files folders from .gitignore	2024-06-17 15:57:34 +08:00
unclecode	8b8683f22e	Add research assistant example using Chainlit	2024-06-04 22:43:09 +08:00
QIN2DIM	5cee084340	fix(main): UnicodeDecodeError File "T:\_GitHubProjects\Forks\crawl4ai\main.py", line 70, in read_index partials[filename[:-5]] = file.read() UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 149: illegal multibyte sequence	2024-05-18 23:31:11 +08:00
Unclecode	bf00c26a83	chore: Update Dockerfile to install chromium-chromedriver and spacy library	2024-05-18 09:16:52 +00:00
unclecode	199c66114c	chore: Update pip installation command and requirements, add new dependencies	2024-05-16 20:58:36 +08:00

1 2

56 Commits