crawl4ai

Author	SHA1	Message	Date
UncleCode	30ec4f571f	feat(docs): add comprehensive Docker API demo script Add a new example script demonstrating Docker API usage with extensive features: - Basic crawling with single/multi URL support - Markdown generation with various filters - Parameter demonstrations (CSS, JS, screenshots, SSL, proxies) - Extraction strategies using CSS and LLM - Deep crawling capabilities with streaming - Integration examples with proxy rotation and SSL certificate fetching Also includes minor formatting improvements in async_webcrawler.py	2025-04-17 20:16:11 +08:00
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
ntohidi	0886153d6a	fix(async_playwright_crawler): improve segment handling and viewport adjustments during screenshot capture (Fixed bug: Capturing Screenshot Twice and Increasing Image Size)	2025-04-17 12:48:11 +02:00
ntohidi	0ec3c4a788	fix(crawler): handle navigation aborts during file downloads in AsyncPlaywrightCrawlerStrategy	2025-04-17 12:11:12 +02:00
Aravind Karnam	eed7f88f29	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-04-17 10:50:02 +05:30
UncleCode	94d486579c	docs(tests): clarify server URL comments in deep crawl tests Improve documentation of test configuration URLs by adding clearer comments explaining when to use each URL configuration - Docker vs development mode. No functional changes, only comment improvements.	2025-04-15 22:32:27 +08:00
UncleCode	5206c6f2d6	Modify the test file	2025-04-15 22:28:01 +08:00
UncleCode	230f22da86	refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization. Improved LLM token handling with new PROVIDER_MODELS_PREFIXES. Added test cases for deep crawling and proxy rotation. Removed docker_config from BrowserConfig as it's handled separately. BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai	2025-04-15 22:27:18 +08:00
ntohidi	05085b6e3d	fix(requirements): add fake-useragent to requirements	2025-04-15 13:05:19 +02:00
UncleCode	793668a413	Remove parameter_updates.txt	2025-04-14 23:05:24 +08:00
UncleCode	82aa53aa59	Merge branch 'next-alpine-docker' into next	2025-04-14 23:01:22 +08:00
UncleCode	cd7ff6f9c1	feat(docs): add AI assistant interface and code copy button Add new AI assistant chat interface with features: - Real-time chat with markdown support - Chat history management - Citation tracking - Selection-to-query functionality Also adds code copy button to documentation code blocks and adjusts layout/styling. Breaking changes: None	2025-04-14 23:00:47 +08:00
UncleCode	c56974cf59	feat(docs): enhance documentation UI with ToC and GitHub stats Add new features to documentation UI: - Add table of contents with scroll spy functionality - Add GitHub repository statistics badge - Implement new centered layout system with fixed sidebar - Add conditional Playwright installation based on CRAWL4AI_MODE Breaking changes: None	2025-04-14 20:46:32 +08:00
ntohidi	1f3b1251d0	docs(cli): add Crawl4AI CLI installation instructions to the CLI guide	2025-04-14 12:16:31 +02:00
ntohidi	7b9aabc64a	fix(crawler): ensure max_pages limit is respected during batch processing in crawling strategies	2025-04-14 12:11:22 +02:00
Aravind Karnam	dcc265458c	fix: Add a nominal wait time for remove overlay elements since it's already controllable through delay_before_return_html	2025-04-14 12:39:05 +05:30
UncleCode	ecec53a8c1	Docker tested on Windows machine.	2025-04-13 20:14:41 +08:00
Aravind Karnam	7d8e81fb2e	fix: fix target_elements, in a less invasive and more efficient way simply by changing order of execution :) https://github.com/unclecode/crawl4ai/issues/902	2025-04-12 12:44:00 +05:30
Aravind Karnam	9fc5d315af	fix: revert the old target_elms code in LXMLwebscraping strategy	2025-04-12 12:07:04 +05:30
Aravind Karnam	d84508b4d5	fix: revert the old target_elms code in regular webscraping strategy	2025-04-12 12:05:17 +05:30
Aravind Karnam	022f5c9e25	Merged next branch	2025-04-12 10:47:02 +05:30
UncleCode	3179d6ad0c	fix(core): improve error handling and stability in core components Enhance error handling and stability across multiple components: - Add safety checks in async_configs.py for type and params existence - Fix browser manager initialization and cleanup logic - Add default LLM config fallback in extraction strategy - Add comprehensive Docker deployment guide and server tests BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances	2025-04-11 20:58:39 +08:00
wakaka6	b2f3cb0dfa	WIP: logger migriate to rich	2025-04-11 00:44:43 +08:00
UncleCode	18e8227dfb	feat(crawler): add console message capture functionality Add ability to capture browser console messages during crawling: - Implement _capture_console_messages method to collect console logs - Update crawl method to support console message capture - Modify browser_manager page creation to accept full CrawlerRunConfig - Fix request failure text formatting This enhancement allows debugging and monitoring of JavaScript console output during crawling operations.	2025-04-10 23:26:09 +08:00
UncleCode	7c358a1aee	fix(browser): add null check for crawlerRunConfig.url Add additional null check when accessing crawlerRunConfig.url in cookie configuration to prevent potential null pointer exceptions. Previously, the code only checked if crawlerRunConfig existed but not its url property. Fixes potential runtime error when crawlerRunConfig.url is undefined.	2025-04-10 23:25:07 +08:00
UncleCode	108b2a8bfb	Fixed capturing console messages for case the url is the local file. Update docker configuration (work in progress)	2025-04-10 23:22:38 +08:00
unclecode	66ac07b4f3	feat(crawler): add network request and console message capturing Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.	2025-04-10 16:03:48 +08:00
UncleCode	a2061bf31e	feat(crawler): add MHTML capture functionality Add ability to capture web pages as MHTML format, which includes all page resources in a single file. This enables complete page archival and offline viewing. - Add capture_mhtml parameter to CrawlerRunConfig - Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy - Add mhtml field to CrawlResult and AsyncCrawlResponse models - Add comprehensive tests for MHTML capture functionality - Update documentation with MHTML capture details - Add exclude_all_images option for better memory management Breaking changes: None	2025-04-09 15:39:04 +08:00
Aravind Karnam	6f7ab9c927	fix: Revert changes to session management in AsyncHttpWebcrawler and solve the underlying issue by removing the session closure in finally block of session context.	2025-04-08 18:31:00 +05:30
UncleCode	9038e9acbd	Merge branch 'main' into next	2025-04-08 17:43:42 +08:00
UncleCode	02e627e0bd	fix(crawler): simplify page retrieval logic in AsyncPlaywrightCrawlerStrategy	2025-04-08 17:43:36 +08:00
UncleCode	5b66208a7e	Refactor next branch	2025-04-06 18:33:09 +08:00
UncleCode	591f55edc7	refactor(browser): rename methods and update type hints in BrowserHub for clarity	2025-04-06 18:22:05 +08:00
UncleCode	e1d9e2489c	refactor(docs): update import statement in quickstart.py for improved clarity	2025-04-05 23:12:06 +08:00
UncleCode	b1693b1c21	Remove old quickstart files	2025-04-05 23:10:25 +08:00
UncleCode	49d904ca0a	refactor(docs): enhance quickstart_examples.py with improved configuration and file handling	2025-04-05 22:57:45 +08:00
UncleCode	ca9351252a	refactor(docs): update import paths and clean up example code in quickstart_examples.py	2025-04-05 22:55:56 +08:00
UncleCode	935d9d39f8	Add quickstart example set	2025-04-05 21:37:25 +08:00
UncleCode	f8213c32b9	Merge branch 'vr0.5.0.post8'	2025-04-05 21:36:17 +08:00
UncleCode	14894b4d70	feat(config): set DefaultMarkdownGenerator as the default markdown generator in CrawlerRunConfig feat(logger): add color mapping for log message formatting options	2025-04-03 20:34:19 +08:00
Aravind Karnam	7155778eac	chore: move from faust-cchardet to chardet	2025-04-03 17:42:51 +05:30
Aravind Karnam	4133e5460d	typo-fix: https://github.com/unclecode/crawl4ai/pull/918	2025-04-03 17:42:24 +05:30
Aravind Karnam	73fda8a6ec	fix: address the PR review: https://github.com/unclecode/crawl4ai/pull/899#discussion_r2024639193	2025-04-03 13:47:13 +05:30
UncleCode	86df20234b	fix(crawler): handle exceptions in get_page call to ensure page retrieval	2025-04-02 21:25:24 +08:00
UncleCode	179921a131	fix(crawler): update get_page call to include additional return value	2025-04-02 19:01:30 +08:00
Aravind Karnam	9e16a4bb26	Merge next and resolve conflicts	2025-04-02 12:18:23 +05:30
UncleCode	c5cac2b459	feat(browser): add BrowserHub for centralized browser management and resource sharing	2025-04-01 20:35:02 +08:00
UncleCode	555455d710	feat(browser): implement browser pooling and page pre-warming Adds a new BrowserManager implementation with browser pooling and page pre-warming capabilities: - Adds support for managing multiple browser instances per configuration - Implements page pre-warming for improved performance - Adds configurable behavior for when no browsers are available - Includes comprehensive status reporting and monitoring - Maintains backward compatibility with existing API - Adds demo script showcasing new features BREAKING CHANGE: BrowserManager API now returns a strategy instance along with page and context	2025-03-31 21:55:07 +08:00
Aravind	765f856ed4	Merge pull request #808 from dvschuyl/bug/parse-srcset-fix-float-width 🐛 Truncate width to integer string in srcset	2025-03-31 18:21:09 +05:30
Aravind Karnam	757e3177ed	fix: https://github.com/unclecode/crawl4ai/issues/839	2025-03-31 17:10:04 +05:30

... 3 4 5 6 7 ...

973 Commits