crawl4ai

Author	SHA1	Message	Date
Aravind Karnam	9e16a4bb26	Merge next and resolve conflicts	2025-04-02 12:18:23 +05:30
UncleCode	c5cac2b459	feat(browser): add BrowserHub for centralized browser management and resource sharing	2025-04-01 20:35:02 +08:00
UncleCode	555455d710	feat(browser): implement browser pooling and page pre-warming Adds a new BrowserManager implementation with browser pooling and page pre-warming capabilities: - Adds support for managing multiple browser instances per configuration - Implements page pre-warming for improved performance - Adds configurable behavior for when no browsers are available - Includes comprehensive status reporting and monitoring - Maintains backward compatibility with existing API - Adds demo script showcasing new features BREAKING CHANGE: BrowserManager API now returns a strategy instance along with page and context	2025-03-31 21:55:07 +08:00
Aravind	765f856ed4	Merge pull request #808 from dvschuyl/bug/parse-srcset-fix-float-width 🐛 Truncate width to integer string in srcset	2025-03-31 18:21:09 +05:30
Aravind Karnam	757e3177ed	fix: https://github.com/unclecode/crawl4ai/issues/839	2025-03-31 17:10:04 +05:30
Aravind	d8357e80d2	Merge pull request #915 from maggie-edkey/css-selector fix(#911): css_selector is not working properly	2025-03-31 13:03:35 +05:30
Aravind Karnam	ef1f0c4102	fix:https://github.com/unclecode/crawl4ai/issues/701	2025-03-31 12:43:32 +05:30
maggie.wang	1119f2f5b5	fix: https://github.com/unclecode/crawl4ai/issues/911	2025-03-31 14:05:54 +08:00
UncleCode	bb02398086	refactor(browser): improve browser strategy architecture and lifecycle management Major refactoring of browser strategy implementations to improve code organization and reliability: - Move CrawlResultContainer and RunManyReturn types from async_webcrawler to models.py - Simplify browser lifecycle management in AsyncWebCrawler - Standardize browser strategy interface with _generate_page method - Improve headless mode handling and browser args construction - Clean up Docker and Playwright strategy implementations - Fix session management and context handling across strategies BREAKING CHANGE: Browser strategy interface has changed with new _generate_page method requirement	2025-03-30 20:58:39 +08:00
UncleCode	3ff7eec8f3	refactor(browser): consolidate browser strategy implementations Moves common browser functionality into BaseBrowserStrategy class to reduce code duplication and improve maintainability. Key changes: - Adds shared browser argument building and session management to base class - Standardizes storage state handling across strategies - Improves process cleanup and error handling - Consolidates CDP URL management and container lifecycle BREAKING CHANGE: Changes browser_mode="custom" to "cdp" for consistency	2025-03-28 22:47:28 +08:00
Aravind Karnam	d8cbeff386	fix: https://github.com/unclecode/crawl4ai/issues/842	2025-03-28 19:31:05 +05:30
UncleCode	64f20ab44a	refactor(docker): update Dockerfile and browser strategy to use Chromium	2025-03-28 15:59:02 +08:00
Aravind Karnam	57e0423b3a	fix:target_element should not affect link extraction. -> https://github.com/unclecode/crawl4ai/issues/902	2025-03-28 12:56:37 +05:30
UncleCode	c635f6b9a2	refactor(browser): reorganize browser strategies and improve Docker implementation Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with: - Add Alpine and Debian-based Dockerfiles for better container options - Enhance Docker registry to share configuration with BuiltinBrowserStrategy - Add CPU and memory limits to container configuration - Improve error handling and logging - Update documentation and examples BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.	2025-03-27 21:35:13 +08:00
Aravind Karnam	7be5427283	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-27 12:29:32 +05:30
UncleCode	7f93e88379	refactor(tests): remove unused imports in test_docker_browser.py	2025-03-26 15:19:29 +08:00
UncleCode	40d4dd36c9	chore(version): bump version to 0.5.0.post8 and update post-installation setup	2025-03-25 21:56:49 +08:00
UncleCode	d8f38f2298	chore(version): bump version to 0.5.0.post7	2025-03-25 21:47:19 +08:00
UncleCode	5c88d1310d	feat(cli): add output file option and integrate LXML web scraping strategy	2025-03-25 21:38:24 +08:00
UncleCode	4a20d7f7c2	feat(cli): add quick JSON extraction and global config management Adds new features to improve user experience and configuration: - Quick JSON extraction with -j flag for direct LLM-based structured data extraction - Global configuration management with 'crwl config' commands - Enhanced LLM extraction with better JSON handling and error management - New user settings for default behaviors (LLM provider, browser settings, etc.) Breaking changes: None	2025-03-25 20:30:25 +08:00
Aravind Karnam	585e5e5973	fix: https://github.com/unclecode/crawl4ai/issues/733	2025-03-25 15:17:59 +05:30
Aravind Karnam	e3111d0a32	fix: prevent session closing after each request to maintain connection pool. Fixes: https://github.com/unclecode/crawl4ai/issues/867	2025-03-25 13:46:55 +05:30
Aravind Karnam	2f0e217751	Chore: Add brotli as dependancy to fix: https://github.com/unclecode/crawl4ai/issues/867	2025-03-25 13:44:41 +05:30
UncleCode	6405cf0a6f	Merge branch 'vr0.5.0.post5' into next	2025-03-25 14:51:29 +08:00
UncleCode	bdd9db579a	chore(version): bump version to 0.5.0.post6 refactor(cli): remove unused import from FastAPI	2025-03-25 12:01:36 +08:00
UncleCode	1107fa1d62	feat(cli): enhance markdown generation with default content filters Add DefaultMarkdownGenerator integration and automatic content filtering for markdown output formats. When using 'markdown-fit' or 'md-fit' output formats, automatically apply PruningContentFilter with default settings if no filter config is provided. This change improves the user experience by providing sensible defaults for markdown generation while maintaining the ability to customize filtering behavior.	2025-03-25 11:56:00 +08:00
Aravind Karnam	efa73257c5	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-24 21:57:29 +05:30
UncleCode	8c08521301	feat(browser): add Docker-based browser automation strategy Implements a new browser strategy that runs Chrome in Docker containers, providing better isolation and cross-platform consistency. Features include: - Connect and launch modes for different container configurations - Persistent storage support for maintaining browser state - Container registry for efficient reuse - Comprehensive test suite for Docker browser functionality This addition allows users to run browser automation workloads in isolated containers, improving security and resource management.	2025-03-24 21:36:58 +08:00
UncleCode	462d5765e2	fix(browser): improve storage state persistence in CDP strategy Enhance storage state persistence mechanism in CDP browser strategy by: - Explicitly saving storage state for each browser context - Using proper file path for storage state - Removing unnecessary sleep delay Also includes test improvements: - Simplified test configurations in playwright tests - Temporarily disabled some CDP tests	2025-03-23 21:06:41 +08:00
UncleCode	6eeb2e4076	feat(browser): enhance browser context creation with user data directory support and improved storage state handling	2025-03-23 19:07:13 +08:00
UncleCode	0094cac675	refactor(browser): improve parallel crawling and browser management Remove PagePoolConfig in favor of direct page management in browser strategies. Add get_pages() method for efficient parallel page creation. Improve storage state handling and persistence. Add comprehensive parallel crawling tests and performance analysis. BREAKING CHANGE: Removed PagePoolConfig class and related functionality.	2025-03-23 18:53:24 +08:00
UncleCode	4ab0893ffb	feat(browser): implement modular browser management system Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.	2025-03-21 22:50:00 +08:00
Aravind Karnam	e01d1e73e1	fix: link normalisation in BestFirstStrategy	2025-03-21 17:34:13 +05:30
Aravind Karnam	471d110c5e	fix: url normalisation ref: https://github.com/unclecode/crawl4ai/issues/841	2025-03-21 16:48:07 +05:30
Aravind Karnam	f89113377a	fix: Move adding of visited urls to the 'visited' set, when queueing the URLs instead of after dequeuing, this is to prevent duplicate crawls. https://github.com/unclecode/crawl4ai/issues/843	2025-03-21 13:44:57 +05:30
Aravind Karnam	6740e87b4d	fix: remove trailing slash when the path is empty. This is causing dupicate crawls	2025-03-21 13:41:31 +05:30
Aravind Karnam	8b761f232b	fix: improve logged url readability by decoding encoded urls	2025-03-21 13:40:23 +05:30
Aravind Karnam	e0c2a7c284	chore: remove mistakenly commited deps.txt file	2025-03-21 11:06:46 +05:30
Aravind Karnam	ac2f9ae533	fix: streamline url status logging via single entrypoint i.e. logger.url_status	2025-03-20 18:59:15 +05:30
Aravind Karnam	eedda1ae5c	fix: Truncate long urls in middle than end since users are confused that same url is being scraped several times. Also remove labels on status and timer to be replaced with symbols to save space and display more URL	2025-03-20 18:56:19 +05:30
Aravind Karnam	8cecbec7a7	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-20 17:07:53 +05:30
UncleCode	6432ff1257	feat(browser): add builtin browser management system Implements a persistent browser management system that allows running a single shared browser instance that can be reused across multiple crawler sessions. Key changes include: - Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes - Implemented builtin browser management in BrowserProfiler - Added CLI commands for managing builtin browser (start, stop, status, restart, view) - Modified browser process handling to support detached processes - Added automatic builtin browser setup during package installation BREAKING CHANGE: The browser_mode config option changes how browser instances are managed	2025-03-20 12:13:59 +08:00
Aravind Karnam	4359b12003	docs + fix: Update example for full page screenshot & PDF export. Fix the bug Error: crawl4ai.async_webcrawler.AsyncWebCrawler.aprocess_html() got multiple values for keyword argument - for screenshot param. https://github.com/unclecode/crawl4ai/issues/822#issuecomment-2732602118	2025-03-18 17:20:24 +05:30
UncleCode	5358ac0fc2	refactor: clean up imports and improve JSON schema generation instructions	2025-03-18 18:53:34 +08:00
Aravind Karnam	529a79725e	docs: remove hallucinations from docs for CrawlerRunConfig + Add chunking strategy docs in the table	2025-03-18 16:14:00 +05:30
Aravind Karnam	9109ecd8fc	chore: Raise an exception with clear messaging when body tag is missing in the fetched html. The message should warn users to add appropriate wait_for condition to wait until body tag is loaded into DOM. fixes: https://github.com/unclecode/crawl4ai/issues/804	2025-03-18 15:26:44 +05:30
Aravind Karnam	84883be513	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-18 15:12:21 +05:30
UncleCode	a24799918c	feat(llm): add additional LLM configuration parameters Extend LLMConfig class to support more fine-grained control over LLM behavior by adding: - temperature control - max tokens limit - top_p sampling - frequency and presence penalties - stop sequences - number of completions These parameters allow for better customization of LLM responses.	2025-03-14 21:36:23 +08:00
UncleCode	a31d7b86be	feat(changelog): update CHANGELOG for version 0.5.0.post5 with new features, changes, fixes, and breaking changes	2025-03-14 15:26:37 +08:00
UncleCode	7884a98be7	feat(crawler): add experimental parameters support and optimize browser handling Add experimental parameters dictionary to CrawlerRunConfig to support beta features Make CSP nonce headers optional via experimental config Remove default cookie injection Clean up browser context creation code Improve code formatting in API handler BREAKING CHANGE: Default cookie injection has been removed from page initialization	2025-03-14 14:39:24 +08:00

1 2 3 4 5 ...

725 Commits