crawl4ai

Author	SHA1	Message	Date
UncleCode	a2061bf31e	feat(crawler): add MHTML capture functionality Add ability to capture web pages as MHTML format, which includes all page resources in a single file. This enables complete page archival and offline viewing. - Add capture_mhtml parameter to CrawlerRunConfig - Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy - Add mhtml field to CrawlResult and AsyncCrawlResponse models - Add comprehensive tests for MHTML capture functionality - Update documentation with MHTML capture details - Add exclude_all_images option for better memory management Breaking changes: None	2025-04-09 15:39:04 +08:00
UncleCode	9038e9acbd	Merge branch 'main' into next	2025-04-08 17:43:42 +08:00
UncleCode	e1d9e2489c	refactor(docs): update import statement in quickstart.py for improved clarity	2025-04-05 23:12:06 +08:00
UncleCode	b1693b1c21	Remove old quickstart files	2025-04-05 23:10:25 +08:00
UncleCode	49d904ca0a	refactor(docs): enhance quickstart_examples.py with improved configuration and file handling	2025-04-05 22:57:45 +08:00
UncleCode	ca9351252a	refactor(docs): update import paths and clean up example code in quickstart_examples.py	2025-04-05 22:55:56 +08:00
UncleCode	935d9d39f8	Add quickstart example set	2025-04-05 21:37:25 +08:00
Aravind Karnam	9e16a4bb26	Merge next and resolve conflicts	2025-04-02 12:18:23 +05:30
UncleCode	c635f6b9a2	refactor(browser): reorganize browser strategies and improve Docker implementation Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with: - Add Alpine and Debian-based Dockerfiles for better container options - Enhance Docker registry to share configuration with BuiltinBrowserStrategy - Add CPU and memory limits to container configuration - Improve error handling and logging - Update documentation and examples BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.	2025-03-27 21:35:13 +08:00
Aravind Karnam	efa73257c5	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-24 21:57:29 +05:30
UncleCode	4ab0893ffb	feat(browser): implement modular browser management system Adds a new browser management system with strategy pattern implementation: - Introduces BrowserManager class with strategy pattern support - Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy - Implements BrowserProfileManager for profile management - Adds PagePoolConfig for browser page pooling - Includes comprehensive test suite for all browser strategies BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.	2025-03-21 22:50:00 +08:00
Aravind Karnam	8cecbec7a7	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-20 17:07:53 +05:30
UncleCode	6432ff1257	feat(browser): add builtin browser management system Implements a persistent browser management system that allows running a single shared browser instance that can be reused across multiple crawler sessions. Key changes include: - Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes - Implemented builtin browser management in BrowserProfiler - Added CLI commands for managing builtin browser (start, stop, status, restart, view) - Modified browser process handling to support detached processes - Added automatic builtin browser setup during package installation BREAKING CHANGE: The browser_mode config option changes how browser instances are managed	2025-03-20 12:13:59 +08:00
Aravind Karnam	4359b12003	docs + fix: Update example for full page screenshot & PDF export. Fix the bug Error: crawl4ai.async_webcrawler.AsyncWebCrawler.aprocess_html() got multiple values for keyword argument - for screenshot param. https://github.com/unclecode/crawl4ai/issues/822#issuecomment-2732602118	2025-03-18 17:20:24 +05:30
Aravind Karnam	529a79725e	docs: remove hallucinations from docs for CrawlerRunConfig + Add chunking strategy docs in the table	2025-03-18 16:14:00 +05:30
Aravind Karnam	84883be513	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-18 15:12:21 +05:30
UncleCode	b750542e6d	feat(crawler): optimize single URL handling and add performance comparison Add special handling for single URL requests in Docker API to use arun() instead of arun_many() Add new example script demonstrating performance differences between sequential and parallel crawling Update cache mode from aggressive to bypass in examples and tests Remove unused dependencies (zstandard, msgpack) BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples	2025-03-13 22:15:15 +08:00
Aravind Karnam	cbb8755972	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-13 10:42:22 +05:30
UncleCode	1630fbdafe	feat(monitor): add real-time crawler monitoring system with memory management Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes: - Terminal-based dashboard with rich UI for displaying task statuses - Memory pressure monitoring and adaptive dispatch control - Queue statistics and performance metrics tracking - Detailed task progress visualization - Stress testing framework for memory management This addition helps operators track crawler performance and manage memory usage more effectively.	2025-03-12 19:05:24 +08:00
UncleCode	9547bada3a	feat(content): add target_elements parameter for selective content extraction Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media. Key changes: - Added target_elements list parameter to CrawlerRunConfig - Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements - Updated documentation with examples and comparison between css_selector and target_elements - Fixed table extraction in content_scraping_strategy.py BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures	2025-03-10 18:54:51 +08:00
UncleCode	9d69fce834	feat(scraping): add smart table extraction and analysis capabilities Add comprehensive table detection and extraction functionality to the web scraping system: - Implement intelligent table detection algorithm with scoring system - Add table extraction with support for headers, rows, captions - Update models to include tables in Media class - Add table_score_threshold configuration option - Add documentation and examples for table extraction - Include crypto analysis example demonstrating table usage This change enables users to extract structured data from HTML tables while intelligently filtering out layout tables.	2025-03-09 21:31:33 +08:00
UncleCode	4aeb7ef9ad	refactor(proxy): consolidate proxy configuration handling Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location. Key changes: - Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py - Updated type hints in async_configs.py to support ProxyConfig - Fixed proxy configuration handling in browser_manager.py - Updated documentation and examples to use new import path BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy	2025-03-07 23:14:11 +08:00
UncleCode	a68cbb232b	feat(browser): add standalone CDP browser launch and lxml extraction strategy Add new features to enhance browser automation and HTML extraction: - Add CDP browser launch capability with customizable ports and profiles - Implement JsonLxmlExtractionStrategy for faster HTML parsing - Add CLI command 'crwl cdp' for launching standalone CDP browsers - Support connecting to external CDP browsers via URL - Optimize selector caching and context-sensitive queries BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai	2025-03-07 20:55:56 +08:00
UncleCode	f78c46446b	feat(deep-crawling): improve URL normalization and domain filtering Enhance URL handling in deep crawling with: - New URL normalization functions for consistent URL formats - Improved domain filtering with subdomain support - Added URLPatternFilter to public API - Better URL deduplication in BFS strategy These changes improve crawling accuracy and reduce duplicate visits.	2025-03-06 22:45:57 +08:00
UncleCode	b3ec7ce960	Merge branch 'vr0.5.0.post1' into next	2025-03-05 14:17:19 +08:00
UncleCode	baee4949d3	refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.	2025-03-05 14:17:04 +08:00
UncleCode	9c58e4ce2e	fix(docs): correct section numbering in deepcrawl_example.py tutorial	2025-03-04 20:57:33 +08:00
UncleCode	df6a6d5f4f	refactor(docs): reorganize tutorial sections and update wrap-up example	2025-03-04 20:55:09 +08:00
UncleCode	56bc3c6e45	refactor(cli): improve CLI default command handling Make 'crawl' the default command when no command is specified. This improves user experience by allowing direct URL input without explicitly specifying the 'crawl' command. Also removes unnecessary blank lines in example code for better readability.	2025-03-04 20:28:16 +08:00
UncleCode	415c1c5bee	refactor(core): replace float('inf') with math.inf Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code. No breaking changes.	2025-03-04 18:23:55 +08:00
Aravind Karnam	504207faa6	docs: update text in llm-strategies.md to reflect new changes in LlmConfig	2025-03-03 19:24:44 +05:30
UncleCode	d024749633	refactor(deep-crawl): add max_pages limit and improve crawl control Add max_pages parameter to all deep crawling strategies to limit total pages crawled. Add score_threshold parameter to BFS/DFS strategies for quality control. Remove legacy parameter handling in AsyncWebCrawler. Improve error handling and logging in crawl strategies. BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()	2025-03-03 21:51:11 +08:00
Aravind	f14e4a4b67	Merge pull request #776 from jawshoeadan/patch-1 Fix LiteLLM branding and link	2025-03-03 19:01:30 +05:30
Aravind Karnam	1e819cdb26	fixes: https://github.com/unclecode/crawl4ai/issues/774	2025-03-03 11:53:15 +05:30
jawshoeadan	5edfea279d	Fix LiteLLM branding and link	2025-03-02 16:58:00 +01:00
UncleCode	c612f9a852	feat(profiles): add CLI command for crawling with browser profiles Adds new functionality to crawl websites using saved browser profiles directly from the CLI. This includes: - New CLI option to use profiles for crawling - Helper functions for profile-based crawling - Fixed type hints for config parameters - Updated example to show browser window by default This makes it easier for users to leverage saved browser profiles for crawling without writing code.	2025-03-02 21:33:33 +08:00
UncleCode	cba4a466e5	feat(browser): add BrowserProfiler class for identity-based browsing Adds a new BrowserProfiler class that provides comprehensive management of browser profiles for identity-based crawling. Features include: - Interactive profile creation and management - Profile listing, retrieval, and deletion - Guided console interface - Migration of profile management from ManagedBrowser - New example script for identity-based browsing ALSO: - Updates logging format in AsyncWebCrawler - Removes content filter from hello_world example - Relaxes httpx version constraint BREAKING CHANGE: Profile management methods from ManagedBrowser are now deprecated and delegate to BrowserProfiler	2025-03-02 20:32:29 +08:00
Aravind	a9e24307cc	Release prep (#749 ) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown	2025-02-28 19:53:35 +08:00
UncleCode	4bcd4cbda1	refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.	2025-02-25 22:27:55 +08:00
UncleCode	c6d48080a4	feat(logger): add abstract logger base class and file logger implementation Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization. BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.	2025-02-23 21:23:41 +08:00
UncleCode	367cd71db9	feat(core): release version 0.5.0 with deep crawling and CLI This major release adds deep crawling capabilities, memory-adaptive dispatcher, multiple crawling strategies, Docker deployment, and a new CLI. It also includes significant improvements to proxy handling, PDF processing, and LLM integration. BREAKING CHANGES: - Add memory-adaptive dispatcher as default for arun_many() - Move max_depth to CrawlerRunConfig - Replace ScrapingMode enum with strategy pattern - Update BrowserContext API - Make model fields optional with defaults - Remove content_filter parameter from CrawlerRunConfig - Remove synchronous WebCrawler and old CLI - Update Docker deployment configuration - Replace FastFilterChain with FilterChain - Change license to Apache 2.0 with attribution clause	2025-02-21 19:55:02 +08:00
Aravind	2af958e12c	Feat/llm config (#724 ) * feature: Add LlmConfig to easily configure and pass LLM configs to different strategies * pulled in next branch and resolved conflicts * feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions * Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params * updated tests, docs and readme	2025-02-21 15:41:37 +08:00
UncleCode	3cb28875c3	refactor(config): enhance serialization and config handling - Add ignore_default_value option to to_serializable_dict - Add viewport dict support in BrowserConfig - Replace FastFilterChain with FilterChain - Add deprecation warnings for unwanted properties - Clean up unused imports - Rename example files for consistency - Add comprehensive Docker configuration tutorial BREAKING CHANGE: FastFilterChain has been replaced with FilterChain	2025-02-19 17:23:25 +08:00
Aravind	dad592c801	2025 feb alpha 1 (#685 ) * spelling change in prompt * gpt-4o-mini support * Remove leading Y before here * prompt spell correction * (Docs) Fix numbered list end-of-line formatting Added the missing "two spaces" to add a line break * fix: access downloads_path through browser_config in _handle_download method - Fixes #585 * crawl * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/583 * Docs update: https://github.com/unclecode/crawl4ai/issues/649 * fix: https://github.com/unclecode/crawl4ai/issues/570 * Docs: updated example for content-selection to reflect new changes in yc newsfeed css * Refactor: Removed old filters and replaced with optimised filters * fix:Fixed imports as per the new names of filters * Tests: For deep crawl filters * Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers. * fix: awaiting on filters that are async in nature eg: content relevance and seo filters * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/715 --------- Co-authored-by: DarshanTank <darshan.tank@gnani.ai> Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com> Co-authored-by: Serhat Soydan <ssoydan@gmail.com> Co-authored-by: cardit1 <maneesh@cardit.in> Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>	2025-02-19 14:13:17 +08:00
UncleCode	c171891999	Merge branch 'main' into next # Conflicts: # .gitignore	2025-02-19 13:26:42 +08:00
UncleCode	392c923980	feat(docker): add JWT authentication and improve server architecture Add JWT token-based authentication to Docker server and client. Refactor server architecture for better code organization and error handling. Move Dockerfile to root deploy directory and update configuration. Add comprehensive documentation and examples. BREAKING CHANGE: Docker server now requires authentication by default. Endpoints require JWT tokens when security.jwt_enabled is true in config.	2025-02-18 22:07:13 +08:00
UncleCode	063df572b0	docs(examples): add SERP API project example Add comprehensive example demonstrating Google Search Results Page (SERP) API implementation using crawl4ai. The example includes: - Basic web crawling setup - LLM-based extraction - Schema generation - Golden standard implementation - CrawlerHub usage The example serves as a reference for implementing SERP API functionality with various extraction strategies.	2025-02-14 23:06:16 +08:00
UncleCode	43e09da694	refactor(crawler): remove content filter functionality Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes: - Removing ContentFilter import and related classes - Removing content_filter parameter from CrawlerRunConfig - Cleaning up LLMExtractionStrategy constructor parameters BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.	2025-02-12 21:59:19 +08:00
UncleCode	91a5fea11f	feat(cli): add command line interface with comprehensive features Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality	2025-02-10 16:58:52 +08:00
UncleCode	19df96ed56	feat(proxy): add proxy rotation strategy Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations	2025-02-09 18:49:10 +08:00

... 3 4 5 6 7 ...

414 Commits