crawl4ai

Author	SHA1	Message	Date
UncleCode	4aeb7ef9ad	refactor(proxy): consolidate proxy configuration handling Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location. Key changes: - Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py - Updated type hints in async_configs.py to support ProxyConfig - Fixed proxy configuration handling in browser_manager.py - Updated documentation and examples to use new import path BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy	2025-03-07 23:14:11 +08:00
UncleCode	baee4949d3	refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.	2025-03-05 14:17:04 +08:00
Aravind	2af958e12c	Feat/llm config (#724 ) * feature: Add LlmConfig to easily configure and pass LLM configs to different strategies * pulled in next branch and resolved conflicts * feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions * Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params * updated tests, docs and readme	2025-02-21 15:41:37 +08:00
UncleCode	3cb28875c3	refactor(config): enhance serialization and config handling - Add ignore_default_value option to to_serializable_dict - Add viewport dict support in BrowserConfig - Replace FastFilterChain with FilterChain - Add deprecation warnings for unwanted properties - Clean up unused imports - Rename example files for consistency - Add comprehensive Docker configuration tutorial BREAKING CHANGE: FastFilterChain has been replaced with FilterChain	2025-02-19 17:23:25 +08:00
UncleCode	392c923980	feat(docker): add JWT authentication and improve server architecture Add JWT token-based authentication to Docker server and client. Refactor server architecture for better code organization and error handling. Move Dockerfile to root deploy directory and update configuration. Add comprehensive documentation and examples. BREAKING CHANGE: Docker server now requires authentication by default. Endpoints require JWT tokens when security.jwt_enabled is true in config.	2025-02-18 22:07:13 +08:00
UncleCode	8bb799068e	feat(crawler): add HTTP crawler strategy for lightweight web scraping Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include: - Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts - File and raw content handling capabilities - Streaming response processing for large files - Customizable request/response hooks - Comprehensive error handling Also refactors browser management code into separate module for better organization.	2025-02-15 19:26:30 +08:00
UncleCode	966fb47e64	feat(config): enhance serialization and add deep crawling exports Improve configuration serialization with better handling of frozensets and slots. Expand deep crawling module exports and documentation. Add comprehensive API usage examples in Docker README. - Add support for frozenset serialization - Improve error handling in config loading - Export additional deep crawling components - Enhance Docker API documentation with detailed examples - Fix ContentTypeFilter initialization	2025-02-13 21:45:19 +08:00
UncleCode	43e09da694	refactor(crawler): remove content filter functionality Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes: - Removing ContentFilter import and related classes - Removing content_filter parameter from CrawlerRunConfig - Cleaning up LLMExtractionStrategy constructor parameters BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.	2025-02-12 21:59:19 +08:00
UncleCode	91a5fea11f	feat(cli): add command line interface with comprehensive features Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality	2025-02-10 16:58:52 +08:00
UncleCode	19df96ed56	feat(proxy): add proxy rotation strategy Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations	2025-02-09 18:49:10 +08:00
UncleCode	c308a794e8	refactor(deep-crawl): reorganize deep crawling functionality into dedicated module Restructure deep crawling code into a dedicated module with improved organization: - Move deep crawl logic from async_deep_crawl.py to deep_crawling/ - Create separate files for BFS strategy, filters, and scorers - Improve code organization and maintainability - Add optimized implementations for URL filtering and scoring - Rename DeepCrawlHandler to DeepCrawlDecorator for clarity BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure	2025-02-04 23:28:17 +08:00
UncleCode	bc7559586f	feat(crawler): add deep crawling capabilities with BFS strategy Implements deep crawling functionality with a new BreadthFirstSearch strategy: - Add DeepCrawlStrategy base class and BFS implementation - Integrate deep crawling with AsyncWebCrawler via decorator pattern - Update CrawlerRunConfig to support deep crawling parameters - Add pagination support for Google Search crawler BREAKING CHANGE: AsyncWebCrawler.arun and arun_many return types now include deep crawl results	2025-02-04 01:24:49 +08:00
UncleCode	2f15976b34	feat(docker): enhance Docker deployment setup and configuration Add comprehensive Docker deployment configuration with: - New .dockerignore and .llm.env.example files - Enhanced Dockerfile with multi-stage build and optimizations - Detailed README with setup instructions and environment configurations - Improved requirements.txt with Gunicorn - Better error handling in async_configs.py BREAKING CHANGE: Docker deployment now requires .llm.env file for API keys	2025-02-01 19:33:27 +08:00
UncleCode	53ac3ec0b4	feat(docker): add Docker service integration and config serialization Add Docker service integration with FastAPI server and client implementation. Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support Docker service communication. Clean up imports and improve error handling. - Add Crawl4aiDockerClient class - Implement config serialization/deserialization - Add FastAPI server with streaming support - Add health check endpoint - Clean up imports and type hints	2025-01-31 18:00:16 +08:00
UncleCode	f8fd9d9eff	feat(pdf): add PDF processing capabilities Add new PDF processing module with the following features: - PDF text extraction and formatting to HTML/Markdown - Image extraction with multiple format support (JPEG, PNG, TIFF) - Link extraction from PDF documents - Metadata extraction including title, author, dates - Support for both local and remote PDF files Also includes: - New configuration options for HTML attribute handling - Internal/external link filtering improvements - Version bump to 0.4.300b4	2025-01-27 21:24:15 +08:00
UncleCode	4d7f91b378	refactor(user-agent): improve user agent generation system Redesign user agent generation to be more modular and reliable: - Add abstract base class UAGen for user agent generation - Implement ValidUAGenerator using fake-useragent library - Add OnlineUAGenerator for fetching real-world user agents - Update browser configurations to use new UA generation system - Improve client hints generation This change makes the user agent system more maintainable and provides better real-world user agent coverage.	2025-01-25 21:16:39 +08:00
UncleCode	69a77222ef	feat(browser): add CDP URL configuration support Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser. - Added cdp_url parameter to BrowserConfig - Added cdp_url support in ManagedBrowser.start() method - Updated documentation for new parameters	2025-01-24 15:53:47 +08:00
UncleCode	6dc01eae3a	refactor(core): improve type hints and remove unused file - Add RelevantContentFilter to __init__.py exports - Update version to 0.4.3b3 - Enhance type hints in async_configs.py - Remove empty utils.scraping.py file - Update mkdocs configuration with version info and GitHub integration BREAKING CHANGE: None	2025-01-23 18:53:22 +08:00
UncleCode	d09c611d15	feat(robots): add robots.txt compliance support Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.	2025-01-21 17:54:13 +08:00
UncleCode	9247877037	feat(proxy): add proxy configuration support to CrawlerRunConfig Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config. - Added proxy_config parameter to CrawlerRunConfig - Updated BrowserManager to apply proxy settings from CrawlerRunConfig - Updated proxy-security documentation with new usage examples	2025-01-20 22:14:05 +08:00
UncleCode	91463e34f1	feat(config): add streaming support and config cloning Add streaming capability to crawler configurations and introduce clone() methods for both BrowserConfig and CrawlerRunConfig to support immutable config updates. Move stream parameter from arun_many() method to CrawlerRunConfig. BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method. Use config.stream=True instead.	2025-01-19 17:51:47 +08:00
UncleCode	1221be30a3	feat(browser): improve browser context management and add shared data support Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks. Implement browser context reuse based on config signatures to improve memory usage. Fix Firefox/Webkit channel settings. Add config parameter to hook callbacks for better context access. Remove debug print statements. BREAKING CHANGE: Hook callback signatures now include config parameter	2025-01-19 17:12:03 +08:00
UncleCode	20c027b79c	chore(cleanup): remove unused files and improve type hints - Remove .pre-commit-config.yaml and duplicate mkdocs configuration files - Add Optional type hint for proxy parameter in BrowserConfig - Fix type annotation for results list in AsyncWebCrawler - Move calculate_batch_size function import to model_loader - Update prompt imports in extraction_strategy.py No breaking changes.	2025-01-14 13:07:18 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	c3370ec5da	refactor(scraping): replace ScrapingMode enum with strategy pattern Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping. This change introduces: - New ContentScrapingStrategy abstract base class - Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations - New Pydantic models for structured scraping results - Updated documentation reflecting the new strategy-based approach BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.	2025-01-13 17:53:12 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	825c78a048	refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.	2025-01-11 21:10:27 +08:00
UncleCode	ac5f461d40	feat(crawler): add memory-adaptive dispatcher with rate limiting Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include: - Added RateLimitConfig dataclass for configuring rate limiting behavior - Extended CrawlerRunConfig with dispatcher-related settings - Refactored arun_many to use the new dispatcher system - Added memory threshold and session permit controls - Integrated optional progress monitoring display BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior	2025-01-10 16:01:18 +08:00
UncleCode	a96e05d4ae	refactor(crawler): optimize response handling and default settings - Set wait_for_images default to false for better performance - Simplify response attribute copying in AsyncWebCrawler - Update hello_world example with proper content filtering	2025-01-01 19:39:02 +08:00
UncleCode	ea0ac25f38	refactor(browser): Update browser channel default to 'chromium' in BrowserConfig.from_args method	2025-01-01 18:58:15 +08:00
UncleCode	a7215ad972	fix(browser): update default browser channel to chromium and simplify channel selection logic	2025-01-01 18:38:33 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	8fbc2e0463	Refactor deployment configuration and enhance browser debugging options	2024-12-20 20:35:28 +08:00
UncleCode	849765712f	Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.	2024-12-19 21:02:29 +08:00
UncleCode	a11d9646e3	Enhance crawler features and improve documentation - Added detailed CrawlerRunConfig parameters documentation. - Introduced plans for real-time event-driven crawling. - Updated async logger default level to DEBUG for better insights. - Improved structure and readability in configuration file. - Enhanced documentation on future capabilities in new blog entries.	2024-12-16 18:52:51 +08:00
UncleCode	7524aa7b5e	Feature: Add Markdown generation to CrawlerRunConfig - Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`. - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`. - Updated version number to 0.4.21 in `__version__.py`.	2024-12-13 21:51:38 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00

38 Commits