crawl4ai

Author	SHA1	Message	Date
UncleCode	c171891999	Merge branch 'main' into next # Conflicts: # .gitignore	2025-02-19 13:26:42 +08:00
UncleCode	3b1025abbb	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2025-02-19 13:24:26 +08:00
UncleCode	f00dcc276f	Update README.md (#562 )	2025-02-19 13:24:04 +08:00
UncleCode	392c923980	feat(docker): add JWT authentication and improve server architecture Add JWT token-based authentication to Docker server and client. Refactor server architecture for better code organization and error handling. Move Dockerfile to root deploy directory and update configuration. Add comprehensive documentation and examples. BREAKING CHANGE: Docker server now requires authentication by default. Endpoints require JWT tokens when security.jwt_enabled is true in config.	2025-02-18 22:07:13 +08:00
UncleCode	2864015469	feat(docker): implement supervisor and secure API endpoints Add supervisor configuration for managing Redis and Gunicorn processes Replace direct process management with supervisord Add secure and token-free API server variants Implement JWT authentication for protected endpoints Update datetime handling in async dispatcher Add email domain verification BREAKING CHANGE: Server startup now uses supervisord instead of direct process management	2025-02-17 20:31:20 +08:00
UncleCode	8bb799068e	feat(crawler): add HTTP crawler strategy for lightweight web scraping Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include: - Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts - File and raw content handling capabilities - Streaming response processing for large files - Customizable request/response hooks - Comprehensive error handling Also refactors browser management code into separate module for better organization.	2025-02-15 19:26:30 +08:00
UncleCode	063df572b0	docs(examples): add SERP API project example Add comprehensive example demonstrating Google Search Results Page (SERP) API implementation using crawl4ai. The example includes: - Basic web crawling setup - LLM-based extraction - Schema generation - Golden standard implementation - CrawlerHub usage The example serves as a reference for implementing SERP API functionality with various extraction strategies.	2025-02-14 23:06:16 +08:00
UncleCode	966fb47e64	feat(config): enhance serialization and add deep crawling exports Improve configuration serialization with better handling of frozensets and slots. Expand deep crawling module exports and documentation. Add comprehensive API usage examples in Docker README. - Add support for frozenset serialization - Improve error handling in config loading - Export additional deep crawling components - Enhance Docker API documentation with detailed examples - Fix ContentTypeFilter initialization	2025-02-13 21:45:19 +08:00
UncleCode	43e09da694	refactor(crawler): remove content filter functionality Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes: - Removing ContentFilter import and related classes - Removing content_filter parameter from CrawlerRunConfig - Cleaning up LLMExtractionStrategy constructor parameters BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.	2025-02-12 21:59:19 +08:00
UncleCode	69705df0b3	fix(install): ensure proper exit after running doctor command	2025-02-11 19:48:23 +08:00
UncleCode	91a5fea11f	feat(cli): add command line interface with comprehensive features Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality	2025-02-10 16:58:52 +08:00
UncleCode	467be9ac76	feat(deep-crawling): add DFS strategy and update exports; refactor CLI entry point	2025-02-09 20:23:40 +08:00
UncleCode	19df96ed56	feat(proxy): add proxy rotation strategy Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations	2025-02-09 18:49:10 +08:00
UncleCode	b957ff2ecd	refactor(crawler): improve HTML handling and cleanup codebase - Add HTML attribute preservation in GoogleSearchCrawler - Fix lxml import references in utils.py - Remove unused ssl_certificate.json - Clean up imports and code organization in hub.py - Update test case formatting and remove unused image search test BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations	2025-02-07 21:56:27 +08:00
UncleCode	91073c1244	refactor(crawling): improve type hints and code cleanup - Added proper return type hints for DeepCrawlStrategy.arun method - Added __call__ method to DeepCrawlStrategy for easier usage - Removed redundant comments and imports - Cleaned up type hints in DFS strategy - Removed empty docker_client.py and .continuerules - Added .private/ to gitignore BREAKING CHANGE: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]	2025-02-07 19:01:59 +08:00
Sezer Bozkır	926beee832	base-config structure is changed (#618 ) refactor(docker): restructure docker-compose for modular configuration - Added reusable base configuration block (x-base-config) for ports, environment variables, volumes, deployment resources, restart policy, and health check. - Updated services to include base configuration directly using `<<: *base-config` syntax. - Removed redundant `base-config` service definition.	2025-02-07 17:11:51 +08:00
UncleCode	a9415aaaf6	refactor(deep-crawling): reorganize deep crawling strategies and add new implementations Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types. BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.	2025-02-05 22:50:39 +08:00
UncleCode	c308a794e8	refactor(deep-crawl): reorganize deep crawling functionality into dedicated module Restructure deep crawling code into a dedicated module with improved organization: - Move deep crawl logic from async_deep_crawl.py to deep_crawling/ - Create separate files for BFS strategy, filters, and scorers - Improve code organization and maintainability - Add optimized implementations for URL filtering and scoring - Rename DeepCrawlHandler to DeepCrawlDecorator for clarity BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure	2025-02-04 23:28:17 +08:00
UncleCode	bc7559586f	feat(crawler): add deep crawling capabilities with BFS strategy Implements deep crawling functionality with a new BreadthFirstSearch strategy: - Add DeepCrawlStrategy base class and BFS implementation - Integrate deep crawling with AsyncWebCrawler via decorator pattern - Update CrawlerRunConfig to support deep crawling parameters - Add pagination support for Google Search crawler BREAKING CHANGE: AsyncWebCrawler.arun and arun_many return types now include deep crawl results	2025-02-04 01:24:49 +08:00
UncleCode	04bc643cec	feat(api): improve cache handling and add API tests Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure results are still cached for future use. Also adds error handling for non-JSON LLM responses and comprehensive API test suite. - Changes default cache fallback from BYPASS to WRITE_ONLY - Adds error handling for LLM JSON parsing - Introduces new test suite for API endpoints	2025-02-02 20:53:31 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00
UncleCode	7b1ef07c41	refactor(docker): remove unused models and utilities for cleaner codebase	2025-02-01 20:10:13 +08:00
UncleCode	2f15976b34	feat(docker): enhance Docker deployment setup and configuration Add comprehensive Docker deployment configuration with: - New .dockerignore and .llm.env.example files - Enhanced Dockerfile with multi-stage build and optimizations - Detailed README with setup instructions and environment configurations - Improved requirements.txt with Gunicorn - Better error handling in async_configs.py BREAKING CHANGE: Docker deployment now requires .llm.env file for API keys	2025-02-01 19:33:27 +08:00
UncleCode	20920fa17b	refactor(docker): clean up import statements in server.py	2025-02-01 14:28:28 +08:00
UncleCode	53ac3ec0b4	feat(docker): add Docker service integration and config serialization Add Docker service integration with FastAPI server and client implementation. Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support Docker service communication. Clean up imports and improve error handling. - Add Crawl4aiDockerClient class - Implement config serialization/deserialization - Add FastAPI server with streaming support - Add health check endpoint - Clean up imports and type hints	2025-01-31 18:00:16 +08:00
UncleCode	ce4f04dad2	feat(docker): add Docker deployment configuration and API server Add Docker deployment setup with FastAPI server implementation for Crawl4AI: - Create Dockerfile with Python 3.10 and Playwright dependencies - Implement FastAPI server with streaming and non-streaming endpoints - Add request/response models and JSON serialization - Include test script for API verification Also includes: - Update .gitignore for Continue development files - Add project rules in .continuerules - Clean up async_dispatcher.py formatting	2025-01-31 15:22:21 +08:00
UncleCode	f81712eb91	refactor(core): reorganize project structure and remove legacy code Major reorganization of the project structure: - Moved legacy synchronous crawler code to legacy folder - Removed deprecated CLI and docs manager - Consolidated version manager into utils.py - Added CrawlerHub to __init__.py exports - Fixed type hints in async_webcrawler.py - Fixed minor bugs in chunking and crawler strategies BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.	2025-01-30 19:35:06 +08:00
UncleCode	31938fb922	feat(crawler): enhance JavaScript execution and PDF processing Add JavaScript execution result handling and improve PDF processing capabilities: - Add js_execution_result to CrawlResult and AsyncCrawlResponse models - Implement execution result capture in AsyncPlaywrightCrawlerStrategy - Add batch processing for PDF pages with configurable batch size - Enhance JsonElementExtractionStrategy with better schema generation - Add HTML optimization utilities BREAKING CHANGE: PDF processing now uses batch processing by default	2025-01-29 21:03:39 +08:00
UncleCode	f8fd9d9eff	feat(pdf): add PDF processing capabilities Add new PDF processing module with the following features: - PDF text extraction and formatting to HTML/Markdown - Image extraction with multiple format support (JPEG, PNG, TIFF) - Link extraction from PDF documents - Metadata extraction including title, author, dates - Support for both local and remote PDF files Also includes: - New configuration options for HTML attribute handling - Internal/external link filtering improvements - Version bump to 0.4.300b4	2025-01-27 21:24:15 +08:00
UncleCode	dde14eba7d	Update README.md (#562 )	2025-01-26 11:00:28 +08:00
UncleCode	54c84079c4	docs(api): improve formatting and readability of API documentation Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting	2025-01-25 22:06:11 +08:00
UncleCode	d0586f09a9	Merge branch 'vr0.4.3b3'	2025-01-25 21:57:29 +08:00
UncleCode	09ac7ed008	feat(demo): uncomment feature demos and add fake-useragent dependency Uncomments demonstration code for memory dispatcher, streaming support, content scraping, JSON schema generation, LLM markdown, and robots compliance in the v0.4.3b2 features demo file. Also adds fake-useragent package as a project dependency. This change makes all feature demonstrations active by default and ensures proper user agent handling capabilities.	2025-01-25 21:56:08 +08:00
UncleCode	97796f39d2	docs(examples): update proxy rotation demo and disable other demos Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality. No breaking changes.	2025-01-25 21:52:35 +08:00
UncleCode	4d7f91b378	refactor(user-agent): improve user agent generation system Redesign user agent generation to be more modular and reliable: - Add abstract base class UAGen for user agent generation - Implement ValidUAGenerator using fake-useragent library - Add OnlineUAGenerator for fetching real-world user agents - Update browser configurations to use new UA generation system - Improve client hints generation This change makes the user agent system more maintainable and provides better real-world user agent coverage.	2025-01-25 21:16:39 +08:00
UncleCode	69a77222ef	feat(browser): add CDP URL configuration support Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser. - Added cdp_url parameter to BrowserConfig - Added cdp_url support in ManagedBrowser.start() method - Updated documentation for new parameters	2025-01-24 15:53:47 +08:00
UncleCode	0afc3e9e5e	refactor(examples): update API usage in features demo Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls() and fix result access patterns. Also improve code formatting and remove extra whitespace. - Replace dispatcher.run_urls with crawler.arun_many - Update streaming demo to use new API and correct result access - Clean up whitespace and formatting - Simplify result property access patterns	2025-01-23 22:37:29 +08:00
UncleCode	65d33bcc0f	style(docs): improve code formatting in features demo Clean up whitespace and improve readability in v0_4_3b2_features_demo.py: - Remove excessive blank lines between functions - Improve config formatting for better readability - Uncomment memory dispatcher demo in main function No breaking changes.	2025-01-23 22:36:58 +08:00
UncleCode	6a01008a2b	docs(multi-url): improve documentation clarity and update examples - Restructure multi-URL crawling documentation with better formatting and examples - Update code examples to use new API syntax (arun_many) - Add detailed parameter explanations for RateLimiter and Dispatchers - Enhance CSS styling for better documentation readability - Fix outdated method calls in feature demo script BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples	2025-01-23 22:33:36 +08:00
UncleCode	6dc01eae3a	refactor(core): improve type hints and remove unused file - Add RelevantContentFilter to __init__.py exports - Update version to 0.4.3b3 - Enhance type hints in async_configs.py - Remove empty utils.scraping.py file - Update mkdocs configuration with version info and GitHub integration BREAKING CHANGE: None	2025-01-23 18:53:22 +08:00
UncleCode	7b7fe84e0d	docs(readme): resolve merge conflict and update version info Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements. No breaking changes.	2025-01-22 20:52:42 +08:00
UncleCode	5c36f4308f	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2025-01-22 20:51:52 +08:00
UncleCode	45809d1c91	Merge branch 'vr0.4.3b2'	2025-01-22 20:51:46 +08:00
UncleCode	357414c345	docs(readme): update version references and fix links Update version numbers to v0.4.3bx throughout README.md Fix contributing guidelines link to point to CONTRIBUTORS.md Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product Add pre-release installation instructions Fix minor formatting in personal story section No breaking changes	2025-01-22 20:46:39 +08:00
UncleCode	260b9120c3	docs(examples): update v0.4.3 features demo to v0.4.3b2 Rename and replace the features demo file to reflect the beta 2 version number. The old v0.4.3 demo file is removed and replaced with a new beta 2 version. Renames: - docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py	2025-01-22 20:41:43 +08:00
UncleCode	976ea52167	docs(examples): update demo scripts and fix output formats Update example scripts to reflect latest API changes and improve demonstrations: - Increase test URLs in dispatcher example from 20 to 40 pages - Comment out unused dispatcher strategies for cleaner output - Fix scraping strategies performance script to use correct object notation - Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections These changes make the examples more current and better aligned with the actual API.	2025-01-22 20:40:03 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00
UncleCode	dee5fe9851	feat(proxy): add proxy rotation support and documentation Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include: - Added proxy rotation demo in features example - Updated proxy configuration handling in BrowserManager - Added proxy rotation documentation - Updated README with new proxy rotation feature - Bumped version to 0.4.3b2 This change enables users to dynamically switch between proxies and verify IP addresses for each request.	2025-01-22 16:11:01 +08:00
UncleCode	88697c4630	docs(readme): update version and feature announcements for v0.4.3b1 Update README.md to announce version 0.4.3b1 release with new features including: - Memory Dispatcher System - Streaming Support - LLM-Powered Markdown Generation - Schema Generation - Robots.txt Compliance Add detailed version numbering explanation section to help users understand pre-release versions.	2025-01-21 21:20:04 +08:00
UncleCode	16b8d4945b	feat(release): prepare v0.4.3 beta release Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit	2025-01-21 21:03:11 +08:00

1 2 3 4 5 ...

623 Commits