crawl4ai

Author	SHA1	Message	Date
UncleCode	4bcd4cbda1	refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.	2025-02-25 22:27:55 +08:00
UncleCode	71ce01c9e1	feat(browser): add cdp_url parameter to BrowserManager initialization	2025-02-24 14:48:02 +08:00
UncleCode	c6d48080a4	feat(logger): add abstract logger base class and file logger implementation Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization. BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.	2025-02-23 21:23:41 +08:00
UncleCode	46d2f12851	chore: remove old Dockerfile and server script	2025-02-22 13:45:04 +08:00
UncleCode	367cd71db9	feat(core): release version 0.5.0 with deep crawling and CLI This major release adds deep crawling capabilities, memory-adaptive dispatcher, multiple crawling strategies, Docker deployment, and a new CLI. It also includes significant improvements to proxy handling, PDF processing, and LLM integration. BREAKING CHANGES: - Add memory-adaptive dispatcher as default for arun_many() - Move max_depth to CrawlerRunConfig - Replace ScrapingMode enum with strategy pattern - Update BrowserContext API - Make model fields optional with defaults - Remove content_filter parameter from CrawlerRunConfig - Remove synchronous WebCrawler and old CLI - Update Docker deployment configuration - Replace FastFilterChain with FilterChain - Change license to Apache 2.0 with attribution clause	2025-02-21 19:55:02 +08:00
Aravind	2af958e12c	Feat/llm config (#724 ) * feature: Add LlmConfig to easily configure and pass LLM configs to different strategies * pulled in next branch and resolved conflicts * feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions * Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params * updated tests, docs and readme	2025-02-21 15:41:37 +08:00
UncleCode	3cb28875c3	refactor(config): enhance serialization and config handling - Add ignore_default_value option to to_serializable_dict - Add viewport dict support in BrowserConfig - Replace FastFilterChain with FilterChain - Add deprecation warnings for unwanted properties - Clean up unused imports - Rename example files for consistency - Add comprehensive Docker configuration tutorial BREAKING CHANGE: FastFilterChain has been replaced with FilterChain	2025-02-19 17:23:25 +08:00
Aravind	dad592c801	2025 feb alpha 1 (#685 ) * spelling change in prompt * gpt-4o-mini support * Remove leading Y before here * prompt spell correction * (Docs) Fix numbered list end-of-line formatting Added the missing "two spaces" to add a line break * fix: access downloads_path through browser_config in _handle_download method - Fixes #585 * crawl * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/583 * Docs update: https://github.com/unclecode/crawl4ai/issues/649 * fix: https://github.com/unclecode/crawl4ai/issues/570 * Docs: updated example for content-selection to reflect new changes in yc newsfeed css * Refactor: Removed old filters and replaced with optimised filters * fix:Fixed imports as per the new names of filters * Tests: For deep crawl filters * Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers. * fix: awaiting on filters that are async in nature eg: content relevance and seo filters * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/715 --------- Co-authored-by: DarshanTank <darshan.tank@gnani.ai> Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com> Co-authored-by: Serhat Soydan <ssoydan@gmail.com> Co-authored-by: cardit1 <maneesh@cardit.in> Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>	2025-02-19 14:13:17 +08:00
UncleCode	c171891999	Merge branch 'main' into next # Conflicts: # .gitignore	2025-02-19 13:26:42 +08:00
UncleCode	3b1025abbb	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2025-02-19 13:24:26 +08:00
UncleCode	f00dcc276f	Update README.md (#562 )	2025-02-19 13:24:04 +08:00
UncleCode	392c923980	feat(docker): add JWT authentication and improve server architecture Add JWT token-based authentication to Docker server and client. Refactor server architecture for better code organization and error handling. Move Dockerfile to root deploy directory and update configuration. Add comprehensive documentation and examples. BREAKING CHANGE: Docker server now requires authentication by default. Endpoints require JWT tokens when security.jwt_enabled is true in config.	2025-02-18 22:07:13 +08:00
UncleCode	2864015469	feat(docker): implement supervisor and secure API endpoints Add supervisor configuration for managing Redis and Gunicorn processes Replace direct process management with supervisord Add secure and token-free API server variants Implement JWT authentication for protected endpoints Update datetime handling in async dispatcher Add email domain verification BREAKING CHANGE: Server startup now uses supervisord instead of direct process management	2025-02-17 20:31:20 +08:00
UncleCode	8bb799068e	feat(crawler): add HTTP crawler strategy for lightweight web scraping Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include: - Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts - File and raw content handling capabilities - Streaming response processing for large files - Customizable request/response hooks - Comprehensive error handling Also refactors browser management code into separate module for better organization.	2025-02-15 19:26:30 +08:00
UncleCode	063df572b0	docs(examples): add SERP API project example Add comprehensive example demonstrating Google Search Results Page (SERP) API implementation using crawl4ai. The example includes: - Basic web crawling setup - LLM-based extraction - Schema generation - Golden standard implementation - CrawlerHub usage The example serves as a reference for implementing SERP API functionality with various extraction strategies.	2025-02-14 23:06:16 +08:00
UncleCode	966fb47e64	feat(config): enhance serialization and add deep crawling exports Improve configuration serialization with better handling of frozensets and slots. Expand deep crawling module exports and documentation. Add comprehensive API usage examples in Docker README. - Add support for frozenset serialization - Improve error handling in config loading - Export additional deep crawling components - Enhance Docker API documentation with detailed examples - Fix ContentTypeFilter initialization	2025-02-13 21:45:19 +08:00
UncleCode	43e09da694	refactor(crawler): remove content filter functionality Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes: - Removing ContentFilter import and related classes - Removing content_filter parameter from CrawlerRunConfig - Cleaning up LLMExtractionStrategy constructor parameters BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.	2025-02-12 21:59:19 +08:00
UncleCode	69705df0b3	fix(install): ensure proper exit after running doctor command	2025-02-11 19:48:23 +08:00
UncleCode	91a5fea11f	feat(cli): add command line interface with comprehensive features Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality	2025-02-10 16:58:52 +08:00
UncleCode	467be9ac76	feat(deep-crawling): add DFS strategy and update exports; refactor CLI entry point	2025-02-09 20:23:40 +08:00
UncleCode	19df96ed56	feat(proxy): add proxy rotation strategy Implements a new proxy rotation system with the following changes: - Add ProxyRotationStrategy abstract base class - Add RoundRobinProxyStrategy concrete implementation - Integrate proxy rotation with AsyncWebCrawler - Add proxy_rotation_strategy parameter to CrawlerRunConfig - Add example script demonstrating proxy rotation usage - Remove deprecated synchronous WebCrawler code - Clean up rate limiting documentation BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations	2025-02-09 18:49:10 +08:00
UncleCode	b957ff2ecd	refactor(crawler): improve HTML handling and cleanup codebase - Add HTML attribute preservation in GoogleSearchCrawler - Fix lxml import references in utils.py - Remove unused ssl_certificate.json - Clean up imports and code organization in hub.py - Update test case formatting and remove unused image search test BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations	2025-02-07 21:56:27 +08:00
UncleCode	91073c1244	refactor(crawling): improve type hints and code cleanup - Added proper return type hints for DeepCrawlStrategy.arun method - Added __call__ method to DeepCrawlStrategy for easier usage - Removed redundant comments and imports - Cleaned up type hints in DFS strategy - Removed empty docker_client.py and .continuerules - Added .private/ to gitignore BREAKING CHANGE: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]	2025-02-07 19:01:59 +08:00
Sezer Bozkır	926beee832	base-config structure is changed (#618 ) refactor(docker): restructure docker-compose for modular configuration - Added reusable base configuration block (x-base-config) for ports, environment variables, volumes, deployment resources, restart policy, and health check. - Updated services to include base configuration directly using `<<: *base-config` syntax. - Removed redundant `base-config` service definition.	2025-02-07 17:11:51 +08:00
UncleCode	a9415aaaf6	refactor(deep-crawling): reorganize deep crawling strategies and add new implementations Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types. BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.	2025-02-05 22:50:39 +08:00
UncleCode	c308a794e8	refactor(deep-crawl): reorganize deep crawling functionality into dedicated module Restructure deep crawling code into a dedicated module with improved organization: - Move deep crawl logic from async_deep_crawl.py to deep_crawling/ - Create separate files for BFS strategy, filters, and scorers - Improve code organization and maintainability - Add optimized implementations for URL filtering and scoring - Rename DeepCrawlHandler to DeepCrawlDecorator for clarity BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure	2025-02-04 23:28:17 +08:00
UncleCode	bc7559586f	feat(crawler): add deep crawling capabilities with BFS strategy Implements deep crawling functionality with a new BreadthFirstSearch strategy: - Add DeepCrawlStrategy base class and BFS implementation - Integrate deep crawling with AsyncWebCrawler via decorator pattern - Update CrawlerRunConfig to support deep crawling parameters - Add pagination support for Google Search crawler BREAKING CHANGE: AsyncWebCrawler.arun and arun_many return types now include deep crawl results	2025-02-04 01:24:49 +08:00
UncleCode	04bc643cec	feat(api): improve cache handling and add API tests Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure results are still cached for future use. Also adds error handling for non-JSON LLM responses and comprehensive API test suite. - Changes default cache fallback from BYPASS to WRITE_ONLY - Adds error handling for LLM JSON parsing - Introduces new test suite for API endpoints	2025-02-02 20:53:31 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00
UncleCode	7b1ef07c41	refactor(docker): remove unused models and utilities for cleaner codebase	2025-02-01 20:10:13 +08:00
UncleCode	2f15976b34	feat(docker): enhance Docker deployment setup and configuration Add comprehensive Docker deployment configuration with: - New .dockerignore and .llm.env.example files - Enhanced Dockerfile with multi-stage build and optimizations - Detailed README with setup instructions and environment configurations - Improved requirements.txt with Gunicorn - Better error handling in async_configs.py BREAKING CHANGE: Docker deployment now requires .llm.env file for API keys	2025-02-01 19:33:27 +08:00
UncleCode	20920fa17b	refactor(docker): clean up import statements in server.py	2025-02-01 14:28:28 +08:00
UncleCode	53ac3ec0b4	feat(docker): add Docker service integration and config serialization Add Docker service integration with FastAPI server and client implementation. Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support Docker service communication. Clean up imports and improve error handling. - Add Crawl4aiDockerClient class - Implement config serialization/deserialization - Add FastAPI server with streaming support - Add health check endpoint - Clean up imports and type hints	2025-01-31 18:00:16 +08:00
UncleCode	ce4f04dad2	feat(docker): add Docker deployment configuration and API server Add Docker deployment setup with FastAPI server implementation for Crawl4AI: - Create Dockerfile with Python 3.10 and Playwright dependencies - Implement FastAPI server with streaming and non-streaming endpoints - Add request/response models and JSON serialization - Include test script for API verification Also includes: - Update .gitignore for Continue development files - Add project rules in .continuerules - Clean up async_dispatcher.py formatting	2025-01-31 15:22:21 +08:00
UncleCode	f81712eb91	refactor(core): reorganize project structure and remove legacy code Major reorganization of the project structure: - Moved legacy synchronous crawler code to legacy folder - Removed deprecated CLI and docs manager - Consolidated version manager into utils.py - Added CrawlerHub to __init__.py exports - Fixed type hints in async_webcrawler.py - Fixed minor bugs in chunking and crawler strategies BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.	2025-01-30 19:35:06 +08:00
UncleCode	31938fb922	feat(crawler): enhance JavaScript execution and PDF processing Add JavaScript execution result handling and improve PDF processing capabilities: - Add js_execution_result to CrawlResult and AsyncCrawlResponse models - Implement execution result capture in AsyncPlaywrightCrawlerStrategy - Add batch processing for PDF pages with configurable batch size - Enhance JsonElementExtractionStrategy with better schema generation - Add HTML optimization utilities BREAKING CHANGE: PDF processing now uses batch processing by default	2025-01-29 21:03:39 +08:00
UncleCode	f8fd9d9eff	feat(pdf): add PDF processing capabilities Add new PDF processing module with the following features: - PDF text extraction and formatting to HTML/Markdown - Image extraction with multiple format support (JPEG, PNG, TIFF) - Link extraction from PDF documents - Metadata extraction including title, author, dates - Support for both local and remote PDF files Also includes: - New configuration options for HTML attribute handling - Internal/external link filtering improvements - Version bump to 0.4.300b4	2025-01-27 21:24:15 +08:00
UncleCode	dde14eba7d	Update README.md (#562 )	2025-01-26 11:00:28 +08:00
UncleCode	54c84079c4	docs(api): improve formatting and readability of API documentation Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files: - arun.md - arun_many.md - async-webcrawler.md - parameters.md Changes include: - Consistent list formatting and indentation - Better spacing between sections - Clearer separation of content blocks - Fixed quotation marks and code block formatting	2025-01-25 22:06:11 +08:00
UncleCode	d0586f09a9	Merge branch 'vr0.4.3b3'	2025-01-25 21:57:29 +08:00
UncleCode	09ac7ed008	feat(demo): uncomment feature demos and add fake-useragent dependency Uncomments demonstration code for memory dispatcher, streaming support, content scraping, JSON schema generation, LLM markdown, and robots compliance in the v0.4.3b2 features demo file. Also adds fake-useragent package as a project dependency. This change makes all feature demonstrations active by default and ensures proper user agent handling capabilities.	2025-01-25 21:56:08 +08:00
UncleCode	97796f39d2	docs(examples): update proxy rotation demo and disable other demos Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality. No breaking changes.	2025-01-25 21:52:35 +08:00
UncleCode	4d7f91b378	refactor(user-agent): improve user agent generation system Redesign user agent generation to be more modular and reliable: - Add abstract base class UAGen for user agent generation - Implement ValidUAGenerator using fake-useragent library - Add OnlineUAGenerator for fetching real-world user agents - Update browser configurations to use new UA generation system - Improve client hints generation This change makes the user agent system more maintainable and provides better real-world user agent coverage.	2025-01-25 21:16:39 +08:00
UncleCode	69a77222ef	feat(browser): add CDP URL configuration support Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser. - Added cdp_url parameter to BrowserConfig - Added cdp_url support in ManagedBrowser.start() method - Updated documentation for new parameters	2025-01-24 15:53:47 +08:00
UncleCode	0afc3e9e5e	refactor(examples): update API usage in features demo Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls() and fix result access patterns. Also improve code formatting and remove extra whitespace. - Replace dispatcher.run_urls with crawler.arun_many - Update streaming demo to use new API and correct result access - Clean up whitespace and formatting - Simplify result property access patterns	2025-01-23 22:37:29 +08:00
UncleCode	65d33bcc0f	style(docs): improve code formatting in features demo Clean up whitespace and improve readability in v0_4_3b2_features_demo.py: - Remove excessive blank lines between functions - Improve config formatting for better readability - Uncomment memory dispatcher demo in main function No breaking changes.	2025-01-23 22:36:58 +08:00
UncleCode	6a01008a2b	docs(multi-url): improve documentation clarity and update examples - Restructure multi-URL crawling documentation with better formatting and examples - Update code examples to use new API syntax (arun_many) - Add detailed parameter explanations for RateLimiter and Dispatchers - Enhance CSS styling for better documentation readability - Fix outdated method calls in feature demo script BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples	2025-01-23 22:33:36 +08:00
UncleCode	6dc01eae3a	refactor(core): improve type hints and remove unused file - Add RelevantContentFilter to __init__.py exports - Update version to 0.4.3b3 - Enhance type hints in async_configs.py - Remove empty utils.scraping.py file - Update mkdocs configuration with version info and GitHub integration BREAKING CHANGE: None	2025-01-23 18:53:22 +08:00
UncleCode	7b7fe84e0d	docs(readme): resolve merge conflict and update version info Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements. No breaking changes.	2025-01-22 20:52:42 +08:00
UncleCode	5c36f4308f	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2025-01-22 20:51:52 +08:00

1 2 3 4 5 ...

631 Commits