Commit Graph

812 Commits

Author SHA1 Message Date
UncleCode
467be9ac76 feat(deep-crawling): add DFS strategy and update exports; refactor CLI entry point 2025-02-09 20:23:40 +08:00
UncleCode
19df96ed56 feat(proxy): add proxy rotation strategy
Implements a new proxy rotation system with the following changes:
- Add ProxyRotationStrategy abstract base class
- Add RoundRobinProxyStrategy concrete implementation
- Integrate proxy rotation with AsyncWebCrawler
- Add proxy_rotation_strategy parameter to CrawlerRunConfig
- Add example script demonstrating proxy rotation usage
- Remove deprecated synchronous WebCrawler code
- Clean up rate limiting documentation

BREAKING CHANGE: Removed synchronous WebCrawler support and related rate limiting configurations
2025-02-09 18:49:10 +08:00
UncleCode
b957ff2ecd refactor(crawler): improve HTML handling and cleanup codebase
- Add HTML attribute preservation in GoogleSearchCrawler
- Fix lxml import references in utils.py
- Remove unused ssl_certificate.json
- Clean up imports and code organization in hub.py
- Update test case formatting and remove unused image search test

BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations
2025-02-07 21:56:27 +08:00
UncleCode
91073c1244 refactor(crawling): improve type hints and code cleanup
- Added proper return type hints for DeepCrawlStrategy.arun method
- Added __call__ method to DeepCrawlStrategy for easier usage
- Removed redundant comments and imports
- Cleaned up type hints in DFS strategy
- Removed empty docker_client.py and .continuerules
- Added .private/ to gitignore

BREAKING CHANGE: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
2025-02-07 19:01:59 +08:00
Sezer Bozkır
926beee832 base-config structure is changed (#618)
refactor(docker): restructure docker-compose for modular configuration

- Added reusable base configuration block (x-base-config) for ports, environment variables, volumes, deployment resources, restart policy, and health check.
- Updated services to include base configuration directly using `<<: *base-config` syntax.
- Removed redundant `base-config` service definition.
2025-02-07 17:11:51 +08:00
UncleCode
a9415aaaf6 refactor(deep-crawling): reorganize deep crawling strategies and add new implementations
Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types.

BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.
2025-02-05 22:50:39 +08:00
UncleCode
c308a794e8 refactor(deep-crawl): reorganize deep crawling functionality into dedicated module
Restructure deep crawling code into a dedicated module with improved organization:
- Move deep crawl logic from async_deep_crawl.py to deep_crawling/
- Create separate files for BFS strategy, filters, and scorers
- Improve code organization and maintainability
- Add optimized implementations for URL filtering and scoring
- Rename DeepCrawlHandler to DeepCrawlDecorator for clarity

BREAKING CHANGE: DeepCrawlStrategy and BreadthFirstSearchStrategy imports need to be updated to new package structure
2025-02-04 23:28:17 +08:00
UncleCode
bc7559586f feat(crawler): add deep crawling capabilities with BFS strategy
Implements deep crawling functionality with a new BreadthFirstSearch strategy:
- Add DeepCrawlStrategy base class and BFS implementation
- Integrate deep crawling with AsyncWebCrawler via decorator pattern
- Update CrawlerRunConfig to support deep crawling parameters
- Add pagination support for Google Search crawler

BREAKING CHANGE: AsyncWebCrawler.arun and arun_many return types now include deep crawl results
2025-02-04 01:24:49 +08:00
UncleCode
04bc643cec feat(api): improve cache handling and add API tests
Changes cache mode from BYPASS to WRITE_ONLY when cache is disabled to ensure
results are still cached for future use. Also adds error handling for non-JSON
LLM responses and comprehensive API test suite.

- Changes default cache fallback from BYPASS to WRITE_ONLY
- Adds error handling for LLM JSON parsing
- Introduces new test suite for API endpoints
2025-02-02 20:53:31 +08:00
UncleCode
33a21d6a7a refactor(docker): improve server architecture and configuration
Complete overhaul of Docker deployment setup with improved architecture:
- Add Redis integration for task management
- Implement rate limiting and security middleware
- Add Prometheus metrics and health checks
- Improve error handling and logging
- Add support for streaming responses
- Implement proper configuration management
- Add platform-specific optimizations for ARM64/AMD64

BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure
2025-02-02 20:19:51 +08:00
UncleCode
7b1ef07c41 refactor(docker): remove unused models and utilities for cleaner codebase 2025-02-01 20:10:13 +08:00
UncleCode
2f15976b34 feat(docker): enhance Docker deployment setup and configuration
Add comprehensive Docker deployment configuration with:
- New .dockerignore and .llm.env.example files
- Enhanced Dockerfile with multi-stage build and optimizations
- Detailed README with setup instructions and environment configurations
- Improved requirements.txt with Gunicorn
- Better error handling in async_configs.py

BREAKING CHANGE: Docker deployment now requires .llm.env file for API keys
2025-02-01 19:33:27 +08:00
UncleCode
20920fa17b refactor(docker): clean up import statements in server.py 2025-02-01 14:28:28 +08:00
UncleCode
53ac3ec0b4 feat(docker): add Docker service integration and config serialization
Add Docker service integration with FastAPI server and client implementation.
Implement serialization utilities for BrowserConfig and CrawlerRunConfig to support
Docker service communication. Clean up imports and improve error handling.

- Add Crawl4aiDockerClient class
- Implement config serialization/deserialization
- Add FastAPI server with streaming support
- Add health check endpoint
- Clean up imports and type hints
2025-01-31 18:00:16 +08:00
UncleCode
ce4f04dad2 feat(docker): add Docker deployment configuration and API server
Add Docker deployment setup with FastAPI server implementation for Crawl4AI:
- Create Dockerfile with Python 3.10 and Playwright dependencies
- Implement FastAPI server with streaming and non-streaming endpoints
- Add request/response models and JSON serialization
- Include test script for API verification

Also includes:
- Update .gitignore for Continue development files
- Add project rules in .continuerules
- Clean up async_dispatcher.py formatting
2025-01-31 15:22:21 +08:00
UncleCode
f81712eb91 refactor(core): reorganize project structure and remove legacy code
Major reorganization of the project structure:
- Moved legacy synchronous crawler code to legacy folder
- Removed deprecated CLI and docs manager
- Consolidated version manager into utils.py
- Added CrawlerHub to __init__.py exports
- Fixed type hints in async_webcrawler.py
- Fixed minor bugs in chunking and crawler strategies

BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.
2025-01-30 19:35:06 +08:00
UncleCode
31938fb922 feat(crawler): enhance JavaScript execution and PDF processing
Add JavaScript execution result handling and improve PDF processing capabilities:
- Add js_execution_result to CrawlResult and AsyncCrawlResponse models
- Implement execution result capture in AsyncPlaywrightCrawlerStrategy
- Add batch processing for PDF pages with configurable batch size
- Enhance JsonElementExtractionStrategy with better schema generation
- Add HTML optimization utilities

BREAKING CHANGE: PDF processing now uses batch processing by default
2025-01-29 21:03:39 +08:00
UncleCode
f8fd9d9eff feat(pdf): add PDF processing capabilities
Add new PDF processing module with the following features:
- PDF text extraction and formatting to HTML/Markdown
- Image extraction with multiple format support (JPEG, PNG, TIFF)
- Link extraction from PDF documents
- Metadata extraction including title, author, dates
- Support for both local and remote PDF files

Also includes:
- New configuration options for HTML attribute handling
- Internal/external link filtering improvements
- Version bump to 0.4.300b4
2025-01-27 21:24:15 +08:00
UncleCode
dde14eba7d Update README.md (#562) 2025-01-26 11:00:28 +08:00
UncleCode
54c84079c4 docs(api): improve formatting and readability of API documentation
Enhanced markdown formatting, fixed list indentation, and improved readability across multiple API documentation files:
- arun.md
- arun_many.md
- async-webcrawler.md
- parameters.md

Changes include:
- Consistent list formatting and indentation
- Better spacing between sections
- Clearer separation of content blocks
- Fixed quotation marks and code block formatting
2025-01-25 22:06:11 +08:00
UncleCode
d0586f09a9 Merge branch 'vr0.4.3b3' 2025-01-25 21:57:29 +08:00
UncleCode
09ac7ed008 feat(demo): uncomment feature demos and add fake-useragent dependency
Uncomments demonstration code for memory dispatcher, streaming support,
content scraping, JSON schema generation, LLM markdown, and robots compliance
in the v0.4.3b2 features demo file. Also adds fake-useragent package as a
project dependency.

This change makes all feature demonstrations active by default and ensures
proper user agent handling capabilities.
2025-01-25 21:56:08 +08:00
UncleCode
97796f39d2 docs(examples): update proxy rotation demo and disable other demos
Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality.

No breaking changes.
2025-01-25 21:52:35 +08:00
UncleCode
4d7f91b378 refactor(user-agent): improve user agent generation system
Redesign user agent generation to be more modular and reliable:
- Add abstract base class UAGen for user agent generation
- Implement ValidUAGenerator using fake-useragent library
- Add OnlineUAGenerator for fetching real-world user agents
- Update browser configurations to use new UA generation system
- Improve client hints generation

This change makes the user agent system more maintainable and provides better real-world user agent coverage.
2025-01-25 21:16:39 +08:00
UncleCode
69a77222ef feat(browser): add CDP URL configuration support
Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser.

- Added cdp_url parameter to BrowserConfig
- Added cdp_url support in ManagedBrowser.start() method
- Updated documentation for new parameters
2025-01-24 15:53:47 +08:00
UncleCode
0afc3e9e5e refactor(examples): update API usage in features demo
Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls()
and fix result access patterns. Also improve code formatting and remove
extra whitespace.

- Replace dispatcher.run_urls with crawler.arun_many
- Update streaming demo to use new API and correct result access
- Clean up whitespace and formatting
- Simplify result property access patterns
2025-01-23 22:37:29 +08:00
UncleCode
65d33bcc0f style(docs): improve code formatting in features demo
Clean up whitespace and improve readability in v0_4_3b2_features_demo.py:
- Remove excessive blank lines between functions
- Improve config formatting for better readability
- Uncomment memory dispatcher demo in main function

No breaking changes.
2025-01-23 22:36:58 +08:00
UncleCode
6a01008a2b docs(multi-url): improve documentation clarity and update examples
- Restructure multi-URL crawling documentation with better formatting and examples
- Update code examples to use new API syntax (arun_many)
- Add detailed parameter explanations for RateLimiter and Dispatchers
- Enhance CSS styling for better documentation readability
- Fix outdated method calls in feature demo script

BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples
2025-01-23 22:33:36 +08:00
UncleCode
6dc01eae3a refactor(core): improve type hints and remove unused file
- Add RelevantContentFilter to __init__.py exports
- Update version to 0.4.3b3
- Enhance type hints in async_configs.py
- Remove empty utils.scraping.py file
- Update mkdocs configuration with version info and GitHub integration

BREAKING CHANGE: None
2025-01-23 18:53:22 +08:00
UncleCode
7b7fe84e0d docs(readme): resolve merge conflict and update version info
Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements.

No breaking changes.
2025-01-22 20:52:42 +08:00
UncleCode
5c36f4308f Merge branch 'main' of https://github.com/unclecode/crawl4ai 2025-01-22 20:51:52 +08:00
UncleCode
45809d1c91 Merge branch 'vr0.4.3b2' 2025-01-22 20:51:46 +08:00
UncleCode
357414c345 docs(readme): update version references and fix links
Update version numbers to v0.4.3bx throughout README.md
Fix contributing guidelines link to point to CONTRIBUTORS.md
Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product
Add pre-release installation instructions
Fix minor formatting in personal story section

No breaking changes
2025-01-22 20:46:39 +08:00
UncleCode
260b9120c3 docs(examples): update v0.4.3 features demo to v0.4.3b2
Rename and replace the features demo file to reflect the beta 2 version number.
The old v0.4.3 demo file is removed and replaced with a new beta 2 version.

Renames:
- docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py
2025-01-22 20:41:43 +08:00
UncleCode
976ea52167 docs(examples): update demo scripts and fix output formats
Update example scripts to reflect latest API changes and improve demonstrations:
- Increase test URLs in dispatcher example from 20 to 40 pages
- Comment out unused dispatcher strategies for cleaner output
- Fix scraping strategies performance script to use correct object notation
- Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections

These changes make the examples more current and better aligned with the actual API.
2025-01-22 20:40:03 +08:00
UncleCode
2d69bf2366 refactor(models): rename final_url to redirected_url for consistency
Renames the final_url field to redirected_url across all components to maintain
consistent terminology throughout the codebase. This change affects:
- AsyncCrawlResponse model
- AsyncPlaywrightCrawlerStrategy
- Documentation and examples

No functional changes, purely naming consistency improvement.
2025-01-22 17:14:24 +08:00
UncleCode
dee5fe9851 feat(proxy): add proxy rotation support and documentation
Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include:
- Added proxy rotation demo in features example
- Updated proxy configuration handling in BrowserManager
- Added proxy rotation documentation
- Updated README with new proxy rotation feature
- Bumped version to 0.4.3b2

This change enables users to dynamically switch between proxies and verify IP addresses for each request.
2025-01-22 16:11:01 +08:00
UncleCode
88697c4630 docs(readme): update version and feature announcements for v0.4.3b1
Update README.md to announce version 0.4.3b1 release with new features including:
- Memory Dispatcher System
- Streaming Support
- LLM-Powered Markdown Generation
- Schema Generation
- Robots.txt Compliance

Add detailed version numbering explanation section to help users understand pre-release versions.
2025-01-21 21:20:04 +08:00
UncleCode
16b8d4945b feat(release): prepare v0.4.3 beta release
Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format

BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
2025-01-21 21:03:11 +08:00
UncleCode
d09c611d15 feat(robots): add robots.txt compliance support
Add support for checking and respecting robots.txt rules before crawling websites:
- Implement RobotsParser class with SQLite caching
- Add check_robots_txt parameter to CrawlerRunConfig
- Integrate robots.txt checking in AsyncWebCrawler
- Update documentation with robots.txt compliance examples
- Add tests for robot parser functionality

The cache uses WAL mode for better concurrency and has a default TTL of 7 days.
2025-01-21 17:54:13 +08:00
UncleCode
9247877037 feat(proxy): add proxy configuration support to CrawlerRunConfig
Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config.

- Added proxy_config parameter to CrawlerRunConfig
- Updated BrowserManager to apply proxy settings from CrawlerRunConfig
- Updated proxy-security documentation with new usage examples
2025-01-20 22:14:05 +08:00
UncleCode
2cec527a22 feat(extraction): add LLM-powered schema generation utility
Adds new static method generate_schema() to JsonElementExtractionStrategy classes
that can automatically generate extraction schemas using LLM (OpenAI or Ollama).
This provides a convenient way to bootstrap extraction schemas while maintaining
the performance benefits of selector-based extraction.

Key changes:
- Added generate_schema() static method to base extraction strategy
- Added support for both CSS and XPath schema generation
- Updated documentation with examples and best practices
- Added new prompt templates for schema generation
2025-01-20 17:28:00 +08:00
UncleCode
4b1309cbf2 feat(crawler): add URL redirection tracking
Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections.

Changes include:
- Added final_url tracking in AsyncPlaywrightCrawlerStrategy
- Added redirected_url field to CrawlResult model
- Updated AsyncWebCrawler to properly handle and store redirect URLs
- Fixed typo in documentation signature
2025-01-19 19:53:38 +08:00
UncleCode
8b6fe6a98f docs(api): add streaming mode documentation and examples
Add comprehensive documentation for the new streaming mode feature in arun_many():
- Update arun_many() API docs to reflect streaming return type
- Add streaming examples in quickstart and multi-url guides
- Document stream parameter in configuration classes
- Add clone() helper method documentation for configs

This change improves documentation for processing large numbers of URLs efficiently.
2025-01-19 18:21:34 +08:00
UncleCode
91463e34f1 feat(config): add streaming support and config cloning
Add streaming capability to crawler configurations and introduce clone() methods
for both BrowserConfig and CrawlerRunConfig to support immutable config updates.
Move stream parameter from arun_many() method to CrawlerRunConfig.

BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method.
Use config.stream=True instead.
2025-01-19 17:51:47 +08:00
UncleCode
1221be30a3 feat(browser): improve browser context management and add shared data support
Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks.
Implement browser context reuse based on config signatures to improve memory usage.
Fix Firefox/Webkit channel settings.
Add config parameter to hook callbacks for better context access.
Remove debug print statements.

BREAKING CHANGE: Hook callback signatures now include config parameter
2025-01-19 17:12:03 +08:00
Aravind
6dfa9cb703 Streamline Feature requests, bug reports and Forums with Forms & Templates (#465)
* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config:Add bug report template and issue chooser

* config: updated new bugs to have needs-triage label by default

* Template for PR

* Template for PR

* Template for PR

* Template for PR

* Added FR template

* Added FR template

* Added FR template

* Added FR template

* Config: updated the text for new labels

* config: changed the order of steps to reproduce

* Config: shortened the form for feature request

* Config: Added a code snippet section to the bug report
2025-01-19 16:53:03 +08:00
UncleCode
e363234172 feat(dispatcher): add streaming support for URL processing
Add new streaming capability to the MemoryAdaptiveDispatcher and AsyncWebCrawler
to allow processing URLs with real-time result streaming. This enables
processing results as they become available rather than waiting for all
URLs to complete.

Key changes:
- Add run_urls_stream method to MemoryAdaptiveDispatcher
- Update AsyncWebCrawler.arun_many to support streaming mode
- Add result queue for better result handling
- Improve type hints and documentation

BREAKING CHANGE: The return type of arun_many now depends on the 'stream'
parameter, returning either List[CrawlResult] or AsyncGenerator[CrawlResult, None]
2025-01-19 14:03:34 +08:00
UncleCode
3d09b6a221 feat(content-filter): add LLMContentFilter for intelligent markdown generation
Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content:
- Implement intelligent content filtering with customizable instructions
- Add chunk processing for handling large documents
- Support parallel processing of content chunks
- Include caching mechanism for filtered results
- Add usage tracking and statistics
- Update documentation with examples and use cases

Also includes minor changes:
- Disable Pydantic warnings in __init__.py
- Add new prompt template for content filtering
2025-01-18 19:31:07 +08:00
UncleCode
2d6b19e1a2 refactor(browser): improve browser path management
Implement more robust browser executable path handling using playwright's built-in browser management. This change:
- Adds async browser path resolution
- Implements path caching in the home folder
- Removes hardcoded browser paths
- Adds httpx dependency
- Removes obsolete test result files

This change makes the browser path resolution more reliable across different platforms and environments.
2025-01-17 22:14:37 +08:00