Commit Graph

689 Commits

Author SHA1 Message Date
UncleCode
14894b4d70 feat(config): set DefaultMarkdownGenerator as the default markdown generator in CrawlerRunConfig
feat(logger): add color mapping for log message formatting options
2025-04-03 20:34:19 +08:00
UncleCode
86df20234b fix(crawler): handle exceptions in get_page call to ensure page retrieval 2025-04-02 21:25:24 +08:00
UncleCode
179921a131 fix(crawler): update get_page call to include additional return value 2025-04-02 19:01:30 +08:00
UncleCode
c5cac2b459 feat(browser): add BrowserHub for centralized browser management and resource sharing 2025-04-01 20:35:02 +08:00
UncleCode
555455d710 feat(browser): implement browser pooling and page pre-warming
Adds a new BrowserManager implementation with browser pooling and page pre-warming capabilities:
- Adds support for managing multiple browser instances per configuration
- Implements page pre-warming for improved performance
- Adds configurable behavior for when no browsers are available
- Includes comprehensive status reporting and monitoring
- Maintains backward compatibility with existing API
- Adds demo script showcasing new features

BREAKING CHANGE: BrowserManager API now returns a strategy instance along with page and context
2025-03-31 21:55:07 +08:00
UncleCode
bb02398086 refactor(browser): improve browser strategy architecture and lifecycle management
Major refactoring of browser strategy implementations to improve code organization and reliability:
- Move CrawlResultContainer and RunManyReturn types from async_webcrawler to models.py
- Simplify browser lifecycle management in AsyncWebCrawler
- Standardize browser strategy interface with _generate_page method
- Improve headless mode handling and browser args construction
- Clean up Docker and Playwright strategy implementations
- Fix session management and context handling across strategies

BREAKING CHANGE: Browser strategy interface has changed with new _generate_page method requirement
2025-03-30 20:58:39 +08:00
UncleCode
3ff7eec8f3 refactor(browser): consolidate browser strategy implementations
Moves common browser functionality into BaseBrowserStrategy class to reduce code duplication and improve maintainability. Key changes:
- Adds shared browser argument building and session management to base class
- Standardizes storage state handling across strategies
- Improves process cleanup and error handling
- Consolidates CDP URL management and container lifecycle

BREAKING CHANGE: Changes browser_mode="custom" to "cdp" for consistency
2025-03-28 22:47:28 +08:00
UncleCode
64f20ab44a refactor(docker): update Dockerfile and browser strategy to use Chromium 2025-03-28 15:59:02 +08:00
UncleCode
c635f6b9a2 refactor(browser): reorganize browser strategies and improve Docker implementation
Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with:
- Add Alpine and Debian-based Dockerfiles for better container options
- Enhance Docker registry to share configuration with BuiltinBrowserStrategy
- Add CPU and memory limits to container configuration
- Improve error handling and logging
- Update documentation and examples

BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.
2025-03-27 21:35:13 +08:00
UncleCode
7f93e88379 refactor(tests): remove unused imports in test_docker_browser.py 2025-03-26 15:19:29 +08:00
UncleCode
40d4dd36c9 chore(version): bump version to 0.5.0.post8 and update post-installation setup 2025-03-25 21:56:49 +08:00
UncleCode
d8f38f2298 chore(version): bump version to 0.5.0.post7 2025-03-25 21:47:19 +08:00
UncleCode
5c88d1310d feat(cli): add output file option and integrate LXML web scraping strategy 2025-03-25 21:38:24 +08:00
UncleCode
4a20d7f7c2 feat(cli): add quick JSON extraction and global config management
Adds new features to improve user experience and configuration:
- Quick JSON extraction with -j flag for direct LLM-based structured data extraction
- Global configuration management with 'crwl config' commands
- Enhanced LLM extraction with better JSON handling and error management
- New user settings for default behaviors (LLM provider, browser settings, etc.)

Breaking changes: None
2025-03-25 20:30:25 +08:00
UncleCode
6405cf0a6f Merge branch 'vr0.5.0.post5' into next 2025-03-25 14:51:29 +08:00
UncleCode
bdd9db579a chore(version): bump version to 0.5.0.post6
refactor(cli): remove unused import from FastAPI
2025-03-25 12:01:36 +08:00
UncleCode
1107fa1d62 feat(cli): enhance markdown generation with default content filters
Add DefaultMarkdownGenerator integration and automatic content filtering for markdown output formats. When using 'markdown-fit' or 'md-fit' output formats, automatically apply PruningContentFilter with default settings if no filter config is provided.

This change improves the user experience by providing sensible defaults for markdown generation while maintaining the ability to customize filtering behavior.
2025-03-25 11:56:00 +08:00
UncleCode
8c08521301 feat(browser): add Docker-based browser automation strategy
Implements a new browser strategy that runs Chrome in Docker containers,
providing better isolation and cross-platform consistency. Features include:
- Connect and launch modes for different container configurations
- Persistent storage support for maintaining browser state
- Container registry for efficient reuse
- Comprehensive test suite for Docker browser functionality

This addition allows users to run browser automation workloads in isolated
containers, improving security and resource management.
2025-03-24 21:36:58 +08:00
UncleCode
462d5765e2 fix(browser): improve storage state persistence in CDP strategy
Enhance storage state persistence mechanism in CDP browser strategy by:
- Explicitly saving storage state for each browser context
- Using proper file path for storage state
- Removing unnecessary sleep delay

Also includes test improvements:
- Simplified test configurations in playwright tests
- Temporarily disabled some CDP tests
2025-03-23 21:06:41 +08:00
UncleCode
6eeb2e4076 feat(browser): enhance browser context creation with user data directory support and improved storage state handling 2025-03-23 19:07:13 +08:00
UncleCode
0094cac675 refactor(browser): improve parallel crawling and browser management
Remove PagePoolConfig in favor of direct page management in browser strategies.
Add get_pages() method for efficient parallel page creation.
Improve storage state handling and persistence.
Add comprehensive parallel crawling tests and performance analysis.

BREAKING CHANGE: Removed PagePoolConfig class and related functionality.
2025-03-23 18:53:24 +08:00
UncleCode
4ab0893ffb feat(browser): implement modular browser management system
Adds a new browser management system with strategy pattern implementation:
- Introduces BrowserManager class with strategy pattern support
- Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy
- Implements BrowserProfileManager for profile management
- Adds PagePoolConfig for browser page pooling
- Includes comprehensive test suite for all browser strategies

BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.
2025-03-21 22:50:00 +08:00
UncleCode
6432ff1257 feat(browser): add builtin browser management system
Implements a persistent browser management system that allows running a single shared browser instance
that can be reused across multiple crawler sessions. Key changes include:

- Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes
- Implemented builtin browser management in BrowserProfiler
- Added CLI commands for managing builtin browser (start, stop, status, restart, view)
- Modified browser process handling to support detached processes
- Added automatic builtin browser setup during package installation

BREAKING CHANGE: The browser_mode config option changes how browser instances are managed
2025-03-20 12:13:59 +08:00
UncleCode
5358ac0fc2 refactor: clean up imports and improve JSON schema generation instructions 2025-03-18 18:53:34 +08:00
UncleCode
a24799918c feat(llm): add additional LLM configuration parameters
Extend LLMConfig class to support more fine-grained control over LLM behavior by adding:
- temperature control
- max tokens limit
- top_p sampling
- frequency and presence penalties
- stop sequences
- number of completions

These parameters allow for better customization of LLM responses.
2025-03-14 21:36:23 +08:00
UncleCode
a31d7b86be feat(changelog): update CHANGELOG for version 0.5.0.post5 with new features, changes, fixes, and breaking changes 2025-03-14 15:26:37 +08:00
UncleCode
7884a98be7 feat(crawler): add experimental parameters support and optimize browser handling
Add experimental parameters dictionary to CrawlerRunConfig to support beta features
Make CSP nonce headers optional via experimental config
Remove default cookie injection
Clean up browser context creation code
Improve code formatting in API handler

BREAKING CHANGE: Default cookie injection has been removed from page initialization
2025-03-14 14:39:24 +08:00
UncleCode
6e3c048328 feat(api): refactor crawl request handling to streamline single and multiple URL processing 2025-03-13 22:30:38 +08:00
UncleCode
b750542e6d feat(crawler): optimize single URL handling and add performance comparison
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)

BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
2025-03-13 22:15:15 +08:00
UncleCode
dc36997a08 feat(schema): improve HTML preprocessing for schema generation
Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.

Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack

This change improves schema generation reliability while reducing noise in the
processed HTML.
2025-03-12 22:40:46 +08:00
UncleCode
1630fbdafe feat(monitor): add real-time crawler monitoring system with memory management
Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes:
- Terminal-based dashboard with rich UI for displaying task statuses
- Memory pressure monitoring and adaptive dispatch control
- Queue statistics and performance metrics tracking
- Detailed task progress visualization
- Stress testing framework for memory management

This addition helps operators track crawler performance and manage memory usage more effectively.
2025-03-12 19:05:24 +08:00
UncleCode
9547bada3a feat(content): add target_elements parameter for selective content extraction
Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media.

Key changes:
- Added target_elements list parameter to CrawlerRunConfig
- Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements
- Updated documentation with examples and comparison between css_selector and target_elements
- Fixed table extraction in content_scraping_strategy.py

BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures
2025-03-10 18:54:51 +08:00
UncleCode
9d69fce834 feat(scraping): add smart table extraction and analysis capabilities
Add comprehensive table detection and extraction functionality to the web scraping system:
- Implement intelligent table detection algorithm with scoring system
- Add table extraction with support for headers, rows, captions
- Update models to include tables in Media class
- Add table_score_threshold configuration option
- Add documentation and examples for table extraction
- Include crypto analysis example demonstrating table usage

This change enables users to extract structured data from HTML tables while intelligently filtering out layout tables.
2025-03-09 21:31:33 +08:00
UncleCode
c6a605ccce feat(filters): add reverse option to URLPatternFilter
Adds a new 'reverse' parameter to URLPatternFilter that allows inverting the filter's logic. When reverse=True, URLs that would normally match are rejected and vice versa.

Also removes unused 'scraped_html' from WebScrapingStrategy output to reduce memory usage.

BREAKING CHANGE: WebScrapingStrategy no longer returns 'scraped_html' in its output dictionary
2025-03-08 18:54:41 +08:00
UncleCode
4aeb7ef9ad refactor(proxy): consolidate proxy configuration handling
Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location.

Key changes:
- Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py
- Updated type hints in async_configs.py to support ProxyConfig
- Fixed proxy configuration handling in browser_manager.py
- Updated documentation and examples to use new import path

BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy
2025-03-07 23:14:11 +08:00
UncleCode
a68cbb232b feat(browser): add standalone CDP browser launch and lxml extraction strategy
Add new features to enhance browser automation and HTML extraction:
- Add CDP browser launch capability with customizable ports and profiles
- Implement JsonLxmlExtractionStrategy for faster HTML parsing
- Add CLI command 'crwl cdp' for launching standalone CDP browsers
- Support connecting to external CDP browsers via URL
- Optimize selector caching and context-sensitive queries

BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai
2025-03-07 20:55:56 +08:00
UncleCode
f78c46446b feat(deep-crawling): improve URL normalization and domain filtering
Enhance URL handling in deep crawling with:
- New URL normalization functions for consistent URL formats
- Improved domain filtering with subdomain support
- Added URLPatternFilter to public API
- Better URL deduplication in BFS strategy

These changes improve crawling accuracy and reduce duplicate visits.
2025-03-06 22:45:57 +08:00
UncleCode
1b72880007 chore(version): bump version to 0.5.0.post3 2025-03-06 20:32:32 +08:00
UncleCode
29f7915b79 fix(models): support float timestamps in CrawlStats
Modify CrawlStats class to handle both datetime and float timestamp formats for start_time and end_time fields. This change improves compatibility with different time formats while maintaining existing functionality.

Other minor changes:
- Add datetime import in async_dispatcher
- Update JsonElementExtractionStrategy kwargs handling

No breaking changes.
2025-03-06 20:30:57 +08:00
UncleCode
2327db6fdc refactor(crawler): introduce CrawlResultContainer and simplify interfaces
Introduces a new generic CrawlResultContainer class to standardize return types and
improve type safety. Removes legacy parameter handling and simplifies method signatures.
This change makes the API more consistent and easier to maintain.

BREAKING CHANGE: Synchronous crawler methods now always return CrawlResultContainer
instead of raw CrawlResult or List[CrawlResult]. Legacy parameters have been removed
from method signatures.
2025-03-05 22:23:08 +08:00
UncleCode
3a234ec950 fix(auth): make JWT authentication optional with fallback
Modify authentication system to gracefully handle cases where JWT is not enabled or token is missing. This includes:
- Making HTTPBearer auto_error=False to prevent automatic 403 errors
- Updating token dependency to return None when JWT is disabled
- Fixing model deserialization in CrawlResult
- Updating documentation links
- Cleaning up imports

BREAKING CHANGE: Authentication behavior changed to be more permissive when JWT is disabled
2025-03-05 17:14:42 +08:00
UncleCode
9e89d27fcd chore(version): bump version to 0.5.0.post2 2025-03-05 14:18:29 +08:00
UncleCode
b3ec7ce960 Merge branch 'vr0.5.0.post1' into next 2025-03-05 14:17:19 +08:00
UncleCode
baee4949d3 refactor(llm): rename LlmConfig to LLMConfig for consistency
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.

BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
2025-03-05 14:17:04 +08:00
UncleCode
9c58e4ce2e fix(docs): correct section numbering in deepcrawl_example.py tutorial v0.5.0.post1 2025-03-04 20:57:33 +08:00
UncleCode
df6a6d5f4f refactor(docs): reorganize tutorial sections and update wrap-up example 2025-03-04 20:55:09 +08:00
UncleCode
e896c08f9c chore(version): bump version to 0.5.0.post1 2025-03-04 20:29:27 +08:00
UncleCode
56bc3c6e45 refactor(cli): improve CLI default command handling
Make 'crawl' the default command when no command is specified.
This improves user experience by allowing direct URL input without
explicitly specifying the 'crawl' command.

Also removes unnecessary blank lines in example code for better readability.
2025-03-04 20:28:16 +08:00
UncleCode
cbef406f9b docs: update README for version 0.5.0 release with new features and CLI commands 2025-03-04 19:24:46 +08:00
UncleCode
8a76563018 chore(docs): update site version to v0.5.x in mkdocs configuration 2025-03-04 18:30:03 +08:00