Commit Graph

383 Commits

Author SHA1 Message Date
UncleCode
b0aa8bc9f7 Update README 2025-04-22 23:21:42 +08:00
UncleCode
4812f08a73 feat(docker): update Docker deployment for v0.6.0
Major updates to Docker deployment infrastructure:
- Switch default port to 11235 for all services
- Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints
- Simplify docker-compose.yml with auto-platform detection
- Update documentation with new features and examples
- Consolidate configuration and improve resource management

BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.
2025-04-22 22:35:25 +08:00
unclecode
f3ebb38edf Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md 2025-04-22 14:56:47 +08:00
UncleCode
0007aea204 Update changelog 2025-04-21 23:21:49 +08:00
UncleCode
b5c25731e6 feat(browser): add geolocation, locale and timezone support
Add support for controlling browser geolocation, locale and timezone settings:
- New GeolocationConfig class for managing GPS coordinates
- Add locale and timezone_id parameters to CrawlerRunConfig
- Update browser context creation to handle location settings
- Add example script for geolocation usage
- Update documentation with location-based identity features

This enables more precise control over browser identity and location reporting.
2025-04-21 23:20:59 +08:00
ntohidi
14a31456ef fix(docs): update browser-crawler-config example to include LLMContentFilter and DefaultMarkdownGenerator, fix syntax errors 2025-04-21 13:59:49 +02:00
Aravind Karnam
b27bb367e8 merge next. Resolve conflicts. Fix some import errors and error handling in server.py 2025-04-19 20:27:47 +05:30
UncleCode
3bf78ff47a refactor(docker-demo): enhance error handling and output formatting
Improve the Docker API demo script with better error handling, more detailed output,
and enhanced visualization:
- Add detailed error messages and stack traces for debugging
- Implement better status code handling and display
- Enhance JSON output formatting with monokai theme and word wrap
- Add depth information display for deep crawls
- Improve proxy usage reporting
- Fix port number inconsistency

No breaking changes.
2025-04-17 22:32:58 +08:00
UncleCode
fd899f66aa Merge branch 'next-fix-markdown-source' into next 2025-04-17 20:16:15 +08:00
UncleCode
30ec4f571f feat(docs): add comprehensive Docker API demo script
Add a new example script demonstrating Docker API usage with extensive features:
- Basic crawling with single/multi URL support
- Markdown generation with various filters
- Parameter demonstrations (CSS, JS, screenshots, SSL, proxies)
- Extraction strategies using CSS and LLM
- Deep crawling capabilities with streaming
- Integration examples with proxy rotation and SSL certificate fetching

Also includes minor formatting improvements in async_webcrawler.py
2025-04-17 20:16:11 +08:00
UncleCode
7db6b468d9 feat(markdown): add content source selection for markdown generation
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction

Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details

BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
2025-04-17 20:13:53 +08:00
Aravind Karnam
eed7f88f29 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-04-17 10:50:02 +05:30
UncleCode
230f22da86 refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.

BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
2025-04-15 22:27:18 +08:00
UncleCode
cd7ff6f9c1 feat(docs): add AI assistant interface and code copy button
Add new AI assistant chat interface with features:
- Real-time chat with markdown support
- Chat history management
- Citation tracking
- Selection-to-query functionality

Also adds code copy button to documentation code blocks and adjusts layout/styling.

Breaking changes: None
2025-04-14 23:00:47 +08:00
UncleCode
c56974cf59 feat(docs): enhance documentation UI with ToC and GitHub stats
Add new features to documentation UI:
- Add table of contents with scroll spy functionality
- Add GitHub repository statistics badge
- Implement new centered layout system with fixed sidebar
- Add conditional Playwright installation based on CRAWL4AI_MODE

Breaking changes: None
2025-04-14 20:46:32 +08:00
ntohidi
1f3b1251d0 docs(cli): add Crawl4AI CLI installation instructions to the CLI guide 2025-04-14 12:16:31 +02:00
Aravind Karnam
022f5c9e25 Merged next branch 2025-04-12 10:47:02 +05:30
UncleCode
108b2a8bfb Fixed capturing console messages for case the url is the local file. Update docker configuration (work in progress) 2025-04-10 23:22:38 +08:00
unclecode
66ac07b4f3 feat(crawler): add network request and console message capturing
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests

This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.
2025-04-10 16:03:48 +08:00
UncleCode
a2061bf31e feat(crawler): add MHTML capture functionality
Add ability to capture web pages as MHTML format, which includes all page resources
in a single file. This enables complete page archival and offline viewing.

- Add capture_mhtml parameter to CrawlerRunConfig
- Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy
- Add mhtml field to CrawlResult and AsyncCrawlResponse models
- Add comprehensive tests for MHTML capture functionality
- Update documentation with MHTML capture details
- Add exclude_all_images option for better memory management

Breaking changes: None
2025-04-09 15:39:04 +08:00
UncleCode
9038e9acbd Merge branch 'main' into next 2025-04-08 17:43:42 +08:00
UncleCode
e1d9e2489c refactor(docs): update import statement in quickstart.py for improved clarity 2025-04-05 23:12:06 +08:00
UncleCode
b1693b1c21 Remove old quickstart files 2025-04-05 23:10:25 +08:00
UncleCode
49d904ca0a refactor(docs): enhance quickstart_examples.py with improved configuration and file handling 2025-04-05 22:57:45 +08:00
UncleCode
ca9351252a refactor(docs): update import paths and clean up example code in quickstart_examples.py 2025-04-05 22:55:56 +08:00
UncleCode
935d9d39f8 Add quickstart example set 2025-04-05 21:37:25 +08:00
Aravind Karnam
9e16a4bb26 Merge next and resolve conflicts 2025-04-02 12:18:23 +05:30
UncleCode
c635f6b9a2 refactor(browser): reorganize browser strategies and improve Docker implementation
Reorganize browser strategy code into separate modules for better maintainability and separation of concerns. Improve Docker implementation with:
- Add Alpine and Debian-based Dockerfiles for better container options
- Enhance Docker registry to share configuration with BuiltinBrowserStrategy
- Add CPU and memory limits to container configuration
- Improve error handling and logging
- Update documentation and examples

BREAKING CHANGE: DockerConfig, DockerRegistry, and DockerUtils have been moved to new locations and their APIs have been updated.
2025-03-27 21:35:13 +08:00
Aravind Karnam
efa73257c5 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-24 21:57:29 +05:30
UncleCode
4ab0893ffb feat(browser): implement modular browser management system
Adds a new browser management system with strategy pattern implementation:
- Introduces BrowserManager class with strategy pattern support
- Adds PlaywrightBrowserStrategy, CDPBrowserStrategy, and BuiltinBrowserStrategy
- Implements BrowserProfileManager for profile management
- Adds PagePoolConfig for browser page pooling
- Includes comprehensive test suite for all browser strategies

BREAKING CHANGE: Browser management has been moved to browser/ module. Direct usage of browser_manager.py and browser_profiler.py is deprecated.
2025-03-21 22:50:00 +08:00
Aravind Karnam
8cecbec7a7 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-20 17:07:53 +05:30
UncleCode
6432ff1257 feat(browser): add builtin browser management system
Implements a persistent browser management system that allows running a single shared browser instance
that can be reused across multiple crawler sessions. Key changes include:

- Added browser_mode config option with 'builtin', 'dedicated', and 'custom' modes
- Implemented builtin browser management in BrowserProfiler
- Added CLI commands for managing builtin browser (start, stop, status, restart, view)
- Modified browser process handling to support detached processes
- Added automatic builtin browser setup during package installation

BREAKING CHANGE: The browser_mode config option changes how browser instances are managed
2025-03-20 12:13:59 +08:00
Aravind Karnam
4359b12003 docs + fix: Update example for full page screenshot & PDF export. Fix the bug Error: crawl4ai.async_webcrawler.AsyncWebCrawler.aprocess_html() got multiple values for keyword argument - for screenshot param. https://github.com/unclecode/crawl4ai/issues/822#issuecomment-2732602118 2025-03-18 17:20:24 +05:30
Aravind Karnam
529a79725e docs: remove hallucinations from docs for CrawlerRunConfig + Add chunking strategy docs in the table 2025-03-18 16:14:00 +05:30
Aravind Karnam
84883be513 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-18 15:12:21 +05:30
UncleCode
b750542e6d feat(crawler): optimize single URL handling and add performance comparison
Add special handling for single URL requests in Docker API to use arun() instead of arun_many()
Add new example script demonstrating performance differences between sequential and parallel crawling
Update cache mode from aggressive to bypass in examples and tests
Remove unused dependencies (zstandard, msgpack)

BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples
2025-03-13 22:15:15 +08:00
Aravind Karnam
cbb8755972 Merge branch 'next' into 2025-MAR-ALPHA-1 2025-03-13 10:42:22 +05:30
UncleCode
1630fbdafe feat(monitor): add real-time crawler monitoring system with memory management
Implements a comprehensive monitoring and visualization system for tracking web crawler operations in real-time. The system includes:
- Terminal-based dashboard with rich UI for displaying task statuses
- Memory pressure monitoring and adaptive dispatch control
- Queue statistics and performance metrics tracking
- Detailed task progress visualization
- Stress testing framework for memory management

This addition helps operators track crawler performance and manage memory usage more effectively.
2025-03-12 19:05:24 +08:00
UncleCode
9547bada3a feat(content): add target_elements parameter for selective content extraction
Adds new target_elements parameter to CrawlerRunConfig that allows more flexible content selection than css_selector. This enables focusing markdown generation and data extraction on specific elements while still processing the entire page for links and media.

Key changes:
- Added target_elements list parameter to CrawlerRunConfig
- Modified WebScrapingStrategy and LXMLWebScrapingStrategy to handle target_elements
- Updated documentation with examples and comparison between css_selector and target_elements
- Fixed table extraction in content_scraping_strategy.py

BREAKING CHANGE: Table extraction logic has been modified to better handle thead/tbody structures
2025-03-10 18:54:51 +08:00
UncleCode
9d69fce834 feat(scraping): add smart table extraction and analysis capabilities
Add comprehensive table detection and extraction functionality to the web scraping system:
- Implement intelligent table detection algorithm with scoring system
- Add table extraction with support for headers, rows, captions
- Update models to include tables in Media class
- Add table_score_threshold configuration option
- Add documentation and examples for table extraction
- Include crypto analysis example demonstrating table usage

This change enables users to extract structured data from HTML tables while intelligently filtering out layout tables.
2025-03-09 21:31:33 +08:00
UncleCode
4aeb7ef9ad refactor(proxy): consolidate proxy configuration handling
Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location.

Key changes:
- Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py
- Updated type hints in async_configs.py to support ProxyConfig
- Fixed proxy configuration handling in browser_manager.py
- Updated documentation and examples to use new import path

BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy
2025-03-07 23:14:11 +08:00
UncleCode
a68cbb232b feat(browser): add standalone CDP browser launch and lxml extraction strategy
Add new features to enhance browser automation and HTML extraction:
- Add CDP browser launch capability with customizable ports and profiles
- Implement JsonLxmlExtractionStrategy for faster HTML parsing
- Add CLI command 'crwl cdp' for launching standalone CDP browsers
- Support connecting to external CDP browsers via URL
- Optimize selector caching and context-sensitive queries

BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai
2025-03-07 20:55:56 +08:00
UncleCode
f78c46446b feat(deep-crawling): improve URL normalization and domain filtering
Enhance URL handling in deep crawling with:
- New URL normalization functions for consistent URL formats
- Improved domain filtering with subdomain support
- Added URLPatternFilter to public API
- Better URL deduplication in BFS strategy

These changes improve crawling accuracy and reduce duplicate visits.
2025-03-06 22:45:57 +08:00
UncleCode
b3ec7ce960 Merge branch 'vr0.5.0.post1' into next 2025-03-05 14:17:19 +08:00
UncleCode
baee4949d3 refactor(llm): rename LlmConfig to LLMConfig for consistency
Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions.
Update all imports and usages to use the new name.
Update documentation and examples to reflect the change.

BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.
2025-03-05 14:17:04 +08:00
UncleCode
9c58e4ce2e fix(docs): correct section numbering in deepcrawl_example.py tutorial 2025-03-04 20:57:33 +08:00
UncleCode
df6a6d5f4f refactor(docs): reorganize tutorial sections and update wrap-up example 2025-03-04 20:55:09 +08:00
UncleCode
56bc3c6e45 refactor(cli): improve CLI default command handling
Make 'crawl' the default command when no command is specified.
This improves user experience by allowing direct URL input without
explicitly specifying the 'crawl' command.

Also removes unnecessary blank lines in example code for better readability.
2025-03-04 20:28:16 +08:00
UncleCode
415c1c5bee refactor(core): replace float('inf') with math.inf
Replace float('inf') and float('-inf') with math.inf and -math.inf from the math module for better readability and performance. Also clean up imports and remove unused speed comparison code.

No breaking changes.
2025-03-04 18:23:55 +08:00
Aravind Karnam
504207faa6 docs: update text in llm-strategies.md to reflect new changes in LlmConfig 2025-03-03 19:24:44 +05:30