Commit Graph

23 Commits

Author SHA1 Message Date
UncleCode
230f22da86 refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.

BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
2025-04-15 22:27:18 +08:00
UncleCode
4a20d7f7c2 feat(cli): add quick JSON extraction and global config management
Adds new features to improve user experience and configuration:
- Quick JSON extraction with -j flag for direct LLM-based structured data extraction
- Global configuration management with 'crwl config' commands
- Enhanced LLM extraction with better JSON handling and error management
- New user settings for default behaviors (LLM provider, browser settings, etc.)

Breaking changes: None
2025-03-25 20:30:25 +08:00
UncleCode
dc36997a08 feat(schema): improve HTML preprocessing for schema generation
Add new preprocess_html_for_schema utility function to better handle HTML cleaning
for schema generation. This replaces the previous optimize_html function in the
GoogleSearchCrawler and includes smarter attribute handling and pattern detection.

Other changes:
- Update default provider to gpt-4o
- Add DEFAULT_PROVIDER_API_KEY constant
- Make LLMConfig creation more flexible with create_llm_config helper
- Add new dependencies: zstandard and msgpack

This change improves schema generation reliability while reducing noise in the
processed HTML.
2025-03-12 22:40:46 +08:00
Aravind
2af958e12c Feat/llm config (#724)
* feature: Add LlmConfig to easily configure and pass LLM configs to different strategies

* pulled in next branch and resolved conflicts

* feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions

* Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params

* updated tests, docs and readme
2025-02-21 15:41:37 +08:00
UncleCode
91a5fea11f feat(cli): add command line interface with comprehensive features
Implements a full-featured CLI for Crawl4AI with the following capabilities:
- Basic and advanced web crawling
- Configuration management via YAML/JSON files
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
- Comprehensive documentation and examples

Also includes:
- Home directory setup for configuration and cache
- Environment variable support for API tokens
- Test suite for CLI functionality
2025-02-10 16:58:52 +08:00
UncleCode
8ec12d7d68 Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
UncleCode
a11d9646e3 Enhance crawler features and improve documentation
- Added detailed CrawlerRunConfig parameters documentation.
  - Introduced plans for real-time event-driven crawling.
  - Updated async logger default level to DEBUG for better insights.
  - Improved structure and readability in configuration file.
  - Enhanced documentation on future capabilities in new blog entries.
2024-12-16 18:52:51 +08:00
UncleCode
0982c639ae Enhance AsyncWebCrawler and related configurations
- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig.
  - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management.
  - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters.
  - Improved error handling with detailed context extraction during exceptions.
  - Enhanced overall maintainability and usability of the web crawler.
2024-12-12 19:35:09 +08:00
UncleCode
5431fa2d0c Add PDF & screenshot functionality, new tutorial
- Added support for exporting pages as PDFs
  - Enhanced screenshot functionality for long pages
  - Created a tutorial on dynamic content loading with 'Load More' buttons.
  - Updated web crawler to handle PDF data in responses.
2024-12-10 20:10:39 +08:00
UncleCode
152ac35bc2 feat(docs): update README for version 0.3.74 with new features and improvements
fix(version): update version number to 0.3.74
refactor(async_webcrawler): enhance logging and add domain-based request delay
2024-11-17 21:09:26 +08:00
UncleCode
3a66aa8a60 feat(cache): introduce CacheMode and CacheContext for enhanced caching behavior
chore(requirements): add colorama dependency
refactor(config): add SHOW_DEPRECATION_WARNINGS flag and clean up code
fix(docs): update example scripts for clarity and consistency
2024-11-17 15:30:56 +08:00
UncleCode
d0014c6793 New async database manager and migration support
- Introduced AsyncDatabaseManager for async DB management.
  - Added migration feature to transition to file-based storage.
  - Enhanced web crawler with improved caching logic.
  - Updated requirements and setup for async processing.
2024-11-16 14:54:41 +08:00
UncleCode
bf91adf3f8 fix: Resolve unexpected BrowserContext closure during crawl in Docker
- Removed __del__ method in AsyncPlaywrightCrawlerStrategy to ensure reliable browser lifecycle management by using explicit context managers.
- Added process monitoring in ManagedBrowser to detect and log unexpected terminations of the browser subprocess.
- Updated Docker configuration to expose port 9222 for remote debugging and allocate extra shared memory to prevent browser crashes.
- Improved error handling and resource cleanup for browser instances, particularly in Docker environments.

Resolves Issue #256
2024-11-13 15:37:16 +08:00
UncleCode
bcfe83f702 feat: enhance crawler with overlay removal and improved screenshot capabilities
• Add smart overlay removal system for handling popups and modals
• Improve screenshot functionality with configurable timing controls
• Implement URL normalization and enhanced link processing
• Add custom base directory support for cache storage
• Refine external content filtering and social media domain handling

This commit significantly improves the crawler's ability to handle modern
websites by automatically removing intrusive overlays and providing better
screenshot capabilities. URL handling is now more robust with proper
normalization and duplicate detection. The cache system is more flexible
with customizable base directory support.

Breaking changes: None
Issue numbers: None
2024-10-24 20:22:47 +08:00
UncleCode
6ec4cb33ca Enhance Markdown generation and external content control
- Integrate customized html2text library for flexible Markdown output
- Add options to exclude external links and images
- Improve content scraping efficiency and error handling
- Update AsyncPlaywrightCrawlerStrategy for faster closing
- Enhance CosineStrategy with generic embedding model loading
2024-10-20 18:56:58 +08:00
UncleCode
4e2852d5ff [v0.3.71] Enhance chunking strategies and improve overall performance
- Add OverlappingWindowChunking and improve SlidingWindowChunking
- Update CHUNK_TOKEN_THRESHOLD to 2048 tokens
- Optimize AsyncPlaywrightCrawlerStrategy close method
- Enhance flexibility in CosineStrategy with generic embedding model loading
- Improve JSON-based extraction strategies
- Add knowledge graph generation example
2024-10-19 18:36:59 +08:00
unclecode
9ee988753d refactor: Update image description minimum word threshold in get_content_of_website_optimized 2024-08-02 14:53:11 +08:00
Aravind Karnam
cf6c835e18 moved score threshold to config.py & replaced the separator for tag.get_text in find_closest_parent_with_useful_text fn from period(.) to space( ) to keep the text more neutral. 2024-07-21 15:18:23 +05:30
unclecode
539263a8ba chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README 2024-06-19 18:32:20 +08:00
unclecode
c8589f8da3 Update:
- Fix Spacy model issue
- Update Readme and requirements.txt
2024-05-16 19:50:20 +08:00
unclecode
f6e59157bf - Test all methods
- Update index.hml
- Update Readme
- Resolve some bugs
2024-05-14 21:27:41 +08:00
unclecode
88643612e8 chore: Update environment variable usage in config files 2024-05-09 22:37:01 +08:00
unclecode
3ff1d15702 Change the project folder name from crawler to crawl4ai 2024-05-09 22:16:28 +08:00