crawl4ai

Author	SHA1	Message	Date
ntohidi	6cd34b3157	Merge branch '2025-MAY-2' of https://github.com/unclecode/crawl4ai into 2025-MAY-2	2025-06-13 11:26:17 +02:00
ntohidi	871d4f1158	fix(extraction_strategy): rename response variable to content for clarity in LLMExtractionStrategy. ref #1146	2025-06-13 11:26:05 +02:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
UncleCode	1fc45ffac8	Fix temperature typo and enhance LinkedIn extraction with Colab support - Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files - Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction - Added Google Colab display server support for running Crawl4AI in notebook environments - Improved browser debugging with verbose startup args logging - Updated LinkedIn schemas and HTML snippets for better parsing accuracy 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-25 16:47:12 +08:00
ntohidi	32966bea11	fix(extraction): resolve `'str' object has no attribute 'choices'` error in LLMExtractionStrategy. Refs: #979 This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition of the `response` variable, which caused downstream exceptions during error handling.	2025-05-15 10:09:19 +02:00
UncleCode	9b5ccac76e	feat(extraction): add RegexExtractionStrategy for pattern-based extraction Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None	2025-05-02 21:15:24 +08:00
UncleCode	3179d6ad0c	fix(core): improve error handling and stability in core components Enhance error handling and stability across multiple components: - Add safety checks in async_configs.py for type and params existence - Fix browser manager initialization and cleanup logic - Add default LLM config fallback in extraction strategy - Add comprehensive Docker deployment guide and server tests BREAKING CHANGE: BrowserManager.start() now automatically closes existing instances	2025-04-11 20:58:39 +08:00
UncleCode	4a20d7f7c2	feat(cli): add quick JSON extraction and global config management Adds new features to improve user experience and configuration: - Quick JSON extraction with -j flag for direct LLM-based structured data extraction - Global configuration management with 'crwl config' commands - Enhanced LLM extraction with better JSON handling and error management - New user settings for default behaviors (LLM provider, browser settings, etc.) Breaking changes: None	2025-03-25 20:30:25 +08:00
UncleCode	5358ac0fc2	refactor: clean up imports and improve JSON schema generation instructions	2025-03-18 18:53:34 +08:00
UncleCode	dc36997a08	feat(schema): improve HTML preprocessing for schema generation Add new preprocess_html_for_schema utility function to better handle HTML cleaning for schema generation. This replaces the previous optimize_html function in the GoogleSearchCrawler and includes smarter attribute handling and pattern detection. Other changes: - Update default provider to gpt-4o - Add DEFAULT_PROVIDER_API_KEY constant - Make LLMConfig creation more flexible with create_llm_config helper - Add new dependencies: zstandard and msgpack This change improves schema generation reliability while reducing noise in the processed HTML.	2025-03-12 22:40:46 +08:00
UncleCode	a68cbb232b	feat(browser): add standalone CDP browser launch and lxml extraction strategy Add new features to enhance browser automation and HTML extraction: - Add CDP browser launch capability with customizable ports and profiles - Implement JsonLxmlExtractionStrategy for faster HTML parsing - Add CLI command 'crwl cdp' for launching standalone CDP browsers - Support connecting to external CDP browsers via URL - Optimize selector caching and context-sensitive queries BREAKING CHANGE: LLMConfig import path changed from crawl4ai.types to crawl4ai	2025-03-07 20:55:56 +08:00
UncleCode	29f7915b79	fix(models): support float timestamps in CrawlStats Modify CrawlStats class to handle both datetime and float timestamp formats for start_time and end_time fields. This change improves compatibility with different time formats while maintaining existing functionality. Other minor changes: - Add datetime import in async_dispatcher - Update JsonElementExtractionStrategy kwargs handling No breaking changes.	2025-03-06 20:30:57 +08:00
UncleCode	baee4949d3	refactor(llm): rename LlmConfig to LLMConfig for consistency Rename LlmConfig to LLMConfig across the codebase to follow consistent naming conventions. Update all imports and usages to use the new name. Update documentation and examples to reflect the change. BREAKING CHANGE: LlmConfig has been renamed to LLMConfig. Users need to update their imports and usage.	2025-03-05 14:17:04 +08:00
Aravind	2af958e12c	Feat/llm config (#724 ) * feature: Add LlmConfig to easily configure and pass LLM configs to different strategies * pulled in next branch and resolved conflicts * feat: Add gemini and deepseek providers. Make ignore_cache in llm content filter to true by default to avoid confusions * Refactor: Update LlmConfig in LLMExtractionStrategy class and deprecate old params * updated tests, docs and readme	2025-02-21 15:41:37 +08:00
UncleCode	43e09da694	refactor(crawler): remove content filter functionality Remove content filter related code and parameters as part of simplifying the crawler configuration. This includes: - Removing ContentFilter import and related classes - Removing content_filter parameter from CrawlerRunConfig - Cleaning up LLMExtractionStrategy constructor parameters BREAKING CHANGE: Removed content_filter parameter from CrawlerRunConfig. Users should migrate to using extraction strategies for content filtering.	2025-02-12 21:59:19 +08:00
UncleCode	91a5fea11f	feat(cli): add command line interface with comprehensive features Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality	2025-02-10 16:58:52 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00
UncleCode	31938fb922	feat(crawler): enhance JavaScript execution and PDF processing Add JavaScript execution result handling and improve PDF processing capabilities: - Add js_execution_result to CrawlResult and AsyncCrawlResponse models - Implement execution result capture in AsyncPlaywrightCrawlerStrategy - Add batch processing for PDF pages with configurable batch size - Enhance JsonElementExtractionStrategy with better schema generation - Add HTML optimization utilities BREAKING CHANGE: PDF processing now uses batch processing by default	2025-01-29 21:03:39 +08:00
UncleCode	f8fd9d9eff	feat(pdf): add PDF processing capabilities Add new PDF processing module with the following features: - PDF text extraction and formatting to HTML/Markdown - Image extraction with multiple format support (JPEG, PNG, TIFF) - Link extraction from PDF documents - Metadata extraction including title, author, dates - Support for both local and remote PDF files Also includes: - New configuration options for HTML attribute handling - Internal/external link filtering improvements - Version bump to 0.4.300b4	2025-01-27 21:24:15 +08:00
UncleCode	2cec527a22	feat(extraction): add LLM-powered schema generation utility Adds new static method generate_schema() to JsonElementExtractionStrategy classes that can automatically generate extraction schemas using LLM (OpenAI or Ollama). This provides a convenient way to bootstrap extraction schemas while maintaining the performance benefits of selector-based extraction. Key changes: - Added generate_schema() static method to base extraction strategy - Added support for both CSS and XPath schema generation - Updated documentation with examples and best practices - Added new prompt templates for schema generation	2025-01-20 17:28:00 +08:00
UncleCode	20c027b79c	chore(cleanup): remove unused files and improve type hints - Remove .pre-commit-config.yaml and duplicate mkdocs configuration files - Add Optional type hint for proxy parameter in BrowserConfig - Fix type annotation for results list in AsyncWebCrawler - Move calculate_batch_size function import to model_loader - Update prompt imports in extraction_strategy.py No breaking changes.	2025-01-14 13:07:18 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	ae376f15fb	docs(extraction): add clarifying comments for CSS selector behavior Add explanatory comments to JsonCssExtractionStrategy._get_elements() method to clarify that it returns all matching elements using select() instead of select_one(). This helps developers understand the method's behavior and its difference from single element selection. Removed trailing whitespace at end of file.	2025-01-05 19:39:15 +08:00
UncleCode	72fbdac467	fix(extraction): JsonCss selector and crawler improvements - Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one - Add robust error handling to page_need_scroll with default fallback - Improve JSON extraction strategies documentation - Refactor content scraping strategy - Update version to 0.4.247	2025-01-05 19:26:46 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	849765712f	Enhance Crawl4AI with new features and documentation - Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.	2024-12-19 21:02:29 +08:00
UncleCode	393bb911c0	Enhance crawler strategies with new features - ReImplemented JsonXPathExtractionStrategy for enhanced JSON data extraction. - Updated existing extraction strategies for better performance. - Improved handling of response status codes during crawls.	2024-12-17 22:40:10 +08:00
UncleCode	399af801a1	Merge branch 'next'	2024-12-12 20:17:27 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
lu4nx	ba3e808802	fix: The extract method logs output only when self.verbose is set to True. (#314 ) Co-authored-by: lu4nx <lu4nx@lx-pc>	2024-12-09 17:19:26 +08:00
UncleCode	bcfe83f702	feat: enhance crawler with overlay removal and improved screenshot capabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None	2024-10-24 20:22:47 +08:00
UncleCode	6ec4cb33ca	Enhance Markdown generation and external content control - Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading	2024-10-20 18:56:58 +08:00
UncleCode	4e2852d5ff	[v0.3.71] Enhance chunking strategies and improve overall performance - Add OverlappingWindowChunking and improve SlidingWindowChunking - Update CHUNK_TOKEN_THRESHOLD to 2048 tokens - Optimize AsyncPlaywrightCrawlerStrategy close method - Enhance flexibility in CosineStrategy with generic embedding model loading - Improve JSON-based extraction strategies - Add knowledge graph generation example	2024-10-19 18:36:59 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	2fada16abb	chore: Update crawl4ai package with AsyncWebCrawler and JsonCssExtractionStrategy	2024-09-03 23:32:27 +08:00
unclecode	c37614cbc8	Add Async Version, JsonCss Extrator	2024-09-03 01:27:00 +08:00
unclecode	b0e8b66666	Merge branch 'proxy-support' into staging	2024-09-01 16:35:14 +08:00
datehoer	fe9ff498ce	add proxy and add ai base_url	2024-08-26 16:12:49 +08:00
unclecode	dec3d44224	refactor: Update extraction strategy to handle schema extraction with non-empty schema This code change updates the `LLMExtractionStrategy` class to handle schema extraction when the schema is non-empty. Previously, the schema extraction was only triggered when the `extract_type` was set to "schema", regardless of whether a schema was provided. With this update, the schema extraction will only be performed if the `extract_type` is "schema" and a non-empty schema is provided. This ensures that the extraction strategy behaves correctly and avoids unnecessary schema extraction when not needed. Also "numpy" is removed from default installation mode.	2024-08-19 15:37:07 +08:00
unclecode	e5e6a34e80	## [v0.2.77] - 2024-08-04 Significant improvements in text processing and performance: - 🚀 Dependency reduction: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy. - 🤖 Transformer upgrade: Implemented text sequence classification using a transformer model for labeling text chunks. - ⚡ Performance enhancement: Improved model loading speed due to removal of spaCy dependency. - 🔧 Future-proofing: Laid groundwork for potential complete removal of spaCy dependency in future versions. These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.	2024-08-04 14:54:18 +08:00
unclecode	4d283ab386	## [v0.2.74] - 2024-07-08 A slew of exciting updates to improve the crawler's stability and robustness! 🎉 - 💻 UTF encoding fix: Resolved the Windows \"charmap\" error by adding UTF encoding. - 🛡️ Error handling: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy. - 🧹 Input sanitization: Improved input sanitization and handled encoding issues in LLMExtractionStrategy. - 🚮 Database cleanup: Removed existing database file and initialized a new one.	2024-07-08 16:33:25 +08:00
unclecode	9926eb9f95	feat: Bump version to v0.2.73 and update documentation This commit updates the version number to v0.2.73 and makes corresponding changes in the README.md and Dockerfile. Docker file install the default mode, this resolve many of installation issues. Additionally, the installation instructions are updated to include support for different modes. Setup.py doesn't have anymore dependancy on Spacy. The change log is also updated to reflect these changes. Supporting websites need with-head browser.	2024-07-03 15:19:22 +08:00
unclecode	61ae2de841	1/Update setup.py to support following modes: - default (most frequent mode) - torch - transformers - all 2/ Update Docker file 3/ Update documentation as well.	2024-06-30 00:15:29 +08:00
unclecode	21b110bfd7	Update LLMExtractionStrategy to disable chunking if specified, Add example of summarization for a web page.	2024-06-19 19:03:35 +08:00
unclecode	350ca1511b	chore: Update configuration values, create new example, and update Dockerfile and README	2024-06-19 18:48:20 +08:00
unclecode	539263a8ba	chore: Update configuration values for chunk token threshold, overlap rate, and minimum word threshold. Create a new example for LLMExtraction Strategy, update Dockerfile, and README	2024-06-19 18:32:20 +08:00
unclecode	51f26d12fe	Update for v0.2.2 - Support multiple JS scripts - Fixed some of bugs - Resolved a few issue relevant to Colab installation	2024-06-02 15:40:18 +08:00
Unclecode	53d1176d53	chore: Update extraction strategy to support GPU, MPS, and CPU, add batch processing for CPU devices	2024-05-19 16:18:58 +00:00

1 2

65 Commits