crawl4ai

Author	SHA1	Message	Date
UncleCode	9546773a07	fix: Move sentence-transformers to optional dependencies - Moved sentence-transformers from core to optional dependencies in pyproject.toml - Removed sentence-transformers from requirements.txt - Added proper ImportError handling with helpful installation message - This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default - Users who need embedding features can install with: pip install 'crawl4ai[transformer]'	2025-07-24 21:24:40 +08:00
ntohidi	36429a63de	fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105	2025-07-08 12:54:33 +02:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	1a73fb60db	feat(crawl4ai): Implement adaptive crawling feature This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456	2025-07-04 15:16:53 +08:00
UncleCode	5c9c305dbf	feat: Add advanced link head extraction with three-layer scoring system (#1 ) Squashed commit from feature/link-extractor branch implementing comprehensive link analysis: - Extract HTML head content from discovered links with parallel processing - Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores - New LinkExtractionConfig class for type-safe configuration - Pattern-based filtering for internal/external links - Comprehensive documentation and examples	2025-06-27 20:06:04 +08:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
ntohidi	773ed7b281	Merge branch '2025-APR-1' into 2025-MAY-2	2025-06-02 20:25:58 +02:00
UncleCode	7d0b447e1c	Update setup script to clarify virtual display setup message	2025-05-25 16:55:18 +08:00
UncleCode	33b0e222ca	Add Colab utilities and rename setup function for clarity	2025-05-25 16:50:56 +08:00
UncleCode	1fc45ffac8	Fix temperature typo and enhance LinkedIn extraction with Colab support - Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files - Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction - Added Google Colab display server support for running Crawl4AI in notebook environments - Improved browser debugging with verbose startup args logging - Updated LinkedIn schemas and HTML snippets for better parsing accuracy 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-25 16:47:12 +08:00
Ahmed-Tawfik94	b4fc60a555	#1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes	2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94	137ac014fb	#1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance	2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94	faa98eefbc	#1105 got fixed (metadata now matches with meta property article:*	2025-05-19 11:35:13 +08:00
UncleCode	754ba731fa	Fix chunk splitting utilities (#1122 ) * Fix merge_chunks splitter usage and remove incorrect return * 📝 Add docstrings to `codex/find-and-fix-a-bug` (#1123) Docstrings generation was requested by @unclecode. * https://github.com/unclecode/crawl4ai/pull/1122#issuecomment-2887985865 The following files were modified: * `crawl4ai/utils.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-05-17 15:06:53 +08:00
Aravind Karnam	f6e25e2a6b	fix: check_robots_txt to support wildcard rules ref: #699	2025-05-07 17:53:30 +05:30
UncleCode	9b5ccac76e	feat(extraction): add RegexExtractionStrategy for pattern-based extraction Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None	2025-05-02 21:15:24 +08:00
UncleCode	0e5d672763	Merge branch 'pr-971' into merge-pr971	2025-05-01 18:57:28 +08:00
wakaka6	b2f3cb0dfa	WIP: logger migriate to rich	2025-04-11 00:44:43 +08:00
Aravind Karnam	7be5427283	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-03-27 12:29:32 +05:30
UncleCode	4a20d7f7c2	feat(cli): add quick JSON extraction and global config management Adds new features to improve user experience and configuration: - Quick JSON extraction with -j flag for direct LLM-based structured data extraction - Global configuration management with 'crwl config' commands - Enhanced LLM extraction with better JSON handling and error management - New user settings for default behaviors (LLM provider, browser settings, etc.) Breaking changes: None	2025-03-25 20:30:25 +08:00
Aravind Karnam	471d110c5e	fix: url normalisation ref: https://github.com/unclecode/crawl4ai/issues/841	2025-03-21 16:48:07 +05:30
Aravind Karnam	6740e87b4d	fix: remove trailing slash when the path is empty. This is causing dupicate crawls	2025-03-21 13:41:31 +05:30
UncleCode	dc36997a08	feat(schema): improve HTML preprocessing for schema generation Add new preprocess_html_for_schema utility function to better handle HTML cleaning for schema generation. This replaces the previous optimize_html function in the GoogleSearchCrawler and includes smarter attribute handling and pattern detection. Other changes: - Update default provider to gpt-4o - Add DEFAULT_PROVIDER_API_KEY constant - Make LLMConfig creation more flexible with create_llm_config helper - Add new dependencies: zstandard and msgpack This change improves schema generation reliability while reducing noise in the processed HTML.	2025-03-12 22:40:46 +08:00
UncleCode	f78c46446b	feat(deep-crawling): improve URL normalization and domain filtering Enhance URL handling in deep crawling with: - New URL normalization functions for consistent URL formats - Improved domain filtering with subdomain support - Added URLPatternFilter to public API - Better URL deduplication in BFS strategy These changes improve crawling accuracy and reduce duplicate visits.	2025-03-06 22:45:57 +08:00
UncleCode	b957ff2ecd	refactor(crawler): improve HTML handling and cleanup codebase - Add HTML attribute preservation in GoogleSearchCrawler - Fix lxml import references in utils.py - Remove unused ssl_certificate.json - Clean up imports and code organization in hub.py - Update test case formatting and remove unused image search test BREAKING CHANGE: Removed ssl_certificate.json file which might affect existing certificate validations	2025-02-07 21:56:27 +08:00
UncleCode	a9415aaaf6	refactor(deep-crawling): reorganize deep crawling strategies and add new implementations Split deep crawling code into separate strategy files for better organization and maintainability. Added new BFF (Best First) and DFS crawling strategies. Introduced base strategy class and common types. BREAKING CHANGE: Deep crawling implementation has been split into multiple files. Import paths for deep crawling strategies have changed.	2025-02-05 22:50:39 +08:00
UncleCode	33a21d6a7a	refactor(docker): improve server architecture and configuration Complete overhaul of Docker deployment setup with improved architecture: - Add Redis integration for task management - Implement rate limiting and security middleware - Add Prometheus metrics and health checks - Improve error handling and logging - Add support for streaming responses - Implement proper configuration management - Add platform-specific optimizations for ARM64/AMD64 BREAKING CHANGE: Docker deployment now requires Redis and new config.yml structure	2025-02-02 20:19:51 +08:00
UncleCode	f81712eb91	refactor(core): reorganize project structure and remove legacy code Major reorganization of the project structure: - Moved legacy synchronous crawler code to legacy folder - Removed deprecated CLI and docs manager - Consolidated version manager into utils.py - Added CrawlerHub to __init__.py exports - Fixed type hints in async_webcrawler.py - Fixed minor bugs in chunking and crawler strategies BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.	2025-01-30 19:35:06 +08:00
UncleCode	31938fb922	feat(crawler): enhance JavaScript execution and PDF processing Add JavaScript execution result handling and improve PDF processing capabilities: - Add js_execution_result to CrawlResult and AsyncCrawlResponse models - Implement execution result capture in AsyncPlaywrightCrawlerStrategy - Add batch processing for PDF pages with configurable batch size - Enhance JsonElementExtractionStrategy with better schema generation - Add HTML optimization utilities BREAKING CHANGE: PDF processing now uses batch processing by default	2025-01-29 21:03:39 +08:00
UncleCode	d09c611d15	feat(robots): add robots.txt compliance support Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.	2025-01-21 17:54:13 +08:00
UncleCode	2d6b19e1a2	refactor(browser): improve browser path management Implement more robust browser executable path handling using playwright's built-in browser management. This change: - Adds async browser path resolution - Implements path caching in the home folder - Removes hardcoded browser paths - Adds httpx dependency - Removes obsolete test result files This change makes the browser path resolution more reliable across different platforms and environments.	2025-01-17 22:14:37 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	72fbdac467	fix(extraction): JsonCss selector and crawler improvements - Fix JsonCssExtractionStrategy._get_elements to return all matching elements instead of just one - Add robust error handling to page_need_scroll with default fallback - Improve JSON extraction strategies documentation - Refactor content scraping strategy - Update version to 0.4.247	2025-01-05 19:26:46 +08:00
UncleCode	fb33a24891	Commit Message: - Added examples for Amazon product data extraction methods - Updated configuration options and enhance documentation - Minor refactoring for improved performance and readability - Cleaned up version control settings.	2024-12-29 20:05:18 +08:00
UncleCode	d5ed451299	Enhance crawler capabilities and documentation - Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.	2024-12-25 21:34:31 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	5188b7a6a0	Add full-page screenshot and PDF export features - Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance. - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`. - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters. - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.	2024-12-10 20:59:31 +08:00
UncleCode	e130fd8db9	Implement new async crawler features and stability updates - Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.	2024-12-10 17:55:29 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
UncleCode	8c611dcb4b	Refactored web scraping components - Enhanced the web scraping strategy with new methods for optimized media handling. - Added new utility functions for better content processing. - Refined existing features for improved accuracy and efficiency in scraping tasks. - Introduced more robust filtering criteria for media elements.	2024-12-05 22:33:47 +08:00
UncleCode	a036b7f122	feat: implement create_box_message utility for formatted error messages and enhance error logging in AsyncWebCrawler	2024-11-28 19:24:07 +08:00
UncleCode	24723b2f10	Enhance features and documentation - Updated version to 0.3.743 - Improved ManagedBrowser configuration with dynamic host/port - Implemented fast HTML formatting in web crawler - Enhanced markdown generation with a new generator class - Improved sanitization and utility functions - Added contributor details and pull request acknowledgments - Updated documentation for clearer usage scenarios - Adjusted tests to reflect class name changes	2024-11-28 12:45:05 +08:00
UncleCode	dbb751c8f0	In this commit, we introduce the new concept of MakrdownGenerationStrategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.	2024-11-21 18:21:43 +08:00
UncleCode	b6af94cbbb	Merge remote-tracking branch 'origin/main' into 0.3.74	2024-11-18 21:15:04 +08:00
UncleCode	d0014c6793	New async database manager and migration support - Introduced AsyncDatabaseManager for async DB management. - Added migration feature to transition to file-based storage. - Enhanced web crawler with improved caching logic. - Updated requirements and setup for async processing.	2024-11-16 14:54:41 +08:00
UncleCode	3d00fee6c2	- In this commit, the library is updated to process file downloads. Users can now specify a download folder and trigger the download process via JavaScript or other means, with all files being saved. The list of downloaded files will also be added to the crowd result object. - Another thing this commit introduces is the concept of the Relevance Content Filter. This is an improvement over Fit Markdown. This class of strategies aims to extract the main content from a given page - the part that really matters and is useful to be processed. One strategy has been created using the BM25 algorithm, which finds chunks of text from the web page relevant to its title, descriptions, and keywords, or supports a given user query and matches them. The result is then returned to the main engine to be converted to Markdown. Plans include adding approaches using language models as well. - The cache database was updated to hold information about response headers and downloaded files.	2024-11-14 22:50:59 +08:00
UncleCode	c38ac29edb	perf(crawler): major performance improvements & raw HTML support - Switch to lxml parser (~4x speedup) - Add raw HTML & local file crawling support - Fix cache headers & async cleanup - Add browser process monitoring - Optimize BeautifulSoup operations - Pre-compile regex patterns Breaking: Raw HTML handling requires new URL prefixes Fixes: #256, #253	2024-11-13 19:40:40 +08:00
Mahesh	00026b5f8b	feat(config): Adding a configurable way of setting the cache directory for constrained environments	2024-11-12 14:52:51 -07:00
UncleCode	c5aa1bec18	Merge pull request #229 from bizrockman/main Preventing NoneType has no attribute get Errors	2024-11-06 07:31:07 +01:00

1 2

90 Commits