crawl4ai

Author	SHA1	Message	Date
UncleCode	d0586f09a9	Merge branch 'vr0.4.3b3'	2025-01-25 21:57:29 +08:00
UncleCode	09ac7ed008	feat(demo): uncomment feature demos and add fake-useragent dependency Uncomments demonstration code for memory dispatcher, streaming support, content scraping, JSON schema generation, LLM markdown, and robots compliance in the v0.4.3b2 features demo file. Also adds fake-useragent package as a project dependency. This change makes all feature demonstrations active by default and ensures proper user agent handling capabilities.	2025-01-25 21:56:08 +08:00
UncleCode	97796f39d2	docs(examples): update proxy rotation demo and disable other demos Modify proxy rotation example to include empty user agent setting and comment out other demo functions for focused testing. This change simplifies the demo file to focus specifically on proxy rotation functionality. No breaking changes.	2025-01-25 21:52:35 +08:00
UncleCode	4d7f91b378	refactor(user-agent): improve user agent generation system Redesign user agent generation to be more modular and reliable: - Add abstract base class UAGen for user agent generation - Implement ValidUAGenerator using fake-useragent library - Add OnlineUAGenerator for fetching real-world user agents - Update browser configurations to use new UA generation system - Improve client hints generation This change makes the user agent system more maintainable and provides better real-world user agent coverage.	2025-01-25 21:16:39 +08:00
UncleCode	69a77222ef	feat(browser): add CDP URL configuration support Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser. - Added cdp_url parameter to BrowserConfig - Added cdp_url support in ManagedBrowser.start() method - Updated documentation for new parameters	2025-01-24 15:53:47 +08:00
UncleCode	0afc3e9e5e	refactor(examples): update API usage in features demo Update the demo script to use the new crawler.arun_many() API instead of dispatcher.run_urls() and fix result access patterns. Also improve code formatting and remove extra whitespace. - Replace dispatcher.run_urls with crawler.arun_many - Update streaming demo to use new API and correct result access - Clean up whitespace and formatting - Simplify result property access patterns	2025-01-23 22:37:29 +08:00
UncleCode	65d33bcc0f	style(docs): improve code formatting in features demo Clean up whitespace and improve readability in v0_4_3b2_features_demo.py: - Remove excessive blank lines between functions - Improve config formatting for better readability - Uncomment memory dispatcher demo in main function No breaking changes.	2025-01-23 22:36:58 +08:00
UncleCode	6a01008a2b	docs(multi-url): improve documentation clarity and update examples - Restructure multi-URL crawling documentation with better formatting and examples - Update code examples to use new API syntax (arun_many) - Add detailed parameter explanations for RateLimiter and Dispatchers - Enhance CSS styling for better documentation readability - Fix outdated method calls in feature demo script BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples	2025-01-23 22:33:36 +08:00
UncleCode	6dc01eae3a	refactor(core): improve type hints and remove unused file - Add RelevantContentFilter to __init__.py exports - Update version to 0.4.3b3 - Enhance type hints in async_configs.py - Remove empty utils.scraping.py file - Update mkdocs configuration with version info and GitHub integration BREAKING CHANGE: None	2025-01-23 18:53:22 +08:00
UncleCode	7b7fe84e0d	docs(readme): resolve merge conflict and update version info Resolves merge conflict in README.md by removing outdated version 0.4.24x information and keeping current version 0.4.3bx details. Updates release notes description to reflect current features including Memory Dispatcher System, Streaming Support, and other improvements. No breaking changes.	2025-01-22 20:52:42 +08:00
UncleCode	5c36f4308f	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2025-01-22 20:51:52 +08:00
UncleCode	45809d1c91	Merge branch 'vr0.4.3b2'	2025-01-22 20:51:46 +08:00
UncleCode	357414c345	docs(readme): update version references and fix links Update version numbers to v0.4.3bx throughout README.md Fix contributing guidelines link to point to CONTRIBUTORS.md Update Aravind's role in CONTRIBUTORS.md to Head of Community and Product Add pre-release installation instructions Fix minor formatting in personal story section No breaking changes	2025-01-22 20:46:39 +08:00
UncleCode	260b9120c3	docs(examples): update v0.4.3 features demo to v0.4.3b2 Rename and replace the features demo file to reflect the beta 2 version number. The old v0.4.3 demo file is removed and replaced with a new beta 2 version. Renames: - docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py	2025-01-22 20:41:43 +08:00
UncleCode	976ea52167	docs(examples): update demo scripts and fix output formats Update example scripts to reflect latest API changes and improve demonstrations: - Increase test URLs in dispatcher example from 20 to 40 pages - Comment out unused dispatcher strategies for cleaner output - Fix scraping strategies performance script to use correct object notation - Update v0_4_3_features_demo with additional feature mentions and uncomment demo sections These changes make the examples more current and better aligned with the actual API.	2025-01-22 20:40:03 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00
UncleCode	dee5fe9851	feat(proxy): add proxy rotation support and documentation Implements dynamic proxy rotation functionality with authentication support and IP verification. Updates include: - Added proxy rotation demo in features example - Updated proxy configuration handling in BrowserManager - Added proxy rotation documentation - Updated README with new proxy rotation feature - Bumped version to 0.4.3b2 This change enables users to dynamically switch between proxies and verify IP addresses for each request.	2025-01-22 16:11:01 +08:00
UncleCode	88697c4630	docs(readme): update version and feature announcements for v0.4.3b1 Update README.md to announce version 0.4.3b1 release with new features including: - Memory Dispatcher System - Streaming Support - LLM-Powered Markdown Generation - Schema Generation - Robots.txt Compliance Add detailed version numbering explanation section to help users understand pre-release versions.	2025-01-21 21:20:04 +08:00
UncleCode	16b8d4945b	feat(release): prepare v0.4.3 beta release Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit	2025-01-21 21:03:11 +08:00
UncleCode	d09c611d15	feat(robots): add robots.txt compliance support Add support for checking and respecting robots.txt rules before crawling websites: - Implement RobotsParser class with SQLite caching - Add check_robots_txt parameter to CrawlerRunConfig - Integrate robots.txt checking in AsyncWebCrawler - Update documentation with robots.txt compliance examples - Add tests for robot parser functionality The cache uses WAL mode for better concurrency and has a default TTL of 7 days.	2025-01-21 17:54:13 +08:00
UncleCode	9247877037	feat(proxy): add proxy configuration support to CrawlerRunConfig Add proxy_config parameter to CrawlerRunConfig to support dynamic proxy configuration per crawl request. This enables users to specify different proxy settings for each crawl operation without modifying the browser config. - Added proxy_config parameter to CrawlerRunConfig - Updated BrowserManager to apply proxy settings from CrawlerRunConfig - Updated proxy-security documentation with new usage examples	2025-01-20 22:14:05 +08:00
UncleCode	2cec527a22	feat(extraction): add LLM-powered schema generation utility Adds new static method generate_schema() to JsonElementExtractionStrategy classes that can automatically generate extraction schemas using LLM (OpenAI or Ollama). This provides a convenient way to bootstrap extraction schemas while maintaining the performance benefits of selector-based extraction. Key changes: - Added generate_schema() static method to base extraction strategy - Added support for both CSS and XPath schema generation - Updated documentation with examples and best practices - Added new prompt templates for schema generation	2025-01-20 17:28:00 +08:00
UncleCode	4b1309cbf2	feat(crawler): add URL redirection tracking Add capability to track and return final URLs after redirects in crawler responses. This enhancement helps users understand the actual destination of crawled URLs after any redirections. Changes include: - Added final_url tracking in AsyncPlaywrightCrawlerStrategy - Added redirected_url field to CrawlResult model - Updated AsyncWebCrawler to properly handle and store redirect URLs - Fixed typo in documentation signature	2025-01-19 19:53:38 +08:00
UncleCode	8b6fe6a98f	docs(api): add streaming mode documentation and examples Add comprehensive documentation for the new streaming mode feature in arun_many(): - Update arun_many() API docs to reflect streaming return type - Add streaming examples in quickstart and multi-url guides - Document stream parameter in configuration classes - Add clone() helper method documentation for configs This change improves documentation for processing large numbers of URLs efficiently.	2025-01-19 18:21:34 +08:00
UncleCode	91463e34f1	feat(config): add streaming support and config cloning Add streaming capability to crawler configurations and introduce clone() methods for both BrowserConfig and CrawlerRunConfig to support immutable config updates. Move stream parameter from arun_many() method to CrawlerRunConfig. BREAKING CHANGE: Removed stream parameter from AsyncWebCrawler.arun_many() method. Use config.stream=True instead.	2025-01-19 17:51:47 +08:00
UncleCode	1221be30a3	feat(browser): improve browser context management and add shared data support Add shared_data parameter to CrawlerRunConfig to allow data sharing between hooks. Implement browser context reuse based on config signatures to improve memory usage. Fix Firefox/Webkit channel settings. Add config parameter to hook callbacks for better context access. Remove debug print statements. BREAKING CHANGE: Hook callback signatures now include config parameter	2025-01-19 17:12:03 +08:00
Aravind	6dfa9cb703	Streamline Feature requests, bug reports and Forums with Forms & Templates (#465 ) * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config:Add bug report template and issue chooser * config: updated new bugs to have needs-triage label by default * Template for PR * Template for PR * Template for PR * Template for PR * Added FR template * Added FR template * Added FR template * Added FR template * Config: updated the text for new labels * config: changed the order of steps to reproduce * Config: shortened the form for feature request * Config: Added a code snippet section to the bug report	2025-01-19 16:53:03 +08:00
UncleCode	e363234172	feat(dispatcher): add streaming support for URL processing Add new streaming capability to the MemoryAdaptiveDispatcher and AsyncWebCrawler to allow processing URLs with real-time result streaming. This enables processing results as they become available rather than waiting for all URLs to complete. Key changes: - Add run_urls_stream method to MemoryAdaptiveDispatcher - Update AsyncWebCrawler.arun_many to support streaming mode - Add result queue for better result handling - Improve type hints and documentation BREAKING CHANGE: The return type of arun_many now depends on the 'stream' parameter, returning either List[CrawlResult] or AsyncGenerator[CrawlResult, None]	2025-01-19 14:03:34 +08:00
UncleCode	3d09b6a221	feat(content-filter): add LLMContentFilter for intelligent markdown generation Add new LLMContentFilter class that uses LLMs to generate high-quality markdown content: - Implement intelligent content filtering with customizable instructions - Add chunk processing for handling large documents - Support parallel processing of content chunks - Include caching mechanism for filtered results - Add usage tracking and statistics - Update documentation with examples and use cases Also includes minor changes: - Disable Pydantic warnings in __init__.py - Add new prompt template for content filtering	2025-01-18 19:31:07 +08:00
UncleCode	2d6b19e1a2	refactor(browser): improve browser path management Implement more robust browser executable path handling using playwright's built-in browser management. This change: - Adds async browser path resolution - Implements path caching in the home folder - Removes hardcoded browser paths - Adds httpx dependency - Removes obsolete test result files This change makes the browser path resolution more reliable across different platforms and environments.	2025-01-17 22:14:37 +08:00
UncleCode	ece9202b61	fix(dispatcher): adjust memory threshold and fix dispatcher initialization - Increase memory threshold from 70% to 90% for better resource utilization - Remove incorrect self parameter from MemoryAdaptiveDispatcher initialization These changes improve the crawler's performance by allowing more memory usage before throttling and fix a bug in dispatcher initialization.	2025-01-16 21:58:52 +08:00
UncleCode	9d694da939	fix(models): make model fields optional with default values Make fields in MediaItem and Link models optional with default values to prevent validation errors when data is incomplete. Also expose BaseDispatcher in __init__ and fix markdown field handling in database manager. BREAKING CHANGE: MediaItem and Link model fields are now optional with default values which may affect existing code expecting required fields.	2025-01-15 22:58:14 +08:00
UncleCode	20c027b79c	chore(cleanup): remove unused files and improve type hints - Remove .pre-commit-config.yaml and duplicate mkdocs configuration files - Add Optional type hint for proxy parameter in BrowserConfig - Fix type annotation for results list in AsyncWebCrawler - Move calculate_batch_size function import to model_loader - Update prompt imports in extraction_strategy.py No breaking changes.	2025-01-14 13:07:18 +08:00
devatbosch	8878b3d032	Updated the correct link for "Contribution guidelines" in README.md (#445 ) Thank you for pointing this out. I am creating a contributing guide, which is why I changed the name to the contributors, but I forgot to update some other places. Thanks again.	2025-01-13 20:57:31 +08:00
Jōnin bingi	1ab9d115cf	Fixing minor typos in README (#440 ) @mcam10 Thx for the support. Appreciate	2025-01-13 20:23:52 +08:00
UncleCode	8ec12d7d68	Apply Ruff Corrections	2025-01-13 19:19:58 +08:00
UncleCode	c3370ec5da	refactor(scraping): replace ScrapingMode enum with strategy pattern Replace the ScrapingMode enum with a proper strategy pattern implementation for content scraping. This change introduces: - New ContentScrapingStrategy abstract base class - Concrete WebScrapingStrategy and LXMLWebScrapingStrategy implementations - New Pydantic models for structured scraping results - Updated documentation reflecting the new strategy-based approach BREAKING CHANGE: ScrapingMode enum has been removed. Users should now use ContentScrapingStrategy implementations instead.	2025-01-13 17:53:12 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	825c78a048	refactor(dispatcher): migrate to modular dispatcher system with enhanced monitoring Reorganize dispatcher functionality into separate components: - Create dedicated dispatcher classes (MemoryAdaptive, Semaphore) - Add RateLimiter for smart request throttling - Implement CrawlerMonitor for real-time progress tracking - Move dispatcher config from CrawlerRunConfig to separate classes BREAKING CHANGE: Dispatcher configuration moved from CrawlerRunConfig to dedicated dispatcher classes. Users need to update their configuration approach for multi-URL crawling.	2025-01-11 21:10:27 +08:00
UncleCode	3865342c93	Merge branch 'next' into next-cdp	2025-01-10 16:01:49 +08:00
UncleCode	ac5f461d40	feat(crawler): add memory-adaptive dispatcher with rate limiting Implements a new MemoryAdaptiveDispatcher class to manage concurrent crawling operations with memory monitoring and rate limiting capabilities. Changes include: - Added RateLimitConfig dataclass for configuring rate limiting behavior - Extended CrawlerRunConfig with dispatcher-related settings - Refactored arun_many to use the new dispatcher system - Added memory threshold and session permit controls - Integrated optional progress monitoring display BREAKING CHANGE: The arun_many method now uses MemoryAdaptiveDispatcher by default, which may affect concurrent crawling behavior	2025-01-10 16:01:18 +08:00
UncleCode	f9c601eb7e	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com across README, examples, and documentation files. This change reflects the new documentation hosting domain. Also add todo/ directory to .gitignore.	2025-01-09 16:24:41 +08:00
UncleCode	e8b4ac6046	docs(urls): update documentation URLs to new domain Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com Improve badges styling and layout in documentation Increase code font size in documentation CSS BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com	2025-01-09 16:22:41 +08:00
UncleCode	051a6cf974	docs(readme): update personal story and project vision Revise the README's personal story section to better reflect the project's origins, motivation, and vision for open-source data accessibility. Add more detail about the creator's background and the project's mission to democratize AI through open data access. Also includes a minor TODO comment addition in async crawler strategy.	2025-01-08 21:13:31 +08:00
UncleCode	1c9464b988	Update all documents	2025-01-08 19:31:31 +08:00
UncleCode	6838901788	Update All docs 2025 8th Jan	2025-01-08 19:31:17 +08:00
UncleCode	ad5e5d21ca	Remove .codeiumignore from version control and add to .gitignore	2025-01-08 13:09:23 +08:00
UncleCode	26d821c0de	Remove .codeiumignore from version control and add to .gitignore	2025-01-08 13:08:19 +08:00
UncleCode	010677cbee	chore: add .gitattributes file Add initial .gitattributes file to standardize line endings and file handling across different operating systems. This will help prevent issues with line ending inconsistencies between developers working on different platforms.	2025-01-08 13:05:00 +08:00
UncleCode	c110d459fb	Update .gitattributes	2025-01-07 21:20:17 +08:00

1 2 3 4 5 ...

592 Commits