crawl4ai

Author	SHA1	Message	Date
ntohidi	437395e490	Merge branch 'feat/undetected-browser' into develop-future	2025-08-06 15:03:30 +08:00
UncleCode	9546773a07	fix: Move sentence-transformers to optional dependencies - Moved sentence-transformers from core to optional dependencies in pyproject.toml - Removed sentence-transformers from requirements.txt - Added proper ImportError handling with helpful installation message - This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default - Users who need embedding features can install with: pip install 'crawl4ai[transformer]'	2025-07-24 21:24:40 +08:00
unclecode	6a728cbe5b	feat: add stealth mode and enhance undetected browser support - Add playwright-stealth integration with enable_stealth parameter in BrowserConfig - Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern - Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching - Update install.py to install both playwright and patchright browsers automatically - Add comprehensive documentation for anti-bot features (stealth mode + undetected browser) - Create examples demonstrating stealth mode usage and comparison tests - Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies - Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium) - Add dependency checker tool in tests/check_dependencies.py Breaking changes: None - all existing functionality preserved 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 16:59:10 +08:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	1a73fb60db	feat(crawl4ai): Implement adaptive crawling feature This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456	2025-07-04 15:16:53 +08:00
UncleCode	c0fd36982d	Update all documentation to import extraction strategies directly from crawl4ai.	2025-06-10 18:08:27 +08:00
UncleCode	2a0c0ed18d	chore(deps): add httpx extras (#1195 )	2025-06-10 15:47:03 +08:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
UncleCode	8a5e23d374	feat(crawler): add separate timeout for wait_for condition Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying a separate timeout for the wait_for condition, independent of the page_timeout. This provides more granular control over waiting behaviors in the crawler. Also removes unused colorama dependency and updates LinkedIn crawler example. BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing	2025-05-16 17:00:45 +08:00
ntohidi	25d97d56e4	fix(dependencies): remove duplicated aiofiles from project dependencies. REF #1045	2025-05-13 13:56:12 +02:00
ntohidi	12783fabda	fix(dependencies): update pillow version constraint to allow newer releases. ref #709	2025-05-07 11:18:13 +02:00
UncleCode	c4f5651199	chore(deps): upgrade to Python 3.12 and prepare for 0.6.0 release - Update Docker base image to Python 3.12-slim-bookworm - Bump version from 0.6.0rc1 to 0.6.0 - Update documentation to reflect release version changes - Fix license specification in pyproject.toml and setup.py - Clean up code formatting in demo_docker_api.py BREAKING CHANGE: Base Python version upgraded from 3.10 to 3.12	2025-04-23 16:35:15 +08:00
UncleCode	4812f08a73	feat(docker): update Docker deployment for v0.6.0 Major updates to Docker deployment infrastructure: - Switch default port to 11235 for all services - Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints - Simplify docker-compose.yml with auto-platform detection - Update documentation with new features and examples - Consolidate configuration and improve resource management BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.	2025-04-22 22:35:25 +08:00
Aravind Karnam	7155778eac	chore: move from faust-cchardet to chardet	2025-04-03 17:42:51 +05:30
Aravind Karnam	2f0e217751	Chore: Add brotli as dependancy to fix: https://github.com/unclecode/crawl4ai/issues/867	2025-03-25 13:44:41 +05:30
UncleCode	b750542e6d	feat(crawler): optimize single URL handling and add performance comparison Add special handling for single URL requests in Docker API to use arun() instead of arun_many() Add new example script demonstrating performance differences between sequential and parallel crawling Update cache mode from aggressive to bypass in examples and tests Remove unused dependencies (zstandard, msgpack) BREAKING CHANGE: Changed default cache_mode from aggressive to bypass in examples	2025-03-13 22:15:15 +08:00
UncleCode	dc36997a08	feat(schema): improve HTML preprocessing for schema generation Add new preprocess_html_for_schema utility function to better handle HTML cleaning for schema generation. This replaces the previous optimize_html function in the GoogleSearchCrawler and includes smarter attribute handling and pattern detection. Other changes: - Update default provider to gpt-4o - Add DEFAULT_PROVIDER_API_KEY constant - Make LLMConfig creation more flexible with create_llm_config helper - Add new dependencies: zstandard and msgpack This change improves schema generation reliability while reducing noise in the processed HTML.	2025-03-12 22:40:46 +08:00
UncleCode	56bc3c6e45	refactor(cli): improve CLI default command handling Make 'crawl' the default command when no command is specified. This improves user experience by allowing direct URL input without explicitly specifying the 'crawl' command. Also removes unnecessary blank lines in example code for better readability.	2025-03-04 20:28:16 +08:00
UncleCode	95175cb394	feat(cli): add browser profile management functionality Adds new interactive browser profile management system that allows users to: - Create and manage browser profiles for authenticated crawling - List existing profiles with detailed information - Delete unused profiles - Use profiles during crawling with the new -p/--profile flag Also restructures CLI to use Click groups and adds humanize dependency for better size formatting.	2025-03-02 20:54:45 +08:00
UncleCode	cba4a466e5	feat(browser): add BrowserProfiler class for identity-based browsing Adds a new BrowserProfiler class that provides comprehensive management of browser profiles for identity-based crawling. Features include: - Interactive profile creation and management - Profile listing, retrieval, and deletion - Guided console interface - Migration of profile management from ManagedBrowser - New example script for identity-based browsing ALSO: - Updates logging format in AsyncWebCrawler - Removes content filter from hello_world example - Relaxes httpx version constraint BREAKING CHANGE: Profile management methods from ManagedBrowser are now deprecated and delegate to BrowserProfiler	2025-03-02 20:32:29 +08:00
Aravind	a9e24307cc	Release prep (#749 ) * fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown	2025-02-28 19:53:35 +08:00
UncleCode	3a87b4e43b	fix(dependencies): update cchardet to faust-cchardet for compatibility	2025-02-26 18:25:58 +08:00
UncleCode	4bcd4cbda1	refactor(pdf): improve PDF processor dependency handling Make PyPDF2 an optional dependency and improve import handling in PDF processor. Move imports inside methods to allow for lazy loading and better error handling. Add new 'pdf' optional dependency group in pyproject.toml. Clean up unused imports and remove deprecated files. BREAKING CHANGE: PyPDF2 is now an optional dependency. Users need to install with 'pip install crawl4ai[pdf]' to use PDF processing features.	2025-02-25 22:27:55 +08:00
UncleCode	8bb799068e	feat(crawler): add HTTP crawler strategy for lightweight web scraping Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include: - Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts - File and raw content handling capabilities - Streaming response processing for large files - Customizable request/response hooks - Comprehensive error handling Also refactors browser management code into separate module for better organization.	2025-02-15 19:26:30 +08:00
UncleCode	467be9ac76	feat(deep-crawling): add DFS strategy and update exports; refactor CLI entry point	2025-02-09 20:23:40 +08:00
UncleCode	09ac7ed008	feat(demo): uncomment feature demos and add fake-useragent dependency Uncomments demonstration code for memory dispatcher, streaming support, content scraping, JSON schema generation, LLM markdown, and robots compliance in the v0.4.3b2 features demo file. Also adds fake-useragent package as a project dependency. This change makes all feature demonstrations active by default and ensures proper user agent handling capabilities.	2025-01-25 21:56:08 +08:00
UncleCode	16b8d4945b	feat(release): prepare v0.4.3 beta release Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit	2025-01-21 21:03:11 +08:00
UncleCode	2d6b19e1a2	refactor(browser): improve browser path management Implement more robust browser executable path handling using playwright's built-in browser management. This change: - Adds async browser path resolution - Implements path caching in the home folder - Removes hardcoded browser paths - Adds httpx dependency - Removes obsolete test result files This change makes the browser path resolution more reliable across different platforms and environments.	2025-01-17 22:14:37 +08:00
UncleCode	f3ae5a657c	feat(scraping): add LXML-based scraping mode for improved performance Adds a new ScrapingMode enum to allow switching between BeautifulSoup and LXML parsing. LXML mode offers 10-20x better performance for large HTML documents. Key changes: - Added ScrapingMode enum with BEAUTIFULSOUP and LXML options - Implemented LXMLWebScrapingStrategy class - Added LXML-based metadata extraction - Updated documentation with scraping mode usage and performance considerations - Added cssselect dependency BREAKING CHANGE: None	2025-01-12 20:46:23 +08:00
UncleCode	ca3e33122e	refactor(docs): reorganize documentation structure and update styles Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized	2025-01-07 20:49:50 +08:00
UncleCode	bfe21b29d4	build: streamline package discovery and bump to v0.4.243 - Replace explicit package listing with setuptools.find - Include all crawl4ai.* packages automatically - Use `packages = {find = {where = ["."], include = ["crawl4ai*"]}}` syntax - Bump version to 0.4.243 This change simplifies package maintenance by automatically discovering all subpackages under crawl4ai namespace instead of listing them manually.	2025-01-01 17:55:59 +08:00
UncleCode	e9d9a6ffe8	fix: ensure js_snippet files are included in package - Add js_snippet to packages list in pyproject.toml - Verified JS files are properly included in installed package - Bump version to 0.4.242	2025-01-01 17:38:59 +08:00
UncleCode	dc6a24618e	feat(install): add doctor command and force browser install - Add --force flag to Playwright browser installation - Add doctor command to test crawling functionality - Install Chrome and Chromium browsers explicitly - Add crawl4ai-doctor entry point in pyproject.toml - Implement simple health check focused on crawling test	2025-01-01 16:33:43 +08:00
UncleCode	78b6ba5cef	build: modernize package configuration with pyproject.toml - Add pyproject.toml for PEP 517 build system support - Configure dependencies, scripts, and metadata in pyproject.toml - Set Python requirement to >=3.9 and add support up to 3.13 - Keep setup.py for backwards compatibility - Move package dependencies and entry points to pyproject.toml	2025-01-01 15:45:27 +08:00
UncleCode	3f019d34cc	docs: update project description emojis - Change project description emojis from 🔥🕷️ to 🚀🤖 - Update emojis consistently in both setup.py and pyproject.toml	2025-01-01 15:39:33 +08:00

35 Commits