crawl4ai

Author	SHA1	Message	Date
ntohidi	437395e490	Merge branch 'feat/undetected-browser' into develop-future	2025-08-06 15:03:30 +08:00
Vinit Agrawal	3a9e2c716e	Remvoed the incorrect reference in browser_config variable	2025-07-18 10:01:00 +05:30
unclecode	6a728cbe5b	feat: add stealth mode and enhance undetected browser support - Add playwright-stealth integration with enable_stealth parameter in BrowserConfig - Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern - Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching - Update install.py to install both playwright and patchright browsers automatically - Add comprehensive documentation for anti-bot features (stealth mode + undetected browser) - Create examples demonstrating stealth mode usage and comparison tests - Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies - Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium) - Add dependency checker tool in tests/check_dependencies.py Breaking changes: None - all existing functionality preserved 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 16:59:10 +08:00
ntohidi	0ebce590f8	Merge branch '2025-JUN-1' into next-MAY	2025-07-09 09:41:03 +02:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
ntohidi	5d9213a0e9	fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215	2025-06-12 12:21:40 +02:00
ntohidi	5ac19a61d7	feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168	2025-06-05 16:40:34 +02:00
ntohidi	cc95d3abd4	Fix raw URL parsing logic to correctly handle "raw://" and "raw:" prefixes. REF #1118	2025-06-03 11:19:08 +02:00
Nasrin	5ce3e682f3	Merge pull request #752 from jl-martins/fix-raw-url-parsing Fix `raw://` URL parsing logic. issue ref #1118	2025-06-03 11:10:29 +02:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
ntohidi	773ed7b281	Merge branch '2025-APR-1' into 2025-MAY-2	2025-06-02 20:25:58 +02:00
João Martins	58c1e17170	Merge branch 'main' into fix-raw-url-parsing	2025-05-30 13:03:25 +01:00
UncleCode	08ad7ef257	feat(browser): improve browser session management and profile handling Enhance browser session management with the following improvements: - Add state cloning between browser contexts - Implement smarter page closing logic based on total pages and browser config - Add storage state persistence during profile creation - Improve managed browser context handling with storage state support This change improves browser session reliability and persistence across runs.	2025-05-21 20:23:17 +08:00
UncleCode	8a5e23d374	feat(crawler): add separate timeout for wait_for condition Adds a new wait_for_timeout parameter to CrawlerRunConfig that allows specifying a separate timeout for the wait_for condition, independent of the page_timeout. This provides more granular control over waiting behaviors in the crawler. Also removes unused colorama dependency and updates LinkedIn crawler example. BREAKING CHANGE: LinkedIn crawler example now uses different wait_for_images timing	2025-05-16 17:00:45 +08:00
ntohidi	22725ca87b	fix(crawler): initialize `captured_console` to prevent unbound local error for local HTML files. REF: #1072 Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False` (default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.	2025-05-15 11:29:36 +02:00
Aravind Karnam	98a56e6e01	Merge next branch	2025-05-13 17:12:11 +05:30
UncleCode	a3e9ef91ad	fix(crawler): remove automatic page closure in screenshot methods Removes automatic page closure in take_screenshot and take_screenshot_naive methods to prevent premature closure of pages that might still be needed in the calling context. This allows for more flexible page lifecycle management by the caller. BREAKING CHANGE: Page objects are no longer automatically closed after taking screenshots. Callers must explicitly handle page closure when appropriate.	2025-05-12 21:17:57 +08:00
UncleCode	206a9dfabd	feat(crawler): add session management and view-source support Add session_id feature to allow reusing browser pages across multiple crawls. Add support for view-source: protocol in URL handling. Fix browser config reference and string formatting issues. Update examples to demonstrate new session management features. BREAKING CHANGE: Browser page handling now persists when using session_id	2025-05-08 17:13:35 +08:00
ntohidi	ee93acbd06	fix(async_playwright_crawler): use config directly instead of self.config for verbosity check	2025-05-07 12:32:38 +02:00
Aravind Karnam	39e3b792a1	Merge branch 'next' into 2025-APR-1	2025-05-07 10:25:25 +05:30
UncleCode	9b5ccac76e	feat(extraction): add RegexExtractionStrategy for pattern-based extraction Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None	2025-05-02 21:15:24 +08:00
ntohidi	e0cd3e10de	fix(crawler): initialize captured_console variable for local file processing	2025-05-02 10:35:35 +02:00
ntohidi	1d6a2b9979	fix(crawler): surface real redirect status codes and keep redirect chain. the 30x response instead of always returning 200. Refs #660	2025-04-30 12:29:17 +02:00
Aravind Karnam	094201ab2a	Merge next + resolve conflicts	2025-04-23 19:44:50 +05:30
ntohidi	0886153d6a	fix(async_playwright_crawler): improve segment handling and viewport adjustments during screenshot capture (Fixed bug: Capturing Screenshot Twice and Increasing Image Size)	2025-04-17 12:48:11 +02:00
ntohidi	0ec3c4a788	fix(crawler): handle navigation aborts during file downloads in AsyncPlaywrightCrawlerStrategy	2025-04-17 12:11:12 +02:00
Aravind Karnam	022f5c9e25	Merged next branch	2025-04-12 10:47:02 +05:30
UncleCode	18e8227dfb	feat(crawler): add console message capture functionality Add ability to capture browser console messages during crawling: - Implement _capture_console_messages method to collect console logs - Update crawl method to support console message capture - Modify browser_manager page creation to accept full CrawlerRunConfig - Fix request failure text formatting This enhancement allows debugging and monitoring of JavaScript console output during crawling operations.	2025-04-10 23:26:09 +08:00
unclecode	66ac07b4f3	feat(crawler): add network request and console message capturing Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.	2025-04-10 16:03:48 +08:00
UncleCode	a2061bf31e	feat(crawler): add MHTML capture functionality Add ability to capture web pages as MHTML format, which includes all page resources in a single file. This enables complete page archival and offline viewing. - Add capture_mhtml parameter to CrawlerRunConfig - Implement MHTML capture using CDP in AsyncPlaywrightCrawlerStrategy - Add mhtml field to CrawlResult and AsyncCrawlResponse models - Add comprehensive tests for MHTML capture functionality - Update documentation with MHTML capture details - Add exclude_all_images option for better memory management Breaking changes: None	2025-04-09 15:39:04 +08:00
Aravind Karnam	6f7ab9c927	fix: Revert changes to session management in AsyncHttpWebcrawler and solve the underlying issue by removing the session closure in finally block of session context.	2025-04-08 18:31:00 +05:30
UncleCode	02e627e0bd	fix(crawler): simplify page retrieval logic in AsyncPlaywrightCrawlerStrategy	2025-04-08 17:43:36 +08:00
Aravind Karnam	7155778eac	chore: move from faust-cchardet to chardet	2025-04-03 17:42:51 +05:30
UncleCode	86df20234b	fix(crawler): handle exceptions in get_page call to ensure page retrieval	2025-04-02 21:25:24 +08:00
UncleCode	179921a131	fix(crawler): update get_page call to include additional return value	2025-04-02 19:01:30 +08:00
Aravind Karnam	757e3177ed	fix: https://github.com/unclecode/crawl4ai/issues/839	2025-03-31 17:10:04 +05:30
maggie.wang	1119f2f5b5	fix: https://github.com/unclecode/crawl4ai/issues/911	2025-03-31 14:05:54 +08:00
Aravind Karnam	d8cbeff386	fix: https://github.com/unclecode/crawl4ai/issues/842	2025-03-28 19:31:05 +05:30
Aravind Karnam	e3111d0a32	fix: prevent session closing after each request to maintain connection pool. Fixes: https://github.com/unclecode/crawl4ai/issues/867	2025-03-25 13:46:55 +05:30
UncleCode	7884a98be7	feat(crawler): add experimental parameters support and optimize browser handling Add experimental parameters dictionary to CrawlerRunConfig to support beta features Make CSP nonce headers optional via experimental config Remove default cookie injection Clean up browser context creation code Improve code formatting in API handler BREAKING CHANGE: Default cookie injection has been removed from page initialization	2025-03-14 14:39:24 +08:00
UncleCode	4aeb7ef9ad	refactor(proxy): consolidate proxy configuration handling Moves ProxyConfig from configs/ directory into proxy_strategy.py to improve code organization and reduce fragmentation. Updates all imports and type hints to reflect the new location. Key changes: - Moved ProxyConfig class from configs/proxy_config.py to proxy_strategy.py - Updated type hints in async_configs.py to support ProxyConfig - Fixed proxy configuration handling in browser_manager.py - Updated documentation and examples to use new import path BREAKING CHANGE: ProxyConfig import path has changed from crawl4ai.configs to crawl4ai.proxy_strategy	2025-03-07 23:14:11 +08:00
Aravind	dad592c801	2025 feb alpha 1 (#685 ) * spelling change in prompt * gpt-4o-mini support * Remove leading Y before here * prompt spell correction * (Docs) Fix numbered list end-of-line formatting Added the missing "two spaces" to add a line break * fix: access downloads_path through browser_config in _handle_download method - Fixes #585 * crawl * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/583 * Docs update: https://github.com/unclecode/crawl4ai/issues/649 * fix: https://github.com/unclecode/crawl4ai/issues/570 * Docs: updated example for content-selection to reflect new changes in yc newsfeed css * Refactor: Removed old filters and replaced with optimised filters * fix:Fixed imports as per the new names of filters * Tests: For deep crawl filters * Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers. * fix: awaiting on filters that are async in nature eg: content relevance and seo filters * fix: https://github.com/unclecode/crawl4ai/issues/592 * fix: https://github.com/unclecode/crawl4ai/issues/715 --------- Co-authored-by: DarshanTank <darshan.tank@gnani.ai> Co-authored-by: Tuhin Mallick <tuhin.mllk@gmail.com> Co-authored-by: Serhat Soydan <ssoydan@gmail.com> Co-authored-by: cardit1 <maneesh@cardit.in> Co-authored-by: Tautik Agrahari <tautikagrahari@gmail.com>	2025-02-19 14:13:17 +08:00
João Martins	27af4cc27b	Fix "raw://" URL parsing logic Closes https://github.com/unclecode/crawl4ai/issues/686	2025-02-15 15:34:59 +00:00
UncleCode	8bb799068e	feat(crawler): add HTTP crawler strategy for lightweight web scraping Implements a new AsyncHTTPCrawlerStrategy class that provides a fast, memory-efficient alternative to browser-based crawling. Features include: - Support for HTTP/HTTPS requests with configurable methods, headers, and timeouts - File and raw content handling capabilities - Streaming response processing for large files - Customizable request/response hooks - Comprehensive error handling Also refactors browser management code into separate module for better organization.	2025-02-15 19:26:30 +08:00
UncleCode	f81712eb91	refactor(core): reorganize project structure and remove legacy code Major reorganization of the project structure: - Moved legacy synchronous crawler code to legacy folder - Removed deprecated CLI and docs manager - Consolidated version manager into utils.py - Added CrawlerHub to __init__.py exports - Fixed type hints in async_webcrawler.py - Fixed minor bugs in chunking and crawler strategies BREAKING CHANGE: Removed synchronous WebCrawler, CLI, and docs management functionality. Users should migrate to AsyncWebCrawler.	2025-01-30 19:35:06 +08:00
UncleCode	31938fb922	feat(crawler): enhance JavaScript execution and PDF processing Add JavaScript execution result handling and improve PDF processing capabilities: - Add js_execution_result to CrawlResult and AsyncCrawlResponse models - Implement execution result capture in AsyncPlaywrightCrawlerStrategy - Add batch processing for PDF pages with configurable batch size - Enhance JsonElementExtractionStrategy with better schema generation - Add HTML optimization utilities BREAKING CHANGE: PDF processing now uses batch processing by default	2025-01-29 21:03:39 +08:00
UncleCode	4d7f91b378	refactor(user-agent): improve user agent generation system Redesign user agent generation to be more modular and reliable: - Add abstract base class UAGen for user agent generation - Implement ValidUAGenerator using fake-useragent library - Add OnlineUAGenerator for fetching real-world user agents - Update browser configurations to use new UA generation system - Improve client hints generation This change makes the user agent system more maintainable and provides better real-world user agent coverage.	2025-01-25 21:16:39 +08:00
UncleCode	69a77222ef	feat(browser): add CDP URL configuration support Add support for direct CDP URL configuration in BrowserConfig and ManagedBrowser classes. This allows connecting to remote browser instances using custom CDP endpoints instead of always launching a local browser. - Added cdp_url parameter to BrowserConfig - Added cdp_url support in ManagedBrowser.start() method - Updated documentation for new parameters	2025-01-24 15:53:47 +08:00
UncleCode	2d69bf2366	refactor(models): rename final_url to redirected_url for consistency Renames the final_url field to redirected_url across all components to maintain consistent terminology throughout the codebase. This change affects: - AsyncCrawlResponse model - AsyncPlaywrightCrawlerStrategy - Documentation and examples No functional changes, purely naming consistency improvement.	2025-01-22 17:14:24 +08:00

1 2 3

107 Commits