crawl4ai

Author	SHA1	Message	Date
Rachel Bushrian	7771ed3894	Merge branch 'develop' into fix/wrong_url_raw	2025-11-24 13:54:07 +02:00
Soham Kukreti	2dc6588573	fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396 - Fix critical bug where overlay removal JS function was injected but never called - Change remove_overlay_elements() to properly execute the injected async function - Wrap JS execution in async to handle the async overlay removal logic - Add test_remove_overlay_elements() test case to verify functionality works - Ensure overlay elements (cookie banners, popups, modals) are actually removed The remove_overlay_elements feature now works as intended: - Before: Function definition injected but never executed (silent failure) - After: Function injected and called, successfully removing overlay elements	2025-09-29 20:40:08 +05:30
Nasrin	23431d8109	Merge pull request #1389 from unclecode/fix/deep-crawl-scoring fix(deep-crawl): BestFirst priority inversion	2025-09-16 15:45:54 +08:00
rbushria	edd0b576b1	Fix: Use correct URL variable for raw HTML extraction (#1116 ) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html	2025-09-01 23:15:56 +03:00
ntohidi	96c4b0de67	fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198 - Add _page_lock and guarded creation; handle empty context.pages safely - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many	2025-08-11 18:55:43 +08:00
ntohidi	88a9fbbb7e	fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253 Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.	2025-08-11 18:16:57 +08:00
ntohidi	0ebce590f8	Merge branch '2025-JUN-1' into next-MAY	2025-07-09 09:41:03 +02:00
ntohidi	5d9213a0e9	fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215	2025-06-12 12:21:40 +02:00
UncleCode	c0fd36982d	Update all documentation to import extraction strategies directly from crawl4ai.	2025-06-10 18:08:27 +08:00
ntohidi	4679ee023d	fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003	2025-06-10 11:19:18 +02:00
ntohidi	5ac19a61d7	feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168	2025-06-05 16:40:34 +02:00
UncleCode	3048cc1ff9	feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling. ## Core Features ### AsyncUrlSeeder Component - Discovers URLs from multiple sources: - Sitemaps (including nested and gzipped) - Common Crawl index - Combined sources for maximum coverage - Extracts page metadata without full crawling: - Title, description, keywords - Open Graph and Twitter Card tags - JSON-LD structured data - Language and charset information - BM25 relevance scoring for intelligent filtering: - Query-based URL discovery - Configurable score thresholds - Automatic ranking by relevance - Performance optimizations: - Async/concurrent processing with configurable workers - Rate limiting (hits per second) - Automatic caching with TTL - Streaming results for large datasets ### SeedingConfig - Comprehensive configuration for URL seeding: - Source selection (sitemap, cc, or both) - URL pattern filtering with wildcards - Live URL validation options - Metadata extraction controls - BM25 scoring parameters - Concurrency and rate limiting ### Integration with AsyncWebCrawler - Seamless pipeline: discover → filter → crawl - Direct compatibility with arun_many() - Significant resource savings by pre-filtering URLs ## Documentation - Comprehensive guide comparing URL seeding vs deep crawling - Complete API reference with parameter tables - Practical examples showing all features - Performance benchmarks and best practices - Integration patterns with AsyncWebCrawler ## Examples - url_seeder_demo.py: Interactive Rich-based demo with: - Basic discovery - Cache management - Live validation - BM25 scoring - Multi-domain discovery - Complete pipeline integration - url_seeder_quick_demo.py: Screenshot-friendly examples: - Pattern-based filtering - Metadata exploration - Smart search with BM25 ## Testing - Comprehensive test suite (test_async_url_seeder_bm25.py) - Coverage of all major features - Edge cases and error handling - Performance and consistency tests ## Implementation Details - Built on httpx with HTTP/2 support - Optional dependencies: lxml, brotli, rank_bm25 - Cache management in ~/.crawl4ai/seeder_cache/ - Logger integration with AsyncLoggerBase - Proper error handling and retry logic ## Bug Fixes - Fixed logger color compatibility (lightblack → bright_black) - Corrected URL extraction from seeder results for arun_many() - Updated all examples and documentation with proper usage This feature enables users to crawl smarter, not harder, by discovering and analyzing URLs before committing resources to crawling them.	2025-06-03 23:27:12 +08:00
João Martins	58c1e17170	Merge branch 'main' into fix-raw-url-parsing	2025-05-30 13:03:25 +01:00
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
UncleCode	230f22da86	refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization. Improved LLM token handling with new PROVIDER_MODELS_PREFIXES. Added test cases for deep crawling and proxy rotation. Removed docker_config from BrowserConfig as it's handled separately. BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai	2025-04-15 22:27:18 +08:00
unclecode	66ac07b4f3	feat(crawler): add network request and console message capturing Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.	2025-04-10 16:03:48 +08:00

16 Commits