crawl4ai

Author	SHA1	Message	Date
ntohidi	31a435fb0e	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-04 19:12:19 +08:00
Nasrin	5de6a28055	Merge pull request #1361 from unclecode/fix/crawler-result-docs Update CrawlResult documentation with missing fields	2025-08-04 19:12:09 +08:00
ntohidi	de1561ad14	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-04 19:04:50 +08:00
Nasrin	337b588732	Merge pull request #1358 from shonenada/patch-1 Fix typos in examples.md	2025-08-04 19:04:42 +08:00
ntohidi	7a6ad547f0	Squashed commit of the following: commit 2def6524cdacb69c72760bf55a41089257c0bb07 Author: ntohidi <nasrin@kidocode.com> Date: Mon Aug 4 18:59:10 2025 +0800 refactor: consolidate WebScrapingStrategy to use LXML implementation only BREAKING CHANGE: None - full backward compatibility maintained This commit simplifies the content scraping architecture by removing the redundant BeautifulSoup-based WebScrapingStrategy implementation and making it an alias for LXMLWebScrapingStrategy. Changes: - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy - Maintain 100% backward compatibility - existing code continues to work Code changes: - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports - crawl4ai/__init__.py: Update imports to show alias relationship - crawl4ai/types.py: Update type definitions - crawl4ai/legacy/web_crawler.py: Update import to use alias - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy - docs/examples/scraping_strategies_performance.py: Update to use single strategy Documentation updates: - docs/md_v2/core/content-selection.md: Update scraping modes section - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide - CHANGELOG.md: Document the refactoring under [Unreleased] Benefits: - 10-20x faster HTML parsing for large documents - Reduced memory usage and simplified codebase - Consistent parsing behavior - No migration required for existing users All existing code using WebScrapingStrategy continues to work without modification, while benefiting from LXML's superior performance.	2025-08-04 19:02:01 +08:00
Soham Kukreti	e6692b987d	docs: Update CrawlResult documentation with missing fields. - Add missing fields: fit_html, js_execution_result, redirected_url, network_requests, console_messages, tables	2025-08-04 15:43:40 +05:30
Yaoda Liu	438a103b17	Fix typos in examples.md	2025-08-03 14:33:10 +08:00
ntohidi	a03e68fa2f	feat: Add URL-specific crawler configurations for multi-URL crawling Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns. Key additions: - Add url_matcher and match_mode parameters to CrawlerRunConfig - Implement is_match() method supporting string patterns, functions, and mixed lists - Add MatchMode enum for OR/AND logic when combining multiple matchers - Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig] - Add select_config() method to dispatchers for runtime config selection - First matching config wins, with fallback to default Pattern matching supports: - Glob-style strings: .pdf, /blog/, api* - Lambda functions: lambda url: 'github.com' in url - Mixed patterns with AND/OR logic for complex matching This enables optimal per-URL configuration: - PDFs: Use PDFContentScrapingStrategy without JavaScript - Blogs: Apply content filtering to reduce noise - APIs: Skip JavaScript, use JSON extraction - Dynamic sites: Execute only necessary JavaScript Breaking changes: None - fully backward compatible	2025-08-02 19:10:36 +08:00
ntohidi	db6ad7a79d	fix: update links in README and C4A-Script documentation for accuracy	2025-07-23 09:47:18 +02:00
ntohidi	cf8badfe27	feat: cleanup unused code and enhance documentation for v0.7.1 - Remove unused StealthConfig from browser_manager.py - Update LinkPreviewConfig import path in __init__.py and examples - Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf')) - Remove sanitize_json_data functions from API endpoints - Add comprehensive C4A Script documentation to release notes - Update v0.7.0 release notes with improved code examples - Create v0.7.1 release notes focusing on cleanup and documentation improvements - Update demo files with corrected import paths and examples - Fix virtual scroll and adaptive crawling examples across documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 11:35:16 +02:00
ntohidi	1d1970ae69	docs: Update release notes and docs for v0.7.0 with teh correct parameters and explanations	2025-07-15 11:32:04 +02:00
UncleCode	7b9ba3015f	Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update	2025-07-12 18:54:20 +08:00
UncleCode	0c8bb742b7	Release v0.7.0-r1: The Adaptive Intelligence Update - Bump version to 0.7.0 - Add release notes and demo files - Update README with v0.7.0 features - Update Docker configurations for v0.7.0-r1 - Move v0.7.0 demo files to releases_review - Fix BM25 scoring bug in URLSeeder Major features: - Adaptive Crawling with pattern learning - Virtual Scroll support for infinite pages - Link Preview with 3-layer scoring - Async URL Seeder for massive discovery - Performance optimizations	2025-07-12 18:51:13 +08:00
ntohidi	0ebce590f8	Merge branch '2025-JUN-1' into next-MAY	2025-07-09 09:41:03 +02:00
ntohidi	a3d41c7951	fix: Clarify description of 'use_stemming' parameter in markdown generation documentation ref #1086	2025-07-08 12:24:33 +02:00
ntohidi	fee4c5c783	fix: Consolidate import statements in local-files.md for clarity	2025-07-08 11:46:24 +02:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	1a73fb60db	feat(crawl4ai): Implement adaptive crawling feature This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456	2025-07-04 15:16:53 +08:00
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
UncleCode	539a324cf6	refactor(link_extractor): remove link_extractor and rename to link_preview This change removes the link_extractor module and renames it to link_preview, streamlining the codebase. The removal of 395 lines of code reduces complexity and improves maintainability. Other files have been updated to reflect this change, ensuring consistency across the project. BREAKING CHANGE: The link_extractor module has been deleted and replaced with link_preview. Update imports accordingly.	2025-06-27 21:54:22 +08:00
UncleCode	5c9c305dbf	feat: Add advanced link head extraction with three-layer scoring system (#1 ) Squashed commit from feature/link-extractor branch implementing comprehensive link analysis: - Extract HTML head content from discovered links with parallel processing - Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores - New LinkExtractionConfig class for type-safe configuration - Pattern-based filtering for internal/external links - Comprehensive documentation and examples	2025-06-27 20:06:04 +08:00
UncleCode	c0fd36982d	Update all documentation to import extraction strategies directly from crawl4ai.	2025-06-10 18:08:27 +08:00
Nasrin	f9b7090084	Merge pull request #1186 from zimmski/fix-typo-provoder fix, Typo	2025-06-10 10:26:45 +02:00
UncleCode	ef6f4329fa	Add use_stemming option to BM25ContentFilter (#1192 )	2025-06-10 15:44:45 +08:00
UncleCode	926592649e	Add Crawl4AI Assistant Chrome Extension - Created manifest.json for the Crawl4AI Assistant extension. - Added popup HTML, CSS, and JS files for the extension interface. - Included icons and favicon for the extension. - Implemented functionality for schema capture and code generation. - Updated index.md to reflect the availability of the new extension. - Enhanced LLM Context Builder layout and styles for consistency. - Adjusted global styles for better branding and responsiveness.	2025-06-08 18:34:05 +08:00
UncleCode	b4bb0ccea0	Update simple-crawling.md Fixing wrong documentation about th fit_markdown to assume its a direct parameter of CrawlerRunConfig, while it is NOT.	2025-06-08 11:33:28 +08:00
UncleCode	08a2cdae53	Add C4A-Script support and documentation - Generate OneShot js code geenrator - Introduced a new C4A-Script tutorial example for login flow using Blockly. - Updated index.html to include Blockly theme and event editor modal for script editing. - Created a test HTML file for testing Blockly integration. - Added comprehensive C4A-Script API reference documentation covering commands, syntax, and examples. - Developed core documentation for C4A-Script, detailing its features, commands, and real-world examples. - Updated mkdocs.yml to include new C4A-Script documentation in navigation.	2025-06-07 23:07:19 +08:00
Markus Zimmermann	022cc2d92a	fix, Typo	2025-06-05 15:30:38 +02:00
UncleCode	82a25c037a	feat(async_url_seeder): add smart URL filtering to exclude nonsense URLs This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed. Significant modifications include: - Added parameter to in . - Implemented logic in to check and filter out nonsense URLs during the seeding process in . - Updated documentation to reflect the new filtering feature and provide examples of its usage in . This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs. BREAKING CHANGE: The now requires the parameter to be explicitly set if the default behavior is to be altered. Related issues: #123	2025-06-05 15:46:24 +08:00
UncleCode	c6fc5c0518	docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity. The changes include: - Added and for comprehensive LinkedIn data discovery. - Removed to eliminate outdated content. - Updated to reflect new data visualization requirements. - Introduced and to facilitate easier access to URL seeding techniques. - Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples. These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.	2025-06-05 15:06:25 +08:00
UncleCode	3048cc1ff9	feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling. ## Core Features ### AsyncUrlSeeder Component - Discovers URLs from multiple sources: - Sitemaps (including nested and gzipped) - Common Crawl index - Combined sources for maximum coverage - Extracts page metadata without full crawling: - Title, description, keywords - Open Graph and Twitter Card tags - JSON-LD structured data - Language and charset information - BM25 relevance scoring for intelligent filtering: - Query-based URL discovery - Configurable score thresholds - Automatic ranking by relevance - Performance optimizations: - Async/concurrent processing with configurable workers - Rate limiting (hits per second) - Automatic caching with TTL - Streaming results for large datasets ### SeedingConfig - Comprehensive configuration for URL seeding: - Source selection (sitemap, cc, or both) - URL pattern filtering with wildcards - Live URL validation options - Metadata extraction controls - BM25 scoring parameters - Concurrency and rate limiting ### Integration with AsyncWebCrawler - Seamless pipeline: discover → filter → crawl - Direct compatibility with arun_many() - Significant resource savings by pre-filtering URLs ## Documentation - Comprehensive guide comparing URL seeding vs deep crawling - Complete API reference with parameter tables - Practical examples showing all features - Performance benchmarks and best practices - Integration patterns with AsyncWebCrawler ## Examples - url_seeder_demo.py: Interactive Rich-based demo with: - Basic discovery - Cache management - Live validation - BM25 scoring - Multi-domain discovery - Complete pipeline integration - url_seeder_quick_demo.py: Screenshot-friendly examples: - Pattern-based filtering - Metadata exploration - Smart search with BM25 ## Testing - Comprehensive test suite (test_async_url_seeder_bm25.py) - Coverage of all major features - Edge cases and error handling - Performance and consistency tests ## Implementation Details - Built on httpx with HTTP/2 support - Optional dependencies: lxml, brotli, rank_bm25 - Cache management in ~/.crawl4ai/seeder_cache/ - Logger integration with AsyncLoggerBase - Proper error handling and retry logic ## Bug Fixes - Fixed logger color compatibility (lightblack → bright_black) - Corrected URL extraction from seeder results for arun_many() - Updated all examples and documentation with proper usage This feature enables users to crawl smarter, not harder, by discovering and analyzing URLs before committing resources to crawling them.	2025-06-03 23:27:12 +08:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
ntohidi	773ed7b281	Merge branch '2025-APR-1' into 2025-MAY-2	2025-06-02 20:25:58 +02:00
devin-ai-integration[bot]	9c2cc7f73c	Fix BM25ContentFilter documentation to use language parameter instead of use_stemming (#1152 ) Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: UncleCode <unclecode@kidocode.com>	2025-05-25 10:02:13 +08:00
UncleCode	a06710ff03	Adding LLMContext generator to website.	2025-05-24 20:37:09 +08:00
ntohidi	cb8d581e47	fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125	2025-05-19 18:03:05 +02:00
Ahmed-Tawfik94	a97654270b	#1086 fix(markdown): update BM25 filter to use language parameter for stemming	2025-05-19 14:11:46 +08:00
Aravind Karnam	2b17f234f8	docs: update direct passing of content_filter to CrawlerRunConfig and instead pass it via MarkdownGenerator. Ref: #603	2025-05-07 15:20:36 +05:30
Aravind Karnam	094201ab2a	Merge next + resolve conflicts	2025-04-23 19:44:50 +05:30
UncleCode	ad4dfb21e1	Remoce "rc1"	2025-04-23 21:00:00 +08:00
UncleCode	949a93982e	feat(docs): update documentation and disable Ask AI feature Major documentation updates including: - Add comprehensive code examples page - Add video tutorial to homepage - Update Docker deployment instructions for v0.6.0 - Temporarily disable Ask AI feature - Add table border styling - Update site version to v0.6.x BREAKING CHANGE: Ask AI feature temporarily disabled pending launch	2025-04-23 19:02:39 +08:00
UncleCode	b0aa8bc9f7	Update README	2025-04-22 23:21:42 +08:00
unclecode	f3ebb38edf	Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md	2025-04-22 14:56:47 +08:00
UncleCode	0007aea204	Update changelog	2025-04-21 23:21:49 +08:00
UncleCode	b5c25731e6	feat(browser): add geolocation, locale and timezone support Add support for controlling browser geolocation, locale and timezone settings: - New GeolocationConfig class for managing GPS coordinates - Add locale and timezone_id parameters to CrawlerRunConfig - Update browser context creation to handle location settings - Add example script for geolocation usage - Update documentation with location-based identity features This enables more precise control over browser identity and location reporting.	2025-04-21 23:20:59 +08:00
ntohidi	14a31456ef	fix(docs): update browser-crawler-config example to include LLMContentFilter and DefaultMarkdownGenerator, fix syntax errors	2025-04-21 13:59:49 +02:00
Aravind Karnam	b27bb367e8	merge next. Resolve conflicts. Fix some import errors and error handling in server.py	2025-04-19 20:27:47 +05:30
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
Aravind Karnam	eed7f88f29	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-04-17 10:50:02 +05:30
UncleCode	cd7ff6f9c1	feat(docs): add AI assistant interface and code copy button Add new AI assistant chat interface with features: - Real-time chat with markdown support - Chat history management - Citation tracking - Selection-to-query functionality Also adds code copy button to documentation code blocks and adjusts layout/styling. Breaking changes: None	2025-04-14 23:00:47 +08:00

1 2

74 Commits