crawl4ai

Author	SHA1	Message	Date
UncleCode	40640badad	feat: add Script Builder to Chrome Extension and reorganize LLM context files This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.	2025-06-08 22:02:12 +08:00
UncleCode	926592649e	Add Crawl4AI Assistant Chrome Extension - Created manifest.json for the Crawl4AI Assistant extension. - Added popup HTML, CSS, and JS files for the extension interface. - Included icons and favicon for the extension. - Implemented functionality for schema capture and code generation. - Updated index.md to reflect the availability of the new extension. - Enhanced LLM Context Builder layout and styles for consistency. - Adjusted global styles for better branding and responsiveness.	2025-06-08 18:34:05 +08:00
UncleCode	6f3a0ea38e	Create "Apps" section in documentation and Add interactive c4a-script playground and LLM context builder for Crawl4AI - Created a new HTML page (`index.html`) for the interactive LLM context builder, allowing users to select and combine different `crawl4ai` context files. - Implemented JavaScript functionality (`llmtxt.js`) to manage component selection, context types, and file downloads. - Added CSS styles (`llmtxt.css`) for a terminal-themed UI. - Introduced a new Markdown file (`build.md`) detailing the requirements and functionality of the context builder. - Updated the navigation in `mkdocs.yml` to include links to the new context builder and demo apps. - Added a new Markdown file (`why.md`) explaining the motivation behind the new context structure and its benefits for AI coding assistants.	2025-06-08 15:48:17 +08:00
UncleCode	b4bb0ccea0	Update simple-crawling.md Fixing wrong documentation about th fit_markdown to assume its a direct parameter of CrawlerRunConfig, while it is NOT.	2025-06-08 11:33:28 +08:00
UncleCode	08a2cdae53	Add C4A-Script support and documentation - Generate OneShot js code geenrator - Introduced a new C4A-Script tutorial example for login flow using Blockly. - Updated index.html to include Blockly theme and event editor modal for script editing. - Created a test HTML file for testing Blockly integration. - Added comprehensive C4A-Script API reference documentation covering commands, syntax, and examples. - Developed core documentation for C4A-Script, detailing its features, commands, and real-world examples. - Updated mkdocs.yml to include new C4A-Script documentation in navigation.	2025-06-07 23:07:19 +08:00
Markus Zimmermann	022cc2d92a	fix, Typo	2025-06-05 15:30:38 +02:00
UncleCode	82a25c037a	feat(async_url_seeder): add smart URL filtering to exclude nonsense URLs This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed. Significant modifications include: - Added parameter to in . - Implemented logic in to check and filter out nonsense URLs during the seeding process in . - Updated documentation to reflect the new filtering feature and provide examples of its usage in . This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs. BREAKING CHANGE: The now requires the parameter to be explicitly set if the default behavior is to be altered. Related issues: #123	2025-06-05 15:46:24 +08:00
UncleCode	c6fc5c0518	docs(linkdin, url_seeder): update and reorganize LinkedIn data discovery and URL seeder documentation This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity. The changes include: - Added and for comprehensive LinkedIn data discovery. - Removed to eliminate outdated content. - Updated to reflect new data visualization requirements. - Introduced and to facilitate easier access to URL seeding techniques. - Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples. These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.	2025-06-05 15:06:25 +08:00
UncleCode	3048cc1ff9	feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling. ## Core Features ### AsyncUrlSeeder Component - Discovers URLs from multiple sources: - Sitemaps (including nested and gzipped) - Common Crawl index - Combined sources for maximum coverage - Extracts page metadata without full crawling: - Title, description, keywords - Open Graph and Twitter Card tags - JSON-LD structured data - Language and charset information - BM25 relevance scoring for intelligent filtering: - Query-based URL discovery - Configurable score thresholds - Automatic ranking by relevance - Performance optimizations: - Async/concurrent processing with configurable workers - Rate limiting (hits per second) - Automatic caching with TTL - Streaming results for large datasets ### SeedingConfig - Comprehensive configuration for URL seeding: - Source selection (sitemap, cc, or both) - URL pattern filtering with wildcards - Live URL validation options - Metadata extraction controls - BM25 scoring parameters - Concurrency and rate limiting ### Integration with AsyncWebCrawler - Seamless pipeline: discover → filter → crawl - Direct compatibility with arun_many() - Significant resource savings by pre-filtering URLs ## Documentation - Comprehensive guide comparing URL seeding vs deep crawling - Complete API reference with parameter tables - Practical examples showing all features - Performance benchmarks and best practices - Integration patterns with AsyncWebCrawler ## Examples - url_seeder_demo.py: Interactive Rich-based demo with: - Basic discovery - Cache management - Live validation - BM25 scoring - Multi-domain discovery - Complete pipeline integration - url_seeder_quick_demo.py: Screenshot-friendly examples: - Pattern-based filtering - Metadata exploration - Smart search with BM25 ## Testing - Comprehensive test suite (test_async_url_seeder_bm25.py) - Coverage of all major features - Edge cases and error handling - Performance and consistency tests ## Implementation Details - Built on httpx with HTTP/2 support - Optional dependencies: lxml, brotli, rank_bm25 - Cache management in ~/.crawl4ai/seeder_cache/ - Logger integration with AsyncLoggerBase - Proper error handling and retry logic ## Bug Fixes - Fixed logger color compatibility (lightblack → bright_black) - Corrected URL extraction from seeder results for arun_many() - Updated all examples and documentation with proper usage This feature enables users to crawl smarter, not harder, by discovering and analyzing URLs before committing resources to crawling them.	2025-06-03 23:27:12 +08:00
ntohidi	fcc2abe4db	(fix): Update document about LLM extraction strategy to use LLMConfig. REF #1146	2025-06-03 12:53:59 +02:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
ntohidi	773ed7b281	Merge branch '2025-APR-1' into 2025-MAY-2	2025-06-02 20:25:58 +02:00
UncleCode	1fc45ffac8	Fix temperature typo and enhance LinkedIn extraction with Colab support - Fixed widespread typo: `temprature` → `temperature` across LLMConfig and related files - Enhanced CSS/XPath selector guidance for more reliable LinkedIn data extraction - Added Google Colab display server support for running Crawl4AI in notebook environments - Improved browser debugging with verbose startup args logging - Updated LinkedIn schemas and HTML snippets for better parsing accuracy 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-05-25 16:47:12 +08:00
devin-ai-integration[bot]	9c2cc7f73c	Fix BM25ContentFilter documentation to use language parameter instead of use_stemming (#1152 ) Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: UncleCode <unclecode@kidocode.com>	2025-05-25 10:02:13 +08:00
UncleCode	1c5e76d51a	Adjust positioning and set only core component as selected item by default	2025-05-24 20:49:44 +08:00
UncleCode	7665a6832f	Add LLMContext article and updte JS to not show all components.	2025-05-24 20:46:24 +08:00
UncleCode	a06710ff03	Adding LLMContext generator to website.	2025-05-24 20:37:09 +08:00
Aravind Karnam	3d46d89759	docs: fix https://github.com/unclecode/crawl4ai/issues/1109	2025-05-22 17:21:42 +05:30
ntohidi	cb8d581e47	fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125	2025-05-19 18:03:05 +02:00
Ahmed-Tawfik94	a97654270b	#1086 fix(markdown): update BM25 filter to use language parameter for stemming	2025-05-19 14:11:46 +08:00
UncleCode	becc4624bb	feat(favicon): add new favicon images for improved branding	2025-05-17 19:03:51 +08:00
UncleCode	ac9981a1f5	feat(favicon): add favicon image and update mkdocs configuration	2025-05-16 21:59:23 +08:00
UncleCode	83ef15fd47	feat(favicon): add favicon.ico for improved branding	2025-05-16 21:55:07 +08:00
UncleCode	a3cb938675	feat(theme): enable dark color mode in mkdocs configuration	2025-05-16 21:44:56 +08:00
UncleCode	9b60988232	feat(feedback): add feedback modal styles and integrate into mkdocs configuration	2025-05-16 21:25:10 +08:00
UncleCode	baca2df8df	feat(analytics): add Google Tag Manager script and gtag.js for tracking	2025-05-16 20:49:02 +08:00
Aravind Karnam	2b17f234f8	docs: update direct passing of content_filter to CrawlerRunConfig and instead pass it via MarkdownGenerator. Ref: #603	2025-05-07 15:20:36 +05:30
Aravind Karnam	39e3b792a1	Merge branch 'next' into 2025-APR-1	2025-05-07 10:25:25 +05:30
UncleCode	9b5ccac76e	feat(extraction): add RegexExtractionStrategy for pattern-based extraction Add new RegexExtractionStrategy for fast, zero-LLM extraction of common data types: - Built-in patterns for emails, URLs, phones, dates, and more - Support for custom regex patterns - LLM-assisted pattern generation utility - Optimized HTML preprocessing with fit_html field - Enhanced network response body capture Breaking changes: None	2025-05-02 21:15:24 +08:00
Aravind Karnam	094201ab2a	Merge next + resolve conflicts	2025-04-23 19:44:50 +05:30
UncleCode	ad4dfb21e1	Remoce "rc1"	2025-04-23 21:00:00 +08:00
UncleCode	7784b2468e	feat(docs): enhance Ask AI button UX and add v0.6.0 release notes Improve Ask AI button with better mobile support, animations, and positioning: - Add button animations and hover effects - Improve mobile responsiveness - Add icon to button - Fix positioning logic for different viewport sizes - Add keyboard (Escape) support Add comprehensive v0.6.0 release documentation: - Create detailed release notes - Update blog index with latest release - Document all major features and breaking changes BREAKING CHANGE: Documentation structure updated with new v0.6.0 section	2025-04-23 20:07:03 +08:00
UncleCode	37fd80e4b9	feat(docs): add mobile-friendly navigation menu Implements a responsive hamburger menu for mobile devices with the following changes: - Add new mobile_menu.js for handling mobile navigation - Update layout.css with mobile-specific styles and animations - Enhance README with updated geolocation example - Register mobile_menu.js in mkdocs.yml The mobile menu includes: - Hamburger button animation - Slide-out sidebar - Backdrop overlay - Touch-friendly navigation - Proper event handling	2025-04-23 19:44:25 +08:00
UncleCode	949a93982e	feat(docs): update documentation and disable Ask AI feature Major documentation updates including: - Add comprehensive code examples page - Add video tutorial to homepage - Update Docker deployment instructions for v0.6.0 - Temporarily disable Ask AI feature - Add table border styling - Update site version to v0.6.x BREAKING CHANGE: Ask AI feature temporarily disabled pending launch	2025-04-23 19:02:39 +08:00
UncleCode	b0aa8bc9f7	Update README	2025-04-22 23:21:42 +08:00
UncleCode	4812f08a73	feat(docker): update Docker deployment for v0.6.0 Major updates to Docker deployment infrastructure: - Switch default port to 11235 for all services - Add MCP (Model Context Protocol) support with WebSocket/SSE endpoints - Simplify docker-compose.yml with auto-platform detection - Update documentation with new features and examples - Consolidate configuration and improve resource management BREAKING CHANGE: Default port changed from 8020 to 11235. Update your configurations and deployment scripts accordingly.	2025-04-22 22:35:25 +08:00
unclecode	f3ebb38edf	Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md	2025-04-22 14:56:47 +08:00
UncleCode	0007aea204	Update changelog	2025-04-21 23:21:49 +08:00
UncleCode	b5c25731e6	feat(browser): add geolocation, locale and timezone support Add support for controlling browser geolocation, locale and timezone settings: - New GeolocationConfig class for managing GPS coordinates - Add locale and timezone_id parameters to CrawlerRunConfig - Update browser context creation to handle location settings - Add example script for geolocation usage - Update documentation with location-based identity features This enables more precise control over browser identity and location reporting.	2025-04-21 23:20:59 +08:00
ntohidi	14a31456ef	fix(docs): update browser-crawler-config example to include LLMContentFilter and DefaultMarkdownGenerator, fix syntax errors	2025-04-21 13:59:49 +02:00
Aravind Karnam	b27bb367e8	merge next. Resolve conflicts. Fix some import errors and error handling in server.py	2025-04-19 20:27:47 +05:30
UncleCode	7db6b468d9	feat(markdown): add content source selection for markdown generation Adds a new content_source parameter to MarkdownGenerationStrategy that allows selecting which HTML content to use for markdown generation: - cleaned_html (default): uses post-processed HTML - raw_html: uses original webpage HTML - fit_html: uses preprocessed HTML for schema extraction Changes include: - Added content_source parameter to MarkdownGenerationStrategy - Updated AsyncWebCrawler to handle HTML source selection - Added examples and tests for the new feature - Updated documentation with new parameter details BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown() method signature to better reflect its generalized purpose	2025-04-17 20:13:53 +08:00
Aravind Karnam	eed7f88f29	Merge branch 'next' into 2025-MAR-ALPHA-1	2025-04-17 10:50:02 +05:30
UncleCode	230f22da86	refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization. Improved LLM token handling with new PROVIDER_MODELS_PREFIXES. Added test cases for deep crawling and proxy rotation. Removed docker_config from BrowserConfig as it's handled separately. BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai	2025-04-15 22:27:18 +08:00
UncleCode	cd7ff6f9c1	feat(docs): add AI assistant interface and code copy button Add new AI assistant chat interface with features: - Real-time chat with markdown support - Chat history management - Citation tracking - Selection-to-query functionality Also adds code copy button to documentation code blocks and adjusts layout/styling. Breaking changes: None	2025-04-14 23:00:47 +08:00
UncleCode	c56974cf59	feat(docs): enhance documentation UI with ToC and GitHub stats Add new features to documentation UI: - Add table of contents with scroll spy functionality - Add GitHub repository statistics badge - Implement new centered layout system with fixed sidebar - Add conditional Playwright installation based on CRAWL4AI_MODE Breaking changes: None	2025-04-14 20:46:32 +08:00
ntohidi	1f3b1251d0	docs(cli): add Crawl4AI CLI installation instructions to the CLI guide	2025-04-14 12:16:31 +02:00
Aravind Karnam	022f5c9e25	Merged next branch	2025-04-12 10:47:02 +05:30
UncleCode	108b2a8bfb	Fixed capturing console messages for case the url is the local file. Update docker configuration (work in progress)	2025-04-10 23:22:38 +08:00
unclecode	66ac07b4f3	feat(crawler): add network request and console message capturing Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.	2025-04-10 16:03:48 +08:00

1 2 3 4 5

223 Commits