crawl4ai

Author	SHA1	Message	Date
UncleCode	14f690d751	docs: Update documentation for v0.7.0 release - Update mkdocs.yml site name to v0.7.x - Add v0.7.0 to blog index as latest release - Move v0.6.0 to Previous Releases section - Copy release notes to proper location in docs/md_v2/blog/releases/	2025-07-12 19:08:17 +08:00
UncleCode	7b9ba3015f	Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update v0.7.0	2025-07-12 18:54:20 +08:00
UncleCode	0c8bb742b7	Release v0.7.0-r1: The Adaptive Intelligence Update - Bump version to 0.7.0 - Add release notes and demo files - Update README with v0.7.0 features - Update Docker configurations for v0.7.0-r1 - Move v0.7.0 demo files to releases_review - Fix BM25 scoring bug in URLSeeder Major features: - Adaptive Crawling with pattern learning - Virtual Scroll support for infinite pages - Link Preview with 3-layer scoring - Async URL Seeder for massive discovery - Performance optimizations	2025-07-12 18:51:13 +08:00
UncleCode	ba2ed53ff1	test(releases): Add test cases for release 0.7.0	2025-07-11 22:27:18 +08:00
UncleCode	a93efcb650	Merge PR #1285 : 2025 APR, MAY, and JUN bug fixes	2025-07-11 21:22:34 +08:00
UncleCode	8794852a26	Merge PR #1285 : 2025 APR, MAY, and JUN bug fixes	2025-07-11 21:22:03 +08:00
UncleCode	fb25a4a769	docs(examples): update crawl4ai showcase script The crawl4ai showcase script has been significantly expanded to include more detailed examples and demonstrations. This includes live code examples, more detailed explanations, and a new real-world example. A new file, uv.lock, has also been added.	2025-07-11 20:55:37 +08:00
ntohidi	afe852935e	fix: show /llm API response in playground. ref #1288	2025-07-09 16:59:17 +02:00
ntohidi	0ebce590f8	Merge branch '2025-JUN-1' into next-MAY	2025-07-09 09:41:03 +02:00
ntohidi	026e96a2df	feat: Add social media and community links to README and index documentation	2025-07-08 15:48:40 +02:00
ntohidi	36429a63de	fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105	2025-07-08 12:54:33 +02:00
ntohidi	a3d41c7951	fix: Clarify description of 'use_stemming' parameter in markdown generation documentation ref #1086	2025-07-08 12:24:33 +02:00
ntohidi	fee4c5c783	fix: Consolidate import statements in local-files.md for clarity	2025-07-08 11:46:24 +02:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	1a73fb60db	feat(crawl4ai): Implement adaptive crawling feature This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456	2025-07-04 15:16:53 +08:00
UncleCode	74705c1f67	Move release scripts to private .scripts folder - Remove release-agent.py, build-nightly.py from public repo - Add .scripts/ to .gitignore for private tools - Maintain clean public repository while keeping internal tools	2025-07-04 15:02:25 +08:00
UncleCode	048d9b0f5b	feat: Implement nightly build script and update version handling	2025-07-03 20:53:03 +08:00
ntohidi	ee25c771d8	feat(cli): add deep crawling options with configurable strategies and max pages. ref #874	2025-07-02 14:07:23 +02:00
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
UncleCode	539a324cf6	refactor(link_extractor): remove link_extractor and rename to link_preview This change removes the link_extractor module and renames it to link_preview, streamlining the codebase. The removal of 395 lines of code reduces complexity and improves maintainability. Other files have been updated to reflect this change, ensuring consistency across the project. BREAKING CHANGE: The link_extractor module has been deleted and replaced with link_preview. Update imports accordingly.	2025-06-27 21:54:22 +08:00
UncleCode	5c9c305dbf	feat: Add advanced link head extraction with three-layer scoring system (#1 ) Squashed commit from feature/link-extractor branch implementing comprehensive link analysis: - Extract HTML head content from discovered links with parallel processing - Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores - New LinkExtractionConfig class for type-safe configuration - Pattern-based filtering for internal/external links - Comprehensive documentation and examples	2025-06-27 20:06:04 +08:00
Aravind	02f3127ded	Track Stargazers (#1249 ) * Webhook for when repo is starred * Send star data to google sheets to be saved * change event name to watch * Change message displayed on Discord * Update .github/workflows/main.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: UncleCode <unclecode@kidocode.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-06-25 22:26:19 +08:00
UncleCode	e528086341	test(async_assistant): add new tests for extract pipeline Introduced two new test files to enhance coverage for the extract pipeline functionality. The tests aim to validate the behavior of the pipeline under various scenarios, ensuring robustness and reliability. No breaking changes. Closes issue #123.	2025-06-23 10:44:27 +08:00
ntohidi	414f16e975	fix: Update pdf and screenshot usage documentation. ref #1230	2025-06-18 19:05:44 +02:00
ntohidi	b7a6e02236	fix: Update pdf and screenshot usage documentation. ref #1230	2025-06-18 19:04:32 +02:00
AHMET YILMAZ	9332326457	feat: Add PDF parsing documentation and navigation entry	2025-06-16 18:18:32 +08:00
ntohidi	6cd34b3157	Merge branch '2025-MAY-2' of https://github.com/unclecode/crawl4ai into 2025-MAY-2	2025-06-13 11:26:17 +02:00
ntohidi	871d4f1158	fix(extraction_strategy): rename response variable to content for clarity in LLMExtractionStrategy. ref #1146	2025-06-13 11:26:05 +02:00
prokopis3	c4d625fb3c	chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py ) - Rename file to correct spelling - No content changes	2025-06-12 14:38:32 +03:00
prokopis3	ef722766f0	fix(browser_profiler): improve keyboard input handling - fix handling of special keys in Windows msvcrt implementation - Guard against UnicodeDecodeError from multi-byte key sequences - Filter out non-printable characters and control sequences - Add error handling to prevent coroutine crashes - Add unit test to verify keyboard input handling Key changes: - Safe UTF-8 decoding with try/except for special keys - Skip non-printable and multi-byte character sequences - Add broad exception handling in keyboard listener Test runs on Windows only due to msvcrt dependency.	2025-06-12 14:33:12 +03:00
ntohidi	dc85481180	refactor: Update LLM extraction example with the updated structure	2025-06-12 12:23:03 +02:00
ntohidi	5d9213a0e9	fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215	2025-06-12 12:21:40 +02:00
UncleCode	c0fd36982d	Update all documentation to import extraction strategies directly from crawl4ai.	2025-06-10 18:08:27 +08:00
ntohidi	4679ee023d	fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003	2025-06-10 11:19:18 +02:00
Nasrin	f9b7090084	Merge pull request #1186 from zimmski/fix-typo-provoder fix, Typo	2025-06-10 10:26:45 +02:00
UncleCode	cab457e9c7	Merge branch 'next' of https://github.com/unclecode/crawl4ai into next	2025-06-10 15:54:20 +08:00
UncleCode	2a0c0ed18d	chore(deps): add httpx extras (#1195 )	2025-06-10 15:47:03 +08:00
UncleCode	c73a130c50	Set memory_wait_timeout default to 10 minutes (#1193 )	2025-06-10 15:47:03 +08:00
UncleCode	ef6f4329fa	Add use_stemming option to BM25ContentFilter (#1192 )	2025-06-10 15:44:45 +08:00
UncleCode	4eb90b41b6	Refactor Crawl4AI Assistant: Rename Schema Builder to Click2Crawl, update UI elements, and remove deprecated files - Updated overlay.css to add gap in titlebar. - Deleted schemaBuilder_v1.js and associated zip files (v1.0.0 to v1.2.0). - Modified index.html to reflect new Click2Crawl feature and updated descriptions. - Updated manifest.json to include new JavaScript files for Click2Crawl and markdown extraction. - Refined popup styles and HTML to align with new feature names and functionalities. - Enhanced user instructions and tooltips to guide users on the new Click2Crawl and Markdown Extraction features.	2025-06-10 15:40:26 +08:00
AHMET YILMAZ	9442597f81	#1127 : Improve URL handling and normalization in scraping strategies	2025-06-10 11:57:06 +08:00
UncleCode	0ac12da9f3	feat: Major Chrome Extension overhaul with Click2Crawl, instant Schema extraction, and modular architecture ✨ New Features: - Click2Crawl: Visual element selection with markdown conversion - Ctrl/Cmd+Click to select multiple elements - Visual text mode for WYSIWYG extraction - Real-time markdown preview with syntax highlighting - Export to .md file or clipboard - Schema Builder Enhancement: Instant data extraction without LLMs - Test schemas directly in browser - See JSON results immediately - Export data or Python code - Cloud deployment ready (coming soon) - Modular Architecture: - Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js - Added contentAnalyzer.js and markdownConverter.js modules - Shared utilities and CSS reset system - Integrated marked.js for markdown rendering 🎨 UI/UX Improvements: - Added edgy cloud announcement banner with seamless shimmer animation - Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud." - Enhanced feature cards with emojis - Fixed CSS conflicts with targeted reset approach - Improved badge hover effects (red on hover) - Added wrap toggle for code preview 📚 Documentation Updates: - Split extraction diagrams into LLM and no-LLM versions - Updated llms-full.txt with latest content - Added versioned LLM context (v0.1.1) 🔧 Technical Enhancements: - Refactored 3464 lines of monolithic content.js into modules - Added proper event handling and cleanup - Improved z-index management - Better scroll position tracking for badges - Enhanced error handling throughout This release transforms the Chrome Extension from a simple tool into a powerful visual data extraction suite, making web scraping accessible to everyone.	2025-06-09 23:18:27 +08:00
AHMET YILMAZ	74b06d4b80	#1167 Add PHP MIME types to ContentTypeFilter for better file handling	2025-06-09 11:49:33 +08:00
UncleCode	40640badad	feat: add Script Builder to Chrome Extension and reorganize LLM context files This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.	2025-06-08 22:02:12 +08:00
UncleCode	926592649e	Add Crawl4AI Assistant Chrome Extension - Created manifest.json for the Crawl4AI Assistant extension. - Added popup HTML, CSS, and JS files for the extension interface. - Included icons and favicon for the extension. - Implemented functionality for schema capture and code generation. - Updated index.md to reflect the availability of the new extension. - Enhanced LLM Context Builder layout and styles for consistency. - Adjusted global styles for better branding and responsiveness.	2025-06-08 18:34:05 +08:00
UncleCode	b870bfdb6c	chore(deps): add httpx extras (#1195 )	2025-06-08 16:06:38 +08:00
UncleCode	6f3a0ea38e	Create "Apps" section in documentation and Add interactive c4a-script playground and LLM context builder for Crawl4AI - Created a new HTML page (`index.html`) for the interactive LLM context builder, allowing users to select and combine different `crawl4ai` context files. - Implemented JavaScript functionality (`llmtxt.js`) to manage component selection, context types, and file downloads. - Added CSS styles (`llmtxt.css`) for a terminal-themed UI. - Introduced a new Markdown file (`build.md`) detailing the requirements and functionality of the context builder. - Updated the navigation in `mkdocs.yml` to include links to the new context builder and demo apps. - Added a new Markdown file (`why.md`) explaining the motivation behind the new context structure and its benefits for AI coding assistants.	2025-06-08 15:48:17 +08:00
UncleCode	451b0d6c9a	Set memory_wait_timeout default to 10 minutes (#1193 )	2025-06-08 13:53:09 +08:00
UncleCode	8b215e17af	Add use_stemming option to BM25ContentFilter (#1192 )	2025-06-08 12:57:37 +08:00
UncleCode	b4bb0ccea0	Update simple-crawling.md Fixing wrong documentation about th fit_markdown to assume its a direct parameter of CrawlerRunConfig, while it is NOT.	2025-06-08 11:33:28 +08:00

... 4 5 6 7 8 ...

1196 Commits