This change removes the link_extractor module and renames it to link_preview, streamlining the codebase. The removal of 395 lines of code reduces complexity and improves maintainability. Other files have been updated to reflect this change, ensuring consistency across the project.
BREAKING CHANGE: The link_extractor module has been deleted and replaced with link_preview. Update imports accordingly.
Squashed commit from feature/link-extractor branch implementing comprehensive link analysis:
- Extract HTML head content from discovered links with parallel processing
- Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores
- New LinkExtractionConfig class for type-safe configuration
- Pattern-based filtering for internal/external links
- Comprehensive documentation and examples
* Webhook for when repo is starred
* Send star data to google sheets to be saved
* change event name to watch
* Change message displayed on Discord
* Update .github/workflows/main.yml
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
---------
Co-authored-by: UncleCode <unclecode@kidocode.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Introduced two new test files to enhance coverage for the extract pipeline functionality. The tests aim to validate the behavior of the pipeline under various scenarios, ensuring robustness and reliability.
No breaking changes. Closes issue #123.
- Updated overlay.css to add gap in titlebar.
- Deleted schemaBuilder_v1.js and associated zip files (v1.0.0 to v1.2.0).
- Modified index.html to reflect new Click2Crawl feature and updated descriptions.
- Updated manifest.json to include new JavaScript files for Click2Crawl and markdown extraction.
- Refined popup styles and HTML to align with new feature names and functionalities.
- Enhanced user instructions and tooltips to guide users on the new Click2Crawl and Markdown Extraction features.
✨ New Features:
- Click2Crawl: Visual element selection with markdown conversion
- Ctrl/Cmd+Click to select multiple elements
- Visual text mode for WYSIWYG extraction
- Real-time markdown preview with syntax highlighting
- Export to .md file or clipboard
- Schema Builder Enhancement: Instant data extraction without LLMs
- Test schemas directly in browser
- See JSON results immediately
- Export data or Python code
- Cloud deployment ready (coming soon)
- Modular Architecture:
- Separated into schemaBuilder.js, scriptBuilder.js, click2CrawlBuilder.js
- Added contentAnalyzer.js and markdownConverter.js modules
- Shared utilities and CSS reset system
- Integrated marked.js for markdown rendering
🎨 UI/UX Improvements:
- Added edgy cloud announcement banner with seamless shimmer animation
- Direct, technical copy: "You don't need Puppeteer. You need Crawl4AI Cloud."
- Enhanced feature cards with emojis
- Fixed CSS conflicts with targeted reset approach
- Improved badge hover effects (red on hover)
- Added wrap toggle for code preview
📚 Documentation Updates:
- Split extraction diagrams into LLM and no-LLM versions
- Updated llms-full.txt with latest content
- Added versioned LLM context (v0.1.1)
🔧 Technical Enhancements:
- Refactored 3464 lines of monolithic content.js into modules
- Added proper event handling and cleanup
- Improved z-index management
- Better scroll position tracking for badges
- Enhanced error handling throughout
This release transforms the Chrome Extension from a simple tool into a powerful
visual data extraction suite, making web scraping accessible to everyone.
This commit introduces significant enhancements to the Crawl4AI ecosystem:
Chrome Extension - Script Builder (Alpha):
- Add recording functionality to capture user interactions (clicks, typing, scrolling)
- Implement smart event grouping for cleaner script generation
- Support export to both JavaScript and C4A script formats
- Add timeline view for visualizing and editing recorded actions
- Include wait commands (time-based and element-based)
- Add saved flows functionality for reusing automation scripts
- Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
- Release new extension versions: v1.1.0, v1.2.0, v1.2.1
LLM Context Builder Improvements:
- Reorganize context files from llmtxt/ to llm.txt/ with better structure
- Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
- Add comprehensive context files for all major Crawl4AI components
- Improve file naming convention for better discoverability
Documentation Updates:
- Update apps index page to match main documentation theme
- Standardize color scheme: "Available" tags use primary color (#50ffff)
- Change "Coming Soon" tags to dark gray for better visual hierarchy
- Add interactive two-column layout for extension landing page
- Include code examples for both Schema Builder and Script Builder features
Technical Improvements:
- Enhance event capture mechanism with better element selection
- Add support for contenteditable elements and complex form interactions
- Implement proper scroll event handling for both window and element scrolling
- Add meta key support for keyboard shortcuts
- Improve selector generation for more reliable element targeting
The Script Builder is released as Alpha, acknowledging potential bugs while providing
early access to this powerful automation recording feature.
- Created manifest.json for the Crawl4AI Assistant extension.
- Added popup HTML, CSS, and JS files for the extension interface.
- Included icons and favicon for the extension.
- Implemented functionality for schema capture and code generation.
- Updated index.md to reflect the availability of the new extension.
- Enhanced LLM Context Builder layout and styles for consistency.
- Adjusted global styles for better branding and responsiveness.
- Created a new HTML page (`index.html`) for the interactive LLM context builder, allowing users to select and combine different `crawl4ai` context files.
- Implemented JavaScript functionality (`llmtxt.js`) to manage component selection, context types, and file downloads.
- Added CSS styles (`llmtxt.css`) for a terminal-themed UI.
- Introduced a new Markdown file (`build.md`) detailing the requirements and functionality of the context builder.
- Updated the navigation in `mkdocs.yml` to include links to the new context builder and demo apps.
- Added a new Markdown file (`why.md`) explaining the motivation behind the new context structure and its benefits for AI coding assistants.
- Generate OneShot js code geenrator
- Introduced a new C4A-Script tutorial example for login flow using Blockly.
- Updated index.html to include Blockly theme and event editor modal for script editing.
- Created a test HTML file for testing Blockly integration.
- Added comprehensive C4A-Script API reference documentation covering commands, syntax, and examples.
- Developed core documentation for C4A-Script, detailing its features, commands, and real-world examples.
- Updated mkdocs.yml to include new C4A-Script documentation in navigation.
This commit introduces a comprehensive set of new scripts and examples to enhance the scripting capabilities of the crawl4ai project. The changes include the addition of several Python scripts for compiling and executing scripts, as well as a variety of example scripts demonstrating different functionalities such as login flows, data extraction, and multi-step workflows. Additionally, detailed documentation has been created to guide users on how to utilize these new features effectively.
The following significant modifications were made:
- Added core scripting files: , , and .
- Created a new documentation file to provide an overview of the new features.
- Introduced multiple example scripts in the directory to showcase various use cases.
- Updated and to integrate the new functionalities.
- Added font assets for improved documentation presentation.
These changes significantly expand the functionality of the crawl4ai project, allowing users to create more complex and varied scripts with ease.
This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed.
Significant modifications include:
- Added parameter to in .
- Implemented logic in to check and filter out nonsense URLs during the seeding process in .
- Updated documentation to reflect the new filtering feature and provide examples of its usage in .
This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs.
BREAKING CHANGE: The now requires the parameter to be explicitly set if the default behavior is to be altered.
Related issues: #123
This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity.
The changes include:
- Added and for comprehensive LinkedIn data discovery.
- Removed to eliminate outdated content.
- Updated to reflect new data visualization requirements.
- Introduced and to facilitate easier access to URL seeding techniques.
- Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples.
These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
- Implemented a comprehensive research pipeline using URLSeeder.
- Steps include user query input, optional LLM enhancement, URL discovery and ranking, content crawling, and synthesis generation.
- Introduced caching mechanism for enhanced query results and crawled content.
- Configurable settings for testing and production modes.
- Output results in JSON and Markdown formats with detailed research insights and citations.
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.
## Core Features
### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
- Sitemaps (including nested and gzipped)
- Common Crawl index
- Combined sources for maximum coverage
- Extracts page metadata without full crawling:
- Title, description, keywords
- Open Graph and Twitter Card tags
- JSON-LD structured data
- Language and charset information
- BM25 relevance scoring for intelligent filtering:
- Query-based URL discovery
- Configurable score thresholds
- Automatic ranking by relevance
- Performance optimizations:
- Async/concurrent processing with configurable workers
- Rate limiting (hits per second)
- Automatic caching with TTL
- Streaming results for large datasets
### SeedingConfig
- Comprehensive configuration for URL seeding:
- Source selection (sitemap, cc, or both)
- URL pattern filtering with wildcards
- Live URL validation options
- Metadata extraction controls
- BM25 scoring parameters
- Concurrency and rate limiting
### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs
## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler
## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
- Basic discovery
- Cache management
- Live validation
- BM25 scoring
- Multi-domain discovery
- Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
- Pattern-based filtering
- Metadata exploration
- Smart search with BM25
## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests
## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic
## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage
This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
- Added Colab badge linking to the demo notebook
- Added call-to-action encouraging users to try the demo in Colab
- Provides zero-setup cloud environment for testing
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>