Commit Graph

10 Commits

Author SHA1 Message Date
UncleCode
926592649e Add Crawl4AI Assistant Chrome Extension
- Created manifest.json for the Crawl4AI Assistant extension.
- Added popup HTML, CSS, and JS files for the extension interface.
- Included icons and favicon for the extension.
- Implemented functionality for schema capture and code generation.
- Updated index.md to reflect the availability of the new extension.
- Enhanced LLM Context Builder layout and styles for consistency.
- Adjusted global styles for better branding and responsiveness.
2025-06-08 18:34:05 +08:00
UncleCode
3048cc1ff9 feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.

## Core Features

### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
  - Sitemaps (including nested and gzipped)
  - Common Crawl index
  - Combined sources for maximum coverage
- Extracts page metadata without full crawling:
  - Title, description, keywords
  - Open Graph and Twitter Card tags
  - JSON-LD structured data
  - Language and charset information
- BM25 relevance scoring for intelligent filtering:
  - Query-based URL discovery
  - Configurable score thresholds
  - Automatic ranking by relevance
- Performance optimizations:
  - Async/concurrent processing with configurable workers
  - Rate limiting (hits per second)
  - Automatic caching with TTL
  - Streaming results for large datasets

### SeedingConfig
- Comprehensive configuration for URL seeding:
  - Source selection (sitemap, cc, or both)
  - URL pattern filtering with wildcards
  - Live URL validation options
  - Metadata extraction controls
  - BM25 scoring parameters
  - Concurrency and rate limiting

### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs

## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler

## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
  - Basic discovery
  - Cache management
  - Live validation
  - BM25 scoring
  - Multi-domain discovery
  - Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
  - Pattern-based filtering
  - Metadata exploration
  - Smart search with BM25

## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests

## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic

## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage

This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
2025-06-03 23:27:12 +08:00
UncleCode
949a93982e feat(docs): update documentation and disable Ask AI feature
Major documentation updates including:
- Add comprehensive code examples page
- Add video tutorial to homepage
- Update Docker deployment instructions for v0.6.0
- Temporarily disable Ask AI feature
- Add table border styling
- Update site version to v0.6.x

BREAKING CHANGE: Ask AI feature temporarily disabled pending launch
2025-04-23 19:02:39 +08:00
UncleCode
cd7ff6f9c1 feat(docs): add AI assistant interface and code copy button
Add new AI assistant chat interface with features:
- Real-time chat with markdown support
- Chat history management
- Citation tracking
- Selection-to-query functionality

Also adds code copy button to documentation code blocks and adjusts layout/styling.

Breaking changes: None
2025-04-14 23:00:47 +08:00
UncleCode
c56974cf59 feat(docs): enhance documentation UI with ToC and GitHub stats
Add new features to documentation UI:
- Add table of contents with scroll spy functionality
- Add GitHub repository statistics badge
- Implement new centered layout system with fixed sidebar
- Add conditional Playwright installation based on CRAWL4AI_MODE

Breaking changes: None
2025-04-14 20:46:32 +08:00
UncleCode
6a01008a2b docs(multi-url): improve documentation clarity and update examples
- Restructure multi-URL crawling documentation with better formatting and examples
- Update code examples to use new API syntax (arun_many)
- Add detailed parameter explanations for RateLimiter and Dispatchers
- Enhance CSS styling for better documentation readability
- Fix outdated method calls in feature demo script

BREAKING CHANGE: Updated dispatcher.run_urls() to crawler.arun_many() in examples
2025-01-23 22:33:36 +08:00
UncleCode
e8b4ac6046 docs(urls): update documentation URLs to new domain
Update all documentation URLs from crawl4ai.com/mkdocs to docs.crawl4ai.com
Improve badges styling and layout in documentation
Increase code font size in documentation CSS

BREAKING CHANGE: Documentation URLs have changed from crawl4ai.com/mkdocs to docs.crawl4ai.com
2025-01-09 16:22:41 +08:00
UncleCode
ca3e33122e refactor(docs): reorganize documentation structure and update styles
Reorganize documentation into core/advanced/extraction sections for better navigation.
Update terminal theme styles and add rich library for better CLI output.
Remove redundant tutorial files and consolidate content into core sections.
Add personal story to index page for project context.

BREAKING CHANGE: Documentation structure has been significantly reorganized
2025-01-07 20:49:50 +08:00
UncleCode
9307c19f35 Update documents, upload new version of quickstart. 2024-10-30 20:39:35 +08:00
UncleCode
4239654722 Update Documentation 2024-10-27 19:24:46 +08:00