Commit Graph

16 Commits

Author SHA1 Message Date
Rachel Bushrian
7771ed3894 Merge branch 'develop' into fix/wrong_url_raw 2025-11-24 13:54:07 +02:00
Soham Kukreti
2dc6588573 fix: remove_overlay_elements functionality by calling injected JS function. ref: #1396
- Fix critical bug where overlay removal JS function was injected but never called
  - Change remove_overlay_elements() to properly execute the injected async function
  - Wrap JS execution in async to handle the async overlay removal logic
  - Add test_remove_overlay_elements() test case to verify functionality works
  - Ensure overlay elements (cookie banners, popups, modals) are actually removed

  The remove_overlay_elements feature now works as intended:
  - Before: Function definition injected but never executed (silent failure)
  - After: Function injected and called, successfully removing overlay elements
2025-09-29 20:40:08 +05:30
Nasrin
23431d8109 Merge pull request #1389 from unclecode/fix/deep-crawl-scoring
fix(deep-crawl): BestFirst priority inversion
2025-09-16 15:45:54 +08:00
rbushria
edd0b576b1 Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html
2025-09-01 23:15:56 +03:00
ntohidi
96c4b0de67 fix(browser_manager): serialize new_page on persistent context to avoid races ref #1198
- Add _page_lock and guarded creation; handle empty context.pages safely
  - Prevents BrowserContext.new_page “Target page/context closed” during concurrent arun_many
2025-08-11 18:55:43 +08:00
ntohidi
88a9fbbb7e fix(deep-crawl): BestFirst priority inversion; remove pre-scoring truncation. ref #1253
Use negative scores in PQ to visit high-score URLs first and drop link cap prior to scoring; add test for ordering.
2025-08-11 18:16:57 +08:00
ntohidi
0ebce590f8 Merge branch '2025-JUN-1' into next-MAY 2025-07-09 09:41:03 +02:00
ntohidi
5d9213a0e9 fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215 2025-06-12 12:21:40 +02:00
UncleCode
c0fd36982d Update all documentation to import extraction strategies directly from crawl4ai. 2025-06-10 18:08:27 +08:00
ntohidi
4679ee023d fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003 2025-06-10 11:19:18 +02:00
ntohidi
5ac19a61d7 feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168 2025-06-05 16:40:34 +02:00
UncleCode
3048cc1ff9 feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.

## Core Features

### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
  - Sitemaps (including nested and gzipped)
  - Common Crawl index
  - Combined sources for maximum coverage
- Extracts page metadata without full crawling:
  - Title, description, keywords
  - Open Graph and Twitter Card tags
  - JSON-LD structured data
  - Language and charset information
- BM25 relevance scoring for intelligent filtering:
  - Query-based URL discovery
  - Configurable score thresholds
  - Automatic ranking by relevance
- Performance optimizations:
  - Async/concurrent processing with configurable workers
  - Rate limiting (hits per second)
  - Automatic caching with TTL
  - Streaming results for large datasets

### SeedingConfig
- Comprehensive configuration for URL seeding:
  - Source selection (sitemap, cc, or both)
  - URL pattern filtering with wildcards
  - Live URL validation options
  - Metadata extraction controls
  - BM25 scoring parameters
  - Concurrency and rate limiting

### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs

## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler

## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
  - Basic discovery
  - Cache management
  - Live validation
  - BM25 scoring
  - Multi-domain discovery
  - Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
  - Pattern-based filtering
  - Metadata exploration
  - Smart search with BM25

## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests

## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic

## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage

This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
2025-06-03 23:27:12 +08:00
João Martins
58c1e17170 Merge branch 'main' into fix-raw-url-parsing 2025-05-30 13:03:25 +01:00
UncleCode
7db6b468d9 feat(markdown): add content source selection for markdown generation
Adds a new content_source parameter to MarkdownGenerationStrategy that allows
selecting which HTML content to use for markdown generation:
- cleaned_html (default): uses post-processed HTML
- raw_html: uses original webpage HTML
- fit_html: uses preprocessed HTML for schema extraction

Changes include:
- Added content_source parameter to MarkdownGenerationStrategy
- Updated AsyncWebCrawler to handle HTML source selection
- Added examples and tests for the new feature
- Updated documentation with new parameter details

BREAKING CHANGE: Renamed cleaned_html parameter to input_html in generate_markdown()
method signature to better reflect its generalized purpose
2025-04-17 20:13:53 +08:00
UncleCode
230f22da86 refactor(proxy): move ProxyConfig to async_configs and improve LLM token handling
Moved ProxyConfig class from proxy_strategy.py to async_configs.py for better organization.
Improved LLM token handling with new PROVIDER_MODELS_PREFIXES.
Added test cases for deep crawling and proxy rotation.
Removed docker_config from BrowserConfig as it's handled separately.

BREAKING CHANGE: ProxyConfig import path changed from crawl4ai.proxy_strategy to crawl4ai
2025-04-15 22:27:18 +08:00
unclecode
66ac07b4f3 feat(crawler): add network request and console message capturing
Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests

This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.
2025-04-10 16:03:48 +08:00