ntohidi
a03e68fa2f
feat: Add URL-specific crawler configurations for multi-URL crawling
...
Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns.
Key additions:
- Add url_matcher and match_mode parameters to CrawlerRunConfig
- Implement is_match() method supporting string patterns, functions, and mixed lists
- Add MatchMode enum for OR/AND logic when combining multiple matchers
- Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig]
- Add select_config() method to dispatchers for runtime config selection
- First matching config wins, with fallback to default
Pattern matching supports:
- Glob-style strings: *.pdf, */blog/*, *api*
- Lambda functions: lambda url: 'github.com' in url
- Mixed patterns with AND/OR logic for complex matching
This enables optimal per-URL configuration:
- PDFs: Use PDFContentScrapingStrategy without JavaScript
- Blogs: Apply content filtering to reduce noise
- APIs: Skip JavaScript, use JSON extraction
- Dynamic sites: Execute only necessary JavaScript
Breaking changes: None - fully backward compatible
2025-08-02 19:10:36 +08:00
Nasrin
864d87afb2
Merge pull request #1339 from charlaie/fix-sitemap-redirect
...
Fix: URL Seeder sitemap redirect
2025-07-31 15:21:03 +08:00
Charlie C
508b6fc233
fix: Enable following redirects in sitemap fetching for seeder
2025-07-31 12:06:10 +08:00
Emmanuel Ferdman
8e3c411a3e
Merge branch 'main' into main
2025-07-29 14:05:35 +03:00
UncleCode
e3281935bc
fix: Add write permissions for GitHub release creation
2025-07-25 18:22:45 +08:00
UncleCode
48647300b4
chore: Bump version to 0.7.2
v0.7.2
2025-07-25 17:42:48 +08:00
UncleCode
9f9ea3bb3b
chore: Clean up test artifacts and disable test workflow
2025-07-25 17:31:52 +08:00
UncleCode
d58b93c207
fix: Re-enable multi-platform Docker builds for ARM64 support
2025-07-25 16:38:11 +08:00
UncleCode
e2b4705010
fix: Use hardcoded Docker repository name to avoid masking issues
2025-07-25 15:52:26 +08:00
UncleCode
4a1abd5086
fix: Handle existing version on Test PyPI gracefully
2025-07-25 15:41:16 +08:00
UncleCode
04258cd4f2
fix: Speed up Docker test builds by using single platform and caching
2025-07-25 15:37:44 +08:00
UncleCode
84e462d9f8
Merge remote-tracking branch 'origin/develop'
2025-07-25 15:35:53 +08:00
UncleCode
9546773a07
fix: Move sentence-transformers to optional dependencies
...
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
2025-07-24 21:24:40 +08:00
UncleCode
66a979ad11
fix: Install dependencies before version check in workflows
2025-07-24 21:01:36 +08:00
UncleCode
0c31e91b53
feat: Add CI/CD workflows for automated PyPI and Docker releases
2025-07-24 20:58:43 +08:00
ntohidi
1b6a31f88f
fix: encode PDF results to base64 in /crawl endpoint. ref #1301
2025-07-23 13:52:18 +02:00
Nasrin
b8c261780f
Merge pull request #1319 from volumetric/fix_for_bug_#1310
...
Removed the incorrect reference in browser_config variable
2025-07-23 12:45:12 +02:00
ntohidi
db6ad7a79d
fix: update links in README and C4A-Script documentation for accuracy
2025-07-23 09:47:18 +02:00
Nasrin
004d514f33
Merge pull request #1265 from unclecode/feature/nasrin-cli-deep-crawl
...
Feature/CLI - deep-crawl: Add --deep-crawl CLI option with BFS/DFS/Best-First strategies and fix serialization error. ref #874
2025-07-23 09:40:33 +02:00
Vinit Agrawal
3a9e2c716e
Remvoed the incorrect reference in browser_config variable
2025-07-18 10:01:00 +05:30
unclecode
0163bd797c
Merge branch 'release/v0.7.1'
v0.7.1
2025-07-17 17:42:04 +08:00
ntohidi
26bad799e4
chore: update version to 0.7.1
2025-07-17 11:37:41 +02:00
ntohidi
cf8badfe27
feat: cleanup unused code and enhance documentation for v0.7.1
...
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com >
2025-07-17 11:35:16 +02:00
unclecode
805c498adf
docs: add simple anti-bot examples
...
- Add simple_anti_bot_examples.py with minimal code examples
- Demonstrates stealth mode, undetected browser, and combined usage
- Clean examples without logging for easy reference
🤖 Generated with [Claude Code](https://claude.ai/code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-07-17 17:05:35 +08:00
unclecode
6a728cbe5b
feat: add stealth mode and enhance undetected browser support
...
- Add playwright-stealth integration with enable_stealth parameter in BrowserConfig
- Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern
- Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching
- Update install.py to install both playwright and patchright browsers automatically
- Add comprehensive documentation for anti-bot features (stealth mode + undetected browser)
- Create examples demonstrating stealth mode usage and comparison tests
- Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies
- Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium)
- Add dependency checker tool in tests/check_dependencies.py
Breaking changes: None - all existing functionality preserved
🤖 Generated with [Claude Code](https://claude.ai/code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-07-17 16:59:10 +08:00
ntohidi
ccbe3c105c
refactor: improve link scoring output format in release notes
2025-07-17 09:13:20 +02:00
Nasrin
761c19d54b
Merge pull request #1307 from unclecode/fix/json-infinity-serialization
...
fix: Handle infinity values in JSON serialization for API responses
2025-07-16 13:34:25 +02:00
Nasrin
14b0ecb137
Merge pull request #1305 from unclecode/fix/release-notes-demo-code
...
Fix: Update release notes and demo code
2025-07-16 13:33:53 +02:00
ntohidi
0eaa9f9895
fix: handle infinity values in JSON serialization for API responses
...
- Add sanitize_json_data() function to convert infinity/NaN to JSON-compliant strings
- Fix /execute_js endpoint returning ValueError: Out of range float values are not JSON compliant: inf
- Fix /crawl endpoint batch responses with infinity values
- Fix /crawl/stream endpoint streaming responses with infinity values
- Fix /crawl/job endpoint background job responses with infinity values
The sanitize_json_data() function recursively processes response data:
- float('inf') → \"Infinity\"
- float('-inf') → \"-Infinity\"
- float('nan') → \"NaN\"
This prevents JSON serialization errors when JavaScript execution or crawling operations produce infinity values, ensuring all API endpoints return valid JSON.
Fixes: API endpoints crashing with infinity JSON serialization errors
Affects: /execute_js, /crawl, /crawl/stream, /crawl/job endpoints
2025-07-15 13:49:07 +02:00
ntohidi
1d1970ae69
docs: Update release notes and docs for v0.7.0 with teh correct parameters and explanations
2025-07-15 11:32:04 +02:00
ntohidi
205df1e330
docs: Fix virtual scroll configuration
2025-07-15 10:29:47 +02:00
ntohidi
2640dc73a5
docs: Enhance session management example for dynamic content crawling with improved JavaScript handling and extraction schema. ref #226
2025-07-15 10:19:29 +02:00
ntohidi
58024755c5
docs: Update adaptive crawling parameters and examples in README and release notes
2025-07-15 10:15:05 +02:00
unclecode
5c33cbcca2
feat: add undetected browser support with adapter pattern
2025-07-14 17:29:50 +08:00
UncleCode
dd5ee752cf
docs: Add missing documentation pages to mkdocs.yml
...
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:58:26 +08:00
UncleCode
bde1bba6a2
docs: Add missing documentation pages to mkdocs.yml
...
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:56:33 +08:00
UncleCode
7b80eb6b99
docs: Add missing documentation pages to mkdocs.yml
...
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:55:35 +08:00
UncleCode
14f690d751
docs: Update documentation for v0.7.0 release
...
- Update mkdocs.yml site name to v0.7.x
- Add v0.7.0 to blog index as latest release
- Move v0.6.0 to Previous Releases section
- Copy release notes to proper location in docs/md_v2/blog/releases/
2025-07-12 19:08:17 +08:00
UncleCode
7b9ba3015f
Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update
v0.7.0
2025-07-12 18:54:20 +08:00
UncleCode
0c8bb742b7
Release v0.7.0-r1: The Adaptive Intelligence Update
...
- Bump version to 0.7.0
- Add release notes and demo files
- Update README with v0.7.0 features
- Update Docker configurations for v0.7.0-r1
- Move v0.7.0 demo files to releases_review
- Fix BM25 scoring bug in URLSeeder
Major features:
- Adaptive Crawling with pattern learning
- Virtual Scroll support for infinite pages
- Link Preview with 3-layer scoring
- Async URL Seeder for massive discovery
- Performance optimizations
2025-07-12 18:51:13 +08:00
UncleCode
ba2ed53ff1
test(releases): Add test cases for release 0.7.0
2025-07-11 22:27:18 +08:00
UncleCode
a93efcb650
Merge PR #1285 : 2025 APR, MAY, and JUN bug fixes
2025-07-11 21:22:34 +08:00
UncleCode
8794852a26
Merge PR #1285 : 2025 APR, MAY, and JUN bug fixes
2025-07-11 21:22:03 +08:00
UncleCode
fb25a4a769
docs(examples): update crawl4ai showcase script
...
The crawl4ai showcase script has been significantly expanded to include more detailed examples and demonstrations. This includes live code examples, more detailed explanations, and a new real-world example. A new file, uv.lock, has also been added.
2025-07-11 20:55:37 +08:00
ntohidi
afe852935e
fix: show /llm API response in playground. ref #1288
2025-07-09 16:59:17 +02:00
ntohidi
0ebce590f8
Merge branch '2025-JUN-1' into next-MAY
2025-07-09 09:41:03 +02:00
ntohidi
026e96a2df
feat: Add social media and community links to README and index documentation
2025-07-08 15:48:40 +02:00
ntohidi
36429a63de
fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105
2025-07-08 12:54:33 +02:00
ntohidi
a3d41c7951
fix: Clarify description of 'use_stemming' parameter in markdown generation documentation ref #1086
2025-07-08 12:24:33 +02:00
ntohidi
fee4c5c783
fix: Consolidate import statements in local-files.md for clarity
2025-07-08 11:46:24 +02:00