crawl4ai

Author	SHA1	Message	Date
Soham Kukreti	18ad3ef159	fix: Implement base tag support in link extraction (#1147 ) - Extract base href from <head><base> tag using XPath in _process_element method - Use base URL as the primary URL for link normalization when present - Add error handling with logging for malformed or problematic base tags - Maintain backward compatibility when no base tag is present - Add test to verify the functionality of the base tag extraction.	2025-08-08 20:11:57 +05:30
AHMET YILMAZ	0541b61405	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-08 11:18:34 +08:00
AHMET YILMAZ	b61b2ee676	feat(browser-profiler): implement cross-platform keyboard listeners and improve quit handling	2025-08-08 11:18:34 +08:00
AHMET YILMAZ	89cf5aba2b	#1057 : enhance ProxyConfig initialization to support dict and string formats	2025-08-06 18:34:58 +08:00
ntohidi	6b0b5301ba	Release v0.7.3: - Updated version to 0.7.3 - Added release notes - Updated documentation	2025-08-06 17:52:01 +08:00
Nezar Ali	7a8190ecb6	Fix examples in README.md	2025-08-06 11:58:29 +03:00
Nasrin	6735c68288	Merge pull request #1170 from prokopis3/fix/create-profile fix(browser_profiler): cross-platform 'q' to quit - create profile	2025-08-06 16:29:14 +08:00
Nasrin	64f37792a7	Merge pull request #1170 from prokopis3/fix/create-profile fix(browser_profiler): cross-platform 'q' to quit - create profile	2025-08-06 16:29:14 +08:00
ntohidi	a5bcac4c9d	feat(docs): enhance table data access example with a real url	2025-08-06 15:19:37 +08:00
Nasrin	45d8327d23	Merge pull request #1366 from unclecode/fix/update-tables-documentation docs: Update README.md and modify Media and Tables Documentation.(#1271)	2025-08-06 15:15:24 +08:00
ntohidi	437395e490	Merge branch 'feat/undetected-browser' into develop-future	2025-08-06 15:03:30 +08:00
Soham Kukreti	fddae303fb	docs: Update README.md and modify Media and Tables Documentation.(#1271 ) - Update Table-to-DataFrame Extraction example in README.md - Replace old method of accessing tables via result.media directly with result.tables in the documentation - Remove tables section from links & media page. - Add tables section to crawler result page.	2025-08-05 23:29:19 +05:30
ntohidi	ff6ea41ac3	feat(docker): add flexible LLM provider configuration - Support LLM_PROVIDER env var to override default provider (openai/gpt-4o-mini) - Add optional 'provider' parameter to API endpoints for per-request overrides - Implement provider validation to ensure API keys exist - Update documentation and examples with new configuration options Closes the need to hardcode providers in config.yml	2025-08-05 14:09:54 +08:00
ntohidi	31a435fb0e	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-04 19:12:19 +08:00
Nasrin	5de6a28055	Merge pull request #1361 from unclecode/fix/crawler-result-docs Update CrawlResult documentation with missing fields	2025-08-04 19:12:09 +08:00
ntohidi	de1561ad14	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-08-04 19:04:50 +08:00
Nasrin	337b588732	Merge pull request #1358 from shonenada/patch-1 Fix typos in examples.md	2025-08-04 19:04:42 +08:00
ntohidi	7a6ad547f0	Squashed commit of the following: commit 2def6524cdacb69c72760bf55a41089257c0bb07 Author: ntohidi <nasrin@kidocode.com> Date: Mon Aug 4 18:59:10 2025 +0800 refactor: consolidate WebScrapingStrategy to use LXML implementation only BREAKING CHANGE: None - full backward compatibility maintained This commit simplifies the content scraping architecture by removing the redundant BeautifulSoup-based WebScrapingStrategy implementation and making it an alias for LXMLWebScrapingStrategy. Changes: - Remove ~1000 lines of BeautifulSoup-based WebScrapingStrategy code - Make WebScrapingStrategy an alias for LXMLWebScrapingStrategy - Update LXMLWebScrapingStrategy to inherit directly from ContentScrapingStrategy - Add required methods (scrap, ascrap, process_element, _log) to LXMLWebScrapingStrategy - Maintain 100% backward compatibility - existing code continues to work Code changes: - crawl4ai/content_scraping_strategy.py: Remove WebScrapingStrategy class, add alias - crawl4ai/async_configs.py: Remove WebScrapingStrategy from imports - crawl4ai/__init__.py: Update imports to show alias relationship - crawl4ai/types.py: Update type definitions - crawl4ai/legacy/web_crawler.py: Update import to use alias - tests/async/test_content_scraper_strategy.py: Update to use LXMLWebScrapingStrategy - docs/examples/scraping_strategies_performance.py: Update to use single strategy Documentation updates: - docs/md_v2/core/content-selection.md: Update scraping modes section - docs/md_v2/migration/webscraping-strategy-migration.md: Add migration guide - CHANGELOG.md: Document the refactoring under [Unreleased] Benefits: - 10-20x faster HTML parsing for large documents - Reduced memory usage and simplified codebase - Consistent parsing behavior - No migration required for existing users All existing code using WebScrapingStrategy continues to work without modification, while benefiting from LXML's superior performance.	2025-08-04 19:02:01 +08:00
Soham Kukreti	e6692b987d	docs: Update CrawlResult documentation with missing fields. - Add missing fields: fit_html, js_execution_result, redirected_url, network_requests, console_messages, tables	2025-08-04 15:43:40 +05:30
ntohidi	307fe28b32	fix: Correct URL matcher fallback behavior and improve memory monitoring Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.	2025-08-03 16:50:54 +08:00
Yaoda Liu	438a103b17	Fix typos in examples.md	2025-08-03 14:33:10 +08:00
ntohidi	a03e68fa2f	feat: Add URL-specific crawler configurations for multi-URL crawling Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns. Key additions: - Add url_matcher and match_mode parameters to CrawlerRunConfig - Implement is_match() method supporting string patterns, functions, and mixed lists - Add MatchMode enum for OR/AND logic when combining multiple matchers - Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig] - Add select_config() method to dispatchers for runtime config selection - First matching config wins, with fallback to default Pattern matching supports: - Glob-style strings: .pdf, /blog/, api* - Lambda functions: lambda url: 'github.com' in url - Mixed patterns with AND/OR logic for complex matching This enables optimal per-URL configuration: - PDFs: Use PDFContentScrapingStrategy without JavaScript - Blogs: Apply content filtering to reduce noise - APIs: Skip JavaScript, use JSON extraction - Dynamic sites: Execute only necessary JavaScript Breaking changes: None - fully backward compatible	2025-08-02 19:10:36 +08:00
Nasrin	864d87afb2	Merge pull request #1339 from charlaie/fix-sitemap-redirect Fix: URL Seeder sitemap redirect	2025-07-31 15:21:03 +08:00
Charlie C	508b6fc233	fix: Enable following redirects in sitemap fetching for seeder	2025-07-31 12:06:10 +08:00
Emmanuel Ferdman	8e3c411a3e	Merge branch 'main' into main	2025-07-29 14:05:35 +03:00
UncleCode	e3281935bc	fix: Add write permissions for GitHub release creation	2025-07-25 18:22:45 +08:00
UncleCode	48647300b4	chore: Bump version to 0.7.2 v0.7.2	2025-07-25 17:42:48 +08:00
UncleCode	9f9ea3bb3b	chore: Clean up test artifacts and disable test workflow	2025-07-25 17:31:52 +08:00
UncleCode	d58b93c207	fix: Re-enable multi-platform Docker builds for ARM64 support	2025-07-25 16:38:11 +08:00
UncleCode	e2b4705010	fix: Use hardcoded Docker repository name to avoid masking issues	2025-07-25 15:52:26 +08:00
UncleCode	4a1abd5086	fix: Handle existing version on Test PyPI gracefully	2025-07-25 15:41:16 +08:00
UncleCode	04258cd4f2	fix: Speed up Docker test builds by using single platform and caching	2025-07-25 15:37:44 +08:00
UncleCode	84e462d9f8	Merge remote-tracking branch 'origin/develop'	2025-07-25 15:35:53 +08:00
UncleCode	9546773a07	fix: Move sentence-transformers to optional dependencies - Moved sentence-transformers from core to optional dependencies in pyproject.toml - Removed sentence-transformers from requirements.txt - Added proper ImportError handling with helpful installation message - This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default - Users who need embedding features can install with: pip install 'crawl4ai[transformer]'	2025-07-24 21:24:40 +08:00
UncleCode	66a979ad11	fix: Install dependencies before version check in workflows	2025-07-24 21:01:36 +08:00
UncleCode	0c31e91b53	feat: Add CI/CD workflows for automated PyPI and Docker releases	2025-07-24 20:58:43 +08:00
ntohidi	1b6a31f88f	fix: encode PDF results to base64 in /crawl endpoint. ref #1301	2025-07-23 13:52:18 +02:00
Nasrin	b8c261780f	Merge pull request #1319 from volumetric/fix_for_bug_#1310 Removed the incorrect reference in browser_config variable	2025-07-23 12:45:12 +02:00
ntohidi	db6ad7a79d	fix: update links in README and C4A-Script documentation for accuracy	2025-07-23 09:47:18 +02:00
Nasrin	004d514f33	Merge pull request #1265 from unclecode/feature/nasrin-cli-deep-crawl Feature/CLI - deep-crawl: Add --deep-crawl CLI option with BFS/DFS/Best-First strategies and fix serialization error. ref #874	2025-07-23 09:40:33 +02:00
Vinit Agrawal	3a9e2c716e	Remvoed the incorrect reference in browser_config variable	2025-07-18 10:01:00 +05:30
unclecode	0163bd797c	Merge branch 'release/v0.7.1' v0.7.1	2025-07-17 17:42:04 +08:00
ntohidi	26bad799e4	chore: update version to 0.7.1	2025-07-17 11:37:41 +02:00
ntohidi	cf8badfe27	feat: cleanup unused code and enhance documentation for v0.7.1 - Remove unused StealthConfig from browser_manager.py - Update LinkPreviewConfig import path in __init__.py and examples - Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf')) - Remove sanitize_json_data functions from API endpoints - Add comprehensive C4A Script documentation to release notes - Update v0.7.0 release notes with improved code examples - Create v0.7.1 release notes focusing on cleanup and documentation improvements - Update demo files with corrected import paths and examples - Fix virtual scroll and adaptive crawling examples across documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 11:35:16 +02:00
unclecode	805c498adf	docs: add simple anti-bot examples - Add simple_anti_bot_examples.py with minimal code examples - Demonstrates stealth mode, undetected browser, and combined usage - Clean examples without logging for easy reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 17:05:35 +08:00
unclecode	6a728cbe5b	feat: add stealth mode and enhance undetected browser support - Add playwright-stealth integration with enable_stealth parameter in BrowserConfig - Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern - Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching - Update install.py to install both playwright and patchright browsers automatically - Add comprehensive documentation for anti-bot features (stealth mode + undetected browser) - Create examples demonstrating stealth mode usage and comparison tests - Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies - Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium) - Add dependency checker tool in tests/check_dependencies.py Breaking changes: None - all existing functionality preserved 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 16:59:10 +08:00
ntohidi	ccbe3c105c	refactor: improve link scoring output format in release notes	2025-07-17 09:13:20 +02:00
Nasrin	761c19d54b	Merge pull request #1307 from unclecode/fix/json-infinity-serialization fix: Handle infinity values in JSON serialization for API responses	2025-07-16 13:34:25 +02:00
Nasrin	14b0ecb137	Merge pull request #1305 from unclecode/fix/release-notes-demo-code Fix: Update release notes and demo code	2025-07-16 13:33:53 +02:00
ntohidi	0eaa9f9895	fix: handle infinity values in JSON serialization for API responses - Add sanitize_json_data() function to convert infinity/NaN to JSON-compliant strings - Fix /execute_js endpoint returning ValueError: Out of range float values are not JSON compliant: inf - Fix /crawl endpoint batch responses with infinity values - Fix /crawl/stream endpoint streaming responses with infinity values - Fix /crawl/job endpoint background job responses with infinity values The sanitize_json_data() function recursively processes response data: - float('inf') → \"Infinity\" - float('-inf') → \"-Infinity\" - float('nan') → \"NaN\" This prevents JSON serialization errors when JavaScript execution or crawling operations produce infinity values, ensuring all API endpoints return valid JSON. Fixes: API endpoints crashing with infinity JSON serialization errors Affects: /execute_js, /crawl, /crawl/stream, /crawl/job endpoints	2025-07-15 13:49:07 +02:00

... 3 4 5 6 7 ...

1204 Commits