crawl4ai

Author	SHA1	Message	Date
Vinit Agrawal	3a9e2c716e	Remvoed the incorrect reference in browser_config variable	2025-07-18 10:01:00 +05:30
unclecode	0163bd797c	Merge branch 'release/v0.7.1' v0.7.1	2025-07-17 17:42:04 +08:00
ntohidi	26bad799e4	chore: update version to 0.7.1	2025-07-17 11:37:41 +02:00
ntohidi	cf8badfe27	feat: cleanup unused code and enhance documentation for v0.7.1 - Remove unused StealthConfig from browser_manager.py - Update LinkPreviewConfig import path in __init__.py and examples - Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf')) - Remove sanitize_json_data functions from API endpoints - Add comprehensive C4A Script documentation to release notes - Update v0.7.0 release notes with improved code examples - Create v0.7.1 release notes focusing on cleanup and documentation improvements - Update demo files with corrected import paths and examples - Fix virtual scroll and adaptive crawling examples across documentation 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 11:35:16 +02:00
unclecode	805c498adf	docs: add simple anti-bot examples - Add simple_anti_bot_examples.py with minimal code examples - Demonstrates stealth mode, undetected browser, and combined usage - Clean examples without logging for easy reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 17:05:35 +08:00
unclecode	6a728cbe5b	feat: add stealth mode and enhance undetected browser support - Add playwright-stealth integration with enable_stealth parameter in BrowserConfig - Merge undetected browser strategy into main async_crawler_strategy.py using adapter pattern - Add browser adapters (BrowserAdapter, PlaywrightAdapter, UndetectedAdapter) for flexible browser switching - Update install.py to install both playwright and patchright browsers automatically - Add comprehensive documentation for anti-bot features (stealth mode + undetected browser) - Create examples demonstrating stealth mode usage and comparison tests - Update pyproject.toml and requirements.txt with patchright>=1.49.0 and other dependencies - Remove duplicate/unused dependencies (alphashape, cssselect, pyperclip, shapely, selenium) - Add dependency checker tool in tests/check_dependencies.py Breaking changes: None - all existing functionality preserved 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-07-17 16:59:10 +08:00
ntohidi	ccbe3c105c	refactor: improve link scoring output format in release notes	2025-07-17 09:13:20 +02:00
Nasrin	761c19d54b	Merge pull request #1307 from unclecode/fix/json-infinity-serialization fix: Handle infinity values in JSON serialization for API responses	2025-07-16 13:34:25 +02:00
Nasrin	14b0ecb137	Merge pull request #1305 from unclecode/fix/release-notes-demo-code Fix: Update release notes and demo code	2025-07-16 13:33:53 +02:00
ntohidi	0eaa9f9895	fix: handle infinity values in JSON serialization for API responses - Add sanitize_json_data() function to convert infinity/NaN to JSON-compliant strings - Fix /execute_js endpoint returning ValueError: Out of range float values are not JSON compliant: inf - Fix /crawl endpoint batch responses with infinity values - Fix /crawl/stream endpoint streaming responses with infinity values - Fix /crawl/job endpoint background job responses with infinity values The sanitize_json_data() function recursively processes response data: - float('inf') → \"Infinity\" - float('-inf') → \"-Infinity\" - float('nan') → \"NaN\" This prevents JSON serialization errors when JavaScript execution or crawling operations produce infinity values, ensuring all API endpoints return valid JSON. Fixes: API endpoints crashing with infinity JSON serialization errors Affects: /execute_js, /crawl, /crawl/stream, /crawl/job endpoints	2025-07-15 13:49:07 +02:00
ntohidi	1d1970ae69	docs: Update release notes and docs for v0.7.0 with teh correct parameters and explanations	2025-07-15 11:32:04 +02:00
ntohidi	205df1e330	docs: Fix virtual scroll configuration	2025-07-15 10:29:47 +02:00
ntohidi	2640dc73a5	docs: Enhance session management example for dynamic content crawling with improved JavaScript handling and extraction schema. ref #226	2025-07-15 10:19:29 +02:00
ntohidi	58024755c5	docs: Update adaptive crawling parameters and examples in README and release notes	2025-07-15 10:15:05 +02:00
unclecode	5c33cbcca2	feat: add undetected browser support with adapter pattern	2025-07-14 17:29:50 +08:00
UncleCode	dd5ee752cf	docs: Add missing documentation pages to mkdocs.yml - Added Adaptive Crawling to Core section - Added URL Seeding to Core section - Added Adaptive Strategies to Advanced section	2025-07-12 19:58:26 +08:00
UncleCode	bde1bba6a2	docs: Add missing documentation pages to mkdocs.yml - Added Adaptive Crawling to Core section - Added URL Seeding to Core section - Added Adaptive Strategies to Advanced section	2025-07-12 19:56:33 +08:00
UncleCode	7b80eb6b99	docs: Add missing documentation pages to mkdocs.yml - Added Adaptive Crawling to Core section - Added URL Seeding to Core section - Added Adaptive Strategies to Advanced section	2025-07-12 19:55:35 +08:00
UncleCode	14f690d751	docs: Update documentation for v0.7.0 release - Update mkdocs.yml site name to v0.7.x - Add v0.7.0 to blog index as latest release - Move v0.6.0 to Previous Releases section - Copy release notes to proper location in docs/md_v2/blog/releases/	2025-07-12 19:08:17 +08:00
UncleCode	7b9ba3015f	Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update v0.7.0	2025-07-12 18:54:20 +08:00
UncleCode	0c8bb742b7	Release v0.7.0-r1: The Adaptive Intelligence Update - Bump version to 0.7.0 - Add release notes and demo files - Update README with v0.7.0 features - Update Docker configurations for v0.7.0-r1 - Move v0.7.0 demo files to releases_review - Fix BM25 scoring bug in URLSeeder Major features: - Adaptive Crawling with pattern learning - Virtual Scroll support for infinite pages - Link Preview with 3-layer scoring - Async URL Seeder for massive discovery - Performance optimizations	2025-07-12 18:51:13 +08:00
UncleCode	ba2ed53ff1	test(releases): Add test cases for release 0.7.0	2025-07-11 22:27:18 +08:00
UncleCode	a93efcb650	Merge PR #1285 : 2025 APR, MAY, and JUN bug fixes	2025-07-11 21:22:34 +08:00
UncleCode	8794852a26	Merge PR #1285 : 2025 APR, MAY, and JUN bug fixes	2025-07-11 21:22:03 +08:00
UncleCode	fb25a4a769	docs(examples): update crawl4ai showcase script The crawl4ai showcase script has been significantly expanded to include more detailed examples and demonstrations. This includes live code examples, more detailed explanations, and a new real-world example. A new file, uv.lock, has also been added.	2025-07-11 20:55:37 +08:00
ntohidi	afe852935e	fix: show /llm API response in playground. ref #1288	2025-07-09 16:59:17 +02:00
ntohidi	0ebce590f8	Merge branch '2025-JUN-1' into next-MAY	2025-07-09 09:41:03 +02:00
ntohidi	026e96a2df	feat: Add social media and community links to README and index documentation	2025-07-08 15:48:40 +02:00
ntohidi	36429a63de	fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105	2025-07-08 12:54:33 +02:00
ntohidi	a3d41c7951	fix: Clarify description of 'use_stemming' parameter in markdown generation documentation ref #1086	2025-07-08 12:24:33 +02:00
ntohidi	fee4c5c783	fix: Consolidate import statements in local-files.md for clarity	2025-07-08 11:46:24 +02:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
UncleCode	1a73fb60db	feat(crawl4ai): Implement adaptive crawling feature This commit introduces the adaptive crawling feature to the crawl4ai project. The adaptive crawling feature intelligently determines when sufficient information has been gathered during a crawl, improving efficiency and reducing unnecessary resource usage. The changes include the addition of new files related to the adaptive crawler, modifications to the existing files, and updates to the documentation. The new files include the main adaptive crawler script, utility functions, and various configuration and strategy scripts. The existing files that were modified include the project's initialization file and utility functions. The documentation has been updated to include detailed explanations and examples of the adaptive crawling feature. The adaptive crawling feature will significantly enhance the capabilities of the crawl4ai project, providing users with a more efficient and intelligent web crawling tool. Significant modifications: - Added adaptive_crawler.py and related scripts - Modified __init__.py and utils.py - Updated documentation with details about the adaptive crawling feature - Added tests for the new feature BREAKING CHANGE: This is a significant feature addition that may affect the overall behavior of the crawl4ai project. Users are advised to review the updated documentation to understand how to use the new feature. Refs: #123, #456	2025-07-04 15:16:53 +08:00
UncleCode	74705c1f67	Move release scripts to private .scripts folder - Remove release-agent.py, build-nightly.py from public repo - Add .scripts/ to .gitignore for private tools - Maintain clean public repository while keeping internal tools	2025-07-04 15:02:25 +08:00
UncleCode	048d9b0f5b	feat: Implement nightly build script and update version handling	2025-07-03 20:53:03 +08:00
ntohidi	ee25c771d8	feat(cli): add deep crawling options with configurable strategies and max pages. ref #874	2025-07-02 14:07:23 +02:00
UncleCode	a353515271	feat: Add virtual scroll support for modern web scraping Add comprehensive virtual scroll handling to capture all content from pages that use DOM recycling techniques (Twitter, Instagram, etc). Key features: - New VirtualScrollConfig class for configuring virtual scroll behavior - Automatic detection of three scrolling scenarios: no change, content appended, content replaced - Intelligent HTML chunk capture and merging with deduplication - 100% content capture from virtual scroll pages - Seamless integration with existing extraction strategies - JavaScript-based detection and capture for performance - Tree-based DOM merging with text-based deduplication Documentation: - Comprehensive guide at docs/md_v2/advanced/virtual-scroll.md - API reference updates in parameters.md and page-interaction.md - Blog article explaining the solution and techniques - Complete examples with local test server Testing: - Full test suite achieving 100% capture of 1000 items - Examples for Twitter timeline, Instagram grid scenarios - Local test server with different scrolling behaviors This enables scraping of modern websites that were previously impossible to fully capture with traditional scrolling techniques.	2025-06-29 20:41:37 +08:00
UncleCode	539a324cf6	refactor(link_extractor): remove link_extractor and rename to link_preview This change removes the link_extractor module and renames it to link_preview, streamlining the codebase. The removal of 395 lines of code reduces complexity and improves maintainability. Other files have been updated to reflect this change, ensuring consistency across the project. BREAKING CHANGE: The link_extractor module has been deleted and replaced with link_preview. Update imports accordingly.	2025-06-27 21:54:22 +08:00
UncleCode	5c9c305dbf	feat: Add advanced link head extraction with three-layer scoring system (#1 ) Squashed commit from feature/link-extractor branch implementing comprehensive link analysis: - Extract HTML head content from discovered links with parallel processing - Three-layer scoring: Intrinsic (URL quality), Contextual (BM25), and Total scores - New LinkExtractionConfig class for type-safe configuration - Pattern-based filtering for internal/external links - Comprehensive documentation and examples	2025-06-27 20:06:04 +08:00
Aravind	02f3127ded	Track Stargazers (#1249 ) * Webhook for when repo is starred * Send star data to google sheets to be saved * change event name to watch * Change message displayed on Discord * Update .github/workflows/main.yml Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: UncleCode <unclecode@kidocode.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2025-06-25 22:26:19 +08:00
UncleCode	e528086341	test(async_assistant): add new tests for extract pipeline Introduced two new test files to enhance coverage for the extract pipeline functionality. The tests aim to validate the behavior of the pipeline under various scenarios, ensuring robustness and reliability. No breaking changes. Closes issue #123.	2025-06-23 10:44:27 +08:00
ntohidi	414f16e975	fix: Update pdf and screenshot usage documentation. ref #1230	2025-06-18 19:05:44 +02:00
ntohidi	b7a6e02236	fix: Update pdf and screenshot usage documentation. ref #1230	2025-06-18 19:04:32 +02:00
AHMET YILMAZ	9332326457	feat: Add PDF parsing documentation and navigation entry	2025-06-16 18:18:32 +08:00
ntohidi	6cd34b3157	Merge branch '2025-MAY-2' of https://github.com/unclecode/crawl4ai into 2025-MAY-2	2025-06-13 11:26:17 +02:00
ntohidi	871d4f1158	fix(extraction_strategy): rename response variable to content for clarity in LLMExtractionStrategy. ref #1146	2025-06-13 11:26:05 +02:00
prokopis3	c4d625fb3c	chore(profile-test): fix filename typo ( test_crteate_profile.py → test_create_profile.py ) - Rename file to correct spelling - No content changes	2025-06-12 14:38:32 +03:00
prokopis3	ef722766f0	fix(browser_profiler): improve keyboard input handling - fix handling of special keys in Windows msvcrt implementation - Guard against UnicodeDecodeError from multi-byte key sequences - Filter out non-printable characters and control sequences - Add error handling to prevent coroutine crashes - Add unit test to verify keyboard input handling Key changes: - Safe UTF-8 decoding with try/except for special keys - Skip non-printable and multi-byte character sequences - Add broad exception handling in keyboard listener Test runs on Windows only due to msvcrt dependency.	2025-06-12 14:33:12 +03:00
ntohidi	dc85481180	refactor: Update LLM extraction example with the updated structure	2025-06-12 12:23:03 +02:00
ntohidi	5d9213a0e9	fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215	2025-06-12 12:21:40 +02:00

... 2 3 4 5 6 ...

1114 Commits