crawl4ai

Author	SHA1	Message	Date
UncleCode	c85f56b085	Merge pull request #1677 from unclecode/sponsors/thor_data	2025-12-25 12:08:21 +08:00
Aravind Karnam	a234959b12	sponsors: Add thor data as sponsor	2025-12-23 20:45:00 +05:30
Aravind Karnam	da82f0ada5	sponsors: Add thor data as sponsor	2025-12-23 16:28:26 +05:30
Nasrin	a87e8c1c9e	Release/v0.7.8 (#1662 ) * Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * announcement: add application form for cloud API closed beta * Release v0.7.8: Stability & Bug Fix Release - Updated version to 0.7.8 - Introduced focused stability release addressing 11 community-reported bugs. - Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates. - Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended. - Updated documentation to reflect recent changes and improvements. * docs: add section for Crawl4AI Cloud API closed beta with application link * fix: add disk cleanup step to Docker workflow --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>	2025-12-11 11:04:52 +01:00
UncleCode	835e3c56fe	Add disk cleanup step in Docker release workflow Added a step to free up disk space before the build process.	2025-12-11 09:49:27 +01:00
Aravind	3a07c5962c	Sponsors/new (#1643 )	2025-12-02 00:49:39 +01:00
Aravind	0024c82cdc	Sponsors/new (#1637 )	2025-11-24 13:29:33 +01:00
Aravind	f68e7531e3	Sponsors/scrapeless (#1619 )	2025-11-17 07:44:52 +01:00
UncleCode	cb637fb5c4	Merge pull request #1613 from unclecode/release/v0.7.7	2025-11-16 12:26:54 +01:00
ntohidi	6244f56f36	Release v0.7.7 - Updated version to 0.7.7 - Added comprehensive demo and release notes - Updated all documentation v0.7.7 docker-rebuild-v0.7.7	2025-11-14 10:23:31 +01:00
ntohidi	2c973b1183	Merge branch 'develop' into release/v0.7.7	2025-11-13 14:54:05 +01:00
Nasrin	f3146de969	Merge pull request #1609 from unclecode/fix/update-config-documentation Update browser and crawler run config documentation to match async_configs.py implementation	2025-11-13 21:52:53 +08:00
Soham Kukreti	d6b6d11a2d	docs: update browser and crawler run config documentation to match async_configs.py implementation Updated browser-crawler-config.md and parameters.md to ensure complete accuracy with the actual BrowserConfig and CrawlerRunConfig implementations. Changes: - Removed non-existent parameters from documentation: * enable_rate_limiting, rate_limit_config (never implemented) * memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher) * display_mode (doesn't exist) - Added missing BrowserConfig parameters (14 total): * browser_mode, use_managed_browser, cdp_url, debugging_port, host * viewport, chrome_channel, channel * accept_downloads, downloads_path, storage_state, sleep_on_close * user_agent_mode, user_agent_generator_config, enable_stealth - Added missing CrawlerRunConfig parameters (29 total): * chunking_strategy, keep_attrs, parser_type, scraping_strategy * proxy_config, proxy_rotation_strategy * locale, timezone_id, geolocation, fetch_ssl_certificate * shared_data, wait_for_timeout * c4a_script, max_scroll_steps * exclude_all_images, table_score_threshold, table_extraction * exclude_internal_links, score_links * capture_network_requests, capture_console_messages * method, stream, url, user_agent, user_agent_mode, user_agent_generator_config * deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental - Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write) - Reorganized parameters into logical sections (Content Processing, Browser Location & Identity, Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features) - Ensured all parameter descriptions match source code docstrings - Added proper default values from __init__ signatures	2025-11-13 14:54:16 +05:30
ntohidi	b58579548c	Bump version to 0.7.7 for stable release	2025-11-13 09:52:18 +01:00
Nasrin	466be69e72	Merge pull request #1607 from unclecode/fix/dfs_deep_crawling Fix/dfs deep crawling	2025-11-13 16:43:47 +08:00
AHMET YILMAZ	ceade853c3	Enhance DFSDeepCrawlStrategy documentation for clarity and detail	2025-11-13 16:39:08 +08:00
ntohidi	998c809e08	Rename folder name for NSTProxy integration examples for crawl4ai	2025-11-13 09:36:39 +01:00
ntohidi	d0fb53540d	Update proxy-security documentation	2025-11-13 09:23:44 +01:00
Nasrin	8116b15b63	Merge pull request #1596 from unclecode/docs-proxy-security #1591 enhance proxy configuration with security, SSL analysis, and rotation examples	2025-11-13 16:22:28 +08:00
AHMET YILMAZ	fe353c4e27	Refactor proxy configuration documentation for clarity and consistency	2025-11-13 11:20:24 +08:00
ntohidi	89cc29fe44	Merge branch 'fix/docker' into develop	2025-11-12 17:06:31 +01:00
Nasrin	cdcb8836b7	Merge pull request #1605 from Nstproxy/feat/nstproxy feat: Add Nstproxy Proxies	2025-11-12 23:56:14 +08:00
Nasrin	b207ae2848	Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing Add CDP endpoint verification with exponential backoff for managed browsers	2025-11-12 23:53:57 +08:00
Nasrin	be00fc3a42	Merge pull request #1598 from unclecode/fix/sitemap_seeder #1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…	2025-11-12 18:09:34 +08:00
Nasrin	124ac583bb	Merge pull request #1599 from unclecode/docs-llm-strategies-update #1551 : Fix casing and variable name consistency for LLMConfig in doc…	2025-11-12 17:54:26 +08:00
AHMET YILMAZ	1bd3de6a47	#1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking	2025-11-12 17:44:43 +08:00
nstproxy	80452166c8	feat: Add Nstproxy Proxies	2025-11-12 16:25:39 +08:00
UncleCode	a99cd37c0e	Merge pull request #1597 from unclecode/sponsors/capsolver	2025-11-11 14:50:44 +08:00
AHMET YILMAZ	2e8f8c9b49	#1551 : Fix casing and variable name consistency for LLMConfig in documentation	2025-11-10 15:38:14 +08:00
AHMET YILMAZ	80745bceb9	#1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder	2025-11-10 14:15:54 +08:00
Aravind Karnam	4bee230c37	docs: Add a tip for captcha solving usecases using a third party integration	2025-11-10 11:20:48 +05:30
Aravind	006e29f308	Merge pull request #1589 from capsolver/main Add some examples of using capsolver to solve captcha	2025-11-10 10:45:16 +05:30
AHMET YILMAZ	263ac890fd	#1591 : Enhance proxy configuration documentation with security features, SSL analysis, and improved examples	2025-11-10 11:42:07 +08:00
unclecode	1a22fb4d4f	docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system. Changes: - Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition - Updated mkdocs.yml navigation to "Self-Hosting Guide" - Completely rewrote introduction emphasizing self-hosting benefits: * Data privacy and ownership * Cost control and transparency * Performance and security advantages * Full customization capabilities - Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with: * Monitoring Dashboard section documenting the /monitor UI * Complete feature breakdown (system health, requests, browsers, janitor, errors) * Monitor API Endpoints with all REST endpoints and examples * WebSocket Streaming integration guide with Python examples * Control Actions for manual browser management * Production Integration patterns (Prometheus, custom dashboards, alerting) * Key production metrics to track - Enhanced summary section: * What users learned checklist * Why self-hosting matters * Clear next steps * Key resources with monitoring dashboard URL The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable. Users will understand they have complete operational visibility at http://localhost:11235/monitor with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs. This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level monitoring capabilities, not just a Docker deployment.	2025-11-09 13:31:52 +08:00
unclecode	81b5312629	Update gitignore	2025-11-09 10:49:42 +08:00
Nasrin	d56b0eb9a9	Merge pull request #1495 from unclecode/fix/viewport_in_managed_browser feat(ManagedBrowser): add viewport size configuration for browser launch	2025-11-06 18:42:45 +08:00
Nasrin	66175e132b	Merge pull request #1590 from unclecode/fix/async-llm-extraction-arunMany This commit resolves issue #1055 where LLM extraction was blocking async	2025-11-06 18:40:42 +08:00
ntohidi	a30548a98f	This commit resolves issue #1055 where LLM extraction was blocking async execution, causing URLs to be processed sequentially instead of in parallel. Changes: - Added aperform_completion_with_backoff() using litellm.acompletion for async LLM calls - Implemented arun() method in ExtractionStrategy base class with thread pool fallback - Created async arun() and aextract() methods in LLMExtractionStrategy using asyncio.gather - Updated AsyncWebCrawler.arun() to detect and use arun() when available - Added comprehensive test suite to verify parallel execution Impact: - LLM extraction now runs truly in parallel across multiple URLs - Significant performance improvement for multi-URL crawls with LLM strategies - Backward compatible - existing extraction strategies continue to work - No breaking changes to public API Technical details: - Uses litellm.acompletion for non-blocking LLM calls - Leverages asyncio.gather for concurrent chunk processing - Maintains backward compatibility via asyncio.to_thread fallback - Works seamlessly with MemoryAdaptiveDispatcher and other dispatchers	2025-11-06 11:22:45 +01:00
CapSolver	2ae9899eac	Clarify CapSolver integration instructions Updated text for clarity and capitalization.	2025-11-06 15:49:30 +08:00
CapSolver	57aeb70f00	Add CapSolver Captcha Solver	2025-11-06 15:37:31 +08:00
Nasrin	2c918155aa	Merge pull request #1529 from unclecode/fix/remove_overlay_elements Fix remove_overlay_elements functionality by calling injected JS function.	2025-11-06 00:10:32 +08:00
Nasrin	854694ef33	Merge pull request #1537 from unclecode/fix/docker-compose-llm-env fix(docker): Remove environment variable overrides in docker-compose.yml	2025-11-06 00:07:51 +08:00
Nasrin	6534ece026	Merge pull request #1532 from unclecode/fix/update-documentation Standardize C4A-Script tutorial, add CLI identity-based crawling, and add sponsorship CTA	2025-11-05 23:37:05 +08:00
Nasrin	89e28d4eee	Merge pull request #1558 from unclecode/claude/fix-update-pyopenssl-security-011CUPexU25DkNvoxfu5ZrnB Claude/fix update pyopenssl security 011 cu pex u25 dk nvoxfu5 zrn b	2025-10-28 17:09:11 +08:00
ntohidi	c0f1865287	feat(api): update marketplace version and build date in root endpoint response	2025-10-26 11:35:39 +01:00
ntohidi	46ef1116c4	fix(app-detail): enhance tab functionality, hide documentation and support tabs in marketplace	2025-10-26 11:21:29 +01:00
Nasrin	4df83893ac	Merge pull request #1560 from unclecode/fix/marketplace Fix/marketplace	2025-10-23 22:17:06 +08:00
ntohidi	13e116610d	fix(marketplace): improve app detail page content rendering and UX Fixed multiple issues with app detail page content display and formatting	2025-10-23 16:12:30 +02:00
Claude	613097d121	test: add verification tests for pyOpenSSL security update - Add lightweight security test to verify version requirements - Add comprehensive integration test for crawl4ai functionality - Tests verify pyOpenSSL >= 25.3.0 and cryptography >= 45.0.7 - All tests passing: security vulnerability is resolved Related to #1545 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-23 06:57:25 +00:00
Claude	44ef0682b0	fix: update pyOpenSSL to >=25.3.0 to address security vulnerability - Updates pyOpenSSL from >=24.3.0 to >=25.3.0 - This resolves CVE affecting cryptography package versions >=37.0.0 & <43.0.1 - pyOpenSSL 25.3.0 requires cryptography>=45.0.7, which is above the vulnerable range - Fixes issue #1545 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-23 06:51:25 +00:00

1 2 3 4 5 ...

1214 Commits