ntohidi
48c31c4cb9
Release v0.7.8: Stability & Bug Fix Release
...
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
2025-12-08 15:42:29 +01:00
Nasrin
5a8fb57795
Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter
...
[Fix]: Docker server does not decode ContentRelevanceFilter
2025-12-03 18:36:07 +08:00
ntohidi
df4d87ed78
refactor: replace PyPDF2 with pypdf across the codebase. ref #1412
2025-12-03 10:59:18 +01:00
Nasrin
f32cfc6db0
Merge pull request #1645 from unclecode/fix/configurable-backoff
...
Make LLM backoff configurable end-to-end
2025-12-02 21:07:49 +08:00
Nasrin
d06c39e8ab
Merge pull request #1641 from unclecode/fix/serialize-proxy-config
...
Fix BrowserConfig proxy_config serialization
2025-12-02 21:06:02 +08:00
ntohidi
afc31e144a
Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop
2025-12-02 13:01:11 +01:00
ntohidi
07ccf13be6
Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268
2025-12-02 13:00:54 +01:00
Chris Murphy
6893094f58
parameterized tests
2025-12-01 16:19:19 -05:00
Chris Murphy
3a8f8298d3
import modules from enhanceable deserialization
2025-12-01 16:18:59 -05:00
Chris Murphy
e95e8e1a97
generalized query in ContentRelevanceFilter to be a str or list
2025-12-01 16:16:31 -05:00
Chris Murphy
eb76df2c0d
added missing deep crawling objects to init
2025-12-01 16:15:58 -05:00
Chris Murphy
6ec6bc4d8a
pass timeout parameter to docker client request
2025-12-01 16:15:27 -05:00
Chris Murphy
33a3cc3933
reproduced AttributeError from #1642
2025-12-01 11:31:07 -05:00
Soham Kukreti
7a133e22cc
feat: make LLM backoff configurable end-to-end
...
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
through LLMExtractionStrategy, LLMContentFilter, table extraction, and
Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
Nasrin
dcb77c94bf
Merge pull request #1623 from unclecode/fix/deprecated_pydantic
...
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
2025-11-27 20:05:42 +08:00
Soham Kukreti
a0c5f0f79a
fix: ensure BrowserConfig.to_dict serializes proxy_config
2025-11-26 17:44:06 +05:30
ntohidi
b36c6daa5c
Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638
2025-11-25 11:51:59 +01:00
Nasrin
94c8a833bf
Merge pull request #1447 from rbushri/fix/wrong_url_raw
...
Fix: Wrong URL variable used for extraction of raw html
2025-11-25 17:49:44 +08:00
ntohidi
84bfea8bd1
Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621
2025-11-25 10:46:00 +01:00
Aravind
0024c82cdc
Sponsors/new ( #1637 )
2025-11-24 13:29:33 +01:00
Rachel Bushrian
7771ed3894
Merge branch 'develop' into fix/wrong_url_raw
2025-11-24 13:54:07 +02:00
AHMET YILMAZ
eca04b0368
Refactor Pydantic model configuration to use ConfigDict for arbitrary types
2025-11-18 15:40:17 +08:00
ntohidi
c2c4d42be4
Fix #1181 : Preserve whitespace in code blocks during HTML scraping
...
The remove_empty_elements_fast() method was removing whitespace-only
span elements inside <pre> and <code> tags, causing import statements
like "import torch" to become "importtorch". Now skips elements inside
code blocks where whitespace is significant.
2025-11-17 12:21:23 +01:00
Aravind
f68e7531e3
Sponsors/scrapeless ( #1619 )
2025-11-17 07:44:52 +01:00
UncleCode
cb637fb5c4
Merge pull request #1613 from unclecode/release/v0.7.7
2025-11-16 12:26:54 +01:00
ntohidi
6244f56f36
Release v0.7.7
...
- Updated version to 0.7.7
- Added comprehensive demo and release notes
- Updated all documentation
v0.7.7
docker-rebuild-v0.7.7
2025-11-14 10:23:31 +01:00
ntohidi
2c973b1183
Merge branch 'develop' into release/v0.7.7
2025-11-13 14:54:05 +01:00
Nasrin
f3146de969
Merge pull request #1609 from unclecode/fix/update-config-documentation
...
Update browser and crawler run config documentation to match async_configs.py implementation
2025-11-13 21:52:53 +08:00
Soham Kukreti
d6b6d11a2d
docs: update browser and crawler run config documentation to match async_configs.py implementation
...
Updated browser-crawler-config.md and parameters.md to ensure complete
accuracy with the actual BrowserConfig and CrawlerRunConfig implementations.
Changes:
- Removed non-existent parameters from documentation:
* enable_rate_limiting, rate_limit_config (never implemented)
* memory_threshold_percent, check_interval, max_session_permit (internal to AsyncDispatcher)
* display_mode (doesn't exist)
- Added missing BrowserConfig parameters (14 total):
* browser_mode, use_managed_browser, cdp_url, debugging_port, host
* viewport, chrome_channel, channel
* accept_downloads, downloads_path, storage_state, sleep_on_close
* user_agent_mode, user_agent_generator_config, enable_stealth
- Added missing CrawlerRunConfig parameters (29 total):
* chunking_strategy, keep_attrs, parser_type, scraping_strategy
* proxy_config, proxy_rotation_strategy
* locale, timezone_id, geolocation, fetch_ssl_certificate
* shared_data, wait_for_timeout
* c4a_script, max_scroll_steps
* exclude_all_images, table_score_threshold, table_extraction
* exclude_internal_links, score_links
* capture_network_requests, capture_console_messages
* method, stream, url, user_agent, user_agent_mode, user_agent_generator_config
* deep_crawl_strategy, link_preview_config, url_matcher, match_mode, experimental
- Marked deprecated cache parameters (bypass_cache, disable_cache, no_cache_read, no_cache_write)
- Reorganized parameters into logical sections (Content Processing, Browser Location & Identity,
Caching & Session, Page Navigation & Timing, Page Interaction, Media Handling, Link/Domain
Handling, Debug & Logging, Connection & HTTP, Virtual Scroll, URL Matching, Advanced Features)
- Ensured all parameter descriptions match source code docstrings
- Added proper default values from __init__ signatures
2025-11-13 14:54:16 +05:30
ntohidi
b58579548c
Bump version to 0.7.7 for stable release
2025-11-13 09:52:18 +01:00
Nasrin
466be69e72
Merge pull request #1607 from unclecode/fix/dfs_deep_crawling
...
Fix/dfs deep crawling
2025-11-13 16:43:47 +08:00
AHMET YILMAZ
ceade853c3
Enhance DFSDeepCrawlStrategy documentation for clarity and detail
2025-11-13 16:39:08 +08:00
ntohidi
998c809e08
Rename folder name for NSTProxy integration examples for crawl4ai
2025-11-13 09:36:39 +01:00
ntohidi
d0fb53540d
Update proxy-security documentation
2025-11-13 09:23:44 +01:00
Nasrin
8116b15b63
Merge pull request #1596 from unclecode/docs-proxy-security
...
#1591 enhance proxy configuration with security, SSL analysis, and rotation examples
2025-11-13 16:22:28 +08:00
AHMET YILMAZ
fe353c4e27
Refactor proxy configuration documentation for clarity and consistency
2025-11-13 11:20:24 +08:00
ntohidi
89cc29fe44
Merge branch 'fix/docker' into develop
2025-11-12 17:06:31 +01:00
Nasrin
cdcb8836b7
Merge pull request #1605 from Nstproxy/feat/nstproxy
...
feat: Add Nstproxy Proxies
2025-11-12 23:56:14 +08:00
Nasrin
b207ae2848
Merge pull request #1528 from unclecode/fix/managed-browser-cdp-timing
...
Add CDP endpoint verification with exponential backoff for managed browsers
2025-11-12 23:53:57 +08:00
Nasrin
be00fc3a42
Merge pull request #1598 from unclecode/fix/sitemap_seeder
...
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUr…
2025-11-12 18:09:34 +08:00
Nasrin
124ac583bb
Merge pull request #1599 from unclecode/docs-llm-strategies-update
...
#1551 : Fix casing and variable name consistency for LLMConfig in doc…
2025-11-12 17:54:26 +08:00
AHMET YILMAZ
1bd3de6a47
#1510 : Add DFS deep crawler demonstration script and enhance DFS strategy with seen URL tracking
2025-11-12 17:44:43 +08:00
nstproxy
80452166c8
feat: Add Nstproxy Proxies
2025-11-12 16:25:39 +08:00
UncleCode
a99cd37c0e
Merge pull request #1597 from unclecode/sponsors/capsolver
2025-11-11 14:50:44 +08:00
AHMET YILMAZ
2e8f8c9b49
#1551 : Fix casing and variable name consistency for LLMConfig in documentation
2025-11-10 15:38:14 +08:00
AHMET YILMAZ
80745bceb9
#1559 :Add tests for sitemap parsing and URL normalization in AsyncUrlSeeder
2025-11-10 14:15:54 +08:00
Aravind Karnam
4bee230c37
docs: Add a tip for captcha solving usecases using a third party integration
2025-11-10 11:20:48 +05:30
Aravind
006e29f308
Merge pull request #1589 from capsolver/main
...
Add some examples of using capsolver to solve captcha
2025-11-10 10:45:16 +05:30
AHMET YILMAZ
263ac890fd
#1591
...
: Enhance proxy configuration documentation with security features, SSL analysis, and improved examples
2025-11-10 11:42:07 +08:00
unclecode
1a22fb4d4f
docs: rename Docker deployment to self-hosting guide with comprehensive monitoring documentation
...
Major documentation restructuring to emphasize self-hosting capabilities and fully document the real-time monitoring system.
Changes:
- Renamed docker-deployment.md → self-hosting.md to better reflect the value proposition
- Updated mkdocs.yml navigation to "Self-Hosting Guide"
- Completely rewrote introduction emphasizing self-hosting benefits:
* Data privacy and ownership
* Cost control and transparency
* Performance and security advantages
* Full customization capabilities
- Expanded "Metrics & Monitoring" → "Real-time Monitoring & Operations" with:
* Monitoring Dashboard section documenting the /monitor UI
* Complete feature breakdown (system health, requests, browsers, janitor, errors)
* Monitor API Endpoints with all REST endpoints and examples
* WebSocket Streaming integration guide with Python examples
* Control Actions for manual browser management
* Production Integration patterns (Prometheus, custom dashboards, alerting)
* Key production metrics to track
- Enhanced summary section:
* What users learned checklist
* Why self-hosting matters
* Clear next steps
* Key resources with monitoring dashboard URL
The monitoring dashboard built 2-3 weeks ago is now fully documented and discoverable.
Users will understand they have complete operational visibility at http://localhost:11235/monitor
with real-time updates, browser pool management, and programmatic control via REST/WebSocket APIs.
This positions Crawl4AI as an enterprise-grade self-hosting solution with DevOps-level
monitoring capabilities, not just a Docker deployment.
2025-11-09 13:31:52 +08:00