From bdacf61ca91d046a9bfebe154c1f561055c303ef Mon Sep 17 00:00:00 2001 From: ntohidi Date: Thu, 28 Aug 2025 17:48:12 +0800 Subject: [PATCH] feat: update documentation for preserve_https_for_internal_links. ref #1410 --- CHANGELOG.md | 10 ++++++++++ docs/md_v2/api/parameters.md | 1 + docs/md_v2/core/deep-crawling.md | 11 +++++++++++ 3 files changed, 22 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 9788caf2..ce63516f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,16 @@ All notable changes to Crawl4AI will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +### Added +- **🔒 HTTPS Preservation for Internal Links**: New `preserve_https_for_internal_links` configuration flag + - Maintains HTTPS scheme for internal links even when servers redirect to HTTP + - Prevents security downgrades during deep crawling + - Useful for security-conscious crawling and sites supporting both protocols + - Fully backward compatible with opt-in flag (default: `False`) + - Fixes issue #1410 where HTTPS URLs were being downgraded to HTTP + ## [0.7.3] - 2025-08-09 ### Added diff --git a/docs/md_v2/api/parameters.md b/docs/md_v2/api/parameters.md index ba526fb7..47f719c8 100644 --- a/docs/md_v2/api/parameters.md +++ b/docs/md_v2/api/parameters.md @@ -155,6 +155,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i | **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. | | **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). | | **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). | +| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. | Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains). diff --git a/docs/md_v2/core/deep-crawling.md b/docs/md_v2/core/deep-crawling.md index 00834787..93760f23 100644 --- a/docs/md_v2/core/deep-crawling.md +++ b/docs/md_v2/core/deep-crawling.md @@ -472,6 +472,17 @@ Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pag 5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling. +6.**Preserve HTTPS for security.** If crawling HTTPS sites that redirect to HTTP, use `preserve_https_for_internal_links=True` to maintain secure connections: + +```python +config = CrawlerRunConfig( + deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2), + preserve_https_for_internal_links=True # Keep HTTPS even if server redirects to HTTP +) +``` + +This is especially useful for security-conscious crawling or when dealing with sites that support both protocols. + --- ## 10. Summary & Next Steps