From bdacf61ca91d046a9bfebe154c1f561055c303ef Mon Sep 17 00:00:00 2001
From: ntohidi <nasrin@kidocode.com>
Date: Thu, 28 Aug 2025 17:48:12 +0800
Subject: [PATCH] feat: update documentation for
 preserve_https_for_internal_links. ref #1410

---
 CHANGELOG.md                     | 10 ++++++++++
 docs/md_v2/api/parameters.md     |  1 +
 docs/md_v2/core/deep-crawling.md | 11 +++++++++++
 3 files changed, 22 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 9788caf2..ce63516f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,16 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [Unreleased]
+
+### Added
+- **🔒 HTTPS Preservation for Internal Links**: New `preserve_https_for_internal_links` configuration flag
+  - Maintains HTTPS scheme for internal links even when servers redirect to HTTP
+  - Prevents security downgrades during deep crawling
+  - Useful for security-conscious crawling and sites supporting both protocols
+  - Fully backward compatible with opt-in flag (default: `False`)
+  - Fixes issue #1410 where HTTPS URLs were being downgraded to HTTP
+
 ## [0.7.3] - 2025-08-09
 
 ### Added
diff --git a/docs/md_v2/api/parameters.md b/docs/md_v2/api/parameters.md
index ba526fb7..47f719c8 100644
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -155,6 +155,7 @@ If your page is a single-page app with repeated JS updates, set `js_only=True` i
 | **`exclude_external_links`** | `bool` (False)          | Removes all links pointing outside the current domain.                                                                      |
 | **`exclude_social_media_links`** | `bool` (False)      | Strips links specifically to social sites (like Facebook or Twitter).                                                      |
 | **`exclude_domains`**        | `list` ([])             | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`).                                            |
+| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |
 
 Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
 
diff --git a/docs/md_v2/core/deep-crawling.md b/docs/md_v2/core/deep-crawling.md
index 00834787..93760f23 100644
--- a/docs/md_v2/core/deep-crawling.md
+++ b/docs/md_v2/core/deep-crawling.md
@@ -472,6 +472,17 @@ Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pag
 
 5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.
 
+6.**Preserve HTTPS for security.** If crawling HTTPS sites that redirect to HTTP, use `preserve_https_for_internal_links=True` to maintain secure connections:
+
+```python
+config = CrawlerRunConfig(
+    deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=2),
+    preserve_https_for_internal_links=True  # Keep HTTPS even if server redirects to HTTP
+)
+```
+
+This is especially useful for security-conscious crawling or when dealing with sites that support both protocols.
+
 ---
 
 ## 10. Summary & Next Steps