Fix redirect target verification in AsyncUrlSeeder and enhance tests

- Added `verify_redirect_targets` parameter to control redirect verification. - Modified `_resolve_head()` to verify redirect targets based on the new parameter. - Implemented tests for both verification modes, ensuring dead redirects are filtered out and legacy behavior is preserved.
Fix #1181 : Preserve whitespace in code blocks during HTML scraping
2025-11-18 11:43:47 +08:00 · 2025-11-17 12:21:23 +01:00 · 2025-11-17 07:44:52 +01:00 · 2025-11-16 12:26:54 +01:00
4 changed files with 98 additions and 9 deletions
--- a/README.md
+++ b/README.md
@@ -1034,11 +1034,14 @@ Our enterprise sponsors and technology partners help scale Crawl4AI to power pro

 | Company | About | Sponsorship Tier |
 |------|------|----------------------------|
-| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥈 Silver |
+| <a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=crawl4ai" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.githubusercontent.com/aravindkarnam/0d275b942705604263e5c32d2db27bc1/raw/Scrapeless-light-logo.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"><img alt="Scrapeless" src="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"></picture></a>  | Scrapeless is the best full-stack web scraping toolkit offering Scraping API, Scraping Browser, Web Unlocker, Captcha Solver, and Proxies, designed to handle all your data collection needs. | 🥈 Silver |
+| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥉 Bronze |
 | <a href="https://kipo.ai" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045751_2d54f57f117c651e.png" alt="DataSync" width="120"/></a> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |
 | <a href="https://www.kidocode.com/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045045_bb8dace3f0440d65.svg" alt="Kidocode" width="120"/><p align="center">KidoCode</p></a> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 5–18, offering both online and on-campus education. | 🥇 Gold |
 | <a href="https://www.alephnull.sg/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013050323_a9e8e8c4c3650421.svg" alt="Aleph null" width="120"/></a> | Singapore-based  Aleph Null is Asia’s leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |

+
+
 ### 🧑‍🤝 Individual Sponsors

 A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!
--- a/crawl4ai/async_url_seeder.py
+++ b/crawl4ai/async_url_seeder.py
@@ -166,6 +166,22 @@ class AsyncUrlSeeder:
    Async version of UrlSeeder.
    Call pattern is await/async for / async with.

+    Parameters
+    ----------
+    ttl : timedelta, default TTL
+        Time-to-live for cached results.
+    client : httpx.AsyncClient, optional
+        HTTP client to use. If None, creates a new one.
+    logger : AsyncLoggerBase, optional
+        Logger instance for logging messages.
+    base_directory : str or pathlib.Path, optional
+        Base directory for cache storage. Defaults to home directory.
+    cache_root : str or pathlib.Path, optional
+        Root directory for URL seeder cache. Defaults to ~/.cache/url_seeder.
+    verify_redirect_targets : bool, default True
+        Whether to verify that redirect targets are alive (2xx status) before returning them.
+        When False, returns redirect targets without verification (legacy behavior).
+
    Public coroutines
    -----------------
    await seed.urls(...)
@@ -203,6 +219,8 @@ class AsyncUrlSeeder:
        # NEW: Add base_directory
        base_directory: Optional[Union[str, pathlib.Path]] = None,
        cache_root: Optional[Union[str, Path]] = None,
+        # NEW: Control redirect target verification
+        verify_redirect_targets: bool = True,
    ):
        self.ttl = ttl
        self._owns_client = client is None  # Track if we created the client
@@ -227,6 +245,9 @@ class AsyncUrlSeeder:
            cache_root or "~/.cache/url_seeder"))
        (self.cache_root / "live").mkdir(parents=True, exist_ok=True)
        (self.cache_root / "head").mkdir(exist_ok=True)
+        
+        # Store redirect verification setting
+        self.verify_redirect_targets = verify_redirect_targets

    def _log(self, level: str, message: str, tag: str = "URL_SEED", **kwargs: Any):
        """Helper to log messages using the provided logger, if available."""
@@ -682,24 +703,47 @@ class AsyncUrlSeeder:

        Returns:
            * the same URL if it answers 2xx,
-            * the absolute redirect target if it answers 3xx,
+            * the absolute redirect target if it answers 3xx (and if verify_redirect_targets=True, only if target is alive/2xx),
            * None on any other status or network error.
        """
        try:
            r = await self.client.head(url, timeout=10, follow_redirects=False)
-
-            # direct hit
+            # direct 2xx hit
            if 200 <= r.status_code < 300:
                return str(r.url)
-
-            # single level redirect
+            # single-level redirect (3xx)
            if r.status_code in (301, 302, 303, 307, 308):
                loc = r.headers.get("location")
                if loc:
-                    return urljoin(url, loc)
-
+                    target = urljoin(url, loc)
+                    # Avoid infinite loop on self-redirect
+                    if target == url:
+                        return None
+                    
+                    # If not verifying redirect targets, return immediately (old behavior)
+                    if not self.verify_redirect_targets:
+                        return target
+                    
+                    # Verify redirect target is alive (new behavior)
+                    try:
+                        r2 = await self.client.head(target, timeout=10, follow_redirects=False)
+                        if 200 <= r2.status_code < 300:
+                            return str(r2.url)
+                        # Optionally, could handle another 3xx here for 2-step chains, but spec only says 1
+                        else:
+                            self._log(
+                                "debug",
+                                "HEAD redirect target {target} did not resolve: status {status}",
+                                params={"target": target, "status": r2.status_code},
+                                tag="URL_SEED",
+                            )
+                            return None
+                    except Exception as e2:
+                        self._log("debug", "HEAD {target} failed: {err}",
+                            params={"target": target, "err": str(e2)}, tag="URL_SEED")
+                        return None
+            # all other cases
            return None
-
        except Exception as e:
            self._log("debug", "HEAD {url} failed: {err}",
                      params={"url": url, "err": str(e)}, tag="URL_SEED")
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
@@ -542,6 +542,19 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
            if el.tag in bypass_tags:
                continue

+            # Skip elements inside <pre> or <code> tags where whitespace is significant
+            # This preserves whitespace-only spans (e.g., <span class="w"> </span>) in code blocks
+            is_in_code_block = False
+            ancestor = el.getparent()
+            while ancestor is not None:
+                if ancestor.tag in ("pre", "code"):
+                    is_in_code_block = True
+                    break
+                ancestor = ancestor.getparent()
+
+            if is_in_code_block:
+                continue
+
            text_content = (el.text_content() or "").strip()
            if (
                len(text_content.split()) < word_count_threshold
--- a/tests/test_async_url_seeder.py
+++ b/tests/test_async_url_seeder.py
@@ -0,0 +1,29 @@
+import pytest
+import asyncio
+from crawl4ai.async_url_seeder import AsyncUrlSeeder
+
+@pytest.mark.asyncio
+async def test_resolve_head_handles_dead_redirects():
+    seeder = AsyncUrlSeeder()
+    # Should return None – redirects to a dead URL
+    assert await seeder._resolve_head("http://youtube.com/sitemap.xml") is None
+    assert await seeder._resolve_head("https://stripe.com/sitemap.xml") is None
+
+@pytest.mark.asyncio
+async def test_resolve_head_direct_hit():
+    seeder = AsyncUrlSeeder()
+    # Test with a known live URL, e.g., httpbin
+    result = await seeder._resolve_head("https://httpbin.org/status/200")
+    assert result == "https://httpbin.org/status/200"
+
+@pytest.mark.asyncio
+async def test_resolve_head_verify_redirect_targets_false():
+    # Test with verification disabled - should return redirect target without checking if alive
+    seeder = AsyncUrlSeeder(verify_redirect_targets=False)
+    # This should return the redirect target even if it's dead (old behavior)
+    result = await seeder._resolve_head("http://youtube.com/sitemap.xml")
+    # The exact redirect target might vary, but it should not be None
+    assert result is not None
+    assert isinstance(result, str)
+    # Should be different from the input URL (indicating redirect was followed)
+    assert result != "http://youtube.com/sitemap.xml"
Author	SHA1	Message	Date
AHMET YILMAZ	43a2088eb0	Fix redirect target verification in AsyncUrlSeeder and enhance tests - Added `verify_redirect_targets` parameter to control redirect verification. - Modified `_resolve_head()` to verify redirect targets based on the new parameter. - Implemented tests for both verification modes, ensuring dead redirects are filtered out and legacy behavior is preserved.	2025-11-18 11:43:47 +08:00
ntohidi	c2c4d42be4	Fix #1181 : Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.	2025-11-17 12:21:23 +01:00
Aravind	f68e7531e3	Sponsors/scrapeless (#1619 )	2025-11-17 07:44:52 +01:00
UncleCode	cb637fb5c4	Merge pull request #1613 from unclecode/release/v0.7.7	2025-11-16 12:26:54 +01:00