Implement CORS handling with --disable-web-security in BrowserManager and add corresponding tests

Fix #1181 : Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.
2025-11-18 16:18:49 +08:00 · 2025-11-17 12:21:23 +01:00 · 2025-11-17 07:44:52 +01:00 · 2025-11-16 12:26:54 +01:00
4 changed files with 107 additions and 4 deletions
--- a/README.md
+++ b/README.md
@@ -1034,11 +1034,14 @@ Our enterprise sponsors and technology partners help scale Crawl4AI to power pro

 | Company | About | Sponsorship Tier |
 |------|------|----------------------------|
-| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥈 Silver |
+| <a href="https://app.scrapeless.com/passport/register?utm_source=official&utm_term=crawl4ai" target="_blank"><picture><source width="250" media="(prefers-color-scheme: dark)" srcset="https://gist.githubusercontent.com/aravindkarnam/0d275b942705604263e5c32d2db27bc1/raw/Scrapeless-light-logo.svg"><source width="250" media="(prefers-color-scheme: light)" srcset="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"><img alt="Scrapeless" src="https://gist.githubusercontent.com/aravindkarnam/22d0525cc0f3021bf19ebf6e11a69ccd/raw/Scrapeless-dark-logo.svg"></picture></a>  | Scrapeless is the best full-stack web scraping toolkit offering Scraping API, Scraping Browser, Web Unlocker, Captcha Solver, and Proxies, designed to handle all your data collection needs. | 🥈 Silver |
+| <a href="https://dashboard.capsolver.com/passport/register?inviteCode=ESVSECTX5Q23" target="_blank"><picture><source width="120" media="(prefers-color-scheme: dark)" srcset="https://docs.crawl4ai.com/uploads/sponsors/20251013045338_72a71fa4ee4d2f40.png"><source width="120" media="(prefers-color-scheme: light)" srcset="https://www.capsolver.com/assets/images/logo-text.png"><img alt="Capsolver" src="https://www.capsolver.com/assets/images/logo-text.png"></picture></a> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥉 Bronze |
 | <a href="https://kipo.ai" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045751_2d54f57f117c651e.png" alt="DataSync" width="120"/></a> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |
 | <a href="https://www.kidocode.com/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013045045_bb8dace3f0440d65.svg" alt="Kidocode" width="120"/><p align="center">KidoCode</p></a> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 5–18, offering both online and on-campus education. | 🥇 Gold |
 | <a href="https://www.alephnull.sg/" target="_blank"><img src="https://docs.crawl4ai.com/uploads/sponsors/20251013050323_a9e8e8c4c3650421.svg" alt="Aleph null" width="120"/></a> | Singapore-based  Aleph Null is Asia’s leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |

+
+
 ### 🧑‍🤝 Individual Sponsors

 A heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!
--- a/crawl4ai/browser_manager.py
+++ b/crawl4ai/browser_manager.py
@@ -674,6 +674,11 @@ class BrowserManager:
                self.default_context = await self.create_browser_context()
            await self.setup_context(self.default_context)
        else:
+            # Handle --disable-web-security requiring a separate user data directory
+            if "--disable-web-security" in (self.config.extra_args or []) and not self.config.user_data_dir:
+                import tempfile
+                self.config.user_data_dir = tempfile.mkdtemp()
+            
            browser_args = self._build_browser_args()

            # Launch appropriate browser type
@@ -682,9 +687,15 @@ class BrowserManager:
            elif self.config.browser_type == "webkit":
                self.browser = await self.playwright.webkit.launch(**browser_args)
            else:
-                self.browser = await self.playwright.chromium.launch(**browser_args)
-
-            self.default_context = self.browser
+                if "--disable-web-security" in (self.config.extra_args or []):
+                    # Use persistent context for --disable-web-security
+                    browser_args["args"] = [arg for arg in browser_args["args"] if not arg.startswith("--user-data-dir")]
+                    self.default_context = await self.playwright.chromium.launch_persistent_context(self.config.user_data_dir, **browser_args)
+                    self.browser = self.default_context
+                    self.config.use_managed_browser = True  # Treat as managed for get_page logic
+                else:
+                    self.browser = await self.playwright.chromium.launch(**browser_args)
+                    self.default_context = self.browser

    async def _verify_cdp_ready(self, cdp_url: str) -> bool:
        """Verify CDP endpoint is ready with exponential backoff"""
@@ -748,6 +759,9 @@ class BrowserManager:
        if self.config.extra_args:
            args.extend(self.config.extra_args)

+        if self.config.user_data_dir:
+            args.append(f"--user-data-dir={self.config.user_data_dir}")
+
        # Deduplicate args
        args = list(dict.fromkeys(args))
        
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
@@ -542,6 +542,19 @@ class LXMLWebScrapingStrategy(ContentScrapingStrategy):
            if el.tag in bypass_tags:
                continue

+            # Skip elements inside <pre> or <code> tags where whitespace is significant
+            # This preserves whitespace-only spans (e.g., <span class="w"> </span>) in code blocks
+            is_in_code_block = False
+            ancestor = el.getparent()
+            while ancestor is not None:
+                if ancestor.tag in ("pre", "code"):
+                    is_in_code_block = True
+                    break
+                ancestor = ancestor.getparent()
+
+            if is_in_code_block:
+                continue
+
            text_content = (el.text_content() or "").strip()
            if (
                len(text_content.split()) < word_count_threshold
--- a/tests/test_browser_manager_cors.py
+++ b/tests/test_browser_manager_cors.py
@@ -0,0 +1,73 @@
+import os
+import sys
+import pytest
+
+# Add the parent directory to the Python path
+parent_dir = os.path.dirname(
+    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+)
+sys.path.append(parent_dir)
+
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+
+@pytest.mark.asyncio
+async def test_normal_browser_launch():
+    """Test that the browser manager launches normally without --disable-web-security"""
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url="https://example.com", bypass_cache=True)
+        assert result.success
+        assert result.html
+        assert result.markdown
+
+
+@pytest.mark.asyncio
+async def test_cors_bypass_with_disable_web_security():
+    """Test that --disable-web-security allows XMLHttpRequest to bypass CORS"""
+    browser_config = BrowserConfig(
+        extra_args=['--disable-web-security'],
+        headless=True  # Run headless for test
+    )
+
+    # JS code that attempts XMLHttpRequest to a cross-origin URL that normally blocks CORS
+    js_code = """
+    var xhr = new XMLHttpRequest();
+    xhr.open('GET', 'https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/data_AlpacaEval_2/weighted_alpaca_eval_gpt4_turbo_leaderboard.csv', false);
+    xhr.send();
+    if (xhr.status == 200) {
+        return {success: true, length: xhr.responseText.length};
+    } else {
+        return {success: false, status: xhr.status, error: xhr.statusText};
+    }
+    """
+
+    crawler_config = CrawlerRunConfig(js_code=js_code)
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(url="https://example.com", config=crawler_config, bypass_cache=True)
+        assert result.success, f"Crawl failed: {result.error_message}"
+        js_result = result.js_execution_result
+        assert js_result is not None, "JS execution result is None"
+        assert js_result.get('success') == True, f"XMLHttpRequest failed: {js_result}"
+        # The result is wrapped in 'results' list
+        results = js_result.get('results', [])
+        assert len(results) > 0, "No results in JS execution"
+        xhr_result = results[0]
+        assert xhr_result.get('success') == True, f"XMLHttpRequest failed: {xhr_result}"
+        assert xhr_result.get('length', 0) > 0, f"No data received from XMLHttpRequest: {xhr_result}"
+
+
+@pytest.mark.asyncio
+async def test_browser_manager_without_cors_flag():
+    """Ensure that without --disable-web-security, normal functionality still works"""
+    browser_config = BrowserConfig(headless=True)
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(url="https://example.com", bypass_cache=True)
+        assert result.success
+        assert result.html
+
+
+# Entry point for debugging
+if __name__ == "__main__":
+    pytest.main([__file__, "-v"])
Author	SHA1	Message	Date
AHMET YILMAZ	af77800a6b	Implement CORS handling with --disable-web-security in BrowserManager and add corresponding tests	2025-11-18 16:18:49 +08:00
ntohidi	c2c4d42be4	Fix #1181 : Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.	2025-11-17 12:21:23 +01:00
Aravind	f68e7531e3	Sponsors/scrapeless (#1619 )	2025-11-17 07:44:52 +01:00
UncleCode	cb637fb5c4	Merge pull request #1613 from unclecode/release/v0.7.7	2025-11-16 12:26:54 +01:00