Fix: HTTP strategy raw: URL parsing truncates at # character

The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract
content from raw: URLs. This caused HTML with CSS color codes like #eee
to be truncated because # is treated as a URL fragment delimiter.

Before: raw:body{background:#eee} -> parsed.path = 'body{background:'
After:  raw:body{background:#eee} -> raw_content = 'body{background:#eee'

Fix: Strip the raw: or raw:// prefix directly instead of using urlparse,
matching how the browser strategy handles it.
This commit is contained in:
unclecode
2025-12-24 04:31:57 +00:00
parent 31ebf37252
commit 624e34164d

View File

@@ -2475,7 +2475,10 @@ class AsyncHTTPCrawlerStrategy(AsyncCrawlerStrategy):
if scheme == 'file':
return await self._handle_file(parsed.path)
elif scheme == 'raw':
return await self._handle_raw(parsed.path)
# Don't use parsed.path - urlparse truncates at '#' which is common in CSS
# Strip prefix directly: "raw://" (6 chars) or "raw:" (4 chars)
raw_content = url[6:] if url.startswith("raw://") else url[4:]
return await self._handle_raw(raw_content)
else: # http or https
return await self._handle_http(url, config)