Release/v0.7.8 (#1662)

* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
This commit is contained in:
Nasrin
2025-12-11 18:04:52 +08:00
committed by GitHub
parent 835e3c56fe
commit a87e8c1c9e
32 changed files with 2123 additions and 135 deletions

View File

@@ -0,0 +1,118 @@
"""Test delayed redirect WITH wait_for - does link resolution use correct URL?"""
import asyncio
import threading
from http.server import HTTPServer, SimpleHTTPRequestHandler
class RedirectTestHandler(SimpleHTTPRequestHandler):
def log_message(self, format, *args):
pass
def do_GET(self):
if self.path == "/page-a":
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
content = """
<!DOCTYPE html>
<html>
<head><title>Page A</title></head>
<body>
<h1>Page A - Will redirect after 200ms</h1>
<script>
setTimeout(function() {
window.location.href = '/redirect-target/';
}, 200);
</script>
</body>
</html>
"""
self.wfile.write(content.encode())
elif self.path.startswith("/redirect-target"):
self.send_response(200)
self.send_header("Content-type", "text/html")
self.end_headers()
content = """
<!DOCTYPE html>
<html>
<head><title>Redirect Target</title></head>
<body>
<h1>Redirect Target</h1>
<nav id="target-nav">
<a href="subpage-1">Subpage 1</a>
<a href="subpage-2">Subpage 2</a>
</nav>
</body>
</html>
"""
self.wfile.write(content.encode())
else:
self.send_response(404)
self.end_headers()
async def main():
import socket
class ReuseAddrHTTPServer(HTTPServer):
allow_reuse_address = True
server = ReuseAddrHTTPServer(("localhost", 8769), RedirectTestHandler)
thread = threading.Thread(target=server.serve_forever)
thread.daemon = True
thread.start()
try:
import sys
sys.path.insert(0, '/Users/nasrin/vscode/c4ai-uc/develop')
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
print("=" * 60)
print("TEST: Delayed JS redirect WITH wait_for='css:#target-nav'")
print("This waits for the redirect to complete")
print("=" * 60)
browser_config = BrowserConfig(headless=True, verbose=False)
crawl_config = CrawlerRunConfig(
cache_mode="bypass",
wait_for="css:#target-nav" # Wait for element on redirect target
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="http://localhost:8769/page-a",
config=crawl_config
)
print(f"Original URL: http://localhost:8769/page-a")
print(f"Redirected URL returned: {result.redirected_url}")
print(f"HTML contains 'Redirect Target': {'Redirect Target' in result.html}")
print()
if "/redirect-target" in (result.redirected_url or ""):
print("✓ redirected_url is CORRECT")
else:
print("✗ BUG #1: redirected_url is WRONG - still shows original URL!")
# Check links
all_links = []
if isinstance(result.links, dict):
all_links = result.links.get("internal", []) + result.links.get("external", [])
print(f"\nLinks found ({len(all_links)} total):")
bug_found = False
for link in all_links:
href = link.get("href", "") if isinstance(link, dict) else getattr(link, 'href', "")
if "subpage" in href:
print(f" {href}")
if "/page-a/" in href:
print(" ^^^ BUG #2: Link resolved with WRONG base URL!")
bug_found = True
elif "/redirect-target/" in href:
print(" ^^^ CORRECT")
if not bug_found and all_links:
print("\n✓ Link resolution is CORRECT")
finally:
server.shutdown()
if __name__ == "__main__":
asyncio.run(main())