Compare commits

..

3 Commits

Author SHA1 Message Date
Aravind Karnam
d5a0866e03 fix: pdf processing to target only css_selector, thereby give users a choice to discard unnecssary element from page in pdf generated 2025-12-18 16:32:16 +05:30
Nasrin
a87e8c1c9e Release/v0.7.8 (#1662)
* Fix: Use correct URL variable for raw HTML extraction (#1116)

- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html

* Fix #1181: Preserve whitespace in code blocks during HTML scraping

  The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.

* Refactor Pydantic model configuration to use ConfigDict for arbitrary types

* Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621

* Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638

* fix: ensure BrowserConfig.to_dict serializes proxy_config

* feat: make LLM backoff configurable end-to-end

- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides

* reproduced AttributeError from #1642

* pass timeout parameter to docker client request

* added missing deep crawling objects to init

* generalized query in ContentRelevanceFilter to be a str or list

* import modules from enhanceable deserialization

* parameterized tests

* Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268

* refactor: replace PyPDF2 with pypdf across the codebase. ref #1412

* announcement: add application form for cloud API closed beta

* Release v0.7.8: Stability & Bug Fix Release

- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.

* docs: add section for Crawl4AI Cloud API closed beta with application link

* fix: add disk cleanup step to Docker workflow

---------

Co-authored-by: rbushria <rbushri@gmail.com>
Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com>
Co-authored-by: Soham Kukreti <kukretisoham@gmail.com>
Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com>
Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
2025-12-11 11:04:52 +01:00
UncleCode
835e3c56fe Add disk cleanup step in Docker release workflow
Added a step to free up disk space before the build process.
2025-12-11 09:49:27 +01:00
3 changed files with 76 additions and 2 deletions

View File

@@ -11,6 +11,25 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Free up disk space
run: |
echo "=== Disk space before cleanup ==="
df -h
# Remove unnecessary tools and libraries (frees ~25GB)
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/local/lib/android
sudo rm -rf /opt/ghc
sudo rm -rf /opt/hostedtoolcache/CodeQL
sudo rm -rf /usr/local/share/boost
sudo rm -rf /usr/share/swift
# Clean apt cache
sudo apt-get clean
echo "=== Disk space after cleanup ==="
df -h
- name: Checkout code
uses: actions/checkout@v4

View File

@@ -989,8 +989,53 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
mhtml_data = None
if config.pdf:
pdf_data = await self.export_pdf(page)
if config.css_selector:
# Extract content with styles and fixed image URLs
content_with_styles = await page.evaluate(f"""
() => {{
const element = document.querySelector("{config.css_selector}");
const clone = element.cloneNode(true);
// Fix all image URLs to absolute
clone.querySelectorAll('img').forEach(img => {{
if (img.src) img.src = img.src; // This converts to absolute URL
}});
// Get all styles
const styles = Array.from(document.styleSheets)
.map(sheet => {{
try {{
return Array.from(sheet.cssRules).map(rule => rule.cssText).join('\\n');
}} catch(e) {{
return '';
}}
}}).join('\\n');
return {{
html: clone.outerHTML,
styles: styles,
baseUrl: window.location.origin
}};
}}
""")
# Create page with base URL for relative resources
temp_page = await context.new_page()
await temp_page.goto(content_with_styles['baseUrl']) # Set the base URL
await temp_page.set_content(f"""
<html>
<head>
<base href="{content_with_styles['baseUrl']}">
<style>{content_with_styles['styles']}</style>
</head>
<body>{content_with_styles['html']}</body>
</html>
""")
pdf_data = await self.export_pdf(temp_page)
await temp_page.close()
else:
pdf_data = await self.export_pdf(page)
if config.capture_mhtml:
mhtml_data = await self.capture_mhtml(page)

View File

@@ -55,6 +55,16 @@
</div>
---
#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Well be onboarding in phases and working closely with early users.
Limited slots._
---
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.
> Enjoy using Crawl4AI? Consider **[becoming a sponsor](https://github.com/sponsors/unclecode)** to support ongoing development and community growth!