Compare commits

..

25 Commits

Author SHA1 Message Date
Nasrin
60d6173914 Merge pull request #1661 from unclecode/waitlist
announcement: add application form for cloud API closed beta
2025-12-09 16:44:15 +08:00
ntohidi
48c31c4cb9 Release v0.7.8: Stability & Bug Fix Release
- Updated version to 0.7.8
- Introduced focused stability release addressing 11 community-reported bugs.
- Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates.
- Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended.
- Updated documentation to reflect recent changes and improvements.
2025-12-08 15:42:29 +01:00
Aravind Karnam
48b6283e71 announcement: add application form for cloud API closed beta 2025-12-08 14:00:57 +05:30
Nasrin
5a8fb57795 Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter
[Fix]: Docker server does not decode ContentRelevanceFilter
2025-12-03 18:36:07 +08:00
ntohidi
df4d87ed78 refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 2025-12-03 10:59:18 +01:00
Nasrin
f32cfc6db0 Merge pull request #1645 from unclecode/fix/configurable-backoff
Make LLM backoff configurable end-to-end
2025-12-02 21:07:49 +08:00
Nasrin
d06c39e8ab Merge pull request #1641 from unclecode/fix/serialize-proxy-config
Fix BrowserConfig proxy_config serialization
2025-12-02 21:06:02 +08:00
ntohidi
afc31e144a Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop 2025-12-02 13:01:11 +01:00
ntohidi
07ccf13be6 Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 2025-12-02 13:00:54 +01:00
Chris Murphy
6893094f58 parameterized tests 2025-12-01 16:19:19 -05:00
Chris Murphy
3a8f8298d3 import modules from enhanceable deserialization 2025-12-01 16:18:59 -05:00
Chris Murphy
e95e8e1a97 generalized query in ContentRelevanceFilter to be a str or list 2025-12-01 16:16:31 -05:00
Chris Murphy
eb76df2c0d added missing deep crawling objects to init 2025-12-01 16:15:58 -05:00
Chris Murphy
6ec6bc4d8a pass timeout parameter to docker client request 2025-12-01 16:15:27 -05:00
Chris Murphy
33a3cc3933 reproduced AttributeError from #1642 2025-12-01 11:31:07 -05:00
Soham Kukreti
7a133e22cc feat: make LLM backoff configurable end-to-end
- extend LLMConfig with backoff delay/attempt/factor fields and thread them
  through LLMExtractionStrategy, LLMContentFilter, table extraction, and
  Docker API handlers
- expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff
  and document them in the md_v2 guides
2025-11-28 18:50:04 +05:30
Nasrin
dcb77c94bf Merge pull request #1623 from unclecode/fix/deprecated_pydantic
Refactor Pydantic model configuration to use ConfigDict for arbitrary…
2025-11-27 20:05:42 +08:00
Soham Kukreti
a0c5f0f79a fix: ensure BrowserConfig.to_dict serializes proxy_config 2025-11-26 17:44:06 +05:30
ntohidi
b36c6daa5c Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 2025-11-25 11:51:59 +01:00
Nasrin
94c8a833bf Merge pull request #1447 from rbushri/fix/wrong_url_raw
Fix: Wrong URL variable used for extraction of raw html
2025-11-25 17:49:44 +08:00
ntohidi
84bfea8bd1 Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 2025-11-25 10:46:00 +01:00
Rachel Bushrian
7771ed3894 Merge branch 'develop' into fix/wrong_url_raw 2025-11-24 13:54:07 +02:00
AHMET YILMAZ
eca04b0368 Refactor Pydantic model configuration to use ConfigDict for arbitrary types 2025-11-18 15:40:17 +08:00
ntohidi
c2c4d42be4 Fix #1181: Preserve whitespace in code blocks during HTML scraping
The remove_empty_elements_fast() method was removing whitespace-only
  span elements inside <pre> and <code> tags, causing import statements
  like "import torch" to become "importtorch". Now skips elements inside
  code blocks where whitespace is significant.
2025-11-17 12:21:23 +01:00
rbushria
edd0b576b1 Fix: Use correct URL variable for raw HTML extraction (#1116)
- Prevents full HTML content from being passed as URL to extraction strategies
- Added unit tests to verify raw HTML and regular URL processing

Fix: Wrong URL variable used for extraction of raw html
2025-09-01 23:15:56 +03:00
3 changed files with 2 additions and 76 deletions

View File

@@ -11,25 +11,6 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Free up disk space
run: |
echo "=== Disk space before cleanup ==="
df -h
# Remove unnecessary tools and libraries (frees ~25GB)
sudo rm -rf /usr/share/dotnet
sudo rm -rf /usr/local/lib/android
sudo rm -rf /opt/ghc
sudo rm -rf /opt/hostedtoolcache/CodeQL
sudo rm -rf /usr/local/share/boost
sudo rm -rf /usr/share/swift
# Clean apt cache
sudo apt-get clean
echo "=== Disk space after cleanup ==="
df -h
- name: Checkout code
uses: actions/checkout@v4

View File

@@ -989,53 +989,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
mhtml_data = None
if config.pdf:
if config.css_selector:
# Extract content with styles and fixed image URLs
content_with_styles = await page.evaluate(f"""
() => {{
const element = document.querySelector("{config.css_selector}");
const clone = element.cloneNode(true);
// Fix all image URLs to absolute
clone.querySelectorAll('img').forEach(img => {{
if (img.src) img.src = img.src; // This converts to absolute URL
}});
// Get all styles
const styles = Array.from(document.styleSheets)
.map(sheet => {{
try {{
return Array.from(sheet.cssRules).map(rule => rule.cssText).join('\\n');
}} catch(e) {{
return '';
}}
}}).join('\\n');
return {{
html: clone.outerHTML,
styles: styles,
baseUrl: window.location.origin
}};
}}
""")
# Create page with base URL for relative resources
temp_page = await context.new_page()
await temp_page.goto(content_with_styles['baseUrl']) # Set the base URL
await temp_page.set_content(f"""
<html>
<head>
<base href="{content_with_styles['baseUrl']}">
<style>{content_with_styles['styles']}</style>
</head>
<body>{content_with_styles['html']}</body>
</html>
""")
pdf_data = await self.export_pdf(temp_page)
await temp_page.close()
else:
pdf_data = await self.export_pdf(page)
pdf_data = await self.export_pdf(page)
if config.capture_mhtml:
mhtml_data = await self.capture_mhtml(page)

View File

@@ -55,16 +55,6 @@
</div>
---
#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**
_Well be onboarding in phases and working closely with early users.
Limited slots._
---
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.
> Enjoy using Crawl4AI? Consider **[becoming a sponsor](https://github.com/sponsors/unclecode)** to support ongoing development and community growth!