fix: pdf processing to target only css_selector, thereby give users a choice to discard unnecssary element from page in pdf generated

Release/v0.7.8 (#1662 )
* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * announcement: add application form for cloud API closed beta * Release v0.7.8: Stability & Bug Fix Release - Updated version to 0.7.8 - Introduced focused stability release addressing 11 community-reported bugs. - Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates. - Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended. - Updated documentation to reflect recent changes and improvements. * docs: add section for Crawl4AI Cloud API closed beta with application link * fix: add disk cleanup step to Docker workflow --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: Aravind Karnam <aravind.karanam@gmail.com>
2025-12-18 16:32:16 +05:30 · 2025-12-11 11:04:52 +01:00 · 2025-12-11 09:49:27 +01:00
3 changed files with 76 additions and 2 deletions
--- a/.github/workflows/docker-release.yml
+++ b/.github/workflows/docker-release.yml
@@ -11,6 +11,25 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Free up disk space
        run: |
          echo "=== Disk space before cleanup ==="
          df -h
          # Remove unnecessary tools and libraries (frees ~25GB)
          sudo rm -rf /usr/share/dotnet
          sudo rm -rf /usr/local/lib/android
          sudo rm -rf /opt/ghc
          sudo rm -rf /opt/hostedtoolcache/CodeQL
          sudo rm -rf /usr/local/share/boost
          sudo rm -rf /usr/share/swift
          # Clean apt cache
          sudo apt-get clean
          echo "=== Disk space after cleanup ==="
          df -h
      - name: Checkout code
        uses: actions/checkout@v4
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -989,8 +989,53 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            mhtml_data = None
            if config.pdf:
-                pdf_data = await self.export_pdf(page)
+                if config.css_selector:
                    # Extract content with styles and fixed image URLs
                    content_with_styles = await page.evaluate(f"""
                        () => {{
                            const element = document.querySelector("{config.css_selector}");
                            const clone = element.cloneNode(true);
                            // Fix all image URLs to absolute
                            clone.querySelectorAll('img').forEach(img => {{
                                if (img.src) img.src = img.src;  // This converts to absolute URL
                            }});
                            // Get all styles
                            const styles = Array.from(document.styleSheets)
                                .map(sheet => {{
                                    try {{
                                        return Array.from(sheet.cssRules).map(rule => rule.cssText).join('\\n');
                                    }} catch(e) {{
                                        return '';
                                    }}
                                }}).join('\\n');
                            return {{
                                html: clone.outerHTML,
                                styles: styles,
                                baseUrl: window.location.origin
                            }};
                        }}
                    """)
                    # Create page with base URL for relative resources
                    temp_page = await context.new_page()
                    await temp_page.goto(content_with_styles['baseUrl'])  # Set the base URL
                    await temp_page.set_content(f"""
                        <html>
                        <head>
                            <base href="{content_with_styles['baseUrl']}">
                            <style>{content_with_styles['styles']}</style>
                        </head>
                        <body>{content_with_styles['html']}</body>
                        </html>
                    """)
                    pdf_data = await self.export_pdf(temp_page)
                    await temp_page.close()
                else:
                    pdf_data = await self.export_pdf(page)
            if config.capture_mhtml:
                mhtml_data = await self.capture_mhtml(page)
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -55,6 +55,16 @@
 </div>
 ---
 #### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)
 Reliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.
 👉 **Apply [here](https://forms.gle/E9MyPaNXACnAMaqG7) for early access**  
 _We’ll be onboarding in phases and working closely with early users.
 Limited slots._
 ---
 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.
 > Enjoy using Crawl4AI? Consider **[becoming a sponsor](https://github.com/sponsors/unclecode)** to support ongoing development and community growth!