fix:Make JsonCssExtractionStrategy.generate_schema resilient to markdown tags generated by LLMs https://github.com/unclecode/crawl4ai/issues/1663

docs: add section for Crawl4AI Cloud API closed beta with application link
Merge pull request #1661 from unclecode/waitlist
2025-12-09 15:23:56 +05:30 · 2025-12-09 10:27:15 +01:00 · 2025-12-09 16:44:15 +08:00 · 2025-12-08 15:42:29 +01:00 · 2025-12-08 14:00:57 +05:30 · 2025-12-03 18:36:07 +08:00
3 changed files with 5 additions and 68 deletions
--- a/.github/workflows/docker-release.yml
+++ b/.github/workflows/docker-release.yml
@@ -11,25 +11,6 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Free up disk space
        run: |
          echo "=== Disk space before cleanup ==="
          df -h
          # Remove unnecessary tools and libraries (frees ~25GB)
          sudo rm -rf /usr/share/dotnet
          sudo rm -rf /usr/local/lib/android
          sudo rm -rf /opt/ghc
          sudo rm -rf /opt/hostedtoolcache/CodeQL
          sudo rm -rf /usr/local/share/boost
          sudo rm -rf /usr/share/swift
          # Clean apt cache
          sudo apt-get clean
          echo "=== Disk space after cleanup ==="
          df -h
      - name: Checkout code
        uses: actions/checkout@v4
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -989,53 +989,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            mhtml_data = None
            if config.pdf:
                if config.css_selector:
                    # Extract content with styles and fixed image URLs
                    content_with_styles = await page.evaluate(f"""
                        () => {{
                            const element = document.querySelector("{config.css_selector}");
                            const clone = element.cloneNode(true);
                            // Fix all image URLs to absolute
                            clone.querySelectorAll('img').forEach(img => {{
                                if (img.src) img.src = img.src;  // This converts to absolute URL
                            }});
                            // Get all styles
                            const styles = Array.from(document.styleSheets)
                                .map(sheet => {{
                                    try {{
                                        return Array.from(sheet.cssRules).map(rule => rule.cssText).join('\\n');
                                    }} catch(e) {{
                                        return '';
                                    }}
                                }}).join('\\n');
                            return {{
                                html: clone.outerHTML,
                                styles: styles,
                                baseUrl: window.location.origin
                            }};
                        }}
                    """)
                    # Create page with base URL for relative resources
                    temp_page = await context.new_page()
                    await temp_page.goto(content_with_styles['baseUrl'])  # Set the base URL
                    await temp_page.set_content(f"""
                        <html>
                        <head>
                            <base href="{content_with_styles['baseUrl']}">
                            <style>{content_with_styles['styles']}</style>
                        </head>
                        <body>{content_with_styles['html']}</body>
                        </html>
                    """)
                    pdf_data = await self.export_pdf(temp_page)
                    await temp_page.close()
                else:
                pdf_data = await self.export_pdf(page)
            if config.capture_mhtml:
                mhtml_data = await self.capture_mhtml(page)
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -1378,9 +1378,10 @@ In this scenario, use your best judgment to generate the schema. You need to exa
                base_url=llm_config.base_url,
                extra_args=kwargs
            )
-            
+              # Simply strip the markdown formatting
            raw_json = response.choices[0].message.content.replace('```json\n', '').replace('\n```', '')
            # Extract and return schema
-            return json.loads(response.choices[0].message.content)
+            return json.loads(raw_json)
        except Exception as e:
            raise Exception(f"Failed to generate schema: {str(e)}")
Author	SHA1	Message	Date
Aravind Karnam	b0b2b2761c	fix:Make JsonCssExtractionStrategy.generate_schema resilient to markdown tags generated by LLMs https://github.com/unclecode/crawl4ai/issues/1663	2025-12-09 15:23:56 +05:30
ntohidi	9672afded2	docs: add section for Crawl4AI Cloud API closed beta with application link	2025-12-09 10:27:15 +01:00
Nasrin	60d6173914	Merge pull request #1661 from unclecode/waitlist announcement: add application form for cloud API closed beta	2025-12-09 16:44:15 +08:00
ntohidi	48c31c4cb9	Release v0.7.8: Stability & Bug Fix Release - Updated version to 0.7.8 - Introduced focused stability release addressing 11 community-reported bugs. - Key fixes include Docker API improvements, LLM extraction enhancements, URL handling corrections, and dependency updates. - Added detailed release notes for v0.7.8 in the blog and created a dedicated verification script to ensure all fixes are functioning as intended. - Updated documentation to reflect recent changes and improvements.	2025-12-08 15:42:29 +01:00
Aravind Karnam	48b6283e71	announcement: add application form for cloud API closed beta	2025-12-08 14:00:57 +05:30
Nasrin	5a8fb57795	Merge pull request #1648 from christopher-w-murphy/fix/content-relevance-filter [Fix]: Docker server does not decode ContentRelevanceFilter	2025-12-03 18:36:07 +08:00
ntohidi	df4d87ed78	refactor: replace PyPDF2 with pypdf across the codebase. ref #1412	2025-12-03 10:59:18 +01:00
Nasrin	f32cfc6db0	Merge pull request #1645 from unclecode/fix/configurable-backoff Make LLM backoff configurable end-to-end	2025-12-02 21:07:49 +08:00
Nasrin	d06c39e8ab	Merge pull request #1641 from unclecode/fix/serialize-proxy-config Fix BrowserConfig proxy_config serialization	2025-12-02 21:06:02 +08:00
ntohidi	afc31e144a	Merge branch 'develop' of https://github.com/unclecode/crawl4ai into develop	2025-12-02 13:01:11 +01:00
ntohidi	07ccf13be6	Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268	2025-12-02 13:00:54 +01:00
Chris Murphy	6893094f58	parameterized tests	2025-12-01 16:19:19 -05:00
Chris Murphy	3a8f8298d3	import modules from enhanceable deserialization	2025-12-01 16:18:59 -05:00
Chris Murphy	e95e8e1a97	generalized query in ContentRelevanceFilter to be a str or list	2025-12-01 16:16:31 -05:00
Chris Murphy	eb76df2c0d	added missing deep crawling objects to init	2025-12-01 16:15:58 -05:00
Chris Murphy	6ec6bc4d8a	pass timeout parameter to docker client request	2025-12-01 16:15:27 -05:00
Chris Murphy	33a3cc3933	reproduced AttributeError from #1642	2025-12-01 11:31:07 -05:00
Soham Kukreti	7a133e22cc	feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides	2025-11-28 18:50:04 +05:30
Nasrin	dcb77c94bf	Merge pull request #1623 from unclecode/fix/deprecated_pydantic Refactor Pydantic model configuration to use ConfigDict for arbitrary…	2025-11-27 20:05:42 +08:00
Soham Kukreti	a0c5f0f79a	fix: ensure BrowserConfig.to_dict serializes proxy_config	2025-11-26 17:44:06 +05:30
ntohidi	b36c6daa5c	Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638	2025-11-25 11:51:59 +01:00
Nasrin	94c8a833bf	Merge pull request #1447 from rbushri/fix/wrong_url_raw Fix: Wrong URL variable used for extraction of raw html	2025-11-25 17:49:44 +08:00
ntohidi	84bfea8bd1	Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621	2025-11-25 10:46:00 +01:00
Rachel Bushrian	7771ed3894	Merge branch 'develop' into fix/wrong_url_raw	2025-11-24 13:54:07 +02:00
AHMET YILMAZ	eca04b0368	Refactor Pydantic model configuration to use ConfigDict for arbitrary types	2025-11-18 15:40:17 +08:00
ntohidi	c2c4d42be4	Fix #1181 : Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant.	2025-11-17 12:21:23 +01:00
rbushria	edd0b576b1	Fix: Use correct URL variable for raw HTML extraction (#1116 ) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html	2025-09-01 23:15:56 +03:00