feat(pdf): add PDF processing capabilities

Add new PDF processing module with the following features: - PDF text extraction and formatting to HTML/Markdown - Image extraction with multiple format support (JPEG, PNG, TIFF) - Link extraction from PDF documents - Metadata extraction including title, author, dates - Support for both local and remote PDF files Also includes: - New configuration options for HTML attribute handling - Internal/external link filtering improvements - Version bump to 0.4.300b4
2025-01-27 21:24:15 +08:00
parent 54c84079c4
commit f8fd9d9eff
9 changed files with 933 additions and 49 deletions
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -1098,17 +1098,19 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
        user_message = {
            "role": "user",
            "content": f"""
-                Instructions:
-                {prompt_template}
-
                HTML to analyze:
                ```html
                {html}
                ```

-                {"Extract the following data: " + query if query else "Please analyze the HTML structure and create the most appropriate schema for data extraction."}
+                Instructions to extract schema for the above given HTML:
+                {prompt_template}
+
                """
        }
+        
+        if query:
+            user_message["content"] += f"\n\nImportant Notes to Consider:\n{query}"

        try:
            # Call LLM with backoff handling