feat(pdf): add PDF processing capabilities
Add new PDF processing module with the following features: - PDF text extraction and formatting to HTML/Markdown - Image extraction with multiple format support (JPEG, PNG, TIFF) - Link extraction from PDF documents - Metadata extraction including title, author, dates - Support for both local and remote PDF files Also includes: - New configuration options for HTML attribute handling - Internal/external link filtering improvements - Version bump to 0.4.300b4
This commit is contained in:
@@ -1098,17 +1098,19 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
|
||||
user_message = {
|
||||
"role": "user",
|
||||
"content": f"""
|
||||
Instructions:
|
||||
{prompt_template}
|
||||
|
||||
HTML to analyze:
|
||||
```html
|
||||
{html}
|
||||
```
|
||||
|
||||
{"Extract the following data: " + query if query else "Please analyze the HTML structure and create the most appropriate schema for data extraction."}
|
||||
Instructions to extract schema for the above given HTML:
|
||||
{prompt_template}
|
||||
|
||||
"""
|
||||
}
|
||||
|
||||
if query:
|
||||
user_message["content"] += f"\n\nImportant Notes to Consider:\n{query}"
|
||||
|
||||
try:
|
||||
# Call LLM with backoff handling
|
||||
|
||||
Reference in New Issue
Block a user