feat(pdf): add PDF processing capabilities

Add new PDF processing module with the following features:
- PDF text extraction and formatting to HTML/Markdown
- Image extraction with multiple format support (JPEG, PNG, TIFF)
- Link extraction from PDF documents
- Metadata extraction including title, author, dates
- Support for both local and remote PDF files

Also includes:
- New configuration options for HTML attribute handling
- Internal/external link filtering improvements
- Version bump to 0.4.300b4
This commit is contained in:
UncleCode
2025-01-27 21:24:15 +08:00
parent 54c84079c4
commit f8fd9d9eff
9 changed files with 933 additions and 49 deletions

View File

@@ -1098,17 +1098,19 @@ class JsonElementExtractionStrategy(ExtractionStrategy):
user_message = {
"role": "user",
"content": f"""
Instructions:
{prompt_template}
HTML to analyze:
```html
{html}
```
{"Extract the following data: " + query if query else "Please analyze the HTML structure and create the most appropriate schema for data extraction."}
Instructions to extract schema for the above given HTML:
{prompt_template}
"""
}
if query:
user_message["content"] += f"\n\nImportant Notes to Consider:\n{query}"
try:
# Call LLM with backoff handling