feat(pdf): add PDF processing capabilities

Add new PDF processing module with the following features:
- PDF text extraction and formatting to HTML/Markdown
- Image extraction with multiple format support (JPEG, PNG, TIFF)
- Link extraction from PDF documents
- Metadata extraction including title, author, dates
- Support for both local and remote PDF files

Also includes:
- New configuration options for HTML attribute handling
- Internal/external link filtering improvements
- Version bump to 0.4.300b4
This commit is contained in:
UncleCode
2025-01-27 21:24:15 +08:00
parent 54c84079c4
commit f8fd9d9eff
9 changed files with 933 additions and 49 deletions

View File

@@ -143,6 +143,7 @@ class AsyncCrawlResponse(BaseModel):
###############################
class MediaItem(BaseModel):
src: Optional[str] = ""
data: Optional[str] = ""
alt: Optional[str] = ""
desc: Optional[str] = ""
score: Optional[int] = 0