feat(pdf): add PDF processing capabilities

Add new PDF processing module with the following features:
- PDF text extraction and formatting to HTML/Markdown
- Image extraction with multiple format support (JPEG, PNG, TIFF)
- Link extraction from PDF documents
- Metadata extraction including title, author, dates
- Support for both local and remote PDF files

Also includes:
- New configuration options for HTML attribute handling
- Internal/external link filtering improvements
- Version bump to 0.4.300b4
This commit is contained in:
UncleCode
2025-01-27 21:24:15 +08:00
parent 54c84079c4
commit f8fd9d9eff
9 changed files with 933 additions and 49 deletions

View File

@@ -1,2 +1,3 @@
# crawl4ai/_version.py
__version__ = "0.4.3b3"
# __version__ = "0.4.3b3"
__version__ = "0.4.300b4"