` with page number information.
+ - `media`: A dictionary where `media["images"]` will contain information about extracted images if `extract_images` was `True`.
+ - `links`: A dictionary where `links["urls"]` can contain URLs found within the PDF content.
+ - `metadata`: A dictionary holding PDF metadata (e.g., title, author, num_pages).
+- **`async ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult`**:
+ - The asynchronous version of `scrap`. Under the hood, it typically runs the synchronous `scrap` method in a separate thread using `asyncio.to_thread` to avoid blocking the event loop.
+- **`_get_pdf_path(self, url: str) -> str`**:
+ - A private helper method to manage PDF file access. If the `url` is remote (http/https), it downloads the PDF to a temporary local file and returns its path. If `url` indicates a local file (`file://` or a direct path), it resolves and returns the local path.
+
+### Example Usage
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
+import os # For creating image directory
+
+async def main():
+ # Define the directory for saving extracted images
+ image_output_dir = "./my_pdf_images"
+ os.makedirs(image_output_dir, exist_ok=True)
+
+ # Configure the PDF content scraping strategy
+ # Enable image extraction and specify where to save them
+ pdf_scraping_cfg = PDFContentScrapingStrategy(
+ extract_images=True,
+ save_images_locally=True,
+ image_save_dir=image_output_dir,
+ batch_size=2 # Process 2 pages at a time for demonstration
+ )
+
+ # The PDFCrawlerStrategy is needed to tell AsyncWebCrawler how to "crawl" a PDF
+ pdf_crawler_cfg = PDFCrawlerStrategy()
+
+ # Configure the overall crawl run
+ run_cfg = CrawlerRunConfig(
+ scraping_strategy=pdf_scraping_cfg # Use our PDF scraping strategy
+ )
+
+ # Initialize the crawler with the PDF-specific crawler strategy
+ async with AsyncWebCrawler(crawler_strategy=pdf_crawler_cfg) as crawler:
+ pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF
+
+ print(f"Starting PDF processing for: {pdf_url}")
+ result = await crawler.arun(url=pdf_url, config=run_cfg)
+
+ if result.success:
+ print("\n--- PDF Processing Successful ---")
+ print(f"Processed URL: {result.url}")
+
+ print("\n--- Metadata ---")
+ for key, value in result.metadata.items():
+ print(f" {key.replace('_', ' ').title()}: {value}")
+
+ if result.markdown and hasattr(result.markdown, 'raw_markdown'):
+ print(f"\n--- Extracted Text (Markdown Snippet) ---")
+ print(result.markdown.raw_markdown[:500].strip() + "...")
+ else:
+ print("\nNo text (markdown) content extracted.")
+
+ if result.media and result.media.get("images"):
+ print(f"\n--- Image Extraction ---")
+ print(f"Extracted {len(result.media['images'])} image(s).")
+ for i, img_info in enumerate(result.media["images"][:2]): # Show info for first 2 images
+ print(f" Image {i+1}:")
+ print(f" Page: {img_info.get('page')}")
+ print(f" Format: {img_info.get('format', 'N/A')}")
+ if img_info.get('path'):
+ print(f" Saved at: {img_info.get('path')}")
+ else:
+ print("\nNo images were extracted (or extract_images was False).")
+ else:
+ print(f"\n--- PDF Processing Failed ---")
+ print(f"Error: {result.error_message}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+### Pros and Cons
+
+**Pros:**
+- Provides a comprehensive way to extract text, metadata, and (optionally) images from PDF documents.
+- Handles both remote PDFs (via URL) and local PDF files.
+- Configurable image extraction allows saving images to disk or accessing their data.
+- Integrates smoothly with the `CrawlResult` object structure, making PDF-derived data accessible in a way consistent with web-scraped data.
+- The `batch_size` parameter can help in managing memory consumption when processing large or numerous PDF pages.
+
+**Cons:**
+- Extraction quality and performance can vary significantly depending on the PDF's complexity, encoding, and whether it's image-based (scanned) or text-based.
+- Image extraction can be resource-intensive (both CPU and disk space if `save_images_locally` is true).
+- Relies on `NaivePDFProcessorStrategy` internally, which might have limitations with very complex layouts, encrypted PDFs, or forms compared to more sophisticated PDF parsing libraries. Scanned PDFs will not yield text unless an OCR step is performed (which is not part of this strategy by default).
+- Link extraction from PDFs can be basic and depends on how hyperlinks are embedded in the document.
diff --git a/mkdocs.yml b/mkdocs.yml
index 38b19afe..72e09397 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -43,6 +43,7 @@ nav:
- "Identity Based Crawling": "advanced/identity-based-crawling.md"
- "SSL Certificate": "advanced/ssl-certificate.md"
- "Network & Console Capture": "advanced/network-console-capture.md"
+ - "PDF Parsing": "advanced/pdf-parsing.md"
- Extraction:
- "LLM-Free Strategies": "extraction/no-llm-strategies.md"
- "LLM Strategies": "extraction/llm-strategies.md"
From b7a6e02236f9da30c1bb21b8a5bb3dab86d97233 Mon Sep 17 00:00:00 2001
From: ntohidi
Date: Wed, 18 Jun 2025 19:04:32 +0200
Subject: [PATCH 47/53] fix: Update pdf and screenshot usage documentation. ref
#1230
---
deploy/docker/c4ai-doc-context.md | 29 ++++++++++++++++--------
docs/md_v2/advanced/advanced-features.md | 29 ++++++++++++++++--------
2 files changed, 38 insertions(+), 20 deletions(-)
diff --git a/deploy/docker/c4ai-doc-context.md b/deploy/docker/c4ai-doc-context.md
index 6591c265..f8b83088 100644
--- a/deploy/docker/c4ai-doc-context.md
+++ b/deploy/docker/c4ai-doc-context.md
@@ -5433,29 +5433,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
```python
import os, asyncio
from base64 import b64decode
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ screenshot=True,
+ pdf=True
+ )
+
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
- cache_mode=CacheMode.BYPASS,
- pdf=True,
- screenshot=True
+ config=run_config
)
-
if result.success:
- # Save screenshot
+ print(f"Screenshot data present: {result.screenshot is not None}")
+ print(f"PDF data present: {result.pdf is not None}")
+
if result.screenshot:
+ print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
-
- # Save PDF
+ else:
+ print("[WARN] Screenshot data is None.")
+
if result.pdf:
+ print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
-
- print("[OK] PDF & screenshot captured.")
+ else:
+ print("[WARN] PDF data is None.")
+
else:
print("[ERROR]", result.error_message)
diff --git a/docs/md_v2/advanced/advanced-features.md b/docs/md_v2/advanced/advanced-features.md
index b56f216e..3563fd40 100644
--- a/docs/md_v2/advanced/advanced-features.md
+++ b/docs/md_v2/advanced/advanced-features.md
@@ -66,29 +66,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
```python
import os, asyncio
from base64 import b64decode
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ screenshot=True,
+ pdf=True
+ )
+
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
- cache_mode=CacheMode.BYPASS,
- pdf=True,
- screenshot=True
+ config=run_config
)
-
if result.success:
- # Save screenshot
+ print(f"Screenshot data present: {result.screenshot is not None}")
+ print(f"PDF data present: {result.pdf is not None}")
+
if result.screenshot:
+ print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
-
- # Save PDF
+ else:
+ print("[WARN] Screenshot data is None.")
+
if result.pdf:
+ print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
-
- print("[OK] PDF & screenshot captured.")
+ else:
+ print("[WARN] PDF data is None.")
+
else:
print("[ERROR]", result.error_message)
From 414f16e975cc2ca29abe3531d5ab91a4b17a4163 Mon Sep 17 00:00:00 2001
From: ntohidi
Date: Wed, 18 Jun 2025 19:05:44 +0200
Subject: [PATCH 48/53] fix: Update pdf and screenshot usage documentation. ref
#1230
---
.../crawl4ai_all_reasoning_content.llm.txt | 29 ++++++++++++-------
1 file changed, 19 insertions(+), 10 deletions(-)
diff --git a/docs/md_v2/assets/llmtxt/crawl4ai_all_reasoning_content.llm.txt b/docs/md_v2/assets/llmtxt/crawl4ai_all_reasoning_content.llm.txt
index 850c1237..c3350fb5 100644
--- a/docs/md_v2/assets/llmtxt/crawl4ai_all_reasoning_content.llm.txt
+++ b/docs/md_v2/assets/llmtxt/crawl4ai_all_reasoning_content.llm.txt
@@ -5359,29 +5359,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
```python
import os, asyncio
from base64 import b64decode
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ screenshot=True,
+ pdf=True
+ )
+
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
- cache_mode=CacheMode.BYPASS,
- pdf=True,
- screenshot=True
+ config=run_config
)
-
if result.success:
- # Save screenshot
+ print(f"Screenshot data present: {result.screenshot is not None}")
+ print(f"PDF data present: {result.pdf is not None}")
+
if result.screenshot:
+ print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
with open("wikipedia_screenshot.png", "wb") as f:
f.write(b64decode(result.screenshot))
-
- # Save PDF
+ else:
+ print("[WARN] Screenshot data is None.")
+
if result.pdf:
+ print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
with open("wikipedia_page.pdf", "wb") as f:
f.write(result.pdf)
-
- print("[OK] PDF & screenshot captured.")
+ else:
+ print("[WARN] PDF data is None.")
+
else:
print("[ERROR]", result.error_message)
From fee4c5c78306b1fe13846344621f7cc06f70a3f0 Mon Sep 17 00:00:00 2001
From: ntohidi
Date: Tue, 8 Jul 2025 11:46:24 +0200
Subject: [PATCH 49/53] fix: Consolidate import statements in local-files.md
for clarity
---
docs/md_v2/core/local-files.md | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/docs/md_v2/core/local-files.md b/docs/md_v2/core/local-files.md
index 31fe7792..2fccea81 100644
--- a/docs/md_v2/core/local-files.md
+++ b/docs/md_v2/core/local-files.md
@@ -8,8 +8,7 @@ To crawl a live web page, provide the URL starting with `http://` or `https://`,
```python
import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def crawl_web():
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
@@ -33,8 +32,7 @@ To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
@@ -93,8 +91,7 @@ import os
import sys
import asyncio
from pathlib import Path
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
From a3d41c795132a8858535e1ce60406e2f36bdd40f Mon Sep 17 00:00:00 2001
From: ntohidi
Date: Tue, 8 Jul 2025 12:24:33 +0200
Subject: [PATCH 50/53] fix: Clarify description of 'use_stemming' parameter in
markdown generation documentation ref #1086
---
docs/md_v2/core/markdown-generation.md | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/md_v2/core/markdown-generation.md b/docs/md_v2/core/markdown-generation.md
index 1b95b965..af9b35b5 100644
--- a/docs/md_v2/core/markdown-generation.md
+++ b/docs/md_v2/core/markdown-generation.md
@@ -200,7 +200,7 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
- **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.
- **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.
-- **`use_stemming`** *(default `True`)*: If enabled, variations of words match (e.g., “learn,” “learning,” “learnt”).
+- **`use_stemming`** *(default `True`)*: Whether to apply stemming to the query and content.
- **`language (str)`**: Language for stemming (default: 'english').
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
From 36429a63ded80920e37d4925be33bd0d5582fda0 Mon Sep 17 00:00:00 2001
From: ntohidi
Date: Tue, 8 Jul 2025 12:54:33 +0200
Subject: [PATCH 51/53] fix: Improve comments for article metadata extraction
in extract_metadata functions. ref #1105
---
crawl4ai/utils.py | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/crawl4ai/utils.py b/crawl4ai/utils.py
index e029a004..8735dee0 100644
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -1547,7 +1547,8 @@ def extract_metadata_using_lxml(html, doc=None):
content = tag.get("content", "").strip()
if property_name and content:
metadata[property_name] = content
- # Article metadata - using starts-with() for performance
+
+ # Article metadata
article_tags = head.xpath('.//meta[starts-with(@property, "article:")]')
for tag in article_tags:
property_name = tag.get("property", "").strip()
@@ -1629,12 +1630,15 @@ def extract_metadata(html, soup=None):
content = tag.get("content", "").strip()
if property_name and content:
metadata[property_name] = content
- # getting the article Values
- metadata.update({
- tag['property'].strip():tag["content"].strip()
- for tag in head.find_all("meta", attrs={"property": re.compile(r"^article:")})
- if tag.has_attr('property') and tag.has_attr('content')
- })
+
+ # Article metadata
+ article_tags = head.find_all("meta", attrs={"property": re.compile(r"^article:")})
+ for tag in article_tags:
+ property_name = tag.get("property", "").strip()
+ content = tag.get("content", "").strip()
+ if property_name and content:
+ metadata[property_name] = content
+
return metadata
From 026e96a2df790af8c387704f4cc6fd3ef6caa521 Mon Sep 17 00:00:00 2001
From: ntohidi
Date: Tue, 8 Jul 2025 15:48:40 +0200
Subject: [PATCH 52/53] feat: Add social media and community links to README
and index documentation
---
README.md | 17 +++++++++++------
docs/md_v2/index.md | 11 +++++++++++
2 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/README.md b/README.md
index 23e40fef..8e6980d8 100644
--- a/README.md
+++ b/README.md
@@ -11,12 +11,17 @@
[](https://pypi.org/project/crawl4ai/)
[](https://pepy.tech/project/crawl4ai)
-
-[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
-[](https://github.com/psf/black)
-[](https://github.com/PyCQA/bandit)
-[](code_of_conduct.md)
-
+
+
+
+
+
+
+
+
+
+
+
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
diff --git a/docs/md_v2/index.md b/docs/md_v2/index.md
index a02bb41d..d497ca89 100644
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -41,6 +41,17 @@
alt="License"/>