Release prep (#749)
* fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
This commit is contained in:
@@ -27,7 +27,6 @@ class CrawlResult(BaseModel):
|
||||
screenshot: Optional[str] = None
|
||||
pdf : Optional[bytes] = None
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
@@ -52,8 +51,7 @@ class CrawlResult(BaseModel):
|
||||
| **downloaded_files (`Optional[List[str]]`)** | If `accept_downloads=True` in `BrowserConfig`, this lists the filepaths of saved downloads. |
|
||||
| **screenshot (`Optional[str]`)** | Screenshot of the page (base64-encoded) if `screenshot=True`. |
|
||||
| **pdf (`Optional[bytes]`)** | PDF of the page if `pdf=True`. |
|
||||
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | For now, `markdown_v2` holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
|
||||
| **markdown_v2 (`Optional[MarkdownGenerationResult]`)** | Legacy field for detailed markdown output. This will be replaced by `markdown` soon. |
|
||||
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | It holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
|
||||
| **extracted_content (`Optional[str]`)** | The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text. |
|
||||
| **metadata (`Optional[dict]`)** | Additional info about the crawl or extracted data. |
|
||||
| **error_message (`Optional[str]`)** | If `success=False`, contains a short description of what went wrong. |
|
||||
@@ -90,10 +88,10 @@ print(result.cleaned_html) # Freed of forms, header, footer, data-* attributes
|
||||
|
||||
## 3. Markdown Generation
|
||||
|
||||
### 3.1 `markdown_v2` (Legacy) vs `markdown`
|
||||
### 3.1 `markdown`
|
||||
|
||||
- **`markdown_v2`**: The current location for detailed markdown output, returning a **`MarkdownGenerationResult`** object.
|
||||
- **`markdown`**: Eventually, we’re merging these fields. For now, you might see `result.markdown_v2` used widely in code examples.
|
||||
- **`markdown`**: The current location for detailed markdown output, returning a **`MarkdownGenerationResult`** object.
|
||||
- **`markdown_v2`**: Deprecated since v0.5.
|
||||
|
||||
**`MarkdownGenerationResult`** Fields:
|
||||
|
||||
@@ -118,7 +116,7 @@ config = CrawlerRunConfig(
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
md_res = result.markdown_v2 # or eventually 'result.markdown'
|
||||
md_res = result.markdown # or eventually 'result.markdown'
|
||||
print(md_res.raw_markdown[:500])
|
||||
print(md_res.markdown_with_citations)
|
||||
print(md_res.references_markdown)
|
||||
@@ -224,15 +222,17 @@ Check any field:
|
||||
if result.success:
|
||||
print(result.status_code, result.response_headers)
|
||||
print("Links found:", len(result.links.get("internal", [])))
|
||||
if result.markdown_v2:
|
||||
print("Markdown snippet:", result.markdown_v2.raw_markdown[:200])
|
||||
if result.markdown:
|
||||
print("Markdown snippet:", result.markdown.raw_markdown[:200])
|
||||
if result.extracted_content:
|
||||
print("Structured JSON:", result.extracted_content)
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
```
|
||||
|
||||
**Remember**: Use `result.markdown_v2` for now. It will eventually become `result.markdown`.
|
||||
**Deprecation**: Since v0.5 `result.markdown_v2`, `result.fit_html`,`result.fit_markdown` are deprecated. Use `result.markdown` instead! It holds `MarkdownGenerationResult`, which includes `fit_html` and `fit_markdown`
|
||||
as it's properties.
|
||||
|
||||
|
||||
---
|
||||
|
||||
|
||||
436
docs/md_v2/core/deep-crawling.md
Normal file
436
docs/md_v2/core/deep-crawling.md
Normal file
@@ -0,0 +1,436 @@
|
||||
# Deep Crawling
|
||||
|
||||
One of Crawl4AI's most powerful features is its ability to perform **configurable deep crawling** that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.
|
||||
|
||||
In this tutorial, you'll learn:
|
||||
|
||||
1. How to set up a **Basic Deep Crawler** with BFS strategy
|
||||
2. Understanding the difference between **streamed and non-streamed** output
|
||||
3. Implementing **filters and scorers** to target specific content
|
||||
4. Creating **advanced filtering chains** for sophisticated crawls
|
||||
5. Using **BestFirstCrawling** for intelligent exploration prioritization
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.
|
||||
> - You know how to configure `CrawlerRunConfig`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Quick Example
|
||||
|
||||
Here's a minimal code snippet that implements a basic deep crawl using the **BFSDeepCrawlStrategy**:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
|
||||
async def main():
|
||||
# Configure a 2-level deep crawl
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
print(f"Crawled {len(results)} pages in total")
|
||||
|
||||
# Access individual results
|
||||
for result in results[:3]: # Show first 3 results
|
||||
print(f"URL: {result.url}")
|
||||
print(f"Depth: {result.metadata.get('depth', 0)}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What's happening?**
|
||||
- `BFSDeepCrawlStrategy(max_depth=2, include_external=False)` instructs Crawl4AI to:
|
||||
- Crawl the starting page (depth 0) plus 2 more levels
|
||||
- Stay within the same domain (don't follow external links)
|
||||
- Each result contains metadata like the crawl depth
|
||||
- Results are returned as a list after all crawling is complete
|
||||
|
||||
---
|
||||
|
||||
## 2. Understanding Deep Crawling Strategy Options
|
||||
|
||||
### 2.1 BFSDeepCrawlStrategy (Breadth-First Search)
|
||||
|
||||
The **BFSDeepCrawlStrategy** uses a breadth-first approach, exploring all links at one depth before moving deeper:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
# Basic configuration
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2, # Crawl initial page + 2 levels deep
|
||||
include_external=False, # Stay within the same domain
|
||||
)
|
||||
```
|
||||
|
||||
**Key parameters:**
|
||||
- **`max_depth`**: Number of levels to crawl beyond the starting page
|
||||
- **`include_external`**: Whether to follow links to other domains
|
||||
|
||||
### 2.2 DFSDeepCrawlStrategy (Depth-First Search)
|
||||
|
||||
The **DFSDeepCrawlStrategy** uses a depth-first approach, explores as far down a branch as possible before backtracking.
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
|
||||
|
||||
# Basic configuration
|
||||
strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=2, # Crawl initial page + 2 levels deep
|
||||
include_external=False, # Stay within the same domain
|
||||
)
|
||||
```
|
||||
|
||||
**Key parameters:**
|
||||
- **`max_depth`**: Number of levels to crawl beyond the starting page
|
||||
- **`include_external`**: Whether to follow links to other domains
|
||||
|
||||
### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
|
||||
|
||||
For more intelligent crawling, use **BestFirstCrawlingStrategy** with scorers to prioritize the most relevant pages:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
# Create a scorer
|
||||
scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
# Configure the strategy
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
url_scorer=scorer
|
||||
)
|
||||
```
|
||||
|
||||
This crawling approach:
|
||||
- Evaluates each discovered URL based on scorer criteria
|
||||
- Visits higher-scoring pages first
|
||||
- Helps focus crawl resources on the most relevant content
|
||||
|
||||
---
|
||||
|
||||
## 3. Streaming vs. Non-Streaming Results
|
||||
|
||||
Crawl4AI can return results in two modes:
|
||||
|
||||
### 3.1 Non-Streaming Mode (Default)
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=False # Default behavior
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Wait for ALL results to be collected before returning
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
for result in results:
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
**When to use non-streaming mode:**
|
||||
- You need the complete dataset before processing
|
||||
- You're performing batch operations on all results together
|
||||
- Crawl time isn't a critical factor
|
||||
|
||||
### 3.2 Streaming Mode
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=True # Enable streaming
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Returns an async iterator
|
||||
async for result in await crawler.arun("https://example.com", config=config):
|
||||
# Process each result as it becomes available
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
**Benefits of streaming mode:**
|
||||
- Process results immediately as they're discovered
|
||||
- Start working with early results while crawling continues
|
||||
- Better for real-time applications or progressive display
|
||||
- Reduces memory pressure when handling many pages
|
||||
|
||||
---
|
||||
|
||||
## 4. Filtering Content with Filter Chains
|
||||
|
||||
Filters help you narrow down which pages to crawl. Combine multiple filters using **FilterChain** for powerful targeting.
|
||||
|
||||
### 4.1 Basic URL Pattern Filter
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter
|
||||
|
||||
# Only follow URLs containing "blog" or "docs"
|
||||
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4.2 Combining Multiple Filters
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
DomainFilter,
|
||||
ContentTypeFilter
|
||||
)
|
||||
|
||||
# Create a chain of filters
|
||||
filter_chain = FilterChain([
|
||||
# Only follow URLs with specific patterns
|
||||
URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
|
||||
|
||||
# Only crawl specific domains
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.example.com"],
|
||||
blocked_domains=["old.docs.example.com"]
|
||||
),
|
||||
|
||||
# Only include specific content types
|
||||
ContentTypeFilter(allowed_types=["text/html"])
|
||||
])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=filter_chain
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4.3 Available Filter Types
|
||||
|
||||
Crawl4AI includes several specialized filters:
|
||||
|
||||
- **`URLPatternFilter`**: Matches URL patterns using wildcard syntax
|
||||
- **`DomainFilter`**: Controls which domains to include or exclude
|
||||
- **`ContentTypeFilter`**: Filters based on HTTP Content-Type
|
||||
- **`ContentRelevanceFilter`**: Uses similarity to a text query
|
||||
- **`SEOFilter`**: Evaluates SEO elements (meta tags, headers, etc.)
|
||||
|
||||
---
|
||||
|
||||
## 5. Using Scorers for Prioritized Crawling
|
||||
|
||||
Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.
|
||||
|
||||
### 5.1 KeywordRelevanceScorer
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
|
||||
# Create a keyword relevance scorer
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7 # Importance of this scorer (0.0 to 1.0)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
url_scorer=keyword_scorer
|
||||
),
|
||||
stream=True # Recommended with BestFirstCrawling
|
||||
)
|
||||
|
||||
# Results will come in order of relevance score
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://example.com", config=config):
|
||||
score = result.metadata.get("score", 0)
|
||||
print(f"Score: {score:.2f} | {result.url}")
|
||||
```
|
||||
|
||||
**How scorers work:**
|
||||
- Evaluate each discovered URL before crawling
|
||||
- Calculate relevance based on various signals
|
||||
- Help the crawler make intelligent choices about traversal order
|
||||
|
||||
---
|
||||
|
||||
## 6. Advanced Filtering Techniques
|
||||
|
||||
### 6.1 SEO Filter for Quality Assessment
|
||||
|
||||
The **SEOFilter** helps you identify pages with strong SEO characteristics:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter
|
||||
|
||||
# Create an SEO filter that looks for specific keywords in page metadata
|
||||
seo_filter = SEOFilter(
|
||||
threshold=0.5, # Minimum score (0.0 to 1.0)
|
||||
keywords=["tutorial", "guide", "documentation"]
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([seo_filter])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 6.2 Content Relevance Filter
|
||||
|
||||
The **ContentRelevanceFilter** analyzes the actual content of pages:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter
|
||||
|
||||
# Create a content relevance filter
|
||||
relevance_filter = ContentRelevanceFilter(
|
||||
query="Web crawling and data extraction with Python",
|
||||
threshold=0.7 # Minimum similarity score (0.0 to 1.0)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([relevance_filter])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This filter:
|
||||
- Measures semantic similarity between query and page content
|
||||
- It's a BM25-based relevance filter using head section content
|
||||
|
||||
---
|
||||
|
||||
## 7. Building a Complete Advanced Crawler
|
||||
|
||||
This example combines multiple techniques for a sophisticated crawl:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
FilterChain,
|
||||
DomainFilter,
|
||||
URLPatternFilter,
|
||||
ContentTypeFilter
|
||||
)
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
async def run_advanced_crawler():
|
||||
# Create a sophisticated filter chain
|
||||
filter_chain = FilterChain([
|
||||
# Domain boundaries
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.example.com"],
|
||||
blocked_domains=["old.docs.example.com"]
|
||||
),
|
||||
|
||||
# URL patterns to include
|
||||
URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
|
||||
|
||||
# Content type filtering
|
||||
ContentTypeFilter(allowed_types=["text/html"])
|
||||
])
|
||||
|
||||
# Create a relevance scorer
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
# Set up the configuration
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=keyword_scorer
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
stream=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Execute the crawl
|
||||
results = []
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.example.com", config=config):
|
||||
results.append(result)
|
||||
score = result.metadata.get("score", 0)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||||
|
||||
# Analyze the results
|
||||
print(f"Crawled {len(results)} high-value pages")
|
||||
print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")
|
||||
|
||||
# Group by depth
|
||||
depth_counts = {}
|
||||
for result in results:
|
||||
depth = result.metadata.get("depth", 0)
|
||||
depth_counts[depth] = depth_counts.get(depth, 0) + 1
|
||||
|
||||
print("Pages crawled by depth:")
|
||||
for depth, count in sorted(depth_counts.items()):
|
||||
print(f" Depth {depth}: {count} pages")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(run_advanced_crawler())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
|
||||
## 8. Common Pitfalls & Tips
|
||||
|
||||
1.**Set realistic depth limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size.
|
||||
|
||||
2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
|
||||
|
||||
3.**Be a good web citizen.** Respect robots.txt. (disabled by default)
|
||||
|
||||
|
||||
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.success` and `result.error_message` when processing results.
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary & Next Steps
|
||||
|
||||
In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
|
||||
|
||||
- Configure **BFSDeepCrawlStrategy** and **BestFirstCrawlingStrategy**
|
||||
- Process results in streaming or non-streaming mode
|
||||
- Apply filters to target specific content
|
||||
- Use scorers to prioritize the most relevant pages
|
||||
- Build a complete advanced crawler with combined techniques
|
||||
|
||||
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.
|
||||
@@ -10,11 +10,10 @@
|
||||
|
||||
In **`CrawlerRunConfig`**, you can specify a **`content_filter`** to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied **before** or **during** the HTML→Markdown process, producing:
|
||||
|
||||
- **`result.markdown_v2.raw_markdown`** (unfiltered)
|
||||
- **`result.markdown_v2.fit_markdown`** (filtered or “fit” version)
|
||||
- **`result.markdown_v2.fit_html`** (the corresponding HTML snippet that produced `fit_markdown`)
|
||||
- **`result.markdown.raw_markdown`** (unfiltered)
|
||||
- **`result.markdown.fit_markdown`** (filtered or “fit” version)
|
||||
- **`result.markdown.fit_html`** (the corresponding HTML snippet that produced `fit_markdown`)
|
||||
|
||||
> **Note**: We’re currently storing the result in `markdown_v2`, but eventually we’ll unify it as `result.markdown`.
|
||||
|
||||
### 1.2 Common Filters
|
||||
|
||||
@@ -62,8 +61,8 @@ async def main():
|
||||
|
||||
if result.success:
|
||||
# 'fit_markdown' is your pruned content, focusing on "denser" text
|
||||
print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
|
||||
print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
|
||||
print("Raw Markdown length:", len(result.markdown.raw_markdown))
|
||||
print("Fit Markdown length:", len(result.markdown.fit_markdown))
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
@@ -123,7 +122,7 @@ async def main():
|
||||
)
|
||||
if result.success:
|
||||
print("Fit Markdown (BM25 query-based):")
|
||||
print(result.markdown_v2.fit_markdown)
|
||||
print(result.markdown.fit_markdown)
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
@@ -144,11 +143,11 @@ if __name__ == "__main__":
|
||||
|
||||
## 4. Accessing the “Fit” Output
|
||||
|
||||
After the crawl, your “fit” content is found in **`result.markdown_v2.fit_markdown`**. In future versions, it will be **`result.markdown.fit_markdown`**. Meanwhile:
|
||||
After the crawl, your “fit” content is found in **`result.markdown.fit_markdown`**.
|
||||
|
||||
```python
|
||||
fit_md = result.markdown_v2.fit_markdown
|
||||
fit_html = result.markdown_v2.fit_html
|
||||
fit_md = result.markdown.fit_markdown
|
||||
fit_html = result.markdown.fit_html
|
||||
```
|
||||
|
||||
If the content filter is **BM25**, you might see additional logic or references in `fit_markdown` that highlight relevant segments. If it’s **Pruning**, the text is typically well-cleaned but not necessarily matched to a query.
|
||||
@@ -167,7 +166,6 @@ prune_filter = PruningContentFilter(
|
||||
)
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
# => result.markdown_v2.fit_markdown
|
||||
```
|
||||
|
||||
### 5.2 BM25
|
||||
@@ -179,7 +177,6 @@ bm25_filter = BM25ContentFilter(
|
||||
)
|
||||
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
# => result.markdown_v2.fit_markdown
|
||||
```
|
||||
|
||||
---
|
||||
@@ -203,7 +200,7 @@ Thus, **multi-level** filtering occurs:
|
||||
|
||||
1. The crawler’s `excluded_tags` are removed from the HTML first.
|
||||
2. The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
|
||||
3. The final “fit” content is generated in `result.markdown_v2.fit_markdown`.
|
||||
3. The final “fit” content is generated in `result.markdown.fit_markdown`.
|
||||
|
||||
---
|
||||
|
||||
@@ -241,7 +238,7 @@ class MyCustomFilter(RelevantContentFilter):
|
||||
- **PruningContentFilter**: Great if you just want the “meatiest” text without a user query.
|
||||
- **BM25ContentFilter**: Perfect for query-based extraction or searching.
|
||||
- Combine with **`excluded_tags`, `exclude_external_links`, `word_count_threshold`** to refine your final “fit” text.
|
||||
- Fit markdown ends up in **`result.markdown_v2.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions.
|
||||
- Fit markdown ends up in **`result.markdown.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions.
|
||||
|
||||
With these tools, you can **zero in** on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
|
||||
|
||||
|
||||
@@ -204,7 +204,7 @@ async def main():
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.fit_markdown) # Filtered markdown content
|
||||
print(result.markdown.fit_markdown) # Filtered markdown content
|
||||
```
|
||||
|
||||
**Key Features:**
|
||||
@@ -249,14 +249,11 @@ filter = LLMContentFilter(
|
||||
|
||||
## 5. Using Fit Markdown
|
||||
|
||||
When a content filter is active, the library produces two forms of markdown inside `result.markdown_v2` or (if using the simplified field) `result.markdown`:
|
||||
When a content filter is active, the library produces two forms of markdown inside `result.markdown`:
|
||||
|
||||
1. **`raw_markdown`**: The full unfiltered markdown.
|
||||
2. **`fit_markdown`**: A “fit” version where the filter has removed or trimmed noisy segments.
|
||||
|
||||
**Note**:
|
||||
> In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: you’ll have a raw version and a filtered (“fit”) version if a filter is used.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
@@ -276,7 +273,7 @@ async def main():
|
||||
print("Raw markdown:\n", result.markdown)
|
||||
|
||||
# If a filter is used, we also have .fit_markdown:
|
||||
md_object = result.markdown_v2 # or your equivalent
|
||||
md_object = result.markdown # or your equivalent
|
||||
print("Filtered markdown:\n", md_object.fit_markdown)
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
@@ -300,7 +297,7 @@ If your library stores detailed markdown output in an object like `MarkdownGener
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
md_obj = result.markdown_v2 # your library’s naming may vary
|
||||
md_obj = result.markdown # your library’s naming may vary
|
||||
print("RAW:\n", md_obj.raw_markdown)
|
||||
print("CITED:\n", md_obj.markdown_with_citations)
|
||||
print("REFERENCES:\n", md_obj.references_markdown)
|
||||
|
||||
@@ -296,7 +296,7 @@ async def quick_parallel_example():
|
||||
# Stream results as they complete
|
||||
async for result in await crawler.arun_many(urls, config=run_conf):
|
||||
if result.success:
|
||||
print(f"[OK] {result.url}, length: {len(result.markdown_v2.raw_markdown)}")
|
||||
print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
|
||||
else:
|
||||
print(f"[ERROR] {result.url} => {result.error_message}")
|
||||
|
||||
@@ -305,7 +305,7 @@ async def quick_parallel_example():
|
||||
results = await crawler.arun_many(urls, config=run_conf)
|
||||
for res in results:
|
||||
if res.success:
|
||||
print(f"[OK] {res.url}, length: {len(res.markdown_v2.raw_markdown)}")
|
||||
print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
|
||||
else:
|
||||
print(f"[ERROR] {res.url} => {res.error_message}")
|
||||
|
||||
|
||||
@@ -39,8 +39,8 @@ result = await crawler.arun(
|
||||
# Different content formats
|
||||
print(result.html) # Raw HTML
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown) # Markdown version
|
||||
print(result.fit_markdown) # Most relevant content in markdown
|
||||
print(result.markdown.raw_markdown) # Raw markdown from cleaned html
|
||||
print(result.markdown.fit_markdown) # Most relevant content in markdown
|
||||
|
||||
# Check success status
|
||||
print(result.success) # True if crawl succeeded
|
||||
|
||||
Reference in New Issue
Block a user