Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
This commit is contained in:
Aravind
2025-02-28 17:23:35 +05:30
committed by GitHub
parent 3a87b4e43b
commit a9e24307cc
38 changed files with 2040 additions and 326 deletions

View File

@@ -27,7 +27,6 @@ class CrawlResult(BaseModel):
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
@@ -52,8 +51,7 @@ class CrawlResult(BaseModel):
| **downloaded_files (`Optional[List[str]]`)** | If `accept_downloads=True` in `BrowserConfig`, this lists the filepaths of saved downloads. |
| **screenshot (`Optional[str]`)** | Screenshot of the page (base64-encoded) if `screenshot=True`. |
| **pdf (`Optional[bytes]`)** | PDF of the page if `pdf=True`. |
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | For now, `markdown_v2` holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
| **markdown_v2 (`Optional[MarkdownGenerationResult]`)** | Legacy field for detailed markdown output. This will be replaced by `markdown` soon. |
| **markdown (`Optional[str or MarkdownGenerationResult]`)** | It holds a `MarkdownGenerationResult`. Over time, this will be consolidated into `markdown`. The generator can provide raw markdown, citations, references, and optionally `fit_markdown`. |
| **extracted_content (`Optional[str]`)** | The output of a structured extraction (CSS/LLM-based) stored as JSON string or other text. |
| **metadata (`Optional[dict]`)** | Additional info about the crawl or extracted data. |
| **error_message (`Optional[str]`)** | If `success=False`, contains a short description of what went wrong. |
@@ -90,10 +88,10 @@ print(result.cleaned_html) # Freed of forms, header, footer, data-* attributes
## 3. Markdown Generation
### 3.1 `markdown_v2` (Legacy) vs `markdown`
### 3.1 `markdown`
- **`markdown_v2`**: The current location for detailed markdown output, returning a **`MarkdownGenerationResult`** object.
- **`markdown`**: Eventually, were merging these fields. For now, you might see `result.markdown_v2` used widely in code examples.
- **`markdown`**: The current location for detailed markdown output, returning a **`MarkdownGenerationResult`** object.
- **`markdown_v2`**: Deprecated since v0.5.
**`MarkdownGenerationResult`** Fields:
@@ -118,7 +116,7 @@ config = CrawlerRunConfig(
)
result = await crawler.arun(url="https://example.com", config=config)
md_res = result.markdown_v2 # or eventually 'result.markdown'
md_res = result.markdown # or eventually 'result.markdown'
print(md_res.raw_markdown[:500])
print(md_res.markdown_with_citations)
print(md_res.references_markdown)
@@ -224,15 +222,17 @@ Check any field:
if result.success:
print(result.status_code, result.response_headers)
print("Links found:", len(result.links.get("internal", [])))
if result.markdown_v2:
print("Markdown snippet:", result.markdown_v2.raw_markdown[:200])
if result.markdown:
print("Markdown snippet:", result.markdown.raw_markdown[:200])
if result.extracted_content:
print("Structured JSON:", result.extracted_content)
else:
print("Error:", result.error_message)
```
**Remember**: Use `result.markdown_v2` for now. It will eventually become `result.markdown`.
**Deprecation**: Since v0.5 `result.markdown_v2`, `result.fit_html`,`result.fit_markdown` are deprecated. Use `result.markdown` instead! It holds `MarkdownGenerationResult`, which includes `fit_html` and `fit_markdown`
as it's properties.
---

View File

@@ -0,0 +1,436 @@
# Deep Crawling
One of Crawl4AI's most powerful features is its ability to perform **configurable deep crawling** that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.
In this tutorial, you'll learn:
1. How to set up a **Basic Deep Crawler** with BFS strategy
2. Understanding the difference between **streamed and non-streamed** output
3. Implementing **filters and scorers** to target specific content
4. Creating **advanced filtering chains** for sophisticated crawls
5. Using **BestFirstCrawling** for intelligent exploration prioritization
> **Prerequisites**
> - Youve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.
> - You know how to configure `CrawlerRunConfig`.
---
## 1. Quick Example
Here's a minimal code snippet that implements a basic deep crawl using the **BFSDeepCrawlStrategy**:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
async def main():
# Configure a 2-level deep crawl
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
print(f"Crawled {len(results)} pages in total")
# Access individual results
for result in results[:3]: # Show first 3 results
print(f"URL: {result.url}")
print(f"Depth: {result.metadata.get('depth', 0)}")
if __name__ == "__main__":
asyncio.run(main())
```
**What's happening?**
- `BFSDeepCrawlStrategy(max_depth=2, include_external=False)` instructs Crawl4AI to:
- Crawl the starting page (depth 0) plus 2 more levels
- Stay within the same domain (don't follow external links)
- Each result contains metadata like the crawl depth
- Results are returned as a list after all crawling is complete
---
## 2. Understanding Deep Crawling Strategy Options
### 2.1 BFSDeepCrawlStrategy (Breadth-First Search)
The **BFSDeepCrawlStrategy** uses a breadth-first approach, exploring all links at one depth before moving deeper:
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
# Basic configuration
strategy = BFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
)
```
**Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains
### 2.2 DFSDeepCrawlStrategy (Depth-First Search)
The **DFSDeepCrawlStrategy** uses a depth-first approach, explores as far down a branch as possible before backtracking.
```python
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
# Basic configuration
strategy = DFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
)
```
**Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains
### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
For more intelligent crawling, use **BestFirstCrawlingStrategy** with scorers to prioritize the most relevant pages:
```python
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Create a scorer
scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
# Configure the strategy
strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=scorer
)
```
This crawling approach:
- Evaluates each discovered URL based on scorer criteria
- Visits higher-scoring pages first
- Helps focus crawl resources on the most relevant content
---
## 3. Streaming vs. Non-Streaming Results
Crawl4AI can return results in two modes:
### 3.1 Non-Streaming Mode (Default)
```python
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=False # Default behavior
)
async with AsyncWebCrawler() as crawler:
# Wait for ALL results to be collected before returning
results = await crawler.arun("https://example.com", config=config)
for result in results:
process_result(result)
```
**When to use non-streaming mode:**
- You need the complete dataset before processing
- You're performing batch operations on all results together
- Crawl time isn't a critical factor
### 3.2 Streaming Mode
```python
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=True # Enable streaming
)
async with AsyncWebCrawler() as crawler:
# Returns an async iterator
async for result in await crawler.arun("https://example.com", config=config):
# Process each result as it becomes available
process_result(result)
```
**Benefits of streaming mode:**
- Process results immediately as they're discovered
- Start working with early results while crawling continues
- Better for real-time applications or progressive display
- Reduces memory pressure when handling many pages
---
## 4. Filtering Content with Filter Chains
Filters help you narrow down which pages to crawl. Combine multiple filters using **FilterChain** for powerful targeting.
### 4.1 Basic URL Pattern Filter
```python
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter
# Only follow URLs containing "blog" or "docs"
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([url_filter])
)
)
```
### 4.2 Combining Multiple Filters
```python
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
DomainFilter,
ContentTypeFilter
)
# Create a chain of filters
filter_chain = FilterChain([
# Only follow URLs with specific patterns
URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
# Only crawl specific domains
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com"]
),
# Only include specific content types
ContentTypeFilter(allowed_types=["text/html"])
])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=filter_chain
)
)
```
### 4.3 Available Filter Types
Crawl4AI includes several specialized filters:
- **`URLPatternFilter`**: Matches URL patterns using wildcard syntax
- **`DomainFilter`**: Controls which domains to include or exclude
- **`ContentTypeFilter`**: Filters based on HTTP Content-Type
- **`ContentRelevanceFilter`**: Uses similarity to a text query
- **`SEOFilter`**: Evaluates SEO elements (meta tags, headers, etc.)
---
## 5. Using Scorers for Prioritized Crawling
Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.
### 5.1 KeywordRelevanceScorer
```python
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7 # Importance of this scorer (0.0 to 1.0)
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
url_scorer=keyword_scorer
),
stream=True # Recommended with BestFirstCrawling
)
# Results will come in order of relevance score
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://example.com", config=config):
score = result.metadata.get("score", 0)
print(f"Score: {score:.2f} | {result.url}")
```
**How scorers work:**
- Evaluate each discovered URL before crawling
- Calculate relevance based on various signals
- Help the crawler make intelligent choices about traversal order
---
## 6. Advanced Filtering Techniques
### 6.1 SEO Filter for Quality Assessment
The **SEOFilter** helps you identify pages with strong SEO characteristics:
```python
from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter
# Create an SEO filter that looks for specific keywords in page metadata
seo_filter = SEOFilter(
threshold=0.5, # Minimum score (0.0 to 1.0)
keywords=["tutorial", "guide", "documentation"]
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([seo_filter])
)
)
```
### 6.2 Content Relevance Filter
The **ContentRelevanceFilter** analyzes the actual content of pages:
```python
from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter
# Create a content relevance filter
relevance_filter = ContentRelevanceFilter(
query="Web crawling and data extraction with Python",
threshold=0.7 # Minimum similarity score (0.0 to 1.0)
)
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([relevance_filter])
)
)
```
This filter:
- Measures semantic similarity between query and page content
- It's a BM25-based relevance filter using head section content
---
## 7. Building a Complete Advanced Crawler
This example combines multiple techniques for a sophisticated crawl:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
DomainFilter,
URLPatternFilter,
ContentTypeFilter
)
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
async def run_advanced_crawler():
# Create a sophisticated filter chain
filter_chain = FilterChain([
# Domain boundaries
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com"]
),
# URL patterns to include
URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
# Content type filtering
ContentTypeFilter(allowed_types=["text/html"])
])
# Create a relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
# Set up the configuration
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True
)
# Execute the crawl
results = []
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.example.com", config=config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
# Analyze the results
print(f"Crawled {len(results)} high-value pages")
print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")
# Group by depth
depth_counts = {}
for result in results:
depth = result.metadata.get("depth", 0)
depth_counts[depth] = depth_counts.get(depth, 0) + 1
print("Pages crawled by depth:")
for depth, count in sorted(depth_counts.items()):
print(f" Depth {depth}: {count} pages")
if __name__ == "__main__":
asyncio.run(run_advanced_crawler())
```
---
## 8. Common Pitfalls & Tips
1.**Set realistic depth limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size.
2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
3.**Be a good web citizen.** Respect robots.txt. (disabled by default)
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.success` and `result.error_message` when processing results.
---
## 9. Summary & Next Steps
In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
- Configure **BFSDeepCrawlStrategy** and **BestFirstCrawlingStrategy**
- Process results in streaming or non-streaming mode
- Apply filters to target specific content
- Use scorers to prioritize the most relevant pages
- Build a complete advanced crawler with combined techniques
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.

View File

@@ -10,11 +10,10 @@
In **`CrawlerRunConfig`**, you can specify a **`content_filter`** to shape how content is pruned or ranked before final markdown generation. A filters logic is applied **before** or **during** the HTML→Markdown process, producing:
- **`result.markdown_v2.raw_markdown`** (unfiltered)
- **`result.markdown_v2.fit_markdown`** (filtered or “fit” version)
- **`result.markdown_v2.fit_html`** (the corresponding HTML snippet that produced `fit_markdown`)
- **`result.markdown.raw_markdown`** (unfiltered)
- **`result.markdown.fit_markdown`** (filtered or “fit” version)
- **`result.markdown.fit_html`** (the corresponding HTML snippet that produced `fit_markdown`)
> **Note**: Were currently storing the result in `markdown_v2`, but eventually well unify it as `result.markdown`.
### 1.2 Common Filters
@@ -62,8 +61,8 @@ async def main():
if result.success:
# 'fit_markdown' is your pruned content, focusing on "denser" text
print("Raw Markdown length:", len(result.markdown_v2.raw_markdown))
print("Fit Markdown length:", len(result.markdown_v2.fit_markdown))
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
else:
print("Error:", result.error_message)
@@ -123,7 +122,7 @@ async def main():
)
if result.success:
print("Fit Markdown (BM25 query-based):")
print(result.markdown_v2.fit_markdown)
print(result.markdown.fit_markdown)
else:
print("Error:", result.error_message)
@@ -144,11 +143,11 @@ if __name__ == "__main__":
## 4. Accessing the “Fit” Output
After the crawl, your “fit” content is found in **`result.markdown_v2.fit_markdown`**. In future versions, it will be **`result.markdown.fit_markdown`**. Meanwhile:
After the crawl, your “fit” content is found in **`result.markdown.fit_markdown`**.
```python
fit_md = result.markdown_v2.fit_markdown
fit_html = result.markdown_v2.fit_html
fit_md = result.markdown.fit_markdown
fit_html = result.markdown.fit_html
```
If the content filter is **BM25**, you might see additional logic or references in `fit_markdown` that highlight relevant segments. If its **Pruning**, the text is typically well-cleaned but not necessarily matched to a query.
@@ -167,7 +166,6 @@ prune_filter = PruningContentFilter(
)
md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown
```
### 5.2 BM25
@@ -179,7 +177,6 @@ bm25_filter = BM25ContentFilter(
)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
# => result.markdown_v2.fit_markdown
```
---
@@ -203,7 +200,7 @@ Thus, **multi-level** filtering occurs:
1. The crawlers `excluded_tags` are removed from the HTML first.
2. The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
3. The final “fit” content is generated in `result.markdown_v2.fit_markdown`.
3. The final “fit” content is generated in `result.markdown.fit_markdown`.
---
@@ -241,7 +238,7 @@ class MyCustomFilter(RelevantContentFilter):
- **PruningContentFilter**: Great if you just want the “meatiest” text without a user query.
- **BM25ContentFilter**: Perfect for query-based extraction or searching.
- Combine with **`excluded_tags`, `exclude_external_links`, `word_count_threshold`** to refine your final “fit” text.
- Fit markdown ends up in **`result.markdown_v2.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions.
- Fit markdown ends up in **`result.markdown.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions.
With these tools, you can **zero in** on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!

View File

@@ -204,7 +204,7 @@ async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.fit_markdown) # Filtered markdown content
print(result.markdown.fit_markdown) # Filtered markdown content
```
**Key Features:**
@@ -249,14 +249,11 @@ filter = LLMContentFilter(
## 5. Using Fit Markdown
When a content filter is active, the library produces two forms of markdown inside `result.markdown_v2` or (if using the simplified field) `result.markdown`:
When a content filter is active, the library produces two forms of markdown inside `result.markdown`:
1. **`raw_markdown`**: The full unfiltered markdown.
2. **`fit_markdown`**: A “fit” version where the filter has removed or trimmed noisy segments.
**Note**:
> In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: youll have a raw version and a filtered (“fit”) version if a filter is used.
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
@@ -276,7 +273,7 @@ async def main():
print("Raw markdown:\n", result.markdown)
# If a filter is used, we also have .fit_markdown:
md_object = result.markdown_v2 # or your equivalent
md_object = result.markdown # or your equivalent
print("Filtered markdown:\n", md_object.fit_markdown)
else:
print("Crawl failed:", result.error_message)
@@ -300,7 +297,7 @@ If your library stores detailed markdown output in an object like `MarkdownGener
**Example**:
```python
md_obj = result.markdown_v2 # your librarys naming may vary
md_obj = result.markdown # your librarys naming may vary
print("RAW:\n", md_obj.raw_markdown)
print("CITED:\n", md_obj.markdown_with_citations)
print("REFERENCES:\n", md_obj.references_markdown)

View File

@@ -296,7 +296,7 @@ async def quick_parallel_example():
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"[OK] {result.url}, length: {len(result.markdown_v2.raw_markdown)}")
print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
else:
print(f"[ERROR] {result.url} => {result.error_message}")
@@ -305,7 +305,7 @@ async def quick_parallel_example():
results = await crawler.arun_many(urls, config=run_conf)
for res in results:
if res.success:
print(f"[OK] {res.url}, length: {len(res.markdown_v2.raw_markdown)}")
print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
else:
print(f"[ERROR] {res.url} => {res.error_message}")

View File

@@ -39,8 +39,8 @@ result = await crawler.arun(
# Different content formats
print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown) # Markdown version
print(result.fit_markdown) # Most relevant content in markdown
print(result.markdown.raw_markdown) # Raw markdown from cleaned html
print(result.markdown.fit_markdown) # Most relevant content in markdown
# Check success status
print(result.success) # True if crawl succeeded