Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
This commit is contained in:
Aravind
2025-02-28 17:23:35 +05:30
committed by GitHub
parent 3a87b4e43b
commit a9e24307cc
38 changed files with 2040 additions and 326 deletions

View File

@@ -200,7 +200,7 @@ Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.

View File

@@ -16,9 +16,6 @@ class CrawlResult(BaseModel):
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
markdown_v2: Optional[MarkdownGenerationResult] = None
fit_markdown: Optional[str] = None
fit_html: Optional[str] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
@@ -116,8 +113,8 @@ print(result.cleaned_html[:500]) # Show a snippet
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
**Usage**:
```python
if result.fit_html:
print("High-value HTML content:", result.fit_html[:300])
if result.markdown.fit_html:
print("High-value HTML content:", result.markdown.fit_html[:300])
```
---
@@ -132,8 +129,6 @@ Crawl4AI can convert HTML→Markdown, optionally including:
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*
**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.
**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
@@ -144,8 +139,8 @@ Crawl4AI can convert HTML→Markdown, optionally including:
**Usage**:
```python
if result.markdown_v2:
md_res = result.markdown_v2
if result.markdown:
md_res = result.markdown
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
@@ -153,26 +148,15 @@ if result.markdown_v2:
print("Pruned text:", md_res.fit_markdown[:300])
```
### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.
### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: Holds the `MarkdownGenerationResult`.
**Usage**:
```python
# Soon, you might see:
if isinstance(result.markdown, MarkdownGenerationResult):
print(result.markdown.raw_markdown[:200])
else:
print(result.markdown)
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
```
### 3.4 **`fit_markdown`** *(Optional[str])*
**What**: A direct reference to the final filtered markdown (legacy approach).
**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.
**Usage**:
```python
print(result.fit_markdown) # Legacy field, prefer result.markdown_v2.fit_markdown
```
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
---
@@ -304,13 +288,11 @@ async def handle_result(result: CrawlResult):
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown_v2:
print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
if result.markdown_v2.fit_markdown:
print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
else:
print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
if result.markdown:
print("Raw Markdown:", result.markdown.raw_markdown[:300])
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown:
print("Fit Markdown:", result.markdown.fit_markdown[:200])
# Media & Links
if "images" in result.media:
@@ -333,12 +315,12 @@ async def handle_result(result: CrawlResult):
## 8. Key Points & Future
1. **`markdown_v2` vs `markdown`**
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
- In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
1. **Deprecated legacy properties of CrawlResult**
- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
- `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
2. **Fit Content**
- **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- If no filter is used, they remain `None`.
3. **References & Citations**