Release prep (#749)
* fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
This commit is contained in:
@@ -200,7 +200,7 @@ Each `arun()` returns a **`CrawlResult`** containing:
|
||||
- `url`: Final URL (if redirected).
|
||||
- `html`: Original HTML.
|
||||
- `cleaned_html`: Sanitized HTML.
|
||||
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
|
||||
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
|
||||
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
|
||||
- `screenshot`, `pdf`: If screenshots/PDF requested.
|
||||
- `media`, `links`: Information about discovered images/links.
|
||||
|
||||
@@ -16,9 +16,6 @@ class CrawlResult(BaseModel):
|
||||
screenshot: Optional[str] = None
|
||||
pdf : Optional[bytes] = None
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
fit_markdown: Optional[str] = None
|
||||
fit_html: Optional[str] = None
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
@@ -116,8 +113,8 @@ print(result.cleaned_html[:500]) # Show a snippet
|
||||
**When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.
|
||||
**Usage**:
|
||||
```python
|
||||
if result.fit_html:
|
||||
print("High-value HTML content:", result.fit_html[:300])
|
||||
if result.markdown.fit_html:
|
||||
print("High-value HTML content:", result.markdown.fit_html[:300])
|
||||
```
|
||||
|
||||
---
|
||||
@@ -132,8 +129,6 @@ Crawl4AI can convert HTML→Markdown, optionally including:
|
||||
- **Links as citations** (with a references section)
|
||||
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
|
||||
|
||||
### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*
|
||||
**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.
|
||||
|
||||
**`MarkdownGenerationResult`** includes:
|
||||
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
|
||||
@@ -144,8 +139,8 @@ Crawl4AI can convert HTML→Markdown, optionally including:
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
if result.markdown_v2:
|
||||
md_res = result.markdown_v2
|
||||
if result.markdown:
|
||||
md_res = result.markdown
|
||||
print("Raw MD:", md_res.raw_markdown[:300])
|
||||
print("Citations MD:", md_res.markdown_with_citations[:300])
|
||||
print("References:", md_res.references_markdown)
|
||||
@@ -153,26 +148,15 @@ if result.markdown_v2:
|
||||
print("Pruned text:", md_res.fit_markdown[:300])
|
||||
```
|
||||
|
||||
### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
|
||||
**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.
|
||||
### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
|
||||
**What**: Holds the `MarkdownGenerationResult`.
|
||||
**Usage**:
|
||||
```python
|
||||
# Soon, you might see:
|
||||
if isinstance(result.markdown, MarkdownGenerationResult):
|
||||
print(result.markdown.raw_markdown[:200])
|
||||
else:
|
||||
print(result.markdown)
|
||||
print(result.markdown.raw_markdown[:200])
|
||||
print(result.markdown.fit_markdown)
|
||||
print(result.markdown.fit_html)
|
||||
```
|
||||
|
||||
### 3.4 **`fit_markdown`** *(Optional[str])*
|
||||
**What**: A direct reference to the final filtered markdown (legacy approach).
|
||||
**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.
|
||||
**Usage**:
|
||||
```python
|
||||
print(result.fit_markdown) # Legacy field, prefer result.markdown_v2.fit_markdown
|
||||
```
|
||||
|
||||
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
|
||||
|
||||
---
|
||||
|
||||
@@ -304,13 +288,11 @@ async def handle_result(result: CrawlResult):
|
||||
print("Cleaned HTML size:", len(result.cleaned_html or ""))
|
||||
|
||||
# Markdown output
|
||||
if result.markdown_v2:
|
||||
print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
|
||||
print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
|
||||
if result.markdown_v2.fit_markdown:
|
||||
print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
|
||||
else:
|
||||
print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
|
||||
if result.markdown:
|
||||
print("Raw Markdown:", result.markdown.raw_markdown[:300])
|
||||
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
|
||||
if result.markdown.fit_markdown:
|
||||
print("Fit Markdown:", result.markdown.fit_markdown[:200])
|
||||
|
||||
# Media & Links
|
||||
if "images" in result.media:
|
||||
@@ -333,12 +315,12 @@ async def handle_result(result: CrawlResult):
|
||||
|
||||
## 8. Key Points & Future
|
||||
|
||||
1. **`markdown_v2` vs `markdown`**
|
||||
- Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.
|
||||
- In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
|
||||
1. **Deprecated legacy properties of CrawlResult**
|
||||
- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
|
||||
- `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
|
||||
|
||||
2. **Fit Content**
|
||||
- **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
|
||||
- **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
|
||||
- If no filter is used, they remain `None`.
|
||||
|
||||
3. **References & Citations**
|
||||
|
||||
Reference in New Issue
Block a user