Release prep (#749)

* fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
2025-02-28 17:23:35 +05:30
parent 3a87b4e43b
commit a9e24307cc
38 changed files with 2040 additions and 326 deletions
--- a/docs/md_v2/api/async-webcrawler.md
+++ b/docs/md_v2/api/async-webcrawler.md
@@ -200,7 +200,7 @@ Each `arun()` returns a **`CrawlResult`** containing:
 - `url`: Final URL (if redirected).
 - `html`: Original HTML.
 - `cleaned_html`: Sanitized HTML.
- `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
+- `markdown_v2`: Deprecated. Instead just use regular `markdown`
 - `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
 - `screenshot`, `pdf`: If screenshots/PDF requested.
 - `media`, `links`: Information about discovered images/links.
--- a/docs/md_v2/api/crawl-result.md
+++ b/docs/md_v2/api/crawl-result.md
@@ -16,9 +16,6 @@ class CrawlResult(BaseModel):
    screenshot: Optional[str] = None
    pdf : Optional[bytes] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
-    markdown_v2: Optional[MarkdownGenerationResult] = None
-    fit_markdown: Optional[str] = None
-    fit_html: Optional[str] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
@@ -116,8 +113,8 @@ print(result.cleaned_html[:500])  # Show a snippet
 **When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.  
 **Usage**:
 ```python
-if result.fit_html:
-    print("High-value HTML content:", result.fit_html[:300])
+if result.markdown.fit_html:
+    print("High-value HTML content:", result.markdown.fit_html[:300])
 ```

 ---
@@ -132,8 +129,6 @@ Crawl4AI can convert HTML→Markdown, optionally including:
 - **Links as citations** (with a references section)  
 - **Fit** markdown if a **content filter** is used (like Pruning or BM25)

-### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*  
-**What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.  

 **`MarkdownGenerationResult`** includes:
 - **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.  
@@ -144,8 +139,8 @@ Crawl4AI can convert HTML→Markdown, optionally including:

 **Usage**:
 ```python
-if result.markdown_v2:
-    md_res = result.markdown_v2
+if result.markdown:
+    md_res = result.markdown
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
@@ -153,26 +148,15 @@ if result.markdown_v2:
        print("Pruned text:", md_res.fit_markdown[:300])
 ```

-### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*  
-**What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.  
+### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*  
+**What**: Holds the `MarkdownGenerationResult`.  
 **Usage**:
 ```python
-# Soon, you might see:
-if isinstance(result.markdown, MarkdownGenerationResult):
-    print(result.markdown.raw_markdown[:200])
-else:
-    print(result.markdown)
+print(result.markdown.raw_markdown[:200])
+print(result.markdown.fit_markdown)
+print(result.markdown.fit_html)
 ```
-
-### 3.4 **`fit_markdown`** *(Optional[str])*  
-**What**: A direct reference to the final filtered markdown (legacy approach).  
-**When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.  
-**Usage**:
-```python
-print(result.fit_markdown)  # Legacy field, prefer result.markdown_v2.fit_markdown
-```
-
-**Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
+**Important**: “Fit” content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.

 ---

@@ -304,13 +288,11 @@ async def handle_result(result: CrawlResult):
    print("Cleaned HTML size:", len(result.cleaned_html or ""))

    # Markdown output
-    if result.markdown_v2:
-        print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
-        print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
-        if result.markdown_v2.fit_markdown:
-            print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
-    else:
-        print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
+    if result.markdown:
+        print("Raw Markdown:", result.markdown.raw_markdown[:300])
+        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
+        if result.markdown.fit_markdown:
+            print("Fit Markdown:", result.markdown.fit_markdown[:200])

    # Media & Links
    if "images" in result.media:
@@ -333,12 +315,12 @@ async def handle_result(result: CrawlResult):

 ## 8. Key Points & Future

-1. **`markdown_v2` vs `markdown`**  
-   - Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.  
-   - In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
+1. **Deprecated legacy properties of CrawlResult**  
+   - `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
+   - `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`

 2. **Fit Content**  
-   - **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.  
+   - **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.  
   - If no filter is used, they remain `None`.

 3. **References & Citations**