Update the Tutorial section for new document version

2024-12-31 17:27:31 +08:00
parent fb33a24891
commit 0ec593fa90
85 changed files with 3412 additions and 9152 deletions
--- a/docs/md_v3/tutorials/advanced-features.md
+++ b/docs/md_v3/tutorials/advanced-features.md
@@ -0,0 +1,329 @@
+# Advanced Features (Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
+
+Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
+
+1. **Proxy Usage**  
+2. **Capturing PDFs & Screenshots**  
+3. **Handling SSL Certificates**  
+4. **Custom Headers**  
+5. **Session Persistence & Local Storage**
+
+> **Prerequisites**  
+> - You have a basic grasp of [AsyncWebCrawler Basics](./async-webcrawler-basics.md)  
+> - You know how to run or configure your Python environment with Playwright installed
+
+---
+
+## 1. Proxy Usage
+
+If you need to route your crawl traffic through a proxy—whether for IP rotation, geo-testing, or privacy—Crawl4AI supports it via `BrowserConfig.proxy_config`.
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    browser_cfg = BrowserConfig(
+        proxy_config={
+            "server": "http://proxy.example.com:8080",
+            "username": "myuser",
+            "password": "mypass",
+        },
+        headless=True
+    )
+    crawler_cfg = CrawlerRunConfig(
+        verbose=True
+    )
+
+    async with AsyncWebCrawler(config=browser_cfg) as crawler:
+        result = await crawler.arun(
+            url="https://www.whatismyip.com/",
+            config=crawler_cfg
+        )
+        if result.success:
+            print("[OK] Page fetched via proxy.")
+            print("Page HTML snippet:", result.html[:200])
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Points**  
+- **`proxy_config`** expects a dict with `server` and optional auth credentials.  
+- Many commercial proxies provide an HTTP/HTTPS “gateway” server that you specify in `server`.  
+- If your proxy doesn’t need auth, omit `username`/`password`.
+
+---
+
+## 2. Capturing PDFs & Screenshots
+
+Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI can do both in one pass:
+
+```python
+import os, asyncio
+from base64 import b64decode
+from crawl4ai import AsyncWebCrawler, CacheMode
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
+            cache_mode=CacheMode.BYPASS,
+            pdf=True,
+            screenshot=True
+        )
+        
+        if result.success:
+            # Save screenshot
+            if result.screenshot:
+                with open("wikipedia_screenshot.png", "wb") as f:
+                    f.write(b64decode(result.screenshot))
+            
+            # Save PDF
+            if result.pdf:
+                with open("wikipedia_page.pdf", "wb") as f:
+                    f.write(b64decode(result.pdf))
+            
+            print("[OK] PDF & screenshot captured.")
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Why PDF + Screenshot?**  
+- Large or complex pages can be slow or error-prone with “traditional” full-page screenshots.  
+- Exporting a PDF is more reliable for very long pages. Crawl4AI automatically converts the first PDF page into an image if you request both.  
+
+**Relevant Parameters**  
+- **`pdf=True`**: Exports the current page as a PDF (base64-encoded in `result.pdf`).  
+- **`screenshot=True`**: Creates a screenshot (base64-encoded in `result.screenshot`).  
+- **`scan_full_page`** or advanced hooking can further refine how the crawler captures content.
+
+---
+
+## 3. Handling SSL Certificates
+
+If you need to verify or export a site’s SSL certificate—for compliance, debugging, or data analysis—Crawl4AI can fetch it during the crawl:
+
+```python
+import asyncio, os
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+
+async def main():
+    tmp_dir = os.path.join(os.getcwd(), "tmp")
+    os.makedirs(tmp_dir, exist_ok=True)
+    
+    config = CrawlerRunConfig(
+        fetch_ssl_certificate=True,
+        cache_mode=CacheMode.BYPASS
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url="https://example.com", config=config)
+        
+        if result.success and result.ssl_certificate:
+            cert = result.ssl_certificate
+            print("\nCertificate Information:")
+            print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
+            print(f"Valid until: {cert.valid_until}")
+            print(f"Fingerprint: {cert.fingerprint}")
+
+            # Export in multiple formats:
+            cert.to_json(os.path.join(tmp_dir, "certificate.json"))
+            cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
+            cert.to_der(os.path.join(tmp_dir, "certificate.der"))
+            
+            print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
+        else:
+            print("[ERROR] No certificate or crawl failed.")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Points**  
+- **`fetch_ssl_certificate=True`** triggers certificate retrieval.  
+- `result.ssl_certificate` includes methods (`to_json`, `to_pem`, `to_der`) for saving in various formats (handy for server config, Java keystores, etc.).
+
+---
+
+## 4. Custom Headers
+
+Sometimes you need to set custom headers (e.g., language preferences, authentication tokens, or specialized user-agent strings). You can do this in multiple ways:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    # Option 1: Set headers at the crawler strategy level
+    crawler1 = AsyncWebCrawler(
+        # The underlying strategy can accept headers in its constructor
+        crawler_strategy=None  # We'll override below for clarity
+    )
+    crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
+    crawler1.crawler_strategy.set_custom_headers({
+        "Accept-Language": "fr-FR,fr;q=0.9"
+    })
+    result1 = await crawler1.arun("https://www.example.com")
+    print("Example 1 result success:", result1.success)
+
+    # Option 2: Pass headers directly to `arun()`
+    crawler2 = AsyncWebCrawler()
+    result2 = await crawler2.arun(
+        url="https://www.example.com",
+        headers={"Accept-Language": "es-ES,es;q=0.9"}
+    )
+    print("Example 2 result success:", result2.success)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Notes**  
+- Some sites may react differently to certain headers (e.g., `Accept-Language`).  
+- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-anti-bot.md) or use `UserAgentGenerator`.
+
+---
+
+## 5. Session Persistence & Local Storage
+
+Crawl4AI can preserve cookies and localStorage so you can continue where you left off—ideal for logging into sites or skipping repeated auth flows.
+
+### 5.1 `storage_state`
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    storage_dict = {
+        "cookies": [
+            {
+                "name": "session",
+                "value": "abcd1234",
+                "domain": "example.com",
+                "path": "/",
+                "expires": 1699999999.0,
+                "httpOnly": False,
+                "secure": False,
+                "sameSite": "None"
+            }
+        ],
+        "origins": [
+            {
+                "origin": "https://example.com",
+                "localStorage": [
+                    {"name": "token", "value": "my_auth_token"}
+                ]
+            }
+        ]
+    }
+
+    # Provide the storage state as a dictionary to start "already logged in"
+    async with AsyncWebCrawler(
+        headless=True,
+        storage_state=storage_dict
+    ) as crawler:
+        result = await crawler.arun("https://example.com/protected")
+        if result.success:
+            print("Protected page content length:", len(result.html))
+        else:
+            print("Failed to crawl protected page")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 5.2 Exporting & Reusing State
+
+You can sign in once, export the browser context, and reuse it later—without re-entering credentials.
+
+- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.  
+- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
+
+**See**: [Detailed session management tutorial](./hooks-custom.md#using-storage_state) or [Explanations → Browser Context & Managed Browser](../../explanations/browser-management.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
+
+---
+
+## Putting It All Together
+
+Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
+
+```python
+import os, asyncio
+from base64 import b64decode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+
+async def main():
+    # 1. Browser config with proxy + headless
+    browser_cfg = BrowserConfig(
+        proxy_config={
+            "server": "http://proxy.example.com:8080",
+            "username": "myuser",
+            "password": "mypass",
+        },
+        headless=True,
+    )
+
+    # 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
+    crawler_cfg = CrawlerRunConfig(
+        pdf=True,
+        screenshot=True,
+        fetch_ssl_certificate=True,
+        cache_mode=CacheMode.BYPASS,
+        headers={"Accept-Language": "en-US,en;q=0.8"},
+        storage_state="my_storage.json",  # Reuse session from a previous sign-in
+        verbose=True,
+    )
+
+    # 3. Crawl
+    async with AsyncWebCrawler(config=browser_cfg) as crawler:
+        result = await crawler.arun("https://secure.example.com/protected", config=crawler_cfg)
+        
+        if result.success:
+            print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
+            
+            # Save PDF & screenshot
+            if result.pdf:
+                with open("result.pdf", "wb") as f:
+                    f.write(b64decode(result.pdf))
+            if result.screenshot:
+                with open("result.png", "wb") as f:
+                    f.write(b64decode(result.screenshot))
+            
+            # Check SSL cert
+            if result.ssl_certificate:
+                print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+## Conclusion & Next Steps
+
+You’ve now explored several **advanced** features:
+
+- **Proxy Usage**  
+- **PDF & Screenshot** capturing for large or critical pages  
+- **SSL Certificate** retrieval & exporting  
+- **Custom Headers** for language or specialized requests  
+- **Session Persistence** via storage state
+
+**Where to go next**:
+
+- **[Hooks & Custom Code](./hooks-custom.md)**: For multi-step interactions (clicking “Load More,” performing logins, etc.)  
+- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: If you need more sophisticated user simulation or stealth.  
+- **[Reference → BrowserConfig & CrawlerRunConfig](../../reference/configuration.md)**: Detailed param descriptions for everything you’ve seen here and more.
+
+With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
+
+**Last Updated**: 2024-XX-XX
--- a/docs/md_v3/tutorials/async-webcrawler-basics.md
+++ b/docs/md_v3/tutorials/async-webcrawler-basics.md
@@ -0,0 +1,218 @@
+Below is a sample Markdown file (`tutorials/async-webcrawler-basics.md`) illustrating how you might teach new users the fundamentals of `AsyncWebCrawler`. This tutorial builds on the **Getting Started** section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.
+
+---
+
+# AsyncWebCrawler Basics
+
+In this tutorial, you’ll learn how to:
+
+1. Create and configure an `AsyncWebCrawler` instance  
+2. Understand the `CrawlResult` object returned by `arun()`  
+3. Use basic `BrowserConfig` and `CrawlerRunConfig` options to tailor your crawl
+
+> **Prerequisites**  
+> - You’ve already completed the [Getting Started](./getting-started.md) tutorial (or have equivalent knowledge).  
+> - You have **Crawl4AI** installed and configured with Playwright.
+
+---
+
+## 1. What is `AsyncWebCrawler`?
+
+`AsyncWebCrawler` is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, it’s your high-level interface for collecting page data.
+
+```python
+from crawl4ai import AsyncWebCrawler
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://example.com")
+    print(result)
+```
+
+---
+
+## 2. Creating a Basic `AsyncWebCrawler` Instance
+
+Below is a simple code snippet showing how to create and use `AsyncWebCrawler`. This goes one step beyond the minimal example you saw in [Getting Started](./getting-started.md).
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai import BrowserConfig, CrawlerRunConfig
+
+async def main():
+    # 1. Set up configuration objects (optional if you want defaults)
+    browser_config = BrowserConfig(
+        browser_type="chromium",
+        headless=True,
+        verbose=True
+    )
+    crawler_config = CrawlerRunConfig(
+        page_timeout=30000,     # 30 seconds
+        wait_for_images=True,
+        verbose=True
+    )
+
+    # 2. Initialize AsyncWebCrawler with your chosen browser config
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        # 3. Run a single crawl
+        url_to_crawl = "https://example.com"
+        result = await crawler.arun(url=url_to_crawl, config=crawler_config)
+        
+        # 4. Inspect the result
+        if result.success:
+            print(f"Successfully crawled: {result.url}")
+            print(f"HTML length: {len(result.html)}")
+            print(f"Markdown snippet: {result.markdown[:200]}...")
+        else:
+            print(f"Failed to crawl {result.url}. Error: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Key Points
+
+1. **`BrowserConfig`** is optional, but it’s the place to specify browser-related settings (e.g., `headless`, `browser_type`).
+2. **`CrawlerRunConfig`** deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).
+3. **`arun()`** is the main method to crawl a single URL. We’ll see how `arun_many()` works in later tutorials.
+
+---
+
+## 3. Understanding `CrawlResult`
+
+When you call `arun()`, you get back a `CrawlResult` object containing all the relevant data from that crawl attempt. Some common fields include:
+
+```python
+class CrawlResult(BaseModel):
+    url: str
+    html: str
+    success: bool
+    cleaned_html: Optional[str] = None
+    media: Dict[str, List[Dict]] = {}
+    links: Dict[str, List[Dict]] = {}
+    screenshot: Optional[str] = None  # base64-encoded screenshot if requested
+    pdf: Optional[bytes] = None       # binary PDF data if requested
+    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
+    markdown_v2: Optional[MarkdownGenerationResult] = None
+    error_message: Optional[str] = None
+    # ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
+```
+
+### Commonly Used Fields
+
+- **`success`**: `True` if the crawl succeeded, `False` otherwise.  
+- **`html`**: The raw HTML (or final rendered state if JavaScript was executed).  
+- **`markdown` / `markdown_v2`**: Contains the automatically generated Markdown representation of the page.  
+- **`media`**: A dictionary with lists of extracted images, videos, or audio elements.  
+- **`links`**: A dictionary with lists of “internal” and “external” link objects.  
+- **`error_message`**: If `success` is `False`, this often contains a description of the error.
+
+**Example**:
+
+```python
+if result.success:
+    print("Page Title or snippet of HTML:", result.html[:200])
+    if result.markdown:
+        print("Markdown snippet:", result.markdown[:200])
+    print("Links found:", len(result.links.get("internal", [])), "internal links")
+else:
+    print("Error crawling:", result.error_message)
+```
+
+---
+
+## 4. Relevant Basic Parameters
+
+Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might tweak early on. We’ll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.
+
+### 4.1 `BrowserConfig` Essentials
+
+| Parameter          | Description                                               | Default        |
+|--------------------|-----------------------------------------------------------|----------------|
+| `browser_type`     | Which browser engine to use: `"chromium"`, `"firefox"`, `"webkit"` | `"chromium"`   |
+| `headless`         | Run the browser with no UI window. If `False`, you see the browser. | `True`         |
+| `verbose`          | Print extra logs for debugging.                          | `True`         |
+| `java_script_enabled` | Toggle JavaScript. When `False`, you might speed up loads but lose dynamic content. | `True`         |
+
+### 4.2 `CrawlerRunConfig` Essentials
+
+| Parameter             | Description                                                  | Default            |
+|-----------------------|--------------------------------------------------------------|--------------------|
+| `page_timeout`        | Maximum time in ms to wait for the page to load or scripts. | `30000` (30s)      |
+| `wait_for_images`     | Wait for images to fully load. Good for accurate rendering.  | `True`             |
+| `css_selector`        | Target only certain elements for extraction.                | `None`             |
+| `excluded_tags`       | Skip certain HTML tags (like `nav`, `footer`, etc.)          | `None`             |
+| `verbose`             | Print logs for debugging.                                    | `True`             |
+
+> **Tip**: Don’t worry if you see lots of parameters. You’ll learn them gradually in later tutorials.
+
+---
+
+## 5. Putting It All Together
+
+Here’s a slightly more in-depth example that shows off a few key config parameters at once:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai import BrowserConfig, CrawlerRunConfig
+
+async def main():
+    browser_cfg = BrowserConfig(
+        browser_type="chromium",
+        headless=True,
+        java_script_enabled=True,
+        verbose=False
+    )
+
+    crawler_cfg = CrawlerRunConfig(
+        page_timeout=30000,  # wait up to 30 seconds
+        wait_for_images=True,
+        css_selector=".article-body",  # only extract content under this CSS selector
+        verbose=True
+    )
+
+    async with AsyncWebCrawler(config=browser_cfg) as crawler:
+        result = await crawler.arun("https://news.example.com", config=crawler_cfg)
+
+        if result.success:
+            print("[OK] Crawled:", result.url)
+            print("HTML length:", len(result.html))
+            print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Observations**:
+- `css_selector=".article-body"` ensures we only focus on the main content region.  
+- `page_timeout=30000` helps if the site is slow.  
+- We turned off `verbose` logs for the browser but kept them on for the crawler config.  
+
+---
+
+## 6. Next Steps
+
+- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
+- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
+- **Reference**: For a complete list of every parameter in `BrowserConfig` and `CrawlerRunConfig`, check out the [Reference section](../../reference/configuration.md).
+
+---
+
+## Summary
+
+You now know the basics of **AsyncWebCrawler**:
+- How to create it with optional browser/crawler configs
+- How `arun()` works for single-page crawls
+- Where to find your crawled data in `CrawlResult`
+- A handful of frequently used configuration parameters
+
+From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Let’s move on to **[Smart Crawling Techniques](./smart-crawling.md)** to learn how to handle iframes, advanced caching, and more.
+
+---
+
+**Last updated**: 2024-XX-XX
+
+Keep exploring! If you get stuck, remember to check out the [How-To Guides](../../how-to/) for targeted solutions or the [Explanations](../../explanations/) for deeper conceptual background.
--- a/docs/md_v3/tutorials/docker-quickstart.md
+++ b/docs/md_v3/tutorials/docker-quickstart.md
@@ -0,0 +1,271 @@
+# Deploying with Docker (Quickstart)
+
+> **⚠️ WARNING: Experimental & Legacy**  
+> Our current Docker solution for Crawl4AI is **not stable** and **will be discontinued** soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in **2025**. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.
+
+Crawl4AI is **open-source** and under **active development**. We appreciate your interest, but strongly recommend you make **informed decisions** if you need a production environment. Expect breaking changes in future versions.
+
+---
+
+## 1. Installation & Environment Setup (Outside Docker)
+
+Before we jump into Docker usage, here’s a quick reminder of how to install Crawl4AI locally (legacy doc). For **non-Docker** deployments or local dev:
+
+```bash
+# 1. Install the package
+pip install crawl4ai
+crawl4ai-setup
+
+# 2. Install playwright dependencies (all browsers or specific ones)
+playwright install --with-deps
+# or
+playwright install --with-deps chromium
+# or
+playwright install --with-deps chrome
+```
+
+**Testing** your installation:
+
+```bash
+# Visible browser test
+python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
+```
+
+---
+
+## 2. Docker Overview
+
+This Docker approach allows you to run a **Crawl4AI** service via REST API. You can:
+
+1. **POST** a request (e.g., URLs, extraction config)  
+2. **Retrieve** your results from a task-based endpoint  
+
+> **Note**: This Docker solution is **temporary**. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.
+
+---
+
+## 3. Pulling and Running the Image
+
+### Basic Run
+
+```bash
+docker pull unclecode/crawl4ai:basic
+docker run -p 11235:11235 unclecode/crawl4ai:basic
+```
+
+This starts a container on port `11235`. You can `POST` requests to `http://localhost:11235/crawl`.
+
+### Using an API Token
+
+```bash
+docker run -p 11235:11235 \
+  -e CRAWL4AI_API_TOKEN=your_secret_token \
+  unclecode/crawl4ai:basic
+```
+
+If **`CRAWL4AI_API_TOKEN`** is set, you must include `Authorization: Bearer <token>` in your requests. Otherwise, the service is open to anyone.
+
+---
+
+## 4. Docker Compose for Multi-Container Workflows
+
+You can also use **Docker Compose** to manage multiple services. Below is an **experimental** snippet:
+
+```yaml
+version: '3.8'
+
+services:
+  crawl4ai:
+    image: unclecode/crawl4ai:basic
+    ports:
+      - "11235:11235"
+    environment:
+      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
+      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
+    # Additional env variables as needed
+    volumes:
+      - /dev/shm:/dev/shm
+```
+
+To run:
+
+```bash
+docker-compose up -d
+```
+
+And to stop:
+
+```bash
+docker-compose down
+```
+
+**Troubleshooting**:
+
+- **Check logs**: `docker-compose logs -f crawl4ai`
+- **Remove orphan containers**: `docker-compose down --remove-orphans`
+- **Remove networks**: `docker network rm <network_name>`
+
+---
+
+## 5. Making Requests to the Container
+
+**Base URL**: `http://localhost:11235`
+
+### Example: Basic Crawl
+
+```python
+import requests
+
+task_request = {
+    "urls": "https://example.com",
+    "priority": 10
+}
+
+response = requests.post("http://localhost:11235/crawl", json=task_request)
+task_id = response.json()["task_id"]
+
+# Poll for status
+status_url = f"http://localhost:11235/task/{task_id}"
+status = requests.get(status_url).json()
+print(status)
+```
+
+If you used an API token, do:
+
+```python
+headers = {"Authorization": "Bearer your_secret_token"}
+response = requests.post(
+    "http://localhost:11235/crawl",
+    headers=headers,
+    json=task_request
+)
+```
+
+---
+
+## 6. Docker + New Crawler Config Approach
+
+### Using `BrowserConfig` & `CrawlerRunConfig` in Requests
+
+The Docker-based solution can accept **crawler configurations** in the request JSON (legacy doc might show direct parameters, but we want to embed them in `crawler_params` or `extra` to align with the new approach). For example:
+
+```python
+import requests
+
+request_data = {
+    "urls": "https://www.nbcnews.com/business",
+    "crawler_params": {
+        "headless": True,
+        "browser_type": "chromium",
+        "verbose": True,
+        "page_timeout": 30000,
+        # ... any other BrowserConfig-like fields
+    },
+    "extra": {
+        "word_count_threshold": 50,
+        "bypass_cache": True
+    }
+}
+
+response = requests.post("http://localhost:11235/crawl", json=request_data)
+task_id = response.json()["task_id"]
+```
+
+This is the recommended style if you want to replicate `BrowserConfig` and `CrawlerRunConfig` settings in Docker mode.
+
+---
+
+## 7. Example: JSON Extraction in Docker
+
+```python
+import requests
+import json
+
+# Define a schema for CSS extraction
+schema = {
+    "name": "Coinbase Crypto Prices",
+    "baseSelector": ".cds-tableRow-t45thuk",
+    "fields": [
+        {
+            "name": "crypto",
+            "selector": "td:nth-child(1) h2",
+            "type": "text"
+        },
+        {
+            "name": "symbol",
+            "selector": "td:nth-child(1) p",
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": "td:nth-child(2)",
+            "type": "text"
+        }
+    ]
+}
+
+request_data = {
+    "urls": "https://www.coinbase.com/explore",
+    "extraction_config": {
+        "type": "json_css",
+        "params": {"schema": schema}
+    },
+    "crawler_params": {
+        "headless": True,
+        "verbose": True
+    }
+}
+
+resp = requests.post("http://localhost:11235/crawl", json=request_data)
+task_id = resp.json()["task_id"]
+
+# Poll for status
+status = requests.get(f"http://localhost:11235/task/{task_id}").json()
+if status["status"] == "completed":
+    extracted_content = status["result"]["extracted_content"]
+    data = json.loads(extracted_content)
+    print("Extracted:", len(data), "entries")
+else:
+    print("Task still in progress or failed.")
+```
+
+---
+
+## 8. Why This Docker Is Temporary
+
+**We are building a new, stable approach**:
+
+- The current Docker container is **experimental** and might break with future releases.  
+- We plan a stable release in **2025** with a more robust API, versioning, and orchestration.  
+- If you use this Docker in production, do so at your own risk and be prepared for **breaking changes**.
+
+**Community**: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the [GitHub repository](https://github.com/unclecode/crawl4ai) for roadmaps and updates.
+
+---
+
+## 9. Known Limitations & Next Steps
+
+1. **Not Production-Ready**: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.  
+2. **Ongoing Changes**: Expect API changes. The official stable version is targeted for **2025**.  
+3. **LLM Integrations**: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.  
+4. **Performance**: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.  
+5. **Version Pinning**: If you must deploy, pin your Docker tag to a specific version (e.g., `:basic-0.3.7`) to avoid surprise updates.
+
+### Next Steps
+
+- **Watch the Repository**: For announcements on the new Docker architecture.  
+- **Experiment**: Use this Docker for test or dev environments, but keep an eye out for breakage.  
+- **Contribute**: If you have ideas or improvements, open a PR or discussion.  
+- **Check Roadmaps**: See our [GitHub issues](https://github.com/unclecode/crawl4ai/issues) or [Roadmap doc](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md) to find upcoming releases.
+
+---
+
+## 10. Summary
+
+**Deploying with Docker** can simplify running Crawl4AI as a service. However:
+
+- **This Docker** approach is **legacy** and subject to removal/overhaul.  
+- For production, please weigh the risks carefully.  
+- Detailed “new Docker approach” is coming in **2025**.
+
+We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for **experimental** usage. Stay tuned for the fully-supported version!
--- a/docs/md_v3/tutorials/getting-started.md
+++ b/docs/md_v3/tutorials/getting-started.md
@@ -0,0 +1,265 @@
+# Getting Started with Crawl4AI
+
+Welcome to **Crawl4AI**, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:
+
+1. **Install** Crawl4AI (both via pip and Docker, with notes on platform challenges).
+2. Run your **first crawl** using minimal configuration.
+3. Generate **Markdown** output (and learn how it’s influenced by content filters).
+4. Experiment with a simple **CSS-based extraction** strategy.
+5. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
+
+---
+
+## 1. Introduction
+
+Crawl4AI provides:
+- An asynchronous crawler, **`AsyncWebCrawler`**.
+- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
+- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports additional filters).
+- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
+
+By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.
+
+---
+
+## 2. Installation
+
+### 2.1 Python + Playwright
+
+#### Basic Pip Installation
+
+```bash
+pip install crawl4ai
+crawl4ai-setup
+playwright install --with-deps  
+```
+
+- **`crawl4ai-setup`** installs and configures Playwright (Chromium by default).
+
+We cover advanced installation and Docker in the [Installation](#installation) section.
+
+---
+
+## 3. Your First Crawl
+
+Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com")
+        print(result.markdown[:300])  # Print first 300 chars
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s happening?**
+- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
+- It fetches `https://example.com`.
+- Crawl4AI automatically converts the HTML into Markdown.
+
+You now have a simple, working crawl!
+
+---
+
+## 4. Basic Configuration (Light Introduction)
+
+Crawl4AI’s crawler can be heavily customized using two main classes:
+
+1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
+2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
+
+Below is an example with minimal usage:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    browser_conf = BrowserConfig(headless=True)  # or False to see the browser
+    run_conf = CrawlerRunConfig(cache_mode="BYPASS")
+
+    async with AsyncWebCrawler(config=browser_conf) as crawler:
+        result = await crawler.arun(
+            url="https://example.com",
+            config=run_conf
+        )
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
+
+---
+
+## 5. Generating Markdown Output
+
+By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
+
+- **`result.markdown`**:  
+  The direct HTML-to-Markdown conversion.  
+- **`result.markdown.fit_markdown`**:  
+  The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
+
+### Example: Using a Filter with `DefaultMarkdownGenerator`
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+md_generator = DefaultMarkdownGenerator(
+    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
+)
+
+config = CrawlerRunConfig(markdown_generator=md_generator)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://news.ycombinator.com", config=config)
+    print("Raw Markdown length:", len(result.markdown.raw_markdown))
+    print("Fit Markdown length:", len(result.markdown.fit_markdown))
+```
+
+**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
+
+---
+
+## 6. Simple Data Extraction (CSS-based)
+
+Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def main():
+    schema = {
+        "name": "Example Items",
+        "baseSelector": "div.item",
+        "fields": [
+            {"name": "title", "selector": "h2", "type": "text"},
+            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
+        ]
+    }
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/items",
+            config=CrawlerRunConfig(
+                extraction_strategy=JsonCssExtractionStrategy(schema)
+            )
+        )
+        # The JSON output is stored in 'extracted_content'
+        data = json.loads(result.extracted_content)
+        print(data)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Why is this helpful?**
+- Great for repetitive page structures (e.g., item listings, articles).
+- No AI usage or costs. 
+- The crawler returns a JSON string you can parse or store.
+
+---
+
+## 7. Simple Data Extraction (LLM-based)
+
+For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
+
+- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)  
+- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)  
+- Or any provider supported by the underlying library
+
+Below is an example using **open-source** style (no token) and closed-source:
+
+```python
+import os
+import json
+import asyncio
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class PricingInfo(BaseModel):
+    model_name: str = Field(..., description="Name of the AI model")
+    input_fee: str = Field(..., description="Fee for input tokens")
+    output_fee: str = Field(..., description="Fee for output tokens")
+
+async def main():
+    # 1) Open-Source usage: no token required
+    llm_strategy_open_source = LLMExtractionStrategy(
+        provider="ollama/llama3.3",  # or "any-other-local-model"
+        api_token="no_token",       # for local models, no API key is typically required
+        schema=PricingInfo.schema(),
+        extraction_type="schema",
+        instruction="""
+            From this page, extract all AI model pricing details in JSON format.
+            Each entry should have 'model_name', 'input_fee', and 'output_fee'.
+        """,
+        temperature=0
+    )
+
+    # 2) Closed-Source usage: API key for OpenAI, for example
+    openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
+    llm_strategy_openai = LLMExtractionStrategy(
+        provider="openai/gpt-4",
+        api_token=openai_token,
+        schema=PricingInfo.schema(),
+        extraction_type="schema",
+        instruction="""
+            From this page, extract all AI model pricing details in JSON format.
+            Each entry should have 'model_name', 'input_fee', and 'output_fee'.
+        """,
+        temperature=0
+    )
+
+    # We'll demo the open-source approach here
+    config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/pricing",
+            config=config
+        )
+        print("LLM-based extraction JSON:", result.extracted_content)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s happening?**
+- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
+- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.  
+- Depending on the **provider** and **api_token**, you can use local models or a remote API.
+
+---
+
+## 8. Next Steps
+
+Congratulations! You have:
+1. Installed Crawl4AI (via pip, with Docker as an option).
+2. Performed a simple crawl and printed Markdown.
+3. Seen how adding a **markdown generator** + **content filter** can produce “fit” Markdown.
+4. Experimented with **CSS-based** extraction for repetitive data.
+5. Learned the basics of **LLM-based** extraction (open-source and closed-source).
+
+If you are ready for more, check out:
+
+- **Installation**: Learn more on how to install Crawl4AI and set up Playwright.
+- **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc.
+- **Markdown Generation Basics**: Dive deeper into content filtering and “fit markdown” usage.
+- **Dynamic Pages & Hooks**: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
+- **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes.
+- **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more.
+
+Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!
--- a/docs/md_v3/tutorials/getting-warmer.md
+++ b/docs/md_v3/tutorials/getting-warmer.md
@@ -0,0 +1,527 @@
+# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
+
+Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
+
+**What Crawl4AI is not:**
+
+Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:
+
+-   To generate perfect, AI-friendly data (particularly for LLMs) from web content
+-   To maximize speed and efficiency in data extraction and processing
+-   To operate at scale, from Raspberry Pi to cloud infrastructures
+
+Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:
+
+1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
+2. Implement intelligent extraction strategies to reduce reliance on costly API calls
+3. Provide a streamlined pipeline for AI data preparation and ingestion
+
+In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.
+
+**Key Links:**
+
+-   **Website:** [https://crawl4ai.com](https://crawl4ai.com)
+-   **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
+-   **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
+-   **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py)
+-   **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
+
+---
+
+## Table of Contents
+
+- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution)
+  - [Table of Contents](#table-of-contents)
+  - [1. Introduction \& Key Concepts](#1-introduction--key-concepts)
+  - [2. Installation \& Environment Setup](#2-installation--environment-setup)
+    - [Test Your Installation](#test-your-installation)
+  - [3. Core Concepts \& Configuration](#3-core-concepts--configuration)
+  - [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction)
+  - [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output)
+  - [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm)
+  - [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models)
+  - [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content)
+  - [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling)
+  - [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation)
+    - [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory)
+    - [Using `storage_state`](#using-storage_state)
+  - [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements)
+  - [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads)
+  - [13. Caching \& Performance Optimization](#13-caching--performance-optimization)
+  - [14. Hooks for Custom Logic](#14-hooks-for-custom-logic)
+  - [15. Dockerization \& Scaling](#15-dockerization--scaling)
+  - [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls)
+  - [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example)
+  - [18. Further Resources \& Community](#18-further-resources--community)
+
+---
+
+## 1. Introduction & Key Concepts
+
+Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
+
+**Quick Test:**
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def test_run():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com")
+        print(result.markdown)
+
+asyncio.run(test_run())
+```
+
+If you see Markdown output, everything is working!
+
+**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md)
+
+---
+
+## 2. Installation & Environment Setup
+
+```bash
+# Install the package
+pip install crawl4ai
+crawl4ai-setup
+
+# Install Playwright with system dependencies (recommended)
+playwright install --with-deps  # Installs all browsers
+
+# Or install specific browsers:
+playwright install --with-deps chrome  # Recommended for Colab/Linux
+playwright install --with-deps firefox
+playwright install --with-deps webkit
+playwright install --with-deps chromium
+
+# Keep Playwright updated periodically
+playwright install
+```
+
+> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably.
+
+### Test Your Installation
+Try these one-liners:
+
+```python
+# Visible browser test
+python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
+
+# Headless test (for servers/CI)
+python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"
+```
+
+You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`.
+
+
+**Try in Colab:**  
+[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
+
+**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md)
+
+---
+
+## 3. Core Concepts & Configuration
+
+Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling.
+
+**Example config:**
+
+```python
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+
+browser_config = BrowserConfig(
+    headless=True,
+    verbose=True,
+    viewport_width=1080,
+    viewport_height=600,
+    text_mode=False,
+    ignore_https_errors=True,
+    java_script_enabled=True
+)
+
+run_config = CrawlerRunConfig(
+    css_selector="article.main",
+    word_count_threshold=50,
+    excluded_tags=['nav','footer'],
+    exclude_external_links=True,
+    wait_for="css:.article-loaded",
+    page_timeout=60000,
+    delay_before_return_html=1.0,
+    mean_delay=0.1,
+    max_range=0.3,
+    process_iframes=True,
+    remove_overlay_elements=True,
+    js_code="""
+        (async () => {
+            window.scrollTo(0, document.body.scrollHeight);
+            await new Promise(r => setTimeout(r, 2000));
+            document.querySelector('.load-more')?.click();
+        })();
+    """
+)
+
+# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
+# run_config.cache_mode = CacheMode.ENABLED
+```
+
+**Prefixes:**
+
+-   `http://` or `https://` for live pages
+-   `file://local.html` for local
+-   `raw:<html>` for raw HTML strings
+
+**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md)
+
+---
+
+## 4. Basic Crawling & Simple Extraction
+
+```python
+async with AsyncWebCrawler(config=browser_config) as crawler:
+    result = await crawler.arun("https://news.example.com/article", config=run_config)
+    print(result.markdown) # Basic markdown content
+```
+
+**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md)
+
+---
+
+## 5. Markdown Generation & AI-Optimized Output
+
+After crawling, `result.markdown_v2` provides:
+
+-   `raw_markdown`: Unfiltered markdown
+-   `markdown_with_citations`: Links as references at the bottom
+-   `references_markdown`: A separate list of reference links
+-   `fit_markdown`: Filtered, relevant markdown (e.g., after BM25)
+-   `fit_html`: The HTML used to produce `fit_markdown`
+
+**Example:**
+
+```python
+print("RAW:", result.markdown_v2.raw_markdown[:200])
+print("CITED:", result.markdown_v2.markdown_with_citations[:200])
+print("REFERENCES:", result.markdown_v2.references_markdown)
+print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
+```
+
+For AI training, `fit_markdown` focuses on the most relevant content.
+
+**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md)
+
+---
+
+## 6. Structured Data Extraction (CSS, XPath, LLM)
+
+Extract JSON data without LLMs:
+
+**CSS:**
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+schema = {
+  "name": "Products",
+  "baseSelector": ".product",
+  "fields": [
+    {"name": "title", "selector": "h2", "type": "text"},
+    {"name": "price", "selector": ".price", "type": "text"}
+  ]
+}
+run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
+```
+
+**XPath:**
+
+```python
+from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
+
+xpath_schema = {
+  "name": "Articles",
+  "baseSelector": "//div[@class='article']",
+  "fields": [
+    {"name":"headline","selector":".//h1","type":"text"},
+    {"name":"summary","selector":".//p[@class='summary']","type":"text"}
+  ]
+}
+run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
+```
+
+**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
+
+---
+
+## 7. Advanced Extraction: LLM & Open-Source Models
+
+Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
+
+```python
+from pydantic import BaseModel
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class TravelData(BaseModel):
+    destination: str
+    attractions: list
+
+run_config.extraction_strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",
+    schema=TravelData.schema(),
+    instruction="Extract destination and top attractions."
+)
+```
+
+**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
+
+---
+
+## 8. Page Interactions, JS Execution, & Dynamic Content
+
+Insert `js_code` and use `wait_for` to ensure content loads. Example:
+
+```python
+run_config.js_code = """
+(async () => {
+   document.querySelector('.load-more')?.click();
+   await new Promise(r => setTimeout(r, 2000));
+})();
+"""
+run_config.wait_for = "css:.item-loaded"
+```
+
+**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md)
+
+---
+
+## 9. Media, Links, & Metadata Handling
+
+`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance.
+
+`result.media["videos"]`, `result.media["audios"]` similarly hold media info.
+
+`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`.
+
+`result.metadata`: Title, description, keywords, author.
+
+**Example:**
+
+```python
+# Images
+for img in result.media["images"]:
+    print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
+
+# Links
+for link in result.links["external"]:
+    print("External Link:", link["href"], "Text:", link["text"])
+
+# Metadata
+print("Page Title:", result.metadata["title"])
+print("Description:", result.metadata["description"])
+```
+
+**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md)
+
+---
+
+## 10. Authentication & Identity Preservation
+
+### Manual Setup via User Data Directory
+
+1. **Open Chrome with a custom user data dir:**
+
+    ```bash
+    "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
+    ```
+
+    On macOS:
+
+    ```bash
+    "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
+    ```
+
+2. **Log in to sites, solve CAPTCHAs, adjust settings manually.**  
+   The browser saves cookies/localStorage in that directory.
+
+3. **Use `user_data_dir` in `BrowserConfig`:**
+
+    ```python
+    browser_config = BrowserConfig(
+        headless=True,
+        user_data_dir="/Users/username/ChromeProfiles/MyProfile"
+    )
+    ```
+
+    Now the crawler starts with those cookies, sessions, etc.
+
+### Using `storage_state`
+
+Alternatively, export and reuse storage states:
+
+```python
+browser_config = BrowserConfig(
+    headless=True,
+    storage_state="mystate.json"  # Pre-saved state
+)
+```
+
+No repeated logins needed.
+
+**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md)
+
+---
+
+## 11. Proxy & Security Enhancements
+
+Use `proxy_config` for authenticated proxies:
+
+```python
+browser_config.proxy_config = {
+    "server": "http://proxy.example.com:8080",
+    "username": "proxyuser",
+    "password": "proxypass"
+}
+```
+
+Combine with `headers` or `ignore_https_errors` as needed.
+
+**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md)
+
+---
+
+## 12. Screenshots, PDFs & File Downloads
+
+Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`:
+
+```python
+run_config.screenshot = True
+run_config.pdf = True
+```
+
+After crawling:
+
+```python
+if result.screenshot:
+    with open("page.png", "wb") as f:
+        f.write(result.screenshot)
+
+if result.pdf:
+    with open("page.pdf", "wb") as f:
+        f.write(result.pdf)
+```
+
+**File Downloads:**
+
+```python
+browser_config.accept_downloads = True
+browser_config.downloads_path = "./downloads"
+run_config.js_code = """document.querySelector('a.download')?.click();"""
+
+# After crawl:
+print("Downloaded files:", result.downloaded_files)
+```
+
+**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md)  
+Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md)
+
+---
+
+## 13. Caching & Performance Optimization
+
+Set `cache_mode` to reuse fetch results:
+
+```python
+from crawl4ai import CacheMode
+run_config.cache_mode = CacheMode.ENABLED
+```
+
+Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction.
+
+**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md)
+
+---
+
+## 14. Hooks for Custom Logic
+
+Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`.
+
+Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL:
+
+**Example Hook:**
+
+```python
+async def on_page_context_created_hook(context, page, **kwargs):
+    # Block all images to speed up load
+    await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
+    print("[HOOK] Image requests blocked")
+
+async with AsyncWebCrawler(config=browser_config) as crawler:
+    crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
+    result = await crawler.arun("https://imageheavy.example.com", config=run_config)
+    print("Crawl finished with images blocked.")
+```
+
+This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.
+
+**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md)
+
+---
+
+## 15. Dockerization & Scaling
+
+Use Docker images:
+
+-   AMD64 basic:
+
+```bash
+docker pull unclecode/crawl4ai:basic-amd64
+docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
+```
+
+-   ARM64 for M1/M2:
+
+```bash
+docker pull unclecode/crawl4ai:basic-arm64
+docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
+```
+
+-   GPU support:
+
+```bash
+docker pull unclecode/crawl4ai:gpu-amd64
+docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
+```
+
+Scale with load balancers or Kubernetes.
+
+**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#)
+
+---
+
+## 16. Troubleshooting & Common Pitfalls
+
+-   Empty results? Relax filters, check selectors.
+-   Timeouts? Increase `page_timeout` or refine `wait_for`.
+-   CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving.
+-   JS errors? Try headful mode for debugging.
+
+Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code.
+
+---
+
+## 17. Comprehensive End-to-End Example
+
+Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example.
+
+---
+
+## 18. Further Resources & Community
+
+-   **Docs:** [https://crawl4ai.com](https://crawl4ai.com)
+-   **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
+
+Follow [@unclecode](https://x.com/unclecode) for news & community updates.
+
+**Happy Crawling!**  
+Leverage Crawl4AI to feed your AI models with clean, structured web data today.
--- a/docs/md_v3/tutorials/hooks-custom.md
+++ b/docs/md_v3/tutorials/hooks-custom.md
@@ -0,0 +1,335 @@
+# Hooks & Custom Code
+
+Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:
+
+- **Authentication** (log in before navigating)  
+- **Content manipulation** (modify HTML, inject scripts, etc.)  
+- **Session or browser configuration** (e.g., adjusting user agents, local storage)  
+- **Custom data collection** (scrape extra details or track state at each stage)
+
+In this tutorial, you’ll learn about:
+
+1. What hooks are available  
+2. How to attach code to each hook  
+3. Practical examples (auth flows, user agent changes, content manipulation, etc.)
+
+> **Prerequisites**  
+> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).  
+> - Comfortable with Python async/await.
+
+---
+
+## 1. Overview of Available Hooks
+
+| Hook Name                | Called When / Purpose                                           | Context / Objects Provided                         |
+|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
+| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. |
+| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`.           |
+| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. |
+| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`.      |
+| **`before_goto`**        | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. |
+| **`after_goto`**         | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`.   |
+| **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference.  |
+| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`.          |
+
+### A Note on `on_browser_created` (the “unbrowser” hook)
+- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties.  
+- For example, you might control [CDP sessions][cdp] or advanced browser flags here.
+
+---
+
+## 2. Registering Hooks
+
+You can attach hooks by calling:
+
+```python
+crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
+```
+
+or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:
+
+```python
+hooks = {
+    "before_goto": my_before_goto_hook,
+    "after_goto": my_after_goto_hook,
+    # ... etc.
+}
+async with AsyncWebCrawler(hooks=hooks) as crawler:
+    ...
+```
+
+### Hook Signature
+
+Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.
+
+---
+
+## 3. Real-Life Examples
+
+Below are concrete scenarios where hooks come in handy.
+
+---
+
+### 3.1 Authentication Before Navigation
+
+One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately).
+
+#### Using `before_goto`
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def before_goto_auth_hook(page, context, goto_params, **kwargs):
+    """
+    Example: Set cookies or localStorage to simulate login.
+    This hook runs right before page.goto() is called.
+    """
+    # Example: Insert cookie-based auth or local storage data
+    # (You could also do more complex actions, like fill forms if you already have a 'page' open.)
+    print("[HOOK] Setting auth data before goto.")
+    await context.add_cookies([
+        {
+            "name": "session",
+            "value": "abcd1234",
+            "domain": "example.com",
+            "path": "/"
+        }
+    ])
+    # Optionally manipulate goto_params if needed:
+    # goto_params["url"] = goto_params["url"] + "?debug=1"
+
+async def main():
+    hooks = {
+        "before_goto": before_goto_auth_hook
+    }
+
+    browser_cfg = BrowserConfig(headless=True)
+    crawler_cfg = CrawlerRunConfig()
+
+    async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
+        result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
+        if result.success:
+            print("[OK] Logged in and fetched protected page.")
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Points**  
+- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.  
+- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.
+
+---
+
+### 3.2 Setting Up the Browser in `on_browser_created`
+
+If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself.
+
+```python
+async def on_browser_created_hook(browser, **kwargs):
+    """
+    Runs immediately after the browser is created, before any pages.
+    'browser' here is a Playwright Browser object.
+    """
+    print("[HOOK] Browser created. Setting up custom stuff.")
+    # Possibly connect to DevTools or create an incognito context
+    # Example (pseudo-code):
+    # devtools_url = await browser.new_context(devtools=True)
+
+# Usage:
+async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
+    ...
+```
+
+---
+
+### 3.3 Adjusting Page or Context in `on_page_context_created`
+
+If you’d like to set default timeouts or inject scripts right after a page context is spun up:
+
+```python
+async def on_page_context_created_hook(page, context, **kwargs):
+    print("[HOOK] Page context created. Setting default timeouts or scripts.")
+    await page.set_default_timeout(20000)  # 20 seconds
+    # Possibly inject a script or set user locale
+
+# Usage:
+hooks = {
+    "on_page_context_created": on_page_context_created_hook
+}
+```
+
+---
+
+### 3.4 Dynamically Updating User Agents
+
+`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:
+
+```python
+async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
+    print(f"[HOOK] User agent updated to {new_ua}")
+    # Maybe add a custom header based on new UA
+    await context.set_extra_http_headers({"X-UA-Source": new_ua})
+
+hooks = {
+    "on_user_agent_updated": on_user_agent_updated_hook
+}
+```
+
+---
+
+### 3.5 Initializing Stuff with `on_execution_started`
+
+`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).
+
+```python
+async def on_execution_started_hook(page, context, **kwargs):
+    print("[HOOK] Execution started. Setting a start timestamp or logging.")
+    context.set_default_navigation_timeout(45000)  # 45s if your site is slow
+
+hooks = {
+    "on_execution_started": on_execution_started_hook
+}
+```
+
+---
+
+### 3.6 Post-Processing with `after_goto`
+
+After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:
+
+```python
+async def after_goto_hook(page, context, response, **kwargs):
+    """
+    Called right after page.goto() finishes, but before the crawler extracts HTML.
+    """
+    if response and response.ok:
+        print("[HOOK] After goto. Status:", response.status)
+        # Maybe remove popups or check if we landed on a login failure page.
+        await page.evaluate("""() => {
+            const popup = document.querySelector(".annoying-popup");
+            if (popup) popup.remove();
+        }""")
+    else:
+        print("[HOOK] Navigation might have failed, status not ok or no response.")
+
+hooks = {
+    "after_goto": after_goto_hook
+}
+```
+
+---
+
+### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`
+
+Sometimes you need to tweak the page or raw HTML right before it’s captured.
+
+```python
+async def before_retrieve_html_hook(page, context, **kwargs):
+    """
+    Modify the DOM just before the crawler finalizes the HTML.
+    """
+    print("[HOOK] Removing adverts before capturing HTML.")
+    await page.evaluate("""() => {
+        const ads = document.querySelectorAll(".ad-banner");
+        ads.forEach(ad => ad.remove());
+    }""")
+
+async def before_return_html_hook(page, context, html, **kwargs):
+    """
+    'html' is the near-finished HTML string. Return an updated string if you like.
+    """
+    # For example, remove personal data or certain tags from the final text
+    print("[HOOK] Sanitizing final HTML.")
+    sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
+    return sanitized_html
+
+hooks = {
+    "before_retrieve_html": before_retrieve_html_hook,
+    "before_return_html": before_return_html_hook
+}
+```
+
+**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.
+
+---
+
+## 4. Putting It All Together
+
+You can combine multiple hooks in a single run. For instance:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def on_browser_created_hook(browser, **kwargs):
+    print("[HOOK] Browser is up, no page yet. Good for broad config.")
+
+async def before_goto_auth_hook(page, context, goto_params, **kwargs):
+    print("[HOOK] Adding cookies for auth.")
+    await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])
+
+async def after_goto_log_hook(page, context, response, **kwargs):
+    if response:
+        print("[HOOK] after_goto: Status code:", response.status)
+
+async def main():
+    hooks = {
+        "on_browser_created": on_browser_created_hook,
+        "before_goto": before_goto_auth_hook,
+        "after_goto": after_goto_log_hook
+    }
+
+    browser_cfg = BrowserConfig(headless=True)
+    crawler_cfg = CrawlerRunConfig(verbose=True)
+
+    async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
+        result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
+        if result.success:
+            print("[OK] Protected page length:", len(result.html))
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+This example:
+
+1. **`on_browser_created`** sets up the brand-new browser instance.  
+2. **`before_goto`** ensures you inject an auth cookie before accessing the page.  
+3. **`after_goto`** logs the resulting HTTP status code.
+
+---
+
+## 5. Common Pitfalls & Best Practices
+
+1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.  
+2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.  
+3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.  
+4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**.  
+5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.
+
+---
+
+## Conclusion & Next Steps
+
+**Hooks** let you bend Crawl4AI to your will:
+
+- **Authentication** (cookies, localStorage) with `before_goto`  
+- **Browser-level config** with `on_browser_created`  
+- **Page or context config** with `on_page_context_created`  
+- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`)  
+
+**Where to go next**:
+
+- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection.  
+- **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood.  
+- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks.
+
+With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!
+
+**Last Updated**: 2024-XX-XX
--- a/docs/md_v3/tutorials/json-extraction-basic.md
+++ b/docs/md_v3/tutorials/json-extraction-basic.md
@@ -0,0 +1,395 @@
+# Extracting JSON (No LLM)
+
+One of Crawl4AI’s **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. By defining a **schema** with CSS or XPath selectors, you can extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
+
+**Why avoid LLM for basic extractions?**
+
+1. **Faster & Cheaper**: No API calls or GPU overhead.  
+2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.  
+3. **Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.  
+4. **Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
+
+Below, we’ll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We’ll also highlight advanced features like **nested fields** and **base element attributes**.
+
+---
+
+## 1. Intro to Schema-Based Extraction
+
+A schema defines:
+
+1. A **base selector** that identifies each “container” element on the page (e.g., a product row, a blog post card).  
+2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).  
+3. **Nested** or **list** types for repeated or hierarchical structures.  
+
+For example, if you have a list of products, each one might have a name, price, reviews, and “related products.” This approach is faster and more reliable than an LLM for consistent, structured pages.
+
+---
+
+## 2. Simple Example: Crypto Prices
+
+Let’s begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don’t** call any LLM:
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def extract_crypto_prices():
+    # 1. Define a simple extraction schema
+    schema = {
+        "name": "Crypto Prices",
+        "baseSelector": "div.crypto-row",    # Repeated elements
+        "fields": [
+            {
+                "name": "coin_name",
+                "selector": "h2.coin-name",
+                "type": "text"
+            },
+            {
+                "name": "price",
+                "selector": "span.coin-price",
+                "type": "text"
+            }
+        ]
+    }
+
+    # 2. Create the extraction strategy
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+    # 3. Set up your crawler config (if needed)
+    config = CrawlerRunConfig(
+        # e.g., pass js_code or wait_for if the page is dynamic
+        # wait_for="css:.crypto-row:nth-child(20)"
+        cache_mode = CacheMode.BYPASS,
+        extraction_strategy=extraction_strategy,
+    )
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # 4. Run the crawl and extraction
+        result = await crawler.arun(
+            url="https://example.com/crypto-prices",
+            
+            config=config
+        )
+
+        if not result.success:
+            print("Crawl failed:", result.error_message)
+            return
+
+        # 5. Parse the extracted JSON
+        data = json.loads(result.extracted_content)
+        print(f"Extracted {len(data)} coin entries")
+        print(json.dumps(data[0], indent=2) if data else "No data found")
+
+asyncio.run(extract_crypto_prices())
+```
+
+**Highlights**:
+
+- **`baseSelector`**: Tells us where each “item” (crypto row) is.  
+- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.  
+- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
+
+No LLM is needed, and the performance is **near-instant** for hundreds or thousands of items.
+
+---
+
+### **XPath Example with `raw://` HTML**
+
+Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We’ll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
+
+async def extract_crypto_prices_xpath():
+    # 1. Minimal dummy HTML with some repeating rows
+    dummy_html = """
+    <html>
+      <body>
+        <div class='crypto-row'>
+          <h2 class='coin-name'>Bitcoin</h2>
+          <span class='coin-price'>$28,000</span>
+        </div>
+        <div class='crypto-row'>
+          <h2 class='coin-name'>Ethereum</h2>
+          <span class='coin-price'>$1,800</span>
+        </div>
+      </body>
+    </html>
+    """
+
+    # 2. Define the JSON schema (XPath version)
+    schema = {
+        "name": "Crypto Prices via XPath",
+        "baseSelector": "//div[@class='crypto-row']",
+        "fields": [
+            {
+                "name": "coin_name",
+                "selector": ".//h2[@class='coin-name']",
+                "type": "text"
+            },
+            {
+                "name": "price",
+                "selector": ".//span[@class='coin-price']",
+                "type": "text"
+            }
+        ]
+    }
+
+    # 3. Place the strategy in the CrawlerRunConfig
+    config = CrawlerRunConfig(
+        extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True)
+    )
+
+    # 4. Use raw:// scheme to pass dummy_html directly
+    raw_url = f"raw://{dummy_html}"
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url=raw_url,
+            config=config
+        )
+
+        if not result.success:
+            print("Crawl failed:", result.error_message)
+            return
+
+        data = json.loads(result.extracted_content)
+        print(f"Extracted {len(data)} coin rows")
+        if data:
+            print("First item:", data[0])
+
+asyncio.run(extract_crypto_prices_xpath())
+```
+
+**Key Points**:
+
+1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.  
+2. **`baseSelector`** and each field’s `"selector"` use **XPath** instead of CSS.  
+3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.  
+4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.  
+
+That’s how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
+
+---
+
+## 3. Advanced Schema & Nested Structures
+
+Real sites often have **nested** or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define **nested** or **list** (and even **nested_list**) fields.
+
+### Sample E-Commerce HTML
+
+We have a **sample e-commerce** HTML file on GitHub (example):
+```
+https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
+```
+This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure **without LLM**.
+
+```python
+schema = {
+    "name": "E-commerce Product Catalog",
+    "baseSelector": "div.category",
+    # (1) We can define optional baseFields if we want to extract attributes from the category container
+    "baseFields": [
+        {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"}, 
+    ],
+    "fields": [
+        {
+            "name": "category_name",
+            "selector": "h2.category-name",
+            "type": "text"
+        },
+        {
+            "name": "products",
+            "selector": "div.product",
+            "type": "nested_list",    # repeated sub-objects
+            "fields": [
+                {
+                    "name": "name",
+                    "selector": "h3.product-name",
+                    "type": "text"
+                },
+                {
+                    "name": "price",
+                    "selector": "p.product-price",
+                    "type": "text"
+                },
+                {
+                    "name": "details",
+                    "selector": "div.product-details",
+                    "type": "nested",  # single sub-object
+                    "fields": [
+                        {"name": "brand", "selector": "span.brand", "type": "text"},
+                        {"name": "model", "selector": "span.model", "type": "text"}
+                    ]
+                },
+                {
+                    "name": "features",
+                    "selector": "ul.product-features li",
+                    "type": "list",
+                    "fields": [
+                        {"name": "feature", "type": "text"} 
+                    ]
+                },
+                {
+                    "name": "reviews",
+                    "selector": "div.review",
+                    "type": "nested_list",
+                    "fields": [
+                        {"name": "reviewer", "selector": "span.reviewer", "type": "text"},
+                        {"name": "rating", "selector": "span.rating", "type": "text"},
+                        {"name": "comment", "selector": "p.review-text", "type": "text"}
+                    ]
+                },
+                {
+                    "name": "related_products",
+                    "selector": "ul.related-products li",
+                    "type": "list",
+                    "fields": [
+                        {"name": "name", "selector": "span.related-name", "type": "text"},
+                        {"name": "price", "selector": "span.related-price", "type": "text"}
+                    ]
+                }
+            ]
+        }
+    ]
+}
+```
+
+Key Takeaways:
+
+- **Nested vs. List**:  
+  - **`type: "nested"`** means a **single** sub-object (like `details`).  
+  - **`type: "list"`** means multiple items that are **simple** dictionaries or single text fields.  
+  - **`type: "nested_list"`** means repeated **complex** objects (like `products` or `reviews`).
+- **Base Fields**: We can extract **attributes** from the container element via `"baseFields"`. For instance, `"data_cat_id"` might be `data-cat-id="elect123"`.  
+- **Transforms**: We can also define a `transform` if we want to lower/upper case, strip whitespace, or even run a custom function.
+
+### Running the Extraction
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+ecommerce_schema = {
+    # ... the advanced schema from above ...
+}
+
+async def extract_ecommerce_data():
+    strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
+    
+    config = CrawlerRunConfig()
+    
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
+            extraction_strategy=strategy,
+            config=config
+        )
+
+        if not result.success:
+            print("Crawl failed:", result.error_message)
+            return
+        
+        # Parse the JSON output
+        data = json.loads(result.extracted_content)
+        print(json.dumps(data, indent=2) if data else "No data found.")
+
+asyncio.run(extract_ecommerce_data())
+```
+
+If all goes well, you get a **structured** JSON array with each “category,” containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
+
+---
+
+## 4. Why “No LLM” Is Often Better
+
+1. **Zero Hallucination**: Schema-based extraction doesn’t guess text. It either finds it or not.  
+2. **Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.  
+3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.  
+4. **Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
+
+**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema approach first for repeated or consistent data patterns.
+
+---
+
+## 5. Base Element Attributes & Additional Fields
+
+It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
+
+```json
+{
+  "name": "href",
+  "type": "attribute",
+  "attribute": "href",
+  "default": null
+}
+```
+
+You can define them in **`baseFields`** (extracted from the main container element) or in each field’s sub-lists. This is especially helpful if you need an item’s link or ID stored in the parent `<div>`.
+
+---
+
+## 6. Putting It All Together: Larger Example
+
+Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
+
+```python
+schema = {
+  "name": "Blog Posts",
+  "baseSelector": "a.blog-post-card",
+  "baseFields": [
+    {"name": "post_url", "type": "attribute", "attribute": "href"}
+  ],
+  "fields": [
+    {"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
+    {"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
+    {"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
+    {"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
+  ]
+}
+```
+
+Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
+
+---
+
+## 7. Tips & Best Practices
+
+1. **Inspect the DOM** in Chrome DevTools or Firefox’s Inspector to find stable selectors.  
+2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.  
+3. **Test** your schema on partial HTML or a test page before a big crawl.  
+4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.  
+5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it’ll often show warnings.  
+6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the “parent” item.  
+7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
+
+---
+
+## 8. Conclusion
+
+With **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy**), you can build powerful, **LLM-free** pipelines that:
+
+- Scrape any consistent site for structured data.  
+- Support nested objects, repeating lists, or advanced transformations.  
+- Scale to thousands of pages quickly and reliably.
+
+**Next Steps**:
+
+- Explore the [Advanced Usage of JSON Extraction](../../explanations/extraction-chunking.md) for deeper details on schema nesting, transformations, or hooking.  
+- Combine your extracted JSON with advanced filtering or summarization in a second pass if needed.  
+- For dynamic pages, combine strategies with `js_code` or infinite scroll hooking to ensure all content is loaded.
+
+**Remember**: For repeated, structured data, you don’t need to pay for or wait on an LLM. A well-crafted schema plus CSS or XPath gets you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
+
+**Last Updated**: 2024-XX-XX
+
+---
+
+That’s it for **Extracting JSON (No LLM)**! You’ve seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
--- a/docs/md_v3/tutorials/json-extraction-llm.md
+++ b/docs/md_v3/tutorials/json-extraction-llm.md
@@ -0,0 +1,334 @@
+Below is a **draft** of the **Extracting JSON (LLM)** tutorial, illustrating how to use large language models for structured data extraction in Crawl4AI. It highlights key parameters (like chunking, overlap, instruction, schema) and explains how the system remains **provider-agnostic** via LightLLM. Adjust field names or code snippets to match your repository’s specifics.
+
+---
+
+# Extracting JSON (LLM)
+
+In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
+
+1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more).  
+2. Automatically splits content into chunks (if desired) to handle token limits, then combines results.  
+3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach.
+
+**Important**: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using [`JsonCssExtractionStrategy`](./json-extraction-basic.md) or [`JsonXPathExtractionStrategy`](./json-extraction-basic.md) first. But if you need AI to interpret or reorganize content, read on!
+
+---
+
+## 1. Why Use an LLM?
+
+- **Complex Reasoning**: If the site’s data is unstructured, scattered, or full of natural language context.  
+- **Semantic Extraction**: Summaries, knowledge graphs, or relational data that require comprehension.  
+- **Flexible**: You can pass instructions to the model to do more advanced transformations or classification.
+
+---
+
+## 2. Provider-Agnostic via LightLLM
+
+Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
+
+- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).  
+- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.  
+- **`api_base`** (optional): If your provider has a custom endpoint.  
+
+This means you **aren’t locked** into a single LLM vendor. Switch or experiment easily.
+
+---
+
+## 3. How LLM Extraction Works
+
+### 3.1 Flow
+
+1. **Chunking** (optional): The HTML or markdown is split into smaller segments if it’s very long (based on `chunk_token_threshold`, overlap, etc.).  
+2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).  
+3. **LLM Inference**: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency).  
+4. **Combining**: The results from each chunk are merged and parsed into JSON.
+
+### 3.2 `extraction_type`
+
+- **`"schema"`**: The model tries to return JSON conforming to your Pydantic-based schema.  
+- **`"block"`**: The model returns freeform text, or smaller JSON structures, which the library collects.  
+
+For structured data, `"schema"` is recommended. You provide `schema=YourPydanticModel.model_json_schema()`.
+
+---
+
+## 4. Key Parameters
+
+Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
+
+1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.  
+2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.  
+3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.  
+4. **`extraction_type`** (str): `"schema"` or `"block"`.  
+5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”  
+6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.  
+7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.  
+8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.  
+9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:  
+   - `"markdown"`: The raw markdown (default).  
+   - `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.  
+   - `"html"`: The cleaned or raw HTML.  
+10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.  
+11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).  
+
+**Example**:
+
+```python
+extraction_strategy = LLMExtractionStrategy(
+    provider="openai/gpt-4",
+    api_token="YOUR_OPENAI_KEY",
+    schema=MyModel.model_json_schema(),
+    extraction_type="schema",
+    instruction="Extract a list of items from the text with 'name' and 'price' fields.",
+    chunk_token_threshold=1200,
+    overlap_rate=0.1,
+    apply_chunking=True,
+    input_format="html",
+    extra_args={"temperature": 0.1, "max_tokens": 1000},
+    verbose=True
+)
+```
+
+---
+
+## 5. Putting It in `CrawlerRunConfig`
+
+**Important**: In Crawl4AI, all strategy definitions should go inside the `CrawlerRunConfig`, not directly as a param in `arun()`. Here’s a full example:
+
+```python
+import os
+import asyncio
+import json
+from pydantic import BaseModel, Field
+from typing import List
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class Product(BaseModel):
+    name: str
+    price: str
+
+async def main():
+    # 1. Define the LLM extraction strategy
+    llm_strategy = LLMExtractionStrategy(
+        provider="openai/gpt-4o-mini",            # e.g. "ollama/llama2"
+        api_token=os.getenv('OPENAI_API_KEY'),
+        schema=Product.schema_json(),            # Or use model_json_schema()
+        extraction_type="schema",
+        instruction="Extract all product objects with 'name' and 'price' from the content.",
+        chunk_token_threshold=1000,
+        overlap_rate=0.0,
+        apply_chunking=True,
+        input_format="markdown",   # or "html", "fit_markdown"
+        extra_args={"temperature": 0.0, "max_tokens": 800}
+    )
+
+    # 2. Build the crawler config
+    crawl_config = CrawlerRunConfig(
+        extraction_strategy=llm_strategy,
+        cache_mode=CacheMode.BYPASS
+    )
+
+    # 3. Create a browser config if needed
+    browser_cfg = BrowserConfig(headless=True)
+
+    async with AsyncWebCrawler(config=browser_cfg) as crawler:
+        # 4. Let's say we want to crawl a single page
+        result = await crawler.arun(
+            url="https://example.com/products",
+            config=crawl_config
+        )
+
+        if result.success:
+            # 5. The extracted content is presumably JSON
+            data = json.loads(result.extracted_content)
+            print("Extracted items:", data)
+            
+            # 6. Show usage stats
+            llm_strategy.show_usage()  # prints token usage
+        else:
+            print("Error:", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+## 6. Chunking Details
+
+### 6.1 `chunk_token_threshold`
+
+If your page is large, you might exceed your LLM’s context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates word→token ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
+
+### 6.2 `overlap_rate`
+
+To keep context continuous across chunks, we can overlap them. E.g., `overlap_rate=0.1` means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries.
+
+### 6.3 Performance & Parallelism
+
+By chunking, you can potentially process multiple chunks in parallel (depending on your concurrency settings and the LLM provider). This reduces total time if the site is huge or has many sections.
+
+---
+
+## 7. Input Format
+
+By default, **LLMExtractionStrategy** uses `input_format="markdown"`, meaning the **crawler’s final markdown** is fed to the LLM. You can change to:
+
+- **`html`**: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM.  
+- **`fit_markdown`**: If you used, for instance, `PruningContentFilter`, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter.  
+- **`markdown`**: Standard markdown output from the crawler’s `markdown_generator`.
+
+This setting is crucial: if the LLM instructions rely on HTML tags, pick `"html"`. If you prefer a text-based approach, pick `"markdown"`.
+
+```python
+LLMExtractionStrategy(
+    # ...
+    input_format="html",  # Instead of "markdown" or "fit_markdown"
+)
+```
+
+---
+
+## 8. Token Usage & Show Usage
+
+To keep track of tokens and cost, each chunk is processed with an LLM call. We record usage in:
+
+- **`usages`** (list): token usage per chunk or call.  
+- **`total_usage`**: sum of all chunk calls.  
+- **`show_usage()`**: prints a usage report (if the provider returns usage data).
+
+```python
+llm_strategy = LLMExtractionStrategy(...)
+# ...
+llm_strategy.show_usage()
+# e.g. “Total usage: 1241 tokens across 2 chunk calls”
+```
+
+If your model provider doesn’t return usage info, these fields might be partial or empty.
+
+---
+
+## 9. Example: Building a Knowledge Graph
+
+Below is a snippet combining **`LLMExtractionStrategy`** with a Pydantic schema for a knowledge graph. Notice how we pass an **`instruction`** telling the model what to parse.
+
+```python
+import os
+import json
+import asyncio
+from typing import List
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class Entity(BaseModel):
+    name: str
+    description: str
+
+class Relationship(BaseModel):
+    entity1: Entity
+    entity2: Entity
+    description: str
+    relation_type: str
+
+class KnowledgeGraph(BaseModel):
+    entities: List[Entity]
+    relationships: List[Relationship]
+
+async def main():
+    # LLM extraction strategy
+    llm_strat = LLMExtractionStrategy(
+        provider="openai/gpt-4",
+        api_token=os.getenv('OPENAI_API_KEY'),
+        schema=KnowledgeGraph.schema_json(),
+        extraction_type="schema",
+        instruction="Extract entities and relationships from the content. Return valid JSON.",
+        chunk_token_threshold=1400,
+        apply_chunking=True,
+        input_format="html",
+        extra_args={"temperature": 0.1, "max_tokens": 1500}
+    )
+
+    crawl_config = CrawlerRunConfig(
+        extraction_strategy=llm_strat,
+        cache_mode=CacheMode.BYPASS
+    )
+
+    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
+        # Example page
+        url = "https://www.nbcnews.com/business"
+        result = await crawler.arun(url=url, config=crawl_config)
+
+        if result.success:
+            with open("kb_result.json", "w", encoding="utf-8") as f:
+                f.write(result.extracted_content)
+            llm_strat.show_usage()
+        else:
+            print("Crawl failed:", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Observations**:
+
+- **`extraction_type="schema"`** ensures we get JSON fitting our `KnowledgeGraph`.  
+- **`input_format="html"`** means we feed HTML to the model.  
+- **`instruction`** guides the model to output a structured knowledge graph.  
+
+---
+
+## 10. Best Practices & Caveats
+
+1. **Cost & Latency**: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data.  
+2. **Model Token Limits**: If your page + instruction exceed the context window, chunking is essential.  
+3. **Instruction Engineering**: Well-crafted instructions can drastically improve output reliability.  
+4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.  
+5. **Parallel vs. Serial**: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers.  
+6. **Check Output**: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup.
+
+---
+
+## 11. Conclusion
+
+**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. It’s perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, it’s **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
+
+- Put your LLM strategy **in `CrawlerRunConfig`**.  
+- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.  
+- Tweak **`chunk_token_threshold`**, **`overlap_rate`**, and **`apply_chunking`** to handle large content efficiently.  
+- Monitor token usage with `show_usage()`.
+
+If your site’s data is consistent or repetitive, consider [`JsonCssExtractionStrategy`](./json-extraction-basic.md) first for speed and simplicity. But if you need an **AI-driven** approach, `LLMExtractionStrategy` offers a flexible, multi-provider solution for extracting structured JSON from any website.
+
+**Next Steps**:
+
+1. **Experiment with Different Providers**  
+   - Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost.  
+   - Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results.
+
+2. **Combine With Other Strategies**  
+   - Use [content filters](../../how-to/content-filters.md) like BM25 or Pruning prior to LLM extraction to remove noise and reduce token usage.  
+   - Apply a [CSS or XPath extraction strategy](./json-extraction-basic.md) first for obvious, structured data, then send only the tricky parts to the LLM.
+
+3. **Performance Tuning**  
+   - If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput.  
+   - Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks.
+
+4. **Validate Outputs**  
+   - If using `extraction_type="schema"`, parse the LLM’s JSON with a Pydantic model for a final validation step.  
+   - Log or handle any parse errors gracefully, especially if the model occasionally returns malformed JSON.
+
+5. **Explore Hooks & Automation**  
+   - Integrate LLM extraction with [hooks](./hooks-custom.md) for complex pre/post-processing.  
+   - Use a multi-step pipeline: crawl, filter, LLM-extract, then store or index results for further analysis.
+
+6. **Scale and Deploy**  
+   - Combine your LLM extraction setup with [Docker or other deployment solutions](./docker-quickstart.md) to run at scale.  
+   - Monitor memory usage and concurrency if you call LLMs frequently.
+
+**Last Updated**: 2024-XX-XX
+
+---
+
+That’s it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
--- a/docs/md_v3/tutorials/link-media-analysis.md
+++ b/docs/md_v3/tutorials/link-media-analysis.md
@@ -0,0 +1,295 @@
+Below is a **draft** of the **“Link & Media Analysis”** tutorial. It demonstrates how to access and filter links, handle domain restrictions, and manage media (especially images) using Crawl4AI’s configuration options. Feel free to adjust examples and text to match your exact workflow or preferences.
+
+---
+
+# Link & Media Analysis
+
+In this tutorial, you’ll learn how to:
+
+1. Extract links (internal, external) from crawled pages  
+2. Filter or exclude specific domains (e.g., social media or custom domains)  
+3. Access and manage media data (especially images) in the crawl result  
+4. Configure your crawler to exclude or prioritize certain images
+
+> **Prerequisites**  
+> - You have completed or are familiar with the [AsyncWebCrawler Basics](./async-webcrawler-basics.md) tutorial.  
+> - You can run Crawl4AI in your environment (Playwright, Python, etc.).
+
+---
+
+Below is a revised version of the **Link Extraction** and **Media Extraction** sections that includes example data structures showing how links and media items are stored in `CrawlResult`. Feel free to adjust any field names or descriptions to match your actual output.
+
+---
+
+## 1. Link Extraction
+
+### 1.1 `result.links`
+
+When you call `arun()` or `arun_many()` on a URL, Crawl4AI automatically extracts links and stores them in the `links` field of `CrawlResult`. By default, the crawler tries to distinguish **internal** links (same domain) from **external** links (different domains).
+
+**Basic Example**:
+
+```python
+from crawl4ai import AsyncWebCrawler
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://www.example.com")
+    if result.success:
+        internal_links = result.links.get("internal", [])
+        external_links = result.links.get("external", [])
+        print(f"Found {len(internal_links)} internal links, {len(external_links)} external links.")
+        
+        # Each link is typically a dictionary with fields like:
+        # { "href": "...", "text": "...", "title": "...", "base_domain": "..." }
+        if internal_links:
+            print("Sample Internal Link:", internal_links[0])
+    else:
+        print("Crawl failed:", result.error_message)
+```
+
+**Structure Example**:
+
+```python
+result.links = {
+  "internal": [
+    {
+      "href": "https://kidocode.com/",
+      "text": "",
+      "title": "",
+      "base_domain": "kidocode.com"
+    },
+    {
+      "href": "https://kidocode.com/degrees/technology",
+      "text": "Technology Degree",
+      "title": "KidoCode Tech Program",
+      "base_domain": "kidocode.com"
+    },
+    # ...
+  ],
+  "external": [
+    # possibly other links leading to third-party sites
+  ]
+}
+```
+
+- **`href`**: The raw hyperlink URL.  
+- **`text`**: The link text (if any) within the `<a>` tag.  
+- **`title`**: The `title` attribute of the link (if present).  
+- **`base_domain`**: The domain extracted from `href`. Helpful for filtering or grouping by domain.
+
+---
+
+## 2. Domain Filtering
+
+Some websites contain hundreds of third-party or affiliate links. You can filter out certain domains at **crawl time** by configuring the crawler. The most relevant parameters in `CrawlerRunConfig` are:
+
+- **`exclude_external_links`**: If `True`, discard any link pointing outside the root domain.  
+- **`exclude_social_media_domains`**: Provide a list of social media platforms (e.g., `["facebook.com", "twitter.com"]`) to exclude from your crawl.  
+- **`exclude_social_media_links`**: If `True`, automatically skip known social platforms.  
+- **`exclude_domains`**: Provide a list of custom domains you want to exclude (e.g., `["spammyads.com", "tracker.net"]`).
+
+### 2.1 Example: Excluding External & Social Media Links
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    crawler_cfg = CrawlerRunConfig(
+        exclude_external_links=True,          # No links outside primary domain
+        exclude_social_media_links=True       # Skip recognized social media domains
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            "https://www.example.com",
+            config=crawler_cfg
+        )
+        if result.success:
+            print("[OK] Crawled:", result.url)
+            print("Internal links count:", len(result.links.get("internal", [])))
+            print("External links count:", len(result.links.get("external", [])))  
+            # Likely zero external links in this scenario
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 2.2 Example: Excluding Specific Domains
+
+If you want to let external links in, but specifically exclude a domain (e.g., `suspiciousads.com`), do this:
+
+```python
+crawler_cfg = CrawlerRunConfig(
+    exclude_domains=["suspiciousads.com"]
+)
+```
+
+This approach is handy when you still want external links but need to block certain sites you consider spammy.
+
+---
+
+## 3. Media Extraction
+
+### 3.1 Accessing `result.media`
+
+By default, Crawl4AI collects images, audio, and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
+
+**Basic Example**:
+
+```python
+if result.success:
+    images_info = result.media.get("images", [])
+    print(f"Found {len(images_info)} images in total.")
+    for i, img in enumerate(images_info[:5]):  # Inspect just the first 5
+        print(f"[Image {i}] URL: {img['src']}")
+        print(f"           Alt text: {img.get('alt', '')}")
+        print(f"           Score: {img.get('score')}")
+        print(f"           Description: {img.get('desc', '')}\n")
+```
+
+**Structure Example**:
+
+```python
+result.media = {
+  "images": [
+    {
+      "src": "https://cdn.prod.website-files.com/.../Group%2089.svg",
+      "alt": "coding school for kids",
+      "desc": "Trial Class Degrees degrees All Degrees AI Degree Technology ...",
+      "score": 3,
+      "type": "image",
+      "group_id": 0,
+      "format": None,
+      "width": None,
+      "height": None
+    },
+    # ...
+  ],
+  "videos": [
+    # Similar structure but with video-specific fields
+  ],
+  "audio": [
+    # Similar structure but with audio-specific fields
+  ]
+}
+```
+
+Depending on your Crawl4AI version or scraping strategy, these dictionaries can include fields like:
+
+- **`src`**: The media URL (e.g., image source)  
+- **`alt`**: The alt text for images (if present)  
+- **`desc`**: A snippet of nearby text or a short description (optional)  
+- **`score`**: A heuristic relevance score if you’re using content-scoring features  
+- **`width`**, **`height`**: If the crawler detects dimensions for the image/video  
+- **`type`**: Usually `"image"`, `"video"`, or `"audio"`  
+- **`group_id`**: If you’re grouping related media items, the crawler might assign an ID  
+
+With these details, you can easily filter out or focus on certain images (for instance, ignoring images with very low scores or a different domain), or gather metadata for analytics.
+
+### 3.2 Excluding External Images
+
+If you’re dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
+
+```python
+crawler_cfg = CrawlerRunConfig(
+    exclude_external_images=True
+)
+```
+
+This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.
+
+### 3.3 Additional Media Config
+
+- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.  
+- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.  
+- **`wait_for_images`**: If `True`, attempts to wait until images are fully loaded before final extraction.
+
+---
+
+## 4. Putting It All Together: Link & Media Filtering
+
+Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    # Suppose we want to keep only internal links, remove certain domains, 
+    # and discard external images from the final crawl data.
+    crawler_cfg = CrawlerRunConfig(
+        exclude_external_links=True,
+        exclude_domains=["spammyads.com"],
+        exclude_social_media_links=True,   # skip Twitter, Facebook, etc.
+        exclude_external_images=True,      # keep only images from main domain
+        wait_for_images=True,             # ensure images are loaded
+        verbose=True
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://www.example.com", config=crawler_cfg)
+
+        if result.success:
+            print("[OK] Crawled:", result.url)
+            
+            # 1. Links
+            in_links = result.links.get("internal", [])
+            ext_links = result.links.get("external", [])
+            print("Internal link count:", len(in_links))
+            print("External link count:", len(ext_links))  # should be zero with exclude_external_links=True
+            
+            # 2. Images
+            images = result.media.get("images", [])
+            print("Images found:", len(images))
+            
+            # Let's see a snippet of these images
+            for i, img in enumerate(images[:3]):
+                print(f"  - {img['src']} (alt={img.get('alt','')}, score={img.get('score','N/A')})")
+        else:
+            print("[ERROR] Failed to crawl. Reason:", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+## 5. Common Pitfalls & Tips
+
+1. **Conflicting Flags**:  
+   - `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.  
+   - `exclude_external_images=True` but want to keep some external images? Currently no partial domain-based setting for images, so you might need a custom approach or hook logic.
+
+2. **Relevancy Scores**:  
+   - If your version of Crawl4AI or your scraping strategy includes an `img["score"]`, it’s typically a heuristic based on size, position, or content analysis. Evaluate carefully if you rely on it.
+
+3. **Performance**:  
+   - Excluding certain domains or external images can speed up your crawl, especially for large, media-heavy pages.  
+   - If you want a “full” link map, do *not* exclude them. Instead, you can post-filter in your own code.
+
+4. **Social Media Lists**:  
+   - `exclude_social_media_links=True` typically references an internal list of known social domains like Facebook, Twitter, LinkedIn, etc. If you need to add or remove from that list, look for library settings or a local config file (depending on your version).
+
+---
+
+## 6. Next Steps
+
+Now that you understand how to manage **Link & Media Analysis**, you can:
+
+- Fine-tune which links are stored or discarded in your final results  
+- Control which images (or other media) appear in `result.media`  
+- Filter out entire domains or social media platforms to keep your dataset relevant
+
+**Recommended Follow-Ups**:  
+- **[Advanced Features (Proxy, PDF, Screenshots)](./advanced-features.md)**: If you want to capture screenshots or save the page as a PDF for archival or debugging.  
+- **[Hooks & Custom Code](./hooks-custom.md)**: For more specialized logic, such as automated “infinite scroll” or repeated “Load More” button clicks.  
+- **Reference**: Check out [CrawlerRunConfig Reference](../../reference/configuration.md) for a comprehensive parameter list.
+
+**Last updated**: 2024-XX-XX
+
+---
+
+**That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
--- a/docs/md_v3/tutorials/markdown-basics.md
+++ b/docs/md_v3/tutorials/markdown-basics.md
@@ -0,0 +1,382 @@
+Below is a **draft** of the **Markdown Generation Basics** tutorial that incorporates your current Crawl4AI design and terminology. It introduces the default markdown generator, explains the concept of content filters (BM25 and Pruning), and covers the `MarkdownGenerationResult` object in a coherent, step-by-step manner. Adjust parameters or naming as needed to align with your actual codebase.
+
+---
+
+# Markdown Generation Basics
+
+One of Crawl4AI’s core features is generating **clean, structured markdown** from web pages. Originally built to solve the problem of extracting only the “actual” content and discarding boilerplate or noise, Crawl4AI’s markdown system remains one of its biggest draws for AI workflows.
+
+In this tutorial, you’ll learn:
+
+1. How to configure the **Default Markdown Generator**  
+2. How **content filters** (BM25 or Pruning) help you refine markdown and discard junk  
+3. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)  
+
+> **Prerequisites**  
+> - You’ve completed or read [AsyncWebCrawler Basics](./async-webcrawler-basics.md) to understand how to run a simple crawl.  
+> - You know how to configure `CrawlerRunConfig`.
+
+---
+
+## 1. Quick Example
+
+Here’s a minimal code snippet that uses the **DefaultMarkdownGenerator** with no additional filtering:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+async def main():
+    config = CrawlerRunConfig(
+        markdown_generator=DefaultMarkdownGenerator()
+    )
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com", config=config)
+        
+        if result.success:
+            print("Raw Markdown Output:\n")
+            print(result.markdown)  # The unfiltered markdown from the page
+        else:
+            print("Crawl failed:", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s happening?**  
+- `CrawlerRunConfig(markdown_generator=DefaultMarkdownGenerator())` instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl.  
+- The resulting markdown is accessible via `result.markdown`.
+
+---
+
+## 2. How Markdown Generation Works
+
+### 2.1 HTML-to-Text Conversion (Forked & Modified)
+
+Under the hood, **DefaultMarkdownGenerator** uses a specialized HTML-to-text approach that:
+
+- Preserves headings, code blocks, bullet points, etc.  
+- Removes extraneous tags (scripts, styles) that don’t add meaningful content.  
+- Can optionally generate references for links or skip them altogether.
+
+A set of **options** (passed as a dict) allows you to customize precisely how HTML converts to markdown. These map to standard html2text-like configuration plus your own enhancements (e.g., ignoring internal links, preserving certain tags verbatim, or adjusting line widths).
+
+### 2.2 Link Citations & References
+
+By default, the generator can convert `<a href="...">` elements into `[text][1]` citations, then place the actual links at the bottom of the document. This is handy for research workflows that demand references in a structured manner.
+
+### 2.3 Optional Content Filters
+
+Before or after the HTML-to-Markdown step, you can apply a **content filter** (like BM25 or Pruning) to reduce noise and produce a “fit_markdown”—a heavily pruned version focusing on the page’s main text. We’ll cover these filters shortly.
+
+---
+
+## 3. Configuring the Default Markdown Generator
+
+You can tweak the output by passing an `options` dict to `DefaultMarkdownGenerator`. For example:
+
+```python
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+    # Example: ignore all links, don't escape HTML, and wrap text at 80 characters
+    md_generator = DefaultMarkdownGenerator(
+        options={
+            "ignore_links": True,
+            "escape_html": False,
+            "body_width": 80
+        }
+    )
+
+    config = CrawlerRunConfig(
+        markdown_generator=md_generator
+    )
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com/docs", config=config)
+        if result.success:
+            print("Markdown:\n", result.markdown[:500])  # Just a snippet
+        else:
+            print("Crawl failed:", result.error_message)
+
+if __name__ == "__main__":
+    import asyncio
+    asyncio.run(main())
+```
+
+Some commonly used `options`:
+
+- **`ignore_links`** (bool): Whether to remove all hyperlinks in the final markdown.  
+- **`ignore_images`** (bool): Remove all `![image]()` references.  
+- **`escape_html`** (bool): Turn HTML entities into text (default is often `True`).  
+- **`body_width`** (int): Wrap text at N characters. `0` or `None` means no wrapping.  
+- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.  
+- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
+
+---
+
+## 4. Content Filters
+
+**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you don’t want.
+
+### 4.1 BM25ContentFilter
+
+If you have a **search query**, BM25 is a good choice:
+
+```python
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+from crawl4ai.content_filter_strategy import BM25ContentFilter
+from crawl4ai import CrawlerRunConfig
+
+bm25_filter = BM25ContentFilter(
+    user_query="machine learning",
+    bm25_threshold=1.2,
+    use_stemming=True
+)
+
+md_generator = DefaultMarkdownGenerator(
+    content_filter=bm25_filter,
+    options={"ignore_links": True}
+)
+
+config = CrawlerRunConfig(markdown_generator=md_generator)
+```
+
+- **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.  
+- **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.  
+- **`use_stemming`**: If `True`, variations of words match (e.g., “learn,” “learning,” “learnt”).
+
+**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
+
+### 4.2 PruningContentFilter
+
+If you **don’t** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
+
+```python
+from crawl4ai.content_filter_strategy import PruningContentFilter
+
+prune_filter = PruningContentFilter(
+    threshold=0.5,
+    threshold_type="fixed",  # or "dynamic"
+    min_word_threshold=50
+)
+```
+
+- **`threshold`**: Score boundary. Blocks below this score get removed.  
+- **`threshold_type`**:  
+  - `"fixed"`: Straight comparison (`score >= threshold` keeps the block).  
+  - `"dynamic"`: The filter adjusts threshold in a data-driven manner.  
+- **`min_word_threshold`**: Discard blocks under N words as likely too short or unhelpful.
+
+**When to Use PruningContentFilter**  
+- You want a broad cleanup without a user query.  
+- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
+
+---
+
+## 5. Using Fit Markdown
+
+When a content filter is active, the library produces two forms of markdown inside `result.markdown_v2` or (if using the simplified field) `result.markdown`:
+
+1. **`raw_markdown`**: The full unfiltered markdown.  
+2. **`fit_markdown`**: A “fit” version where the filter has removed or trimmed noisy segments.
+
+**Note**:  
+- In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: you’ll have a raw version and a filtered (“fit”) version if a filter is used.
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+from crawl4ai.content_filter_strategy import PruningContentFilter
+
+async def main():
+    config = CrawlerRunConfig(
+        markdown_generator=DefaultMarkdownGenerator(
+            content_filter=PruningContentFilter(threshold=0.6),
+            options={"ignore_links": True}
+        )
+    )
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://news.example.com/tech", config=config)
+        if result.success:
+            print("Raw markdown:\n", result.markdown)
+            
+            # If a filter is used, we also have .fit_markdown:
+            md_object = result.markdown_v2  # or your equivalent
+            print("Filtered markdown:\n", md_object.fit_markdown)
+        else:
+            print("Crawl failed:", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+## 6. The `MarkdownGenerationResult` Object
+
+If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, you’ll see fields such as:
+
+- **`raw_markdown`**: The direct HTML-to-markdown transformation (no filtering).  
+- **`markdown_with_citations`**: A version that moves links to reference-style footnotes.  
+- **`references_markdown`**: A separate string or section containing the gathered references.  
+- **`fit_markdown`**: The filtered markdown if you used a content filter.  
+- **`fit_html`**: The corresponding HTML snippet used to generate `fit_markdown` (helpful for debugging or advanced usage).
+
+**Example**:
+
+```python
+md_obj = result.markdown_v2  # your library’s naming may vary
+print("RAW:\n", md_obj.raw_markdown)
+print("CITED:\n", md_obj.markdown_with_citations)
+print("REFERENCES:\n", md_obj.references_markdown)
+print("FIT:\n", md_obj.fit_markdown)
+```
+
+**Why Does This Matter?**  
+- You can supply `raw_markdown` to an LLM if you want the entire text.  
+- Or feed `fit_markdown` into a vector database to reduce token usage.  
+- `references_markdown` can help you keep track of link provenance.
+
+---
+
+Below is a **revised section** under “Combining Filters (BM25 + Pruning)” that demonstrates how you can run **two** passes of content filtering without re-crawling, by taking the HTML (or text) from a first pass and feeding it into the second filter. It uses real code patterns from the snippet you provided for **BM25ContentFilter**, which directly accepts **HTML** strings (and can also handle plain text with minimal adaptation).
+
+---
+
+## 7. Combining Filters (BM25 + Pruning) in Two Passes
+
+You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank what’s left** against a user query (with `BM25ContentFilter`). You don’t have to crawl the page twice. Instead:
+
+1. **First pass**: Apply `PruningContentFilter` directly to the raw HTML from `result.html` (the crawler’s downloaded HTML).  
+2. **Second pass**: Take the pruned HTML (or text) from step 1, and feed it into `BM25ContentFilter`, focusing on a user query.
+
+### Two-Pass Example
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
+from bs4 import BeautifulSoup
+
+async def main():
+    # 1. Crawl with minimal or no markdown generator, just get raw HTML
+    config = CrawlerRunConfig(
+        # If you only want raw HTML, you can skip passing a markdown_generator
+        # or provide one but focus on .html in this example
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com/tech-article", config=config)
+
+        if not result.success or not result.html:
+            print("Crawl failed or no HTML content.")
+            return
+        
+        raw_html = result.html
+        
+        # 2. First pass: PruningContentFilter on raw HTML
+        pruning_filter = PruningContentFilter(threshold=0.5, min_word_threshold=50)
+        
+        # filter_content returns a list of "text chunks" or cleaned HTML sections
+        pruned_chunks = pruning_filter.filter_content(raw_html)
+        # This list is basically pruned content blocks, presumably in HTML or text form
+        
+        # For demonstration, let's combine these chunks back into a single HTML-like string
+        # or you could do further processing. It's up to your pipeline design.
+        pruned_html = "\n".join(pruned_chunks)
+        
+        # 3. Second pass: BM25ContentFilter with a user query
+        bm25_filter = BM25ContentFilter(
+            user_query="machine learning",
+            bm25_threshold=1.2,
+            language="english"
+        )
+        
+        bm25_chunks = bm25_filter.filter_content(pruned_html)  # returns a list of text chunks
+        
+        if not bm25_chunks:
+            print("Nothing matched the BM25 query after pruning.")
+            return
+        
+        # 4. Combine or display final results
+        final_text = "\n---\n".join(bm25_chunks)
+        
+        print("==== PRUNED OUTPUT (first pass) ====")
+        print(pruned_html[:500], "... (truncated)")  # preview
+
+        print("\n==== BM25 OUTPUT (second pass) ====")
+        print(final_text[:500], "... (truncated)")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### What’s Happening?
+
+1. **Raw HTML**: We crawl once and store the raw HTML in `result.html`.  
+2. **PruningContentFilter**: Takes HTML + optional parameters. It extracts blocks of text or partial HTML, removing headings/sections deemed “noise.” It returns a **list of text chunks**.  
+3. **Combine or Transform**: We join these pruned chunks back into a single HTML-like string. (Alternatively, you could store them in a list for further logic—whatever suits your pipeline.)  
+4. **BM25ContentFilter**: We feed the pruned string into `BM25ContentFilter` with a user query. This second pass further narrows the content to chunks relevant to “machine learning.”
+
+**No Re-Crawling**: We used `raw_html` from the first pass, so there’s no need to run `arun()` again—**no second network request**.
+
+### Tips & Variations
+
+- **Plain Text vs. HTML**: If your pruned output is mostly text, BM25 can still handle it; just keep in mind it expects a valid string input. If you supply partial HTML (like `"<p>some text</p>"`), it will parse it as HTML.  
+- **Chaining in a Single Pipeline**: If your code supports it, you can chain multiple filters automatically. Otherwise, manual two-pass filtering (as shown) is straightforward.  
+- **Adjust Thresholds**: If you see too much or too little text in step one, tweak `threshold=0.5` or `min_word_threshold=50`. Similarly, `bm25_threshold=1.2` can be raised/lowered for more or fewer chunks in step two.
+
+### One-Pass Combination?
+
+If your codebase or pipeline design allows applying multiple filters in one pass, you could do so. But often it’s simpler—and more transparent—to run them sequentially, analyzing each step’s result.
+
+**Bottom Line**: By **manually chaining** your filtering logic in two passes, you get powerful incremental control over the final content. First, remove “global” clutter with Pruning, then refine further with BM25-based query relevance—without incurring a second network crawl.
+
+---
+
+## 8. Common Pitfalls & Tips
+
+1. **No Markdown Output?**  
+   - Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.  
+   - Check if your content filter is too aggressive. Lower thresholds or disable the filter to see if content reappears.
+
+2. **Performance Considerations**  
+   - Very large pages with multiple filters can be slower. Consider `cache_mode` to avoid re-downloading.  
+   - If your final use case is LLM ingestion, consider summarizing further or chunking big texts.
+
+3. **Take Advantage of `fit_markdown`**  
+   - Great for RAG pipelines, semantic search, or any scenario where extraneous boilerplate is unwanted.  
+   - Still verify the textual quality—some sites have crucial data in footers or sidebars.
+
+4. **Adjusting `html2text` Options**  
+   - If you see lots of raw HTML slipping into the text, turn on `escape_html`.  
+   - If code blocks look messy, experiment with `mark_code` or `handle_code_in_pre`.
+
+---
+
+## 9. Summary & Next Steps
+
+In this **Markdown Generation Basics** tutorial, you learned to:
+
+- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.  
+- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.  
+- Distinguish between raw and filtered markdown (`fit_markdown`).  
+- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).
+
+**Where to go from here**:
+
+- **[Extracting JSON (No LLM)](./json-extraction-basic.md)**: If you need structured data instead of markdown, check out the library’s JSON extraction strategies.  
+- **[Advanced Features](./advanced-features.md)**: Combine markdown generation with proxies, PDF exports, and more.  
+- **[Explanations → Content Filters vs. Extraction Strategies](../../explanations/extraction-chunking.md)**: Dive deeper into how filters differ from chunking or semantic extraction.  
+
+Now you can produce high-quality Markdown from any website, focusing on exactly the content you need—an essential step for powering AI models, summarization pipelines, or knowledge-base queries.
+
+**Last Updated**: 2024-XX-XX
+
+---
+
+That’s it for **Markdown Generation Basics**! Enjoy generating clean, noise-free markdown for your LLM workflows, content archives, or research.
--- a/docs/md_v3/tutorials/targeted-crawling.md
+++ b/docs/md_v3/tutorials/targeted-crawling.md
@@ -0,0 +1,227 @@
+Below is a **draft** of a follow-up tutorial, **“Smart Crawling Techniques,”** building on the **“AsyncWebCrawler Basics”** tutorial. This tutorial focuses on three main points:
+
+1. **Advanced usage of CSS selectors** (e.g., partial extraction, exclusions)
+2. **Handling iframes** (if relevant for your workflow)
+3. **Waiting for dynamic content** using `wait_for`, including the new `css:` and `js:` prefixes
+
+Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.
+
+---
+
+# Smart Crawling Techniques
+
+In the previous tutorial ([AsyncWebCrawler Basics](./async-webcrawler-basics.md)), you learned how to create an `AsyncWebCrawler` instance, run a basic crawl, and inspect the `CrawlResult`. Now it’s time to explore some of the **targeted crawling** features that let you:
+
+1. Select specific parts of a webpage using CSS selectors  
+2. Exclude or ignore certain page elements  
+3. Wait for dynamic content to load using `wait_for` (with `css:` or `js:` rules)  
+4. (Optionally) Handle iframes if your target site embeds additional content
+
+> **Prerequisites**  
+> - You’ve read or completed [AsyncWebCrawler Basics](./async-webcrawler-basics.md).  
+> - You have a working environment for Crawl4AI (Playwright installed, etc.).
+
+---
+
+## 1. Targeting Specific Elements with CSS Selectors
+
+### 1.1 Simple CSS Selector Usage
+
+Let’s say you only need to crawl the main article content of a news page. By setting `css_selector` in `CrawlerRunConfig`, your final HTML or Markdown output focuses on that region. For example:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    browser_cfg = BrowserConfig(headless=True)
+    crawler_cfg = CrawlerRunConfig(
+        css_selector=".article-body",  # Only capture .article-body content
+        excluded_tags=["nav", "footer"]  # Optional: skip big nav & footer sections
+    )
+
+    async with AsyncWebCrawler(config=browser_cfg) as crawler:
+        result = await crawler.arun(
+            url="https://news.example.com/story/12345",
+            config=crawler_cfg
+        )
+        if result.success:
+            print("[OK] Extracted content length:", len(result.html))
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Key Parameters**:
+- **`css_selector`**: Tells the crawler to focus on `.article-body`.  
+- **`excluded_tags`**: Tells the crawler to skip specific HTML tags altogether (e.g., `nav` or `footer`).  
+
+**Tip**: For extremely noisy pages, you can further refine how you exclude certain elements by using `excluded_selector`, which takes a CSS selector you want removed from the final output.
+
+### 1.2 Excluding Content with `excluded_selector`
+
+If you want to remove certain sections within `.article-body` (like “related stories” sidebars), set:
+
+```python
+CrawlerRunConfig(
+    css_selector=".article-body",
+    excluded_selector=".related-stories, .ads-banner"
+)
+```
+
+This combination grabs the main article content while filtering out sidebars or ads.
+
+---
+
+## 2. Handling Iframes
+
+Some sites embed extra content via `<iframe>` elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set:
+
+```python
+crawler_cfg = CrawlerRunConfig(
+    process_iframes=True
+)
+```
+
+- **`process_iframes=True`**: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it into `result.html` and `result.markdown`.
+
+**Warning**: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings.
+
+---
+
+## 3. Waiting for Dynamic Content
+
+Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at `wait_for`.
+
+### 3.1 `wait_for` Basics
+
+In `CrawlerRunConfig`, `wait_for` can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses `smart_wait` to interpret what you provide.
+
+```python
+crawler_cfg = CrawlerRunConfig(
+    wait_for="css:.main-article-loaded",
+    page_timeout=30000
+)
+```
+
+**Example**: `css:.main-article-loaded` means “Wait for an element with the class `.main-article-loaded` to appear in the DOM.” If it doesn’t appear within `30` seconds, you’ll get a timeout.
+
+### 3.2 Using Explicit Prefixes
+
+**`js:`** and **`css:`** can explicitly tell the crawler which approach to use:
+
+- **`wait_for="css:.comments-section"`** → Wait for `.comments-section` to appear  
+- **`wait_for="js:() => document.querySelectorAll('.comments').length > 5"`** → Wait until there are at least 6 comment elements  
+
+**Code Example**:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+    config = CrawlerRunConfig(
+        wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10",
+        page_timeout=20000  # 20s
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/async-list",
+            config=config
+        )
+        if result.success:
+            print("[OK] Dynamic items loaded. HTML length:", len(result.html))
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### 3.3 Fallback Logic
+
+If you **don’t** prefix `js:` or `css:`, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit:
+
+- **`"css:.my-selector"`** → Force CSS  
+- **`"js:() => myAppState.isReady()"`** → Force JavaScript
+
+**What Should My JavaScript Return?**  
+- A function that returns `true` once the condition is met (or `false` if it fails).  
+- The function can be sync or async, but note that the crawler wraps it in an async loop to poll until `true` or timeout.
+
+---
+
+## 4. Example: Targeted Crawl with Iframes & Wait-For
+
+Below is a more advanced snippet combining these features:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    browser_cfg = BrowserConfig(headless=True)
+    crawler_cfg = CrawlerRunConfig(
+        css_selector=".main-content",
+        process_iframes=True,
+        wait_for="css:.loaded-indicator",   # Wait for .loaded-indicator to appear
+        excluded_tags=["script", "style"],  # Remove script/style tags
+        page_timeout=30000,
+        verbose=True
+    )
+    
+    async with AsyncWebCrawler(config=browser_cfg) as crawler:
+        result = await crawler.arun(
+            url="https://example.com/iframe-heavy",
+            config=crawler_cfg
+        )
+        if result.success:
+            print("[OK] Crawled with iframes. Length of final HTML:", len(result.html))
+        else:
+            print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s Happening**:
+1. **`css_selector=".main-content"`** → Focus only on `.main-content` for final extraction.  
+2. **`process_iframes=True`** → Recursively handle `<iframe>` content.  
+3. **`wait_for="css:.loaded-indicator"`** → Don’t extract until the page shows `.loaded-indicator`.  
+4. **`excluded_tags=["script", "style"]`** → Remove script and style tags for a cleaner result.
+
+---
+
+## 5. Common Pitfalls & Tips
+
+1. **Be Explicit**: Using `"js:"` or `"css:"` can spare you headaches if the library guesses incorrectly.  
+2. **Timeouts**: If the site never triggers your wait condition, a `TimeoutError` can occur. Check your logs or use `verbose=True` for more clues.  
+3. **Infinite Scroll**: If you have repeated “load more” loops, you might use [Hooks & Custom Code](./hooks-custom.md) or add your own JavaScript for repeated scrolling.  
+4. **Iframes**: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors.  
+
+---
+
+## 6. Summary & Next Steps
+
+With these **Targeted Crawling Techniques** you can:
+
+- Precisely target or exclude content using CSS selectors.  
+- Automatically wait for dynamic elements to load using `wait_for`.  
+- Merge iframe content into your main page result.  
+
+### Where to Go Next?
+
+- **[Link & Media Analysis](./link-media-analysis.md)**: Dive deeper into analyzing extracted links and media items.  
+- **[Hooks & Custom Code](./hooks-custom.md)**: Learn how to implement repeated actions like infinite scroll or login sequences using hooks.  
+- **Reference**: For an exhaustive list of parameters and advanced usage, see [CrawlerRunConfig Reference](../../reference/configuration.md).  
+
+If you run into issues or want to see real examples from other users, check the [How-To Guides](../../how-to/) or raise a question on GitHub.
+
+**Last updated**: 2024-XX-XX
+
+---
+
+That’s it for **Targeted Crawling Techniques**! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding.