`.
+- **Structural context** â If a node is deeply nested or in a suspected sidebar, it might be deprioritized.
+
+---
+
+## 3. BM25ContentFilter
+
+**BM25** is a classical text ranking algorithm often used in search engines. If you have a **user query** or rely on page metadata to derive a query, BM25 can identify which text chunks best match that query.
+
+### 3.1 Usage Example
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.content_filter_strategy import BM25ContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+async def main():
+ # 1) A BM25 filter with a user query
+ bm25_filter = BM25ContentFilter(
+ user_query="startup fundraising tips",
+ # Adjust for stricter or looser results
+ bm25_threshold=1.2
+ )
+
+ # 2) Insert into a Markdown Generator
+ md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
+
+ # 3) Pass to crawler config
+ config = CrawlerRunConfig(
+ markdown_generator=md_generator
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://news.ycombinator.com",
+ config=config
+ )
+ if result.success:
+ print("Fit Markdown (BM25 query-based):")
+ print(result.markdown_v2.fit_markdown)
+ else:
+ print("Error:", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+### 3.2 Parameters
+
+- **`user_query`** (str, optional): E.g. `"machine learning"`. If blank, the filter tries to glean a query from page metadata.
+- **`bm25_threshold`** (float, default 1.0):
+ - Higher â fewer chunks but more relevant.
+ - Lower â more inclusive.
+
+> In more advanced scenarios, you might see parameters like `use_stemming`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted.
+
+---
+
+## 4. Accessing the âFitâ Output
+
+After the crawl, your âfitâ content is found in **`result.markdown_v2.fit_markdown`**. In future versions, it will be **`result.markdown.fit_markdown`**. Meanwhile:
+
+```python
+fit_md = result.markdown_v2.fit_markdown
+fit_html = result.markdown_v2.fit_html
+```
+
+If the content filter is **BM25**, you might see additional logic or references in `fit_markdown` that highlight relevant segments. If itâs **Pruning**, the text is typically well-cleaned but not necessarily matched to a query.
+
+---
+
+## 5. Code Patterns Recap
+
+### 5.1 Pruning
+
+```python
+prune_filter = PruningContentFilter(
+ threshold=0.5,
+ threshold_type="fixed",
+ min_word_threshold=10
+)
+md_generator = DefaultMarkdownGenerator(content_filter=prune_filter)
+config = CrawlerRunConfig(markdown_generator=md_generator)
+# => result.markdown_v2.fit_markdown
+```
+
+### 5.2 BM25
+
+```python
+bm25_filter = BM25ContentFilter(
+ user_query="health benefits fruit",
+ bm25_threshold=1.2
+)
+md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
+config = CrawlerRunConfig(markdown_generator=md_generator)
+# => result.markdown_v2.fit_markdown
+```
+
+---
+
+## 6. Combining with âword_count_thresholdâ & Exclusions
+
+Remember you can also specify:
+
+```python
+config = CrawlerRunConfig(
+ word_count_threshold=10,
+ excluded_tags=["nav", "footer", "header"],
+ exclude_external_links=True,
+ markdown_generator=DefaultMarkdownGenerator(
+ content_filter=PruningContentFilter(threshold=0.5)
+ )
+)
+```
+
+Thus, **multi-level** filtering occurs:
+
+1. The crawlerâs `excluded_tags` are removed from the HTML first.
+2. The content filter (Pruning, BM25, or custom) prunes or ranks the remaining text blocks.
+3. The final âfitâ content is generated in `result.markdown_v2.fit_markdown`.
+
+---
+
+## 7. Custom Filters
+
+If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from `RelevantContentFilter` and implement `filter_content(html)`. Then inject it into your **markdown generator**:
+
+```python
+from crawl4ai.content_filter_strategy import RelevantContentFilter
+
+class MyCustomFilter(RelevantContentFilter):
+ def filter_content(self, html, min_word_threshold=None):
+ # parse HTML, implement custom logic
+ return [block for block in ... if ... some condition...]
+
+```
+
+**Steps**:
+
+1. Subclass `RelevantContentFilter`.
+2. Implement `filter_content(...)`.
+3. Use it in your `DefaultMarkdownGenerator(content_filter=MyCustomFilter(...))`.
+
+---
+
+## 8. Final Thoughts
+
+**Fit Markdown** is a crucial feature for:
+
+- **Summaries**: Quickly get the important text from a cluttered page.
+- **Search**: Combine with **BM25** to produce content relevant to a query.
+- **AI Pipelines**: Filter out boilerplate so LLM-based extraction or summarization runs on denser text.
+
+**Key Points**:
+- **PruningContentFilter**: Great if you just want the âmeatiestâ text without a user query.
+- **BM25ContentFilter**: Perfect for query-based extraction or searching.
+- Combine with **`excluded_tags`, `exclude_external_links`, `word_count_threshold`** to refine your final âfitâ text.
+- Fit markdown ends up in **`result.markdown_v2.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions.
+
+With these tools, you can **zero in** on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant âfit markdownâ for your AI or data pipelines. Happy pruning and searching!
+
+- Last Updated: 2025-01-01
\ No newline at end of file
diff --git a/docs/md_v2/core/installation.md b/docs/md_v2/core/installation.md
new file mode 100644
index 00000000..2e1fd431
--- /dev/null
+++ b/docs/md_v2/core/installation.md
@@ -0,0 +1,129 @@
+# Installation & Setup (2023 Edition)
+
+## 1. Basic Installation
+
+```bash
+pip install crawl4ai
+```
+
+This installs the **core** Crawl4AI library along with essential dependencies.â**No** advanced features (like transformers or PyTorch) are included yet.
+
+## 2. Initial Setup & Diagnostics
+
+### 2.1 Run the Setup Command
+After installing, call:
+
+```bash
+crawl4ai-setup
+```
+
+**What does it do?**
+- Installs or updates required Playwright browsers (Chromium, Firefox, etc.)
+- Performs OS-level checks (e.g., missing libs on Linux)
+- Confirms your environment is ready to crawl
+
+### 2.2 Diagnostics
+Optionally, you can run **diagnostics** to confirm everything is functioning:
+
+```bash
+crawl4ai-doctor
+```
+
+This command attempts to:
+- Check Python version compatibility
+- Verify Playwright installation
+- Inspect environment variables or library conflicts
+
+If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`.
+
+---
+
+## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`)
+
+Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://www.example.com",
+ )
+ print(result.markdown[:300]) # Show the first 300 characters of extracted text
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Expected** outcome:
+- A headless browser session loads `example.com`
+- Crawl4AI returns ~300 characters of markdown.
+If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly.
+
+---
+
+## 4. Advanced Installation (Optional)
+
+**Warning**: Only install these **if you truly need them**. They bring in larger dependencies, including big models, which can increase disk usage and memory load significantly.
+
+### 4.1 Torch, Transformers, or All
+
+- **Text Clustering (Torch)**
+ ```bash
+ pip install crawl4ai[torch]
+ crawl4ai-setup
+ ```
+ Installs PyTorch-based features (e.g., cosine similarity or advanced semantic chunking).
+
+- **Transformers**
+ ```bash
+ pip install crawl4ai[transformer]
+ crawl4ai-setup
+ ```
+ Adds Hugging Face-based summarization or generation strategies.
+
+- **All Features**
+ ```bash
+ pip install crawl4ai[all]
+ crawl4ai-setup
+ ```
+
+#### (Optional) Pre-Fetching Models
+```bash
+crawl4ai-download-models
+```
+This step caches large models locally (if needed).â**Only do this** if your workflow requires them.
+
+---
+
+## 5. Docker (Experimental)
+
+We provide a **temporary** Docker approach for testing.â**Itâs not stable and may break** with future releases. We plan a major Docker revamp in a future stable version, 2025 Q1. If you still want to try:
+
+```bash
+docker pull unclecode/crawl4ai:basic
+docker run -p 11235:11235 unclecode/crawl4ai:basic
+```
+
+You can then make POST requests to `http://localhost:11235/crawl` to perform crawls.â**Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025).
+
+---
+
+## 6. Local Server Mode (Legacy)
+
+Some older docs mention running Crawl4AI as a local server. This approach has been **partially replaced** by the new Docker-based prototype and upcoming stable server release. You can experiment, but expect major changes. Official local server instructions will arrive once the new Docker architecture is finalized.
+
+---
+
+## Summary
+
+1.â**Install** with `pip install crawl4ai` and run `crawl4ai-setup`.
+2.â**Diagnose** with `crawl4ai-doctor` if you see errors.
+3.â**Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`.
+4.â**Advanced** features (Torch, Transformers) are **optional**âavoid them if you donât need them (they significantly increase resource usage).
+5.â**Docker** is **experimental**âuse at your own risk until the stable version is released.
+6.â**Local server** references in older docs are largely deprecated; a new solution is in progress.
+
+**Got questions?** Check [GitHub issues](https://github.com/unclecode/crawl4ai/issues) for updates or ask the community!
\ No newline at end of file
diff --git a/docs/md_v3/tutorials/link-media-analysis.md b/docs/md_v2/core/link-media.md
similarity index 87%
rename from docs/md_v3/tutorials/link-media-analysis.md
rename to docs/md_v2/core/link-media.md
index 229fad8d..ed56e8fb 100644
--- a/docs/md_v3/tutorials/link-media-analysis.md
+++ b/docs/md_v2/core/link-media.md
@@ -1,8 +1,4 @@
-Below is a **draft** of the **âLink & Media Analysisâ** tutorial. It demonstrates how to access and filter links, handle domain restrictions, and manage media (especially images) using Crawl4AIâs configuration options. Feel free to adjust examples and text to match your exact workflow or preferences.
-
----
-
-# Link & Media Analysis
+# Link & Media
In this tutorial, youâll learn how to:
@@ -12,7 +8,7 @@ In this tutorial, youâll learn how to:
4. Configure your crawler to exclude or prioritize certain images
> **Prerequisites**
-> - You have completed or are familiar with the [AsyncWebCrawler Basics](./async-webcrawler-basics.md) tutorial.
+> - You have completed or are familiar with the [AsyncWebCrawler Basics](../core/simple-crawling.md) tutorial.
> - You can run Crawl4AI in your environment (Playwright, Python, etc.).
---
@@ -37,8 +33,10 @@ async with AsyncWebCrawler() as crawler:
if result.success:
internal_links = result.links.get("internal", [])
external_links = result.links.get("external", [])
- print(f"Found {len(internal_links)} internal links, {len(external_links)} external links.")
-
+ print(f"Found {len(internal_links)} internal links.")
+ print(f"Found {len(internal_links)} external links.")
+ print(f"Found {len(result.media)} media items.")
+
# Each link is typically a dictionary with fields like:
# { "href": "...", "text": "...", "title": "...", "base_domain": "..." }
if internal_links:
@@ -259,37 +257,20 @@ if __name__ == "__main__":
## 5. Common Pitfalls & Tips
-1. **Conflicting Flags**:
+1.â**Conflicting Flags**:
- `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.
- `exclude_external_images=True` but want to keep some external images? Currently no partial domain-based setting for images, so you might need a custom approach or hook logic.
-2. **Relevancy Scores**:
+2.â**Relevancy Scores**:
- If your version of Crawl4AI or your scraping strategy includes an `img["score"]`, itâs typically a heuristic based on size, position, or content analysis. Evaluate carefully if you rely on it.
-3. **Performance**:
+3.â**Performance**:
- Excluding certain domains or external images can speed up your crawl, especially for large, media-heavy pages.
- If you want a âfullâ link map, do *not* exclude them. Instead, you can post-filter in your own code.
-4. **Social Media Lists**:
+4.â**Social Media Lists**:
- `exclude_social_media_links=True` typically references an internal list of known social domains like Facebook, Twitter, LinkedIn, etc. If you need to add or remove from that list, look for library settings or a local config file (depending on your version).
---
-## 6. Next Steps
-
-Now that you understand how to manage **Link & Media Analysis**, you can:
-
-- Fine-tune which links are stored or discarded in your final results
-- Control which images (or other media) appear in `result.media`
-- Filter out entire domains or social media platforms to keep your dataset relevant
-
-**Recommended Follow-Ups**:
-- **[Advanced Features (Proxy, PDF, Screenshots)](./advanced-features.md)**: If you want to capture screenshots or save the page as a PDF for archival or debugging.
-- **[Hooks & Custom Code](./hooks-custom.md)**: For more specialized logic, such as automated âinfinite scrollâ or repeated âLoad Moreâ button clicks.
-- **Reference**: Check out [CrawlerRunConfig Reference](../../reference/configuration.md) for a comprehensive parameter list.
-
-**Last updated**: 2024-XX-XX
-
----
-
**Thatâs it for Link & Media Analysis!** Youâre now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
\ No newline at end of file
diff --git a/docs/md_v2/basic/prefix-based-input.md b/docs/md_v2/core/local-files.md
similarity index 97%
rename from docs/md_v2/basic/prefix-based-input.md
rename to docs/md_v2/core/local-files.md
index 6dfae9d4..ddf27f8c 100644
--- a/docs/md_v2/basic/prefix-based-input.md
+++ b/docs/md_v2/core/local-files.md
@@ -14,7 +14,10 @@ from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_web():
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
+ result = await crawler.arun(
+ url="https://en.wikipedia.org/wiki/apple",
+ config=config
+ )
if result.success:
print("Markdown Content:")
print(result.markdown)
diff --git a/docs/md_v3/tutorials/markdown-basics.md b/docs/md_v2/core/markdown-generation.md
similarity index 83%
rename from docs/md_v3/tutorials/markdown-basics.md
rename to docs/md_v2/core/markdown-generation.md
index 48498709..1f2b190b 100644
--- a/docs/md_v3/tutorials/markdown-basics.md
+++ b/docs/md_v2/core/markdown-generation.md
@@ -1,7 +1,3 @@
-Below is a **draft** of the **Markdown Generation Basics** tutorial that incorporates your current Crawl4AI design and terminology. It introduces the default markdown generator, explains the concept of content filters (BM25 and Pruning), and covers the `MarkdownGenerationResult` object in a coherent, step-by-step manner. Adjust parameters or naming as needed to align with your actual codebase.
-
----
-
# Markdown Generation Basics
One of Crawl4AIâs core features is generating **clean, structured markdown** from web pages. Originally built to solve the problem of extracting only the âactualâ content and discarding boilerplate or noise, Crawl4AIâs markdown system remains one of its biggest draws for AI workflows.
@@ -13,7 +9,7 @@ In this tutorial, youâll learn:
3. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)
> **Prerequisites**
-> - Youâve completed or read [AsyncWebCrawler Basics](./async-webcrawler-basics.md) to understand how to run a simple crawl.
+> - Youâve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.
> - You know how to configure `CrawlerRunConfig`.
---
@@ -45,7 +41,7 @@ if __name__ == "__main__":
```
**Whatâs happening?**
-- `CrawlerRunConfig(markdown_generator=DefaultMarkdownGenerator())` instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl.
+- `CrawlerRunConfig( markdown_generator = DefaultMarkdownGenerator() )` instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl.
- The resulting markdown is accessible via `result.markdown`.
---
@@ -166,8 +162,8 @@ prune_filter = PruningContentFilter(
- **`threshold`**: Score boundary. Blocks below this score get removed.
- **`threshold_type`**:
- - `"fixed"`: Straight comparison (`score >= threshold` keeps the block).
- - `"dynamic"`: The filter adjusts threshold in a data-driven manner.
+ - `"fixed"`: Straight comparison (`score >= threshold` keeps the block).
+ - `"dynamic"`: The filter adjusts threshold in a data-driven manner.
- **`min_word_threshold`**: Discard blocks under N words as likely too short or unhelpful.
**When to Use PruningContentFilter**
@@ -180,11 +176,11 @@ prune_filter = PruningContentFilter(
When a content filter is active, the library produces two forms of markdown inside `result.markdown_v2` or (if using the simplified field) `result.markdown`:
-1. **`raw_markdown`**: The full unfiltered markdown.
-2. **`fit_markdown`**: A âfitâ version where the filter has removed or trimmed noisy segments.
+1.â**`raw_markdown`**: The full unfiltered markdown.
+2.â**`fit_markdown`**: A âfitâ version where the filter has removed or trimmed noisy segments.
**Note**:
-- In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: youâll have a raw version and a filtered (âfitâ) version if a filter is used.
+> In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: youâll have a raw version and a filtered (âfitâ) version if a filter is used.
```python
import asyncio
@@ -251,8 +247,8 @@ Below is a **revised section** under âCombining Filters (BM25 + Pruning)â th
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank whatâs left** against a user query (with `BM25ContentFilter`). You donât have to crawl the page twice. Instead:
-1. **First pass**: Apply `PruningContentFilter` directly to the raw HTML from `result.html` (the crawlerâs downloaded HTML).
-2. **Second pass**: Take the pruned HTML (or text) from step 1, and feed it into `BM25ContentFilter`, focusing on a user query.
+1.â**First pass**: Apply `PruningContentFilter` directly to the raw HTML from `result.html` (the crawlerâs downloaded HTML).
+2.â**Second pass**: Take the pruned HTML (or text) from step 1, and feed it into `BM25ContentFilter`, focusing on a user query.
### Two-Pass Example
@@ -296,7 +292,8 @@ async def main():
language="english"
)
- bm25_chunks = bm25_filter.filter_content(pruned_html) # returns a list of text chunks
+ # returns a list of text chunks
+ bm25_chunks = bm25_filter.filter_content(pruned_html)
if not bm25_chunks:
print("Nothing matched the BM25 query after pruning.")
@@ -317,10 +314,10 @@ if __name__ == "__main__":
### Whatâs Happening?
-1. **Raw HTML**: We crawl once and store the raw HTML in `result.html`.
-2. **PruningContentFilter**: Takes HTML + optional parameters. It extracts blocks of text or partial HTML, removing headings/sections deemed ânoise.â It returns a **list of text chunks**.
-3. **Combine or Transform**: We join these pruned chunks back into a single HTML-like string. (Alternatively, you could store them in a list for further logicâwhatever suits your pipeline.)
-4. **BM25ContentFilter**: We feed the pruned string into `BM25ContentFilter` with a user query. This second pass further narrows the content to chunks relevant to âmachine learning.â
+1.â**Raw HTML**: We crawl once and store the raw HTML in `result.html`.
+2.â**PruningContentFilter**: Takes HTML + optional parameters. It extracts blocks of text or partial HTML, removing headings/sections deemed ânoise.â It returns a **list of text chunks**.
+3.â**Combine or Transform**: We join these pruned chunks back into a single HTML-like string. (Alternatively, you could store them in a list for further logicâwhatever suits your pipeline.)
+4.â**BM25ContentFilter**: We feed the pruned string into `BM25ContentFilter` with a user query. This second pass further narrows the content to chunks relevant to âmachine learning.â
**No Re-Crawling**: We used `raw_html` from the first pass, so thereâs no need to run `arun()` againâ**no second network request**.
@@ -340,19 +337,19 @@ If your codebase or pipeline design allows applying multiple filters in one pass
## 8. Common Pitfalls & Tips
-1. **No Markdown Output?**
+1.â**No Markdown Output?**
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
- Check if your content filter is too aggressive. Lower thresholds or disable the filter to see if content reappears.
-2. **Performance Considerations**
+2.â**Performance Considerations**
- Very large pages with multiple filters can be slower. Consider `cache_mode` to avoid re-downloading.
- If your final use case is LLM ingestion, consider summarizing further or chunking big texts.
-3. **Take Advantage of `fit_markdown`**
+3.â**Take Advantage of `fit_markdown`**
- Great for RAG pipelines, semantic search, or any scenario where extraneous boilerplate is unwanted.
- Still verify the textual qualityâsome sites have crucial data in footers or sidebars.
-4. **Adjusting `html2text` Options**
+4.â**Adjusting `html2text` Options**
- If you see lots of raw HTML slipping into the text, turn on `escape_html`.
- If code blocks look messy, experiment with `mark_code` or `handle_code_in_pre`.
@@ -367,16 +364,6 @@ In this **Markdown Generation Basics** tutorial, you learned to:
- Distinguish between raw and filtered markdown (`fit_markdown`).
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).
-**Where to go from here**:
-
-- **[Extracting JSON (No LLM)](./json-extraction-basic.md)**: If you need structured data instead of markdown, check out the libraryâs JSON extraction strategies.
-- **[Advanced Features](./advanced-features.md)**: Combine markdown generation with proxies, PDF exports, and more.
-- **[Explanations â Content Filters vs. Extraction Strategies](../../explanations/extraction-chunking.md)**: Dive deeper into how filters differ from chunking or semantic extraction.
-
Now you can produce high-quality Markdown from any website, focusing on exactly the content you needâan essential step for powering AI models, summarization pipelines, or knowledge-base queries.
-**Last Updated**: 2024-XX-XX
-
----
-
-Thatâs it for **Markdown Generation Basics**! Enjoy generating clean, noise-free markdown for your LLM workflows, content archives, or research.
\ No newline at end of file
+**Last Updated**: 2025-01-01
diff --git a/docs/md_v2/core/page-interaction.md b/docs/md_v2/core/page-interaction.md
new file mode 100644
index 00000000..5fadc692
--- /dev/null
+++ b/docs/md_v2/core/page-interaction.md
@@ -0,0 +1,343 @@
+# Page Interaction
+
+Crawl4AI provides powerful features for interacting with **dynamic** webpages, handling JavaScript execution, waiting for conditions, and managing multi-step flows. By combining **js_code**, **wait_for**, and certain **CrawlerRunConfig** parameters, you can:
+
+1. Click âLoad Moreâ buttons
+2. Fill forms and submit them
+3. Wait for elements or data to appear
+4. Reuse sessions across multiple steps
+
+Below is a quick overview of how to do it.
+
+---
+
+## 1. JavaScript Execution
+
+### Basic Execution
+
+**`js_code`** in **`CrawlerRunConfig`** accepts either a single JS string or a list of JS snippets.
+**Example**: Weâll scroll to the bottom of the page, then optionally click a âLoad Moreâ button.
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+ # Single JS command
+ config = CrawlerRunConfig(
+ js_code="window.scrollTo(0, document.body.scrollHeight);"
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://news.ycombinator.com", # Example site
+ config=config
+ )
+ print("Crawled length:", len(result.cleaned_html))
+
+ # Multiple commands
+ js_commands = [
+ "window.scrollTo(0, document.body.scrollHeight);",
+ # 'More' link on Hacker News
+ "document.querySelector('a.morelink')?.click();",
+ ]
+ config = CrawlerRunConfig(js_code=js_commands)
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://news.ycombinator.com", # Another pass
+ config=config
+ )
+ print("After scroll+click, length:", len(result.cleaned_html))
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Relevant `CrawlerRunConfig` params**:
+- **`js_code`**: A string or list of strings with JavaScript to run after the page loads.
+- **`js_only`**: If set to `True` on subsequent calls, indicates weâre continuing an existing session without a new full navigation.
+- **`session_id`**: If you want to keep the same page across multiple calls, specify an ID.
+
+---
+
+## 2. Wait Conditions
+
+### 2.1 CSS-Based Waiting
+
+Sometimes, you just want to wait for a specific element to appear. For example:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+ config = CrawlerRunConfig(
+ # Wait for at least 30 items on Hacker News
+ wait_for="css:.athing:nth-child(30)"
+ )
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://news.ycombinator.com",
+ config=config
+ )
+ print("We have at least 30 items loaded!")
+ # Rough check
+ print("Total items in HTML:", result.cleaned_html.count("athing"))
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key param**:
+- **`wait_for="css:..."`**: Tells the crawler to wait until that CSS selector is present.
+
+### 2.2 JavaScript-Based Waiting
+
+For more complex conditions (e.g., waiting for content length to exceed a threshold), prefix `js:`:
+
+```python
+wait_condition = """() => {
+ const items = document.querySelectorAll('.athing');
+ return items.length > 50; // Wait for at least 51 items
+}"""
+
+config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
+```
+
+**Behind the Scenes**: Crawl4AI keeps polling the JS function until it returns `true` or a timeout occurs.
+
+---
+
+## 3. Handling Dynamic Content
+
+Many modern sites require **multiple steps**: scrolling, clicking âLoad More,â or updating via JavaScript. Below are typical patterns.
+
+### 3.1 Load More Example (Hacker News âMoreâ Link)
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+ # Step 1: Load initial Hacker News page
+ config = CrawlerRunConfig(
+ wait_for="css:.athing:nth-child(30)" # Wait for 30 items
+ )
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://news.ycombinator.com",
+ config=config
+ )
+ print("Initial items loaded.")
+
+ # Step 2: Let's scroll and click the "More" link
+ load_more_js = [
+ "window.scrollTo(0, document.body.scrollHeight);",
+ # The "More" link at page bottom
+ "document.querySelector('a.morelink')?.click();"
+ ]
+
+ next_page_conf = CrawlerRunConfig(
+ js_code=load_more_js,
+ wait_for="""js:() => {
+ return document.querySelectorAll('.athing').length > 30;
+ }""",
+ # Mark that we do not re-navigate, but run JS in the same session:
+ js_only=True,
+ session_id="hn_session"
+ )
+
+ # Re-use the same crawler session
+ result2 = await crawler.arun(
+ url="https://news.ycombinator.com", # same URL but continuing session
+ config=next_page_conf
+ )
+ total_items = result2.cleaned_html.count("athing")
+ print("Items after load-more:", total_items)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key params**:
+- **`session_id="hn_session"`**: Keep the same page across multiple calls to `arun()`.
+- **`js_only=True`**: Weâre not performing a full reload, just applying JS in the existing page.
+- **`wait_for`** with `js:`: Wait for item count to grow beyond 30.
+
+---
+
+### 3.2 Form Interaction
+
+If the site has a search or login form, you can fill fields and submit them with **`js_code`**. For instance, if GitHub had a local search form:
+
+```python
+js_form_interaction = """
+document.querySelector('#your-search').value = 'TypeScript commits';
+document.querySelector('form').submit();
+"""
+
+config = CrawlerRunConfig(
+ js_code=js_form_interaction,
+ wait_for="css:.commit"
+)
+result = await crawler.arun(url="https://github.com/search", config=config)
+```
+
+**In reality**: Replace IDs or classes with the real siteâs form selectors.
+
+---
+
+## 4. Timing Control
+
+1.â**`page_timeout`** (ms): Overall page load or script execution time limit.
+2.â**`delay_before_return_html`** (seconds): Wait an extra moment before capturing the final HTML.
+3.â**`mean_delay`** & **`max_range`**: If you call `arun_many()` with multiple URLs, these add a random pause between each request.
+
+**Example**:
+
+```python
+config = CrawlerRunConfig(
+ page_timeout=60000, # 60s limit
+ delay_before_return_html=2.5
+)
+```
+
+---
+
+## 5. Multi-Step Interaction Example
+
+Below is a simplified script that does multiple âLoad Moreâ clicks on GitHubâs TypeScript commits page. It **re-uses** the same session to accumulate new commits each time. The code includes the relevant **`CrawlerRunConfig`** parameters youâd rely on.
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+
+async def multi_page_commits():
+ browser_cfg = BrowserConfig(
+ headless=False, # Visible for demonstration
+ verbose=True
+ )
+ session_id = "github_ts_commits"
+
+ base_wait = """js:() => {
+ const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+ return commits.length > 0;
+ }"""
+
+ # Step 1: Load initial commits
+ config1 = CrawlerRunConfig(
+ wait_for=base_wait,
+ session_id=session_id,
+ cache_mode=CacheMode.BYPASS,
+ # Not using js_only yet since it's our first load
+ )
+
+ async with AsyncWebCrawler(config=browser_cfg) as crawler:
+ result = await crawler.arun(
+ url="https://github.com/microsoft/TypeScript/commits/main",
+ config=config1
+ )
+ print("Initial commits loaded. Count:", result.cleaned_html.count("commit"))
+
+ # Step 2: For subsequent pages, we run JS to click 'Next Page' if it exists
+ js_next_page = """
+ const selector = 'a[data-testid="pagination-next-button"]';
+ const button = document.querySelector(selector);
+ if (button) button.click();
+ """
+
+ # Wait until new commits appear
+ wait_for_more = """js:() => {
+ const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+ if (!window.firstCommit && commits.length>0) {
+ window.firstCommit = commits[0].textContent;
+ return false;
+ }
+ // If top commit changes, we have new commits
+ const topNow = commits[0]?.textContent.trim();
+ return topNow && topNow !== window.firstCommit;
+ }"""
+
+ for page in range(2): # let's do 2 more "Next" pages
+ config_next = CrawlerRunConfig(
+ session_id=session_id,
+ js_code=js_next_page,
+ wait_for=wait_for_more,
+ js_only=True, # We're continuing from the open tab
+ cache_mode=CacheMode.BYPASS
+ )
+ result2 = await crawler.arun(
+ url="https://github.com/microsoft/TypeScript/commits/main",
+ config=config_next
+ )
+ print(f"Page {page+2} commits count:", result2.cleaned_html.count("commit"))
+
+ # Optionally kill session
+ await crawler.crawler_strategy.kill_session(session_id)
+
+async def main():
+ await multi_page_commits()
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key Points**:
+
+- **`session_id`**: Keep the same page open.
+- **`js_code`** + **`wait_for`** + **`js_only=True`**: We do partial refreshes, waiting for new commits to appear.
+- **`cache_mode=CacheMode.BYPASS`** ensures we always see fresh data each step.
+
+---
+
+## 6. Combine Interaction with Extraction
+
+Once dynamic content is loaded, you can attach an **`extraction_strategy`** (like `JsonCssExtractionStrategy` or `LLMExtractionStrategy`). For example:
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+schema = {
+ "name": "Commits",
+ "baseSelector": "li.Box-sc-g0xbh4-0",
+ "fields": [
+ {"name": "title", "selector": "h4.markdown-title", "type": "text"}
+ ]
+}
+config = CrawlerRunConfig(
+ session_id="ts_commits_session",
+ js_code=js_next_page,
+ wait_for=wait_for_more,
+ extraction_strategy=JsonCssExtractionStrategy(schema)
+)
+```
+
+When done, check `result.extracted_content` for the JSON.
+
+---
+
+## 7. Relevant `CrawlerRunConfig` Parameters
+
+Below are the key interaction-related parameters in `CrawlerRunConfig`. For a full list, see [Configuration Parameters](../api/parameters.md).
+
+- **`js_code`**: JavaScript to run after initial load.
+- **`js_only`**: If `True`, no new page navigationâonly JS in the existing session.
+- **`wait_for`**: CSS (`"css:..."`) or JS (`"js:..."`) expression to wait for.
+- **`session_id`**: Reuse the same page across calls.
+- **`cache_mode`**: Whether to read/write from the cache or bypass.
+- **`remove_overlay_elements`**: Remove certain popups automatically.
+- **`simulate_user`, `override_navigator`, `magic`**: Anti-bot or âhuman-likeâ interactions.
+
+---
+
+## 8. Conclusion
+
+Crawl4AIâs **page interaction** features let you:
+
+1.â**Execute JavaScript** for scrolling, clicks, or form filling.
+2.â**Wait** for CSS or custom JS conditions before capturing data.
+3.â**Handle** multi-step flows (like âLoad Moreâ) with partial reloads or persistent sessions.
+4. Combine with **structured extraction** for dynamic sites.
+
+With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
\ No newline at end of file
diff --git a/docs/md_v2/core/quickstart.md b/docs/md_v2/core/quickstart.md
new file mode 100644
index 00000000..c4e6561e
--- /dev/null
+++ b/docs/md_v2/core/quickstart.md
@@ -0,0 +1,362 @@
+Below is the **revised Quickstart** guide with the **Installation** section removed, plus an updated **dynamic content** crawl example that uses `BrowserConfig` and `CrawlerRunConfig` (instead of passing parameters directly to `arun()`). Everything else remains as before.
+
+---
+
+# Getting Started with Crawl4AI
+
+Welcome to **Crawl4AI**, an open-source LLM-friendly Web Crawler & Scraper. In this tutorial, youâll:
+
+1. Run your **first crawl** using minimal configuration.
+2. Generate **Markdown** output (and learn how itâs influenced by content filters).
+3. Experiment with a simple **CSS-based extraction** strategy.
+4. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
+5. Crawl a **dynamic** page that loads content via JavaScript.
+
+---
+
+## 1. Introduction
+
+Crawl4AI provides:
+
+- An asynchronous crawler, **`AsyncWebCrawler`**.
+- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
+- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters).
+- Multiple extraction strategies (LLM-based or âtraditionalâ CSS/XPath-based).
+
+By the end of this guide, youâll have performed a basic crawl, generated Markdown, tried out two extraction strategies, and crawled a dynamic page that uses âLoad Moreâ buttons or JavaScript updates.
+
+---
+
+## 2. Your First Crawl
+
+Hereâs a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun("https://example.com")
+ print(result.markdown[:300]) # Print first 300 chars
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Whatâs happening?**
+- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
+- It fetches `https://example.com`.
+- Crawl4AI automatically converts the HTML into Markdown.
+
+You now have a simple, working crawl!
+
+---
+
+## 3. Basic Configuration (Light Introduction)
+
+Crawl4AIâs crawler can be heavily customized using two main classes:
+
+1.â**`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
+2.â**`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
+
+Below is an example with minimal usage:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+
+async def main():
+ browser_conf = BrowserConfig(headless=True) # or False to see the browser
+ run_conf = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS
+ )
+
+ async with AsyncWebCrawler(config=browser_conf) as crawler:
+ result = await crawler.arun(
+ url="https://example.com",
+ config=run_conf
+ )
+ print(result.markdown)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+> IMPORTANT: By default cache mode is set to `CacheMode.ENABLED`. So to have fresh content, you need to set it to `CacheMode.BYPASS`
+
+Weâll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
+
+---
+
+## 4. Generating Markdown Output
+
+By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
+
+- **`result.markdown`**:
+ The direct HTML-to-Markdown conversion.
+- **`result.markdown.fit_markdown`**:
+ The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
+
+### Example: Using a Filter with `DefaultMarkdownGenerator`
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+md_generator = DefaultMarkdownGenerator(
+ content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
+)
+
+config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ markdown_generator=md_generator
+)
+
+async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun("https://news.ycombinator.com", config=config)
+ print("Raw Markdown length:", len(result.markdown.raw_markdown))
+ print("Fit Markdown length:", len(result.markdown.fit_markdown))
+```
+
+**Note**: If you do **not** specify a content filter or markdown generator, youâll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. Weâll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
+
+---
+
+## 5. Simple Data Extraction (CSS-based)
+
+Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def main():
+ schema = {
+ "name": "Example Items",
+ "baseSelector": "div.item",
+ "fields": [
+ {"name": "title", "selector": "h2", "type": "text"},
+ {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
+ ]
+ }
+
+ raw_html = "
"
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="raw://" + raw_html,
+ config=CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ extraction_strategy=JsonCssExtractionStrategy(schema)
+ )
+ )
+ # The JSON output is stored in 'extracted_content'
+ data = json.loads(result.extracted_content)
+ print(data)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Why is this helpful?**
+- Great for repetitive page structures (e.g., item listings, articles).
+- No AI usage or costs.
+- The crawler returns a JSON string you can parse or store.
+
+> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
+
+---
+
+## 6. Simple Data Extraction (LLM-based)
+
+For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
+
+- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
+- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
+- Or any provider supported by the underlying library
+
+Below is an example using **open-source** style (no token) and closed-source:
+
+```python
+import os
+import json
+import asyncio
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class OpenAIModelFee(BaseModel):
+ model_name: str = Field(..., description="Name of the OpenAI model.")
+ input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+ output_fee: str = Field(
+ ..., description="Fee for output token for the OpenAI model."
+ )
+
+async def extract_structured_data_using_llm(
+ provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
+):
+ print(f"\n--- Extracting Structured Data with {provider} ---")
+
+ if api_token is None and provider != "ollama":
+ print(f"API token is required for {provider}. Skipping this example.")
+ return
+
+ browser_config = BrowserConfig(headless=True)
+
+ extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
+ if extra_headers:
+ extra_args["extra_headers"] = extra_headers
+
+ crawler_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ word_count_threshold=1,
+ page_timeout=80000,
+ extraction_strategy=LLMExtractionStrategy(
+ provider=provider,
+ api_token=api_token,
+ schema=OpenAIModelFee.model_json_schema(),
+ extraction_type="schema",
+ instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
+ Do not miss any models in the entire content.""",
+ extra_args=extra_args,
+ ),
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(
+ url="https://openai.com/api/pricing/", config=crawler_config
+ )
+ print(result.extracted_content)
+
+if __name__ == "__main__":
+ # Use ollama with llama3.3
+ # asyncio.run(
+ # extract_structured_data_using_llm(
+ # provider="ollama/llama3.3", api_token="no-token"
+ # )
+ # )
+
+ asyncio.run(
+ extract_structured_data_using_llm(
+ provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
+ )
+ )
+```
+
+**Whatâs happening?**
+- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
+- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
+- Depending on the **provider** and **api_token**, you can use local models or a remote API.
+
+---
+
+## 7. Dynamic Content Example
+
+Some sites require multiple âpage clicksâ or dynamic JavaScript updates. Below is an example showing how to **click** a âNext Pageâ button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def extract_structured_data_using_css_extractor():
+ print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
+ schema = {
+ "name": "KidoCode Courses",
+ "baseSelector": "section.charge-methodology .w-tab-content > div",
+ "fields": [
+ {
+ "name": "section_title",
+ "selector": "h3.heading-50",
+ "type": "text",
+ },
+ {
+ "name": "section_description",
+ "selector": ".charge-content",
+ "type": "text",
+ },
+ {
+ "name": "course_name",
+ "selector": ".text-block-93",
+ "type": "text",
+ },
+ {
+ "name": "course_description",
+ "selector": ".course-content-text",
+ "type": "text",
+ },
+ {
+ "name": "course_icon",
+ "selector": ".image-92",
+ "type": "attribute",
+ "attribute": "src",
+ },
+ ],
+ }
+
+ browser_config = BrowserConfig(headless=True, java_script_enabled=True)
+
+ js_click_tabs = """
+ (async () => {
+ const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
+ for(let tab of tabs) {
+ tab.scrollIntoView();
+ tab.click();
+ await new Promise(r => setTimeout(r, 500));
+ }
+ })();
+ """
+
+ crawler_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ extraction_strategy=JsonCssExtractionStrategy(schema),
+ js_code=[js_click_tabs],
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(
+ url="https://www.kidocode.com/degrees/technology", config=crawler_config
+ )
+
+ companies = json.loads(result.extracted_content)
+ print(f"Successfully extracted {len(companies)} companies")
+ print(json.dumps(companies[0], indent=2))
+
+async def main():
+ await extract_structured_data_using_css_extractor()
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key Points**:
+
+- **`BrowserConfig(headless=False)`**: We want to watch it click âNext Page.â
+- **`CrawlerRunConfig(...)`**: We specify the extraction strategy, pass `session_id` to reuse the same page.
+- **`js_code`** and **`wait_for`** are used for subsequent pages (`page > 0`) to click the âNextâ button and wait for new commits to load.
+- **`js_only=True`** indicates weâre not re-navigating but continuing the existing session.
+- Finally, we call `kill_session()` to clean up the page and browser session.
+
+---
+
+## 8. Next Steps
+
+Congratulations! You have:
+
+1. Performed a basic crawl and printed Markdown.
+2. Used **content filters** with a markdown generator.
+3. Extracted JSON via **CSS** or **LLM** strategies.
+4. Handled **dynamic** pages with JavaScript triggers.
+
+If youâre ready for more, check out:
+
+- **Installation**: A deeper dive into advanced installs, Docker usage (experimental), or optional dependencies.
+- **Hooks & Auth**: Learn how to run custom JavaScript or handle logins with cookies, local storage, etc.
+- **Deployment**: Explore ephemeral testing in Docker or plan for the upcoming stable Docker release.
+- **Browser Management**: Delve into user simulation, stealth modes, and concurrency best practices.
+
+Crawl4AI is a powerful, flexible tool. Enjoy building out your scrapers, data pipelines, or AI-driven extraction flows. Happy crawling!
\ No newline at end of file
diff --git a/docs/md_v2/basic/simple-crawling.md b/docs/md_v2/core/simple-crawling.md
similarity index 100%
rename from docs/md_v2/basic/simple-crawling.md
rename to docs/md_v2/core/simple-crawling.md
diff --git a/docs/md_v2/extraction/chunking.md b/docs/md_v2/extraction/chunking.md
index f429310f..2a04a60e 100644
--- a/docs/md_v2/extraction/chunking.md
+++ b/docs/md_v2/extraction/chunking.md
@@ -1,133 +1,144 @@
-## Chunking Strategies đ
+# Chunking Strategies
+Chunking strategies are critical for dividing large texts into manageable parts, enabling effective content processing and extraction. These strategies are foundational in cosine similarity-based extraction techniques, which allow users to retrieve only the most relevant chunks of content for a given query. Additionally, they facilitate direct integration into RAG (Retrieval-Augmented Generation) systems for structured and scalable workflows.
-Crawl4AI provides several powerful chunking strategies to divide text into manageable parts for further processing. Each strategy has unique characteristics and is suitable for different scenarios. Let's explore them one by one.
+### Why Use Chunking?
+1.â**Cosine Similarity and Query Relevance**: Prepares chunks for semantic similarity analysis.
+2.â**RAG System Integration**: Seamlessly processes and stores chunks for retrieval.
+3.â**Structured Processing**: Allows for diverse segmentation methods, such as sentence-based, topic-based, or windowed approaches.
-### RegexChunking
+### Methods of Chunking
-`RegexChunking` splits text using regular expressions. This is ideal for creating chunks based on specific patterns like paragraphs or sentences.
+#### 1. Regex-Based Chunking
+Splits text based on regular expression patterns, useful for coarse segmentation.
-#### When to Use
-- Great for structured text with consistent delimiters.
-- Suitable for documents where specific patterns (e.g., double newlines, periods) indicate logical chunks.
-
-#### Parameters
-- `patterns` (list, optional): Regular expressions used to split the text. Default is to split by double newlines (`['\n\n']`).
-
-#### Example
+**Code Example**:
```python
-from crawl4ai.chunking_strategy import RegexChunking
+class RegexChunking:
+ def __init__(self, patterns=None):
+ self.patterns = patterns or [r'\n\n'] # Default pattern for paragraphs
-# Define patterns for splitting text
-patterns = [r'\n\n', r'\. ']
-chunker = RegexChunking(patterns=patterns)
+ def chunk(self, text):
+ paragraphs = [text]
+ for pattern in self.patterns:
+ paragraphs = [seg for p in paragraphs for seg in re.split(pattern, p)]
+ return paragraphs
-# Sample text
-text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."
+# Example Usage
+text = """This is the first paragraph.
-# Chunk the text
-chunks = chunker.chunk(text)
-print(chunks)
+This is the second paragraph."""
+chunker = RegexChunking()
+print(chunker.chunk(text))
```
-### NlpSentenceChunking
+#### 2. Sentence-Based Chunking
+Divides text into sentences using NLP tools, ideal for extracting meaningful statements.
-`NlpSentenceChunking` uses NLP models to split text into sentences, ensuring accurate sentence boundaries.
-
-#### When to Use
-- Ideal for texts where sentence boundaries are crucial.
-- Useful for creating chunks that preserve grammatical structures.
-
-#### Parameters
-- None.
-
-#### Example
+**Code Example**:
```python
-from crawl4ai.chunking_strategy import NlpSentenceChunking
+from nltk.tokenize import sent_tokenize
+class NlpSentenceChunking:
+ def chunk(self, text):
+ sentences = sent_tokenize(text)
+ return [sentence.strip() for sentence in sentences]
+
+# Example Usage
+text = "This is sentence one. This is sentence two."
chunker = NlpSentenceChunking()
-
-# Sample text
-text = "This is a sample text. It will be split into sentences. Here's another sentence."
-
-# Chunk the text
-chunks = chunker.chunk(text)
-print(chunks)
+print(chunker.chunk(text))
```
-### TopicSegmentationChunking
+#### 3. Topic-Based Segmentation
+Uses algorithms like TextTiling to create topic-coherent chunks.
-`TopicSegmentationChunking` employs the TextTiling algorithm to segment text into topic-based chunks. This method identifies thematic boundaries.
-
-#### When to Use
-- Perfect for long documents with distinct topics.
-- Useful when preserving topic continuity is more important than maintaining text order.
-
-#### Parameters
-- `num_keywords` (int, optional): Number of keywords for each topic segment. Default is `3`.
-
-#### Example
+**Code Example**:
```python
-from crawl4ai.chunking_strategy import TopicSegmentationChunking
+from nltk.tokenize import TextTilingTokenizer
-chunker = TopicSegmentationChunking(num_keywords=3)
+class TopicSegmentationChunking:
+ def __init__(self):
+ self.tokenizer = TextTilingTokenizer()
-# Sample text
-text = "This document contains several topics. Topic one discusses AI. Topic two covers machine learning."
+ def chunk(self, text):
+ return self.tokenizer.tokenize(text)
-# Chunk the text
-chunks = chunker.chunk(text)
-print(chunks)
+# Example Usage
+text = """This is an introduction.
+This is a detailed discussion on the topic."""
+chunker = TopicSegmentationChunking()
+print(chunker.chunk(text))
```
-### FixedLengthWordChunking
+#### 4. Fixed-Length Word Chunking
+Segments text into chunks of a fixed word count.
-`FixedLengthWordChunking` splits text into chunks based on a fixed number of words. This ensures each chunk has approximately the same length.
-
-#### When to Use
-- Suitable for processing large texts where uniform chunk size is important.
-- Useful when the number of words per chunk needs to be controlled.
-
-#### Parameters
-- `chunk_size` (int, optional): Number of words per chunk. Default is `100`.
-
-#### Example
+**Code Example**:
```python
-from crawl4ai.chunking_strategy import FixedLengthWordChunking
+class FixedLengthWordChunking:
+ def __init__(self, chunk_size=100):
+ self.chunk_size = chunk_size
-chunker = FixedLengthWordChunking(chunk_size=10)
+ def chunk(self, text):
+ words = text.split()
+ return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
-# Sample text
-text = "This is a sample text. It will be split into chunks of fixed length."
-
-# Chunk the text
-chunks = chunker.chunk(text)
-print(chunks)
+# Example Usage
+text = "This is a long text with many words to be chunked into fixed sizes."
+chunker = FixedLengthWordChunking(chunk_size=5)
+print(chunker.chunk(text))
```
-### SlidingWindowChunking
+#### 5. Sliding Window Chunking
+Generates overlapping chunks for better contextual coherence.
-`SlidingWindowChunking` uses a sliding window approach to create overlapping chunks. Each chunk has a fixed length, and the window slides by a specified step size.
-
-#### When to Use
-- Ideal for creating overlapping chunks to preserve context.
-- Useful for tasks where context from adjacent chunks is needed.
-
-#### Parameters
-- `window_size` (int, optional): Number of words in each chunk. Default is `100`.
-- `step` (int, optional): Number of words to slide the window. Default is `50`.
-
-#### Example
+**Code Example**:
```python
-from crawl4ai.chunking_strategy import SlidingWindowChunking
+class SlidingWindowChunking:
+ def __init__(self, window_size=100, step=50):
+ self.window_size = window_size
+ self.step = step
-chunker = SlidingWindowChunking(window_size=10, step=5)
+ def chunk(self, text):
+ words = text.split()
+ chunks = []
+ for i in range(0, len(words) - self.window_size + 1, self.step):
+ chunks.append(' '.join(words[i:i + self.window_size]))
+ return chunks
-# Sample text
-text = "This is a sample text. It will be split using a sliding window approach to preserve context."
-
-# Chunk the text
-chunks = chunker.chunk(text)
-print(chunks)
+# Example Usage
+text = "This is a long text to demonstrate sliding window chunking."
+chunker = SlidingWindowChunking(window_size=5, step=2)
+print(chunker.chunk(text))
```
-With these chunking strategies, you can choose the best method to divide your text based on your specific needs. Whether you need precise sentence boundaries, topic-based segmentation, or uniform chunk sizes, Crawl4AI has you covered. Happy chunking! đâ¨
+### Combining Chunking with Cosine Similarity
+To enhance the relevance of extracted content, chunking strategies can be paired with cosine similarity techniques. Hereâs an example workflow:
+
+**Code Example**:
+```python
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+
+class CosineSimilarityExtractor:
+ def __init__(self, query):
+ self.query = query
+ self.vectorizer = TfidfVectorizer()
+
+ def find_relevant_chunks(self, chunks):
+ vectors = self.vectorizer.fit_transform([self.query] + chunks)
+ similarities = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
+ return [(chunks[i], similarities[i]) for i in range(len(chunks))]
+
+# Example Workflow
+text = """This is a sample document. It has multiple sentences.
+We are testing chunking and similarity."""
+
+chunker = SlidingWindowChunking(window_size=5, step=3)
+chunks = chunker.chunk(text)
+query = "testing chunking"
+extractor = CosineSimilarityExtractor(query)
+relevant_chunks = extractor.find_relevant_chunks(chunks)
+
+print(relevant_chunks)
+```
diff --git a/docs/md_v2/extraction/cosine.md b/docs/md_v2/extraction/clustring-strategies.md
similarity index 96%
rename from docs/md_v2/extraction/cosine.md
rename to docs/md_v2/extraction/clustring-strategies.md
index 9ce49e40..3fe00fa1 100644
--- a/docs/md_v2/extraction/cosine.md
+++ b/docs/md_v2/extraction/clustring-strategies.md
@@ -56,12 +56,12 @@ CosineStrategy(
### Parameter Details
-1. **semantic_filter**
+1.â**semantic_filter**
- Sets the target topic or content type
- Use keywords relevant to your desired content
- Example: "technical specifications", "user reviews", "pricing information"
-2. **sim_threshold**
+2.â**sim_threshold**
- Controls how similar content must be to be grouped together
- Higher values (e.g., 0.8) mean stricter matching
- Lower values (e.g., 0.3) allow more variation
@@ -73,7 +73,7 @@ CosineStrategy(
strategy = CosineStrategy(sim_threshold=0.3)
```
-3. **word_count_threshold**
+3.â**word_count_threshold**
- Filters out short content blocks
- Helps eliminate noise and irrelevant content
```python
@@ -81,7 +81,7 @@ CosineStrategy(
strategy = CosineStrategy(word_count_threshold=50)
```
-4. **top_k**
+4.â**top_k**
- Number of top content clusters to return
- Higher values return more diverse content
```python
@@ -163,17 +163,17 @@ async def extract_pricing_features(url: str):
## Best Practices
-1. **Adjust Thresholds Iteratively**
+1.â**Adjust Thresholds Iteratively**
- Start with default values
- Adjust based on results
- Monitor clustering quality
-2. **Choose Appropriate Word Count Thresholds**
+2.â**Choose Appropriate Word Count Thresholds**
- Higher for articles (100+)
- Lower for reviews/comments (20+)
- Medium for product descriptions (50+)
-3. **Optimize Performance**
+3.â**Optimize Performance**
```python
strategy = CosineStrategy(
word_count_threshold=10, # Filter early
@@ -182,7 +182,7 @@ async def extract_pricing_features(url: str):
)
```
-4. **Handle Different Content Types**
+4.â**Handle Different Content Types**
```python
# For mixed content pages
strategy = CosineStrategy(
diff --git a/docs/md_v3/tutorials/json-extraction-llm.md b/docs/md_v2/extraction/llm-strategies.md
similarity index 77%
rename from docs/md_v3/tutorials/json-extraction-llm.md
rename to docs/md_v2/extraction/llm-strategies.md
index 5b9369d9..eddb1072 100644
--- a/docs/md_v3/tutorials/json-extraction-llm.md
+++ b/docs/md_v2/extraction/llm-strategies.md
@@ -1,7 +1,3 @@
-Below is a **draft** of the **Extracting JSON (LLM)** tutorial, illustrating how to use large language models for structured data extraction in Crawl4AI. It highlights key parameters (like chunking, overlap, instruction, schema) and explains how the system remains **provider-agnostic** via LightLLM. Adjust field names or code snippets to match your repositoryâs specifics.
-
----
-
# Extracting JSON (LLM)
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
@@ -24,7 +20,7 @@ In some cases, you need to extract **complex or unstructured** information from
## 2. Provider-Agnostic via LightLLM
-Crawl4AI uses a âprovider stringâ (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
+Crawl4AI uses a âprovider stringâ (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM.â**Any** model that LightLLM supports is fair game. You just provide:
- **`provider`**: The `
/` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
@@ -38,10 +34,10 @@ This means you **arenât locked** into a single LLM vendor. Switch or experimen
### 3.1 Flow
-1. **Chunking** (optional): The HTML or markdown is split into smaller segments if itâs very long (based on `chunk_token_threshold`, overlap, etc.).
-2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).
-3. **LLM Inference**: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency).
-4. **Combining**: The results from each chunk are merged and parsed into JSON.
+1.â**Chunking** (optional): The HTML or markdown is split into smaller segments if itâs very long (based on `chunk_token_threshold`, overlap, etc.).
+2.â**Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).
+3.â**LLM Inference**: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency).
+4.â**Combining**: The results from each chunk are merged and parsed into JSON.
### 3.2 `extraction_type`
@@ -56,20 +52,20 @@ For structured data, `"schema"` is recommended. You provide `schema=YourPydantic
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
-1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
-2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.
-3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
-4. **`extraction_type`** (str): `"schema"` or `"block"`.
-5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., âExtract these fields as a JSON array.â
-6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
-7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
-8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
-9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
+1.â**`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
+2.â**`api_token`** (str): The API key or token for that model. May not be needed for local models.
+3.â**`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
+4.â**`extraction_type`** (str): `"schema"` or `"block"`.
+5.â**`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., âExtract these fields as a JSON array.â
+6.â**`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
+7.â**`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
+8.â**`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
+9.â**`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
- `"markdown"`: The raw markdown (default).
- `"fit_markdown"`: The filtered âfitâ markdown if you used a content filter.
- `"html"`: The cleaned or raw HTML.
-10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
-11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
+10.â**`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
+11.â**`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
**Example**:
@@ -159,7 +155,7 @@ if __name__ == "__main__":
### 6.1 `chunk_token_threshold`
-If your page is large, you might exceed your LLMâs context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates wordâtoken ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
+If your page is large, you might exceed your LLMâs context window.â**`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates wordâtoken ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
### 6.2 `overlap_rate`
@@ -281,12 +277,12 @@ if __name__ == "__main__":
## 10. Best Practices & Caveats
-1. **Cost & Latency**: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data.
-2. **Model Token Limits**: If your page + instruction exceed the context window, chunking is essential.
-3. **Instruction Engineering**: Well-crafted instructions can drastically improve output reliability.
-4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
-5. **Parallel vs. Serial**: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers.
-6. **Check Output**: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup.
+1.â**Cost & Latency**: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data.
+2.â**Model Token Limits**: If your page + instruction exceed the context window, chunking is essential.
+3.â**Instruction Engineering**: Well-crafted instructions can drastically improve output reliability.
+4.â**Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
+5.â**Parallel vs. Serial**: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers.
+6.â**Check Output**: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup.
---
@@ -303,31 +299,31 @@ If your siteâs data is consistent or repetitive, consider [`JsonCssExtractionS
**Next Steps**:
-1. **Experiment with Different Providers**
+1.â**Experiment with Different Providers**
- Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost.
- Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results.
-2. **Combine With Other Strategies**
+2.â**Combine With Other Strategies**
- Use [content filters](../../how-to/content-filters.md) like BM25 or Pruning prior to LLM extraction to remove noise and reduce token usage.
- Apply a [CSS or XPath extraction strategy](./json-extraction-basic.md) first for obvious, structured data, then send only the tricky parts to the LLM.
-3. **Performance Tuning**
+3.â**Performance Tuning**
- If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput.
- Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks.
-4. **Validate Outputs**
+4.â**Validate Outputs**
- If using `extraction_type="schema"`, parse the LLMâs JSON with a Pydantic model for a final validation step.
- Log or handle any parse errors gracefully, especially if the model occasionally returns malformed JSON.
-5. **Explore Hooks & Automation**
+5.â**Explore Hooks & Automation**
- Integrate LLM extraction with [hooks](./hooks-custom.md) for complex pre/post-processing.
- Use a multi-step pipeline: crawl, filter, LLM-extract, then store or index results for further analysis.
-6. **Scale and Deploy**
+6.â**Scale and Deploy**
- Combine your LLM extraction setup with [Docker or other deployment solutions](./docker-quickstart.md) to run at scale.
- Monitor memory usage and concurrency if you call LLMs frequently.
-**Last Updated**: 2024-XX-XX
+**Last Updated**: 2025-01-01
---
diff --git a/docs/md_v3/tutorials/json-extraction-basic.md b/docs/md_v2/extraction/no-llm-strategies.md
similarity index 86%
rename from docs/md_v3/tutorials/json-extraction-basic.md
rename to docs/md_v2/extraction/no-llm-strategies.md
index 1a9b79e6..7429a68b 100644
--- a/docs/md_v3/tutorials/json-extraction-basic.md
+++ b/docs/md_v2/extraction/no-llm-strategies.md
@@ -4,10 +4,10 @@ One of Crawl4AIâs **most powerful** features is extracting **structured JSON**
**Why avoid LLM for basic extractions?**
-1. **Faster & Cheaper**: No API calls or GPU overhead.
-2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.
-3. **Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.
-4. **Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
+1.â**Faster & Cheaper**: No API calls or GPU overhead.
+2.â**Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.
+3.â**Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.
+4.â**Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
Below, weâll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). Weâll also highlight advanced features like **nested fields** and **base element attributes**.
@@ -18,8 +18,8 @@ Below, weâll explore how to craft these schemas and use them with **JsonCssExt
A schema defines:
1. A **base selector** that identifies each âcontainerâ element on the page (e.g., a product row, a blog post card).
-2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
-3. **Nested** or **list** types for repeated or hierarchical structures.
+2.â**Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
+3.â**Nested** or **list** types for repeated or hierarchical structures.
For example, if you have a list of products, each one might have a name, price, reviews, and ârelated products.â This approach is faster and more reliable than an LLM for consistent, structured pages.
@@ -168,9 +168,9 @@ asyncio.run(extract_crypto_prices_xpath())
**Key Points**:
-1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
-2. **`baseSelector`** and each fieldâs `"selector"` use **XPath** instead of CSS.
-3. **`raw://`** lets us pass `dummy_html` with no real network requestâhandy for local testing.
+1.â**`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
+2.â**`baseSelector`** and each fieldâs `"selector"` use **XPath** instead of CSS.
+3.â**`raw://`** lets us pass `dummy_html` with no real network requestâhandy for local testing.
4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.
Thatâs how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML inputâall while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
@@ -310,10 +310,10 @@ If all goes well, you get a **structured** JSON array with each âcategory,â
## 4. Why âNo LLMâ Is Often Better
-1. **Zero Hallucination**: Schema-based extraction doesnât guess text. It either finds it or not.
-2. **Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
-3. **Speed**: LLM-based extraction can be 10â1000x slower for large-scale crawling.
-4. **Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
+1.â**Zero Hallucination**: Schema-based extraction doesnât guess text. It either finds it or not.
+2.â**Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
+3.â**Speed**: LLM-based extraction can be 10â1000x slower for large-scale crawling.
+4.â**Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema approach first for repeated or consistent data patterns.
@@ -362,13 +362,13 @@ Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post o
## 7. Tips & Best Practices
-1. **Inspect the DOM** in Chrome DevTools or Firefoxâs Inspector to find stable selectors.
-2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
-3. **Test** your schema on partial HTML or a test page before a big crawl.
-4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
-5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, itâll often show warnings.
-6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the âparentâ item.
-7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
+1.â**Inspect the DOM** in Chrome DevTools or Firefoxâs Inspector to find stable selectors.
+2.â**Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
+3.â**Test** your schema on partial HTML or a test page before a big crawl.
+4.â**Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
+5.â**Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, itâll often show warnings.
+6.â**Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the âparentâ item.
+7.â**Performance**: For large pages, make sure your selectors are as narrow as possible.
---
@@ -388,7 +388,7 @@ With **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy**), you can
**Remember**: For repeated, structured data, you donât need to pay for or wait on an LLM. A well-crafted schema plus CSS or XPath gets you the data faster, cleaner, and cheaperâ**the real power** of Crawl4AI.
-**Last Updated**: 2024-XX-XX
+**Last Updated**: 2025-01-01
---
diff --git a/docs/md_v2/extraction/css-advanced.md b/docs/md_v2/extraction/old/css-advanced.md
similarity index 88%
rename from docs/md_v2/extraction/css-advanced.md
rename to docs/md_v2/extraction/old/css-advanced.md
index 393b79a5..0b36fe75 100644
--- a/docs/md_v2/extraction/css-advanced.md
+++ b/docs/md_v2/extraction/old/css-advanced.md
@@ -152,10 +152,10 @@ schema = {
This schema demonstrates several advanced features:
-1. **Nested Objects**: The `details` field is a nested object within each product.
-2. **Simple Lists**: The `features` field is a simple list of text items.
-3. **Nested Lists**: The `products` field is a nested list, where each item is a complex object.
-4. **Lists of Objects**: The `reviews` and `related_products` fields are lists of objects.
+1.â**Nested Objects**: The `details` field is a nested object within each product.
+2.â**Simple Lists**: The `features` field is a simple list of text items.
+3.â**Nested Lists**: The `products` field is a nested list, where each item is a complex object.
+4.â**Lists of Objects**: The `reviews` and `related_products` fields are lists of objects.
Let's break down the key concepts:
@@ -272,11 +272,11 @@ This will produce a structured JSON output that captures the complex hierarchy o
## Tips for Advanced Usage
-1. **Start Simple**: Begin with a basic schema and gradually add complexity.
-2. **Test Incrementally**: Test each part of your schema separately before combining them.
-3. **Use Chrome DevTools**: The Element Inspector is invaluable for identifying the correct selectors.
-4. **Handle Missing Data**: Use the `default` key in your field definitions to handle cases where data might be missing.
-5. **Leverage Transforms**: Use the `transform` key to clean or format extracted data (e.g., converting prices to numbers).
-6. **Consider Performance**: Very complex schemas might slow down extraction. Balance complexity with performance needs.
+1.â**Start Simple**: Begin with a basic schema and gradually add complexity.
+2.â**Test Incrementally**: Test each part of your schema separately before combining them.
+3.â**Use Chrome DevTools**: The Element Inspector is invaluable for identifying the correct selectors.
+4.â**Handle Missing Data**: Use the `default` key in your field definitions to handle cases where data might be missing.
+5.â**Leverage Transforms**: Use the `transform` key to clean or format extracted data (e.g., converting prices to numbers).
+6.â**Consider Performance**: Very complex schemas might slow down extraction. Balance complexity with performance needs.
By mastering these advanced techniques, you can use JsonCssExtractionStrategy to extract highly structured data from even the most complex web pages, making it a powerful tool for web scraping and data analysis tasks.
\ No newline at end of file
diff --git a/docs/md_v2/extraction/css.md b/docs/md_v2/extraction/old/css.md
similarity index 85%
rename from docs/md_v2/extraction/css.md
rename to docs/md_v2/extraction/old/css.md
index 3b5075a6..9eec8fdc 100644
--- a/docs/md_v2/extraction/css.md
+++ b/docs/md_v2/extraction/old/css.md
@@ -83,17 +83,17 @@ The schema defines how to extract the data:
## Advantages of JsonCssExtractionStrategy
-1. **Speed**: CSS selectors are fast to execute, making this method efficient for large datasets.
-2. **Precision**: You can target exactly the elements you need.
-3. **Structured Output**: The result is already structured as JSON, ready for further processing.
-4. **No External Dependencies**: Unlike LLM-based strategies, this doesn't require any API calls to external services.
+1.â**Speed**: CSS selectors are fast to execute, making this method efficient for large datasets.
+2.â**Precision**: You can target exactly the elements you need.
+3.â**Structured Output**: The result is already structured as JSON, ready for further processing.
+4.â**No External Dependencies**: Unlike LLM-based strategies, this doesn't require any API calls to external services.
## Tips for Using JsonCssExtractionStrategy
-1. **Inspect the Page**: Use browser developer tools to identify the correct CSS selectors.
-2. **Test Selectors**: Verify your selectors in the browser console before using them in the script.
-3. **Handle Dynamic Content**: If the page uses JavaScript to load content, you may need to combine this with JS execution (see the Advanced Usage section).
-4. **Error Handling**: Always check the `result.success` flag and handle potential failures.
+1.â**Inspect the Page**: Use browser developer tools to identify the correct CSS selectors.
+2.â**Test Selectors**: Verify your selectors in the browser console before using them in the script.
+3.â**Handle Dynamic Content**: If the page uses JavaScript to load content, you may need to combine this with JS execution (see the Advanced Usage section).
+4.â**Error Handling**: Always check the `result.success` flag and handle potential failures.
## Advanced Usage: Combining with JavaScript Execution
diff --git a/docs/md_v2/extraction/llm.md b/docs/md_v2/extraction/old/llm.md
similarity index 100%
rename from docs/md_v2/extraction/llm.md
rename to docs/md_v2/extraction/old/llm.md
diff --git a/docs/md_v2/extraction/overview.md b/docs/md_v2/extraction/old/overview.md
similarity index 96%
rename from docs/md_v2/extraction/overview.md
rename to docs/md_v2/extraction/old/overview.md
index 7c524475..2d2883a4 100644
--- a/docs/md_v2/extraction/overview.md
+++ b/docs/md_v2/extraction/old/overview.md
@@ -97,17 +97,17 @@ result = await crawler.arun(
Choose your strategy based on these factors:
-1. **Content Structure**
+1.â**Content Structure**
- Well-structured HTML â Use CSS Strategy
- Natural language text â Use LLM Strategy
- Mixed/Complex content â Use Cosine Strategy
-2. **Performance Requirements**
+2.â**Performance Requirements**
- Fastest: CSS Strategy
- Moderate: Cosine Strategy
- Variable: LLM Strategy (depends on provider)
-3. **Accuracy Needs**
+3.â**Accuracy Needs**
- Highest structure accuracy: CSS Strategy
- Best semantic understanding: LLM Strategy
- Best content relevance: Cosine Strategy
@@ -132,7 +132,7 @@ llm_result = await crawler.arun(
## Common Use Cases
-1. **E-commerce Scraping**
+1.â**E-commerce Scraping**
```python
# CSS Strategy for product listings
schema = {
@@ -145,7 +145,7 @@ llm_result = await crawler.arun(
}
```
-2. **News Article Extraction**
+2.â**News Article Extraction**
```python
# LLM Strategy for article content
class Article(BaseModel):
@@ -160,7 +160,7 @@ llm_result = await crawler.arun(
)
```
-3. **Content Analysis**
+3.â**Content Analysis**
```python
# Cosine Strategy for topic analysis
strategy = CosineStrategy(
@@ -200,17 +200,17 @@ If fit_markdown is requested but not available (no markdown generator or content
## Best Practices
-1. **Choose the Right Strategy**
+1.â**Choose the Right Strategy**
- Start with CSS for structured data
- Use LLM for complex interpretation
- Try Cosine for content relevance
-2. **Optimize Performance**
+2.â**Optimize Performance**
- Cache LLM results
- Keep CSS selectors specific
- Tune similarity thresholds
-3. **Handle Errors**
+3.â**Handle Errors**
```python
result = await crawler.arun(
url="https://example.com",
diff --git a/docs/md_v2/index.md b/docs/md_v2/index.md
index 65ea6da8..a522ea13 100644
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -1,113 +1,93 @@
-# Crawl4AI
+# đđ¤ Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper
-Welcome to the official documentation for Crawl4AI! đˇď¸đ¤ Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.
+
-## Introduction
+

-Crawl4AI has one clear task: to make crawling and data extraction from web pages easy and efficient, especially for large language models (LLMs) and AI applications. Whether you are using it as a REST API or a Python library, Crawl4AI offers a robust and flexible solution with full asynchronous support.
+
-## Quick Start
+[](https://github.com/unclecode/crawl4ai/stargazers)
+[](https://github.com/unclecode/crawl4ai/network/members)
+[](https://badge.fury.io/py/crawl4ai)
+[](https://pypi.org/project/crawl4ai/)
+[](https://pepy.tech/project/crawl4ai)
+[](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
+[](https://github.com/psf/black)
+[](https://github.com/PyCQA/bandit)
-Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
+Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for large language models, AI agents, and data pipelines. Fully open source, flexible, and built for real-time performance, **Crawl4AI** empowers developers with unmatched speed, precision, and deployment ease.
-async def main():
- # Create an instance of AsyncWebCrawler
- async with AsyncWebCrawler(verbose=True) as crawler:
- # Run the crawler on a URL
- result = await crawler.arun(url="https://www.nbcnews.com/business")
+---
- # Print the extracted content
- print(result.markdown)
+## My Personal Journey
-# Run the async main function
-asyncio.run(main())
-```
+Iâve always loved exploring the web development, back from when HTML and JavaScript were hardly intertwined. My curiosity drove me into web development, mathematics, AI, and machine learning, always keeping a close tie to real industrial applications. In 2009â2010, as a postgraduate student, I created platforms to gather and organize published papers for Masterâs and PhD researchers. Faced with post-grad studentsâ data challenges, I built a helper app to crawl newly published papers and public data. Relying on Internet Explorer and DLL hacks was far more cumbersome than modern tools, highlighting my longtime background in data extraction.
-## Key Features â¨
+Fast-forward to 2023: I needed to fetch web data and transform it into neat **markdown** for my AI pipeline. All solutions I found were either **closed-source**, overpriced, or produced low-quality output. As someone who has built large edu-tech ventures (like KidoCode), I believe **data belongs to the people**. We shouldnât pay $16 just to parse the webâs publicly available content. This friction led me to create my own library, **Crawl4AI**, in a matter of days to meet my immediate needs. Unexpectedly, it went **viral**, accumulating thousands of GitHub stars.
-- đ Completely free and open-source
-- đ Blazing fast performance, outperforming many paid services
-- đ¤ LLM-friendly output formats (JSON, cleaned HTML, markdown)
-- đ Fit markdown generation for extracting main article content.
-- đ Multi-browser support (Chromium, Firefox, WebKit)
-- đ Supports crawling multiple URLs simultaneously
-- đ¨ Extracts and returns all media tags (Images, Audio, and Video)
-- đ Extracts all external and internal links
-- đ Extracts metadata from the page
-- đ Custom hooks for authentication, headers, and page modifications
-- đľď¸ User-agent customization
-- đźď¸ Takes screenshots of pages with enhanced error handling
-- đ Executes multiple custom JavaScripts before crawling
-- đ Generates structured output without LLM using JsonCssExtractionStrategy
-- đ Various chunking strategies: topic-based, regex, sentence, and more
-- đ§ Advanced extraction strategies: cosine clustering, LLM, and more
-- đŻ CSS selector support for precise data extraction
-- đ Passes instructions/keywords to refine extraction
-- đ Proxy support with authentication for enhanced access
-- đ Session management for complex multi-page crawling
-- đ Asynchronous architecture for improved performance
-- đźď¸ Improved image processing with lazy-loading detection
-- đ°ď¸ Enhanced handling of delayed content loading
-- đ Custom headers support for LLM interactions
-- đźď¸ iframe content extraction for comprehensive analysis
-- âąď¸ Flexible timeout and delayed content retrieval options
+Now, in **January 2025**, Crawl4AI has surpassed **21,000 stars** and remains the #1 trending repository. Itâs my way of giving back to the community after benefiting from open source for years. Iâm thrilled by how many of you share that passion. Thank you for being here, join our Discord, file issues, submit PRs, or just spread the word. Letâs build the best data extraction, crawling, and scraping library **together**.
+
+---
+
+## What Does Crawl4AI Do?
+
+Crawl4AI is a feature-rich crawler and scraper that aims to:
+
+1.â**Generate Clean Markdown**: Perfect for RAG pipelines or direct ingestion into LLMs.
+2.â**Structured Extraction**: Parse repeated patterns with CSS, XPath, or LLM-based extraction.
+3.â**Advanced Browser Control**: Hooks, proxies, stealth modes, session re-useâfine-grained control.
+4.â**High Performance**: Parallel crawling, chunk-based extraction, real-time use cases.
+5.â**Open Source**: No forced API keys, no paywallsâeveryone can access their data.
+
+**Core Philosophies**:
+- **Democratize Data**: Free to use, transparent, and highly configurable.
+- **LLM Friendly**: Minimally processed, well-structured text, images, and metadata, so AI models can easily consume it.
+
+---
## Documentation Structure
-Our documentation is organized into several sections:
+To help you get started, weâve organized our docs into clear sections:
-### Basic Usage
-- [Installation](basic/installation.md)
-- [Quick Start](basic/quickstart.md)
-- [Simple Crawling](basic/simple-crawling.md)
-- [Browser Configuration](basic/browser-config.md)
-- [Content Selection](basic/content-selection.md)
-- [Output Formats](basic/output-formats.md)
-- [Page Interaction](basic/page-interaction.md)
+- **Setup & Installation**
+ Basic instructions to install Crawl4AI via pip or Docker.
+- **Quick Start**
+ A hands-on introduction showing how to do your first crawl, generate Markdown, and do a simple extraction.
+- **Core**
+ Deeper guides on single-page crawling, advanced browser/crawler parameters, content filtering, and caching.
+- **Advanced**
+ Explore link & media handling, lazy loading, hooking & authentication, proxies, session management, and more.
+- **Extraction**
+ Detailed references for no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and clustering approaches.
+- **API Reference**
+ Find the technical specifics of each class and method, including `AsyncWebCrawler`, `arun()`, and `CrawlResult`.
-### Advanced Features
-- [Magic Mode](advanced/magic-mode.md)
-- [Session Management](advanced/session-management.md)
-- [Hooks & Authentication](advanced/hooks-auth.md)
-- [Proxy & Security](advanced/proxy-security.md)
-- [Content Processing](advanced/content-processing.md)
+Throughout these sections, youâll find code samples you can **copy-paste** into your environment. If something is missing or unclear, raise an issue or PR.
-### Extraction & Processing
-- [Extraction Strategies Overview](extraction/overview.md)
-- [LLM Integration](extraction/llm.md)
-- [CSS-Based Extraction](extraction/css.md)
-- [Cosine Strategy](extraction/cosine.md)
-- [Chunking Strategies](extraction/chunking.md)
+---
-### API Reference
-- [AsyncWebCrawler](api/async-webcrawler.md)
-- [CrawlResult](api/crawl-result.md)
-- [Extraction Strategies](api/strategies.md)
-- [arun() Method Parameters](api/arun.md)
+## How You Can Support
-### Examples
-- Coming soon!
+- **Star & Fork**: If you find Crawl4AI helpful, star the repo on GitHub or fork it to add your own features.
+- **File Issues**: Encounter a bug or missing feature? Let us know by filing an issue, so we can improve.
+- **Pull Requests**: Whether itâs a small fix, a big feature, or better docsâcontributions are always welcome.
+- **Join Discord**: Come chat about web scraping, crawling tips, or AI workflows with the community.
+- **Spread the Word**: Mention Crawl4AI in your blog posts, talks, or on social media.
-## Getting Started
+**Our mission**: to empower everyoneâstudents, researchers, entrepreneurs, data scientistsâto access, parse, and shape the worldâs data with speed, cost-efficiency, and creative freedom.
-1. Install Crawl4AI:
-```bash
-pip install crawl4ai
-```
+---
-2. Check out our [Quick Start Guide](basic/quickstart.md) to begin crawling web pages.
+## Quick Links
-3. Explore our [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) to see Crawl4AI in action.
+- **[GitHub Repo](https://github.com/unclecode/crawl4ai)**
+- **[Installation Guide](./core/installation.md)**
+- **[Quick Start](./core/quickstart.md)**
+- **[API Reference](./api/async-webcrawler.md)**
+- **[Changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md)**
-## Support
+Thank you for joining me on this journey. Letâs keep building an **open, democratic** approach to data extraction and AI together.
-For questions, suggestions, or issues:
-- GitHub Issues: [Report a Bug](https://github.com/unclecode/crawl4ai/issues)
-- Twitter: [@unclecode](https://twitter.com/unclecode)
-- Website: [crawl4ai.com](https://crawl4ai.com)
-
-Happy Crawling! đ¸ď¸đ
\ No newline at end of file
+Happy Crawling!
+â *Unclecde, Founder & Maintainer of Crawl4AI*
diff --git a/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md b/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
deleted file mode 100644
index fb1846b5..00000000
--- a/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
+++ /dev/null
@@ -1,51 +0,0 @@
-# Crawl4AI
-
-## Episode 1: Introduction to Crawl4AI and Basic Installation
-
-### Quick Intro
-Walk through installation from PyPI, setup, and verification. Show how to install with options like `torch` or `transformer` for advanced capabilities.
-
-Here's a condensed outline of the **Installation and Setup** video content:
-
----
-
-1) **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
-
-2) **Installation Overview**:
-
- - **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
-
- - **Optional Advanced Installs**:
- - `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
- - `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
- - `pip install crawl4ai[all]` - Installs all features for complete functionality.
-
-3) **Verifying the Installation**:
-
- - Walk through a simple test script to confirm the setup:
- ```python
- import asyncio
- from crawl4ai import AsyncWebCrawler
-
- async def main():
- async with AsyncWebCrawler(verbose=True) as crawler:
- result = await crawler.arun(url="https://www.example.com")
- print(result.markdown[:500]) # Show first 500 characters
-
- asyncio.run(main())
- ```
- - Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.
-
-4) **Important Tips**:
-
- - **Run** `playwright install` **after installation** to set up dependencies.
- - **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
- - If you encounter issues, refer to the documentation or GitHub issues.
-
-5) **Wrap Up**:
-
- - Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).
-
----
-
-This structure provides a concise, effective guide to get viewers up and running with Crawl4AI in minutes.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md b/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
deleted file mode 100644
index c4fd09df..00000000
--- a/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
+++ /dev/null
@@ -1,78 +0,0 @@
-# Crawl4AI
-
-## Episode 2: Overview of Advanced Features
-
-### Quick Intro
-A general overview of advanced features like hooks, CSS selectors, and JSON CSS extraction.
-
-Here's a condensed outline for an **Overview of Advanced Features** video covering Crawl4AI's powerful customization and extraction options:
-
----
-
-### **Overview of Advanced Features**
-
-1) **Introduction to Advanced Features**:
-
- - Briefly introduce Crawl4AIâs advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.
-
-2) **Taking Screenshots**:
-
- - Explain the screenshot capability for capturing page state and verifying content.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", screenshot=True)
- ```
- - Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.
-
-3) **Media and Link Extraction**:
-
- - Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com")
- print("Media:", result.media)
- print("Links:", result.links)
- ```
-
-4) **Custom User Agent**:
-
- - Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
- ```
-
-5) **Custom Hooks for Enhanced Control**:
-
- - Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
- - **Example**: Setting a custom header with `before_get_url` hook.
- ```python
- async def before_get_url(page):
- await page.set_extra_http_headers({"X-Test-Header": "test"})
- ```
-
-6) **CSS Selectors for Targeted Extraction**:
-
- - Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", css_selector="h2")
- print("H2 Tags:", result.extracted_content)
- ```
-
-7) **Crawling Inside Iframes**:
-
- - Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", process_iframes=True)
- ```
-
-8) **Wrap-Up**:
-
- - Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
- - Tease upcoming videos where each feature will be explored in detail.
-
----
-
-This covers each advanced feature with a brief example, providing a useful overview to prepare viewers for the more in-depth videos.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_03_Browser_Configurations_&_Headless_Crawling.md b/docs/md_v2/tutorial/episode_03_Browser_Configurations_&_Headless_Crawling.md
deleted file mode 100644
index 45f1a353..00000000
--- a/docs/md_v2/tutorial/episode_03_Browser_Configurations_&_Headless_Crawling.md
+++ /dev/null
@@ -1,65 +0,0 @@
-# Crawl4AI
-
-## Episode 3: Browser Configurations & Headless Crawling
-
-### Quick Intro
-Explain browser options (`chromium`, `firefox`, `webkit`) and settings for headless mode, caching, and verbose logging.
-
-Hereâs a streamlined outline for the **Browser Configurations & Headless Crawling** video:
-
----
-
-### **Browser Configurations & Headless Crawling**
-
-1) **Overview of Browser Options**:
-
- - Crawl4AI supports three browser engines:
- - **Chromium** (default) - Highly compatible.
- - **Firefox** - Great for specialized use cases.
- - **Webkit** - Lightweight, ideal for basic needs.
- - **Example**:
- ```python
- # Using Chromium (default)
- crawler = AsyncWebCrawler(browser_type="chromium")
-
- # Using Firefox
- crawler = AsyncWebCrawler(browser_type="firefox")
-
- # Using WebKit
- crawler = AsyncWebCrawler(browser_type="webkit")
- ```
-
-2) **Headless Mode**:
-
- - Headless mode runs the browser without a visible GUI, making it faster and less resource-intensive.
- - To enable or disable:
- ```python
- # Headless mode (default is True)
- crawler = AsyncWebCrawler(headless=True)
-
- # Disable headless mode for debugging
- crawler = AsyncWebCrawler(headless=False)
- ```
-
-3) **Verbose Logging**:
- - Use `verbose=True` to get detailed logs for each action, useful for debugging:
- ```python
- crawler = AsyncWebCrawler(verbose=True)
- ```
-
-4) **Running a Basic Crawl with Configuration**:
- - Example of a simple crawl with custom browser settings:
- ```python
- async with AsyncWebCrawler(browser_type="firefox", headless=True, verbose=True) as crawler:
- result = await crawler.arun(url="https://www.example.com")
- print(result.markdown[:500]) # Show first 500 characters
- ```
- - This example uses Firefox in headless mode with logging enabled, demonstrating the flexibility of Crawl4AIâs setup.
-
-5) **Recap & Next Steps**:
- - Recap the power of selecting different browsers and running headless mode for speed and efficiency.
- - Tease the next video: **Proxy & Security Settings** for navigating blocked or restricted content and protecting IP identity.
-
----
-
-This breakdown covers browser configuration essentials in Crawl4AI, providing users with practical steps to optimize their scraping setup.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_04_Advanced_Proxy_and_Security_Settings.md b/docs/md_v2/tutorial/episode_04_Advanced_Proxy_and_Security_Settings.md
deleted file mode 100644
index ea235962..00000000
--- a/docs/md_v2/tutorial/episode_04_Advanced_Proxy_and_Security_Settings.md
+++ /dev/null
@@ -1,90 +0,0 @@
-# Crawl4AI
-
-## Episode 4: Advanced Proxy and Security Settings
-
-### Quick Intro
-Showcase proxy configurations (HTTP, SOCKS5, authenticated proxies). Demo: Use rotating proxies and set custom headers to avoid IP blocking and enhance security.
-
-Hereâs a focused outline for the **Proxy and Security Settings** video:
-
----
-
-### **Proxy & Security Settings**
-
-1) **Why Use Proxies in Web Crawling**:
-
- - Proxies are essential for bypassing IP-based restrictions, improving anonymity, and managing rate limits.
- - Crawl4AI supports simple proxies, authenticated proxies, and proxy rotation for robust web scraping.
-
-2) **Basic Proxy Setup**:
-
- - **Using a Simple Proxy**:
- ```python
- # HTTP proxy
- crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
-
- # SOCKS proxy
- crawler = AsyncWebCrawler(proxy="socks5://proxy.example.com:1080")
- ```
-
-3) **Authenticated Proxies**:
-
- - Use `proxy_config` for proxies requiring a username and password:
- ```python
- proxy_config = {
- "server": "http://proxy.example.com:8080",
- "username": "user",
- "password": "pass"
- }
- crawler = AsyncWebCrawler(proxy_config=proxy_config)
- ```
-
-4) **Rotating Proxies**:
-
- - Rotating proxies helps avoid IP bans by switching IP addresses for each request:
- ```python
- async def get_next_proxy():
- # Define proxy rotation logic here
- return {"server": "http://next.proxy.com:8080"}
-
- async with AsyncWebCrawler() as crawler:
- for url in urls:
- proxy = await get_next_proxy()
- crawler.update_proxy(proxy)
- result = await crawler.arun(url=url)
- ```
- - This setup periodically switches the proxy for enhanced security and access.
-
-5) **Custom Headers for Additional Security**:
-
- - Set custom headers to mask the crawlerâs identity and avoid detection:
- ```python
- headers = {
- "X-Forwarded-For": "203.0.113.195",
- "Accept-Language": "en-US,en;q=0.9",
- "Cache-Control": "no-cache",
- "Pragma": "no-cache"
- }
- crawler = AsyncWebCrawler(headers=headers)
- ```
-
-6) **Combining Proxies with Magic Mode for Anti-Bot Protection**:
-
- - For sites with aggressive bot detection, combine `proxy` settings with `magic=True`:
- ```python
- async with AsyncWebCrawler(proxy="http://proxy.example.com:8080", headers={"Accept-Language": "en-US"}) as crawler:
- result = await crawler.arun(
- url="https://example.com",
- magic=True # Enables anti-detection features
- )
- ```
- - **Magic Mode** automatically enables user simulation, random timing, and browser property masking.
-
-7) **Wrap Up & Next Steps**:
-
- - Summarize the importance of proxies and anti-detection in accessing restricted content and avoiding bans.
- - Tease the next video: **JavaScript Execution and Handling Dynamic Content** for working with interactive and dynamically loaded pages.
-
----
-
-This outline provides a practical guide to setting up proxies and security configurations, empowering users to navigate restricted sites while staying undetected.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md b/docs/md_v2/tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md
deleted file mode 100644
index 98d0968f..00000000
--- a/docs/md_v2/tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# Crawl4AI
-
-## Episode 5: JavaScript Execution and Dynamic Content Handling
-
-### Quick Intro
-Explain JavaScript code injection with examples (e.g., simulating scrolling, clicking âload moreâ). Demo: Extract content from a page that uses dynamic loading with lazy-loaded images.
-
-Hereâs a focused outline for the **JavaScript Execution and Dynamic Content Handling** video:
-
----
-
-### **JavaScript Execution & Dynamic Content Handling**
-
-1) **Why JavaScript Execution Matters**:
-
- - Many modern websites load content dynamically via JavaScript, requiring special handling to access all elements.
- - Crawl4AI can execute JavaScript on pages, enabling it to interact with elements like âload moreâ buttons, infinite scrolls, and content that appears only after certain actions.
-
-2) **Basic JavaScript Execution**:
-
- - Use `js_code` to execute JavaScript commands on a page:
- ```python
- # Scroll to bottom of the page
- result = await crawler.arun(
- url="https://example.com",
- js_code="window.scrollTo(0, document.body.scrollHeight);"
- )
- ```
- - This command scrolls to the bottom, triggering any lazy-loaded or dynamically added content.
-
-3) **Multiple Commands & Simulating Clicks**:
-
- - Combine multiple JavaScript commands to interact with elements like âload moreâ buttons:
- ```python
- js_commands = [
- "window.scrollTo(0, document.body.scrollHeight);",
- "document.querySelector('.load-more').click();"
- ]
- result = await crawler.arun(
- url="https://example.com",
- js_code=js_commands
- )
- ```
- - This script scrolls down and then clicks the âload moreâ button, useful for loading additional content blocks.
-
-4) **Waiting for Dynamic Content**:
-
- - Use `wait_for` to ensure the page loads specific elements before proceeding:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- js_code="window.scrollTo(0, document.body.scrollHeight);",
- wait_for="css:.dynamic-content" # Wait for elements with class `.dynamic-content`
- )
- ```
- - This example waits until elements with `.dynamic-content` are loaded, helping to capture content that appears after JavaScript actions.
-
-5) **Handling Complex Dynamic Content (e.g., Infinite Scroll)**:
-
- - Combine JavaScript execution with conditional waiting to handle infinite scrolls or paginated content:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- js_code=[
- "window.scrollTo(0, document.body.scrollHeight);",
- "const loadMore = document.querySelector('.load-more'); if (loadMore) loadMore.click();"
- ],
- wait_for="js:() => document.querySelectorAll('.item').length > 10" # Wait until 10 items are loaded
- )
- ```
- - This example scrolls and clicks "load more" repeatedly, waiting each time for a specified number of items to load.
-
-6) **Complete Example: Dynamic Content Handling with Extraction**:
-
- - Full example demonstrating a dynamic load and content extraction in one process:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- js_code=[
- "window.scrollTo(0, document.body.scrollHeight);",
- "document.querySelector('.load-more').click();"
- ],
- wait_for="css:.main-content",
- css_selector=".main-content"
- )
- print(result.markdown[:500]) # Output the main content extracted
- ```
-
-7) **Wrap Up & Next Steps**:
-
- - Recap how JavaScript execution allows access to dynamic content, enabling powerful interactions.
- - Tease the next video: **Content Cleaning and Fit Markdown** to show how Crawl4AI can extract only the most relevant content from complex pages.
-
----
-
-This outline explains how to handle dynamic content and JavaScript-based interactions effectively, enabling users to scrape and interact with complex, modern websites.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md b/docs/md_v2/tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md
deleted file mode 100644
index dfc3e5a2..00000000
--- a/docs/md_v2/tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md
+++ /dev/null
@@ -1,86 +0,0 @@
-# Crawl4AI
-
-## Episode 6: Magic Mode and Anti-Bot Protection
-
-### Quick Intro
-Highlight `Magic Mode` and anti-bot features like user simulation, navigator overrides, and timing randomization. Demo: Access a site with anti-bot protection and show how `Magic Mode` seamlessly handles it.
-
-Hereâs a concise outline for the **Magic Mode and Anti-Bot Protection** video:
-
----
-
-### **Magic Mode & Anti-Bot Protection**
-
-1) **Why Anti-Bot Protection is Important**:
-
- - Many websites use bot detection mechanisms to block automated scraping. Crawl4AIâs anti-detection features help avoid IP bans, CAPTCHAs, and access restrictions.
- - **Magic Mode** is a one-step solution to enable a range of anti-bot features without complex configuration.
-
-2) **Enabling Magic Mode**:
-
- - Simply set `magic=True` to activate Crawl4AIâs full anti-bot suite:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- magic=True # Enables all anti-detection features
- )
- ```
- - This enables a blend of stealth techniques, including masking automation signals, randomizing timings, and simulating real user behavior.
-
-3) **What Magic Mode Does Behind the Scenes**:
-
- - **User Simulation**: Mimics human actions like mouse movements and scrolling.
- - **Navigator Overrides**: Hides signals that indicate an automated browser.
- - **Timing Randomization**: Adds random delays to simulate natural interaction patterns.
- - **Cookie Handling**: Accepts and manages cookies dynamically to avoid triggers from cookie pop-ups.
-
-4) **Manual Anti-Bot Options (If Not Using Magic Mode)**:
-
- - For granular control, you can configure individual settings without Magic Mode:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- simulate_user=True, # Enables human-like behavior
- override_navigator=True # Hides automation fingerprints
- )
- ```
- - **Use Cases**: This approach allows more specific adjustments when certain anti-bot features are needed but others are not.
-
-5) **Combining Proxies with Magic Mode**:
-
- - To avoid rate limits or IP blocks, combine Magic Mode with a proxy:
- ```python
- async with AsyncWebCrawler(
- proxy="http://proxy.example.com:8080",
- headers={"Accept-Language": "en-US"}
- ) as crawler:
- result = await crawler.arun(
- url="https://example.com",
- magic=True # Full anti-detection
- )
- ```
- - This setup maximizes stealth by pairing anti-bot detection with IP obfuscation.
-
-6) **Example of Anti-Bot Protection in Action**:
-
- - Full example with Magic Mode and proxies to scrape a protected page:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com/protected-content",
- magic=True,
- proxy="http://proxy.example.com:8080",
- wait_for="css:.content-loaded" # Wait for the main content to load
- )
- print(result.markdown[:500]) # Display first 500 characters of the content
- ```
- - This example ensures seamless access to protected content by combining anti-detection and waiting for full content load.
-
-7) **Wrap Up & Next Steps**:
-
- - Recap the power of Magic Mode and anti-bot features for handling restricted websites.
- - Tease the next video: **Content Cleaning and Fit Markdown** to show how to extract clean and focused content from a page.
-
----
-
-This outline shows users how to easily avoid bot detection and access restricted content, demonstrating both the power and simplicity of Magic Mode in Crawl4AI.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md b/docs/md_v2/tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md
deleted file mode 100644
index 60ef9eea..00000000
--- a/docs/md_v2/tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md
+++ /dev/null
@@ -1,89 +0,0 @@
-# Crawl4AI
-
-## Episode 7: Content Cleaning and Fit Markdown
-
-### Quick Intro
-Explain content cleaning options, including `fit_markdown` to keep only the most relevant content. Demo: Extract and compare regular vs. fit markdown from a news site or blog.
-
-Hereâs a streamlined outline for the **Content Cleaning and Fit Markdown** video:
-
----
-
-### **Content Cleaning & Fit Markdown**
-
-1) **Overview of Content Cleaning in Crawl4AI**:
-
- - Explain that web pages often include extra elements like ads, navigation bars, footers, and popups.
- - Crawl4AIâs content cleaning features help extract only the main content, reducing noise and enhancing readability.
-
-2) **Basic Content Cleaning Options**:
-
- - **Removing Unwanted Elements**: Exclude specific HTML tags, like forms or navigation bars:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- word_count_threshold=10, # Filter out blocks with fewer than 10 words
- excluded_tags=['form', 'nav'], # Exclude specific tags
- remove_overlay_elements=True # Remove popups and modals
- )
- ```
- - This example extracts content while excluding forms, navigation, and modal overlays, ensuring clean results.
-
-3) **Fit Markdown for Main Content Extraction**:
-
- - **What is Fit Markdown**: Uses advanced analysis to identify the most relevant content (ideal for articles, blogs, and documentation).
- - **How it Works**: Analyzes content density, removes boilerplate elements, and maintains formatting for a clear output.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://example.com")
- main_content = result.fit_markdown # Extracted main content
- print(main_content[:500]) # Display first 500 characters
- ```
- - Fit Markdown is especially helpful for long-form content like news articles or blog posts.
-
-4) **Comparing Fit Markdown with Regular Markdown**:
-
- - **Fit Markdown** returns the primary content without extraneous elements.
- - **Regular Markdown** includes all extracted text in markdown format.
- - Example to show the difference:
- ```python
- all_content = result.markdown # Full markdown
- main_content = result.fit_markdown # Only the main content
-
- print(f"All Content Length: {len(all_content)}")
- print(f"Main Content Length: {len(main_content)}")
- ```
- - This comparison shows the effectiveness of Fit Markdown in focusing on essential content.
-
-5) **Media and Metadata Handling with Content Cleaning**:
-
- - **Media Extraction**: Crawl4AI captures images and videos with metadata like alt text, descriptions, and relevance scores:
- ```python
- for image in result.media["images"]:
- print(f"Source: {image['src']}, Alt Text: {image['alt']}, Relevance Score: {image['score']}")
- ```
- - **Use Case**: Useful for saving only relevant images or videos from an article or content-heavy page.
-
-6) **Example of Clean Content Extraction in Action**:
-
- - Full example extracting cleaned content and Fit Markdown:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- word_count_threshold=10,
- excluded_tags=['nav', 'footer'],
- remove_overlay_elements=True
- )
- print(result.fit_markdown[:500]) # Show main content
- ```
- - This example demonstrates content cleaning with settings for filtering noise and focusing on the core text.
-
-7) **Wrap Up & Next Steps**:
-
- - Summarize the power of Crawl4AIâs content cleaning features and Fit Markdown for capturing clean, relevant content.
- - Tease the next video: **Link Analysis and Smart Filtering** to focus on analyzing and filtering links within crawled pages.
-
----
-
-This outline covers Crawl4AIâs content cleaning features and the unique benefits of Fit Markdown, showing users how to retrieve focused, high-quality content from web pages.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md b/docs/md_v2/tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md
deleted file mode 100644
index c0daacad..00000000
--- a/docs/md_v2/tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md
+++ /dev/null
@@ -1,116 +0,0 @@
-# Crawl4AI
-
-## Episode 8: Media Handling: Images, Videos, and Audio
-
-### Quick Intro
-Showcase Crawl4AIâs media extraction capabilities, including lazy-loaded media and metadata. Demo: Crawl a multimedia page, extract images, and show metadata (alt text, context, relevance score).
-
-Hereâs a clear and focused outline for the **Media Handling: Images, Videos, and Audio** video:
-
----
-
-### **Media Handling: Images, Videos, and Audio**
-
-1) **Overview of Media Extraction in Crawl4AI**:
-
- - Crawl4AI can detect and extract different types of media (images, videos, and audio) along with useful metadata.
- - This functionality is essential for gathering visual content from multimedia-heavy pages like e-commerce sites, news articles, and social media feeds.
-
-2) **Image Extraction and Metadata**:
-
- - Crawl4AI captures images with detailed metadata, including:
- - **Source URL**: The direct URL to the image.
- - **Alt Text**: Image description if available.
- - **Relevance Score**: A score (0â10) indicating how relevant the image is to the main content.
- - **Context**: Text surrounding the image on the page.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://example.com")
-
- for image in result.media["images"]:
- print(f"Source: {image['src']}")
- print(f"Alt Text: {image['alt']}")
- print(f"Relevance Score: {image['score']}")
- print(f"Context: {image['context']}")
- ```
- - This example shows how to access each imageâs metadata, making it easy to filter for the most relevant visuals.
-
-3) **Handling Lazy-Loaded Images**:
-
- - Crawl4AI automatically supports lazy-loaded images, which are commonly used to optimize webpage loading.
- - **Example with Wait for Lazy-Loaded Content**:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- wait_for="css:img[data-src]", # Wait for lazy-loaded images
- delay_before_return_html=2.0 # Allow extra time for images to load
- )
- ```
- - This setup waits for lazy-loaded images to appear, ensuring they are fully captured.
-
-4) **Video Extraction and Metadata**:
-
- - Crawl4AI captures video elements, including:
- - **Source URL**: The videoâs direct URL.
- - **Type**: Format of the video (e.g., MP4).
- - **Thumbnail**: A poster or thumbnail image if available.
- - **Duration**: Video length, if metadata is provided.
- - **Example**:
- ```python
- for video in result.media["videos"]:
- print(f"Video Source: {video['src']}")
- print(f"Type: {video['type']}")
- print(f"Thumbnail: {video.get('poster')}")
- print(f"Duration: {video.get('duration')}")
- ```
- - This allows users to gather video content and relevant details for further processing or analysis.
-
-5) **Audio Extraction and Metadata**:
-
- - Audio elements can also be extracted, with metadata like:
- - **Source URL**: The audio fileâs direct URL.
- - **Type**: Format of the audio file (e.g., MP3).
- - **Duration**: Length of the audio, if available.
- - **Example**:
- ```python
- for audio in result.media["audios"]:
- print(f"Audio Source: {audio['src']}")
- print(f"Type: {audio['type']}")
- print(f"Duration: {audio.get('duration')}")
- ```
- - Useful for sites with podcasts, sound bites, or other audio content.
-
-6) **Filtering Media by Relevance**:
-
- - Use metadata like relevance score to filter only the most useful media content:
- ```python
- relevant_images = [img for img in result.media["images"] if img['score'] > 5]
- ```
- - This is especially helpful for content-heavy pages where you only want media directly related to the main content.
-
-7) **Example: Full Media Extraction with Content Filtering**:
-
- - Full example extracting images, videos, and audio along with filtering by relevance:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- word_count_threshold=10, # Filter content blocks for relevance
- exclude_external_images=True # Only keep internal images
- )
-
- # Display media summaries
- print(f"Relevant Images: {len(relevant_images)}")
- print(f"Videos: {len(result.media['videos'])}")
- print(f"Audio Clips: {len(result.media['audios'])}")
- ```
- - This example shows how to capture and filter various media types, focusing on whatâs most relevant.
-
-8) **Wrap Up & Next Steps**:
-
- - Recap the comprehensive media extraction capabilities, emphasizing how metadata helps users focus on relevant content.
- - Tease the next video: **Link Analysis and Smart Filtering** to explore how Crawl4AI handles internal, external, and social media links for more focused data gathering.
-
----
-
-This outline provides users with a complete guide to handling images, videos, and audio in Crawl4AI, using metadata to enhance relevance and precision in multimedia extraction.
diff --git a/docs/md_v2/tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md b/docs/md_v2/tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md
deleted file mode 100644
index 263d77bb..00000000
--- a/docs/md_v2/tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md
+++ /dev/null
@@ -1,95 +0,0 @@
-# Crawl4AI
-
-## Episode 9: Link Analysis and Smart Filtering
-
-### Quick Intro
-Walk through internal and external link classification, social media link filtering, and custom domain exclusion. Demo: Analyze links on a website, focusing on internal navigation vs. external or ad links.
-
-Hereâs a focused outline for the **Link Analysis and Smart Filtering** video:
-
----
-
-### **Link Analysis & Smart Filtering**
-
-1) **Importance of Link Analysis in Web Crawling**:
-
- - Explain that web pages often contain numerous links, including internal links, external links, social media links, and ads.
- - Crawl4AIâs link analysis and filtering options help extract only relevant links, enabling more targeted and efficient crawls.
-
-2) **Automatic Link Classification**:
-
- - Crawl4AI categorizes links automatically into internal, external, and social media links.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://example.com")
-
- # Access internal and external links
- internal_links = result.links["internal"]
- external_links = result.links["external"]
-
- # Print first few links for each type
- print("Internal Links:", internal_links[:3])
- print("External Links:", external_links[:3])
- ```
-
-3) **Filtering Out Unwanted Links**:
-
- - **Exclude External Links**: Remove all links pointing to external sites.
- - **Exclude Social Media Links**: Filter out social media domains like Facebook or Twitter.
- - **Example**:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- exclude_external_links=True, # Remove external links
- exclude_social_media_links=True # Remove social media links
- )
- ```
-
-4) **Custom Domain Filtering**:
-
- - **Exclude Specific Domains**: Filter links from particular domains, e.g., ad sites.
- - **Custom Social Media Domains**: Add additional social media domains if needed.
- - **Example**:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- exclude_domains=["ads.com", "trackers.com"],
- exclude_social_media_domains=["facebook.com", "linkedin.com"]
- )
- ```
-
-5) **Accessing Link Context and Metadata**:
-
- - Crawl4AI provides additional metadata for each link, including its text, type (e.g., navigation or content), and surrounding context.
- - **Example**:
- ```python
- for link in result.links["internal"]:
- print(f"Link: {link['href']}, Text: {link['text']}, Context: {link['context']}")
- ```
- - **Use Case**: Helps users understand the relevance of links based on where they are placed on the page (e.g., navigation vs. article content).
-
-6) **Example of Comprehensive Link Filtering and Analysis**:
-
- - Full example combining link filtering, metadata access, and contextual information:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- exclude_external_links=True,
- exclude_social_media_links=True,
- exclude_domains=["ads.com"],
- css_selector=".main-content" # Focus only on main content area
- )
- for link in result.links["internal"]:
- print(f"Internal Link: {link['href']}, Text: {link['text']}, Context: {link['context']}")
- ```
- - This example filters unnecessary links, keeping only internal and relevant links from the main content area.
-
-7) **Wrap Up & Next Steps**:
-
- - Summarize the benefits of link filtering for efficient crawling and relevant content extraction.
- - Tease the next video: **Custom Headers, Identity Management, and User Simulation** to explain how to configure identity settings and simulate user behavior for stealthier crawls.
-
----
-
-This outline provides a practical overview of Crawl4AIâs link analysis and filtering features, helping users target only essential links while eliminating distractions.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md b/docs/md_v2/tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md
deleted file mode 100644
index 6eb928f0..00000000
--- a/docs/md_v2/tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md
+++ /dev/null
@@ -1,93 +0,0 @@
-# Crawl4AI
-
-## Episode 10: Custom Headers, Identity, and User Simulation
-
-### Quick Intro
-Teach how to use custom headers, user-agent strings, and simulate real user interactions. Demo: Set custom user-agent and headers to access a site that blocks typical crawlers.
-
-Hereâs a concise outline for the **Custom Headers, Identity Management, and User Simulation** video:
-
----
-
-### **Custom Headers, Identity Management, & User Simulation**
-
-1) **Why Customize Headers and Identity in Crawling**:
-
- - Websites often track request headers and browser properties to detect bots. Customizing headers and managing identity help make requests appear more human, improving access to restricted sites.
-
-2) **Setting Custom Headers**:
-
- - Customize HTTP headers to mimic genuine browser requests or meet site-specific requirements:
- ```python
- headers = {
- "Accept-Language": "en-US,en;q=0.9",
- "X-Requested-With": "XMLHttpRequest",
- "Cache-Control": "no-cache"
- }
- crawler = AsyncWebCrawler(headers=headers)
- ```
- - **Use Case**: Customize the `Accept-Language` header to simulate local user settings, or `Cache-Control` to bypass cache for fresh content.
-
-3) **Setting a Custom User Agent**:
-
- - Some websites block requests from common crawler user agents. Setting a custom user agent string helps bypass these restrictions:
- ```python
- crawler = AsyncWebCrawler(
- user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
- )
- ```
- - **Tip**: Use user-agent strings from popular browsers (e.g., Chrome, Firefox) to improve access and reduce detection risks.
-
-4) **User Simulation for Human-like Behavior**:
-
- - Enable `simulate_user=True` to mimic natural user interactions, such as random timing and simulated mouse movements:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- simulate_user=True # Simulates human-like behavior
- )
- ```
- - **Behavioral Effects**: Adds subtle variations in interactions, making the crawler harder to detect on bot-protected sites.
-
-5) **Navigator Overrides and Magic Mode for Full Identity Masking**:
-
- - Use `override_navigator=True` to mask automation indicators like `navigator.webdriver`, which websites check to detect bots:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- override_navigator=True # Masks bot-related signals
- )
- ```
- - **Combining with Magic Mode**: For a complete anti-bot setup, combine these identity options with `magic=True` for maximum protection:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- magic=True, # Enables all anti-bot detection features
- user_agent="Custom-Agent", # Custom agent with Magic Mode
- )
- ```
- - This setup includes all anti-detection techniques like navigator masking, random timing, and user simulation.
-
-6) **Example: Comprehensive Setup for Identity Management**:
-
- - A full example combining custom headers, user-agent, and user simulation for a realistic browsing profile:
- ```python
- async with AsyncWebCrawler(
- headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
- user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
- simulate_user=True
- ) as crawler:
- result = await crawler.arun(url="https://example.com/secure-page")
- print(result.markdown[:500]) # Display extracted content
- ```
- - This example enables detailed customization for evading detection and accessing protected pages smoothly.
-
-7) **Wrap Up & Next Steps**:
-
- - Recap the value of headers, user-agent customization, and simulation in bypassing bot detection.
- - Tease the next video: **Extraction Strategies: JSON CSS, LLM, and Cosine** to dive into structured data extraction methods for high-quality content retrieval.
-
----
-
-This outline equips users with tools for managing crawler identity and human-like behavior, essential for accessing bot-protected or restricted websites.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md b/docs/md_v2/tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md
deleted file mode 100644
index b460ff8c..00000000
--- a/docs/md_v2/tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md
+++ /dev/null
@@ -1,186 +0,0 @@
-Hereâs a detailed outline for the **JSON-CSS Extraction Strategy** video, covering all key aspects and supported structures in Crawl4AI:
-
----
-
-### **10.1 JSON-CSS Extraction Strategy**
-
-#### **1. Introduction to JSON-CSS Extraction**
- - JSON-CSS Extraction is used for pulling structured data from pages with repeated patterns, like product listings, article feeds, or directories.
- - This strategy allows defining a schema with CSS selectors and data fields, making it easy to capture nested, list-based, or singular elements.
-
-#### **2. Basic Schema Structure**
- - **Schema Fields**: The schema has two main components:
- - `baseSelector`: A CSS selector to locate the main elements you want to extract (e.g., each article or product block).
- - `fields`: Defines the data fields for each element, supporting various data types and structures.
-
-#### **3. Simple Field Extraction**
- - **Example HTML**:
- ```html
-
-
Sample Product
-
$19.99
-
This is a sample product.
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "price", "selector": ".price", "type": "text"},
- {"name": "description", "selector": ".description", "type": "text"}
- ]
- }
- ```
- - **Explanation**: Each field captures text content from specified CSS selectors within each `.product` element.
-
-#### **4. Supported Field Types: Text, Attribute, HTML, Regex**
- - **Field Type Options**:
- - `text`: Extracts visible text.
- - `attribute`: Captures an HTML attribute (e.g., `src`, `href`).
- - `html`: Extracts the raw HTML of an element.
- - `regex`: Allows regex patterns to extract part of the text.
-
- - **Example HTML** (including an image):
- ```html
-
-
Sample Product
-

-
$19.99
-
Limited time offer.
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"},
- {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"},
- {"name": "description_html", "selector": ".description", "type": "html"}
- ]
- }
- ```
- - **Explanation**:
- - `attribute`: Extracts the `src` attribute from `.product-image`.
- - `regex`: Extracts the numeric part from `$19.99`.
- - `html`: Retrieves the full HTML of the description element.
-
-#### **5. Nested Field Extraction**
- - **Use Case**: Useful when content contains sub-elements, such as an article with author details within it.
- - **Example HTML**:
- ```html
-
-
Sample Article
-
- John Doe
- Writer and editor
-
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".article",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "author", "type": "nested", "selector": ".author", "fields": [
- {"name": "name", "selector": ".name", "type": "text"},
- {"name": "bio", "selector": ".bio", "type": "text"}
- ]}
- ]
- }
- ```
- - **Explanation**:
- - `nested`: Extracts `name` and `bio` within `.author`, grouping the author details in a single `author` object.
-
-#### **6. List and Nested List Extraction**
- - **List**: Extracts multiple elements matching the selector as a list.
- - **Nested List**: Allows lists within lists, useful for items with sub-lists (e.g., specifications for each product).
- - **Example HTML**:
- ```html
-
-
Product with Features
-
- - Feature 1
- - Feature 2
- - Feature 3
-
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "features", "type": "list", "selector": ".features .feature", "fields": [
- {"name": "feature", "type": "text"}
- ]}
- ]
- }
- ```
- - **Explanation**:
- - `list`: Captures each `.feature` item within `.features`, outputting an array of features under the `features` field.
-
-#### **7. Transformations for Field Values**
- - Transformations allow you to modify extracted values (e.g., converting to lowercase).
- - Supported transformations: `lowercase`, `uppercase`, `strip`.
- - **Example HTML**:
- ```html
-
-
Special Product
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"}
- ]
- }
- ```
- - **Explanation**: The `transform` property changes the `title` to uppercase, useful for standardized outputs.
-
-#### **8. Full JSON-CSS Extraction Example**
- - Combining all elements in a single schema example for a comprehensive crawl:
- - **Example HTML**:
- ```html
-
-
Featured Product
-

-
$99.99
-
Best product of the year.
-
- - Durable
- - Eco-friendly
-
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"},
- {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"},
- {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"},
- {"name": "description", "selector": ".description", "type": "html"},
- {"name": "features", "type": "list", "selector": ".features .feature", "fields": [
- {"name": "feature", "type": "text"}
- ]}
- ]
- }
- ```
- - **Explanation**: This schema captures and transforms each aspect of the product, illustrating the JSON-CSS strategyâs versatility for structured extraction.
-
-#### **9. Wrap Up & Next Steps**
- - Summarize JSON-CSS Extractionâs flexibility for structured, pattern-based extraction.
- - Tease the next video: **10.2 LLM Extraction Strategy**, focusing on using language models to extract data based on intelligent content analysis.
-
----
-
-This outline covers each JSON-CSS Extraction option in Crawl4AI, with practical examples and schema configurations, making it a thorough guide for users.
diff --git a/docs/md_v2/tutorial/episode_11_2_Extraction_Strategies_LLM.md b/docs/md_v2/tutorial/episode_11_2_Extraction_Strategies_LLM.md
deleted file mode 100644
index a9f00e92..00000000
--- a/docs/md_v2/tutorial/episode_11_2_Extraction_Strategies_LLM.md
+++ /dev/null
@@ -1,153 +0,0 @@
-# Crawl4AI
-
-## Episode 11: Extraction Strategies: JSON CSS, LLM, and Cosine
-
-### Quick Intro
-Introduce JSON CSS Extraction Strategy for structured data, LLM Extraction Strategy for intelligent parsing, and Cosine Strategy for clustering similar content. Demo: Use JSON CSS to scrape product details from an e-commerce site.
-
-Hereâs a comprehensive outline for the **LLM Extraction Strategy** video, covering key details and example applications.
-
----
-
-### **10.2 LLM Extraction Strategy**
-
-#### **1. Introduction to LLM Extraction Strategy**
- - The LLM Extraction Strategy leverages language models to interpret and extract structured data from complex web content.
- - Unlike traditional CSS selectors, this strategy uses natural language instructions and schemas to guide the extraction, ideal for unstructured or diverse content.
- - Supports **OpenAI**, **Azure OpenAI**, **HuggingFace**, and **Ollama** models, enabling flexibility with both proprietary and open-source providers.
-
-#### **2. Key Components of LLM Extraction Strategy**
- - **Provider**: Specifies the LLM provider (e.g., OpenAI, HuggingFace, Azure).
- - **API Token**: Required for most providers, except Ollama (local LLM model).
- - **Instruction**: Custom extraction instructions sent to the model, providing flexibility in how the data is structured and extracted.
- - **Schema**: Optional, defines structured fields to organize extracted data into JSON format.
- - **Extraction Type**: Supports `"block"` for simpler text blocks or `"schema"` when a structured output format is required.
- - **Chunking Parameters**: Breaks down large documents, with options to adjust chunk size and overlap rate for more accurate extraction across lengthy texts.
-
-#### **3. Basic Extraction Example: OpenAI Model Pricing**
- - **Goal**: Extract model names and their input and output fees from the OpenAI pricing page.
- - **Schema Definition**:
- - **Model Name**: Text for model identification.
- - **Input Fee**: Token cost for input processing.
- - **Output Fee**: Token cost for output generation.
-
- - **Schema**:
- ```python
- class OpenAIModelFee(BaseModel):
- model_name: str = Field(..., description="Name of the OpenAI model.")
- input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
- output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
- ```
-
- - **Example Code**:
- ```python
- async def extract_openai_pricing():
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://openai.com/api/pricing/",
- extraction_strategy=LLMExtractionStrategy(
- provider="openai/gpt-4o",
- api_token=os.getenv("OPENAI_API_KEY"),
- schema=OpenAIModelFee.schema(),
- extraction_type="schema",
- instruction="Extract model names and fees for input and output tokens from the page."
- ),
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - The extraction strategy combines a schema and detailed instruction to guide the LLM in capturing structured data.
- - Each modelâs name, input fee, and output fee are extracted in a JSON format.
-
-#### **4. Knowledge Graph Extraction Example**
- - **Goal**: Extract entities and their relationships from a document for use in a knowledge graph.
- - **Schema Definition**:
- - **Entities**: Individual items with descriptions (e.g., people, organizations).
- - **Relationships**: Connections between entities, including descriptions and relationship types.
-
- - **Schema**:
- ```python
- class Entity(BaseModel):
- name: str
- description: str
-
- class Relationship(BaseModel):
- entity1: Entity
- entity2: Entity
- description: str
- relation_type: str
-
- class KnowledgeGraph(BaseModel):
- entities: List[Entity]
- relationships: List[Relationship]
- ```
-
- - **Example Code**:
- ```python
- async def extract_knowledge_graph():
- extraction_strategy = LLMExtractionStrategy(
- provider="azure/gpt-4o-mini",
- api_token=os.getenv("AZURE_API_KEY"),
- schema=KnowledgeGraph.schema(),
- extraction_type="schema",
- instruction="Extract entities and relationships from the content to build a knowledge graph."
- )
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com/some-article",
- extraction_strategy=extraction_strategy,
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - In this setup, the LLM extracts entities and their relationships based on the schema and instruction.
- - The schema organizes results into a JSON-based knowledge graph format.
-
-#### **5. Key Settings in LLM Extraction**
- - **Chunking Options**:
- - For long pages, set `chunk_token_threshold` to specify maximum token count per section.
- - Adjust `overlap_rate` to control the overlap between chunks, useful for contextual consistency.
- - **Example**:
- ```python
- extraction_strategy = LLMExtractionStrategy(
- provider="openai/gpt-4",
- api_token=os.getenv("OPENAI_API_KEY"),
- chunk_token_threshold=3000,
- overlap_rate=0.2, # 20% overlap between chunks
- instruction="Extract key insights and relationships."
- )
- ```
- - This setup ensures that longer texts are divided into manageable chunks with slight overlap, enhancing the quality of extraction.
-
-#### **6. Flexible Provider Options for LLM Extraction**
- - **Using Proprietary Models**: OpenAI, Azure, and HuggingFace provide robust language models, often suited for complex or detailed extractions.
- - **Using Open-Source Models**: Ollama and other open-source models can be deployed locally, suitable for offline or cost-effective extraction.
- - **Example Call**:
- ```python
- await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
- await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
- await extract_structured_data_using_llm("ollama/llama3.2")
- ```
-
-#### **7. Complete Example of LLM Extraction Setup**
- - Code to run both the OpenAI pricing and Knowledge Graph extractions, using various providers:
- ```python
- async def main():
- await extract_openai_pricing()
- await extract_knowledge_graph()
-
- if __name__ == "__main__":
- asyncio.run(main())
- ```
-
-#### **8. Wrap Up & Next Steps**
- - Recap the power of LLM extraction for handling unstructured or complex data extraction tasks.
- - Tease the next video: **10.3 Cosine Similarity Strategy** for clustering similar content based on semantic similarity.
-
----
-
-This outline explains LLM Extraction in Crawl4AI, with examples showing how to extract structured data using custom schemas and instructions. It demonstrates flexibility with multiple providers, ensuring practical application for different use cases.
diff --git a/docs/md_v2/tutorial/episode_11_3_Extraction_Strategies_Cosine.md b/docs/md_v2/tutorial/episode_11_3_Extraction_Strategies_Cosine.md
deleted file mode 100644
index 6100ae4c..00000000
--- a/docs/md_v2/tutorial/episode_11_3_Extraction_Strategies_Cosine.md
+++ /dev/null
@@ -1,136 +0,0 @@
-# Crawl4AI
-
-## Episode 11: Extraction Strategies: JSON CSS, LLM, and Cosine
-
-### Quick Intro
-Introduce JSON CSS Extraction Strategy for structured data, LLM Extraction Strategy for intelligent parsing, and Cosine Strategy for clustering similar content. Demo: Use JSON CSS to scrape product details from an e-commerce site.
-
-Hereâs a structured outline for the **Cosine Similarity Strategy** video, covering key concepts, configuration, and a practical example.
-
----
-
-### **10.3 Cosine Similarity Strategy**
-
-#### **1. Introduction to Cosine Similarity Strategy**
- - The Cosine Similarity Strategy clusters content by semantic similarity, offering an efficient alternative to LLM-based extraction, especially when speed is a priority.
- - Ideal for grouping similar sections of text, this strategy is well-suited for pages with content sections that may need to be classified or tagged, like news articles, product descriptions, or reviews.
-
-#### **2. Key Configuration Options**
- - **semantic_filter**: A keyword-based filter to focus on relevant content.
- - **word_count_threshold**: Minimum number of words per cluster, filtering out shorter, less meaningful clusters.
- - **max_dist**: Maximum allowable distance between elements in clusters, impacting cluster tightness.
- - **linkage_method**: Method for hierarchical clustering, such as `'ward'` (for well-separated clusters).
- - **top_k**: Specifies the number of top categories for each cluster.
- - **model_name**: Defines the model for embeddings, such as `sentence-transformers/all-MiniLM-L6-v2`.
- - **sim_threshold**: Minimum similarity threshold for filtering, allowing control over cluster relevance.
-
-#### **3. How Cosine Similarity Clustering Works**
- - **Step 1**: Embeddings are generated for each text section, transforming them into vectors that capture semantic meaning.
- - **Step 2**: Hierarchical clustering groups similar sections based on cosine similarity, forming clusters with related content.
- - **Step 3**: Clusters are filtered based on word count, removing those below the `word_count_threshold`.
- - **Step 4**: Each cluster is then categorized with tags, if enabled, providing context to each grouped content section.
-
-#### **4. Example Use Case: Clustering Blog Article Sections**
- - **Goal**: Group related sections of a blog or news page to identify distinct topics or discussion areas.
- - **Example HTML Sections**:
- ```text
- "The economy is showing signs of recovery, with markets up this quarter.",
- "In the sports world, several major teams are preparing for the upcoming season.",
- "New advancements in AI technology are reshaping the tech landscape.",
- "Market analysts are optimistic about continued growth in tech stocks."
- ```
-
- - **Code Setup**:
- ```python
- async def extract_blog_sections():
- extraction_strategy = CosineStrategy(
- word_count_threshold=15,
- max_dist=0.3,
- sim_threshold=0.2,
- model_name="sentence-transformers/all-MiniLM-L6-v2",
- top_k=2
- )
- async with AsyncWebCrawler() as crawler:
- url = "https://example.com/blog-page"
- result = await crawler.arun(
- url=url,
- extraction_strategy=extraction_strategy,
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - **word_count_threshold**: Ensures only clusters with meaningful content are included.
- - **sim_threshold**: Filters out clusters with low similarity, focusing on closely related sections.
- - **top_k**: Selects top tags, useful for identifying main topics.
-
-#### **5. Applying Semantic Filtering with Cosine Similarity**
- - **Semantic Filter**: Filters sections based on relevance to a specific keyword, such as âtechnologyâ for tech articles.
- - **Example Code**:
- ```python
- extraction_strategy = CosineStrategy(
- semantic_filter="technology",
- word_count_threshold=10,
- max_dist=0.25,
- model_name="sentence-transformers/all-MiniLM-L6-v2"
- )
- ```
- - **Explanation**:
- - **semantic_filter**: Only sections with high similarity to the âtechnologyâ keyword will be included in the clustering, making it easy to focus on specific topics within a mixed-content page.
-
-#### **6. Clustering Product Reviews by Similarity**
- - **Goal**: Organize product reviews by themes, such as âprice,â âquality,â or âdurability.â
- - **Example Reviews**:
- ```text
- "The quality of this product is outstanding and well worth the price.",
- "I found the product to be durable but a bit overpriced.",
- "Great value for the money and long-lasting.",
- "The build quality is good, but I expected a lower price point."
- ```
-
- - **Code Setup**:
- ```python
- async def extract_product_reviews():
- extraction_strategy = CosineStrategy(
- word_count_threshold=20,
- max_dist=0.35,
- sim_threshold=0.25,
- model_name="sentence-transformers/all-MiniLM-L6-v2"
- )
- async with AsyncWebCrawler() as crawler:
- url = "https://example.com/product-reviews"
- result = await crawler.arun(
- url=url,
- extraction_strategy=extraction_strategy,
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - This configuration clusters similar reviews, grouping feedback by common themes, helping businesses understand customer sentiments around particular product aspects.
-
-#### **7. Performance Advantages of Cosine Strategy**
- - **Speed**: The Cosine Similarity Strategy is faster than LLM-based extraction, as it doesnât rely on API calls to external LLMs.
- - **Local Processing**: The strategy runs locally with pre-trained sentence embeddings, ideal for high-throughput scenarios where cost and latency are concerns.
- - **Comparison**: With a well-optimized local model, this method can perform clustering on large datasets quickly, making it suitable for tasks requiring rapid, repeated analysis.
-
-#### **8. Full Code Example for Clustering News Articles**
- - **Code**:
- ```python
- async def main():
- await extract_blog_sections()
- await extract_product_reviews()
-
- if __name__ == "__main__":
- asyncio.run(main())
- ```
-
-#### **9. Wrap Up & Next Steps**
- - Recap the efficiency and effectiveness of Cosine Similarity for clustering related content quickly.
- - Close with a reminder of Crawl4AIâs flexibility across extraction strategies, and prompt users to experiment with different settings to optimize clustering for their specific content.
-
----
-
-This outline covers Cosine Similarity Strategyâs speed and effectiveness, providing examples that showcase its potential for clustering various content types efficiently.
diff --git a/docs/md_v2/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md b/docs/md_v2/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md
deleted file mode 100644
index d1ab813d..00000000
--- a/docs/md_v2/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md
+++ /dev/null
@@ -1,140 +0,0 @@
-# Crawl4AI
-
-## Episode 12: Session-Based Crawling for Dynamic Websites
-
-### Quick Intro
-Show session management for handling websites with multiple pages or actions (like âload moreâ buttons). Demo: Crawl a paginated content page, persisting session data across multiple requests.
-
-Hereâs a detailed outline for the **Session-Based Crawling for Dynamic Websites** video, explaining why sessions are necessary, how to use them, and providing practical examples and a visual diagram to illustrate the concept.
-
----
-
-### **11. Session-Based Crawling for Dynamic Websites**
-
-#### **1. Introduction to Session-Based Crawling**
- - **What is Session-Based Crawling**: Session-based crawling maintains a continuous browsing session across multiple page states, allowing the crawler to interact with a page and retrieve content that loads dynamically or based on user interactions.
- - **Why Itâs Needed**:
- - In static pages, all content is available directly from a single URL.
- - In dynamic websites, content often loads progressively or based on user actions (e.g., clicking âload more,â submitting forms, scrolling).
- - Session-based crawling helps simulate user actions, capturing content that is otherwise hidden until specific actions are taken.
-
-#### **2. Conceptual Diagram for Session-Based Crawling**
-
- ```mermaid
- graph TD
- Start[Start Session] --> S1[Initial State (S1)]
- S1 -->|Crawl| Content1[Extract Content S1]
- S1 -->|Action: Click Load More| S2[State S2]
- S2 -->|Crawl| Content2[Extract Content S2]
- S2 -->|Action: Scroll Down| S3[State S3]
- S3 -->|Crawl| Content3[Extract Content S3]
- S3 -->|Action: Submit Form| S4[Final State]
- S4 -->|Crawl| Content4[Extract Content S4]
- Content4 --> End[End Session]
- ```
-
- - **Explanation of Diagram**:
- - **Start**: Initializes the session and opens the starting URL.
- - **State Transitions**: Each action (e.g., clicking âload more,â scrolling) transitions to a new state, where additional content becomes available.
- - **Session Persistence**: Keeps the same browsing session active, preserving the state and allowing for a sequence of actions to unfold.
- - **End**: After reaching the final state, the session ends, and all accumulated content has been extracted.
-
-#### **3. Key Components of Session-Based Crawling in Crawl4AI**
- - **Session ID**: A unique identifier to maintain the state across requests, allowing the crawler to ârememberâ previous actions.
- - **JavaScript Execution**: Executes JavaScript commands (e.g., clicks, scrolls) to simulate interactions.
- - **Wait Conditions**: Ensures the crawler waits for content to load in each state before moving on.
- - **Sequential State Transitions**: By defining actions and wait conditions between states, the crawler can navigate through the page as a user would.
-
-#### **4. Basic Session Example: Multi-Step Content Loading**
- - **Goal**: Crawl an article feed that requires several âload moreâ clicks to display additional content.
- - **Code**:
- ```python
- async def crawl_article_feed():
- async with AsyncWebCrawler() as crawler:
- session_id = "feed_session"
-
- for page in range(3):
- result = await crawler.arun(
- url="https://example.com/articles",
- session_id=session_id,
- js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
- wait_for="css:.article",
- css_selector=".article" # Target article elements
- )
- print(f"Page {page + 1}: Extracted {len(result.extracted_content)} articles")
- ```
- - **Explanation**:
- - **session_id**: Ensures all requests share the same browsing state.
- - **js_code**: Clicks the âload moreâ button after the initial page load, expanding content on each iteration.
- - **wait_for**: Ensures articles have loaded after each click before extraction.
-
-#### **5. Advanced Example: E-Commerce Product Search with Filter Selection**
- - **Goal**: Interact with filters on an e-commerce page to extract products based on selected criteria.
- - **Example Steps**:
- 1. **State 1**: Load the main product page.
- 2. **State 2**: Apply a filter (e.g., âOn Saleâ) by selecting a checkbox.
- 3. **State 3**: Scroll to load additional products and capture updated results.
-
- - **Code**:
- ```python
- async def extract_filtered_products():
- async with AsyncWebCrawler() as crawler:
- session_id = "product_session"
-
- # Step 1: Open product page
- result = await crawler.arun(
- url="https://example.com/products",
- session_id=session_id,
- wait_for="css:.product-item"
- )
-
- # Step 2: Apply filter (e.g., "On Sale")
- result = await crawler.arun(
- url="https://example.com/products",
- session_id=session_id,
- js_code="document.querySelector('#sale-filter-checkbox').click();",
- wait_for="css:.product-item"
- )
-
- # Step 3: Scroll to load additional products
- for _ in range(2): # Scroll down twice
- result = await crawler.arun(
- url="https://example.com/products",
- session_id=session_id,
- js_code="window.scrollTo(0, document.body.scrollHeight);",
- wait_for="css:.product-item"
- )
- print(f"Loaded {len(result.extracted_content)} products after scroll")
- ```
- - **Explanation**:
- - **State Persistence**: Each action (filter selection and scroll) builds on the previous session state.
- - **Multiple Interactions**: Combines clicking a filter with scrolling, demonstrating how the session preserves these actions.
-
-#### **6. Key Benefits of Session-Based Crawling**
- - **Accessing Hidden Content**: Retrieves data that loads only after user actions.
- - **Simulating User Behavior**: Handles interactive elements such as âload moreâ buttons, dropdowns, and filters.
- - **Maintaining Continuity Across States**: Enables a sequential process, moving logically from one state to the next, capturing all desired content without reloading the initial state each time.
-
-#### **7. Additional Configuration Tips**
- - **Manage Session End**: Always conclude the session after the final state to release resources.
- - **Optimize with Wait Conditions**: Use `wait_for` to ensure complete loading before each extraction.
- - **Handling Errors in Session-Based Crawling**: Include error handling for interactions that may fail, ensuring robustness across state transitions.
-
-#### **8. Complete Code Example: Multi-Step Session Workflow**
- - **Example**:
- ```python
- async def main():
- await crawl_article_feed()
- await extract_filtered_products()
-
- if __name__ == "__main__":
- asyncio.run(main())
- ```
-
-#### **9. Wrap Up & Next Steps**
- - Recap the usefulness of session-based crawling for dynamic content extraction.
- - Tease the next video: **Hooks and Custom Workflow with AsyncWebCrawler** to cover advanced customization options for further control over the crawling process.
-
----
-
-This outline covers session-based crawling from both a conceptual and practical perspective, helping users understand its importance, configure it effectively, and use it to handle complex dynamic content.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md b/docs/md_v2/tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md
deleted file mode 100644
index eda07e8b..00000000
--- a/docs/md_v2/tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md
+++ /dev/null
@@ -1,138 +0,0 @@
-# Crawl4AI
-
-## Episode 13: Chunking Strategies for Large Text Processing
-
-### Quick Intro
-Explain Regex, NLP, and Fixed-Length chunking, and when to use each. Demo: Chunk a large article or document for processing by topics or sentences.
-
-Hereâs a structured outline for the **Chunking Strategies for Large Text Processing** video, emphasizing how chunking works within extraction and why itâs crucial for effective data aggregation.
-
-Hereâs a structured outline for the **Chunking Strategies for Large Text Processing** video, explaining each strategy, when to use it, and providing examples to illustrate.
-
----
-
-### **12. Chunking Strategies for Large Text Processing**
-
-#### **1. Introduction to Chunking in Crawl4AI**
- - **What is Chunking**: Chunking is the process of dividing large text into manageable sections or âchunks,â enabling efficient processing in extraction tasks.
- - **Why Itâs Needed**:
- - When processing large text, feeding it directly into an extraction function (like `F(x)`) can overwhelm memory or token limits.
- - Chunking breaks down `x` (the text) into smaller pieces, which are processed sequentially or in parallel by the extraction function, with the final result being an aggregation of all chunksâ processed output.
-
-#### **2. Key Chunking Strategies and Use Cases**
- - Crawl4AI offers various chunking strategies to suit different text structures, chunk sizes, and processing requirements.
- - **Choosing a Strategy**: Select based on the type of text (e.g., articles, transcripts) and extraction needs (e.g., simple splitting or context-sensitive processing).
-
-#### **3. Strategy 1: Regex-Based Chunking**
- - **Description**: Uses regular expressions to split text based on specified patterns (e.g., paragraphs or section breaks).
- - **Use Case**: Ideal for dividing text by paragraphs or larger logical blocks where sections are clearly separated by line breaks or punctuation.
- - **Example**:
- - **Pattern**: `r'\n\n'` for double line breaks.
- ```python
- chunker = RegexChunking(patterns=[r'\n\n'])
- text_chunks = chunker.chunk(long_text)
- print(text_chunks) # Output: List of paragraphs
- ```
- - **Pros**: Flexible for pattern-based chunking.
- - **Cons**: Limited to text with consistent formatting.
-
-#### **4. Strategy 2: NLP Sentence-Based Chunking**
- - **Description**: Uses NLP to split text by sentences, ensuring grammatically complete segments.
- - **Use Case**: Useful for extracting individual statements, such as in news articles, quotes, or legal text.
- - **Example**:
- ```python
- chunker = NlpSentenceChunking()
- sentence_chunks = chunker.chunk(long_text)
- print(sentence_chunks) # Output: List of sentences
- ```
- - **Pros**: Maintains sentence structure, ideal for tasks needing semantic completeness.
- - **Cons**: May create very small chunks, which could limit contextual extraction.
-
-#### **5. Strategy 3: Topic-Based Segmentation Using TextTiling**
- - **Description**: Segments text into topics using TextTiling, identifying topic shifts and key segments.
- - **Use Case**: Ideal for long articles, reports, or essays where each section covers a different topic.
- - **Example**:
- ```python
- chunker = TopicSegmentationChunking(num_keywords=3)
- topic_chunks = chunker.chunk_with_topics(long_text)
- print(topic_chunks) # Output: List of topic segments with keywords
- ```
- - **Pros**: Groups related content, preserving topical coherence.
- - **Cons**: Depends on identifiable topic shifts, which may not be present in all texts.
-
-#### **6. Strategy 4: Fixed-Length Word Chunking**
- - **Description**: Splits text into chunks based on a fixed number of words.
- - **Use Case**: Ideal for text where exact segment size is required, such as processing word-limited documents for LLMs.
- - **Example**:
- ```python
- chunker = FixedLengthWordChunking(chunk_size=100)
- word_chunks = chunker.chunk(long_text)
- print(word_chunks) # Output: List of 100-word chunks
- ```
- - **Pros**: Ensures uniform chunk sizes, suitable for token-based extraction limits.
- - **Cons**: May split sentences, affecting semantic coherence.
-
-#### **7. Strategy 5: Sliding Window Chunking**
- - **Description**: Uses a fixed window size with a step, creating overlapping chunks to maintain context.
- - **Use Case**: Useful for maintaining context across sections, as with documents where context is needed for neighboring sections.
- - **Example**:
- ```python
- chunker = SlidingWindowChunking(window_size=100, step=50)
- window_chunks = chunker.chunk(long_text)
- print(window_chunks) # Output: List of overlapping word chunks
- ```
- - **Pros**: Retains context across adjacent chunks, ideal for complex semantic extraction.
- - **Cons**: Overlap increases data size, potentially impacting processing time.
-
-#### **8. Strategy 6: Overlapping Window Chunking**
- - **Description**: Similar to sliding windows but with a defined overlap, allowing chunks to share content at the edges.
- - **Use Case**: Suitable for handling long texts with essential overlapping information, like research articles or medical records.
- - **Example**:
- ```python
- chunker = OverlappingWindowChunking(window_size=1000, overlap=100)
- overlap_chunks = chunker.chunk(long_text)
- print(overlap_chunks) # Output: List of overlapping chunks with defined overlap
- ```
- - **Pros**: Allows controlled overlap for consistent content coverage across chunks.
- - **Cons**: Redundant data in overlapping areas may increase computation.
-
-#### **9. Practical Example: Using Chunking with an Extraction Strategy**
- - **Goal**: Combine chunking with an extraction strategy to process large text effectively.
- - **Example Code**:
- ```python
- from crawl4ai.extraction_strategy import LLMExtractionStrategy
-
- async def extract_large_text():
- # Initialize chunker and extraction strategy
- chunker = FixedLengthWordChunking(chunk_size=200)
- extraction_strategy = LLMExtractionStrategy(provider="openai/gpt-4", api_token="your_api_token")
-
- # Split text into chunks
- text_chunks = chunker.chunk(large_text)
-
- async with AsyncWebCrawler() as crawler:
- for chunk in text_chunks:
- result = await crawler.arun(
- url="https://example.com",
- extraction_strategy=extraction_strategy,
- content=chunk
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - `chunker.chunk()`: Divides the `large_text` into smaller segments based on the chosen strategy.
- - `extraction_strategy`: Processes each chunk separately, and results are then aggregated to form the final output.
-
-#### **10. Choosing the Right Chunking Strategy**
- - **Text Structure**: If text has clear sections (e.g., paragraphs, topics), use Regex or Topic Segmentation.
- - **Extraction Needs**: If context is crucial, consider Sliding or Overlapping Window Chunking.
- - **Processing Constraints**: For word-limited extractions (e.g., LLMs with token limits), Fixed-Length Word Chunking is often most effective.
-
-#### **11. Wrap Up & Next Steps**
- - Recap the benefits of each chunking strategy and when to use them in extraction workflows.
- - Tease the next video: **Hooks and Custom Workflow with AsyncWebCrawler**, focusing on customizing crawler behavior with hooks for a fine-tuned extraction process.
-
----
-
-This outline provides a complete understanding of chunking strategies, explaining each methodâs strengths and best-use scenarios to help users process large texts effectively in Crawl4AI.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md b/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md
deleted file mode 100644
index 87a3d217..00000000
--- a/docs/md_v2/tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md
+++ /dev/null
@@ -1,185 +0,0 @@
-# Crawl4AI
-
-## Episode 14: Hooks and Custom Workflow with AsyncWebCrawler
-
-### Quick Intro
-Cover hooks (`on_browser_created`, `before_goto`, `after_goto`) to add custom workflows. Demo: Use hooks to add custom cookies or headers, log HTML, or trigger specific events on page load.
-
-Hereâs a detailed outline for the **Hooks and Custom Workflow with AsyncWebCrawler** video, covering each hookâs purpose, usage, and example implementations.
-
----
-
-### **13. Hooks and Custom Workflow with AsyncWebCrawler**
-
-#### **1. Introduction to Hooks in Crawl4AI**
- - **What are Hooks**: Hooks are customizable entry points in the crawling process that allow users to inject custom actions or logic at specific stages.
- - **Why Use Hooks**:
- - They enable fine-grained control over the crawling workflow.
- - Useful for performing additional tasks (e.g., logging, modifying headers) dynamically during the crawl.
- - Hooks provide the flexibility to adapt the crawler to complex site structures or unique project needs.
-
-#### **2. Overview of Available Hooks**
- - Crawl4AI offers seven key hooks to modify and control different stages in the crawling lifecycle:
- - `on_browser_created`
- - `on_user_agent_updated`
- - `on_execution_started`
- - `before_goto`
- - `after_goto`
- - `before_return_html`
- - `before_retrieve_html`
-
-#### **3. Hook-by-Hook Explanation and Examples**
-
----
-
-##### **Hook 1: `on_browser_created`**
- - **Purpose**: Triggered right after the browser instance is created.
- - **Use Case**:
- - Initializing browser-specific settings or performing setup actions.
- - Configuring browser extensions or scripts before any page is opened.
- - **Example**:
- ```python
- async def log_browser_creation(browser):
- print("Browser instance created:", browser)
-
- crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
- ```
- - **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
-
----
-
-##### **Hook 2: `on_user_agent_updated`**
- - **Purpose**: Called whenever the user agent string is updated.
- - **Use Case**:
- - Modifying the user agent based on page requirements, e.g., changing to a mobile user agent for mobile-only pages.
- - **Example**:
- ```python
- def update_user_agent(user_agent):
- print(f"User Agent Updated: {user_agent}")
-
- crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
- crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
- ```
- - **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
-
----
-
-##### **Hook 3: `on_execution_started`**
- - **Purpose**: Called right before the crawler begins any interaction (e.g., JavaScript execution, clicks).
- - **Use Case**:
- - Performing setup actions, such as inserting cookies or initiating custom scripts.
- - **Example**:
- ```python
- async def log_execution_start(page):
- print("Execution started on page:", page.url)
-
- crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
- ```
- - **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
-
----
-
-##### **Hook 4: `before_goto`**
- - **Purpose**: Triggered before navigating to a new URL with `page.goto()`.
- - **Use Case**:
- - Modifying request headers or setting up conditions right before the page loads.
- - Adding headers or dynamically adjusting options for specific URLs.
- - **Example**:
- ```python
- async def modify_headers_before_goto(page):
- await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
- print("Custom headers set before navigation")
-
- crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
- ```
- - **Explanation**: This hook allows injecting headers or altering settings based on the pageâs needs, particularly useful for pages with custom requirements.
-
----
-
-##### **Hook 5: `after_goto`**
- - **Purpose**: Executed immediately after a page has loaded (after `page.goto()`).
- - **Use Case**:
- - Checking the loaded page state, modifying the DOM, or performing post-navigation actions (e.g., scrolling).
- - **Example**:
- ```python
- async def post_navigation_scroll(page):
- await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
- print("Scrolled to the bottom after navigation")
-
- crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
- ```
- - **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
-
----
-
-##### **Hook 6: `before_return_html`**
- - **Purpose**: Called right before HTML content is retrieved and returned.
- - **Use Case**:
- - Removing overlays or cleaning up the page for a cleaner HTML extraction.
- - **Example**:
- ```python
- async def remove_advertisements(page, html):
- await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
- print("Advertisements removed before returning HTML")
-
- crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
- ```
- - **Explanation**: The hook removes ad banners from the HTML before itâs retrieved, ensuring a cleaner data extraction.
-
----
-
-##### **Hook 7: `before_retrieve_html`**
- - **Purpose**: Runs right before Crawl4AI initiates HTML retrieval.
- - **Use Case**:
- - Finalizing any page adjustments (e.g., setting timers, waiting for specific elements).
- - **Example**:
- ```python
- async def wait_for_content_before_retrieve(page):
- await page.wait_for_selector('.main-content')
- print("Main content loaded, ready to retrieve HTML")
-
- crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
- ```
- - **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
-
-#### **4. Setting Hooks in Crawl4AI**
- - **How to Set Hooks**:
- - Use `set_hook` to define a custom function for each hook.
- - Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
- - **Example Setup**:
- ```python
- crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
- crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
- crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
- ```
-
-#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
- - **Goal**: Log each key step, set custom headers before navigation, and clean up the page before retrieving HTML.
- - **Example Code**:
- ```python
- async def custom_crawl():
- async with AsyncWebCrawler() as crawler:
- # Set hooks for custom workflow
- crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
- crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
- crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
- crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
-
- # Perform the crawl
- url = "https://example.com"
- result = await crawler.arun(url=url)
- print(result.html) # Display or process HTML
- ```
-
-#### **6. Benefits of Using Hooks in Custom Crawling Workflows**
- - **Enhanced Control**: Hooks offer precise control over each stage, allowing adjustments based on content and structure.
- - **Efficient Modifications**: Avoid reloading or restarting the session; hooks can alter actions dynamically.
- - **Context-Sensitive Actions**: Hooks enable custom logic tailored to specific pages or sections, maximizing extraction quality.
-
-#### **7. Wrap Up & Next Steps**
- - Recap how hooks empower customized workflows in Crawl4AI, enabling flexibility at every stage.
- - Tease the next video: **Automating Post-Processing with Crawl4AI**, covering automated steps after data extraction.
-
----
-
-This outline provides a thorough understanding of hooks, their practical applications, and examples for customizing the crawling workflow in Crawl4AI.
\ No newline at end of file
diff --git a/docs/md_v2/tutorial/tutorial.md b/docs/md_v2/tutorial/tutorial.md
deleted file mode 100644
index 7bead842..00000000
--- a/docs/md_v2/tutorial/tutorial.md
+++ /dev/null
@@ -1,1789 +0,0 @@
-# Crawl4AI
-
-## Episode 1: Introduction to Crawl4AI and Basic Installation
-
-### Quick Intro
-Walk through installation from PyPI, setup, and verification. Show how to install with options like `torch` or `transformer` for advanced capabilities.
-
-Here's a condensed outline of the **Installation and Setup** video content:
-
----
-
-1) **Introduction to Crawl4AI**:
-
- - Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
-
-2) **Installation Overview**:
-
- - **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
- - **Optional Advanced Installs**:
- - `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
- - `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
- - `pip install crawl4ai[all]` - Installs all features for complete functionality.
-
-3) **Verifying the Installation**:
-
- - Walk through a simple test script to confirm the setup:
- ```python
- import asyncio
- from crawl4ai import AsyncWebCrawler, CacheMode
-
- async def main():
- async with AsyncWebCrawler(verbose=True) as crawler:
- result = await crawler.arun(url="https://www.example.com")
- print(result.markdown[:500]) # Show first 500 characters
-
- asyncio.run(main())
- ```
- - Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.
-
-4) **Important Tips**:
-
- - **Run** `playwright install` **after installation** to set up dependencies.
- - **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
- - If you encounter issues, refer to the documentation or GitHub issues.
-
-5) **Wrap Up**:
-
- - Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).
-
----
-
-This structure provides a concise, effective guide to get viewers up and running with Crawl4AI in minutes.# Crawl4AI
-
-## Episode 2: Overview of Advanced Features
-
-### Quick Intro
-A general overview of advanced features like hooks, CSS selectors, and JSON CSS extraction.
-
-Here's a condensed outline for an **Overview of Advanced Features** video covering Crawl4AI's powerful customization and extraction options:
-
----
-
-### **Overview of Advanced Features**
-
-1) **Introduction to Advanced Features**:
-
- - Briefly introduce Crawl4AIâs advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.
-
-2) **Taking Screenshots**:
-
- - Explain the screenshot capability for capturing page state and verifying content.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", screenshot=True)
- ```
- - Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.
-
-3) **Media and Link Extraction**:
-
- - Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com")
- print("Media:", result.media)
- print("Links:", result.links)
- ```
-
-4) **Custom User Agent**:
-
- - Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
- ```
-
-5) **Custom Hooks for Enhanced Control**:
-
- - Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
- - **Example**: Setting a custom header with `before_get_url` hook.
- ```python
- async def before_get_url(page):
- await page.set_extra_http_headers({"X-Test-Header": "test"})
- ```
-
-6) **CSS Selectors for Targeted Extraction**:
-
- - Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", css_selector="h2")
- print("H2 Tags:", result.extracted_content)
- ```
-
-7) **Crawling Inside Iframes**:
-
- - Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://www.example.com", process_iframes=True)
- ```
-
-8) **Wrap-Up**:
-
- - Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
- - Tease upcoming videos where each feature will be explored in detail.
-
----
-
-This covers each advanced feature with a brief example, providing a useful overview to prepare viewers for the more in-depth videos.# Crawl4AI
-
-## Episode 3: Browser Configurations & Headless Crawling
-
-### Quick Intro
-Explain browser options (`chromium`, `firefox`, `webkit`) and settings for headless mode, caching, and verbose logging.
-
-Hereâs a streamlined outline for the **Browser Configurations & Headless Crawling** video:
-
----
-
-### **Browser Configurations & Headless Crawling**
-
-1) **Overview of Browser Options**:
-
- - Crawl4AI supports three browser engines:
- - **Chromium** (default) - Highly compatible.
- - **Firefox** - Great for specialized use cases.
- - **Webkit** - Lightweight, ideal for basic needs.
- - **Example**:
- ```python
- # Using Chromium (default)
- crawler = AsyncWebCrawler(browser_type="chromium")
-
- # Using Firefox
- crawler = AsyncWebCrawler(browser_type="firefox")
-
- # Using WebKit
- crawler = AsyncWebCrawler(browser_type="webkit")
- ```
-
-2) **Headless Mode**:
-
- - Headless mode runs the browser without a visible GUI, making it faster and less resource-intensive.
- - To enable or disable:
- ```python
- # Headless mode (default is True)
- crawler = AsyncWebCrawler(headless=True)
-
- # Disable headless mode for debugging
- crawler = AsyncWebCrawler(headless=False)
- ```
-
-3) **Verbose Logging**:
-
- - Use `verbose=True` to get detailed logs for each action, useful for debugging:
- ```python
- crawler = AsyncWebCrawler(verbose=True)
- ```
-
-4) **Running a Basic Crawl with Configuration**:
-
- - Example of a simple crawl with custom browser settings:
- ```python
- async with AsyncWebCrawler(browser_type="firefox", headless=True, verbose=True) as crawler:
- result = await crawler.arun(url="https://www.example.com")
- print(result.markdown[:500]) # Show first 500 characters
- ```
- - This example uses Firefox in headless mode with logging enabled, demonstrating the flexibility of Crawl4AIâs setup.
-
-5) **Recap & Next Steps**:
-
- - Recap the power of selecting different browsers and running headless mode for speed and efficiency.
- - Tease the next video: **Proxy & Security Settings** for navigating blocked or restricted content and protecting IP identity.
-
----
-
-This breakdown covers browser configuration essentials in Crawl4AI, providing users with practical steps to optimize their scraping setup.# Crawl4AI
-
-## Episode 4: Advanced Proxy and Security Settings
-
-### Quick Intro
-Showcase proxy configurations (HTTP, SOCKS5, authenticated proxies). Demo: Use rotating proxies and set custom headers to avoid IP blocking and enhance security.
-
-Hereâs a focused outline for the **Proxy and Security Settings** video:
-
----
-
-### **Proxy & Security Settings**
-
-1) **Why Use Proxies in Web Crawling**:
-
- - Proxies are essential for bypassing IP-based restrictions, improving anonymity, and managing rate limits.
- - Crawl4AI supports simple proxies, authenticated proxies, and proxy rotation for robust web scraping.
-
-2) **Basic Proxy Setup**:
-
- - **Using a Simple Proxy**:
- ```python
- # HTTP proxy
- crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
-
- # SOCKS proxy
- crawler = AsyncWebCrawler(proxy="socks5://proxy.example.com:1080")
- ```
-
-3) **Authenticated Proxies**:
-
- - Use `proxy_config` for proxies requiring a username and password:
- ```python
- proxy_config = {
- "server": "http://proxy.example.com:8080",
- "username": "user",
- "password": "pass"
- }
- crawler = AsyncWebCrawler(proxy_config=proxy_config)
- ```
-
-4) **Rotating Proxies**:
-
- - Rotating proxies helps avoid IP bans by switching IP addresses for each request:
- ```python
- async def get_next_proxy():
- # Define proxy rotation logic here
- return {"server": "http://next.proxy.com:8080"}
-
- async with AsyncWebCrawler() as crawler:
- for url in urls:
- proxy = await get_next_proxy()
- crawler.update_proxy(proxy)
- result = await crawler.arun(url=url)
- ```
- - This setup periodically switches the proxy for enhanced security and access.
-
-5) **Custom Headers for Additional Security**:
-
- - Set custom headers to mask the crawlerâs identity and avoid detection:
- ```python
- headers = {
- "X-Forwarded-For": "203.0.113.195",
- "Accept-Language": "en-US,en;q=0.9",
- "Cache-Control": "no-cache",
- "Pragma": "no-cache"
- }
- crawler = AsyncWebCrawler(headers=headers)
- ```
-
-6) **Combining Proxies with Magic Mode for Anti-Bot Protection**:
-
- - For sites with aggressive bot detection, combine `proxy` settings with `magic=True`:
- ```python
- async with AsyncWebCrawler(proxy="http://proxy.example.com:8080", headers={"Accept-Language": "en-US"}) as crawler:
- result = await crawler.arun(
- url="https://example.com",
- magic=True # Enables anti-detection features
- )
- ```
- - **Magic Mode** automatically enables user simulation, random timing, and browser property masking.
-
-7) **Wrap Up & Next Steps**:
-
- - Summarize the importance of proxies and anti-detection in accessing restricted content and avoiding bans.
- - Tease the next video: **JavaScript Execution and Handling Dynamic Content** for working with interactive and dynamically loaded pages.
-
----
-
-This outline provides a practical guide to setting up proxies and security configurations, empowering users to navigate restricted sites while staying undetected.# Crawl4AI
-
-## Episode 5: JavaScript Execution and Dynamic Content Handling
-
-### Quick Intro
-Explain JavaScript code injection with examples (e.g., simulating scrolling, clicking âload moreâ). Demo: Extract content from a page that uses dynamic loading with lazy-loaded images.
-
-Hereâs a focused outline for the **JavaScript Execution and Dynamic Content Handling** video:
-
----
-
-### **JavaScript Execution & Dynamic Content Handling**
-
-1) **Why JavaScript Execution Matters**:
-
- - Many modern websites load content dynamically via JavaScript, requiring special handling to access all elements.
- - Crawl4AI can execute JavaScript on pages, enabling it to interact with elements like âload moreâ buttons, infinite scrolls, and content that appears only after certain actions.
-
-2) **Basic JavaScript Execution**:
-
- - Use `js_code` to execute JavaScript commands on a page:
- ```python
- # Scroll to bottom of the page
- result = await crawler.arun(
- url="https://example.com",
- js_code="window.scrollTo(0, document.body.scrollHeight);"
- )
- ```
- - This command scrolls to the bottom, triggering any lazy-loaded or dynamically added content.
-
-3) **Multiple Commands & Simulating Clicks**:
-
- - Combine multiple JavaScript commands to interact with elements like âload moreâ buttons:
- ```python
- js_commands = [
- "window.scrollTo(0, document.body.scrollHeight);",
- "document.querySelector('.load-more').click();"
- ]
- result = await crawler.arun(
- url="https://example.com",
- js_code=js_commands
- )
- ```
- - This script scrolls down and then clicks the âload moreâ button, useful for loading additional content blocks.
-
-4) **Waiting for Dynamic Content**:
-
- - Use `wait_for` to ensure the page loads specific elements before proceeding:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- js_code="window.scrollTo(0, document.body.scrollHeight);",
- wait_for="css:.dynamic-content" # Wait for elements with class `.dynamic-content`
- )
- ```
- - This example waits until elements with `.dynamic-content` are loaded, helping to capture content that appears after JavaScript actions.
-
-5) **Handling Complex Dynamic Content (e.g., Infinite Scroll)**:
-
- - Combine JavaScript execution with conditional waiting to handle infinite scrolls or paginated content:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- js_code=[
- "window.scrollTo(0, document.body.scrollHeight);",
- "const loadMore = document.querySelector('.load-more'); if (loadMore) loadMore.click();"
- ],
- wait_for="js:() => document.querySelectorAll('.item').length > 10" # Wait until 10 items are loaded
- )
- ```
- - This example scrolls and clicks "load more" repeatedly, waiting each time for a specified number of items to load.
-
-6) **Complete Example: Dynamic Content Handling with Extraction**:
-
- - Full example demonstrating a dynamic load and content extraction in one process:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- js_code=[
- "window.scrollTo(0, document.body.scrollHeight);",
- "document.querySelector('.load-more').click();"
- ],
- wait_for="css:.main-content",
- css_selector=".main-content"
- )
- print(result.markdown[:500]) # Output the main content extracted
- ```
-
-7) **Wrap Up & Next Steps**:
-
- - Recap how JavaScript execution allows access to dynamic content, enabling powerful interactions.
- - Tease the next video: **Content Cleaning and Fit Markdown** to show how Crawl4AI can extract only the most relevant content from complex pages.
-
----
-
-This outline explains how to handle dynamic content and JavaScript-based interactions effectively, enabling users to scrape and interact with complex, modern websites.# Crawl4AI
-
-## Episode 6: Magic Mode and Anti-Bot Protection
-
-### Quick Intro
-Highlight `Magic Mode` and anti-bot features like user simulation, navigator overrides, and timing randomization. Demo: Access a site with anti-bot protection and show how `Magic Mode` seamlessly handles it.
-
-Hereâs a concise outline for the **Magic Mode and Anti-Bot Protection** video:
-
----
-
-### **Magic Mode & Anti-Bot Protection**
-
-1) **Why Anti-Bot Protection is Important**:
-
- - Many websites use bot detection mechanisms to block automated scraping. Crawl4AIâs anti-detection features help avoid IP bans, CAPTCHAs, and access restrictions.
- - **Magic Mode** is a one-step solution to enable a range of anti-bot features without complex configuration.
-
-2) **Enabling Magic Mode**:
-
- - Simply set `magic=True` to activate Crawl4AIâs full anti-bot suite:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- magic=True # Enables all anti-detection features
- )
- ```
- - This enables a blend of stealth techniques, including masking automation signals, randomizing timings, and simulating real user behavior.
-
-3) **What Magic Mode Does Behind the Scenes**:
-
- - **User Simulation**: Mimics human actions like mouse movements and scrolling.
- - **Navigator Overrides**: Hides signals that indicate an automated browser.
- - **Timing Randomization**: Adds random delays to simulate natural interaction patterns.
- - **Cookie Handling**: Accepts and manages cookies dynamically to avoid triggers from cookie pop-ups.
-
-4) **Manual Anti-Bot Options (If Not Using Magic Mode)**:
-
- - For granular control, you can configure individual settings without Magic Mode:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- simulate_user=True, # Enables human-like behavior
- override_navigator=True # Hides automation fingerprints
- )
- ```
- - **Use Cases**: This approach allows more specific adjustments when certain anti-bot features are needed but others are not.
-
-5) **Combining Proxies with Magic Mode**:
-
- - To avoid rate limits or IP blocks, combine Magic Mode with a proxy:
- ```python
- async with AsyncWebCrawler(
- proxy="http://proxy.example.com:8080",
- headers={"Accept-Language": "en-US"}
- ) as crawler:
- result = await crawler.arun(
- url="https://example.com",
- magic=True # Full anti-detection
- )
- ```
- - This setup maximizes stealth by pairing anti-bot detection with IP obfuscation.
-
-6) **Example of Anti-Bot Protection in Action**:
-
- - Full example with Magic Mode and proxies to scrape a protected page:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com/protected-content",
- magic=True,
- proxy="http://proxy.example.com:8080",
- wait_for="css:.content-loaded" # Wait for the main content to load
- )
- print(result.markdown[:500]) # Display first 500 characters of the content
- ```
- - This example ensures seamless access to protected content by combining anti-detection and waiting for full content load.
-
-7) **Wrap Up & Next Steps**:
-
- - Recap the power of Magic Mode and anti-bot features for handling restricted websites.
- - Tease the next video: **Content Cleaning and Fit Markdown** to show how to extract clean and focused content from a page.
-
----
-
-This outline shows users how to easily avoid bot detection and access restricted content, demonstrating both the power and simplicity of Magic Mode in Crawl4AI.# Crawl4AI
-
-## Episode 7: Content Cleaning and Fit Markdown
-
-### Quick Intro
-Explain content cleaning options, including `fit_markdown` to keep only the most relevant content. Demo: Extract and compare regular vs. fit markdown from a news site or blog.
-
-Hereâs a streamlined outline for the **Content Cleaning and Fit Markdown** video:
-
----
-
-### **Content Cleaning & Fit Markdown**
-
-1) **Overview of Content Cleaning in Crawl4AI**:
-
- - Explain that web pages often include extra elements like ads, navigation bars, footers, and popups.
- - Crawl4AIâs content cleaning features help extract only the main content, reducing noise and enhancing readability.
-
-2) **Basic Content Cleaning Options**:
-
- - **Removing Unwanted Elements**: Exclude specific HTML tags, like forms or navigation bars:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- word_count_threshold=10, # Filter out blocks with fewer than 10 words
- excluded_tags=['form', 'nav'], # Exclude specific tags
- remove_overlay_elements=True # Remove popups and modals
- )
- ```
- - This example extracts content while excluding forms, navigation, and modal overlays, ensuring clean results.
-
-3) **Fit Markdown for Main Content Extraction**:
-
- - **What is Fit Markdown**: Uses advanced analysis to identify the most relevant content (ideal for articles, blogs, and documentation).
- - **How it Works**: Analyzes content density, removes boilerplate elements, and maintains formatting for a clear output.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://example.com")
- main_content = result.fit_markdown # Extracted main content
- print(main_content[:500]) # Display first 500 characters
- ```
- - Fit Markdown is especially helpful for long-form content like news articles or blog posts.
-
-4) **Comparing Fit Markdown with Regular Markdown**:
-
- - **Fit Markdown** returns the primary content without extraneous elements.
- - **Regular Markdown** includes all extracted text in markdown format.
- - Example to show the difference:
- ```python
- all_content = result.markdown # Full markdown
- main_content = result.fit_markdown # Only the main content
-
- print(f"All Content Length: {len(all_content)}")
- print(f"Main Content Length: {len(main_content)}")
- ```
- - This comparison shows the effectiveness of Fit Markdown in focusing on essential content.
-
-5) **Media and Metadata Handling with Content Cleaning**:
-
- - **Media Extraction**: Crawl4AI captures images and videos with metadata like alt text, descriptions, and relevance scores:
- ```python
- for image in result.media["images"]:
- print(f"Source: {image['src']}, Alt Text: {image['alt']}, Relevance Score: {image['score']}")
- ```
- - **Use Case**: Useful for saving only relevant images or videos from an article or content-heavy page.
-
-6) **Example of Clean Content Extraction in Action**:
-
- - Full example extracting cleaned content and Fit Markdown:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- word_count_threshold=10,
- excluded_tags=['nav', 'footer'],
- remove_overlay_elements=True
- )
- print(result.fit_markdown[:500]) # Show main content
- ```
- - This example demonstrates content cleaning with settings for filtering noise and focusing on the core text.
-
-7) **Wrap Up & Next Steps**:
-
- - Summarize the power of Crawl4AIâs content cleaning features and Fit Markdown for capturing clean, relevant content.
- - Tease the next video: **Link Analysis and Smart Filtering** to focus on analyzing and filtering links within crawled pages.
-
----
-
-This outline covers Crawl4AIâs content cleaning features and the unique benefits of Fit Markdown, showing users how to retrieve focused, high-quality content from web pages.# Crawl4AI
-
-## Episode 8: Media Handling: Images, Videos, and Audio
-
-### Quick Intro
-Showcase Crawl4AIâs media extraction capabilities, including lazy-loaded media and metadata. Demo: Crawl a multimedia page, extract images, and show metadata (alt text, context, relevance score).
-
-Hereâs a clear and focused outline for the **Media Handling: Images, Videos, and Audio** video:
-
----
-
-### **Media Handling: Images, Videos, and Audio**
-
-1) **Overview of Media Extraction in Crawl4AI**:
-
- - Crawl4AI can detect and extract different types of media (images, videos, and audio) along with useful metadata.
- - This functionality is essential for gathering visual content from multimedia-heavy pages like e-commerce sites, news articles, and social media feeds.
-
-2) **Image Extraction and Metadata**:
-
- - Crawl4AI captures images with detailed metadata, including:
- - **Source URL**: The direct URL to the image.
- - **Alt Text**: Image description if available.
- - **Relevance Score**: A score (0â10) indicating how relevant the image is to the main content.
- - **Context**: Text surrounding the image on the page.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://example.com")
-
- for image in result.media["images"]:
- print(f"Source: {image['src']}")
- print(f"Alt Text: {image['alt']}")
- print(f"Relevance Score: {image['score']}")
- print(f"Context: {image['context']}")
- ```
- - This example shows how to access each imageâs metadata, making it easy to filter for the most relevant visuals.
-
-3) **Handling Lazy-Loaded Images**:
-
- - Crawl4AI automatically supports lazy-loaded images, which are commonly used to optimize webpage loading.
- - **Example with Wait for Lazy-Loaded Content**:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- wait_for="css:img[data-src]", # Wait for lazy-loaded images
- delay_before_return_html=2.0 # Allow extra time for images to load
- )
- ```
- - This setup waits for lazy-loaded images to appear, ensuring they are fully captured.
-
-4) **Video Extraction and Metadata**:
-
- - Crawl4AI captures video elements, including:
- - **Source URL**: The videoâs direct URL.
- - **Type**: Format of the video (e.g., MP4).
- - **Thumbnail**: A poster or thumbnail image if available.
- - **Duration**: Video length, if metadata is provided.
- - **Example**:
- ```python
- for video in result.media["videos"]:
- print(f"Video Source: {video['src']}")
- print(f"Type: {video['type']}")
- print(f"Thumbnail: {video.get('poster')}")
- print(f"Duration: {video.get('duration')}")
- ```
- - This allows users to gather video content and relevant details for further processing or analysis.
-
-5) **Audio Extraction and Metadata**:
-
- - Audio elements can also be extracted, with metadata like:
- - **Source URL**: The audio fileâs direct URL.
- - **Type**: Format of the audio file (e.g., MP3).
- - **Duration**: Length of the audio, if available.
- - **Example**:
- ```python
- for audio in result.media["audios"]:
- print(f"Audio Source: {audio['src']}")
- print(f"Type: {audio['type']}")
- print(f"Duration: {audio.get('duration')}")
- ```
- - Useful for sites with podcasts, sound bites, or other audio content.
-
-6) **Filtering Media by Relevance**:
-
- - Use metadata like relevance score to filter only the most useful media content:
- ```python
- relevant_images = [img for img in result.media["images"] if img['score'] > 5]
- ```
- - This is especially helpful for content-heavy pages where you only want media directly related to the main content.
-
-7) **Example: Full Media Extraction with Content Filtering**:
-
- - Full example extracting images, videos, and audio along with filtering by relevance:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- word_count_threshold=10, # Filter content blocks for relevance
- exclude_external_images=True # Only keep internal images
- )
-
- # Display media summaries
- print(f"Relevant Images: {len(relevant_images)}")
- print(f"Videos: {len(result.media['videos'])}")
- print(f"Audio Clips: {len(result.media['audios'])}")
- ```
- - This example shows how to capture and filter various media types, focusing on whatâs most relevant.
-
-8) **Wrap Up & Next Steps**:
-
- - Recap the comprehensive media extraction capabilities, emphasizing how metadata helps users focus on relevant content.
- - Tease the next video: **Link Analysis and Smart Filtering** to explore how Crawl4AI handles internal, external, and social media links for more focused data gathering.
-
----
-
-This outline provides users with a complete guide to handling images, videos, and audio in Crawl4AI, using metadata to enhance relevance and precision in multimedia extraction.# Crawl4AI
-
-## Episode 9: Link Analysis and Smart Filtering
-
-### Quick Intro
-Walk through internal and external link classification, social media link filtering, and custom domain exclusion. Demo: Analyze links on a website, focusing on internal navigation vs. external or ad links.
-
-Hereâs a focused outline for the **Link Analysis and Smart Filtering** video:
-
----
-
-### **Link Analysis & Smart Filtering**
-
-1) **Importance of Link Analysis in Web Crawling**:
-
- - Explain that web pages often contain numerous links, including internal links, external links, social media links, and ads.
- - Crawl4AIâs link analysis and filtering options help extract only relevant links, enabling more targeted and efficient crawls.
-
-2) **Automatic Link Classification**:
-
- - Crawl4AI categorizes links automatically into internal, external, and social media links.
- - **Example**:
- ```python
- result = await crawler.arun(url="https://example.com")
-
- # Access internal and external links
- internal_links = result.links["internal"]
- external_links = result.links["external"]
-
- # Print first few links for each type
- print("Internal Links:", internal_links[:3])
- print("External Links:", external_links[:3])
- ```
-
-3) **Filtering Out Unwanted Links**:
-
- - **Exclude External Links**: Remove all links pointing to external sites.
- - **Exclude Social Media Links**: Filter out social media domains like Facebook or Twitter.
- - **Example**:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- exclude_external_links=True, # Remove external links
- exclude_social_media_links=True # Remove social media links
- )
- ```
-
-4) **Custom Domain Filtering**:
-
- - **Exclude Specific Domains**: Filter links from particular domains, e.g., ad sites.
- - **Custom Social Media Domains**: Add additional social media domains if needed.
- - **Example**:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- exclude_domains=["ads.com", "trackers.com"],
- exclude_social_media_domains=["facebook.com", "linkedin.com"]
- )
- ```
-
-5) **Accessing Link Context and Metadata**:
-
- - Crawl4AI provides additional metadata for each link, including its text, type (e.g., navigation or content), and surrounding context.
- - **Example**:
- ```python
- for link in result.links["internal"]:
- print(f"Link: {link['href']}, Text: {link['text']}, Context: {link['context']}")
- ```
- - **Use Case**: Helps users understand the relevance of links based on where they are placed on the page (e.g., navigation vs. article content).
-
-6) **Example of Comprehensive Link Filtering and Analysis**:
-
- - Full example combining link filtering, metadata access, and contextual information:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- exclude_external_links=True,
- exclude_social_media_links=True,
- exclude_domains=["ads.com"],
- css_selector=".main-content" # Focus only on main content area
- )
- for link in result.links["internal"]:
- print(f"Internal Link: {link['href']}, Text: {link['text']}, Context: {link['context']}")
- ```
- - This example filters unnecessary links, keeping only internal and relevant links from the main content area.
-
-7) **Wrap Up & Next Steps**:
-
- - Summarize the benefits of link filtering for efficient crawling and relevant content extraction.
- - Tease the next video: **Custom Headers, Identity Management, and User Simulation** to explain how to configure identity settings and simulate user behavior for stealthier crawls.
-
----
-
-This outline provides a practical overview of Crawl4AIâs link analysis and filtering features, helping users target only essential links while eliminating distractions.# Crawl4AI
-
-## Episode 10: Custom Headers, Identity, and User Simulation
-
-### Quick Intro
-Teach how to use custom headers, user-agent strings, and simulate real user interactions. Demo: Set custom user-agent and headers to access a site that blocks typical crawlers.
-
-Hereâs a concise outline for the **Custom Headers, Identity Management, and User Simulation** video:
-
----
-
-### **Custom Headers, Identity Management, & User Simulation**
-
-1) **Why Customize Headers and Identity in Crawling**:
-
- - Websites often track request headers and browser properties to detect bots. Customizing headers and managing identity help make requests appear more human, improving access to restricted sites.
-
-2) **Setting Custom Headers**:
-
- - Customize HTTP headers to mimic genuine browser requests or meet site-specific requirements:
- ```python
- headers = {
- "Accept-Language": "en-US,en;q=0.9",
- "X-Requested-With": "XMLHttpRequest",
- "Cache-Control": "no-cache"
- }
- crawler = AsyncWebCrawler(headers=headers)
- ```
- - **Use Case**: Customize the `Accept-Language` header to simulate local user settings, or `Cache-Control` to bypass cache for fresh content.
-
-3) **Setting a Custom User Agent**:
-
- - Some websites block requests from common crawler user agents. Setting a custom user agent string helps bypass these restrictions:
- ```python
- crawler = AsyncWebCrawler(
- user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
- )
- ```
- - **Tip**: Use user-agent strings from popular browsers (e.g., Chrome, Firefox) to improve access and reduce detection risks.
-
-4) **User Simulation for Human-like Behavior**:
-
- - Enable `simulate_user=True` to mimic natural user interactions, such as random timing and simulated mouse movements:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- simulate_user=True # Simulates human-like behavior
- )
- ```
- - **Behavioral Effects**: Adds subtle variations in interactions, making the crawler harder to detect on bot-protected sites.
-
-5) **Navigator Overrides and Magic Mode for Full Identity Masking**:
-
- - Use `override_navigator=True` to mask automation indicators like `navigator.webdriver`, which websites check to detect bots:
- ```python
- result = await crawler.arun(
- url="https://example.com",
- override_navigator=True # Masks bot-related signals
- )
- ```
- - **Combining with Magic Mode**: For a complete anti-bot setup, combine these identity options with `magic=True` for maximum protection:
- ```python
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com",
- magic=True, # Enables all anti-bot detection features
- user_agent="Custom-Agent", # Custom agent with Magic Mode
- )
- ```
- - This setup includes all anti-detection techniques like navigator masking, random timing, and user simulation.
-
-6) **Example: Comprehensive Setup for Identity Management**:
-
- - A full example combining custom headers, user-agent, and user simulation for a realistic browsing profile:
- ```python
- async with AsyncWebCrawler(
- headers={"Accept-Language": "en-US", "Cache-Control": "no-cache"},
- user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0",
- ) as crawler:
- result = await crawler.arun(
- url="https://example.com/secure-page",
- simulate_user=True
- )
- print(result.markdown[:500]) # Display extracted content
- ```
- - This example enables detailed customization for evading detection and accessing protected pages smoothly.
-
-7) **Wrap Up & Next Steps**:
-
- - Recap the value of headers, user-agent customization, and simulation in bypassing bot detection.
- - Tease the next video: **Extraction Strategies: JSON CSS, LLM, and Cosine** to dive into structured data extraction methods for high-quality content retrieval.
-
----
-
-This outline equips users with tools for managing crawler identity and human-like behavior, essential for accessing bot-protected or restricted websites.Hereâs a detailed outline for the **JSON-CSS Extraction Strategy** video, covering all key aspects and supported structures in Crawl4AI:
-
----
-
-### **10.1 JSON-CSS Extraction Strategy**
-
-#### **1. Introduction to JSON-CSS Extraction**
- - JSON-CSS Extraction is used for pulling structured data from pages with repeated patterns, like product listings, article feeds, or directories.
- - This strategy allows defining a schema with CSS selectors and data fields, making it easy to capture nested, list-based, or singular elements.
-
-#### **2. Basic Schema Structure**
- - **Schema Fields**: The schema has two main components:
- - `baseSelector`: A CSS selector to locate the main elements you want to extract (e.g., each article or product block).
- - `fields`: Defines the data fields for each element, supporting various data types and structures.
-
-#### **3. Simple Field Extraction**
- - **Example HTML**:
- ```html
-
-
Sample Product
-
$19.99
-
This is a sample product.
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "price", "selector": ".price", "type": "text"},
- {"name": "description", "selector": ".description", "type": "text"}
- ]
- }
- ```
- - **Explanation**: Each field captures text content from specified CSS selectors within each `.product` element.
-
-#### **4. Supported Field Types: Text, Attribute, HTML, Regex**
- - **Field Type Options**:
- - `text`: Extracts visible text.
- - `attribute`: Captures an HTML attribute (e.g., `src`, `href`).
- - `html`: Extracts the raw HTML of an element.
- - `regex`: Allows regex patterns to extract part of the text.
-
- - **Example HTML** (including an image):
- ```html
-
-
Sample Product
-

-
$19.99
-
Limited time offer.
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"},
- {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"},
- {"name": "description_html", "selector": ".description", "type": "html"}
- ]
- }
- ```
- - **Explanation**:
- - `attribute`: Extracts the `src` attribute from `.product-image`.
- - `regex`: Extracts the numeric part from `$19.99`.
- - `html`: Retrieves the full HTML of the description element.
-
-#### **5. Nested Field Extraction**
- - **Use Case**: Useful when content contains sub-elements, such as an article with author details within it.
- - **Example HTML**:
- ```html
-
-
Sample Article
-
- John Doe
- Writer and editor
-
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".article",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "author", "type": "nested", "selector": ".author", "fields": [
- {"name": "name", "selector": ".name", "type": "text"},
- {"name": "bio", "selector": ".bio", "type": "text"}
- ]}
- ]
- }
- ```
- - **Explanation**:
- - `nested`: Extracts `name` and `bio` within `.author`, grouping the author details in a single `author` object.
-
-#### **6. List and Nested List Extraction**
- - **List**: Extracts multiple elements matching the selector as a list.
- - **Nested List**: Allows lists within lists, useful for items with sub-lists (e.g., specifications for each product).
- - **Example HTML**:
- ```html
-
-
Product with Features
-
- - Feature 1
- - Feature 2
- - Feature 3
-
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text"},
- {"name": "features", "type": "list", "selector": ".features .feature", "fields": [
- {"name": "feature", "type": "text"}
- ]}
- ]
- }
- ```
- - **Explanation**:
- - `list`: Captures each `.feature` item within `.features`, outputting an array of features under the `features` field.
-
-#### **7. Transformations for Field Values**
- - Transformations allow you to modify extracted values (e.g., converting to lowercase).
- - Supported transformations: `lowercase`, `uppercase`, `strip`.
- - **Example HTML**:
- ```html
-
-
Special Product
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"}
- ]
- }
- ```
- - **Explanation**: The `transform` property changes the `title` to uppercase, useful for standardized outputs.
-
-#### **8. Full JSON-CSS Extraction Example**
- - Combining all elements in a single schema example for a comprehensive crawl:
- - **Example HTML**:
- ```html
-
-
Featured Product
-

-
$99.99
-
Best product of the year.
-
- - Durable
- - Eco-friendly
-
-
- ```
- - **Schema**:
- ```python
- schema = {
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": ".title", "type": "text", "transform": "uppercase"},
- {"name": "image_url", "selector": ".product-image", "type": "attribute", "attribute": "src"},
- {"name": "price", "selector": ".price", "type": "regex", "pattern": r"\$(\d+\.\d+)"},
- {"name": "description", "selector": ".description", "type": "html"},
- {"name": "features", "type": "list", "selector": ".features .feature", "fields": [
- {"name": "feature", "type": "text"}
- ]}
- ]
- }
- ```
- - **Explanation**: This schema captures and transforms each aspect of the product, illustrating the JSON-CSS strategyâs versatility for structured extraction.
-
-#### **9. Wrap Up & Next Steps**
- - Summarize JSON-CSS Extractionâs flexibility for structured, pattern-based extraction.
- - Tease the next video: **10.2 LLM Extraction Strategy**, focusing on using language models to extract data based on intelligent content analysis.
-
----
-
-This outline covers each JSON-CSS Extraction option in Crawl4AI, with practical examples and schema configurations, making it a thorough guide for users.# Crawl4AI
-
-## Episode 11: Extraction Strategies: JSON CSS, LLM, and Cosine
-
-### Quick Intro
-Introduce JSON CSS Extraction Strategy for structured data, LLM Extraction Strategy for intelligent parsing, and Cosine Strategy for clustering similar content. Demo: Use JSON CSS to scrape product details from an e-commerce site.
-
-Hereâs a comprehensive outline for the **LLM Extraction Strategy** video, covering key details and example applications.
-
----
-
-### **10.2 LLM Extraction Strategy**
-
-#### **1. Introduction to LLM Extraction Strategy**
- - The LLM Extraction Strategy leverages language models to interpret and extract structured data from complex web content.
- - Unlike traditional CSS selectors, this strategy uses natural language instructions and schemas to guide the extraction, ideal for unstructured or diverse content.
- - Supports **OpenAI**, **Azure OpenAI**, **HuggingFace**, and **Ollama** models, enabling flexibility with both proprietary and open-source providers.
-
-#### **2. Key Components of LLM Extraction Strategy**
- - **Provider**: Specifies the LLM provider (e.g., OpenAI, HuggingFace, Azure).
- - **API Token**: Required for most providers, except Ollama (local LLM model).
- - **Instruction**: Custom extraction instructions sent to the model, providing flexibility in how the data is structured and extracted.
- - **Schema**: Optional, defines structured fields to organize extracted data into JSON format.
- - **Extraction Type**: Supports `"block"` for simpler text blocks or `"schema"` when a structured output format is required.
- - **Chunking Parameters**: Breaks down large documents, with options to adjust chunk size and overlap rate for more accurate extraction across lengthy texts.
-
-#### **3. Basic Extraction Example: OpenAI Model Pricing**
- - **Goal**: Extract model names and their input and output fees from the OpenAI pricing page.
- - **Schema Definition**:
- - **Model Name**: Text for model identification.
- - **Input Fee**: Token cost for input processing.
- - **Output Fee**: Token cost for output generation.
-
- - **Schema**:
- ```python
- class OpenAIModelFee(BaseModel):
- model_name: str = Field(..., description="Name of the OpenAI model.")
- input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
- output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
- ```
-
- - **Example Code**:
- ```python
- async def extract_openai_pricing():
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://openai.com/api/pricing/",
- extraction_strategy=LLMExtractionStrategy(
- provider="openai/gpt-4o",
- api_token=os.getenv("OPENAI_API_KEY"),
- schema=OpenAIModelFee.schema(),
- extraction_type="schema",
- instruction="Extract model names and fees for input and output tokens from the page."
- ),
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - The extraction strategy combines a schema and detailed instruction to guide the LLM in capturing structured data.
- - Each modelâs name, input fee, and output fee are extracted in a JSON format.
-
-#### **4. Knowledge Graph Extraction Example**
- - **Goal**: Extract entities and their relationships from a document for use in a knowledge graph.
- - **Schema Definition**:
- - **Entities**: Individual items with descriptions (e.g., people, organizations).
- - **Relationships**: Connections between entities, including descriptions and relationship types.
-
- - **Schema**:
- ```python
- class Entity(BaseModel):
- name: str
- description: str
-
- class Relationship(BaseModel):
- entity1: Entity
- entity2: Entity
- description: str
- relation_type: str
-
- class KnowledgeGraph(BaseModel):
- entities: List[Entity]
- relationships: List[Relationship]
- ```
-
- - **Example Code**:
- ```python
- async def extract_knowledge_graph():
- extraction_strategy = LLMExtractionStrategy(
- provider="azure/gpt-4o-mini",
- api_token=os.getenv("AZURE_API_KEY"),
- schema=KnowledgeGraph.schema(),
- extraction_type="schema",
- instruction="Extract entities and relationships from the content to build a knowledge graph."
- )
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com/some-article",
- extraction_strategy=extraction_strategy,
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - In this setup, the LLM extracts entities and their relationships based on the schema and instruction.
- - The schema organizes results into a JSON-based knowledge graph format.
-
-#### **5. Key Settings in LLM Extraction**
- - **Chunking Options**:
- - For long pages, set `chunk_token_threshold` to specify maximum token count per section.
- - Adjust `overlap_rate` to control the overlap between chunks, useful for contextual consistency.
- - **Example**:
- ```python
- extraction_strategy = LLMExtractionStrategy(
- provider="openai/gpt-4",
- api_token=os.getenv("OPENAI_API_KEY"),
- chunk_token_threshold=3000,
- overlap_rate=0.2, # 20% overlap between chunks
- instruction="Extract key insights and relationships."
- )
- ```
- - This setup ensures that longer texts are divided into manageable chunks with slight overlap, enhancing the quality of extraction.
-
-#### **6. Flexible Provider Options for LLM Extraction**
- - **Using Proprietary Models**: OpenAI, Azure, and HuggingFace provide robust language models, often suited for complex or detailed extractions.
- - **Using Open-Source Models**: Ollama and other open-source models can be deployed locally, suitable for offline or cost-effective extraction.
- - **Example Call**:
- ```python
- await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
- await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
- await extract_structured_data_using_llm("ollama/llama3.2")
- ```
-
-#### **7. Complete Example of LLM Extraction Setup**
- - Code to run both the OpenAI pricing and Knowledge Graph extractions, using various providers:
- ```python
- async def main():
- await extract_openai_pricing()
- await extract_knowledge_graph()
-
- if __name__ == "__main__":
- asyncio.run(main())
- ```
-
-#### **8. Wrap Up & Next Steps**
- - Recap the power of LLM extraction for handling unstructured or complex data extraction tasks.
- - Tease the next video: **10.3 Cosine Similarity Strategy** for clustering similar content based on semantic similarity.
-
----
-
-This outline explains LLM Extraction in Crawl4AI, with examples showing how to extract structured data using custom schemas and instructions. It demonstrates flexibility with multiple providers, ensuring practical application for different use cases.# Crawl4AI
-
-## Episode 11: Extraction Strategies: JSON CSS, LLM, and Cosine
-
-### Quick Intro
-Introduce JSON CSS Extraction Strategy for structured data, LLM Extraction Strategy for intelligent parsing, and Cosine Strategy for clustering similar content. Demo: Use JSON CSS to scrape product details from an e-commerce site.
-
-Hereâs a structured outline for the **Cosine Similarity Strategy** video, covering key concepts, configuration, and a practical example.
-
----
-
-### **10.3 Cosine Similarity Strategy**
-
-#### **1. Introduction to Cosine Similarity Strategy**
- - The Cosine Similarity Strategy clusters content by semantic similarity, offering an efficient alternative to LLM-based extraction, especially when speed is a priority.
- - Ideal for grouping similar sections of text, this strategy is well-suited for pages with content sections that may need to be classified or tagged, like news articles, product descriptions, or reviews.
-
-#### **2. Key Configuration Options**
- - **semantic_filter**: A keyword-based filter to focus on relevant content.
- - **word_count_threshold**: Minimum number of words per cluster, filtering out shorter, less meaningful clusters.
- - **max_dist**: Maximum allowable distance between elements in clusters, impacting cluster tightness.
- - **linkage_method**: Method for hierarchical clustering, such as `'ward'` (for well-separated clusters).
- - **top_k**: Specifies the number of top categories for each cluster.
- - **model_name**: Defines the model for embeddings, such as `sentence-transformers/all-MiniLM-L6-v2`.
- - **sim_threshold**: Minimum similarity threshold for filtering, allowing control over cluster relevance.
-
-#### **3. How Cosine Similarity Clustering Works**
- - **Step 1**: Embeddings are generated for each text section, transforming them into vectors that capture semantic meaning.
- - **Step 2**: Hierarchical clustering groups similar sections based on cosine similarity, forming clusters with related content.
- - **Step 3**: Clusters are filtered based on word count, removing those below the `word_count_threshold`.
- - **Step 4**: Each cluster is then categorized with tags, if enabled, providing context to each grouped content section.
-
-#### **4. Example Use Case: Clustering Blog Article Sections**
- - **Goal**: Group related sections of a blog or news page to identify distinct topics or discussion areas.
- - **Example HTML Sections**:
- ```text
- "The economy is showing signs of recovery, with markets up this quarter.",
- "In the sports world, several major teams are preparing for the upcoming season.",
- "New advancements in AI technology are reshaping the tech landscape.",
- "Market analysts are optimistic about continued growth in tech stocks."
- ```
-
- - **Code Setup**:
- ```python
- async def extract_blog_sections():
- extraction_strategy = CosineStrategy(
- word_count_threshold=15,
- max_dist=0.3,
- sim_threshold=0.2,
- model_name="sentence-transformers/all-MiniLM-L6-v2",
- top_k=2
- )
- async with AsyncWebCrawler() as crawler:
- url = "https://example.com/blog-page"
- result = await crawler.arun(
- url=url,
- extraction_strategy=extraction_strategy,
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - **word_count_threshold**: Ensures only clusters with meaningful content are included.
- - **sim_threshold**: Filters out clusters with low similarity, focusing on closely related sections.
- - **top_k**: Selects top tags, useful for identifying main topics.
-
-#### **5. Applying Semantic Filtering with Cosine Similarity**
- - **Semantic Filter**: Filters sections based on relevance to a specific keyword, such as âtechnologyâ for tech articles.
- - **Example Code**:
- ```python
- extraction_strategy = CosineStrategy(
- semantic_filter="technology",
- word_count_threshold=10,
- max_dist=0.25,
- model_name="sentence-transformers/all-MiniLM-L6-v2"
- )
- ```
- - **Explanation**:
- - **semantic_filter**: Only sections with high similarity to the âtechnologyâ keyword will be included in the clustering, making it easy to focus on specific topics within a mixed-content page.
-
-#### **6. Clustering Product Reviews by Similarity**
- - **Goal**: Organize product reviews by themes, such as âprice,â âquality,â or âdurability.â
- - **Example Reviews**:
- ```text
- "The quality of this product is outstanding and well worth the price.",
- "I found the product to be durable but a bit overpriced.",
- "Great value for the money and long-lasting.",
- "The build quality is good, but I expected a lower price point."
- ```
-
- - **Code Setup**:
- ```python
- async def extract_product_reviews():
- extraction_strategy = CosineStrategy(
- word_count_threshold=20,
- max_dist=0.35,
- sim_threshold=0.25,
- model_name="sentence-transformers/all-MiniLM-L6-v2"
- )
- async with AsyncWebCrawler() as crawler:
- url = "https://example.com/product-reviews"
- result = await crawler.arun(
- url=url,
- extraction_strategy=extraction_strategy,
- cache_mode=CacheMode.BYPASS
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - This configuration clusters similar reviews, grouping feedback by common themes, helping businesses understand customer sentiments around particular product aspects.
-
-#### **7. Performance Advantages of Cosine Strategy**
- - **Speed**: The Cosine Similarity Strategy is faster than LLM-based extraction, as it doesnât rely on API calls to external LLMs.
- - **Local Processing**: The strategy runs locally with pre-trained sentence embeddings, ideal for high-throughput scenarios where cost and latency are concerns.
- - **Comparison**: With a well-optimized local model, this method can perform clustering on large datasets quickly, making it suitable for tasks requiring rapid, repeated analysis.
-
-#### **8. Full Code Example for Clustering News Articles**
- - **Code**:
- ```python
- async def main():
- await extract_blog_sections()
- await extract_product_reviews()
-
- if __name__ == "__main__":
- asyncio.run(main())
- ```
-
-#### **9. Wrap Up & Next Steps**
- - Recap the efficiency and effectiveness of Cosine Similarity for clustering related content quickly.
- - Close with a reminder of Crawl4AIâs flexibility across extraction strategies, and prompt users to experiment with different settings to optimize clustering for their specific content.
-
----
-
-This outline covers Cosine Similarity Strategyâs speed and effectiveness, providing examples that showcase its potential for clustering various content types efficiently.# Crawl4AI
-
-## Episode 12: Session-Based Crawling for Dynamic Websites
-
-### Quick Intro
-Show session management for handling websites with multiple pages or actions (like âload moreâ buttons). Demo: Crawl a paginated content page, persisting session data across multiple requests.
-
-Hereâs a detailed outline for the **Session-Based Crawling for Dynamic Websites** video, explaining why sessions are necessary, how to use them, and providing practical examples and a visual diagram to illustrate the concept.
-
----
-
-### **11. Session-Based Crawling for Dynamic Websites**
-
-#### **1. Introduction to Session-Based Crawling**
- - **What is Session-Based Crawling**: Session-based crawling maintains a continuous browsing session across multiple page states, allowing the crawler to interact with a page and retrieve content that loads dynamically or based on user interactions.
- - **Why Itâs Needed**:
- - In static pages, all content is available directly from a single URL.
- - In dynamic websites, content often loads progressively or based on user actions (e.g., clicking âload more,â submitting forms, scrolling).
- - Session-based crawling helps simulate user actions, capturing content that is otherwise hidden until specific actions are taken.
-
-#### **2. Conceptual Diagram for Session-Based Crawling**
-
- ```mermaid
- graph TD
- Start[Start Session] --> S1[Initial State (S1)]
- S1 -->|Crawl| Content1[Extract Content S1]
- S1 -->|Action: Click Load More| S2[State S2]
- S2 -->|Crawl| Content2[Extract Content S2]
- S2 -->|Action: Scroll Down| S3[State S3]
- S3 -->|Crawl| Content3[Extract Content S3]
- S3 -->|Action: Submit Form| S4[Final State]
- S4 -->|Crawl| Content4[Extract Content S4]
- Content4 --> End[End Session]
- ```
-
- - **Explanation of Diagram**:
- - **Start**: Initializes the session and opens the starting URL.
- - **State Transitions**: Each action (e.g., clicking âload more,â scrolling) transitions to a new state, where additional content becomes available.
- - **Session Persistence**: Keeps the same browsing session active, preserving the state and allowing for a sequence of actions to unfold.
- - **End**: After reaching the final state, the session ends, and all accumulated content has been extracted.
-
-#### **3. Key Components of Session-Based Crawling in Crawl4AI**
- - **Session ID**: A unique identifier to maintain the state across requests, allowing the crawler to ârememberâ previous actions.
- - **JavaScript Execution**: Executes JavaScript commands (e.g., clicks, scrolls) to simulate interactions.
- - **Wait Conditions**: Ensures the crawler waits for content to load in each state before moving on.
- - **Sequential State Transitions**: By defining actions and wait conditions between states, the crawler can navigate through the page as a user would.
-
-#### **4. Basic Session Example: Multi-Step Content Loading**
- - **Goal**: Crawl an article feed that requires several âload moreâ clicks to display additional content.
- - **Code**:
- ```python
- async def crawl_article_feed():
- async with AsyncWebCrawler() as crawler:
- session_id = "feed_session"
-
- for page in range(3):
- result = await crawler.arun(
- url="https://example.com/articles",
- session_id=session_id,
- js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
- wait_for="css:.article",
- css_selector=".article" # Target article elements
- )
- print(f"Page {page + 1}: Extracted {len(result.extracted_content)} articles")
- ```
- - **Explanation**:
- - **session_id**: Ensures all requests share the same browsing state.
- - **js_code**: Clicks the âload moreâ button after the initial page load, expanding content on each iteration.
- - **wait_for**: Ensures articles have loaded after each click before extraction.
-
-#### **5. Advanced Example: E-Commerce Product Search with Filter Selection**
- - **Goal**: Interact with filters on an e-commerce page to extract products based on selected criteria.
- - **Example Steps**:
- 1. **State 1**: Load the main product page.
- 2. **State 2**: Apply a filter (e.g., âOn Saleâ) by selecting a checkbox.
- 3. **State 3**: Scroll to load additional products and capture updated results.
-
- - **Code**:
- ```python
- async def extract_filtered_products():
- async with AsyncWebCrawler() as crawler:
- session_id = "product_session"
-
- # Step 1: Open product page
- result = await crawler.arun(
- url="https://example.com/products",
- session_id=session_id,
- wait_for="css:.product-item"
- )
-
- # Step 2: Apply filter (e.g., "On Sale")
- result = await crawler.arun(
- url="https://example.com/products",
- session_id=session_id,
- js_code="document.querySelector('#sale-filter-checkbox').click();",
- wait_for="css:.product-item"
- )
-
- # Step 3: Scroll to load additional products
- for _ in range(2): # Scroll down twice
- result = await crawler.arun(
- url="https://example.com/products",
- session_id=session_id,
- js_code="window.scrollTo(0, document.body.scrollHeight);",
- wait_for="css:.product-item"
- )
- print(f"Loaded {len(result.extracted_content)} products after scroll")
- ```
- - **Explanation**:
- - **State Persistence**: Each action (filter selection and scroll) builds on the previous session state.
- - **Multiple Interactions**: Combines clicking a filter with scrolling, demonstrating how the session preserves these actions.
-
-#### **6. Key Benefits of Session-Based Crawling**
- - **Accessing Hidden Content**: Retrieves data that loads only after user actions.
- - **Simulating User Behavior**: Handles interactive elements such as âload moreâ buttons, dropdowns, and filters.
- - **Maintaining Continuity Across States**: Enables a sequential process, moving logically from one state to the next, capturing all desired content without reloading the initial state each time.
-
-#### **7. Additional Configuration Tips**
- - **Manage Session End**: Always conclude the session after the final state to release resources.
- - **Optimize with Wait Conditions**: Use `wait_for` to ensure complete loading before each extraction.
- - **Handling Errors in Session-Based Crawling**: Include error handling for interactions that may fail, ensuring robustness across state transitions.
-
-#### **8. Complete Code Example: Multi-Step Session Workflow**
- - **Example**:
- ```python
- async def main():
- await crawl_article_feed()
- await extract_filtered_products()
-
- if __name__ == "__main__":
- asyncio.run(main())
- ```
-
-#### **9. Wrap Up & Next Steps**
- - Recap the usefulness of session-based crawling for dynamic content extraction.
- - Tease the next video: **Hooks and Custom Workflow with AsyncWebCrawler** to cover advanced customization options for further control over the crawling process.
-
----
-
-This outline covers session-based crawling from both a conceptual and practical perspective, helping users understand its importance, configure it effectively, and use it to handle complex dynamic content.# Crawl4AI
-
-## Episode 13: Chunking Strategies for Large Text Processing
-
-### Quick Intro
-Explain Regex, NLP, and Fixed-Length chunking, and when to use each. Demo: Chunk a large article or document for processing by topics or sentences.
-
-Hereâs a structured outline for the **Chunking Strategies for Large Text Processing** video, emphasizing how chunking works within extraction and why itâs crucial for effective data aggregation.
-
-Hereâs a structured outline for the **Chunking Strategies for Large Text Processing** video, explaining each strategy, when to use it, and providing examples to illustrate.
-
----
-
-### **12. Chunking Strategies for Large Text Processing**
-
-#### **1. Introduction to Chunking in Crawl4AI**
- - **What is Chunking**: Chunking is the process of dividing large text into manageable sections or âchunks,â enabling efficient processing in extraction tasks.
- - **Why Itâs Needed**:
- - When processing large text, feeding it directly into an extraction function (like `F(x)`) can overwhelm memory or token limits.
- - Chunking breaks down `x` (the text) into smaller pieces, which are processed sequentially or in parallel by the extraction function, with the final result being an aggregation of all chunksâ processed output.
-
-#### **2. Key Chunking Strategies and Use Cases**
- - Crawl4AI offers various chunking strategies to suit different text structures, chunk sizes, and processing requirements.
- - **Choosing a Strategy**: Select based on the type of text (e.g., articles, transcripts) and extraction needs (e.g., simple splitting or context-sensitive processing).
-
-#### **3. Strategy 1: Regex-Based Chunking**
- - **Description**: Uses regular expressions to split text based on specified patterns (e.g., paragraphs or section breaks).
- - **Use Case**: Ideal for dividing text by paragraphs or larger logical blocks where sections are clearly separated by line breaks or punctuation.
- - **Example**:
- - **Pattern**: `r'\n\n'` for double line breaks.
- ```python
- chunker = RegexChunking(patterns=[r'\n\n'])
- text_chunks = chunker.chunk(long_text)
- print(text_chunks) # Output: List of paragraphs
- ```
- - **Pros**: Flexible for pattern-based chunking.
- - **Cons**: Limited to text with consistent formatting.
-
-#### **4. Strategy 2: NLP Sentence-Based Chunking**
- - **Description**: Uses NLP to split text by sentences, ensuring grammatically complete segments.
- - **Use Case**: Useful for extracting individual statements, such as in news articles, quotes, or legal text.
- - **Example**:
- ```python
- chunker = NlpSentenceChunking()
- sentence_chunks = chunker.chunk(long_text)
- print(sentence_chunks) # Output: List of sentences
- ```
- - **Pros**: Maintains sentence structure, ideal for tasks needing semantic completeness.
- - **Cons**: May create very small chunks, which could limit contextual extraction.
-
-#### **5. Strategy 3: Topic-Based Segmentation Using TextTiling**
- - **Description**: Segments text into topics using TextTiling, identifying topic shifts and key segments.
- - **Use Case**: Ideal for long articles, reports, or essays where each section covers a different topic.
- - **Example**:
- ```python
- chunker = TopicSegmentationChunking(num_keywords=3)
- topic_chunks = chunker.chunk_with_topics(long_text)
- print(topic_chunks) # Output: List of topic segments with keywords
- ```
- - **Pros**: Groups related content, preserving topical coherence.
- - **Cons**: Depends on identifiable topic shifts, which may not be present in all texts.
-
-#### **6. Strategy 4: Fixed-Length Word Chunking**
- - **Description**: Splits text into chunks based on a fixed number of words.
- - **Use Case**: Ideal for text where exact segment size is required, such as processing word-limited documents for LLMs.
- - **Example**:
- ```python
- chunker = FixedLengthWordChunking(chunk_size=100)
- word_chunks = chunker.chunk(long_text)
- print(word_chunks) # Output: List of 100-word chunks
- ```
- - **Pros**: Ensures uniform chunk sizes, suitable for token-based extraction limits.
- - **Cons**: May split sentences, affecting semantic coherence.
-
-#### **7. Strategy 5: Sliding Window Chunking**
- - **Description**: Uses a fixed window size with a step, creating overlapping chunks to maintain context.
- - **Use Case**: Useful for maintaining context across sections, as with documents where context is needed for neighboring sections.
- - **Example**:
- ```python
- chunker = SlidingWindowChunking(window_size=100, step=50)
- window_chunks = chunker.chunk(long_text)
- print(window_chunks) # Output: List of overlapping word chunks
- ```
- - **Pros**: Retains context across adjacent chunks, ideal for complex semantic extraction.
- - **Cons**: Overlap increases data size, potentially impacting processing time.
-
-#### **8. Strategy 6: Overlapping Window Chunking**
- - **Description**: Similar to sliding windows but with a defined overlap, allowing chunks to share content at the edges.
- - **Use Case**: Suitable for handling long texts with essential overlapping information, like research articles or medical records.
- - **Example**:
- ```python
- chunker = OverlappingWindowChunking(window_size=1000, overlap=100)
- overlap_chunks = chunker.chunk(long_text)
- print(overlap_chunks) # Output: List of overlapping chunks with defined overlap
- ```
- - **Pros**: Allows controlled overlap for consistent content coverage across chunks.
- - **Cons**: Redundant data in overlapping areas may increase computation.
-
-#### **9. Practical Example: Using Chunking with an Extraction Strategy**
- - **Goal**: Combine chunking with an extraction strategy to process large text effectively.
- - **Example Code**:
- ```python
- from crawl4ai.extraction_strategy import LLMExtractionStrategy
-
- async def extract_large_text():
- # Initialize chunker and extraction strategy
- chunker = FixedLengthWordChunking(chunk_size=200)
- extraction_strategy = LLMExtractionStrategy(provider="openai/gpt-4", api_token="your_api_token")
-
- # Split text into chunks
- text_chunks = chunker.chunk(large_text)
-
- async with AsyncWebCrawler() as crawler:
- for chunk in text_chunks:
- result = await crawler.arun(
- url="https://example.com",
- extraction_strategy=extraction_strategy,
- content=chunk
- )
- print(result.extracted_content)
- ```
-
- - **Explanation**:
- - `chunker.chunk()`: Divides the `large_text` into smaller segments based on the chosen strategy.
- - `extraction_strategy`: Processes each chunk separately, and results are then aggregated to form the final output.
-
-#### **10. Choosing the Right Chunking Strategy**
- - **Text Structure**: If text has clear sections (e.g., paragraphs, topics), use Regex or Topic Segmentation.
- - **Extraction Needs**: If context is crucial, consider Sliding or Overlapping Window Chunking.
- - **Processing Constraints**: For word-limited extractions (e.g., LLMs with token limits), Fixed-Length Word Chunking is often most effective.
-
-#### **11. Wrap Up & Next Steps**
- - Recap the benefits of each chunking strategy and when to use them in extraction workflows.
- - Tease the next video: **Hooks and Custom Workflow with AsyncWebCrawler**, focusing on customizing crawler behavior with hooks for a fine-tuned extraction process.
-
----
-
-This outline provides a complete understanding of chunking strategies, explaining each methodâs strengths and best-use scenarios to help users process large texts effectively in Crawl4AI.# Crawl4AI
-
-## Episode 14: Hooks and Custom Workflow with AsyncWebCrawler
-
-### Quick Intro
-Cover hooks (`on_browser_created`, `before_goto`, `after_goto`) to add custom workflows. Demo: Use hooks to add custom cookies or headers, log HTML, or trigger specific events on page load.
-
-Hereâs a detailed outline for the **Hooks and Custom Workflow with AsyncWebCrawler** video, covering each hookâs purpose, usage, and example implementations.
-
----
-
-### **13. Hooks and Custom Workflow with AsyncWebCrawler**
-
-#### **1. Introduction to Hooks in Crawl4AI**
- - **What are Hooks**: Hooks are customizable entry points in the crawling process that allow users to inject custom actions or logic at specific stages.
- - **Why Use Hooks**:
- - They enable fine-grained control over the crawling workflow.
- - Useful for performing additional tasks (e.g., logging, modifying headers) dynamically during the crawl.
- - Hooks provide the flexibility to adapt the crawler to complex site structures or unique project needs.
-
-#### **2. Overview of Available Hooks**
- - Crawl4AI offers seven key hooks to modify and control different stages in the crawling lifecycle:
- - `on_browser_created`
- - `on_user_agent_updated`
- - `on_execution_started`
- - `before_goto`
- - `after_goto`
- - `before_return_html`
- - `before_retrieve_html`
-
-#### **3. Hook-by-Hook Explanation and Examples**
-
----
-
-##### **Hook 1: `on_browser_created`**
- - **Purpose**: Triggered right after the browser instance is created.
- - **Use Case**:
- - Initializing browser-specific settings or performing setup actions.
- - Configuring browser extensions or scripts before any page is opened.
- - **Example**:
- ```python
- async def log_browser_creation(browser):
- print("Browser instance created:", browser)
-
- crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
- ```
- - **Explanation**: This hook logs the browser creation event, useful for tracking when a new browser instance starts.
-
----
-
-##### **Hook 2: `on_user_agent_updated`**
- - **Purpose**: Called whenever the user agent string is updated.
- - **Use Case**:
- - Modifying the user agent based on page requirements, e.g., changing to a mobile user agent for mobile-only pages.
- - **Example**:
- ```python
- def update_user_agent(user_agent):
- print(f"User Agent Updated: {user_agent}")
-
- crawler.crawler_strategy.set_hook('on_user_agent_updated', update_user_agent)
- crawler.update_user_agent("Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)")
- ```
- - **Explanation**: This hook provides a callback every time the user agent changes, helpful for debugging or dynamically altering user agent settings based on conditions.
-
----
-
-##### **Hook 3: `on_execution_started`**
- - **Purpose**: Called right before the crawler begins any interaction (e.g., JavaScript execution, clicks).
- - **Use Case**:
- - Performing setup actions, such as inserting cookies or initiating custom scripts.
- - **Example**:
- ```python
- async def log_execution_start(page):
- print("Execution started on page:", page.url)
-
- crawler.crawler_strategy.set_hook('on_execution_started', log_execution_start)
- ```
- - **Explanation**: Logs the start of any major interaction on the page, ideal for cases where you want to monitor each interaction.
-
----
-
-##### **Hook 4: `before_goto`**
- - **Purpose**: Triggered before navigating to a new URL with `page.goto()`.
- - **Use Case**:
- - Modifying request headers or setting up conditions right before the page loads.
- - Adding headers or dynamically adjusting options for specific URLs.
- - **Example**:
- ```python
- async def modify_headers_before_goto(page):
- await page.set_extra_http_headers({"X-Custom-Header": "CustomValue"})
- print("Custom headers set before navigation")
-
- crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
- ```
- - **Explanation**: This hook allows injecting headers or altering settings based on the pageâs needs, particularly useful for pages with custom requirements.
-
----
-
-##### **Hook 5: `after_goto`**
- - **Purpose**: Executed immediately after a page has loaded (after `page.goto()`).
- - **Use Case**:
- - Checking the loaded page state, modifying the DOM, or performing post-navigation actions (e.g., scrolling).
- - **Example**:
- ```python
- async def post_navigation_scroll(page):
- await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
- print("Scrolled to the bottom after navigation")
-
- crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
- ```
- - **Explanation**: This hook scrolls to the bottom of the page after loading, which can help load dynamically added content like infinite scroll elements.
-
----
-
-##### **Hook 6: `before_return_html`**
- - **Purpose**: Called right before HTML content is retrieved and returned.
- - **Use Case**:
- - Removing overlays or cleaning up the page for a cleaner HTML extraction.
- - **Example**:
- ```python
- async def remove_advertisements(page, html):
- await page.evaluate("document.querySelectorAll('.ad-banner').forEach(el => el.remove());")
- print("Advertisements removed before returning HTML")
-
- crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
- ```
- - **Explanation**: The hook removes ad banners from the HTML before itâs retrieved, ensuring a cleaner data extraction.
-
----
-
-##### **Hook 7: `before_retrieve_html`**
- - **Purpose**: Runs right before Crawl4AI initiates HTML retrieval.
- - **Use Case**:
- - Finalizing any page adjustments (e.g., setting timers, waiting for specific elements).
- - **Example**:
- ```python
- async def wait_for_content_before_retrieve(page):
- await page.wait_for_selector('.main-content')
- print("Main content loaded, ready to retrieve HTML")
-
- crawler.crawler_strategy.set_hook('before_retrieve_html', wait_for_content_before_retrieve)
- ```
- - **Explanation**: This hook waits for the main content to load before retrieving the HTML, ensuring that all essential content is captured.
-
-#### **4. Setting Hooks in Crawl4AI**
- - **How to Set Hooks**:
- - Use `set_hook` to define a custom function for each hook.
- - Each hook function can be asynchronous (useful for actions like waiting or retrieving async data).
- - **Example Setup**:
- ```python
- crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
- crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
- crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
- ```
-
-#### **5. Complete Example: Using Hooks for a Customized Crawl Workflow**
- - **Goal**: Log each key step, set custom headers before navigation, and clean up the page before retrieving HTML.
- - **Example Code**:
- ```python
- async def custom_crawl():
- async with AsyncWebCrawler() as crawler:
- # Set hooks for custom workflow
- crawler.crawler_strategy.set_hook('on_browser_created', log_browser_creation)
- crawler.crawler_strategy.set_hook('before_goto', modify_headers_before_goto)
- crawler.crawler_strategy.set_hook('after_goto', post_navigation_scroll)
- crawler.crawler_strategy.set_hook('before_return_html', remove_advertisements)
-
- # Perform the crawl
- url = "https://example.com"
- result = await crawler.arun(url=url)
- print(result.html) # Display or process HTML
- ```
-
-#### **6. Benefits of Using Hooks in Custom Crawling Workflows**
- - **Enhanced Control**: Hooks offer precise control over each stage, allowing adjustments based on content and structure.
- - **Efficient Modifications**: Avoid reloading or restarting the session; hooks can alter actions dynamically.
- - **Context-Sensitive Actions**: Hooks enable custom logic tailored to specific pages or sections, maximizing extraction quality.
-
-#### **7. Wrap Up & Next Steps**
- - Recap how hooks empower customized workflows in Crawl4AI, enabling flexibility at every stage.
- - Tease the next video: **Automating Post-Processing with Crawl4AI**, covering automated steps after data extraction.
-
----
-
-This outline provides a thorough understanding of hooks, their practical applications, and examples for customizing the crawling workflow in Crawl4AI.
\ No newline at end of file
diff --git a/docs/md_v3/tutorials/async-webcrawler-basics.md b/docs/md_v3/tutorials/async-webcrawler-basics.md
deleted file mode 100644
index 6236d899..00000000
--- a/docs/md_v3/tutorials/async-webcrawler-basics.md
+++ /dev/null
@@ -1,235 +0,0 @@
-Below is a sample Markdown file (`tutorials/async-webcrawler-basics.md`) illustrating how you might teach new users the fundamentals of `AsyncWebCrawler`. This tutorial builds on the **Getting Started** section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.
-
----
-
-# AsyncWebCrawler Basics
-
-In this tutorial, youâll learn how to:
-
-1. Create and configure an `AsyncWebCrawler` instance
-2. Understand the `CrawlResult` object returned by `arun()`
-3. Use basic `BrowserConfig` and `CrawlerRunConfig` options to tailor your crawl
-
-> **Prerequisites**
-> - Youâve already completed the [Getting Started](./getting-started.md) tutorial (or have equivalent knowledge).
-> - You have **Crawl4AI** installed and configured with Playwright.
-
----
-
-## 1. What is `AsyncWebCrawler`?
-
-`AsyncWebCrawler` is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, itâs your high-level interface for collecting page data.
-
-```python
-from crawl4ai import AsyncWebCrawler
-
-async with AsyncWebCrawler() as crawler:
- result = await crawler.arun("https://example.com")
- print(result)
-```
-
----
-
-## 2. Creating a Basic `AsyncWebCrawler` Instance
-
-Below is a simple code snippet showing how to create and use `AsyncWebCrawler`. This goes one step beyond the minimal example you saw in [Getting Started](./getting-started.md).
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
-from crawl4ai import BrowserConfig, CrawlerRunConfig
-
-async def main():
- # 1. Set up configuration objects (optional if you want defaults)
- browser_config = BrowserConfig(
- browser_type="chromium",
- headless=True,
- verbose=True
- )
- crawler_config = CrawlerRunConfig(
- page_timeout=30000, # 30 seconds
- wait_for_images=True,
- verbose=True
- )
-
- # 2. Initialize AsyncWebCrawler with your chosen browser config
- async with AsyncWebCrawler(config=browser_config) as crawler:
- # 3. Run a single crawl
- url_to_crawl = "https://example.com"
- result = await crawler.arun(url=url_to_crawl, config=crawler_config)
-
- # 4. Inspect the result
- if result.success:
- print(f"Successfully crawled: {result.url}")
- print(f"HTML length: {len(result.html)}")
- print(f"Markdown snippet: {result.markdown[:200]}...")
- else:
- print(f"Failed to crawl {result.url}. Error: {result.error_message}")
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-### Key Points
-
-1. **`BrowserConfig`** is optional, but itâs the place to specify browser-related settings (e.g., `headless`, `browser_type`).
-2. **`CrawlerRunConfig`** deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).
-3. **`arun()`** is the main method to crawl a single URL. Weâll see how `arun_many()` works in later tutorials.
-
----
-
-## 3. Understanding `CrawlResult`
-
-When you call `arun()`, you get back a `CrawlResult` object containing all the relevant data from that crawl attempt. Some common fields include:
-
-```python
-class CrawlResult(BaseModel):
- url: str
- html: str
- success: bool
- cleaned_html: Optional[str] = None
- media: Dict[str, List[Dict]] = {}
- links: Dict[str, List[Dict]] = {}
- screenshot: Optional[str] = None # base64-encoded screenshot if requested
- pdf: Optional[bytes] = None # binary PDF data if requested
- markdown: Optional[Union[str, MarkdownGenerationResult]] = None
- markdown_v2: Optional[MarkdownGenerationResult] = None
- error_message: Optional[str] = None
- # ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
-```
-
-### Commonly Used Fields
-
-- **`success`**: `True` if the crawl succeeded, `False` otherwise.
-- **`html`**: The raw HTML (or final rendered state if JavaScript was executed).
-- **`markdown` / `markdown_v2`**: Contains the automatically generated Markdown representation of the page.
-- **`media`**: A dictionary with lists of extracted images, videos, or audio elements.
-- **`links`**: A dictionary with lists of âinternalâ and âexternalâ link objects.
-- **`error_message`**: If `success` is `False`, this often contains a description of the error.
-
-**Example**:
-
-```python
-if result.success:
- print("Page Title or snippet of HTML:", result.html[:200])
- if result.markdown:
- print("Markdown snippet:", result.markdown[:200])
- print("Links found:", len(result.links.get("internal", [])), "internal links")
-else:
- print("Error crawling:", result.error_message)
-```
-
----
-
-## 4. Relevant Basic Parameters
-
-Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might tweak early on. Weâll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.
-
-### 4.1 `BrowserConfig` Essentials
-
-| Parameter | Description | Default |
-|--------------------|-----------------------------------------------------------|----------------|
-| `browser_type` | Which browser engine to use: `"chromium"`, `"firefox"`, `"webkit"` | `"chromium"` |
-| `headless` | Run the browser with no UI window. If `False`, you see the browser. | `True` |
-| `verbose` | Print extra logs for debugging. | `True` |
-| `java_script_enabled` | Toggle JavaScript. When `False`, you might speed up loads but lose dynamic content. | `True` |
-
-### 4.2 `CrawlerRunConfig` Essentials
-
-| Parameter | Description | Default |
-|-----------------------|--------------------------------------------------------------|--------------------|
-| `page_timeout` | Maximum time in ms to wait for the page to load or scripts. | `30000` (30s) |
-| `wait_for_images` | Wait for images to fully load. Good for accurate rendering. | `True` |
-| `css_selector` | Target only certain elements for extraction. | `None` |
-| `excluded_tags` | Skip certain HTML tags (like `nav`, `footer`, etc.) | `None` |
-| `verbose` | Print logs for debugging. | `True` |
-
-> **Tip**: Donât worry if you see lots of parameters. Youâll learn them gradually in later tutorials.
-
----
-
-## 5. Windows-Specific Configuration
-
-When using AsyncWebCrawler on Windows, you might encounter a `NotImplementedError` related to `asyncio.create_subprocess_exec`. This is a known Windows-specific issue that occurs because Windows' default event loop doesn't support subprocess operations.
-
-To resolve this, Crawl4AI provides a utility function to configure Windows to use the ProactorEventLoop. Call this function before running any async operations:
-
-```python
-from crawl4ai.utils import configure_windows_event_loop
-
-# Call this before any async operations if you're on Windows
-configure_windows_event_loop()
-
-# Your AsyncWebCrawler code here
-```
-
----
-
-## 6. Putting It All Together
-
-Hereâs a slightly more in-depth example that shows off a few key config parameters at once:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
-from crawl4ai import BrowserConfig, CrawlerRunConfig
-
-async def main():
- browser_cfg = BrowserConfig(
- browser_type="chromium",
- headless=True,
- java_script_enabled=True,
- verbose=False
- )
-
- crawler_cfg = CrawlerRunConfig(
- page_timeout=30000, # wait up to 30 seconds
- wait_for_images=True,
- css_selector=".article-body", # only extract content under this CSS selector
- verbose=True
- )
-
- async with AsyncWebCrawler(config=browser_cfg) as crawler:
- result = await crawler.arun("https://news.example.com", config=crawler_cfg)
-
- if result.success:
- print("[OK] Crawled:", result.url)
- print("HTML length:", len(result.html))
- print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
- else:
- print("[ERROR]", result.error_message)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**Key Observations**:
-- `css_selector=".article-body"` ensures we only focus on the main content region.
-- `page_timeout=30000` helps if the site is slow.
-- We turned off `verbose` logs for the browser but kept them on for the crawler config.
-
----
-
-## 7. Next Steps
-
-- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
-- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
-- **Reference**: For a complete list of every parameter in `BrowserConfig` and `CrawlerRunConfig`, check out the [Reference section](../../reference/configuration.md).
-
----
-
-## Summary
-
-You now know the basics of **AsyncWebCrawler**:
-- How to create it with optional browser/crawler configs
-- How `arun()` works for single-page crawls
-- Where to find your crawled data in `CrawlResult`
-- A handful of frequently used configuration parameters
-
-From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Letâs move on to **[Smart Crawling Techniques](./smart-crawling.md)** to learn how to handle iframes, advanced caching, and more.
-
----
-
-**Last updated**: 2024-XX-XX
-
-Keep exploring! If you get stuck, remember to check out the [How-To Guides](../../how-to/) for targeted solutions or the [Explanations](../../explanations/) for deeper conceptual background.
\ No newline at end of file
diff --git a/docs/md_v3/tutorials/docker-quickstart.md b/docs/md_v3/tutorials/docker-quickstart.md
deleted file mode 100644
index 73070baa..00000000
--- a/docs/md_v3/tutorials/docker-quickstart.md
+++ /dev/null
@@ -1,271 +0,0 @@
-# Deploying with Docker (Quickstart)
-
-> **â ď¸ WARNING: Experimental & Legacy**
-> Our current Docker solution for Crawl4AI is **not stable** and **will be discontinued** soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in **2025**. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.
-
-Crawl4AI is **open-source** and under **active development**. We appreciate your interest, but strongly recommend you make **informed decisions** if you need a production environment. Expect breaking changes in future versions.
-
----
-
-## 1. Installation & Environment Setup (Outside Docker)
-
-Before we jump into Docker usage, hereâs a quick reminder of how to install Crawl4AI locally (legacy doc). For **non-Docker** deployments or local dev:
-
-```bash
-# 1. Install the package
-pip install crawl4ai
-crawl4ai-setup
-
-# 2. Install playwright dependencies (all browsers or specific ones)
-playwright install --with-deps
-# or
-playwright install --with-deps chromium
-# or
-playwright install --with-deps chrome
-```
-
-**Testing** your installation:
-
-```bash
-# Visible browser test
-python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
-```
-
----
-
-## 2. Docker Overview
-
-This Docker approach allows you to run a **Crawl4AI** service via REST API. You can:
-
-1. **POST** a request (e.g., URLs, extraction config)
-2. **Retrieve** your results from a task-based endpoint
-
-> **Note**: This Docker solution is **temporary**. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.
-
----
-
-## 3. Pulling and Running the Image
-
-### Basic Run
-
-```bash
-docker pull unclecode/crawl4ai:basic
-docker run -p 11235:11235 unclecode/crawl4ai:basic
-```
-
-This starts a container on port `11235`. You can `POST` requests to `http://localhost:11235/crawl`.
-
-### Using an API Token
-
-```bash
-docker run -p 11235:11235 \
- -e CRAWL4AI_API_TOKEN=your_secret_token \
- unclecode/crawl4ai:basic
-```
-
-If **`CRAWL4AI_API_TOKEN`** is set, you must include `Authorization: Bearer ` in your requests. Otherwise, the service is open to anyone.
-
----
-
-## 4. Docker Compose for Multi-Container Workflows
-
-You can also use **Docker Compose** to manage multiple services. Below is an **experimental** snippet:
-
-```yaml
-version: '3.8'
-
-services:
- crawl4ai:
- image: unclecode/crawl4ai:basic
- ports:
- - "11235:11235"
- environment:
- - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
- - OPENAI_API_KEY=${OPENAI_API_KEY:-}
- # Additional env variables as needed
- volumes:
- - /dev/shm:/dev/shm
-```
-
-To run:
-
-```bash
-docker-compose up -d
-```
-
-And to stop:
-
-```bash
-docker-compose down
-```
-
-**Troubleshooting**:
-
-- **Check logs**: `docker-compose logs -f crawl4ai`
-- **Remove orphan containers**: `docker-compose down --remove-orphans`
-- **Remove networks**: `docker network rm `
-
----
-
-## 5. Making Requests to the Container
-
-**Base URL**: `http://localhost:11235`
-
-### Example: Basic Crawl
-
-```python
-import requests
-
-task_request = {
- "urls": "https://example.com",
- "priority": 10
-}
-
-response = requests.post("http://localhost:11235/crawl", json=task_request)
-task_id = response.json()["task_id"]
-
-# Poll for status
-status_url = f"http://localhost:11235/task/{task_id}"
-status = requests.get(status_url).json()
-print(status)
-```
-
-If you used an API token, do:
-
-```python
-headers = {"Authorization": "Bearer your_secret_token"}
-response = requests.post(
- "http://localhost:11235/crawl",
- headers=headers,
- json=task_request
-)
-```
-
----
-
-## 6. Docker + New Crawler Config Approach
-
-### Using `BrowserConfig` & `CrawlerRunConfig` in Requests
-
-The Docker-based solution can accept **crawler configurations** in the request JSON (legacy doc might show direct parameters, but we want to embed them in `crawler_params` or `extra` to align with the new approach). For example:
-
-```python
-import requests
-
-request_data = {
- "urls": "https://www.nbcnews.com/business",
- "crawler_params": {
- "headless": True,
- "browser_type": "chromium",
- "verbose": True,
- "page_timeout": 30000,
- # ... any other BrowserConfig-like fields
- },
- "extra": {
- "word_count_threshold": 50,
- "bypass_cache": True
- }
-}
-
-response = requests.post("http://localhost:11235/crawl", json=request_data)
-task_id = response.json()["task_id"]
-```
-
-This is the recommended style if you want to replicate `BrowserConfig` and `CrawlerRunConfig` settings in Docker mode.
-
----
-
-## 7. Example: JSON Extraction in Docker
-
-```python
-import requests
-import json
-
-# Define a schema for CSS extraction
-schema = {
- "name": "Coinbase Crypto Prices",
- "baseSelector": ".cds-tableRow-t45thuk",
- "fields": [
- {
- "name": "crypto",
- "selector": "td:nth-child(1) h2",
- "type": "text"
- },
- {
- "name": "symbol",
- "selector": "td:nth-child(1) p",
- "type": "text"
- },
- {
- "name": "price",
- "selector": "td:nth-child(2)",
- "type": "text"
- }
- ]
-}
-
-request_data = {
- "urls": "https://www.coinbase.com/explore",
- "extraction_config": {
- "type": "json_css",
- "params": {"schema": schema}
- },
- "crawler_params": {
- "headless": True,
- "verbose": True
- }
-}
-
-resp = requests.post("http://localhost:11235/crawl", json=request_data)
-task_id = resp.json()["task_id"]
-
-# Poll for status
-status = requests.get(f"http://localhost:11235/task/{task_id}").json()
-if status["status"] == "completed":
- extracted_content = status["result"]["extracted_content"]
- data = json.loads(extracted_content)
- print("Extracted:", len(data), "entries")
-else:
- print("Task still in progress or failed.")
-```
-
----
-
-## 8. Why This Docker Is Temporary
-
-**We are building a new, stable approach**:
-
-- The current Docker container is **experimental** and might break with future releases.
-- We plan a stable release in **2025** with a more robust API, versioning, and orchestration.
-- If you use this Docker in production, do so at your own risk and be prepared for **breaking changes**.
-
-**Community**: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the [GitHub repository](https://github.com/unclecode/crawl4ai) for roadmaps and updates.
-
----
-
-## 9. Known Limitations & Next Steps
-
-1. **Not Production-Ready**: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.
-2. **Ongoing Changes**: Expect API changes. The official stable version is targeted for **2025**.
-3. **LLM Integrations**: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.
-4. **Performance**: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.
-5. **Version Pinning**: If you must deploy, pin your Docker tag to a specific version (e.g., `:basic-0.3.7`) to avoid surprise updates.
-
-### Next Steps
-
-- **Watch the Repository**: For announcements on the new Docker architecture.
-- **Experiment**: Use this Docker for test or dev environments, but keep an eye out for breakage.
-- **Contribute**: If you have ideas or improvements, open a PR or discussion.
-- **Check Roadmaps**: See our [GitHub issues](https://github.com/unclecode/crawl4ai/issues) or [Roadmap doc](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md) to find upcoming releases.
-
----
-
-## 10. Summary
-
-**Deploying with Docker** can simplify running Crawl4AI as a service. However:
-
-- **This Docker** approach is **legacy** and subject to removal/overhaul.
-- For production, please weigh the risks carefully.
-- Detailed ânew Docker approachâ is coming in **2025**.
-
-We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for **experimental** usage. Stay tuned for the fully-supported version!
\ No newline at end of file
diff --git a/docs/md_v3/tutorials/getting-started.md b/docs/md_v3/tutorials/getting-started.md
deleted file mode 100644
index b148e6e1..00000000
--- a/docs/md_v3/tutorials/getting-started.md
+++ /dev/null
@@ -1,272 +0,0 @@
-# Getting Started with Crawl4AI
-
-Welcome to **Crawl4AI**, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, youâll:
-
-1. **Install** Crawl4AI (both via pip and Docker, with notes on platform challenges).
-2. Run your **first crawl** using minimal configuration.
-3. Generate **Markdown** output (and learn how itâs influenced by content filters).
-4. Experiment with a simple **CSS-based extraction** strategy.
-5. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
-
----
-
-## 1. Introduction
-
-Crawl4AI provides:
-- An asynchronous crawler, **`AsyncWebCrawler`**.
-- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
-- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports additional filters).
-- Multiple extraction strategies (LLM-based or âtraditionalâ CSS/XPath-based).
-
-By the end of this guide, youâll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.
-
----
-
-## 2. Installation
-
-### 2.1 Python + Playwright
-
-#### Basic Pip Installation
-
-```bash
-pip install crawl4ai
-crawl4ai-setup
-
-# Verify your installation
-crawl4ai-doctor
-```
-
-If you encounter any browser-related issues, you can install them manually:
-```bash
-python -m playwright install --with-deps chrome chromium
-```
-
-- **`crawl4ai-setup`** installs and configures Playwright (Chromium by default).
-
-We cover advanced installation and Docker in the [Installation](#installation) section.
-
----
-
-## 3. Your First Crawl
-
-Hereâs a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
-
-async def main():
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun("https://example.com")
- print(result.markdown[:300]) # Print first 300 chars
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**Whatâs happening?**
-- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
-- It fetches `https://example.com`.
-- Crawl4AI automatically converts the HTML into Markdown.
-
-You now have a simple, working crawl!
-
----
-
-## 4. Basic Configuration (Light Introduction)
-
-Crawl4AIâs crawler can be heavily customized using two main classes:
-
-1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
-2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
-
-Below is an example with minimal usage:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-
-async def main():
- browser_conf = BrowserConfig(headless=True) # or False to see the browser
- run_conf = CrawlerRunConfig(cache_mode="BYPASS")
-
- async with AsyncWebCrawler(config=browser_conf) as crawler:
- result = await crawler.arun(
- url="https://example.com",
- config=run_conf
- )
- print(result.markdown)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-Weâll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
-
----
-
-## 5. Generating Markdown Output
-
-By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
-
-- **`result.markdown`**:
- The direct HTML-to-Markdown conversion.
-- **`result.markdown.fit_markdown`**:
- The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
-
-### Example: Using a Filter with `DefaultMarkdownGenerator`
-
-```python
-from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
-from crawl4ai.content_filter_strategy import PruningContentFilter
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-md_generator = DefaultMarkdownGenerator(
- content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
-)
-
-config = CrawlerRunConfig(markdown_generator=md_generator)
-
-async with AsyncWebCrawler() as crawler:
- result = await crawler.arun("https://news.ycombinator.com", config=config)
- print("Raw Markdown length:", len(result.markdown.raw_markdown))
- print("Fit Markdown length:", len(result.markdown.fit_markdown))
-```
-
-**Note**: If you do **not** specify a content filter or markdown generator, youâll typically see only the raw Markdown. Weâll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
-
----
-
-## 6. Simple Data Extraction (CSS-based)
-
-Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
-
-```python
-import asyncio
-import json
-from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-
-async def main():
- schema = {
- "name": "Example Items",
- "baseSelector": "div.item",
- "fields": [
- {"name": "title", "selector": "h2", "type": "text"},
- {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
- ]
- }
-
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com/items",
- config=CrawlerRunConfig(
- extraction_strategy=JsonCssExtractionStrategy(schema)
- )
- )
- # The JSON output is stored in 'extracted_content'
- data = json.loads(result.extracted_content)
- print(data)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**Why is this helpful?**
-- Great for repetitive page structures (e.g., item listings, articles).
-- No AI usage or costs.
-- The crawler returns a JSON string you can parse or store.
-
----
-
-## 7. Simple Data Extraction (LLM-based)
-
-For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
-
-- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
-- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
-- Or any provider supported by the underlying library
-
-Below is an example using **open-source** style (no token) and closed-source:
-
-```python
-import os
-import json
-import asyncio
-from pydantic import BaseModel, Field
-from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-
-class PricingInfo(BaseModel):
- model_name: str = Field(..., description="Name of the AI model")
- input_fee: str = Field(..., description="Fee for input tokens")
- output_fee: str = Field(..., description="Fee for output tokens")
-
-async def main():
- # 1) Open-Source usage: no token required
- llm_strategy_open_source = LLMExtractionStrategy(
- provider="ollama/llama3.3", # or "any-other-local-model"
- api_token="no_token", # for local models, no API key is typically required
- schema=PricingInfo.schema(),
- extraction_type="schema",
- instruction="""
- From this page, extract all AI model pricing details in JSON format.
- Each entry should have 'model_name', 'input_fee', and 'output_fee'.
- """,
- temperature=0
- )
-
- # 2) Closed-Source usage: API key for OpenAI, for example
- openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
- llm_strategy_openai = LLMExtractionStrategy(
- provider="openai/gpt-4",
- api_token=openai_token,
- schema=PricingInfo.schema(),
- extraction_type="schema",
- instruction="""
- From this page, extract all AI model pricing details in JSON format.
- Each entry should have 'model_name', 'input_fee', and 'output_fee'.
- """,
- temperature=0
- )
-
- # We'll demo the open-source approach here
- config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)
-
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun(
- url="https://example.com/pricing",
- config=config
- )
- print("LLM-based extraction JSON:", result.extracted_content)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**Whatâs happening?**
-- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
-- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
-- Depending on the **provider** and **api_token**, you can use local models or a remote API.
-
----
-
-## 8. Next Steps
-
-Congratulations! You have:
-1. Installed Crawl4AI (via pip, with Docker as an option).
-2. Performed a simple crawl and printed Markdown.
-3. Seen how adding a **markdown generator** + **content filter** can produce âfitâ Markdown.
-4. Experimented with **CSS-based** extraction for repetitive data.
-5. Learned the basics of **LLM-based** extraction (open-source and closed-source).
-
-If you are ready for more, check out:
-
-- **Installation**: Learn more on how to install Crawl4AI and set up Playwright.
-- **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc.
-- **Markdown Generation Basics**: Dive deeper into content filtering and âfit markdownâ usage.
-- **Dynamic Pages & Hooks**: Tackle sites with âLoad Moreâ buttons, login forms, or JavaScript complexities.
-- **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes.
-- **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more.
-
-Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!
diff --git a/docs/md_v3/tutorials/getting-warmer.md b/docs/md_v3/tutorials/getting-warmer.md
deleted file mode 100644
index b2deb414..00000000
--- a/docs/md_v3/tutorials/getting-warmer.md
+++ /dev/null
@@ -1,527 +0,0 @@
-# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
-
-Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
-
-**What Crawl4AI is not:**
-
-Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal:
-
-- To generate perfect, AI-friendly data (particularly for LLMs) from web content
-- To maximize speed and efficiency in data extraction and processing
-- To operate at scale, from Raspberry Pi to cloud infrastructures
-
-Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to:
-
-1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON)
-2. Implement intelligent extraction strategies to reduce reliance on costly API calls
-3. Provide a streamlined pipeline for AI data preparation and ingestion
-
-In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities.
-
-**Key Links:**
-
-- **Website:** [https://crawl4ai.com](https://crawl4ai.com)
-- **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
-- **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-- **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py)
-- **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
-
----
-
-## Table of Contents
-
-- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution)
- - [Table of Contents](#table-of-contents)
- - [1. Introduction \& Key Concepts](#1-introduction--key-concepts)
- - [2. Installation \& Environment Setup](#2-installation--environment-setup)
- - [Test Your Installation](#test-your-installation)
- - [3. Core Concepts \& Configuration](#3-core-concepts--configuration)
- - [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction)
- - [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output)
- - [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm)
- - [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models)
- - [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content)
- - [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling)
- - [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation)
- - [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory)
- - [Using `storage_state`](#using-storage_state)
- - [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements)
- - [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads)
- - [13. Caching \& Performance Optimization](#13-caching--performance-optimization)
- - [14. Hooks for Custom Logic](#14-hooks-for-custom-logic)
- - [15. Dockerization \& Scaling](#15-dockerization--scaling)
- - [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls)
- - [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example)
- - [18. Further Resources \& Community](#18-further-resources--community)
-
----
-
-## 1. Introduction & Key Concepts
-
-Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
-
-**Quick Test:**
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
-
-async def test_run():
- async with AsyncWebCrawler() as crawler:
- result = await crawler.arun("https://example.com")
- print(result.markdown)
-
-asyncio.run(test_run())
-```
-
-If you see Markdown output, everything is working!
-
-**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md)
-
----
-
-## 2. Installation & Environment Setup
-
-```bash
-# Install the package
-pip install crawl4ai
-crawl4ai-setup
-
-# Install Playwright with system dependencies (recommended)
-playwright install --with-deps # Installs all browsers
-
-# Or install specific browsers:
-playwright install --with-deps chrome # Recommended for Colab/Linux
-playwright install --with-deps firefox
-playwright install --with-deps webkit
-playwright install --with-deps chromium
-
-# Keep Playwright updated periodically
-playwright install
-```
-
-> **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably.
-
-### Test Your Installation
-Try these one-liners:
-
-```python
-# Visible browser test
-python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
-
-# Headless test (for servers/CI)
-python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()"
-```
-
-You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`.
-
-
-**Try in Colab:**
-[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-
-**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md)
-
----
-
-## 3. Core Concepts & Configuration
-
-Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling.
-
-**Example config:**
-
-```python
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-
-browser_config = BrowserConfig(
- headless=True,
- verbose=True,
- viewport_width=1080,
- viewport_height=600,
- text_mode=False,
- ignore_https_errors=True,
- java_script_enabled=True
-)
-
-run_config = CrawlerRunConfig(
- css_selector="article.main",
- word_count_threshold=50,
- excluded_tags=['nav','footer'],
- exclude_external_links=True,
- wait_for="css:.article-loaded",
- page_timeout=60000,
- delay_before_return_html=1.0,
- mean_delay=0.1,
- max_range=0.3,
- process_iframes=True,
- remove_overlay_elements=True,
- js_code="""
- (async () => {
- window.scrollTo(0, document.body.scrollHeight);
- await new Promise(r => setTimeout(r, 2000));
- document.querySelector('.load-more')?.click();
- })();
- """
-)
-
-# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
-# run_config.cache_mode = CacheMode.ENABLED
-```
-
-**Prefixes:**
-
-- `http://` or `https://` for live pages
-- `file://local.html` for local
-- `raw:` for raw HTML strings
-
-**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md)
-
----
-
-## 4. Basic Crawling & Simple Extraction
-
-```python
-async with AsyncWebCrawler(config=browser_config) as crawler:
- result = await crawler.arun("https://news.example.com/article", config=run_config)
- print(result.markdown) # Basic markdown content
-```
-
-**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md)
-
----
-
-## 5. Markdown Generation & AI-Optimized Output
-
-After crawling, `result.markdown_v2` provides:
-
-- `raw_markdown`: Unfiltered markdown
-- `markdown_with_citations`: Links as references at the bottom
-- `references_markdown`: A separate list of reference links
-- `fit_markdown`: Filtered, relevant markdown (e.g., after BM25)
-- `fit_html`: The HTML used to produce `fit_markdown`
-
-**Example:**
-
-```python
-print("RAW:", result.markdown_v2.raw_markdown[:200])
-print("CITED:", result.markdown_v2.markdown_with_citations[:200])
-print("REFERENCES:", result.markdown_v2.references_markdown)
-print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
-```
-
-For AI training, `fit_markdown` focuses on the most relevant content.
-
-**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md)
-
----
-
-## 6. Structured Data Extraction (CSS, XPath, LLM)
-
-Extract JSON data without LLMs:
-
-**CSS:**
-
-```python
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-
-schema = {
- "name": "Products",
- "baseSelector": ".product",
- "fields": [
- {"name": "title", "selector": "h2", "type": "text"},
- {"name": "price", "selector": ".price", "type": "text"}
- ]
-}
-run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
-```
-
-**XPath:**
-
-```python
-from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
-
-xpath_schema = {
- "name": "Articles",
- "baseSelector": "//div[@class='article']",
- "fields": [
- {"name":"headline","selector":".//h1","type":"text"},
- {"name":"summary","selector":".//p[@class='summary']","type":"text"}
- ]
-}
-run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
-```
-
-**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
-
----
-
-## 7. Advanced Extraction: LLM & Open-Source Models
-
-Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
-
-```python
-from pydantic import BaseModel
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-
-class TravelData(BaseModel):
- destination: str
- attractions: list
-
-run_config.extraction_strategy = LLMExtractionStrategy(
- provider="ollama/nemotron",
- schema=TravelData.schema(),
- instruction="Extract destination and top attractions."
-)
-```
-
-**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
-
----
-
-## 8. Page Interactions, JS Execution, & Dynamic Content
-
-Insert `js_code` and use `wait_for` to ensure content loads. Example:
-
-```python
-run_config.js_code = """
-(async () => {
- document.querySelector('.load-more')?.click();
- await new Promise(r => setTimeout(r, 2000));
-})();
-"""
-run_config.wait_for = "css:.item-loaded"
-```
-
-**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md)
-
----
-
-## 9. Media, Links, & Metadata Handling
-
-`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance.
-
-`result.media["videos"]`, `result.media["audios"]` similarly hold media info.
-
-`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`.
-
-`result.metadata`: Title, description, keywords, author.
-
-**Example:**
-
-```python
-# Images
-for img in result.media["images"]:
- print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
-
-# Links
-for link in result.links["external"]:
- print("External Link:", link["href"], "Text:", link["text"])
-
-# Metadata
-print("Page Title:", result.metadata["title"])
-print("Description:", result.metadata["description"])
-```
-
-**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md)
-
----
-
-## 10. Authentication & Identity Preservation
-
-### Manual Setup via User Data Directory
-
-1. **Open Chrome with a custom user data dir:**
-
- ```bash
- "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
- ```
-
- On macOS:
-
- ```bash
- "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
- ```
-
-2. **Log in to sites, solve CAPTCHAs, adjust settings manually.**
- The browser saves cookies/localStorage in that directory.
-
-3. **Use `user_data_dir` in `BrowserConfig`:**
-
- ```python
- browser_config = BrowserConfig(
- headless=True,
- user_data_dir="/Users/username/ChromeProfiles/MyProfile"
- )
- ```
-
- Now the crawler starts with those cookies, sessions, etc.
-
-### Using `storage_state`
-
-Alternatively, export and reuse storage states:
-
-```python
-browser_config = BrowserConfig(
- headless=True,
- storage_state="mystate.json" # Pre-saved state
-)
-```
-
-No repeated logins needed.
-
-**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md)
-
----
-
-## 11. Proxy & Security Enhancements
-
-Use `proxy_config` for authenticated proxies:
-
-```python
-browser_config.proxy_config = {
- "server": "http://proxy.example.com:8080",
- "username": "proxyuser",
- "password": "proxypass"
-}
-```
-
-Combine with `headers` or `ignore_https_errors` as needed.
-
-**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md)
-
----
-
-## 12. Screenshots, PDFs & File Downloads
-
-Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`:
-
-```python
-run_config.screenshot = True
-run_config.pdf = True
-```
-
-After crawling:
-
-```python
-if result.screenshot:
- with open("page.png", "wb") as f:
- f.write(result.screenshot)
-
-if result.pdf:
- with open("page.pdf", "wb") as f:
- f.write(result.pdf)
-```
-
-**File Downloads:**
-
-```python
-browser_config.accept_downloads = True
-browser_config.downloads_path = "./downloads"
-run_config.js_code = """document.querySelector('a.download')?.click();"""
-
-# After crawl:
-print("Downloaded files:", result.downloaded_files)
-```
-
-**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md)
-Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md)
-
----
-
-## 13. Caching & Performance Optimization
-
-Set `cache_mode` to reuse fetch results:
-
-```python
-from crawl4ai import CacheMode
-run_config.cache_mode = CacheMode.ENABLED
-```
-
-Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction.
-
-**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md)
-
----
-
-## 14. Hooks for Custom Logic
-
-Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`.
-
-Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL:
-
-**Example Hook:**
-
-```python
-async def on_page_context_created_hook(context, page, **kwargs):
- # Block all images to speed up load
- await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
- print("[HOOK] Image requests blocked")
-
-async with AsyncWebCrawler(config=browser_config) as crawler:
- crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
- result = await crawler.arun("https://imageheavy.example.com", config=run_config)
- print("Crawl finished with images blocked.")
-```
-
-This hook is clean and doesnât create a separate page itselfâit just modifies the current context/page setup.
-
-**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md)
-
----
-
-## 15. Dockerization & Scaling
-
-Use Docker images:
-
-- AMD64 basic:
-
-```bash
-docker pull unclecode/crawl4ai:basic-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
-```
-
-- ARM64 for M1/M2:
-
-```bash
-docker pull unclecode/crawl4ai:basic-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
-```
-
-- GPU support:
-
-```bash
-docker pull unclecode/crawl4ai:gpu-amd64
-docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
-```
-
-Scale with load balancers or Kubernetes.
-
-**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#)
-
----
-
-## 16. Troubleshooting & Common Pitfalls
-
-- Empty results? Relax filters, check selectors.
-- Timeouts? Increase `page_timeout` or refine `wait_for`.
-- CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving.
-- JS errors? Try headful mode for debugging.
-
-Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code.
-
----
-
-## 17. Comprehensive End-to-End Example
-
-Combine hooks, JS execution, PDF saving, LLM extractionâsee [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example.
-
----
-
-## 18. Further Resources & Community
-
-- **Docs:** [https://crawl4ai.com](https://crawl4ai.com)
-- **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
-
-Follow [@unclecode](https://x.com/unclecode) for news & community updates.
-
-**Happy Crawling!**
-Leverage Crawl4AI to feed your AI models with clean, structured web data today.
diff --git a/docs/md_v3/tutorials/hooks-custom.md b/docs/md_v3/tutorials/hooks-custom.md
deleted file mode 100644
index 2f144065..00000000
--- a/docs/md_v3/tutorials/hooks-custom.md
+++ /dev/null
@@ -1,335 +0,0 @@
-# Hooks & Custom Code
-
-Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:
-
-- **Authentication** (log in before navigating)
-- **Content manipulation** (modify HTML, inject scripts, etc.)
-- **Session or browser configuration** (e.g., adjusting user agents, local storage)
-- **Custom data collection** (scrape extra details or track state at each stage)
-
-In this tutorial, youâll learn about:
-
-1. What hooks are available
-2. How to attach code to each hook
-3. Practical examples (auth flows, user agent changes, content manipulation, etc.)
-
-> **Prerequisites**
-> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
-> - Comfortable with Python async/await.
-
----
-
-## 1. Overview of Available Hooks
-
-| Hook Name | Called When / Purpose | Context / Objects Provided |
-|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
-| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. |
-| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`. |
-| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. |
-| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`. |
-| **`before_goto`** | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. |
-| **`after_goto`** | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`. |
-| **`before_retrieve_html`** | Right before retrieving or finalizing the pageâs HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference. |
-| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`. |
-
-### A Note on `on_browser_created` (the âunbrowserâ hook)
-- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties.
-- For example, you might control [CDP sessions][cdp] or advanced browser flags here.
-
----
-
-## 2. Registering Hooks
-
-You can attach hooks by calling:
-
-```python
-crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
-```
-
-or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:
-
-```python
-hooks = {
- "before_goto": my_before_goto_hook,
- "after_goto": my_after_goto_hook,
- # ... etc.
-}
-async with AsyncWebCrawler(hooks=hooks) as crawler:
- ...
-```
-
-### Hook Signature
-
-Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**âmost often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.
-
----
-
-## 3. Real-Life Examples
-
-Below are concrete scenarios where hooks come in handy.
-
----
-
-### 3.1 Authentication Before Navigation
-
-One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately).
-
-#### Using `before_goto`
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-
-async def before_goto_auth_hook(page, context, goto_params, **kwargs):
- """
- Example: Set cookies or localStorage to simulate login.
- This hook runs right before page.goto() is called.
- """
- # Example: Insert cookie-based auth or local storage data
- # (You could also do more complex actions, like fill forms if you already have a 'page' open.)
- print("[HOOK] Setting auth data before goto.")
- await context.add_cookies([
- {
- "name": "session",
- "value": "abcd1234",
- "domain": "example.com",
- "path": "/"
- }
- ])
- # Optionally manipulate goto_params if needed:
- # goto_params["url"] = goto_params["url"] + "?debug=1"
-
-async def main():
- hooks = {
- "before_goto": before_goto_auth_hook
- }
-
- browser_cfg = BrowserConfig(headless=True)
- crawler_cfg = CrawlerRunConfig()
-
- async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
- result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
- if result.success:
- print("[OK] Logged in and fetched protected page.")
- else:
- print("[ERROR]", result.error_message)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**Key Points**
-- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.
-- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.
-
----
-
-### 3.2 Setting Up the Browser in `on_browser_created`
-
-If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), youâll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself.
-
-```python
-async def on_browser_created_hook(browser, **kwargs):
- """
- Runs immediately after the browser is created, before any pages.
- 'browser' here is a Playwright Browser object.
- """
- print("[HOOK] Browser created. Setting up custom stuff.")
- # Possibly connect to DevTools or create an incognito context
- # Example (pseudo-code):
- # devtools_url = await browser.new_context(devtools=True)
-
-# Usage:
-async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
- ...
-```
-
----
-
-### 3.3 Adjusting Page or Context in `on_page_context_created`
-
-If youâd like to set default timeouts or inject scripts right after a page context is spun up:
-
-```python
-async def on_page_context_created_hook(page, context, **kwargs):
- print("[HOOK] Page context created. Setting default timeouts or scripts.")
- await page.set_default_timeout(20000) # 20 seconds
- # Possibly inject a script or set user locale
-
-# Usage:
-hooks = {
- "on_page_context_created": on_page_context_created_hook
-}
-```
-
----
-
-### 3.4 Dynamically Updating User Agents
-
-`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:
-
-```python
-async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
- print(f"[HOOK] User agent updated to {new_ua}")
- # Maybe add a custom header based on new UA
- await context.set_extra_http_headers({"X-UA-Source": new_ua})
-
-hooks = {
- "on_user_agent_updated": on_user_agent_updated_hook
-}
-```
-
----
-
-### 3.5 Initializing Stuff with `on_execution_started`
-
-`on_execution_started` runs before your main crawling logic. Itâs a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).
-
-```python
-async def on_execution_started_hook(page, context, **kwargs):
- print("[HOOK] Execution started. Setting a start timestamp or logging.")
- context.set_default_navigation_timeout(45000) # 45s if your site is slow
-
-hooks = {
- "on_execution_started": on_execution_started_hook
-}
-```
-
----
-
-### 3.6 Post-Processing with `after_goto`
-
-After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulationsâlike verifying youâre on the right page, or removing interstitials:
-
-```python
-async def after_goto_hook(page, context, response, **kwargs):
- """
- Called right after page.goto() finishes, but before the crawler extracts HTML.
- """
- if response and response.ok:
- print("[HOOK] After goto. Status:", response.status)
- # Maybe remove popups or check if we landed on a login failure page.
- await page.evaluate("""() => {
- const popup = document.querySelector(".annoying-popup");
- if (popup) popup.remove();
- }""")
- else:
- print("[HOOK] Navigation might have failed, status not ok or no response.")
-
-hooks = {
- "after_goto": after_goto_hook
-}
-```
-
----
-
-### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`
-
-Sometimes you need to tweak the page or raw HTML right before itâs captured.
-
-```python
-async def before_retrieve_html_hook(page, context, **kwargs):
- """
- Modify the DOM just before the crawler finalizes the HTML.
- """
- print("[HOOK] Removing adverts before capturing HTML.")
- await page.evaluate("""() => {
- const ads = document.querySelectorAll(".ad-banner");
- ads.forEach(ad => ad.remove());
- }""")
-
-async def before_return_html_hook(page, context, html, **kwargs):
- """
- 'html' is the near-finished HTML string. Return an updated string if you like.
- """
- # For example, remove personal data or certain tags from the final text
- print("[HOOK] Sanitizing final HTML.")
- sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
- return sanitized_html
-
-hooks = {
- "before_retrieve_html": before_retrieve_html_hook,
- "before_return_html": before_return_html_hook
-}
-```
-
-**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.
-
----
-
-## 4. Putting It All Together
-
-You can combine multiple hooks in a single run. For instance:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-
-async def on_browser_created_hook(browser, **kwargs):
- print("[HOOK] Browser is up, no page yet. Good for broad config.")
-
-async def before_goto_auth_hook(page, context, goto_params, **kwargs):
- print("[HOOK] Adding cookies for auth.")
- await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])
-
-async def after_goto_log_hook(page, context, response, **kwargs):
- if response:
- print("[HOOK] after_goto: Status code:", response.status)
-
-async def main():
- hooks = {
- "on_browser_created": on_browser_created_hook,
- "before_goto": before_goto_auth_hook,
- "after_goto": after_goto_log_hook
- }
-
- browser_cfg = BrowserConfig(headless=True)
- crawler_cfg = CrawlerRunConfig(verbose=True)
-
- async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
- result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
- if result.success:
- print("[OK] Protected page length:", len(result.html))
- else:
- print("[ERROR]", result.error_message)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-This example:
-
-1. **`on_browser_created`** sets up the brand-new browser instance.
-2. **`before_goto`** ensures you inject an auth cookie before accessing the page.
-3. **`after_goto`** logs the resulting HTTP status code.
-
----
-
-## 5. Common Pitfalls & Best Practices
-
-1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.
-2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.
-3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwrightâs `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.
-4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**.
-5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate âhow-toâ function with a simpler approach might suffice.
-
----
-
-## Conclusion & Next Steps
-
-**Hooks** let you bend Crawl4AI to your will:
-
-- **Authentication** (cookies, localStorage) with `before_goto`
-- **Browser-level config** with `on_browser_created`
-- **Page or context config** with `on_page_context_created`
-- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`)
-
-**Where to go next**:
-
-- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection.
-- **[Reference â AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood.
-- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated âLoad Moreâ clicks.
-
-With the hook system, you have near-complete control over the browserâs lifecycleâwhether itâs setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!
-
-**Last Updated**: 2024-XX-XX
diff --git a/docs/md_v3/tutorials/targeted-crawling.md b/docs/md_v3/tutorials/targeted-crawling.md
deleted file mode 100644
index f5fe2b77..00000000
--- a/docs/md_v3/tutorials/targeted-crawling.md
+++ /dev/null
@@ -1,227 +0,0 @@
-Below is a **draft** of a follow-up tutorial, **âSmart Crawling Techniques,â** building on the **âAsyncWebCrawler Basicsâ** tutorial. This tutorial focuses on three main points:
-
-1. **Advanced usage of CSS selectors** (e.g., partial extraction, exclusions)
-2. **Handling iframes** (if relevant for your workflow)
-3. **Waiting for dynamic content** using `wait_for`, including the new `css:` and `js:` prefixes
-
-Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.
-
----
-
-# Smart Crawling Techniques
-
-In the previous tutorial ([AsyncWebCrawler Basics](./async-webcrawler-basics.md)), you learned how to create an `AsyncWebCrawler` instance, run a basic crawl, and inspect the `CrawlResult`. Now itâs time to explore some of the **targeted crawling** features that let you:
-
-1. Select specific parts of a webpage using CSS selectors
-2. Exclude or ignore certain page elements
-3. Wait for dynamic content to load using `wait_for` (with `css:` or `js:` rules)
-4. (Optionally) Handle iframes if your target site embeds additional content
-
-> **Prerequisites**
-> - Youâve read or completed [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
-> - You have a working environment for Crawl4AI (Playwright installed, etc.).
-
----
-
-## 1. Targeting Specific Elements with CSS Selectors
-
-### 1.1 Simple CSS Selector Usage
-
-Letâs say you only need to crawl the main article content of a news page. By setting `css_selector` in `CrawlerRunConfig`, your final HTML or Markdown output focuses on that region. For example:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-
-async def main():
- browser_cfg = BrowserConfig(headless=True)
- crawler_cfg = CrawlerRunConfig(
- css_selector=".article-body", # Only capture .article-body content
- excluded_tags=["nav", "footer"] # Optional: skip big nav & footer sections
- )
-
- async with AsyncWebCrawler(config=browser_cfg) as crawler:
- result = await crawler.arun(
- url="https://news.example.com/story/12345",
- config=crawler_cfg
- )
- if result.success:
- print("[OK] Extracted content length:", len(result.html))
- else:
- print("[ERROR]", result.error_message)
-
-if __name__ == "__main__":
- asyncio.run(main())
-```
-
-**Key Parameters**:
-- **`css_selector`**: Tells the crawler to focus on `.article-body`.
-- **`excluded_tags`**: Tells the crawler to skip specific HTML tags altogether (e.g., `nav` or `footer`).
-
-**Tip**: For extremely noisy pages, you can further refine how you exclude certain elements by using `excluded_selector`, which takes a CSS selector you want removed from the final output.
-
-### 1.2 Excluding Content with `excluded_selector`
-
-If you want to remove certain sections within `.article-body` (like ârelated storiesâ sidebars), set:
-
-```python
-CrawlerRunConfig(
- css_selector=".article-body",
- excluded_selector=".related-stories, .ads-banner"
-)
-```
-
-This combination grabs the main article content while filtering out sidebars or ads.
-
----
-
-## 2. Handling Iframes
-
-Some sites embed extra content via `