Update the Tutorial section for new document version

2024-12-31 17:27:31 +08:00
parent fb33a24891
commit 0ec593fa90
85 changed files with 3412 additions and 9152 deletions
--- a/docs/md_v3/tutorials/getting-started.md
+++ b/docs/md_v3/tutorials/getting-started.md
@@ -0,0 +1,265 @@
+# Getting Started with Crawl4AI
+
+Welcome to **Crawl4AI**, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:
+
+1. **Install** Crawl4AI (both via pip and Docker, with notes on platform challenges).
+2. Run your **first crawl** using minimal configuration.
+3. Generate **Markdown** output (and learn how it’s influenced by content filters).
+4. Experiment with a simple **CSS-based extraction** strategy.
+5. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
+
+---
+
+## 1. Introduction
+
+Crawl4AI provides:
+- An asynchronous crawler, **`AsyncWebCrawler`**.
+- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
+- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports additional filters).
+- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
+
+By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.
+
+---
+
+## 2. Installation
+
+### 2.1 Python + Playwright
+
+#### Basic Pip Installation
+
+```bash
+pip install crawl4ai
+crawl4ai-setup
+playwright install --with-deps  
+```
+
+- **`crawl4ai-setup`** installs and configures Playwright (Chromium by default).
+
+We cover advanced installation and Docker in the [Installation](#installation) section.
+
+---
+
+## 3. Your First Crawl
+
+Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun("https://example.com")
+        print(result.markdown[:300])  # Print first 300 chars
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s happening?**
+- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
+- It fetches `https://example.com`.
+- Crawl4AI automatically converts the HTML into Markdown.
+
+You now have a simple, working crawl!
+
+---
+
+## 4. Basic Configuration (Light Introduction)
+
+Crawl4AI’s crawler can be heavily customized using two main classes:
+
+1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
+2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
+
+Below is an example with minimal usage:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+    browser_conf = BrowserConfig(headless=True)  # or False to see the browser
+    run_conf = CrawlerRunConfig(cache_mode="BYPASS")
+
+    async with AsyncWebCrawler(config=browser_conf) as crawler:
+        result = await crawler.arun(
+            url="https://example.com",
+            config=run_conf
+        )
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
+
+---
+
+## 5. Generating Markdown Output
+
+By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
+
+- **`result.markdown`**:  
+  The direct HTML-to-Markdown conversion.  
+- **`result.markdown.fit_markdown`**:  
+  The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
+
+### Example: Using a Filter with `DefaultMarkdownGenerator`
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+md_generator = DefaultMarkdownGenerator(
+    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
+)
+
+config = CrawlerRunConfig(markdown_generator=md_generator)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://news.ycombinator.com", config=config)
+    print("Raw Markdown length:", len(result.markdown.raw_markdown))
+    print("Fit Markdown length:", len(result.markdown.fit_markdown))
+```
+
+**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
+
+---
+
+## 6. Simple Data Extraction (CSS-based)
+
+Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def main():
+    schema = {
+        "name": "Example Items",
+        "baseSelector": "div.item",
+        "fields": [
+            {"name": "title", "selector": "h2", "type": "text"},
+            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
+        ]
+    }
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/items",
+            config=CrawlerRunConfig(
+                extraction_strategy=JsonCssExtractionStrategy(schema)
+            )
+        )
+        # The JSON output is stored in 'extracted_content'
+        data = json.loads(result.extracted_content)
+        print(data)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Why is this helpful?**
+- Great for repetitive page structures (e.g., item listings, articles).
+- No AI usage or costs. 
+- The crawler returns a JSON string you can parse or store.
+
+---
+
+## 7. Simple Data Extraction (LLM-based)
+
+For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
+
+- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)  
+- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)  
+- Or any provider supported by the underlying library
+
+Below is an example using **open-source** style (no token) and closed-source:
+
+```python
+import os
+import json
+import asyncio
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class PricingInfo(BaseModel):
+    model_name: str = Field(..., description="Name of the AI model")
+    input_fee: str = Field(..., description="Fee for input tokens")
+    output_fee: str = Field(..., description="Fee for output tokens")
+
+async def main():
+    # 1) Open-Source usage: no token required
+    llm_strategy_open_source = LLMExtractionStrategy(
+        provider="ollama/llama3.3",  # or "any-other-local-model"
+        api_token="no_token",       # for local models, no API key is typically required
+        schema=PricingInfo.schema(),
+        extraction_type="schema",
+        instruction="""
+            From this page, extract all AI model pricing details in JSON format.
+            Each entry should have 'model_name', 'input_fee', and 'output_fee'.
+        """,
+        temperature=0
+    )
+
+    # 2) Closed-Source usage: API key for OpenAI, for example
+    openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
+    llm_strategy_openai = LLMExtractionStrategy(
+        provider="openai/gpt-4",
+        api_token=openai_token,
+        schema=PricingInfo.schema(),
+        extraction_type="schema",
+        instruction="""
+            From this page, extract all AI model pricing details in JSON format.
+            Each entry should have 'model_name', 'input_fee', and 'output_fee'.
+        """,
+        temperature=0
+    )
+
+    # We'll demo the open-source approach here
+    config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)
+
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url="https://example.com/pricing",
+            config=config
+        )
+        print("LLM-based extraction JSON:", result.extracted_content)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s happening?**
+- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
+- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.  
+- Depending on the **provider** and **api_token**, you can use local models or a remote API.
+
+---
+
+## 8. Next Steps
+
+Congratulations! You have:
+1. Installed Crawl4AI (via pip, with Docker as an option).
+2. Performed a simple crawl and printed Markdown.
+3. Seen how adding a **markdown generator** + **content filter** can produce “fit” Markdown.
+4. Experimented with **CSS-based** extraction for repetitive data.
+5. Learned the basics of **LLM-based** extraction (open-source and closed-source).
+
+If you are ready for more, check out:
+
+- **Installation**: Learn more on how to install Crawl4AI and set up Playwright.
+- **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc.
+- **Markdown Generation Basics**: Dive deeper into content filtering and “fit markdown” usage.
+- **Dynamic Pages & Hooks**: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
+- **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes.
+- **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more.
+
+Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!