ADD MKDocs

2024-06-21 17:56:54 +08:00
parent 21b110bfd7
commit e7705e661a
34 changed files with 3933 additions and 580 deletions
--- a/docs/md/examples/hooks_auth.md
+++ b/docs/md/examples/hooks_auth.md
@@ -0,0 +1,96 @@
+# Hooks & Auth
+
+Crawl4AI allows you to customize the behavior of the web crawler using hooks. Hooks are functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the crawling process.
+
+## Example: Using Crawler Hooks
+
+Let's see how we can customize the crawler using hooks! In this example, we'll:
+
+1. Maximize the browser window and log in to a website when the driver is created.
+2. Add a custom header before fetching the URL.
+3. Log the current URL after fetching it.
+4. Log the length of the HTML before returning it.
+
+### Hook Definitions
+
+```python
+def on_driver_created(driver):
+    print("[HOOK] on_driver_created")
+    # Example customization: maximize the window
+    driver.maximize_window()
+    
+    # Example customization: logging in to a hypothetical website
+    driver.get('https://example.com/login')
+    
+    from selenium.webdriver.support.ui import WebDriverWait
+    from selenium.webdriver.common.by import By
+    from selenium.webdriver.support import expected_conditions as EC
+    
+    WebDriverWait(driver, 10).until(
+        EC.presence_of_element_located((By.NAME, 'username'))
+    )
+    driver.find_element(By.NAME, 'username').send_keys('testuser')
+    driver.find_element(By.NAME, 'password').send_keys('password123')
+    driver.find_element(By.NAME, 'login').click()
+    WebDriverWait(driver, 10).until(
+        EC.presence_of_element_located((By.ID, 'welcome'))
+    )
+    # Add a custom cookie
+    driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
+    return driver        
+    
+
+def before_get_url(driver):
+    print("[HOOK] before_get_url")
+    # Example customization: add a custom header
+    # Enable Network domain for sending headers
+    driver.execute_cdp_cmd('Network.enable', {})
+    # Add a custom header
+    driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
+    return driver
+
+def after_get_url(driver):
+    print("[HOOK] after_get_url")
+    # Example customization: log the URL
+    print(driver.current_url)
+    return driver
+
+def before_return_html(driver, html):
+    print("[HOOK] before_return_html")
+    # Example customization: log the HTML
+    print(len(html))
+    return driver
+```
+
+### Using the Hooks with the WebCrawler
+
+```python
+print("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
+crawler = WebCrawler(verbose=True)
+crawler.warmup()
+crawler.set_hook('on_driver_created', on_driver_created)
+crawler.set_hook('before_get_url', before_get_url)
+crawler.set_hook('after_get_url', after_get_url)
+crawler.set_hook('before_return_html', before_return_html)
+
+result = crawler.run(url="https://example.com")
+
+print("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
+print(result)
+```
+
+### Explanation
+
+- `on_driver_created`: This hook is called when the Selenium driver is created. In this example, it maximizes the window, logs in to a website, and adds a custom cookie.
+- `before_get_url`: This hook is called right before Selenium fetches the URL. In this example, it adds a custom HTTP header.
+- `after_get_url`: This hook is called after Selenium fetches the URL. In this example, it logs the current URL.
+- `before_return_html`: This hook is called before returning the HTML content. In this example, it logs the length of the HTML content.
+
+### Additional Ideas
+
+- **Add custom headers to requests**: You can add custom headers to the requests using the `before_get_url` hook.
+- **Perform safety checks**: Use the hooks to perform safety checks before the crawling process starts.
+- **Modify the HTML content**: Use the `before_return_html` hook to modify the HTML content before it is returned.
+- **Log additional information**: Use the hooks to log additional information for debugging or monitoring purposes.
+
+By using these hooks, you can customize the behavior of the crawler to suit your specific needs.
--- a/docs/md/examples/index.md
+++ b/docs/md/examples/index.md
@@ -0,0 +1,29 @@
+# Examples
+
+Welcome to the examples section of Crawl4AI documentation! In this section, you will find practical examples demonstrating how to use Crawl4AI for various web crawling and data extraction tasks. Each example is designed to showcase different features and capabilities of the library.
+
+## Examples Index
+
+### [LLM Extraction](llm_extraction.md)
+
+This example demonstrates how to use Crawl4AI to extract information using Large Language Models (LLMs). You will learn how to configure the `LLMExtractionStrategy` to get structured data from web pages.
+
+### [JS Execution & CSS Filtering](js_execution_css_filtering.md)
+
+Learn how to execute custom JavaScript code and filter data using CSS selectors. This example shows how to perform complex web interactions and extract specific content from web pages.
+
+### [Hooks & Auth](hooks_auth.md)
+
+This example covers the use of custom hooks for authentication and other pre-crawling tasks. You will see how to set up hooks to modify headers, authenticate sessions, and perform other preparatory actions before crawling.
+
+### [Summarization](summarization.md)
+
+Discover how to use Crawl4AI to summarize web page content. This example demonstrates the summarization capabilities of the library, helping you extract concise information from lengthy web pages.
+
+### [Research Assistant](research_assistant.md)
+
+In this example, Crawl4AI is used as a research assistant to gather and organize information from multiple sources. You will learn how to use various extraction and chunking strategies to compile a comprehensive report.
+
+---
+
+Each example includes detailed explanations and code snippets to help you understand and implement the features in your projects. Click on the links to explore each example and start making the most of Crawl4AI!
--- a/docs/md/examples/js_execution_css_filtering.md
+++ b/docs/md/examples/js_execution_css_filtering.md
@@ -0,0 +1,44 @@
+# JS Execution & CSS Filtering
+
+In this example, we'll demonstrate how to use Crawl4AI to execute JavaScript, filter data with CSS selectors, and use a cosine similarity strategy to extract relevant content. This approach is particularly useful when you need to interact with dynamic content on web pages, such as clicking "Load More" buttons.
+
+## Example: Extracting Structured Data
+
+```python
+# Import necessary modules
+from crawl4ai import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+from crawl4ai.crawler_strategy import *
+
+# Define the JavaScript code to click the "Load More" button
+js_code = ["""
+const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+loadMoreButton && loadMoreButton.click();
+"""]
+
+crawler = WebCrawler(verbose=True)
+crawler.warmup()
+# Run the crawler with keyword filtering and CSS selector
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    js=js_code,
+    css_selector="p",
+    extraction_strategy=CosineStrategy(
+        semantic_filter="technology",
+    ),
+)
+
+# Display the extracted result
+print(result)
+```
+
+### Explanation
+
+1. **JavaScript Execution**: The `js_code` variable contains JavaScript code that simulates clicking a "Load More" button. This is useful for loading additional content dynamically.
+2. **CSS Selector**: The `css_selector="p"` parameter ensures that only paragraph (`<p>`) tags are extracted from the web page.
+3. **Extraction Strategy**: The `CosineStrategy` is used with a semantic filter for "technology" to extract relevant content based on cosine similarity.
+
+## Try It Yourself
+
+This example demonstrates the power and flexibility of Crawl4AI in handling complex web interactions and extracting meaningful data. You can customize the JavaScript code, CSS selectors, and extraction strategies to suit your specific requirements.
--- a/docs/md/examples/llm_extraction.md
+++ b/docs/md/examples/llm_extraction.md
@@ -0,0 +1,90 @@
+# LLM Extraction
+
+Crawl4AI allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages. Below are two examples demonstrating how to use LLMExtractionStrategy for different purposes.
+
+## Example 1: Extract Structured Data
+
+In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.
+
+```python
+import os
+import time
+from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+from crawl4ai.crawler_strategy import *
+
+url = r'https://openai.com/api/pricing/'
+
+crawler = WebCrawler()
+crawler.warmup()
+
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+result = crawler.run(
+    url=url,
+    word_count_threshold=1,
+    extraction_strategy= LLMExtractionStrategy(
+        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        schema=OpenAIModelFee.model_json_schema(),
+        extraction_type="schema",
+        instruction="From the crawled content, extract all mentioned model names along with their "\
+            "fees for input and output tokens. Make sure not to miss anything in the entire content. "\
+            'One extracted model JSON format should look like this: '\
+            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+    ),
+    bypass_cache=True,
+)
+
+model_fees = json.loads(result.extracted_content)
+
+print(len(model_fees))
+
+with open(".data/data.json", "w") as f:
+    f.write(result.extracted_content)
+```
+
+## Example 2: Extract Relevant Content
+
+In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.
+
+```python
+crawler = WebCrawler()
+crawler.warmup()
+
+result = crawler.run(
+        url="https://www.nbcnews.com/business",
+        extraction_strategy=LLMExtractionStrategy(
+            provider="openai/gpt-4o",
+            api_token=os.getenv('OPENAI_API_KEY'),
+            instruction="Extract only content related to technology"
+        ),
+    bypass_cache=True,
+    )
+
+model_fees = json.loads(result.extracted_content)
+
+print(len(model_fees))
+
+with open(".data/data.json", "w") as f:
+    f.write(result.extracted_content)
+```
+
+## Customizing LLM Provider
+
+Under the hood, Crawl4AI uses the `litellm` library, which allows you to use any LLM provider you want. Just pass the correct model name and API token.
+
+```python
+extraction_strategy=LLMExtractionStrategy(
+    provider="your_llm_provider/model_name",
+    api_token="your_api_token",
+    instruction="Your extraction instruction"
+)
+```
+
+This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.
--- a/docs/md/examples/research_assistant.md
+++ b/docs/md/examples/research_assistant.md
@@ -0,0 +1,248 @@
+## Research Assistant Example
+
+This example demonstrates how to build a research assistant using `Chainlit` and `Crawl4AI`. The assistant will be capable of crawling web pages for information and answering questions based on the crawled content. Additionally, it integrates speech-to-text functionality for audio inputs.
+
+### Step-by-Step Guide
+
+1. **Install Required Packages**
+
+    Ensure you have the necessary packages installed. You need `chainlit`, `groq`, `requests`, and `openai`.
+
+    ```bash
+    pip install chainlit groq requests openai
+    ```
+
+2. **Import Libraries**
+
+    Import all the necessary modules and initialize the OpenAI client.
+
+    ```python
+    import os
+    import time
+    from openai import AsyncOpenAI
+    import chainlit as cl
+    import re
+    import requests
+    from io import BytesIO
+    from chainlit.element import ElementBased
+    from groq import Groq
+
+    from concurrent.futures import ThreadPoolExecutor
+
+    client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
+
+    # Instrument the OpenAI client
+    cl.instrument_openai()
+    ```
+
+3. **Set Configuration**
+
+    Define the model settings for the assistant.
+
+    ```python
+    settings = {
+        "model": "llama3-8b-8192",
+        "temperature": 0.5,
+        "max_tokens": 500,
+        "top_p": 1,
+        "frequency_penalty": 0,
+        "presence_penalty": 0,
+    }
+    ```
+
+4. **Define Utility Functions**
+
+    - **Extract URLs from Text**: Use regex to find URLs in messages.
+
+        ```python
+        def extract_urls(text):
+            url_pattern = re.compile(r'(https?://\S+)')
+            return url_pattern.findall(text)
+        ```
+
+    - **Crawl URL**: Send a request to `Crawl4AI` to fetch the content of a URL.
+
+        ```python
+        def crawl_url(url):
+            data = {
+                "urls": [url],
+                "include_raw_html": True,
+                "word_count_threshold": 10,
+                "extraction_strategy": "NoExtractionStrategy",
+                "chunking_strategy": "RegexChunking"
+            }
+            response = requests.post("https://crawl4ai.com/crawl", json=data)
+            response_data = response.json()
+            response_data = response_data['results'][0]
+            return response_data['markdown']
+        ```
+
+5. **Initialize Chat Start Event**
+
+    Set up the initial chat message and user session.
+
+    ```python
+    @cl.on_chat_start
+    async def on_chat_start():
+        cl.user_session.set("session", {
+            "history": [],
+            "context": {}
+        })  
+        await cl.Message(
+            content="Welcome to the chat! How can I assist you today?"
+        ).send()
+    ```
+
+6. **Handle Incoming Messages**
+
+    Process user messages, extract URLs, and crawl them concurrently. Update the chat history and system message.
+
+    ```python
+    @cl.on_message
+    async def on_message(message: cl.Message):
+        user_session = cl.user_session.get("session")
+
+        # Extract URLs from the user's message
+        urls = extract_urls(message.content)
+
+        futures = []
+        with ThreadPoolExecutor() as executor:
+            for url in urls:
+                futures.append(executor.submit(crawl_url, url))
+
+        results = [future.result() for future in futures]
+
+        for url, result in zip(urls, results):
+            ref_number = f"REF_{len(user_session['context']) + 1}"
+            user_session["context"][ref_number] = {
+                "url": url,
+                "content": result
+            }    
+
+        user_session["history"].append({
+            "role": "user",
+            "content": message.content
+        })
+
+        # Create a system message that includes the context
+        context_messages = [
+            f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
+            for ref, data in user_session["context"].items()
+        ]
+        if context_messages:
+            system_message = {
+                "role": "system",
+                "content": (
+                    "You are a helpful bot. Use the following context for answering questions. "
+                    "Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
+                    "If the question requires any information from the provided appendices or context, refer to the sources. "
+                    "If not, there is no need to add a references section. "
+                    "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
+                    "\n\n".join(context_messages)
+                )
+            }
+        else:
+            system_message = {
+                "role": "system",
+                "content": "You are a helpful assistant."
+            }
+
+        msg = cl.Message(content="")
+        await msg.send()
+
+        # Get response from the LLM
+        stream = await client.chat.completions.create(
+            messages=[
+                system_message,
+                *user_session["history"]
+            ],
+            stream=True,
+            **settings
+        )
+
+        assistant_response = ""
+        async for part in stream:
+            if token := part.choices[0].delta.content:
+                assistant_response += token
+                await msg.stream_token(token)
+
+        # Add assistant message to the history
+        user_session["history"].append({
+            "role": "assistant",
+            "content": assistant_response
+        })
+        await msg.update()
+
+        # Append the reference section to the assistant's response
+        reference_section = "\n\nReferences:\n"
+        for ref, data in user_session["context"].items():
+            reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
+
+        msg.content += reference_section
+        await msg.update()
+    ```
+
+7. **Handle Audio Input**
+
+    Capture and transcribe audio input. Store the audio buffer and transcribe it when the audio ends.
+
+    ```python
+    @cl.on_audio_chunk
+    async def on_audio_chunk(chunk: cl.AudioChunk):
+        if chunk.isStart:
+            buffer = BytesIO()
+            buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
+            cl.user_session.set("audio_buffer", buffer)
+            cl.user_session.set("audio_mime_type", chunk.mimeType)
+
+        cl.user_session.get("audio_buffer").write(chunk.data)
+
+    @cl.step(type="tool")
+    async def speech_to_text(audio_file):
+        cli = Groq()
+        response = await client.audio.transcriptions.create(
+            model="whisper-large-v3", file=audio_file
+        )
+        return response.text
+
+    @cl.on_audio_end
+    async def on_audio_end(elements: list[ElementBased]):
+        audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
+        audio_buffer.seek(0)
+        audio_file = audio_buffer.read()
+        audio_mime_type: str = cl.user_session.get("audio_mime_type")
+        
+        start_time = time.time()
+        transcription = await speech_to_text((audio_buffer.name, audio_file, audio_mime_type))
+        end_time = time.time()
+        print(f"Transcription took {end_time - start_time} seconds")
+        
+        user_msg = cl.Message(
+            author="You", 
+            type="user_message",
+            content=transcription
+        )
+        await user_msg.send()
+        await on_message(user_msg)
+    ```
+
+8. **Run the Chat Application**
+
+    Start the Chainlit application.
+
+    ```python
+    if __name__ == "__main__":
+        from chainlit.cli import run_chainlit
+        run_chainlit(__file__)
+    ```
+
+### Explanation
+
+- **Libraries and Configuration**: Import necessary libraries and configure the OpenAI client.
+- **Utility Functions**: Define functions to extract URLs and crawl them.
+- **Chat Start Event**: Initialize chat session and welcome message.
+- **Message Handling**: Extract URLs, crawl them concurrently, and update chat history and context.
+- **Audio Handling**: Capture, buffer, and transcribe audio input, then process the transcription as text.
+- **Running the Application**: Start the Chainlit server to interact with the assistant.
+
+This example showcases how to create an interactive research assistant that can fetch, process, and summarize web content, along with handling audio inputs for a seamless user experience.
--- a/docs/md/examples/summarization.md
+++ b/docs/md/examples/summarization.md
@@ -0,0 +1,108 @@
+## Summarization Example
+
+This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
+
+### Step-by-Step Guide
+
+1. **Import Necessary Modules**
+
+    First, import the necessary modules and classes.
+
+    ```python
+    import os
+    import time
+    import json
+    from crawl4ai.web_crawler import WebCrawler
+    from crawl4ai.chunking_strategy import *
+    from crawl4ai.extraction_strategy import *
+    from crawl4ai.crawler_strategy import *
+    from pydantic import BaseModel, Field
+    ```
+
+2. **Define the URL to be Crawled**
+
+    Set the URL of the web page you want to summarize.
+
+    ```python
+    url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
+    ```
+
+3. **Initialize the WebCrawler**
+
+    Create an instance of the `WebCrawler` and call the `warmup` method.
+
+    ```python
+    crawler = WebCrawler()
+    crawler.warmup()
+    ```
+
+4. **Define the Data Model**
+
+    Use Pydantic to define the structure of the extracted data.
+
+    ```python
+    class PageSummary(BaseModel):
+        title: str = Field(..., description="Title of the page.")
+        summary: str = Field(..., description="Summary of the page.")
+        brief_summary: str = Field(..., description="Brief summary of the page.")
+        keywords: list = Field(..., description="Keywords assigned to the page.")
+    ```
+
+5. **Run the Crawler**
+
+    Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.
+
+    ```python
+    result = crawler.run(
+        url=url,
+        word_count_threshold=1,
+        extraction_strategy=LLMExtractionStrategy(
+            provider="openai/gpt-4o", 
+            api_token=os.getenv('OPENAI_API_KEY'), 
+            schema=PageSummary.model_json_schema(),
+            extraction_type="schema",
+            apply_chunking=False,
+            instruction=(
+                "From the crawled content, extract the following details: "
+                "1. Title of the page "
+                "2. Summary of the page, which is a detailed summary "
+                "3. Brief summary of the page, which is a paragraph text "
+                "4. Keywords assigned to the page, which is a list of keywords. "
+                'The extracted JSON format should look like this: '
+                '{ "title": "Page Title", "summary": "Detailed summary of the page.", '
+                '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
+            )
+        ),
+        bypass_cache=True,
+    )
+    ```
+
+6. **Process the Extracted Data**
+
+    Load the extracted content into a JSON object and print it.
+
+    ```python
+    page_summary = json.loads(result.extracted_content)
+    print(page_summary)
+    ```
+
+7. **Save the Extracted Data**
+
+    Save the extracted data to a file for further use.
+
+    ```python
+    with open(".data/page_summary.json", "w") as f:
+        f.write(result.extracted_content)
+    ```
+
+### Explanation
+
+- **Importing Modules**: Import the necessary modules, including `WebCrawler` and `LLMExtractionStrategy` from `Crawl4AI`.
+- **URL Definition**: Set the URL of the web page you want to crawl and summarize.
+- **WebCrawler Initialization**: Create an instance of `WebCrawler` and call the `warmup` method to prepare the crawler.
+- **Data Model Definition**: Define the structure of the data you want to extract using Pydantic's `BaseModel`.
+- **Crawler Execution**: Run the crawler with the `LLMExtractionStrategy`, providing the schema and detailed instructions for the extraction process.
+- **Data Processing**: Load the extracted content into a JSON object and print it to verify the results.
+- **Data Saving**: Save the extracted data to a file for further use.
+
+This example demonstrates how to harness the power of `Crawl4AI` to perform advanced web crawling and data extraction tasks with minimal code.