chore: Update README, generate new notbook for quickstart

2024-09-04 14:46:22 +08:00
parent 2fada16abb
commit 5c15837677
6 changed files with 939 additions and 239 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# Crawl4AI v0.2.77 🕷️🤖
+# Crawl4AI Async Version 🕷️🤖
 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
@@ -6,34 +6,22 @@
 [![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
 [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
-Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
+Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
-#### [v0.2.77] - 2024-08-02
+> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md).
 Major improvements in functionality, performance, and cross-platform compatibility! 🚀
 - 🐳 **Docker enhancements**:
  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
 - 🌐 **Official Docker Hub image**:
  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
 - 🔧 **Selenium upgrade**:
  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
 - 🖼️ **Image description**:
  - Implemented ability to generate textual descriptions for extracted images from web pages.
 - ⚡ **Performance boost**:
  - Various improvements to enhance overall speed and performance.
 ## Try it Now!
 ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
-✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
+✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
-✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
+✨ Check out the [Demo](https://crawl4ai.com/mkdocs/demo)
 ## Features ✨
 - 🆓 Completely free and open-source
 - 🚀 Blazing fast performance, outperforming many paid services
 - 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
 - 🌍 Supports crawling multiple URLs simultaneously
 - 🎨 Extracts and returns all media tags (Images, Audio, and Video)
@@ -43,44 +31,17 @@ Major improvements in functionality, performance, and cross-platform compatibili
 - 🕵️ User-agent customization
 - 🖼️ Takes screenshots of the page
 - 📜 Executes multiple custom JavaScripts before crawling
 - 📊 Generates structured output without LLM using JsonCssExtractionStrategy
 - 📚 Various chunking strategies: topic-based, regex, sentence, and more
 - 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
+- 🎯 CSS selector support for precise data extraction
 - 📝 Passes instructions/keywords to refine extraction
 - 🔒 Proxy support for enhanced privacy and access
 - 🔄 Session management for complex multi-page crawling scenarios
 - 🌐 Asynchronous architecture for improved performance and scalability
 # Crawl4AI
-## 🌟 Shoutout to Contributors of v0.2.77!
+## Installation 🛠️
 A big thank you to the amazing contributors who've made this release possible:
 - [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
 - [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
 - [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
 Your contributions are driving Crawl4AI forward! 🚀
 ## Cool Examples 🚀
 ### Quick Start
 ```python
 from crawl4ai import WebCrawler
 # Create an instance of WebCrawler
 crawler = WebCrawler()
 # Warm up the crawler (load necessary models)
 crawler.warmup()
 # Run the crawler on a URL
 result = crawler.run(url="https://www.nbcnews.com/business")
 # Print the extracted content
 print(result.markdown)
 ```
 ## How to install 🛠 
 ### Using pip 🐍
 ```bash
@@ -105,55 +66,80 @@ docker pull unclecode/crawl4ai:latest
 docker run -d -p 8000:80 unclecode/crawl4ai:latest
 ```
-
+## Quick Start 🚀
 ## Speed-First Design 🚀
 Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
 ```python
-import time
+import asyncio
-from crawl4ai.web_crawler import WebCrawler
+from crawl4ai import AsyncWebCrawler
 crawler = WebCrawler()
 crawler.warmup()
-start = time.time()
+async def main():
-url = r"https://www.nbcnews.com/business"
+    async with AsyncWebCrawler(verbose=True) as crawler:
-result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
-end = time.time()
+        print(result.markdown)
-print(f"Time taken: {end - start}")
+
 if __name__ == "__main__":
    asyncio.run(main())
 ```
-Let's take a look the calculated time for the above code snippet:
+## Advanced Usage 🔬
-```bash
+### Executing JavaScript and Using CSS Selectors
-[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
+
-[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
+```python
-[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
+import asyncio
-Time taken: 1.439958095550537
+from crawl4ai import AsyncWebCrawler
 async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            css_selector="article.tease-card",
            bypass_cache=True
        )
        print(result.extracted_content)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
-### Extract Structured Data from Web Pages 📊
+### Using a Proxy
-Crawl all OpenAI models and their fees from the official page.
+```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
 async def main():
    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True
        )
        print(result.markdown)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 ### Extracting Structured Data with OpenAI
 ```python
 import os
-from crawl4ai import WebCrawler
+import asyncio
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import LLMExtractionStrategy
 from pydantic import BaseModel, Field
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
-url = 'https://openai.com/api/pricing/'
+async def main():
-crawler = WebCrawler()
+    async with AsyncWebCrawler(verbose=True) as crawler:
-crawler.warmup()
+        result = await crawler.arun(
-
+            url='https://openai.com/api/pricing/',
 result = crawler.run(
        url=url,
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
@@ -165,58 +151,179 @@ result = crawler.run(
            ),            
            bypass_cache=True,
        )
        print(result.extracted_content)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
-### Execute JS, Filter Data with CSS Selector, and Clustering
+### Advanced Multi-Page Crawling with JavaScript Execution
 Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:
 ```python
-from crawl4ai import WebCrawler
+import asyncio
-from crawl4ai.chunking_strategy import CosineStrategy
+import re
 from bs4 import BeautifulSoup
 from crawl4ai import AsyncWebCrawler
-js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
+async def crawl_typescript_commits():
    first_commit = ""
    async def on_execution_started(page):
        nonlocal first_commit 
        try:
            while True:
                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
                commit = await commit.evaluate('(element) => element.textContent')
                commit = re.sub(r'\s+', '', commit)
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
-crawler = WebCrawler()
+    async with AsyncWebCrawler(verbose=True) as crawler:
-crawler.warmup()
+        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
-result = crawler.run(
+        url = "https://github.com/microsoft/TypeScript/commits/main"
-    url="https://www.nbcnews.com/business",
+        session_id = "typescript_commits_session"
-    js=js_code,
+        all_commits = []
-    css_selector="p",
+
-    extraction_strategy=CosineStrategy(semantic_filter="technology")
+        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """
        for page in range(3):  # Crawl 3 pages
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
                bypass_cache=True,
                js_only=page > 0
            )
-print(result.extracted_content)
+            assert result.success, f"Failed to crawl page {page + 1}"
            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
            commits = soup.select("li")
            all_commits.extend(commits)
            print(f"Page {page + 1}: Found {len(commits)} commits")
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 if __name__ == "__main__":
    asyncio.run(crawl_typescript_commits())
 ```
-### Extract Structured Data from Web Pages With Proxy and BaseUrl
+This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
 ### Using JsonCssExtractionStrategy
 The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.
 ```python
-from crawl4ai import WebCrawler
+import asyncio
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
+import json
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-def create_crawler():
+async def extract_news_teasers():
-    crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
+    schema = {
-    crawler.warmup()
+        "name": "News Teaser Extractor",
-    return crawler
+        "baseSelector": ".wide-tease-item__wrapper",
        "fields": [
            {
                "name": "category",
                "selector": ".unibrow span[data-testid='unibrow-text']",
                "type": "text",
            },
            {
                "name": "headline",
                "selector": ".wide-tease-item__headline",
                "type": "text",
            },
            {
                "name": "summary",
                "selector": ".wide-tease-item__description",
                "type": "text",
            },
            {
                "name": "time",
                "selector": "[data-testid='wide-tease-date']",
                "type": "text",
            },
            {
                "name": "image",
                "type": "nested",
                "selector": "picture.teasePicture img",
                "fields": [
                    {"name": "src", "type": "attribute", "attribute": "src"},
                    {"name": "alt", "type": "attribute", "attribute": "alt"},
                ],
            },
            {
                "name": "link",
                "selector": "a[href]",
                "type": "attribute",
                "attribute": "href",
            },
        ],
    }
-crawler = create_crawler()
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
-crawler.warmup()
+    async with AsyncWebCrawler(verbose=True) as crawler:
-
+        result = await crawler.arun(
 result = crawler.run(
            url="https://www.nbcnews.com/business",
-    extraction_strategy=LLMExtractionStrategy(
+            extraction_strategy=extraction_strategy,
-        provider="openai/gpt-4o",
+            bypass_cache=True,
        api_token="sk-",
        base_url="https://api.openai.com/v1"
    )
        )
-print(result.markdown)
+        assert result.success, "Failed to crawl the page"
        news_teasers = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(news_teasers)} news teasers")
        print(json.dumps(news_teasers[0], indent=2))
 if __name__ == "__main__":
    asyncio.run(extract_news_teasers())
 ```
 ## Speed Comparison 🚀
 Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.
 We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:
 ```
 Firecrawl:
 Time taken: 7.02 seconds
 Content length: 42074 characters
 Images found: 49
 Crawl4AI (simple crawl):
 Time taken: 1.60 seconds
 Content length: 18238 characters
 Images found: 49
 Crawl4AI (with JavaScript execution):
 Time taken: 4.64 seconds
 Content length: 40869 characters
 Images found: 89
 ```
 As you can see, Crawl4AI outperforms Firecrawl significantly:
 - Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
 - With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.
 You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.
 ## Documentation 📚
 For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
--- a/README.sync.md
+++ b/README.sync.md
@@ -0,0 +1,244 @@
 # Crawl4AI v0.2.77 🕷️🤖
 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
 [![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
 [![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
 [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
 Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
 #### [v0.2.77] - 2024-08-02
 Major improvements in functionality, performance, and cross-platform compatibility! 🚀
 - 🐳 **Docker enhancements**:
  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
 - 🌐 **Official Docker Hub image**:
  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
 - 🔧 **Selenium upgrade**:
  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
 - 🖼️ **Image description**:
  - Implemented ability to generate textual descriptions for extracted images from web pages.
 - ⚡ **Performance boost**:
  - Various improvements to enhance overall speed and performance.
 ## Try it Now!
 ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
 ✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
 ✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
 ## Features ✨
 - 🆓 Completely free and open-source
 - 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
 - 🌍 Supports crawling multiple URLs simultaneously
 - 🎨 Extracts and returns all media tags (Images, Audio, and Video)
 - 🔗 Extracts all external and internal links
 - 📚 Extracts metadata from the page
 - 🔄 Custom hooks for authentication, headers, and page modifications before crawling
 - 🕵️ User-agent customization
 - 🖼️ Takes screenshots of the page
 - 📜 Executes multiple custom JavaScripts before crawling
 - 📚 Various chunking strategies: topic-based, regex, sentence, and more
 - 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
 - 🎯 CSS selector support
 - 📝 Passes instructions/keywords to refine extraction
 # Crawl4AI
 ## 🌟 Shoutout to Contributors of v0.2.77!
 A big thank you to the amazing contributors who've made this release possible:
 - [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
 - [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
 - [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
 Your contributions are driving Crawl4AI forward! 🚀
 ## Cool Examples 🚀
 ### Quick Start
 ```python
 from crawl4ai import WebCrawler
 # Create an instance of WebCrawler
 crawler = WebCrawler()
 # Warm up the crawler (load necessary models)
 crawler.warmup()
 # Run the crawler on a URL
 result = crawler.run(url="https://www.nbcnews.com/business")
 # Print the extracted content
 print(result.markdown)
 ```
 ## How to install 🛠 
 ### Using pip 🐍
 ```bash
 virtualenv venv
 source venv/bin/activate
 pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
 ```
 ### Using Docker 🐳
 ```bash
 # For Mac users (M1/M2)
 # docker build --platform linux/amd64 -t crawl4ai .
 docker build -t crawl4ai .
 docker run -d -p 8000:80 crawl4ai
 ```
 ### Using Docker Hub 🐳
 ```bash
 docker pull unclecode/crawl4ai:latest
 docker run -d -p 8000:80 unclecode/crawl4ai:latest
 ```
 ## Speed-First Design 🚀
 Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
 ```python
 import time
 from crawl4ai.web_crawler import WebCrawler
 crawler = WebCrawler()
 crawler.warmup()
 start = time.time()
 url = r"https://www.nbcnews.com/business"
 result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
 end = time.time()
 print(f"Time taken: {end - start}")
 ```
 Let's take a look the calculated time for the above code snippet:
 ```bash
 [LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
 [LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
 [LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
 Time taken: 1.439958095550537
 ```
 Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
 ### Extract Structured Data from Web Pages 📊
 Crawl all OpenAI models and their fees from the official page.
 ```python
 import os
 from crawl4ai import WebCrawler
 from crawl4ai.extraction_strategy import LLMExtractionStrategy
 from pydantic import BaseModel, Field
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
 url = 'https://openai.com/api/pricing/'
 crawler = WebCrawler()
 crawler.warmup()
 result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )
 print(result.extracted_content)
 ```
 ### Execute JS, Filter Data with CSS Selector, and Clustering
 ```python
 from crawl4ai import WebCrawler
 from crawl4ai.chunking_strategy import CosineStrategy
 js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
 crawler = WebCrawler()
 crawler.warmup()
 result = crawler.run(
    url="https://www.nbcnews.com/business",
    js=js_code,
    css_selector="p",
    extraction_strategy=CosineStrategy(semantic_filter="technology")
 )
 print(result.extracted_content)
 ```
 ### Extract Structured Data from Web Pages With Proxy and BaseUrl
 ```python
 from crawl4ai import WebCrawler
 from crawl4ai.extraction_strategy import LLMExtractionStrategy
 def create_crawler():
    crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
    crawler.warmup()
    return crawler
 crawler = create_crawler()
 crawler.warmup()
 result = crawler.run(
    url="https://www.nbcnews.com/business",
    extraction_strategy=LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token="sk-",
        base_url="https://api.openai.com/v1"
    )
 )
 print(result.markdown)
 ```
 ## Documentation 📚
 For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
 ## Contributing 🤝
 We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
 ## License 📄
 Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
 ## Contact 📧
 For questions, suggestions, or feedback, feel free to reach out:
 - GitHub: [unclecode](https://github.com/unclecode)
 - Twitter: [@unclecode](https://twitter.com/unclecode)
 - Website: [crawl4ai.com](https://crawl4ai.com)
 Happy Crawling! 🕸️🚀
 ## Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -80,47 +80,6 @@ def load_bge_small_en_v1_5():
    model, device = set_model_device(model)
    return tokenizer, model
@lru_cache()
 def load_onnx_all_MiniLM_l6_v2():
    from crawl4ai.onnx_embedding import DefaultEmbeddingModel
    model_path = "models/onnx.tar.gz"
    model_url = "https://unclecode-files.s3.us-west-2.amazonaws.com/onnx.tar.gz"
    __location__ = os.path.realpath(
        os.path.join(os.getcwd(), os.path.dirname(__file__)))
    download_path = os.path.join(__location__, model_path)
    onnx_dir = os.path.join(__location__, "models/onnx")
    # Create the models directory if it does not exist
    os.makedirs(os.path.dirname(download_path), exist_ok=True)
    # Download the tar.gz file if it does not exist
    if not os.path.exists(download_path):
        def download_with_progress(url, filename):
            def reporthook(block_num, block_size, total_size):
                downloaded = block_num * block_size
                percentage = 100 * downloaded / total_size
                if downloaded < total_size:
                    print(f"\rDownloading: {percentage:.2f}% ({downloaded / (1024 * 1024):.2f} MB of {total_size / (1024 * 1024):.2f} MB)", end='')
                else:
                    print("\rDownload complete!")
            urllib.request.urlretrieve(url, filename, reporthook)
        download_with_progress(model_url, download_path)
    # Extract the tar.gz file if the onnx directory does not exist
    if not os.path.exists(onnx_dir):
        with tarfile.open(download_path, "r:gz") as tar:
            tar.extractall(path=os.path.join(__location__, "models"))
        # remove the tar.gz file
        os.remove(download_path)
    model = DefaultEmbeddingModel()
    return model
@lru_cache()
 def load_text_classifier():
--- a/crawl4ai/onnx_embedding.py
+++ b/crawl4ai/onnx_embedding.py
@@ -1,50 +0,0 @@
 # A dependency-light way to run the onnx model
 import numpy as np
 from typing import List
 import os
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
 MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
 def normalize(v):
    norm = np.linalg.norm(v, axis=1)
    norm[norm == 0] = 1e-12
    return v / norm[:, np.newaxis]
 # Sampel implementation of the default sentence-transformers model using ONNX
 class DefaultEmbeddingModel():
    def __init__(self):
        from tokenizers import Tokenizer
        import onnxruntime as ort
        # max_seq_length = 256, for some reason sentence-transformers uses 256 even though the HF config has a max length of 128
        # https://github.com/UKPLab/sentence-transformers/blob/3e1929fddef16df94f8bc6e3b10598a98f46e62d/docs/_static/html/models_en_sentence_embeddings.html#LL480
        self.tokenizer = Tokenizer.from_file(os.path.join(__location__, "models/onnx/tokenizer.json"))
        self.tokenizer.enable_truncation(max_length=256)
        self.tokenizer.enable_padding(pad_id=0, pad_token="[PAD]", length=256)
        self.model = ort.InferenceSession(os.path.join(__location__,"models/onnx/model.onnx"))
    def __call__(self, documents: List[str], batch_size: int = 32):
        all_embeddings = []
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            encoded = [self.tokenizer.encode(d) for d in batch]
            input_ids = np.array([e.ids for e in encoded])
            attention_mask = np.array([e.attention_mask for e in encoded])
            onnx_input = {
                "input_ids": np.array(input_ids, dtype=np.int64),
                "attention_mask": np.array(attention_mask, dtype=np.int64),
                "token_type_ids": np.array([np.zeros(len(e), dtype=np.int64) for e in input_ids], dtype=np.int64),
            }
            model_output = self.model.run(None, onnx_input)
            last_hidden_state = model_output[0]
            # Perform mean pooling with attention weighting
            input_mask_expanded = np.broadcast_to(np.expand_dims(attention_mask, -1), last_hidden_state.shape)
            embeddings = np.sum(last_hidden_state * input_mask_expanded, 1) / np.clip(input_mask_expanded.sum(1), a_min=1e-9, a_max=None)
            embeddings = normalize(embeddings).astype(np.float32)
            all_embeddings.append(embeddings)
        return np.concatenate(all_embeddings)
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -0,0 +1,442 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Crawl4AI: Advanced Web Crawling and Data Extraction\n",
    "\n",
    "Welcome to this interactive notebook showcasing Crawl4AI, an advanced asynchronous web crawling and data extraction library.\n",
    "\n",
    "- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\n",
    "- Twitter: [@unclecode](https://twitter.com/unclecode)\n",
    "- Website: [https://crawl4ai.com](https://crawl4ai.com)\n",
    "\n",
    "Let's explore the powerful features of Crawl4AI!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation\n",
    "\n",
    "First, let's install Crawl4AI from GitHub:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
    "!pip install nest-asyncio\n",
    "!playwright install"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "import nest_asyncio\n",
    "from crawl4ai import AsyncWebCrawler\n",
    "from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy\n",
    "import json\n",
    "import time\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage\n",
    "\n",
    "Let's start with a simple crawl example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def simple_crawl():\n",
    "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
    "        result = await crawler.arun(url=\"https://www.nbcnews.com/business\")\n",
    "        print(result.markdown[:500])  # Print first 500 characters\n",
    "\n",
    "await simple_crawl()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Features\n",
    "\n",
    "### Executing JavaScript and Using CSS Selectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def js_and_css():\n",
    "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
    "        js_code = [\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"]\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://www.nbcnews.com/business\",\n",
    "            js_code=js_code,\n",
    "            css_selector=\"article.tease-card\",\n",
    "            bypass_cache=True\n",
    "        )\n",
    "        print(result.extracted_content[:500])  # Print first 500 characters\n",
    "\n",
    "await js_and_css()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a Proxy\n",
    "\n",
    "Note: You'll need to replace the proxy URL with a working proxy for this example to run successfully."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def use_proxy():\n",
    "    async with AsyncWebCrawler(verbose=True, proxy=\"http://your-proxy-url:port\") as crawler:\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://www.nbcnews.com/business\",\n",
    "            bypass_cache=True\n",
    "        )\n",
    "        print(result.markdown[:500])  # Print first 500 characters\n",
    "\n",
    "# Uncomment the following line to run the proxy example\n",
    "# await use_proxy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting Structured Data with OpenAI\n",
    "\n",
    "Note: You'll need to set your OpenAI API key as an environment variable for this example to work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "class OpenAIModelFee(BaseModel):\n",
    "    model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n",
    "    input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n",
    "    output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n",
    "\n",
    "async def extract_openai_fees():\n",
    "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
    "        result = await crawler.arun(\n",
    "            url='https://openai.com/api/pricing/',\n",
    "            word_count_threshold=1,\n",
    "            extraction_strategy=LLMExtractionStrategy(\n",
    "                provider=\"openai/gpt-4o\", api_token=os.getenv('OPENAI_API_KEY'), \n",
    "                schema=OpenAIModelFee.schema(),\n",
    "                extraction_type=\"schema\",\n",
    "                instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens. \n",
    "                Do not miss any models in the entire content. One extracted model JSON format should look like this: \n",
    "                {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}.\"\"\"\n",
    "            ),            \n",
    "            bypass_cache=True,\n",
    "        )\n",
    "        print(result.extracted_content)\n",
    "\n",
    "# Uncomment the following line to run the OpenAI extraction example\n",
    "# await extract_openai_fees()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Advanced Multi-Page Crawling with JavaScript Execution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "async def crawl_typescript_commits():\n",
    "    first_commit = \"\"\n",
    "    async def on_execution_started(page):\n",
    "        nonlocal first_commit \n",
    "        try:\n",
    "            while True:\n",
    "                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')\n",
    "                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')\n",
    "                commit = await commit.evaluate('(element) => element.textContent')\n",
    "                commit = re.sub(r'\\s+', '', commit)\n",
    "                if commit and commit != first_commit:\n",
    "                    first_commit = commit\n",
    "                    break\n",
    "                await asyncio.sleep(0.5)\n",
    "        except Exception as e:\n",
    "            print(f\"Warning: New content didn't appear after JavaScript execution: {e}\")\n",
    "\n",
    "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
    "        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)\n",
    "\n",
    "        url = \"https://github.com/microsoft/TypeScript/commits/main\"\n",
    "        session_id = \"typescript_commits_session\"\n",
    "        all_commits = []\n",
    "\n",
    "        js_next_page = \"\"\"\n",
    "        const button = document.querySelector('a[data-testid=\"pagination-next-button\"]');\n",
    "        if (button) button.click();\n",
    "        \"\"\"\n",
    "\n",
    "        for page in range(3):  # Crawl 3 pages\n",
    "            result = await crawler.arun(\n",
    "                url=url,\n",
    "                session_id=session_id,\n",
    "                css_selector=\"li.Box-sc-g0xbh4-0\",\n",
    "                js=js_next_page if page > 0 else None,\n",
    "                bypass_cache=True,\n",
    "                js_only=page > 0\n",
    "            )\n",
    "\n",
    "            assert result.success, f\"Failed to crawl page {page + 1}\"\n",
    "\n",
    "            soup = BeautifulSoup(result.cleaned_html, 'html.parser')\n",
    "            commits = soup.select(\"li\")\n",
    "            all_commits.extend(commits)\n",
    "\n",
    "            print(f\"Page {page + 1}: Found {len(commits)} commits\")\n",
    "\n",
    "        await crawler.crawler_strategy.kill_session(session_id)\n",
    "        print(f\"Successfully crawled {len(all_commits)} commits across 3 pages\")\n",
    "\n",
    "await crawl_typescript_commits()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using JsonCssExtractionStrategy for Fast Structured Output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "async def extract_news_teasers():\n",
    "    schema = {\n",
    "        \"name\": \"News Teaser Extractor\",\n",
    "        \"baseSelector\": \".wide-tease-item__wrapper\",\n",
    "        \"fields\": [\n",
    "            {\n",
    "                \"name\": \"category\",\n",
    "                \"selector\": \".unibrow span[data-testid='unibrow-text']\",\n",
    "                \"type\": \"text\",\n",
    "            },\n",
    "            {\n",
    "                \"name\": \"headline\",\n",
    "                \"selector\": \".wide-tease-item__headline\",\n",
    "                \"type\": \"text\",\n",
    "            },\n",
    "            {\n",
    "                \"name\": \"summary\",\n",
    "                \"selector\": \".wide-tease-item__description\",\n",
    "                \"type\": \"text\",\n",
    "            },\n",
    "            {\n",
    "                \"name\": \"time\",\n",
    "                \"selector\": \"[data-testid='wide-tease-date']\",\n",
    "                \"type\": \"text\",\n",
    "            },\n",
    "            {\n",
    "                \"name\": \"image\",\n",
    "                \"type\": \"nested\",\n",
    "                \"selector\": \"picture.teasePicture img\",\n",
    "                \"fields\": [\n",
    "                    {\"name\": \"src\", \"type\": \"attribute\", \"attribute\": \"src\"},\n",
    "                    {\"name\": \"alt\", \"type\": \"attribute\", \"attribute\": \"alt\"},\n",
    "                ],\n",
    "            },\n",
    "            {\n",
    "                \"name\": \"link\",\n",
    "                \"selector\": \"a[href]\",\n",
    "                \"type\": \"attribute\",\n",
    "                \"attribute\": \"href\",\n",
    "            },\n",
    "        ],\n",
    "    }\n",
    "\n",
    "    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n",
    "\n",
    "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://www.nbcnews.com/business\",\n",
    "            extraction_strategy=extraction_strategy,\n",
    "            bypass_cache=True,\n",
    "        )\n",
    "\n",
    "        assert result.success, \"Failed to crawl the page\"\n",
    "\n",
    "        news_teasers = json.loads(result.extracted_content)\n",
    "        print(f\"Successfully extracted {len(news_teasers)} news teasers\")\n",
    "        print(json.dumps(news_teasers[0], indent=2))\n",
    "\n",
    "await extract_news_teasers()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Speed Comparison\n",
    "\n",
    "Let's compare the speed of Crawl4AI with Firecrawl, a paid service. Note that we can't run Firecrawl in this Colab environment, so we'll simulate its performance based on previously recorded data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "async def speed_comparison():\n",
    "    # Simulated Firecrawl performance\n",
    "    print(\"Firecrawl (simulated):\")\n",
    "    print(\"Time taken: 7.02 seconds\")\n",
    "    print(\"Content length: 42074 characters\")\n",
    "    print(\"Images found: 49\")\n",
    "    print()\n",
    "\n",
    "    async with AsyncWebCrawler() as crawler:\n",
    "        # Crawl4AI simple crawl\n",
    "        start = time.time()\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://www.nbcnews.com/business\",\n",
    "            word_count_threshold=0,\n",
    "            bypass_cache=True, \n",
    "            verbose=False\n",
    "        )\n",
    "        end = time.time()\n",
    "        print(\"Crawl4AI (simple crawl):\")\n",
    "        print(f\"Time taken: {end - start:.2f} seconds\")\n",
    "        print(f\"Content length: {len(result.markdown)} characters\")\n",
    "        print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
    "        print()\n",
    "\n",
    "        # Crawl4AI with JavaScript execution\n",
    "        start = time.time()\n",
    "        result = await crawler.arun(\n",
    "            url=\"https://www.nbcnews.com/business\",\n",
    "            js_code=[\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"],\n",
    "            word_count_threshold=0,\n",
    "            bypass_cache=True, \n",
    "            verbose=False\n",
    "        )\n",
    "        end = time.time()\n",
    "        print(\"Crawl4AI (with JavaScript execution):\")\n",
    "        print(f\"Time taken: {end - start:.2f} seconds\")\n",
    "        print(f\"Content length: {len(result.markdown)} characters\")\n",
    "        print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
    "\n",
    "await speed_comparison()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, Crawl4AI outperforms Firecrawl significantly:\n",
    "- Simple crawl: Crawl4AI is typically over 4 times faster than Firecrawl.\n",
    "- With JavaScript execution: Even when executing JavaScript to load more content (potentially doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.\n",
    "\n",
    "Please note that actual performance may vary depending on network conditions and the specific content being crawled."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this notebook, we've explored the powerful features of Crawl4AI, including:\n",
    "\n",
    "1. Basic crawling\n",
    "2. JavaScript execution and CSS selector usage\n",
    "3. Proxy support\n",
    "4. Structured data extraction with OpenAI\n",
    "5. Advanced multi-page crawling with JavaScript execution\n",
    "6. Fast structured output using JsonCssExtractionStrategy\n",
    "7. Speed comparison with other services\n",
    "\n",
    "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
    "\n",
    "For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n",
    "\n",
    "Happy crawling!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,24 +1,22 @@
-numpy==1.25.0
+numpy>=1.25.0
-aiohttp==3.9.5
+aiohttp>=3.9.5
-aiosqlite==0.20.0
+aiosqlite>=0.20.0
-beautifulsoup4==4.12.3
+beautifulsoup4>=4.12.3
-fastapi==0.111.0
+fastapi>=0.111.0
-html2text==2024.2.26
+html2text>=2024.2.26
-httpx==0.27.0
+httpx>=0.27.0
-litellm==1.40.17
+litellm>=1.40.17
-nltk==3.8.1
+nltk>=3.8.1
-pydantic==2.7.4
+pydantic>=2.7.4
-python-dotenv==1.0.1
+python-dotenv>=1.0.1
-requests==2.32.3
+requests>=2.32.3
-rich==13.7.1
+rich>=13.7.1
-scikit-learn==1.5.0
+scikit-learn>=1.5.0
-selenium==4.23.1
+selenium>=4.23.1
-uvicorn==0.30.1
+uvicorn>=0.30.1
-transformers==4.41.2
+transformers>=4.41.2
-# webdriver-manager==4.0.1
+torch>=2.3.1
-# chromedriver-autoinstaller==0.6.4
+tokenizers>=0.19.1
-torch==2.3.1
+pillow>=10.3.0
-onnxruntime==1.18.0
+slowapi>=0.1.9
-tokenizers==0.19.1
+playwright>=1.46.0
 pillow==10.3.0
 slowapi==0.1.9