Update quickstart_async.py to improve performance and add Firecrawl simulation

2024-09-28 00:11:39 +08:00
parent 8b6e88c85c
commit 5d4e92db7d
2 changed files with 74 additions and 60 deletions
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -30,14 +30,14 @@
    },
    {
      "cell_type": "code",
-      "source": [
-        "!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
-      ],
+      "execution_count": null,
      "metadata": {
        "id": "mSnaxLf3zMog"
      },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
+      ]
    },
    {
      "cell_type": "code",
@@ -94,7 +94,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 4,
+      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
@@ -104,14 +104,14 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.18 seconds.\n",
-            "18219\n"
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.05 seconds\n",
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.05 seconds.\n",
+            "18102\n"
          ]
        }
      ],
@@ -125,12 +125,12 @@
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, you’ll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
-      ],
      "metadata": {
        "id": "9rtkgHI28uI4"
-      }
+      },
+      "source": [
+        "💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, you’ll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
+      ]
    },
    {
      "cell_type": "markdown",
@@ -145,7 +145,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 9,
+      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
@@ -155,18 +155,18 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
            "[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
            "[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
-            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 9.78 seconds\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.44 seconds\n",
+            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 6.06 seconds\n",
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.10 seconds\n",
            "[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.45 seconds.\n",
-            "34241\n"
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.11 seconds.\n",
+            "41135\n"
          ]
        }
      ],
@@ -239,8 +239,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
@@ -306,16 +306,16 @@
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "tfkcVQ0b7mw-"
+      },
      "source": [
        "## Advanced Multi-Page Crawling with JavaScript Execution\n",
        "\n",
        "This example demonstrates Crawl4AI's ability to handle complex crawling scenarios, specifically extracting commits from multiple pages of a GitHub repository. The challenge here is that clicking the \"Next\" button doesn't load a new page, but instead uses asynchronous JavaScript to update the content. This is a common hurdle in modern web crawling.\n",
        "\n",
        "To overcome this, we use Crawl4AI's custom JavaScript execution to simulate clicking the \"Next\" button, and implement a custom hook to detect when new data has loaded. Our strategy involves comparing the first commit's text before and after \"clicking\" Next, waiting until it changes to confirm new data has rendered. This showcases Crawl4AI's flexibility in handling dynamic content and its ability to implement custom logic for even the most challenging crawling tasks."
-      ],
-      "metadata": {
-        "id": "tfkcVQ0b7mw-"
-      }
+      ]
    },
    {
      "cell_type": "code",
@@ -329,8 +329,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
@@ -427,6 +427,9 @@
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "1ZMqIzB_8SYp"
+      },
      "source": [
        "The JsonCssExtractionStrategy is a powerful feature of Crawl4AI that allows for precise, structured data extraction from web pages. Here's how it works:\n",
        "\n",
@@ -440,10 +443,7 @@
        "This approach allows for highly flexible and precise data extraction, transforming semi-structured web content into clean, structured JSON data. It's particularly useful for extracting consistent data patterns from pages like product listings, news articles, or search results.\n",
        "\n",
        "For more details and advanced usage, check out the full documentation on the Crawl4AI website."
-      ],
-      "metadata": {
-        "id": "1ZMqIzB_8SYp"
-      }
+      ]
    },
    {
      "cell_type": "code",
@@ -457,8 +457,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
@@ -558,6 +558,9 @@
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "agDD186f3wig"
+      },
      "source": [
        "💡 **Note on Speed Comparison:**\n",
        "\n",
@@ -566,21 +569,18 @@
        "For a more accurate comparison, it's recommended to run these tests on your own servers or computers with a stable and fast internet connection. Despite these limitations, Crawl4AI still demonstrates faster performance in this environment.\n",
        "\n",
        "If you run these tests locally, you may observe an even more significant speed advantage for Crawl4AI compared to other services."
-      ],
-      "metadata": {
-        "id": "agDD186f3wig"
-      }
+      ]
    },
    {
      "cell_type": "code",
-      "source": [
-        "!pip install firecrawl"
-      ],
+      "execution_count": null,
      "metadata": {
        "id": "F7KwHv8G1LbY"
      },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install firecrawl"
+      ]
    },
    {
      "cell_type": "code",
@@ -594,8 +594,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "Firecrawl (simulated):\n",
            "Time taken: 4.38 seconds\n",
@@ -710,6 +710,9 @@
    }
  ],
  "metadata": {
+    "colab": {
+      "provenance": []
+    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
@@ -725,12 +728,9 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.8.10"
-    },
-    "colab": {
-      "provenance": []
+      "version": "3.10.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -1,6 +1,6 @@
 import os, sys
 # append parent directory to system path
-sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";

 import asyncio
 # import nest_asyncio
@@ -46,12 +46,12 @@ async def js_and_css():
        ]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
-            # js_code=js_code,
-            css_selector="article.tease-card",
+            js_code=js_code,
+            # css_selector="article.tease-card",
            # wait_for=wait_for,
            bypass_cache=True,
        )
-        print(result.extracted_content[:500])  # Print first 500 characters
+        print(result.markdown[:500])  # Print first 500 characters

 async def use_proxy():
    print("\n--- Using a Proxy ---")
@@ -270,7 +270,7 @@ async def crawl_dynamic_content_pages_method_3():
        js_next_page = """
        const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
        if (commits.length > 0) {
-            window.lastCommit = commits[0].textContent.trim();
+            window.firstCommit = commits[0].textContent.trim();
        }
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
@@ -280,7 +280,7 @@ async def crawl_dynamic_content_pages_method_3():
            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
            if (commits.length === 0) return false;
            const firstCommit = commits[0].textContent.trim();
-            return firstCommit !== window.lastCommit;
+            return firstCommit !== window.firstCommit;
        }"""
        
        schema = {
@@ -321,12 +321,26 @@ async def crawl_dynamic_content_pages_method_3():
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

 async def speed_comparison():
-    print("\n--- Speed Comparison ---")
+    # print("\n--- Speed Comparison ---")
+    # print("Firecrawl (simulated):")
+    # print("Time taken: 7.02 seconds")
+    # print("Content length: 42074 characters")
+    # print("Images found: 49")
+    # print()
+    # Simulated Firecrawl performance
+    from firecrawl import FirecrawlApp
+    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+    start = time.time()
+    scrape_status = app.scrape_url(
+    'https://www.nbcnews.com/business',
+    params={'formats': ['markdown', 'html']}
+    )
+    end = time.time()
    print("Firecrawl (simulated):")
-    print("Time taken: 7.02 seconds")
-    print("Content length: 42074 characters")
-    print("Images found: 49")
-    print()
+    print(f"Time taken: {end - start:.2f} seconds")
+    print(f"Content length: {len(scrape_status['markdown'])} characters")
+    print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
+    print()    

    async with AsyncWebCrawler() as crawler:
        # Crawl4AI simple crawl
@@ -375,10 +389,10 @@ async def main():
    await simple_crawl()
    await js_and_css()
    await use_proxy()
-    await extract_structured_data_using_llm()
    await extract_structured_data_using_css_extractor()
-    await crawl_dynamic_content_pages_method_1()
-    await crawl_dynamic_content_pages_method_2()
+    await extract_structured_data_using_llm()
+    # await crawl_dynamic_content_pages_method_1()
+    # await crawl_dynamic_content_pages_method_2()
    await crawl_dynamic_content_pages_method_3()
    await speed_comparison()