Update quickstart_async.py to improve performance and add Firecrawl simulation

2024-09-28 00:11:39 +08:00
parent 8b6e88c85c
commit 5d4e92db7d
2 changed files with 74 additions and 60 deletions
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -30,14 +30,14 @@
    },
    {
      "cell_type": "code",
-      "source": [
-        "!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
-      ],
+      "execution_count": null,
      "metadata": {
        "id": "mSnaxLf3zMog"
      },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
+      ]
    },
    {
      "cell_type": "code",
@@ -94,7 +94,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 4,
+      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
@@ -104,14 +104,14 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.18 seconds\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.18 seconds.\n",
-            "18219\n"
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.05 seconds\n",
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.05 seconds.\n",
+            "18102\n"
          ]
        }
      ],
@@ -125,12 +125,12 @@
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, you’ll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
-      ],
      "metadata": {
        "id": "9rtkgHI28uI4"
-      }
+      },
+      "source": [
+        "💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, you’ll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
+      ]
    },
    {
      "cell_type": "markdown",
@@ -145,7 +145,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 9,
+      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
@@ -155,18 +155,18 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
            "[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
            "[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
-            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 9.78 seconds\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.44 seconds\n",
+            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 6.06 seconds\n",
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.10 seconds\n",
            "[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.45 seconds.\n",
-            "34241\n"
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.11 seconds.\n",
+            "41135\n"
          ]
        }
      ],
@@ -239,8 +239,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
@@ -306,16 +306,16 @@
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "tfkcVQ0b7mw-"
+      },
      "source": [
        "## Advanced Multi-Page Crawling with JavaScript Execution\n",
        "\n",
        "This example demonstrates Crawl4AI's ability to handle complex crawling scenarios, specifically extracting commits from multiple pages of a GitHub repository. The challenge here is that clicking the \"Next\" button doesn't load a new page, but instead uses asynchronous JavaScript to update the content. This is a common hurdle in modern web crawling.\n",
        "\n",
        "To overcome this, we use Crawl4AI's custom JavaScript execution to simulate clicking the \"Next\" button, and implement a custom hook to detect when new data has loaded. Our strategy involves comparing the first commit's text before and after \"clicking\" Next, waiting until it changes to confirm new data has rendered. This showcases Crawl4AI's flexibility in handling dynamic content and its ability to implement custom logic for even the most challenging crawling tasks."
-      ],
-      "metadata": {
-        "id": "tfkcVQ0b7mw-"
-      }
+      ]
    },
    {
      "cell_type": "code",
@@ -329,8 +329,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
@@ -427,6 +427,9 @@
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "1ZMqIzB_8SYp"
+      },
      "source": [
        "The JsonCssExtractionStrategy is a powerful feature of Crawl4AI that allows for precise, structured data extraction from web pages. Here's how it works:\n",
        "\n",
@@ -440,10 +443,7 @@
        "This approach allows for highly flexible and precise data extraction, transforming semi-structured web content into clean, structured JSON data. It's particularly useful for extracting consistent data patterns from pages like product listings, news articles, or search results.\n",
        "\n",
        "For more details and advanced usage, check out the full documentation on the Crawl4AI website."
-      ],
-      "metadata": {
-        "id": "1ZMqIzB_8SYp"
-      }
+      ]
    },
    {
      "cell_type": "code",
@@ -457,8 +457,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
@@ -558,6 +558,9 @@
    },
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "agDD186f3wig"
+      },
      "source": [
        "💡 **Note on Speed Comparison:**\n",
        "\n",
@@ -566,21 +569,18 @@
        "For a more accurate comparison, it's recommended to run these tests on your own servers or computers with a stable and fast internet connection. Despite these limitations, Crawl4AI still demonstrates faster performance in this environment.\n",
        "\n",
        "If you run these tests locally, you may observe an even more significant speed advantage for Crawl4AI compared to other services."
-      ],
-      "metadata": {
-        "id": "agDD186f3wig"
-      }
+      ]
    },
    {
      "cell_type": "code",
-      "source": [
-        "!pip install firecrawl"
-      ],
+      "execution_count": null,
      "metadata": {
        "id": "F7KwHv8G1LbY"
      },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install firecrawl"
+      ]
    },
    {
      "cell_type": "code",
@@ -594,8 +594,8 @@
      },
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "Firecrawl (simulated):\n",
            "Time taken: 4.38 seconds\n",
@@ -710,6 +710,9 @@
    }
  ],
  "metadata": {
+    "colab": {
+      "provenance": []
+    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
@@ -725,12 +728,9 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.8.10"
-    },
-    "colab": {
-      "provenance": []
+      "version": "3.10.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}