- Debug
- Refactor code for new version
This commit is contained in:
unclecode
2024-05-16 17:31:44 +08:00
parent f6e59157bf
commit 5b80be956d
23 changed files with 3116 additions and 1019 deletions

36
pages/partial/footer.html Normal file
View File

@@ -0,0 +1,36 @@
<section class="hero bg-zinc-900 py-8 px-20 text-zinc-400">
<div class="container mx-auto px-4">
<h2 class="text-3xl font-bold mb-4">🤔 Why building this?</h2>
<p class="text-lg mb-4">
In recent times, we've witnessed a surge of startups emerging, riding the AI hype wave and charging
for services that should rightfully be accessible to everyone. 🌍💸 One such example is scraping and
crawling web pages and transforming them into a format suitable for Large Language Models (LLMs).
🕸️🤖 We believe that building a business around this is not the right approach; instead, it should
definitely be open-source. 🆓🌟 So, if you possess the skills to build such tools and share our
philosophy, we invite you to join our "Robinhood" band and help set these products free for the
benefit of all. 🤝💪
</p>
</div>
</section>
<footer class="bg-zinc-900 text-zinc-400 py-4">
<div class="container mx-auto px-4">
<div class="flex justify-between items-center">
<p>© 2024 Crawl4AI. All rights reserved.</p>
<div class="social-links">
<a
href="https://github.com/unclecode/crawl4ai"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>😺 GitHub</a
>
<a
href="https://twitter.com/unclecode"
class="text-zinc-400 hover:text-gray-300 mx-2"
target="_blank"
>🐦 Twitter</a
>
</div>
</div>
</div>
</footer>

View File

@@ -0,0 +1,160 @@
<section id="how-to-guide" class="content-section">
<h1 class="text-2xl font-bold">How to Guide</h1>
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
<!-- Step 1 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🌟
<strong
>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
fun!</strong
>
</div>
<div class="">
First Step: Create an instance of WebCrawler and call the
<code>warmup()</code> function.
</div>
<div>
<pre><code class="language-python">crawler = WebCrawler()
crawler.warmup()</code></pre>
</div>
<!-- Step 2 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
</div>
<div class="">First crawl (caches the result):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<div class="">Second crawl (Force to crawl again):</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
<div class="bg-red-900 p-2 text-zinc-50">
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set <code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
</div>
</div>
<div class="">Crawl result without raw HTML content:</div>
<div>
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
</div>
<!-- Step 3 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📄
<strong
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content
in the response. By default, it is set to True.</strong
>
</div>
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
<div>
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
</div>
<!-- Step 4 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
</div>
<div class="">Using RegexChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)</code></pre>
</div>
<div class="">Using NlpSentenceChunking:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)</code></pre>
</div>
<!-- Step 5 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
</div>
<div class="">Using CosineStrategy:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
)</code></pre>
</div>
<!-- Step 6 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🤖
<strong
>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy without instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)</code></pre>
</div>
<!-- Step 7 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
📜
<strong
>Let's make it even more interesting: LLMExtractionStrategy with
instructions!</strong
>
</div>
<div class="">Using LLMExtractionStrategy with instructions:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)</code></pre>
</div>
<!-- Step 8 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎯
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
</div>
<div class="">Using CSS selector to extract H2 tags:</div>
<div>
<pre><code class="language-python">result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)</code></pre>
</div>
<!-- Step 9 -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🖱️
<strong
>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong
>
</div>
<div class="">Using JavaScript to click 'Load More' button:</div>
<div>
<pre><code class="language-python">js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
</div>
<!-- Conclusion -->
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
🎉
<strong
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
and crawl the web like a pro! 🕸️</strong
>
</div>
</div>
</section>

View File

@@ -0,0 +1,56 @@
<section id="installation" class="content-section active">
<h1 class="text-2xl font-bold">Installation 💻</h1>
<p class="mb-4">
There are three ways to use Crawl4AI:
<ol class="list-decimal list-inside mb-4">
<li class="">
As a library
</li>
<li class="">
As a local server (Docker)
</li>
<li class="">
As a Google Colab notebook. <a href="https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk"
><img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"
style="display: inline-block; width: 100px; height: 20px"
/></a>
</li>
</p>
<p class="my-4">To install Crawl4AI as a library, follow these steps:</p>
<ol class="list-decimal list-inside mb-4">
<li class="mb-4">
Install the package from GitHub:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code>pip install git+https://github.com/unclecode/crawl4ai.git</code></pre>
</li>
<li class="mb-4">
Alternatively, you can clone the repository and install the package locally:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
</code></pre>
</li>
<li class="">
Use docker to run the local server:
<pre
class="bg-zinc-800 p-4 rounded mt-2 text-zinc-100"
><code class = "language-python bash">docker build -t crawl4ai .
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai</code></pre>
</li>
</ol>
<p class="mb-4">
For more information about how to run Crawl4AI as a local server, please refer to the
<a href="https://github.com/unclecode/crawl4ai" class="text-blue-400">GitHub repository</a>.
</p>
</section>

204
pages/partial/try_it.html Normal file
View File

@@ -0,0 +1,204 @@
<section class="try-it py-8 px-16 pb-20 bg-zinc-900">
<div class="container mx-auto ">
<h2 class="text-2xl font-bold mb-4 text-lime-500">Try It Now</h2>
<div class="flex gap-4">
<div class="flex flex-col flex-1 gap-2">
<div class="flex flex-col">
<label for="url-input" class="text-lime-500 font-bold text-xs">URL(s)</label>
<input
type="text"
id="url-input"
value="https://www.nbcnews.com/business"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter URL(s) separated by commas"
/>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="threshold" class="text-lime-500 font-bold text-xs">Min Words Threshold</label>
<select
id="threshold"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="5">5</option>
<option value="10" selected>10</option>
<option value="15">15</option>
<option value="20">20</option>
<option value="25">25</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="css-selector" class="text-lime-500 font-bold text-xs">CSS Selector</label>
<input
type="text"
id="css-selector"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-lime-700"
placeholder="CSS Selector (e.g. .content, #main, article)"
/>
</div>
</div>
<div class="flex gap-2">
<div class="flex flex-col">
<label for="extraction-strategy-select" class="text-lime-500 font-bold text-xs"
>Extraction Strategy</label
>
<select
id="extraction-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="CosineStrategy">CosineStrategy</option>
<option value="LLMExtractionStrategy">LLMExtractionStrategy</option>
<option value="NoExtractionStrategy">NoExtractionStrategy</option>
</select>
</div>
<div class="flex flex-col">
<label for="chunking-strategy-select" class="text-lime-500 font-bold text-xs"
>Chunking Strategy</label
>
<select
id="chunking-strategy-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="RegexChunking">RegexChunking</option>
<option value="NlpSentenceChunking">NlpSentenceChunking</option>
<option value="TopicSegmentationChunking">TopicSegmentationChunking</option>
<option value="FixedLengthWordChunking">FixedLengthWordChunking</option>
<option value="SlidingWindowChunking">SlidingWindowChunking</option>
</select>
</div>
</div>
<div id = "llm_settings" class="flex gap-2 hidden hidden">
<div class="flex flex-col">
<label for="provider-model-select" class="text-lime-500 font-bold text-xs"
>Provider Model</label
>
<select
id="provider-model-select"
class="border border-zinc-700 rounded px-4 py-1 bg-zinc-900 text-zinc-300"
>
<option value="groq/llama3-70b-8192">groq/llama3-70b-8192</option>
<option value="groq/llama3-8b-8192">groq/llama3-8b-8192</option>
<option value="groq/mixtral-8x7b-32768">groq/mixtral-8x7b-32768</option>
<option value="openai/gpt-4-turbo">gpt-4-turbo</option>
<option value="openai/gpt-3.5-turbo">gpt-3.5-turbo</option>
<option value="openai/gpt-4o">gpt-4o</option>
<option value="anthropic/claude-3-haiku-20240307">claude-3-haiku</option>
<option value="anthropic/claude-3-opus-20240229">claude-3-opus</option>
<option value="anthropic/claude-3-sonnet-20240229">claude-3-sonnet</option>
</select>
</div>
<div class="flex flex-col flex-1">
<label for="token-input" class="text-lime-500 font-bold text-xs">API Token</label>
<input
type="password"
id="token-input"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300"
placeholder="Enter Groq API token"
/>
</div>
</div>
<div class="flex gap-2">
<!-- Add two textarea one for getting Keyword Filter and another one Instruction, make both grow whole with-->
<div id = "semantic_filter_div" class="flex flex-col flex-1">
<label for="keyword-filter" class="text-lime-500 font-bold text-xs">Keyword Filter</label>
<textarea
id="semantic_filter"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter keywords for CosineStrategy to narrow down the content."
></textarea>
</div>
<div id = "instruction_div" class="flex flex-col flex-1 hidden">
<label for="instruction" class="text-lime-500 font-bold text-xs">Instruction</label>
<textarea
id="instruction"
rows="3"
class="border border-zinc-700 rounded px-4 py-0 bg-zinc-900 text-zinc-300 placeholder-zinc-700"
placeholder="Enter instruction for the LLMEstrategy to instruct the model."
></textarea>
</div>
</div>
<div class="flex gap-3">
<div class="flex items-center gap-2">
<input type="checkbox" id="bypass-cache-checkbox" />
<label for="bypass-cache-checkbox" class="text-lime-500 font-bold">Bypass Cache</label>
</div>
<div class="flex items-center gap-2">
<input type="checkbox" id="extract-blocks-checkbox" checked />
<label for="extract-blocks-checkbox" class="text-lime-500 font-bold">Extract Blocks</label>
</div>
<button id="crawl-btn" class="bg-lime-600 text-black font-bold px-4 py-0 rounded">Crawl</button>
</div>
</div>
<div id="result" class="flex-1">
<div id="loading" class="hidden">
<p class="text-white">Loading... Please wait.</p>
</div>
<div class="tab-buttons flex gap-2">
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="json">
JSON
</button>
<button
class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="cleaned-html"
>
Cleaned HTML
</button>
<button class="tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="markdown">
Markdown
</button>
</div>
<div class="tab-content code bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex"><code id="json-result" class="language-json"></code></pre>
<pre class="hidden h-full flex"><code id="cleaned-html-result" class="language-html"></code></pre>
<pre class="hidden h-full flex"><code id="markdown-result" class="language-markdown"></code></pre>
</div>
</div>
<div id="code_help" class="flex-1">
<div class="tab-buttons flex gap-2">
<button class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500" data-tab="curl">
cURL
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="library"
>
Python
</button>
<button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="python"
>
REST API
</button>
<!-- <button
class="code-tab-btn px-4 py-1 text-sm bg-zinc-700 rounded-t text-lime-500"
data-tab="nodejs"
>
Node.js
</button> -->
</div>
<div class="tab-content result bg-zinc-900 p-2 rounded h-full border border-zinc-700 text-sm">
<pre class="h-full flex relative">
<code id="curl-code" class="language-bash"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="curl-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="python-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="python-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="nodejs-code" class="language-javascript"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="nodejs-code">Copy</button>
</pre>
<pre class="hidden h-full flex relative">
<code id="library-code" class="language-python"></code>
<button class="absolute top-2 right-2 bg-zinc-700 text-white px-2 py-1 rounded copy-btn" data-target="library-code">Copy</button>
</pre>
</div>
</div>
</div>
</div>
</section>