174 lines
7.1 KiB
HTML
174 lines
7.1 KiB
HTML
<section id="how-to-guide" class="content-section">
|
|
<h1 class="text-2xl font-bold">How to Guide</h1>
|
|
<div class="flex flex-col gap-4 p-4 bg-zinc-900 text-lime-500">
|
|
<!-- Step 1 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🌟
|
|
<strong
|
|
>Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling
|
|
fun!</strong
|
|
>
|
|
</div>
|
|
<div class="">
|
|
First Step: Create an instance of WebCrawler and call the
|
|
<code>warmup()</code> function.
|
|
</div>
|
|
<div>
|
|
<pre><code class="language-python">crawler = WebCrawler()
|
|
crawler.warmup()</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 2 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🧠 <strong>Understanding 'bypass_cache' and 'include_raw_html' parameters:</strong>
|
|
</div>
|
|
<div class="">First crawl (caches the result):</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business")</code></pre>
|
|
</div>
|
|
<div class="">Second crawl (Force to crawl again):</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)</code></pre>
|
|
<div class="bg-red-900 p-2 text-zinc-50">
|
|
⚠️ Don't forget to set <code>`bypass_cache`</code> to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set <code>`always_by_pass_cache`</code> in constructor to True to always bypass the cache.
|
|
</div>
|
|
</div>
|
|
<div class="">Crawl result without raw HTML content:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 3 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
📄
|
|
<strong
|
|
>The 'include_raw_html' parameter, when set to True, includes the raw HTML content
|
|
in the response. By default, it is set to True.</strong
|
|
>
|
|
</div>
|
|
<div class="">Set <code>always_by_pass_cache</code> to True:</div>
|
|
<div>
|
|
<pre><code class="language-python">crawler.always_by_pass_cache = True</code></pre>
|
|
</div>
|
|
<!-- Step 3.5 Screenshot -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
📸
|
|
<strong>Let's take a screenshot of the page!</strong>
|
|
</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
screenshot=True
|
|
)
|
|
with open("screenshot.png", "wb") as f:
|
|
f.write(base64.b64decode(result.screenshot))</code></pre>
|
|
</div>
|
|
|
|
|
|
<!-- Step 4 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🧩 <strong>Let's add a chunking strategy: RegexChunking!</strong>
|
|
</div>
|
|
<div class="">Using RegexChunking:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
|
)</code></pre>
|
|
</div>
|
|
<div class="">Using NlpSentenceChunking:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
chunking_strategy=NlpSentenceChunking()
|
|
)</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 5 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🧠 <strong>Let's get smarter with an extraction strategy: CosineStrategy!</strong>
|
|
</div>
|
|
<div class="">Using CosineStrategy:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
|
|
)</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 6 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🤖
|
|
<strong
|
|
>Time to bring in the big guns: LLMExtractionStrategy without instructions!</strong
|
|
>
|
|
</div>
|
|
<div class="">Using LLMExtractionStrategy without instructions:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
|
|
)</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 7 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
📜
|
|
<strong
|
|
>Let's make it even more interesting: LLMExtractionStrategy with
|
|
instructions!</strong
|
|
>
|
|
</div>
|
|
<div class="">Using LLMExtractionStrategy with instructions:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
extraction_strategy=LLMExtractionStrategy(
|
|
provider="openai/gpt-4o",
|
|
api_token=os.getenv('OPENAI_API_KEY'),
|
|
instruction="I am interested in only financial news"
|
|
)
|
|
)</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 8 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🎯
|
|
<strong>Targeted extraction: Let's use a CSS selector to extract only H2 tags!</strong>
|
|
</div>
|
|
<div class="">Using CSS selector to extract H2 tags:</div>
|
|
<div>
|
|
<pre><code class="language-python">result = crawler.run(
|
|
url="https://www.nbcnews.com/business",
|
|
css_selector="h2"
|
|
)</code></pre>
|
|
</div>
|
|
|
|
<!-- Step 9 -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🖱️
|
|
<strong
|
|
>Let's get interactive: Passing JavaScript code to click 'Load More' button!</strong
|
|
>
|
|
</div>
|
|
<div class="">Using JavaScript to click 'Load More' button:</div>
|
|
<div>
|
|
<pre><code class="language-python">js_code = ["""
|
|
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
|
|
loadMoreButton && loadMoreButton.click();
|
|
"""]
|
|
crawler = WebCrawler(verbos=crawler_strategy, always_by_pass_cache=True)
|
|
result = crawler.run(url="https://www.nbcnews.com/business", js = js_code)</code></pre>
|
|
<div class="">Remember that you can pass multiple JavaScript code snippets in the list. They all will be executed in the order they are passed.</div>
|
|
</div>
|
|
|
|
<!-- Conclusion -->
|
|
<div class="col-span-2 bg-lime-800 p-2 rounded text-zinc-50">
|
|
🎉
|
|
<strong
|
|
>Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth
|
|
and crawl the web like a pro! 🕸️</strong
|
|
>
|
|
</div>
|
|
</div>
|
|
</section> |