Try It Now

URL(s)

Min Words Threshold

CSS Selector

Extraction Strategy

Chunking Strategy

Provider Model

API Token

Bypass Cache

Extract Blocks

🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!

First Step: Create an instance of WebCrawler and call the warmup() function.

crawler = WebCrawler()
            crawler.warmup()

🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters:

First crawl (caches the result):

result = crawler.run(url="https://www.nbcnews.com/business")

Second crawl (Force to crawl again):

result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)

Crawl result without raw HTML content:

result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)

📄 The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default, it is set to True.

Set always_by_pass_cache to True:

crawler.always_by_pass_cache = True

🧩 Let's add a chunking strategy: RegexChunking!

Using RegexChunking:

result = crawler.run(
                url="https://www.nbcnews.com/business",
                chunking_strategy=RegexChunking(patterns=["\n\n"])
            )

Using NlpSentenceChunking:

result = crawler.run(
                url="https://www.nbcnews.com/business",
                chunking_strategy=NlpSentenceChunking()
            )

🧠 Let's get smarter with an extraction strategy: CosineStrategy!

Using CosineStrategy:

result = crawler.run(
                url="https://www.nbcnews.com/business",
                extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
            )

🤖 Time to bring in the big guns: LLMExtractionStrategy without instructions!

Using LLMExtractionStrategy without instructions:

result = crawler.run(
                url="https://www.nbcnews.com/business",
                extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
            )

📜 Let's make it even more interesting: LLMExtractionStrategy with instructions!

Using LLMExtractionStrategy with instructions:

result = crawler.run(
                url="https://www.nbcnews.com/business",
                extraction_strategy=LLMExtractionStrategy(
                    provider="openai/gpt-4o",
                    api_token=os.getenv('OPENAI_API_KEY'),
                    instruction="I am interested in only financial news"
                )
            )

🎯 Targeted extraction: Let's use a CSS selector to extract only H2 tags!

Using CSS selector to extract H2 tags:

result = crawler.run(
                url="https://www.nbcnews.com/business",
                css_selector="h2"
            )

🖱️ Let's get interactive: Passing JavaScript code to click 'Load More' button!

Using JavaScript to click 'Load More' button:

js_code = """
            const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
            loadMoreButton && loadMoreButton.click();
            """
            crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
            crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
            result = crawler.run(url="https://www.nbcnews.com/business")

🎉 Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️

Installation 💻

There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.

You can also try Crawl4AI in a Google Colab

Using Crawl4AI as a Library 📚

To install Crawl4AI as a library, follow these steps:

Install the package from GitHub:

pip install git+https://github.com/unclecode/crawl4ai.git

Alternatively, you can clone the repository and install the package locally:

virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

Import the necessary modules in your Python script:

from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os

crawler = WebCrawler()

# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
    url='https://www.nbcnews.com/business',
    word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
    chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
    extraction_strategy= CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
    # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
    bypass_cache=False,
    extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
    css_selector = "", # Eg: "div.article-body"
    verbose=True,
    include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())

For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.

Parameter	Description	Required	Default Value
urls	A list of URLs to crawl and extract data from.	Yes	-
include_raw_html	Whether to include the raw HTML content in the response.	No	false
bypass_cache	Whether to force a fresh crawl even if the URL has been previously crawled.	No	false
extract_blocks	Whether to extract semantical blocks of text from the HTML.	No	true
word_count_threshold	The minimum number of words a block must contain to be considered meaningful (minimum value is 5).	No	5
extraction_strategy	The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").	No	CosineStrategy
chunking_strategy	The strategy to use for chunking the text before processing (e.g., "RegexChunking").	No	RegexChunking
css_selector	The CSS selector to target specific parts of the HTML for extraction.	No	None
verbose	Whether to enable verbose logging.	No	true

🔥🕷️ Crawl4AI: Web Data for your Thoughts

Try It Now

Installation 💻

Using Crawl4AI as a Library 📚

📖 Parameters

Extraction Strategies

Chunking Strategies

🤔 Why building this?

⚙️ Installation