Try It Now

URL(s)

Min Words Threshold

CSS Selector

Extraction Strategy

Chunking Strategy

Provider Model

API Token

Bypass Cache

Extract Blocks

Installation 💻

There are two ways to use Crawl4AI: as a library in your Python projects or as a standalone local server.

You can also try Crawl4AI in a Google Colab

Using Crawl4AI as a Library 📚

To install Crawl4AI as a library, follow these steps:

Install the package from GitHub:

pip install git+https://github.com/unclecode/crawl4ai.git

Alternatively, you can clone the repository and install the package locally:

virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

Import the necessary modules in your Python script:

from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
import os

crawler = WebCrawler()

# Single page crawl
single_url = UrlModel(url='https://www.nbcnews.com/business', forced=False)
result = crawl4ai.fetch_page(
    url='https://www.nbcnews.com/business',
    word_count_threshold=5, # Minimum word count for a HTML tag to be considered as a worthy block
    chunking_strategy= RegexChunking( patterns = ["\\n\\n"]), # Default is RegexChunking
    extraction_strategy= CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method='ward', top_k=3) # Default is CosineStrategy
    # extraction_strategy= LLMExtractionStrategy(provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY')),
    bypass_cache=False,
    extract_blocks =True, # Whether to extract semantical blocks of text from the HTML
    css_selector = "", # Eg: "div.article-body"
    verbose=True,
    include_raw_html=True, # Whether to include the raw HTML content in the response
)
print(result.model_dump())

For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.

📖 Parameters

Parameter	Description	Required	Default Value
urls	A list of URLs to crawl and extract data from.	Yes	-
include_raw_html	Whether to include the raw HTML content in the response.	No	false
bypass_cache	Whether to force a fresh crawl even if the URL has been previously crawled.	No	false
extract_blocks	Whether to extract semantical blocks of text from the HTML.	No	true
word_count_threshold	The minimum number of words a block must contain to be considered meaningful (minimum value is 5).	No	5
extraction_strategy	The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").	No	CosineStrategy
chunking_strategy	The strategy to use for chunking the text before processing (e.g., "RegexChunking").	No	RegexChunking
css_selector	The CSS selector to target specific parts of the HTML for extraction.	No	None
verbose	Whether to enable verbose logging.	No	true

🔥🕷️ Crawl4AI: Web Data for your Thoughts

Try It Now

Installation 💻

Using Crawl4AI as a Library 📚

📖 Parameters

Extraction Strategies

Chunking Strategies

🤔 Why building this?

⚙️ Installation