Installation 💻

There are three ways to use Crawl4AI:

As a library
As a local server (Docker)
As a Google Colab notebook.

To install Crawl4AI as a library, follow these steps:

Install the package from GitHub:

pip install git+https://github.com/unclecode/crawl4ai.git

Alternatively, you can clone the repository and install the package locally:

virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .

Use docker to run the local server:

docker build -t crawl4ai . 
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai

For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.

How to Guide

🌟 Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun!

First Step: Create an instance of WebCrawler and call the warmup() function.

crawler = WebCrawler()
crawler.warmup()

🧠 Understanding 'bypass_cache' and 'include_raw_html' parameters:

First crawl (caches the result):

result = crawler.run(url="https://www.nbcnews.com/business")

Second crawl (Force to crawl again):

result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)

⚠️ Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.

Crawl result without raw HTML content:

result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)

📄 The 'include_raw_html' parameter, when set to True, includes the raw HTML content in the response. By default, it is set to True.

Set always_by_pass_cache to True:

crawler.always_by_pass_cache = True

🧩 Let's add a chunking strategy: RegexChunking!

Using RegexChunking:

result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)

Using NlpSentenceChunking:

result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)

🧠 Let's get smarter with an extraction strategy: CosineStrategy!

Using CosineStrategy:

result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3)
)

🤖 Time to bring in the big guns: LLMExtractionStrategy without instructions!

Using LLMExtractionStrategy without instructions:

result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)

📜 Let's make it even more interesting: LLMExtractionStrategy with instructions!

Using LLMExtractionStrategy with instructions:

result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)

🎯 Targeted extraction: Let's use a CSS selector to extract only H2 tags!

Using CSS selector to extract H2 tags:

result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)

🖱️ Let's get interactive: Passing JavaScript code to click 'Load More' button!

Using JavaScript to click 'Load More' button:

js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")

🎉 Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️

RegexChunking

RegexChunking is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.

Constructor Parameters:

patterns (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (['\n\n']).

Example usage:

chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")

NlpSentenceChunking

NlpSentenceChunking uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.

Constructor Parameters:

None.

Example usage:

chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")

TopicSegmentationChunking

TopicSegmentationChunking uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.

Constructor Parameters:

num_keywords (int, optional): The number of keywords to extract for each topic segment. Default is 3.

Example usage:

chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")

FixedLengthWordChunking

FixedLengthWordChunking splits a given text into chunks of fixed length, based on the number of words.

Constructor Parameters:

chunk_size (int, optional): The number of words in each chunk. Default is 100.

Example usage:

chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")

SlidingWindowChunking

SlidingWindowChunking uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.

Constructor Parameters:

window_size (int, optional): The number of words in each chunk. Default is 100.
step (int, optional): The number of words to slide the window. Default is 50.

Example usage:

chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")

NoExtractionStrategy

NoExtractionStrategy is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required. Only clean html, and amrkdown.

Constructor Parameters:

None.

Example usage:

extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)

LLMExtractionStrategy

LLMExtractionStrategy uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.

Constructor Parameters:

provider (str, optional): The provider to use for the language model completions. Default is DEFAULT_PROVIDER (e.g., openai/gpt-4).
api_token (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable OPENAI_API_KEY.
instruction (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is None.

Example usage:

extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)

By providing clear instructions, users can tailor the extraction process to their specific needs, enhancing the relevance and utility of the extracted content.

CosineStrategy

CosineStrategy uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.

Constructor Parameters:

semantic_filter (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is None.
word_count_threshold (int, optional): Minimum number of words per cluster. Default is 20.
max_dist (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is 0.2.
linkage_method (str, optional): The linkage method for hierarchical clustering. Default is 'ward'.
top_k (int, optional): Number of top categories to extract. Default is 3.
model_name (str, optional): The model name for embedding generation. Default is 'BAAI/bge-small-en-v1.5'.

Example usage:

extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)

Cosine Similarity Filtering

When a semantic_filter is provided, the CosineStrategy applies an embedding-based filtering process to select relevant documents before performing hierarchical clustering.

TopicExtractionStrategy

TopicExtractionStrategy uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.

Constructor Parameters:

num_keywords (int, optional): Number of keywords to represent each topic segment. Default is 3.

Example usage:

extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)