There are three ways to use Crawl4AI:
To install Crawl4AI as a library, follow these steps:
pip install git+https://github.com/unclecode/crawl4ai.git
virtualenv venv
source venv/bin/activate
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
pip install -e .
docker build -t crawl4ai .
# docker build --platform linux/amd64 -t crawl4ai . For Mac users
docker run -d -p 8000:80 crawl4ai
For more information about how to run Crawl4AI as a local server, please refer to the GitHub repository.
warmup() function.
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(url="https://www.nbcnews.com/business")
result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
`bypass_cache` to True if you want to try different strategies
for the same URL. Otherwise, the cached result will be returned. You can also set
`always_by_pass_cache` in constructor to True to always bypass the cache.
result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
always_by_pass_cache to True:crawler.always_by_pass_cache = True
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
result = crawler.run(
url="https://www.nbcnews.com/business",
chunking_strategy=NlpSentenceChunking()
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=CosineStrategy(word_count_threshold=20, max_dist=0.2, linkage_method="ward", top_k=3)
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
)
result = crawler.run(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="I am interested in only financial news"
)
)
result = crawler.run(
url="https://www.nbcnews.com/business",
css_selector="h2"
)
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
loadMoreButton && loadMoreButton.click();
"""
crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
result = crawler.run(url="https://www.nbcnews.com/business")
RegexChunking is a text chunking strategy that splits a given text into smaller parts
using regular expressions. This is useful for preparing large texts for processing by language
models, ensuring they are divided into manageable segments.
patterns (list, optional): A list of regular expression patterns used to split the
text. Default is to split by double newlines (['\n\n']).
chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
NlpSentenceChunking uses a natural language processing model to chunk a given text into
sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
chunker = NlpSentenceChunking()
chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
TopicSegmentationChunking uses the TextTiling algorithm to segment a given text into
topic-based chunks. This method identifies thematic boundaries in the text.
num_keywords (int, optional): The number of keywords to extract for each topic
segment. Default is 3.
chunker = TopicSegmentationChunking(num_keywords=3)
chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
FixedLengthWordChunking splits a given text into chunks of fixed length, based on the
number of words.
chunk_size (int, optional): The number of words in each chunk. Default is
100.
chunker = FixedLengthWordChunking(chunk_size=100)
chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
SlidingWindowChunking uses a sliding window approach to chunk a given text. Each chunk
has a fixed length, and the window slides by a specified step size.
window_size (int, optional): The number of words in each chunk. Default is
100.
step (int, optional): The number of words to slide the window. Default is
50.
chunker = SlidingWindowChunking(window_size=100, step=50)
chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
NoExtractionStrategy is a basic extraction strategy that returns the entire HTML
content without any modification. It is useful for cases where no specific extraction is required.
Only clean html, and amrkdown.
None.
extractor = NoExtractionStrategy()
extracted_content = extractor.extract(url, html)
LLMExtractionStrategy uses a Language Model (LLM) to extract meaningful blocks or
chunks from the given HTML content. This strategy leverages an external provider for language model
completions.
provider (str, optional): The provider to use for the language model completions.
Default is DEFAULT_PROVIDER (e.g., openai/gpt-4).
api_token (str, optional): The API token for the provider. If not provided, it will
try to load from the environment variable OPENAI_API_KEY.
instruction (str, optional): An instruction to guide the LLM on how to perform the
extraction. This allows users to specify the type of data they are interested in or set the tone
of the response. Default is None.
extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
extracted_content = extractor.extract(url, html)
By providing clear instructions, users can tailor the extraction process to their specific needs, enhancing the relevance and utility of the extracted content.
CosineStrategy uses hierarchical clustering based on cosine similarity to extract
clusters of text from the given HTML content. This strategy is suitable for identifying related
content sections.
semantic_filter (str, optional): A string containing keywords for filtering relevant
documents before clustering. If provided, documents are filtered based on their cosine
similarity to the keyword filter embedding. Default is None.
word_count_threshold (int, optional): Minimum number of words per cluster. Default
is 20.
max_dist (float, optional): The maximum cophenetic distance on the dendrogram to
form clusters. Default is 0.2.
linkage_method (str, optional): The linkage method for hierarchical clustering.
Default is 'ward'.
top_k (int, optional): Number of top categories to extract. Default is
3.
model_name (str, optional): The model name for embedding generation. Default is
'BAAI/bge-small-en-v1.5'.
extractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
extracted_content = extractor.extract(url, html)
When a semantic_filter is provided, the CosineStrategy applies an
embedding-based filtering process to select relevant documents before performing hierarchical
clustering.
TopicExtractionStrategy uses the TextTiling algorithm to segment the HTML content into
topics and extracts keywords for each segment. This strategy is useful for identifying and
summarizing thematic content.
num_keywords (int, optional): Number of keywords to represent each topic segment.
Default is 3.
extractor = TopicExtractionStrategy(num_keywords=3)
extracted_content = extractor.extract(url, html)